Skip to content

Web Scraping with Python: Complete Guide Using Requests, BeautifulSoup, and Selenium

Updated on

You need data that is not available through APIs -- product prices from competitor websites, research papers from academic portals, job listings from hiring platforms, or news articles from multiple sources. Copying data manually is slow, error-prone, and impossible at scale. Web scraping automates this process, but choosing the wrong tool or approach leads to blocked requests, broken parsers, and legal headaches.

This guide covers the complete Python web scraping stack: requests + BeautifulSoup for static pages, Selenium for JavaScript-rendered content, and Scrapy for large-scale crawling. You'll learn practical techniques for handling real-world challenges.

📚

Quick Start: requests + BeautifulSoup

The most common stack for scraping static HTML pages:

import requests
from bs4 import BeautifulSoup
 
# Fetch the page
url = 'https://books.toscrape.com/'
response = requests.get(url)
response.raise_for_status()  # Raise error for bad status codes
 
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
 
# Extract book titles
books = soup.select('article.product_pod h3 a')
for book in books[:5]:
    print(book['title'])

Installation

# Install required packages
# pip install requests beautifulsoup4 lxml

Parsing HTML with BeautifulSoup

Finding Elements

from bs4 import BeautifulSoup
 
html = """
<div class="products">
    <div class="product" id="p1">
        <h2 class="name">Widget A</h2>
        <span class="price">$9.99</span>
        <p class="desc">A useful widget</p>
    </div>
    <div class="product" id="p2">
        <h2 class="name">Widget B</h2>
        <span class="price">$14.99</span>
        <p class="desc">A better widget</p>
    </div>
</div>
"""
 
soup = BeautifulSoup(html, 'html.parser')
 
# By tag
print(soup.find('h2').text)  # "Widget A"
 
# By class
prices = soup.find_all('span', class_='price')
for p in prices:
    print(p.text)  # "$9.99", "$14.99"
 
# By ID
product = soup.find(id='p2')
print(product.find('h2').text)  # "Widget B"
 
# CSS selector
names = soup.select('.product .name')
for n in names:
    print(n.text)

Common Selectors

MethodExampleFinds
find('tag')soup.find('h2')First h2 element
find_all('tag')soup.find_all('a')All anchor elements
find(class_='x')soup.find(class_='price')First element with class
find(id='x')soup.find(id='main')Element with ID
select('css')soup.select('div.product > h2')CSS selector matches
select_one('css')soup.select_one('#header')First CSS match

Extracting Data

from bs4 import BeautifulSoup
 
html = '<a href="/page/2" class="next" data-page="2">Next Page</a>'
soup = BeautifulSoup(html, 'html.parser')
link = soup.find('a')
 
# Text content
print(link.text)           # "Next Page"
print(link.get_text(strip=True))  # "Next Page" (stripped)
 
# Attributes
print(link['href'])        # "/page/2"
print(link.get('class'))   # ['next']
print(link['data-page'])   # "2"

Handling Headers and Sessions

Many websites block requests without proper headers:

import requests
from bs4 import BeautifulSoup
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}
 
# Use a session for cookies and persistent headers
session = requests.Session()
session.headers.update(headers)
 
response = session.get('https://books.toscrape.com/')
soup = BeautifulSoup(response.text, 'html.parser')
print(f"Found {len(soup.select('article.product_pod'))} books")

Pagination

Following Next Page Links

import requests
from bs4 import BeautifulSoup
 
base_url = 'https://books.toscrape.com/catalogue/'
all_books = []
url = 'https://books.toscrape.com/catalogue/page-1.html'
 
while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
 
    # Extract books from current page
    for book in soup.select('article.product_pod'):
        title = book.select_one('h3 a')['title']
        price = book.select_one('.price_color').text
        all_books.append({'title': title, 'price': price})
 
    # Find next page
    next_btn = soup.select_one('li.next a')
    if next_btn:
        url = base_url + next_btn['href']
    else:
        url = None
 
    print(f"Scraped {len(all_books)} books so far...")
 
print(f"Total: {len(all_books)} books")

Scraping JavaScript Pages with Selenium

When a page renders content with JavaScript, requests only gets the raw HTML before rendering. Use Selenium to control a real browser:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
# Setup headless Chrome
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
 
try:
    driver.get('https://quotes.toscrape.com/js/')
 
    # Wait for content to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'quote'))
    )
 
    quotes = driver.find_elements(By.CLASS_NAME, 'quote')
    for quote in quotes:
        text = quote.find_element(By.CLASS_NAME, 'text').text
        author = quote.find_element(By.CLASS_NAME, 'author').text
        print(f'"{text}" -- {author}')
finally:
    driver.quit()

Tool Comparison

ToolBest ForSpeedJavaScriptLearning Curve
requests + BeautifulSoupStatic HTML pagesFastNoEasy
SeleniumJavaScript-rendered pagesSlowYesMedium
ScrapyLarge-scale crawlingFastWith pluginsSteep
PlaywrightModern JS appsMediumYesMedium
httpx + selectolaxHigh-performance parsingVery fastNoEasy

Scrapy for Large-Scale Scraping

For crawling thousands of pages, Scrapy provides built-in concurrency, retry logic, and data pipelines:

# spider.py
import scrapy
 
class BookSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com/']
 
    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('.price_color::text').get(),
            }
 
        # Follow pagination
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
 
# Run with: scrapy runspider spider.py -o books.json

Data Cleaning and Storage

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# Scrape data
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
 
books = []
for item in soup.select('article.product_pod'):
    title = item.select_one('h3 a')['title']
    price_text = item.select_one('.price_color').text
    price = float(price_text.replace('£', ''))
    rating_class = item.select_one('p.star-rating')['class'][1]
 
    books.append({
        'title': title,
        'price': price,
        'rating': rating_class,
    })
 
# Convert to DataFrame
df = pd.DataFrame(books)
print(df.head())
 
# Save to CSV
df.to_csv('books.csv', index=False)

Ethical Scraping Practices

PracticeDescription
Check robots.txtRespect the site's crawling rules
Add delaysUse time.sleep() between requests (1-3 seconds)
Identify yourselfSet a descriptive User-Agent string
Cache responsesDon't re-scrape pages you already have
Check for APIsMany sites have public APIs that are faster and allowed
Read ToSSome sites explicitly prohibit scraping
Rate limitDon't overwhelm servers with concurrent requests
import time
import requests
 
def polite_scrape(urls, delay=2):
    """Scrape with delays between requests."""
    results = []
    for url in urls:
        response = requests.get(url)
        results.append(response.text)
        time.sleep(delay)  # Be polite
    return results

Analyzing Scraped Data

After collecting data via web scraping, PyGWalker (opens in a new tab) helps you explore and visualize the scraped dataset interactively in Jupyter:

import pandas as pd
import pygwalker as pyg
 
df = pd.read_csv('scraped_data.csv')
walker = pyg.walk(df)

For running scraping scripts iteratively in Jupyter with AI assistance, RunCell (opens in a new tab) provides an AI-powered environment where you can debug selectors and test parsing logic interactively.

FAQ

What is the best Python library for web scraping?

For static pages, requests + BeautifulSoup is the simplest and most popular choice. For JavaScript-rendered pages, use Selenium or Playwright. For large-scale crawling with thousands of pages, Scrapy provides built-in concurrency and pipeline management.

How do I scrape a website that uses JavaScript?

Use Selenium or Playwright to control a headless browser that executes JavaScript. Alternatively, check the browser's Network tab for API endpoints that return JSON data -- scraping the API directly is faster and more reliable than browser automation.

Is web scraping legal?

Web scraping legality depends on the jurisdiction, the website's terms of service, and how the data is used. Scraping publicly available data is generally legal in many jurisdictions, but always check the site's robots.txt and ToS. Avoid scraping personal data or copyrighted content.

How do I avoid getting blocked while scraping?

Use delays between requests (1-3 seconds), rotate User-Agent strings, respect robots.txt, use sessions for cookies, and avoid making too many concurrent requests. If you're consistently blocked, check if the site has a public API instead.

How do I handle pagination in web scraping?

Find the "next page" link or button in the HTML, extract its URL, and follow it in a loop until no more pages exist. Alternatively, if pages use query parameters (e.g., ?page=2), iterate through page numbers directly.

Conclusion

Python's web scraping ecosystem covers every scenario: requests + BeautifulSoup for quick static page scraping, Selenium for JavaScript-heavy sites, and Scrapy for production-scale crawling. Start with the simplest tool that works, add complexity only when needed, and always scrape ethically -- respect robots.txt, add delays, and check for APIs first. Store your scraped data in structured formats like CSV or DataFrames for immediate analysis.

📚