Web Scraping with Python: Complete Guide Using Requests, BeautifulSoup, and Selenium

Q: What is the best Python library for web scraping?

For static pages, requests + BeautifulSoup is the simplest choice. For JavaScript pages, use Selenium or Playwright. For large-scale crawling, Scrapy provides concurrency and pipelines.

Q: How do I scrape a website that uses JavaScript?

Use Selenium or Playwright to control a headless browser. Alternatively, check the Network tab for API endpoints returning JSON data.

Q: Is web scraping legal?

Legality depends on jurisdiction, ToS, and data usage. Scraping public data is generally legal but check robots.txt and ToS. Avoid personal data and copyrighted content.

Q: How do I avoid getting blocked while scraping?

Use delays between requests, rotate User-Agents, respect robots.txt, use sessions for cookies, and limit concurrent requests.

Q: How do I handle pagination in web scraping?

Find the next page link in HTML and follow it in a loop, or iterate through page number query parameters directly.

Name: Soren Atelier

Updated on 2/10/2026

You need data that is not available through APIs -- product prices from competitor websites, research papers from academic portals, job listings from hiring platforms, or news articles from multiple sources. Copying data manually is slow, error-prone, and impossible at scale. Web scraping automates this process, but choosing the wrong tool or approach leads to blocked requests, broken parsers, and legal headaches.

This guide covers the complete Python web scraping stack: requests + BeautifulSoup for static pages, Selenium for JavaScript-rendered content, and Scrapy for large-scale crawling. You'll learn practical techniques for handling real-world challenges.

Quick Start: requests + BeautifulSoup

The most common stack for scraping static HTML pages:

import requests
from bs4 import BeautifulSoup
 
# Fetch the page
url = 'https://books.toscrape.com/'
response = requests.get(url)
response.raise_for_status()  # Raise error for bad status codes
 
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
 
# Extract book titles
books = soup.select('article.product_pod h3 a')
for book in books[:5]:
    print(book['title'])

Installation

# Install required packages
# pip install requests beautifulsoup4 lxml

Parsing HTML with BeautifulSoup

Finding Elements

from bs4 import BeautifulSoup
 
html = """
<div class="products">
    <div class="product" id="p1">
        <h2 class="name">Widget A</h2>
        <span class="price">$9.99</span>
        <p class="desc">A useful widget</p>
    </div>
    <div class="product" id="p2">
        <h2 class="name">Widget B</h2>
        <span class="price">$14.99</span>
        <p class="desc">A better widget</p>
    </div>
</div>
"""
 
soup = BeautifulSoup(html, 'html.parser')
 
# By tag
print(soup.find('h2').text)  # "Widget A"
 
# By class
prices = soup.find_all('span', class_='price')
for p in prices:
    print(p.text)  # "$9.99", "$14.99"
 
# By ID
product = soup.find(id='p2')
print(product.find('h2').text)  # "Widget B"
 
# CSS selector
names = soup.select('.product .name')
for n in names:
    print(n.text)

Common Selectors

Method	Example	Finds
`find('tag')`	`soup.find('h2')`	First h2 element
`find_all('tag')`	`soup.find_all('a')`	All anchor elements
`find(class_='x')`	`soup.find(class_='price')`	First element with class
`find(id='x')`	`soup.find(id='main')`	Element with ID
`select('css')`	`soup.select('div.product > h2')`	CSS selector matches
`select_one('css')`	`soup.select_one('#header')`	First CSS match

Extracting Data

from bs4 import BeautifulSoup
 
html = '<a href="/page/2" class="next" data-page="2">Next Page</a>'
soup = BeautifulSoup(html, 'html.parser')
link = soup.find('a')
 
# Text content
print(link.text)           # "Next Page"
print(link.get_text(strip=True))  # "Next Page" (stripped)
 
# Attributes
print(link['href'])        # "/page/2"
print(link.get('class'))   # ['next']
print(link['data-page'])   # "2"

Handling Headers and Sessions

Many websites block requests without proper headers:

import requests
from bs4 import BeautifulSoup
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}
 
# Use a session for cookies and persistent headers
session = requests.Session()
session.headers.update(headers)
 
response = session.get('https://books.toscrape.com/')
soup = BeautifulSoup(response.text, 'html.parser')
print(f"Found {len(soup.select('article.product_pod'))} books")

Pagination

Following Next Page Links

import requests
from bs4 import BeautifulSoup
 
base_url = 'https://books.toscrape.com/catalogue/'
all_books = []
url = 'https://books.toscrape.com/catalogue/page-1.html'
 
while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
 
    # Extract books from current page
    for book in soup.select('article.product_pod'):
        title = book.select_one('h3 a')['title']
        price = book.select_one('.price_color').text
        all_books.append({'title': title, 'price': price})
 
    # Find next page
    next_btn = soup.select_one('li.next a')
    if next_btn:
        url = base_url + next_btn['href']
    else:
        url = None
 
    print(f"Scraped {len(all_books)} books so far...")
 
print(f"Total: {len(all_books)} books")

Scraping JavaScript Pages with Selenium

When a page renders content with JavaScript, requests only gets the raw HTML before rendering. Use Selenium to control a real browser:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
# Setup headless Chrome
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
 
try:
    driver.get('https://quotes.toscrape.com/js/')
 
    # Wait for content to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'quote'))
    )
 
    quotes = driver.find_elements(By.CLASS_NAME, 'quote')
    for quote in quotes:
        text = quote.find_element(By.CLASS_NAME, 'text').text
        author = quote.find_element(By.CLASS_NAME, 'author').text
        print(f'"{text}" -- {author}')
finally:
    driver.quit()

Tool Comparison

Tool	Best For	Speed	JavaScript	Learning Curve
`requests` + `BeautifulSoup`	Static HTML pages	Fast	No	Easy
`Selenium`	JavaScript-rendered pages	Slow	Yes	Medium
`Scrapy`	Large-scale crawling	Fast	With plugins	Steep
`Playwright`	Modern JS apps	Medium	Yes	Medium
`httpx` + `selectolax`	High-performance parsing	Very fast	No	Easy

Scrapy for Large-Scale Scraping

For crawling thousands of pages, Scrapy provides built-in concurrency, retry logic, and data pipelines:

# spider.py
import scrapy
 
class BookSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com/']
 
    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('.price_color::text').get(),
            }
 
        # Follow pagination
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
 
# Run with: scrapy runspider spider.py -o books.json

Data Cleaning and Storage

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# Scrape data
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
 
books = []
for item in soup.select('article.product_pod'):
    title = item.select_one('h3 a')['title']
    price_text = item.select_one('.price_color').text
    price = float(price_text.replace('£', ''))
    rating_class = item.select_one('p.star-rating')['class'][1]
 
    books.append({
        'title': title,
        'price': price,
        'rating': rating_class,
    })
 
# Convert to DataFrame
df = pd.DataFrame(books)
print(df.head())
 
# Save to CSV
df.to_csv('books.csv', index=False)

Ethical Scraping Practices

Practice	Description
Check robots.txt	Respect the site's crawling rules
Add delays	Use `time.sleep()` between requests (1-3 seconds)
Identify yourself	Set a descriptive User-Agent string
Cache responses	Don't re-scrape pages you already have
Check for APIs	Many sites have public APIs that are faster and allowed
Read ToS	Some sites explicitly prohibit scraping
Rate limit	Don't overwhelm servers with concurrent requests

import time
import requests
 
def polite_scrape(urls, delay=2):
    """Scrape with delays between requests."""
    results = []
    for url in urls:
        response = requests.get(url)
        results.append(response.text)
        time.sleep(delay)  # Be polite
    return results

Analyzing Scraped Data

After collecting data via web scraping, PyGWalker (opens in a new tab) helps you explore and visualize the scraped dataset interactively in Jupyter:

import pandas as pd
import pygwalker as pyg
 
df = pd.read_csv('scraped_data.csv')
walker = pyg.walk(df)

For running scraping scripts iteratively in Jupyter with AI assistance, RunCell (opens in a new tab) provides an AI-powered environment where you can debug selectors and test parsing logic interactively.

FAQ

What is the best Python library for web scraping?

For static pages, requests + BeautifulSoup is the simplest and most popular choice. For JavaScript-rendered pages, use Selenium or Playwright. For large-scale crawling with thousands of pages, Scrapy provides built-in concurrency and pipeline management.

How do I scrape a website that uses JavaScript?

Use Selenium or Playwright to control a headless browser that executes JavaScript. Alternatively, check the browser's Network tab for API endpoints that return JSON data -- scraping the API directly is faster and more reliable than browser automation.

Is web scraping legal?

Web scraping legality depends on the jurisdiction, the website's terms of service, and how the data is used. Scraping publicly available data is generally legal in many jurisdictions, but always check the site's robots.txt and ToS. Avoid scraping personal data or copyrighted content.

How do I avoid getting blocked while scraping?

Use delays between requests (1-3 seconds), rotate User-Agent strings, respect robots.txt, use sessions for cookies, and avoid making too many concurrent requests. If you're consistently blocked, check if the site has a public API instead.

How do I handle pagination in web scraping?

Find the "next page" link or button in the HTML, extract its URL, and follow it in a loop until no more pages exist. Alternatively, if pages use query parameters (e.g., ?page=2), iterate through page numbers directly.

Conclusion

Python's web scraping ecosystem covers every scenario: requests + BeautifulSoup for quick static page scraping, Selenium for JavaScript-heavy sites, and Scrapy for production-scale crawling. Start with the simplest tool that works, add complexity only when needed, and always scrape ethically -- respect robots.txt, add delays, and check for APIs first. Store your scraped data in structured formats like CSV or DataFrames for immediate analysis.

📚