Web Scraping with Python: Complete Guide Using Requests, BeautifulSoup, and Selenium
Updated on
You need data that is not available through APIs -- product prices from competitor websites, research papers from academic portals, job listings from hiring platforms, or news articles from multiple sources. Copying data manually is slow, error-prone, and impossible at scale. Web scraping automates this process, but choosing the wrong tool or approach leads to blocked requests, broken parsers, and legal headaches.
This guide covers the complete Python web scraping stack: requests + BeautifulSoup for static pages, Selenium for JavaScript-rendered content, and Scrapy for large-scale crawling. You'll learn practical techniques for handling real-world challenges.
Quick Start: requests + BeautifulSoup
The most common stack for scraping static HTML pages:
import requests
from bs4 import BeautifulSoup
# Fetch the page
url = 'https://books.toscrape.com/'
response = requests.get(url)
response.raise_for_status() # Raise error for bad status codes
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Extract book titles
books = soup.select('article.product_pod h3 a')
for book in books[:5]:
print(book['title'])Installation
# Install required packages
# pip install requests beautifulsoup4 lxmlParsing HTML with BeautifulSoup
Finding Elements
from bs4 import BeautifulSoup
html = """
<div class="products">
<div class="product" id="p1">
<h2 class="name">Widget A</h2>
<span class="price">$9.99</span>
<p class="desc">A useful widget</p>
</div>
<div class="product" id="p2">
<h2 class="name">Widget B</h2>
<span class="price">$14.99</span>
<p class="desc">A better widget</p>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# By tag
print(soup.find('h2').text) # "Widget A"
# By class
prices = soup.find_all('span', class_='price')
for p in prices:
print(p.text) # "$9.99", "$14.99"
# By ID
product = soup.find(id='p2')
print(product.find('h2').text) # "Widget B"
# CSS selector
names = soup.select('.product .name')
for n in names:
print(n.text)Common Selectors
| Method | Example | Finds |
|---|---|---|
find('tag') | soup.find('h2') | First h2 element |
find_all('tag') | soup.find_all('a') | All anchor elements |
find(class_='x') | soup.find(class_='price') | First element with class |
find(id='x') | soup.find(id='main') | Element with ID |
select('css') | soup.select('div.product > h2') | CSS selector matches |
select_one('css') | soup.select_one('#header') | First CSS match |
Extracting Data
from bs4 import BeautifulSoup
html = '<a href="/page/2" class="next" data-page="2">Next Page</a>'
soup = BeautifulSoup(html, 'html.parser')
link = soup.find('a')
# Text content
print(link.text) # "Next Page"
print(link.get_text(strip=True)) # "Next Page" (stripped)
# Attributes
print(link['href']) # "/page/2"
print(link.get('class')) # ['next']
print(link['data-page']) # "2"Handling Headers and Sessions
Many websites block requests without proper headers:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-US,en;q=0.9',
}
# Use a session for cookies and persistent headers
session = requests.Session()
session.headers.update(headers)
response = session.get('https://books.toscrape.com/')
soup = BeautifulSoup(response.text, 'html.parser')
print(f"Found {len(soup.select('article.product_pod'))} books")Pagination
Following Next Page Links
import requests
from bs4 import BeautifulSoup
base_url = 'https://books.toscrape.com/catalogue/'
all_books = []
url = 'https://books.toscrape.com/catalogue/page-1.html'
while url:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract books from current page
for book in soup.select('article.product_pod'):
title = book.select_one('h3 a')['title']
price = book.select_one('.price_color').text
all_books.append({'title': title, 'price': price})
# Find next page
next_btn = soup.select_one('li.next a')
if next_btn:
url = base_url + next_btn['href']
else:
url = None
print(f"Scraped {len(all_books)} books so far...")
print(f"Total: {len(all_books)} books")Scraping JavaScript Pages with Selenium
When a page renders content with JavaScript, requests only gets the raw HTML before rendering. Use Selenium to control a real browser:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Setup headless Chrome
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
try:
driver.get('https://quotes.toscrape.com/js/')
# Wait for content to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'quote'))
)
quotes = driver.find_elements(By.CLASS_NAME, 'quote')
for quote in quotes:
text = quote.find_element(By.CLASS_NAME, 'text').text
author = quote.find_element(By.CLASS_NAME, 'author').text
print(f'"{text}" -- {author}')
finally:
driver.quit()Tool Comparison
| Tool | Best For | Speed | JavaScript | Learning Curve |
|---|---|---|---|---|
requests + BeautifulSoup | Static HTML pages | Fast | No | Easy |
Selenium | JavaScript-rendered pages | Slow | Yes | Medium |
Scrapy | Large-scale crawling | Fast | With plugins | Steep |
Playwright | Modern JS apps | Medium | Yes | Medium |
httpx + selectolax | High-performance parsing | Very fast | No | Easy |
Scrapy for Large-Scale Scraping
For crawling thousands of pages, Scrapy provides built-in concurrency, retry logic, and data pipelines:
# spider.py
import scrapy
class BookSpider(scrapy.Spider):
name = 'books'
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
for book in response.css('article.product_pod'):
yield {
'title': book.css('h3 a::attr(title)').get(),
'price': book.css('.price_color::text').get(),
}
# Follow pagination
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
# Run with: scrapy runspider spider.py -o books.jsonData Cleaning and Storage
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Scrape data
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = []
for item in soup.select('article.product_pod'):
title = item.select_one('h3 a')['title']
price_text = item.select_one('.price_color').text
price = float(price_text.replace('£', ''))
rating_class = item.select_one('p.star-rating')['class'][1]
books.append({
'title': title,
'price': price,
'rating': rating_class,
})
# Convert to DataFrame
df = pd.DataFrame(books)
print(df.head())
# Save to CSV
df.to_csv('books.csv', index=False)Ethical Scraping Practices
| Practice | Description |
|---|---|
| Check robots.txt | Respect the site's crawling rules |
| Add delays | Use time.sleep() between requests (1-3 seconds) |
| Identify yourself | Set a descriptive User-Agent string |
| Cache responses | Don't re-scrape pages you already have |
| Check for APIs | Many sites have public APIs that are faster and allowed |
| Read ToS | Some sites explicitly prohibit scraping |
| Rate limit | Don't overwhelm servers with concurrent requests |
import time
import requests
def polite_scrape(urls, delay=2):
"""Scrape with delays between requests."""
results = []
for url in urls:
response = requests.get(url)
results.append(response.text)
time.sleep(delay) # Be polite
return resultsAnalyzing Scraped Data
After collecting data via web scraping, PyGWalker (opens in a new tab) helps you explore and visualize the scraped dataset interactively in Jupyter:
import pandas as pd
import pygwalker as pyg
df = pd.read_csv('scraped_data.csv')
walker = pyg.walk(df)For running scraping scripts iteratively in Jupyter with AI assistance, RunCell (opens in a new tab) provides an AI-powered environment where you can debug selectors and test parsing logic interactively.
FAQ
What is the best Python library for web scraping?
For static pages, requests + BeautifulSoup is the simplest and most popular choice. For JavaScript-rendered pages, use Selenium or Playwright. For large-scale crawling with thousands of pages, Scrapy provides built-in concurrency and pipeline management.
How do I scrape a website that uses JavaScript?
Use Selenium or Playwright to control a headless browser that executes JavaScript. Alternatively, check the browser's Network tab for API endpoints that return JSON data -- scraping the API directly is faster and more reliable than browser automation.
Is web scraping legal?
Web scraping legality depends on the jurisdiction, the website's terms of service, and how the data is used. Scraping publicly available data is generally legal in many jurisdictions, but always check the site's robots.txt and ToS. Avoid scraping personal data or copyrighted content.
How do I avoid getting blocked while scraping?
Use delays between requests (1-3 seconds), rotate User-Agent strings, respect robots.txt, use sessions for cookies, and avoid making too many concurrent requests. If you're consistently blocked, check if the site has a public API instead.
How do I handle pagination in web scraping?
Find the "next page" link or button in the HTML, extract its URL, and follow it in a loop until no more pages exist. Alternatively, if pages use query parameters (e.g., ?page=2), iterate through page numbers directly.
Conclusion
Python's web scraping ecosystem covers every scenario: requests + BeautifulSoup for quick static page scraping, Selenium for JavaScript-heavy sites, and Scrapy for production-scale crawling. Start with the simplest tool that works, add complexity only when needed, and always scrape ethically -- respect robots.txt, add delays, and check for APIs first. Store your scraped data in structured formats like CSV or DataFrames for immediate analysis.