Python网页抓取：使用Requests、BeautifulSoup和Selenium的完全指南

Q: Python网页抓取最好的库是什么？

对于静态页面，requests + BeautifulSoup是最简单的选择。对于JavaScript页面，使用Selenium或Playwright。对于大规模爬取，Scrapy提供并发和管道功能。

Q: 如何抓取使用JavaScript的网站？

使用Selenium或Playwright控制无头浏览器。或者检查Network标签页查找返回JSON数据的API端点。

Q: 网页抓取合法吗？

合法性取决于管辖区域、服务条款和数据用途。抓取公开数据通常是合法的，但需要检查robots.txt和服务条款。避免个人数据和受版权保护的内容。

Q: 如何避免在抓取时被封锁？

在请求之间使用延迟，轮换User-Agent，遵守robots.txt，使用会话处理cookies，限制并发请求。

Q: 如何在网页抓取中处理分页？

在HTML中找到下一页链接并在循环中跟随它，或直接遍历页码查询参数。

Name: Soren Atelier

更新于 2026/2/10

你需要的数据无法通过API获取——竞争对手网站上的产品价格、学术门户的研究论文、招聘平台的职位列表或多个来源的新闻文章。手动复制数据既慢又容易出错，而且在大规模时根本不可能。网页抓取可以自动化这一过程，但选择错误的工具或方法会导致请求被封锁、解析器损坏和法律问题。

本指南涵盖了完整的Python网页抓取技术栈：requests + BeautifulSoup用于静态页面，Selenium用于JavaScript渲染的内容，Scrapy用于大规模爬取。你将学习处理实际挑战的实用技术。

快速开始：requests + BeautifulSoup

抓取静态HTML页面最常用的组合：

import requests
from bs4 import BeautifulSoup
 
# 获取页面
url = 'https://books.toscrape.com/'
response = requests.get(url)
response.raise_for_status()  # 对错误状态码抛出异常
 
# 解析HTML
soup = BeautifulSoup(response.text, 'html.parser')
 
# 提取书名
books = soup.select('article.product_pod h3 a')
for book in books[:5]:
    print(book['title'])

安装

# 安装所需包
# pip install requests beautifulsoup4 lxml

使用BeautifulSoup解析HTML

查找元素

from bs4 import BeautifulSoup
 
html = """
<div class="products">
    <div class="product" id="p1">
        <h2 class="name">Widget A</h2>
        <span class="price">$9.99</span>
        <p class="desc">A useful widget</p>
    </div>
    <div class="product" id="p2">
        <h2 class="name">Widget B</h2>
        <span class="price">$14.99</span>
        <p class="desc">A better widget</p>
    </div>
</div>
"""
 
soup = BeautifulSoup(html, 'html.parser')
 
# 按标签查找
print(soup.find('h2').text)  # "Widget A"
 
# 按类名查找
prices = soup.find_all('span', class_='price')
for p in prices:
    print(p.text)  # "$9.99", "$14.99"
 
# 按ID查找
product = soup.find(id='p2')
print(product.find('h2').text)  # "Widget B"
 
# CSS选择器
names = soup.select('.product .name')
for n in names:
    print(n.text)

常用选择器

方法	示例	查找内容
`find('tag')`	`soup.find('h2')`	第一个h2元素
`find_all('tag')`	`soup.find_all('a')`	所有锚元素
`find(class_='x')`	`soup.find(class_='price')`	第一个带该类的元素
`find(id='x')`	`soup.find(id='main')`	带该ID的元素
`select('css')`	`soup.select('div.product > h2')`	CSS选择器匹配
`select_one('css')`	`soup.select_one('#header')`	第一个CSS匹配

提取数据

from bs4 import BeautifulSoup
 
html = '<a href="/page/2" class="next" data-page="2">Next Page</a>'
soup = BeautifulSoup(html, 'html.parser')
link = soup.find('a')
 
# 文本内容
print(link.text)           # "Next Page"
print(link.get_text(strip=True))  # "Next Page"（去除空白）
 
# 属性
print(link['href'])        # "/page/2"
print(link.get('class'))   # ['next']
print(link['data-page'])   # "2"

处理请求头和会话

许多网站会阻止没有适当请求头的请求：

import requests
from bs4 import BeautifulSoup
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}
 
# 使用会话处理cookies和持久化请求头
session = requests.Session()
session.headers.update(headers)
 
response = session.get('https://books.toscrape.com/')
soup = BeautifulSoup(response.text, 'html.parser')
print(f"找到 {len(soup.select('article.product_pod'))} 本书")

分页处理

跟随下一页链接

import requests
from bs4 import BeautifulSoup
 
base_url = 'https://books.toscrape.com/catalogue/'
all_books = []
url = 'https://books.toscrape.com/catalogue/page-1.html'
 
while url:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
 
    # 从当前页提取书籍
    for book in soup.select('article.product_pod'):
        title = book.select_one('h3 a')['title']
        price = book.select_one('.price_color').text
        all_books.append({'title': title, 'price': price})
 
    # 查找下一页
    next_btn = soup.select_one('li.next a')
    if next_btn:
        url = base_url + next_btn['href']
    else:
        url = None
 
    print(f"已抓取 {len(all_books)} 本书...")
 
print(f"总计：{len(all_books)} 本书")

使用Selenium抓取JavaScript页面

当页面使用JavaScript渲染内容时，requests只能获取渲染前的原始HTML。使用Selenium控制真实浏览器：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
# 设置无头Chrome
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
 
try:
    driver.get('https://quotes.toscrape.com/js/')
 
    # 等待内容加载
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'quote'))
    )
 
    quotes = driver.find_elements(By.CLASS_NAME, 'quote')
    for quote in quotes:
        text = quote.find_element(By.CLASS_NAME, 'text').text
        author = quote.find_element(By.CLASS_NAME, 'author').text
        print(f'"{text}" -- {author}')
finally:
    driver.quit()

工具比较

工具	最适合	速度	JavaScript	学习曲线
`requests` + `BeautifulSoup`	静态HTML页面	快	否	简单
`Selenium`	JavaScript渲染的页面	慢	是	中等
`Scrapy`	大规模爬取	快	需要插件	较陡
`Playwright`	现代JS应用	中等	是	中等
`httpx` + `selectolax`	高性能解析	非常快	否	简单

Scrapy大规模抓取

对于爬取数千个页面，Scrapy提供内置的并发、重试逻辑和数据管道：

# spider.py
import scrapy
 
class BookSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['https://books.toscrape.com/']
 
    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('.price_color::text').get(),
            }
 
        # 跟随分页
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
 
# 运行命令：scrapy runspider spider.py -o books.json

数据清理和存储

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# 抓取数据
url = 'https://books.toscrape.com/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
 
books = []
for item in soup.select('article.product_pod'):
    title = item.select_one('h3 a')['title']
    price_text = item.select_one('.price_color').text
    price = float(price_text.replace('£', ''))
    rating_class = item.select_one('p.star-rating')['class'][1]
 
    books.append({
        'title': title,
        'price': price,
        'rating': rating_class,
    })
 
# 转换为DataFrame
df = pd.DataFrame(books)
print(df.head())
 
# 保存为CSV
df.to_csv('books.csv', index=False)

道德抓取实践

实践	描述
检查robots.txt	遵守网站的爬取规则
添加延迟	在请求之间使用`time.sleep()`（1-3秒）
表明身份	设置描述性的User-Agent字符串
缓存响应	不要重新抓取已有的页面
检查API	许多网站有公共API，更快且被允许
阅读服务条款	一些网站明确禁止抓取
限制速率	不要用并发请求使服务器过载

import time
import requests
 
def polite_scrape(urls, delay=2):
    """带延迟的礼貌抓取。"""
    results = []
    for url in urls:
        response = requests.get(url)
        results.append(response.text)
        time.sleep(delay)  # 保持礼貌
    return results

分析抓取的数据

收集网页抓取数据后，PyGWalker (opens in a new tab)可以帮助你在Jupyter中交互式地探索和可视化抓取的数据集：

import pandas as pd
import pygwalker as pyg
 
df = pd.read_csv('scraped_data.csv')
walker = pyg.walk(df)

要在Jupyter中借助AI辅助迭代运行抓取脚本，RunCell (opens in a new tab)提供了AI驱动的环境，你可以交互式地调试选择器和测试解析逻辑。

常见问题

Python网页抓取最好的库是什么？

对于静态页面，requests + BeautifulSoup是最简单和最流行的选择。对于JavaScript渲染的页面，使用Selenium或Playwright。对于需要爬取数千个页面的大规模抓取，Scrapy提供内置的并发和管道管理。

如何抓取使用JavaScript的网站？

使用Selenium或Playwright控制无头浏览器来执行JavaScript。或者，检查浏览器的Network标签页，查找返回JSON数据的API端点——直接抓取API比浏览器自动化更快、更可靠。

网页抓取合法吗？

网页抓取的合法性取决于管辖区域、网站的服务条款以及数据的使用方式。在许多管辖区域，抓取公开可用的数据通常是合法的，但请始终检查网站的robots.txt和服务条款。避免抓取个人数据或受版权保护的内容。

如何避免在抓取时被封锁？

在请求之间使用延迟（1-3秒），轮换User-Agent字符串，遵守robots.txt，使用会话处理cookies，避免发出过多的并发请求。如果持续被封锁，请检查该网站是否有公共API。

如何在网页抓取中处理分页？

在HTML中找到"下一页"链接或按钮，提取其URL，然后在循环中跟随它直到没有更多页面。或者，如果页面使用查询参数（如?page=2），直接遍历页码。

总结

Python的网页抓取生态系统覆盖了每种场景：requests + BeautifulSoup用于快速静态页面抓取，Selenium用于JavaScript密集型网站，Scrapy用于生产规模的爬取。从能用的最简单工具开始，仅在需要时增加复杂性，并始终以道德方式抓取——遵守robots.txt，添加延迟，首先检查API。将抓取的数据存储为CSV或DataFrame等结构化格式以便立即分析。

📚