Web Scraping with Scrapy
Scrapy is a complete crawling framework rather than a single-purpose library. Where requests and BeautifulSoup hand you raw HTML and leave orchestration to you, Scrapy provides an asynchronous engine, a request scheduler, automatic retries, configurable concurrency, and an item pipeline — the infrastructure that production crawls need. This guide walks through building a real spider and the settings that keep it fast and polite. For the broader context on production architecture, see Scaling & Deploying Python Web Scrapers.
Installation and Project Structure
Scrapy installs as a single package and ships a command-line tool that scaffolds projects.
pip install scrapy
scrapy startproject bookstore
This generates a project with a predictable layout: spiders/ holds your crawlers, items.py defines the data schema, pipelines.py processes scraped records, and settings.py controls concurrency, throttling, and middleware. That separation of concerns is the whole point — fetching, parsing, and storage stay in distinct, testable layers.
Writing Your First Spider
A spider declares where to start, how to follow links, and how to parse responses. Scrapy calls your parse method with each downloaded response and lets you yield either extracted items or new requests to follow.
import scrapy
class BookSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css("p.price_color::text").get(),
"in_stock": bool(book.css("p.instock.availability")),
}
# Follow pagination automatically
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Run it and export the results in one command:
scrapy crawl books -o books.json
Scrapy handles the request queue, concurrency, and retries while your code focuses purely on extraction.
Selectors: CSS and XPath
Scrapy's response object exposes both CSS and XPath selectors. CSS is concise for class- and tag-based selection; XPath is more powerful for traversing relationships and matching on text. The ::text and ::attr() pseudo-selectors extract text and attributes directly. If you are coming from BeautifulSoup, the mental model is similar — see Parsing HTML with BeautifulSoup — but Scrapy selectors are backed by the fast parsel library and integrate with the framework's response handling.
# CSS
response.css("h3 a::attr(title)").get()
# Equivalent XPath
response.xpath("//h3/a/@title").get()
Use .get() for the first match and .getall() for a list. Both return None (or an empty list) instead of raising when nothing matches, which keeps parsing code resilient.
Items and Pipelines
For anything beyond a quick export, define a schema with Item and process records through pipelines. An item declares the fields you expect; a pipeline validates, cleans, deduplicates, or stores each scraped record as it flows through the engine.
# items.py
import scrapy
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
in_stock = scrapy.Field()
# pipelines.py
class PriceCleanPipeline:
def process_item(self, item, spider):
item["price"] = float(item["price"].replace("£", ""))
return item
Enable the pipeline in settings.py with ITEM_PIPELINES = {"bookstore.pipelines.PriceCleanPipeline": 300}. Pipelines are where storage logic belongs — see Storing and Exporting Scraped Data for writing to databases from a pipeline.
Concurrency, Throttling, and Politeness
Scrapy is concurrent by default, which makes responsible configuration essential. The key settings in settings.py:
# settings.py
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5 # base delay between requests
AUTOTHROTTLE_ENABLED = True # adapt delay to server response time
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
RETRY_TIMES = 3 # retry transient failures
ROBOTSTXT_OBEY = True # respect robots.txt by default
AUTOTHROTTLE is particularly valuable: it automatically adjusts delays based on server latency, slowing down when the target is under load. Combined with RETRY_TIMES and the built-in retry middleware, it handles the 429/503 backoff logic you would otherwise write by hand. For evading more aggressive defenses, integrate proxy and header rotation from Rotating Proxies and Managing IP Blocks.
Common Mistakes to Avoid
- Blocking the event loop: Scrapy is asynchronous. Calling
time.sleep()or synchronousrequestsinside a spider stalls the entire engine. UseDOWNLOAD_DELAYand yield requests instead. - Disabling AutoThrottle then setting concurrency too high: this is the fastest route to an IP ban. Let AutoThrottle adapt, or tune
DOWNLOAD_DELAYconservatively. - Putting storage logic in
parse: keep parsing pure and move persistence into pipelines so it is reusable and testable. - Ignoring
response.follow: building absolute URLs by hand is error-prone;response.followresolves relative links for you. - Forgetting to handle missing fields: always assume a selector may return
Noneand validate before downstream processing.
Frequently Asked Questions
Does Scrapy render JavaScript?
No — Scrapy fetches raw HTML and does not execute JavaScript. For dynamic, JS-rendered sites, use a headless browser such as Playwright, or integrate scrapy-playwright to combine the two.
Is Scrapy overkill for small projects?
For a few pages, yes — requests and BeautifulSoup are simpler. Scrapy pays off when you crawl many linked pages, need retries and throttling, or run the job repeatedly. See Scrapy vs BeautifulSoup: Which to Use.
How do I schedule Scrapy crawls? Run spiders from cron or a task scheduler, or use Scrapyd / a managed service to deploy and schedule spiders with an API. The crawl itself stays the same code.
Can Scrapy resume an interrupted crawl?
Yes. Enable a persistent job directory with JOBDIR so the scheduler and deduplication filter survive restarts, letting a stopped crawl pick up where it left off.