Reading layout

Web Scraping with Scrapy

Scrapy is a complete crawling framework rather than a single-purpose library. Where requests and BeautifulSoup hand you raw HTML and leave orchestration to you, Scrapy provides an asynchronous engine, a request scheduler, automatic retries, configurable concurrency, and an item pipeline — the infrastructure that production crawls need. This guide walks through building a real spider and the settings that keep it fast and polite. For the broader context on production architecture, see Scaling & Deploying Python Web Scrapers.

Scrapy architecture The engine sits at the center, exchanging requests and responses with the scheduler, the downloader (which fetches the web), and the spider (which yields items), then sends items to the item pipelines. EngineSchedulerDownloader+ middlewaresSpiderparse → itemsItem pipelinesthe web ↑↓
Scrapy's engine coordinates the scheduler, downloader, spider, and item pipelines.

Installation and Project Structure

Scrapy installs as a single package and ships a command-line tool that scaffolds projects.

pip install scrapy
scrapy startproject bookstore

This generates a project with a predictable layout: spiders/ holds your crawlers, items.py defines the data schema, pipelines.py processes scraped records, and settings.py controls concurrency, throttling, and middleware. That separation of concerns is the whole point — fetching, parsing, and storage stay in distinct, testable layers.

Writing Your First Spider

A spider declares where to start, how to follow links, and how to parse responses. Scrapy calls your parse method with each downloaded response and lets you yield either extracted items or new requests to follow.

import scrapy

class BookSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get(),
                "in_stock": bool(book.css("p.instock.availability")),
            }

        # Follow pagination automatically
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Run it and export the results in one command:

scrapy crawl books -o books.json

Scrapy handles the request queue, concurrency, and retries while your code focuses purely on extraction.

Selectors: CSS and XPath

Scrapy's response object exposes both CSS and XPath selectors. CSS is concise for class- and tag-based selection; XPath is more powerful for traversing relationships and matching on text. The ::text and ::attr() pseudo-selectors extract text and attributes directly. If you are coming from BeautifulSoup, the mental model is similar — see Parsing HTML with BeautifulSoup — but Scrapy selectors are backed by the fast parsel library and integrate with the framework's response handling.

# CSS
response.css("h3 a::attr(title)").get()
# Equivalent XPath
response.xpath("//h3/a/@title").get()

Use .get() for the first match and .getall() for a list. Both return None (or an empty list) instead of raising when nothing matches, which keeps parsing code resilient.

Items and Pipelines

For anything beyond a quick export, define a schema with Item and process records through pipelines. An item declares the fields you expect; a pipeline validates, cleans, deduplicates, or stores each scraped record as it flows through the engine.

# items.py
import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    in_stock = scrapy.Field()

# pipelines.py
class PriceCleanPipeline:
    def process_item(self, item, spider):
        item["price"] = float(item["price"].replace("£", ""))
        return item

Enable the pipeline in settings.py with ITEM_PIPELINES = {"bookstore.pipelines.PriceCleanPipeline": 300}. Pipelines are where storage logic belongs — see Storing and Exporting Scraped Data for writing to databases from a pipeline.

Concurrency, Throttling, and Politeness

Scrapy is concurrent by default, which makes responsible configuration essential. The key settings in settings.py:

# settings.py
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5            # base delay between requests
AUTOTHROTTLE_ENABLED = True    # adapt delay to server response time
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
RETRY_TIMES = 3                # retry transient failures
ROBOTSTXT_OBEY = True          # respect robots.txt by default

AUTOTHROTTLE is particularly valuable: it automatically adjusts delays based on server latency, slowing down when the target is under load. Combined with RETRY_TIMES and the built-in retry middleware, it handles the 429/503 backoff logic you would otherwise write by hand. For evading more aggressive defenses, integrate proxy and header rotation from Rotating Proxies and Managing IP Blocks.

Common Mistakes to Avoid

  • Blocking the event loop: Scrapy is asynchronous. Calling time.sleep() or synchronous requests inside a spider stalls the entire engine. Use DOWNLOAD_DELAY and yield requests instead.
  • Disabling AutoThrottle then setting concurrency too high: this is the fastest route to an IP ban. Let AutoThrottle adapt, or tune DOWNLOAD_DELAY conservatively.
  • Putting storage logic in parse: keep parsing pure and move persistence into pipelines so it is reusable and testable.
  • Ignoring response.follow: building absolute URLs by hand is error-prone; response.follow resolves relative links for you.
  • Forgetting to handle missing fields: always assume a selector may return None and validate before downstream processing.

Frequently Asked Questions

Does Scrapy render JavaScript? No — Scrapy fetches raw HTML and does not execute JavaScript. For dynamic, JS-rendered sites, use a headless browser such as Playwright, or integrate scrapy-playwright to combine the two.

Is Scrapy overkill for small projects? For a few pages, yes — requests and BeautifulSoup are simpler. Scrapy pays off when you crawl many linked pages, need retries and throttling, or run the job repeatedly. See Scrapy vs BeautifulSoup: Which to Use.

How do I schedule Scrapy crawls? Run spiders from cron or a task scheduler, or use Scrapyd / a managed service to deploy and schedule spiders with an API. The crawl itself stays the same code.

Can Scrapy resume an interrupted crawl? Yes. Enable a persistent job directory with JOBDIR so the scheduler and deduplication filter survive restarts, letting a stopped crawl pick up where it left off.