Reading layout

Scrapy vs Playwright for Single-Page Apps

Single-page apps break naive HTML scrapers because the content arrives after JavaScript runs, and this guide — part of Web Scraping with Scrapy — explains when plain Scrapy still wins and when you truly need a browser.

Scrapy versus Playwright decision path for single-page apps Starting from a single-page app, if a hidden JSON API exists you use Scrapy to call it directly, which is fast and cheap. If rendering truly needs a browser, you use Playwright or the scrapy-playwright hybrid. Single-page appempty HTML shell + JSHidden JSON APIin the Network tab?yes (usual case)Scrapy → call the APIfast · low memory · clean JSONneeds renderingPlaywright renderor scrapy-playwright hybrid
For an SPA, prefer Scrapy calling the hidden JSON API; fall back to Playwright rendering only when client-side execution is unavoidable.

For most single-page apps, you do not need a browser at all. The content that renders on screen is almost always delivered by a background JSON API that you can call directly with Scrapy — faster, cheaper, and far more stable than driving a headless browser. Reach for Playwright only when rendering depends on client-side execution you cannot replay: heavy anti-bot JavaScript, canvas or WebGL output, or state assembled across many interdependent requests. The pragmatic middle ground is scrapy-playwright, which renders just the pages that need a browser inside a normal Scrapy crawl.

Why SPAs Defeat Plain HTML Parsing

A single-page app ships a nearly empty HTML shell and a bundle of JavaScript. The browser runs that JavaScript, which fetches data over fetch or XMLHttpRequest and injects it into the DOM. When Scrapy downloads the page, it sees only the shell — the product grid, the reviews, the prices are all absent because no JavaScript ran. This is the same wall that motivates the browser-automation tooling in Advanced Scraping Techniques and Anti-Bot Evasion.

There are two ways through the wall. The first is to render the page in a real browser so the JavaScript executes and the DOM fills in — this is what Playwright does, and it is covered end to end in Using Playwright for Modern Web Automation. The second is to skip the browser and call the same API the JavaScript calls. The second path is almost always the better default: it is an order of magnitude faster, uses a fraction of the memory, and returns clean structured JSON instead of HTML you have to parse.

Option A: Find the Hidden API with Scrapy

Open your browser's developer tools, switch to the Network tab, filter to XHR/Fetch, and reload the page. The requests that return JSON matching the on-screen content are the endpoints you want. Copy one as cURL, note its headers and query parameters, and reproduce it in Scrapy. Because you are talking to the API directly, pagination, filtering, and sorting usually become simple query-string changes — the same technique you would use to work through Handling Pagination and Infinite Scroll.

import scrapy


class ProductApiSpider(scrapy.Spider):
    name = "product_api"
    custom_settings = {
        "DEFAULT_REQUEST_HEADERS": {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36",
            "Accept": "application/json",
        }
    }

    def start_requests(self) -> "scrapy.Request":
        for page in range(1, 6):
            url = f"https://example.com/api/products?page={page}&page_size=50"
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response: scrapy.http.Response) -> "dict":
        payload = response.json()
        for item in payload.get("results", []):
            yield {
                "id": item["id"],
                "title": item["name"],
                "price": item["price"],
            }

This spider never renders anything. It reads JSON straight from the API the SPA uses, which makes it fast, deterministic, and easy to run at scale. Once the data lands, route it into durable storage as described in Storing and Exporting Scraped Data.

Option B: The scrapy-playwright Hybrid

Sometimes the hidden API is signed with a token generated by obfuscated client JavaScript, or the content genuinely depends on browser rendering. Rather than rewrite your whole project around a browser, add scrapy-playwright so Scrapy renders only the requests that need it while everything else stays on the fast HTTP path. You keep Scrapy's scheduling, retries, and item pipelines, and pay the browser cost only where it is unavoidable.

import scrapy


class SpaHybridSpider(scrapy.Spider):
    name = "spa_hybrid"
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DEFAULT_REQUEST_HEADERS": {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15",
        },
    }

    def start_requests(self) -> "scrapy.Request":
        yield scrapy.Request(
            "https://example.com/dashboard",
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    {"method": "wait_for_selector", "args": ["div.report-row"]},
                ],
            },
            callback=self.parse,
        )

    def parse(self, response: scrapy.http.Response) -> "dict":
        for row in response.css("div.report-row"):
            yield {
                "label": row.css("::attr(data-label)").get(),
                "value": row.css("span.value::text").get(),
            }

Only the requests carrying meta={"playwright": True} launch a browser; the rest of the crawl runs as ordinary Scrapy. That selective rendering is what keeps the hybrid affordable at volume. If the target throws browser challenges, combine this with the evasion techniques from Bypassing Cloudflare and Akamai Protections.

Edge Cases and Caveats

  • Always check for a hidden API first. Rendering a browser to scrape data that is available as clean JSON wastes 10–50x the resources for a worse result. The Network tab is your first stop, not your last resort.
  • APIs change without warning. A private endpoint can alter its schema or auth overnight. Version your parsing and alert on empty results so a silent break does not corrupt your dataset.
  • scrapy-playwright needs the asyncio reactor. You must set TWISTED_REACTOR to the asyncio selector reactor, or the handler will not initialize. This is the most common setup error.
  • Browsers do not scale linearly. Each Playwright context consumes real memory and CPU. Cap concurrency and reuse contexts; a hundred parallel browsers will exhaust a modest server.
  • Signed or short-lived tokens. If the API requires a token minted by client JavaScript, you may still need one browser render to harvest the token, then replay cheap HTTP calls with it.
  • Respect robots and rate limits either way. Rendering does not make aggressive crawling acceptable. Keep DOWNLOAD_DELAY and concurrency polite regardless of which path you choose.

Frequently Asked Questions

Do I always need Playwright to scrape a single-page app? No — usually the opposite. Most SPAs load their content from a background JSON API you can call directly with Scrapy, which is faster and more stable than rendering. Only reach for Playwright when the data depends on client-side execution you cannot replay, such as token signing or canvas rendering.

What is scrapy-playwright and when should I use it? It is a Scrapy download handler that renders selected requests in a Playwright browser while the rest of the crawl runs over plain HTTP. Use it when a minority of pages genuinely need JavaScript execution but you still want Scrapy's scheduling, retries, and item pipelines for everything else.

How do I find the hidden API an SPA uses? Open your browser's developer tools, go to the Network tab, filter to XHR or Fetch requests, and reload the page. Look for responses returning JSON that matches the on-screen content, then reproduce that request — with its headers and query parameters — in your scraper.

Is Playwright slower than Scrapy for the same site? Yes, substantially. A headless browser must download assets, execute JavaScript, and build the DOM, which costs far more time and memory than a single HTTP request for JSON. When both approaches can get the same data, the direct-API Scrapy path wins on speed, cost, and reliability.