Reading layout

Data Extraction Patterns and Working with APIs

Once you can parse rendered HTML, a harder truth sets in: the cleanest copy of the data you want is almost never in the HTML. Modern pages hydrate themselves from JSON payloads, embed machine-readable application/ld+json blocks for search engines, and answer background fetch calls with tidy typed objects. Scraping the rendered markup means fighting CSS-class churn and layout changes; reading the data source directly means one stable request and a dictionary you can trust. This guide is the bridge between HTML parsing and running scrapers at scale — it sits between The Complete Guide to Python Web Scraping and Scaling & Deploying Python Web Scrapers, and it teaches you to find and read the underlying data instead of the page around it.

Four data sources on a single page Rendered HTML lives in the DOM and is read with BeautifulSoup. JSON-LD lives in script tags and is read with json and extruct. A private JSON API answers XHR calls and is read with requests. A GraphQL endpoint answers POST queries. One page, four data sourcesRendered HTMLlives inthe DOM treeread withBeautifulSoupfragile selectorsJSON-LDlives in<script> tagsread withjson + extructstable schemaPrivate JSON APIlives inXHR / fetch callsread withrequests + jsonclean, typedGraphQLlives inPOST /graphqlread witha query bodyexact fields
The same page can expose its data in four different places — each wants a different reader.

Stop Scraping the Page, Start Reading the Source

A rendered HTML element is a presentation of data, not the data itself. When a site restyles its product grid, your div.product-card__price--v2 selector breaks even though nothing about the underlying price changed. The fix is to move one layer down the stack. The same page usually carries the same values as JSON-LD, ships them in a JavaScript state blob, or fetches them from an endpoint that returns clean JSON.

The workflow you already know — fetch with a realistic User-Agent, then parse — still applies. What changes is what you parse. Compare a fragile selector against the structured alternative:

import requests
from bs4 import BeautifulSoup

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}

def price_from_markup(url: str) -> str | None:
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")
    tag = soup.select_one("span.price_color")   # breaks on any redesign
    return tag.get_text(strip=True) if tag else None

That approach is covered end to end in Parsing HTML with BeautifulSoup. It is the right tool when the data genuinely only exists as visible text. The rest of this guide is about the far more common case where it does not.

Structured Data Is Already in the Page

Most commercial pages ship a block of structured data specifically so that Google, Bing, and social crawlers can read them. That block is a gift: it is a JSON object with a documented schema.org vocabulary, sitting in the HTML you already downloaded, and it changes far less often than the visible layout. Finding it is a single selector; parsing it is a single json.loads.

import json
import requests
from bs4 import BeautifulSoup

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}

def read_structured_data(url: str) -> list[dict]:
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")
    blocks = soup.find_all("script", type="application/ld+json")
    return [json.loads(b.string) for b in blocks if b.string]

The full treatment of these <script> blocks, plus microdata and Open Graph tags, lives in Extracting JSON-LD and Structured Data. It is usually the first thing to check on any e-commerce, recipe, article, or event page.

Parsing JSON and XML the Server Hands You

When a request returns JSON or XML directly — a REST endpoint, an RSS feed, a sitemap, a data export — there is no markup to parse at all. The job becomes navigating a nested structure and pulling out the fields you need. Python's standard library reads JSON natively, and small helpers like xmltodict collapse XML into the same dict-and-list shape.

import requests

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "application/json",
}

def fetch_json(url: str) -> dict:
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.raise_for_status()
    return resp.json()          # requests parses the body for you

data = fetch_json("https://api.example.com/v1/products?page=1")
for product in data.get("results", []):
    print(product["name"], product["price"])

For JSONPath queries, XML namespaces, streaming large payloads, and the exact tool for each format, see Parsing JSON and XML Responses.

Reverse-Engineering the Private API Behind a Page

The most valuable endpoints are the undocumented ones the site's own frontend calls. When you scroll a listing or open a product, the browser fires an XHR or fetch request that returns exactly the data the UI renders — paginated, typed, and free of markup. Replaying that request in Python is faster and sturdier than driving a browser or parsing HTML, because you are talking to the same interface the app does.

import requests

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Accept": "application/json",
    "X-Requested-With": "XMLHttpRequest",
    "Referer": "https://example.com/listing",
}

def call_hidden_api(page: int) -> dict:
    resp = requests.get(
        "https://example.com/internal/api/products",
        params={"page": page, "limit": 50},
        headers=HEADERS,
        timeout=10,
    )
    resp.raise_for_status()
    return resp.json()

The systematic method — reading the Network panel, replaying requests, and handling auth tokens — is in Reverse-Engineering Private APIs. It is the highest-leverage skill in this whole guide.

Querying GraphQL Endpoints Directly

A growing number of sites expose a single GraphQL endpoint instead of many REST routes. That looks intimidating but is often easier to scrape: you send a POST with a query describing exactly the fields you want, and the server returns exactly those and nothing else. No over-fetching, no scraping around unrelated markup.

import requests

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
    "Content-Type": "application/json",
}

QUERY = """
query Products($first: Int!) {
  products(first: $first) {
    edges { node { name priceRange { minVariantPrice { amount } } } }
  }
}
"""

def graphql_products(endpoint: str, first: int = 20) -> dict:
    resp = requests.post(
        endpoint,
        json={"query": QUERY, "variables": {"first": first}},
        headers=HEADERS,
        timeout=15,
    )
    resp.raise_for_status()
    return resp.json()["data"]

Query construction, introspection, and cursor-based pagination are covered in Scraping GraphQL Endpoints.

Choosing the Right Source for a Page

Given a target page, work down this ladder and stop at the first source that has your data cleanly:

  1. A private JSON API — the cleanest, most stable option when it exists. Start by finding hidden API endpoints in network traffic.
  2. A GraphQL endpoint — nearly as clean; you control the field selection.
  3. JSON-LD structured data — no network archaeology needed; it is in the HTML you already have.
  4. Rendered HTML — the fallback for data that genuinely exists nowhere else.

The trade-off is discovery effort versus durability. Parsing HTML is zero discovery but the most fragile; a private API takes ten minutes in the Network panel but rarely breaks. On any non-trivial project the API route pays for itself within the first schema change.

def choose_strategy(has_api: bool, has_jsonld: bool) -> str:
    if has_api:
        return "replay the private/GraphQL request"
    if has_jsonld:
        return "parse the ld+json block"
    return "fall back to HTML selectors"

Storing What You Extract

Every source in this guide converges on the same output: Python dicts and lists ready to validate and persist. Because API and JSON-LD data arrives already typed and nested, the natural next step is to flatten and store it. That hand-off to durable storage — CSV, JSON Lines, SQLite, PostgreSQL, or Parquet — is covered in Storing and Exporting Scraped Data.

import csv

def rows_to_csv(rows: list[dict], path: str) -> None:
    if not rows:
        return
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
        writer.writeheader()
        writer.writerows(rows)

Common Pitfalls

  • Rendering the page when the JSON was right there. Reaching for Selenium or Playwright before checking the Network panel wastes CPU and time. Look for a JSON source first; render only when there truly is none.
  • Dropping headers on API calls. Private endpoints often gate on Accept, Referer, X-Requested-With, or a token header. Copy the request the browser actually sent, headers and all, then trim.
  • Assuming JSON-LD is always one object. A page can contain several ld+json blocks, and each may be a single object or an array (or a @graph list). Iterate defensively.
  • Ignoring pagination shape. REST offsets, cursor tokens, and GraphQL edges/pageInfo cursors all page differently. Read the response envelope before looping.
  • Hammering an undocumented endpoint. A private API has no published rate limit, which means it has an unknown one. Throttle and back off exactly as you would for HTML requests.
  • Trusting types blindly. JSON-LD often stringifies numbers ("price": "19.99"). Cast and validate at the boundary before doing arithmetic.

Frequently Asked Questions

How do I know whether a site has a private API to scrape? Open the browser DevTools Network panel, filter to Fetch/XHR, and interact with the page — scroll, paginate, open a detail view. Any request that returns JSON matching what you see on screen is a candidate endpoint you can replay directly in Python.

Is calling a site's private API legal? Reading a publicly reachable endpoint is technically no different from loading the page that calls it, but the same rules apply as for any scraping: respect the terms of service, do not bypass authentication you were not granted, honor robots.txt, and throttle politely. Treat undocumented endpoints as a convenience, not a license.

Why prefer JSON-LD over parsing the visible HTML? JSON-LD is published as machine-readable structured data with a documented schema.org vocabulary, so it is both cleaner and far more stable than presentation markup. A site can redesign its entire product grid without touching the ld+json block that feeds search engines.

Do I still need BeautifulSoup if I am reading APIs? Often yes — to locate the <script type="application/ld+json"> blocks inside a page, and as a fallback for values that only exist in visible markup. For pure JSON or XML endpoints you can skip it entirely and parse the response body directly.

What is the difference between a REST API and a GraphQL endpoint for scraping? A REST API exposes many URLs, each returning a fixed shape; you page through them with query parameters. A GraphQL endpoint is a single URL you POST queries to, choosing exactly which fields come back. GraphQL avoids over-fetching but requires you to write the query and handle cursor pagination.

Can I mix these techniques in one scraper? Yes, and mature scrapers usually do. A common pattern is to page a listing through a private JSON API, then enrich each item with JSON-LD pulled from its detail page, and finally store the merged records — each source used where it is cleanest.