Reading layout

Extracting JSON-LD and Structured Data with Python

Search engines do not guess what a page is about — they read structured data the site publishes on purpose. That same data is the scraper's shortcut. Instead of chasing CSS classes through the visible markup, you read a documented application/ld+json block, a set of microdata attributes, or the Open Graph meta tags, and get a typed object back. This guide is part of Data Extraction Patterns and Working with APIs, and it covers the three embedded formats you will meet most often, using BeautifulSoup, the standard-library json module, and the extruct library that unifies all three.

Extracting JSON-LD from an HTML page BeautifulSoup finds the script tag whose type is application slash ld plus json, the json module parses its text into a Python dictionary of clean typed fields. HTML page<div> … visible markup<script type="application/ld+json">{ … } </script>soupjson.loads()parse .stringPython dict"@type": "Product""name": "Widget""price": "19.99""ratingValue": 4.6
JSON-LD travels inside the HTML but parses as pure JSON — skip the DOM guesswork.

When to Use Structured Data Extraction

Reach for embedded structured data before writing a single visible-markup selector when:

  • The page is an e-commerce product, recipe, article, event, or business listing — these almost always carry JSON-LD for rich search results.
  • You need stable extraction that survives redesigns. Structured data changes far less often than presentation markup.
  • You want fields that are awkward to select visually — SKU, currency code, ISO dates, @id references — but are explicit in the structured block.
  • You are enriching records already pulled from an API and want a second, independent source of truth.

Fall back to visible-HTML parsing only when a page publishes no structured data at all, or when the block omits a field you need.

Prerequisites

Use Python 3.10 or newer. Install a parser, the HTTP client, and extruct:

pip install requests beautifulsoup4 lxml extruct

requests fetches pages, beautifulsoup4 with the lxml parser locates the blocks, and extruct extracts every structured-data syntax (JSON-LD, microdata, RDFa, Open Graph) in one call when you want them all at once.

1. Find and Parse JSON-LD Blocks

JSON-LD lives in <script type="application/ld+json"> tags. A page may have several, and each may be a single object, an array of objects, or an object with a @graph list. Parse defensively and flatten to a list.

import json
import requests
from bs4 import BeautifulSoup

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}

def extract_jsonld(url: str) -> list[dict]:
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    items: list[dict] = []
    for block in soup.find_all("script", type="application/ld+json"):
        if not block.string:
            continue
        try:
            parsed = json.loads(block.string)
        except json.JSONDecodeError:
            continue                        # skip malformed blocks, keep going
        if isinstance(parsed, dict) and "@graph" in parsed:
            items.extend(parsed["@graph"])
        elif isinstance(parsed, list):
            items.extend(parsed)
        else:
            items.append(parsed)
    return items

2. Filter by schema.org Type

Structured blocks are self-describing through their @type. Once you have the flat list, pick the nodes you care about — Product, Article, Recipe, Event, and so on.

def nodes_of_type(nodes: list[dict], schema_type: str) -> list[dict]:
    matches = []
    for node in nodes:
        node_type = node.get("@type", "")
        # @type can be a string or a list of strings
        types = node_type if isinstance(node_type, list) else [node_type]
        if schema_type in types:
            matches.append(node)
    return matches

data = extract_jsonld("https://example.com/product/widget")
products = nodes_of_type(data, "Product")

Pulling specific fields such as price, availability, and ratings out of these Product nodes is a topic of its own — see Scraping schema.org Product Data for the full field map and the nested offers/aggregateRating handling.

3. Read Microdata and Open Graph Tags

Not every site uses JSON-LD. Older or CMS-driven pages often use microdata (itemprop/itemscope attributes) or, for social previews, Open Graph meta tags. Open Graph is trivial to read with BeautifulSoup:

def open_graph(url: str) -> dict[str, str]:
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    og: dict[str, str] = {}
    for meta in soup.find_all("meta"):
        prop = meta.get("property", "")
        if prop.startswith("og:") and meta.get("content"):
            og[prop[3:]] = meta["content"]      # strip the "og:" prefix
    return og

4. Extract Every Format at Once with extruct

When you do not know in advance which syntax a site uses, extruct reads all of them in a single pass and hands back a dictionary keyed by format. It is the pragmatic choice for crawling many unfamiliar domains.

import extruct
import requests
from w3lib.html import get_base_url

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}

def extract_all(url: str) -> dict:
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.raise_for_status()
    base_url = get_base_url(resp.text, resp.url)
    return extruct.extract(
        resp.text,
        base_url=base_url,
        syntaxes=["json-ld", "microdata", "opengraph"],
    )

result = extract_all("https://example.com/product/widget")
print(result["json-ld"])       # list of JSON-LD nodes
print(result["microdata"])     # list of microdata items
print(result["opengraph"])     # list of Open Graph properties

Performance and Scaling Considerations

Structured-data extraction is cheap: you are already downloading the HTML, and parsing one small <script> block costs far less than walking the full DOM with dozens of selectors. A few things keep it fast at volume:

  • Prefer lxml as the parser. It is the fastest option for locating the script tags; the trade-offs are laid out in BeautifulSoup vs lxml: Which Parser Is Faster.
  • Skip extruct when you only need JSON-LD. A direct find_all + json.loads is lighter than running every extractor. Reserve extruct for unknown or mixed sites.
  • Cache nothing you do not need. Extract the target nodes, discard the rest of the tree, and hand clean dicts to storage rather than holding whole BeautifulSoup objects in memory across a crawl.
  • The output is already storage-ready. Because these blocks are typed dicts, they flow straight into Storing and Exporting Scraped Data with minimal cleanup.

Common Errors and Fixes

json.JSONDecodeError: Expecting value — the block contains trailing commas, HTML comments, or CDATA wrappers that are not valid JSON. Strip surrounding //<![CDATA[ markers and wrap the parse in a try/except so one broken block does not abort the page:

raw = block.string.strip()
if raw.startswith("<!--"):
    raw = raw.strip("<!->").strip()
try:
    parsed = json.loads(raw)
except json.JSONDecodeError:
    parsed = None

TypeError: 'NoneType' object is not subscriptableblock.string is None because the script tag holds nested nodes or is empty. Guard with if not block.string: continue, or use block.get_text() as a fallback.

KeyError: '@type' — a node has no @type, or the type is a list, not a string. Always read it with .get("@type", "") and normalize a list-or-string into a list before comparing.

Empty results on a page you can see has rich snippets — the JSON-LD is injected by JavaScript after load, so it is absent from the initial HTML. Either find the API that supplies it, or render the page first with a headless browser, then run the same extractor on the rendered HTML.

Frequently Asked Questions

What is the difference between JSON-LD, microdata, and RDFa? All three encode schema.org structured data, but JSON-LD keeps it in a separate <script> block (cleanest to parse), while microdata and RDFa annotate the visible HTML with attributes. JSON-LD is now the format Google recommends and the one you will encounter most.

Do I need extruct, or is BeautifulSoup enough? For JSON-LD alone, BeautifulSoup plus json.loads is enough and lighter. Use extruct when a page might use microdata or RDFa, or when you are crawling many sites and do not want to write a separate extractor for each syntax.

Why is the JSON-LD block missing from my downloaded HTML? Some sites inject structured data with JavaScript after the initial page load, so it is not in the raw response requests receives. Confirm by searching the response text for ld+json; if it is absent, the data is added client-side and you will need to render the page or find its source API.

Can one page contain more than one JSON-LD block? Yes. Pages routinely ship several — one for the organization, one for breadcrumbs, one for the product — and each can be an object, an array, or a @graph container. Always iterate over every <script type="application/ld+json"> and flatten before filtering by type.