Scraping schema.org Product Data from JSON-LD
This walkthrough zooms in on one high-value target from Extracting JSON-LD and Structured Data: pulling price, availability, and ratings out of Product nodes across e-commerce pages.
Most online stores publish a schema.org Product block so their listings qualify for rich search results, and that block is the single best place to read price and stock from. The price and availability live nested inside an offers object, ratings live inside aggregateRating, and numbers are frequently stored as strings. Read those three shapes correctly and you get a clean, redesign-proof product record from almost any retailer with one request.
The Shape of a Product Node
A typical Product node looks like this once parsed into a Python dict:
{
"@type": "Product",
"name": "Aeron Chair",
"sku": "AER-001",
"brand": {"@type": "Brand", "name": "Herman Miller"},
"offers": {
"@type": "Offer",
"price": "1395.00",
"priceCurrency": "USD",
"availability": "https://schema.org/InStock"
},
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "4.7",
"reviewCount": "212"
}
}
Three things trip people up. First, price is nested — it is node["offers"]["price"], not node["price"]. Second, offers can be a single object or a list of offers (multiple sellers or variants). Third, availability is a URL, not a boolean: https://schema.org/InStock, .../OutOfStock, .../PreOrder. A robust parser normalizes all three.
A Robust Product Parser
The function below fetches a page, isolates the Product node, and flattens the nested offers and aggregateRating into a single typed record. It casts stringified numbers and reduces the availability URL to a short status.
import json
import requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
def _first_offer(offers: dict | list) -> dict:
"""offers may be one Offer object or a list of them; take the first."""
if isinstance(offers, list):
return offers[0] if offers else {}
return offers or {}
def _to_float(value: object) -> float | None:
try:
return float(str(value))
except (TypeError, ValueError):
return None
def parse_product(url: str) -> dict | None:
resp = requests.get(url, headers=HEADERS, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "lxml")
product = None
for block in soup.find_all("script", type="application/ld+json"):
if not block.string:
continue
try:
parsed = json.loads(block.string)
except json.JSONDecodeError:
continue
nodes = parsed.get("@graph", [parsed]) if isinstance(parsed, dict) else parsed
for node in nodes if isinstance(nodes, list) else [nodes]:
types = node.get("@type", "")
types = types if isinstance(types, list) else [types]
if "Product" in types:
product = node
break
if product:
break
if not product:
return None
offer = _first_offer(product.get("offers", {}))
rating = product.get("aggregateRating", {}) or {}
availability = str(offer.get("availability", "")).rsplit("/", 1)[-1] # -> "InStock"
return {
"name": product.get("name"),
"sku": product.get("sku"),
"price": _to_float(offer.get("price")),
"currency": offer.get("priceCurrency"),
"in_stock": availability == "InStock",
"availability": availability or None,
"rating": _to_float(rating.get("ratingValue")),
"review_count": _to_float(rating.get("reviewCount")),
}
if __name__ == "__main__":
record = parse_product("https://example.com/product/aeron-chair")
print(record)
The output is a flat dict — price is a float, in_stock is a bool, rating is a float — ready to validate and write straight to disk via Storing and Exporting Scraped Data.
Scaling Across Many Product Pages
Because every retailer that uses schema.org shares the same field names, one parser works across sites with only small tweaks. Loop a list of product URLs, collect the dicts, and hand the batch to storage. If you gather listings from a JSON API rather than crawling category pages, pair this with Parsing JSON and XML Responses to get the URLs first, then enrich each with its JSON-LD.
Edge Cases and Caveats
- Multiple offers. Marketplaces list several sellers under
offersas a list. Decide whether you want the lowest price (minover the list) or the featured one (first). The helper above takes the first; adapt as needed. AggregateOfferinstead ofOffer. Some pages use@type: AggregateOfferwithlowPrice/highPricefields rather than a singleprice. Check for those keys before falling back toprice.- Missing
aggregateRating. New or unreviewed products omit ratings entirely. Always use.get(...)with defaults so a missing block returnsNonerather than raising. - Stringified and localized numbers. Prices may arrive as
"1,395.00"or"1.395,00"depending on locale. Strip thousands separators before casting, and never assume a.is the decimal point for non-US stores. - Currency mismatches.
priceCurrencycan differ from the page's displayed currency if the site geolocates. Always store the currency alongside the number rather than assuming one. - JavaScript-injected blocks. If the
ProductJSON-LD is absent from the raw HTML, it is rendered client-side; render the page first or read the product API directly.
Frequently Asked Questions
Where exactly is the price in a schema.org Product?
Inside the offers object, as product["offers"]["price"]. It is usually a string, so cast it to a float, and remember offers can be a list of multiple offers rather than a single object.
How do I tell if a product is in stock from JSON-LD?
Read offers.availability, which is a schema.org URL such as https://schema.org/InStock or https://schema.org/OutOfStock. Take the segment after the last slash and compare it to "InStock".
Why are the ratings missing on some pages?
Products with no reviews omit the aggregateRating block entirely, so a lookup returns nothing. Use .get("aggregateRating", {}) and default the values to None rather than assuming the block is always present.
Can I use this same parser on different e-commerce sites?
Largely yes — schema.org field names are standardized, so name, offers.price, and aggregateRating.ratingValue mean the same thing everywhere. You mainly adjust for whether offers is a list, whether the type is AggregateOffer, and for locale-specific number formatting.