Parsing JSON and XML API Responses in Python
When an endpoint hands you JSON or XML directly, there is no HTML to fight — the work shifts to navigating a nested structure and pulling out exactly the fields you need. This guide is part of Data Extraction Patterns and Working with APIs, and it covers the everyday tools for both formats: the standard-library json module and JSONPath for JSON, and xmltodict and lxml for XML feeds, sitemaps, and legacy SOAP-style responses.
When to Use Each Tool
Pick the parser by the format and the depth of the query you need:
json(standard library) — the default for any JSON body.requestscalls it for you viaresponse.json(). Reach past it only when navigation gets awkward.- JSONPath (
jsonpath-ng) — when you need to pull values from deeply nested or variable JSON with a query expression instead of a chain of["key"][0]["key"]lookups. xmltodict— the fastest way to turn XML into ordinary Python dicts and lists. Ideal for RSS/Atom feeds, sitemaps, and simple XML APIs.lxml— when XML is large, uses namespaces, or needs real XPath queries and streaming. It is also the fastest XML parser available in Python.
Prerequisites
Use Python 3.10 or newer and install the HTTP client plus the two optional parsers:
pip install requests xmltodict lxml jsonpath-ng
The json module ships with Python, so JSON parsing needs no install. A refresher on how these responses arrive over the wire is in Understanding HTTP Requests and Responses.
1. Parse a JSON Response
For well-behaved JSON APIs, response.json() is all you need. Send an explicit Accept header, check the status, then walk the parsed dict.
import requests
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "application/json",
}
def fetch_json(url: str, params: dict | None = None) -> dict:
resp = requests.get(url, headers=HEADERS, params=params, timeout=10)
resp.raise_for_status()
return resp.json()
payload = fetch_json("https://api.example.com/v2/products", {"page": 1})
for item in payload.get("results", []):
print(item["id"], item["name"])
2. Query Deep JSON with JSONPath
When the value you want is buried several levels down, or its position varies, a JSONPath expression is far more readable than a long chain of subscripts and defends against missing keys.
from jsonpath_ng.ext import parse
def extract_prices(payload: dict) -> list[float]:
expr = parse("$.results[*].variants[*].price")
return [match.value for match in expr.find(payload)]
# $.results[*].variants[*].price walks every result and every variant
prices = extract_prices(payload)
Once you have flat values like these, turning a nested payload into rows for analysis is its own task — see Flattening Nested JSON with pandas.
3. Parse XML with xmltodict
xmltodict collapses XML into the same dict-and-list shape as JSON, so the rest of your pipeline treats both formats identically. This is perfect for RSS feeds and sitemaps.
import requests
import xmltodict
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept": "application/xml",
}
def parse_rss(url: str) -> list[dict]:
resp = requests.get(url, headers=HEADERS, timeout=10)
resp.raise_for_status()
doc = xmltodict.parse(resp.content)
items = doc["rss"]["channel"]["item"]
items = items if isinstance(items, list) else [items] # single item edge case
return [{"title": i.get("title"), "link": i.get("link")} for i in items]
Note the single-item guard: XML with one child element parses to a dict, not a one-element list.
4. Parse Large or Namespaced XML with lxml
For big documents, XPath queries, or XML that declares namespaces, lxml is the right tool. It parses quickly and lets you target elements precisely.
import requests
from lxml import etree
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
def parse_sitemap(url: str) -> list[str]:
resp = requests.get(url, headers=HEADERS, timeout=10)
resp.raise_for_status()
root = etree.fromstring(resp.content)
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
return [loc.text for loc in root.findall(".//sm:loc", ns)]
urls = parse_sitemap("https://example.com/sitemap.xml")
The ns dictionary maps a short prefix to the namespace URI so the XPath .//sm:loc matches the namespaced <loc> elements — a step people forget, which is why namespaced XPath silently returns nothing.
Performance and Scaling Considerations
- Stream, don't slurp, huge feeds. For XML measured in hundreds of megabytes, use
lxml.etree.iterparseand clear each element after processing so memory stays flat instead of loading the whole tree. response.json()beats a manualjson.loads(response.text)by skipping a redundant decode step; letrequestsdo it.- Compile JSONPath expressions once.
parse(...)builds an expression object; reuse it across records in a loop rather than re-parsing the string each iteration. - Prefer
response.contentoverresponse.textfor XML. Passing bytes lets the parser honor the document's own encoding declaration instead of guessing. - Feed the output straight to storage. Parsed dicts are storage-ready; persist them incrementally as described in Storing and Exporting Scraped Data rather than buffering everything in memory.
Common Errors and Fixes
json.decoder.JSONDecodeError: Expecting value — the response was not JSON at all, usually an HTML error or login page returned with a 200. Check response.headers["Content-Type"] before parsing, and inspect response.text[:200] when it fails.
KeyError deep in a nested payload — a key is absent for some records. Replace chained subscripts with .get(...) defaults or a JSONPath query, both of which return empty instead of raising.
lxml.etree.XMLSyntaxError — the document is not well-formed, or you passed a str that contained an encoding declaration. Pass resp.content (bytes), and for messy XML use etree.XMLParser(recover=True) to skip broken nodes.
Namespaced XPath returns an empty list — you queried //loc on a document that namespaces its elements. Register the namespace in a prefix map and query //sm:loc as shown above.
requests.exceptions.JSONDecodeError on an empty body — some endpoints answer 204 No Content. Guard with if resp.status_code == 204 or not resp.content: return {} before calling .json().
Frequently Asked Questions
Should I use response.json() or the json module directly?
Use response.json() — it decodes the body and parses it in one step with the correct encoding. Fall back to json.loads() only when the JSON arrives as a string from somewhere other than a requests response.
When is xmltodict better than lxml?xmltodict is best for small-to-medium XML you want as plain dicts with no XPath — feeds, sitemaps, simple APIs. Switch to lxml for large documents, namespaces, real XPath queries, or streaming with iterparse.
How do I handle XML namespaces in XPath?
Declare a prefix-to-URI mapping and use that prefix in the query, e.g. root.findall(".//sm:loc", {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}). Without the mapping, the query matches nothing even though the elements are visibly present.
Why does my JSON parse fail even though the request succeeded?
A 200 status does not guarantee a JSON body — anti-bot pages, redirects, and error pages often return HTML with a success code. Verify the Content-Type header and peek at the first bytes of response.text before parsing.
What is JSONPath and do I need it?
JSONPath is a query language for JSON, like XPath for XML. You do not strictly need it, but for deeply nested or irregular payloads a single expression such as $.results[*].variants[*].price is far clearer and safer than nested loops and subscripts.