Scraping GraphQL Endpoints with Python
A growing number of sites replace scattered REST endpoints with a single GraphQL API: one URL, usually /graphql, that answers precise queries describing exactly the fields you want. For a scraper this is a gift — you ask for the data you need and nothing else, in a predictable JSON shape. This guide is part of Data Extraction Patterns & APIs, and it covers the full workflow: discovering the endpoint, running introspection to learn the schema, building queries with variables, POSTing them with httpx, and passing the authentication headers that protected APIs require.
When to Use This Approach
GraphQL scraping is the right tool when the signals below appear:
- You see a request to
/graphql. In the Network tab, data arrives via aPOSTto a single endpoint whose payload is aquerystring. Finding it works exactly like any other hidden call — see Reverse-Engineering Private APIs in Python. - One request returns nested data. GraphQL responses often contain deeply nested objects (a user with their posts and each post's comments) in a single call.
- You want to minimize requests. Because you select fields precisely, you can fetch a whole screen of data in one round trip instead of several REST calls.
- The schema is stable. GraphQL types change more deliberately than page markup, so queries tend to survive redesigns.
If the site exposes plain JSON REST endpoints instead, that simpler path is covered separately. When you do get GraphQL data back, reshape the nested result using the techniques in Parsing JSON and XML Responses.
Prerequisites
You need Python 3.10 or newer and an HTTP client that speaks JSON cleanly. httpx is used throughout.
python -m pip install "httpx[http2]>=0.27"
You do not need a dedicated GraphQL library — a query is just a string you POST as JSON. Keeping it to raw httpx makes the mechanics obvious and avoids hiding the request behind an abstraction.
1. Confirm the Endpoint and Method
Every GraphQL call is a POST to one URL with a JSON body containing a query field (and optionally variables and operationName). Copy a real request from the Network tab to confirm the URL and any headers, then reproduce the smallest possible query.
import httpx
GRAPHQL_URL = "https://api.example.com/graphql"
def run_query(query: str, variables: dict | None = None) -> dict:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "application/json",
"Content-Type": "application/json",
}
payload: dict = {"query": query}
if variables is not None:
payload["variables"] = variables
with httpx.Client(http2=True, timeout=20) as client:
response = client.post(GRAPHQL_URL, json=payload, headers=headers)
response.raise_for_status()
return response.json()
if __name__ == "__main__":
result = run_query("{ __typename }")
print(result)
A successful { "data": { "__typename": "Query" } } confirms the endpoint is live and accepting queries.
2. Run Introspection to Learn the Schema
Unlike REST, GraphQL can describe itself. An introspection query returns every type, field, and argument the API exposes, so you can build valid queries without guessing. Many production servers disable introspection, but when it is available it is the fastest way to map the schema.
INTROSPECTION_QUERY = """
query IntrospectionQuery {
__schema {
queryType { name }
types {
name
kind
fields { name type { name kind ofType { name kind } } }
}
}
}
"""
def list_types() -> list[str]:
result = run_query(INTROSPECTION_QUERY)
schema = result["data"]["__schema"]
return [t["name"] for t in schema["types"] if not t["name"].startswith("__")]
if __name__ == "__main__":
for name in list_types():
print(name)
If introspection is disabled you will get an error like GraphQL introspection is not allowed. In that case, fall back to reading the query strings the site's own frontend sends — every field it requests is one you know exists.
3. Build a Query with Variables
Never interpolate values straight into a query string — GraphQL has first-class variables for that. Declare them in the operation signature and pass them in a separate variables object. This keeps queries reusable and avoids quoting bugs.
PRODUCT_QUERY = """
query GetProduct($id: ID!) {
product(id: $id) {
id
name
price
inStock
category { name }
}
}
"""
def get_product(product_id: str) -> dict:
result = run_query(PRODUCT_QUERY, variables={"id": product_id})
if "errors" in result:
raise RuntimeError(result["errors"])
return result["data"]["product"]
if __name__ == "__main__":
product = get_product("SKU-1024")
print(f"{product['name']}: {product['price']}")
Request only the fields you actually use. A lean selection set is faster for the server and less likely to trip rate limits.
4. Handle Authentication
Protected GraphQL APIs authenticate the same way as REST ones: a bearer token in an Authorization header, or a session cookie set at login. For cookie-based sessions, sign in once and reuse the client so cookies carry over — the pattern is detailed in Managing Cookies and Sessions.
import httpx
GRAPHQL_URL = "https://api.example.com/graphql"
def run_authenticated(query: str, token: str, variables: dict | None = None) -> dict:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "application/json",
"Content-Type": "application/json",
"Authorization": f"Bearer {token}",
}
payload: dict = {"query": query}
if variables is not None:
payload["variables"] = variables
with httpx.Client(http2=True, timeout=20) as client:
response = client.post(GRAPHQL_URL, json=payload, headers=headers)
response.raise_for_status()
return response.json()
If the API rejects your token with an UNAUTHENTICATED error in the errors array even though the HTTP status is 200, the token is missing or expired — GraphQL commonly returns application errors with a 200 status, so always inspect the body.
5. Fetch Lists and Paginate
Listing queries return connections of many records, and those are almost always paginated with cursors. The full loop — reading pageInfo, following endCursor, and stopping on hasNextPage — is covered in depth in Handling GraphQL Pagination and Cursors. The single-page shape looks like this:
LIST_QUERY = """
query GetProducts($first: Int!, $after: String) {
products(first: $first, after: $after) {
edges { node { id name price } }
pageInfo { endCursor hasNextPage }
}
}
"""
def first_page(page_size: int = 20) -> list[dict]:
result = run_query(LIST_QUERY, variables={"first": page_size, "after": None})
edges = result["data"]["products"]["edges"]
return [edge["node"] for edge in edges]
if __name__ == "__main__":
for node in first_page():
print(node["name"], node["price"])
Performance and Scaling Considerations
GraphQL lets you collapse many REST calls into one, which cuts request volume dramatically — but each query can be heavier for the server to resolve, so keep selection sets tight and page sizes modest. Reuse a single Client to pool connections. When you crawl many independent queries, move to concurrent POSTs using the semaphore-gated pattern in Asynchronous Scraping with asyncio and HTTPX, and cache responses during development so you iterate on parsing without re-hitting the endpoint. Watch for server-side complexity limits: some APIs reject queries that request too many nested fields at once.
Common Errors and Fixes
400 Bad Request on POST. The JSON body is malformed or the query has a syntax error. Print the errors array — GraphQL error messages usually point to the exact line and column.
HTTP 200 but an errors key in the body. GraphQL returns application errors with a 200 status. Always check result.get("errors") before reading result["data"], or you will hit a KeyError.
UNAUTHENTICATED / FORBIDDEN in errors. The token or session is missing or expired. Refresh it and confirm the Authorization header is actually being sent.
Cannot query field "x" on type "Y". The field name is wrong. Re-run introspection or copy an exact query from the site's frontend to get valid field names.
Persisted-query errors like PersistedQueryNotFound. The API only accepts pre-registered query hashes, not raw query strings. You must send the hash the frontend uses, captured from the Network tab. Endpoints locked down this tightly may also sit behind anti-bot protections.
Frequently Asked Questions
Do I need a GraphQL client library?
No. A query is just a string in a JSON POST body, so raw httpx is enough and keeps the request transparent. Libraries like gql add schema validation and typed results, which help on large projects but are optional for scraping.
Why does the server return a 200 status with errors?
GraphQL treats field-level failures as part of a normal response, so the HTTP layer reports 200 while the errors array carries the problem. Always inspect the body rather than trusting the status code alone.
What if introspection is disabled? Read the queries the site's own frontend sends in the Network tab. Every field it requests is a valid field, so you can reconstruct the parts of the schema you need without introspection.
How do I fetch more than one page of results?
Use cursor-based pagination: request pageInfo { endCursor hasNextPage }, then feed endCursor back as the after variable until hasNextPage is false. The dedicated guide on cursor pagination walks through the complete loop.