Handling GraphQL Pagination and Cursors
This walkthrough extends Scraping GraphQL Endpoints with Python with the one piece every real list query needs: following cursor-based pagination to fetch every record instead of just the first page.
Quick answer: GraphQL connections paginate with an opaque cursor, not a page number. You request pageInfo { endCursor hasNextPage } alongside your data, pass the returned endCursor back into the query's after variable, and repeat while hasNextPage is true. Start with after: null for the first page, and stop the loop the moment hasNextPage turns false.
Why Cursors Instead of Page Numbers
Offset pagination (?page=3) breaks when the underlying list changes between requests: insert a row and every later page shifts, so you skip or duplicate records. Relay-style connections solve this with a cursor — an opaque token that marks a fixed position in the result set. Because the cursor points at a specific record rather than a numeric offset, the sequence stays stable even as data changes underneath you. This is the same reliability problem that offset-based pagination and infinite scroll faces on rendered pages, solved at the API layer.
A relay connection has a consistent shape:
edges— a list where each entry wraps anode(your actual record) and itscursor.pageInfo— metadata containing at leastendCursor(the cursor of the last edge) andhasNextPage(whether more records exist).
Your query asks for the fields you want on each node, plus pageInfo, and takes two arguments: first (how many records per page) and after (the cursor to start after).
The Pagination Query
PAGINATED_QUERY = """
query GetProducts($first: Int!, $after: String) {
products(first: $first, after: $after) {
edges {
node {
id
name
price
}
}
pageInfo {
endCursor
hasNextPage
}
}
}
"""
On the first call, after is null (Python None), which tells the server to start from the beginning. Each subsequent call passes the previous response's endCursor.
The Full Loop in Python
The loop is small once the shape is clear: run the query, collect the nodes, read pageInfo, and either continue with the new cursor or stop. A guard on the iteration count prevents an infinite loop if a server ever misreports hasNextPage.
import httpx
GRAPHQL_URL = "https://api.example.com/graphql"
PAGINATED_QUERY = """
query GetProducts($first: Int!, $after: String) {
products(first: $first, after: $after) {
edges { node { id name price } }
pageInfo { endCursor hasNextPage }
}
}
"""
def run_query(client: httpx.Client, variables: dict) -> dict:
response = client.post(GRAPHQL_URL, json={"query": PAGINATED_QUERY, "variables": variables})
response.raise_for_status()
result = response.json()
if "errors" in result:
raise RuntimeError(result["errors"])
return result["data"]["products"]
def fetch_all_products(page_size: int = 50, max_pages: int = 1000) -> list[dict]:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "application/json",
"Content-Type": "application/json",
}
products: list[dict] = []
cursor: str | None = None
with httpx.Client(http2=True, timeout=20, headers=headers) as client:
for _ in range(max_pages):
connection = run_query(client, {"first": page_size, "after": cursor})
products.extend(edge["node"] for edge in connection["edges"])
page_info = connection["pageInfo"]
if not page_info["hasNextPage"]:
break
cursor = page_info["endCursor"]
return products
if __name__ == "__main__":
all_products = fetch_all_products()
print(f"collected {len(all_products)} products across all pages")
Reusing one Client pools the connection so every page after the first skips the TCP and TLS handshake. Once you have the flat list of nodes, reshape or store it using the patterns in Parsing JSON and XML Responses.
Edge Cases and Caveats
- Cursors are opaque — never build them yourself. A cursor is often a base64-encoded token whose format is an implementation detail. Only ever pass back a value the server gave you.
hasNextPagecan be optimistic. Some servers returntrueon the final page and then an emptyedgeslist on the next call. Break the loop if a page returns zero edges, regardless of the flag.- Respect
firstlimits. Many APIs cap page size (often 100). Requestingfirst: 1000may return an error or silently clamp to the maximum, so keep it modest. - Backward pagination exists too. Relay also defines
lastandbeforewith astartCursorfor paging in reverse. Most scraping only needs forward paging withfirst/after. - Total counts are not guaranteed. A
totalCountfield is a common extension but not part of the relay spec — do not rely on it being present for loop control; usehasNextPage. - Pace long crawls. Thousands of sequential pages add up; when volume is high, move to concurrent requests with the semaphore approach in Asynchronous Scraping with asyncio and HTTPX.
Frequently Asked Questions
What exactly is a cursor?
An opaque token that marks a record's position in the result set. You treat it as a black box: read it from pageInfo.endCursor and pass it straight back as the after variable. Its internal format is the server's business, not yours.
How do I know when to stop paginating?
Loop while pageInfo.hasNextPage is true and feed endCursor into after each time. As a safety net, also stop if a page returns no edges, since some servers report hasNextPage: true one page too long.
Can I paginate faster with concurrent requests?
Cursor pagination is inherently sequential — you need each page's endCursor before you can request the next. To parallelize, split the work along another dimension (categories, date ranges, separate queries) and paginate each stream concurrently with the async approach described above.
The response has no pageInfo. How do I page then?
That endpoint likely uses a non-relay style — perhaps offset arguments like limit/offset, or a simple nextToken. Inspect a real request from the site's frontend to see its exact pagination arguments, using the discovery method in Reverse-Engineering Private APIs in Python.