Handling GraphQL Pagination and Cursors

This walkthrough extends Scraping GraphQL Endpoints with Python with the one piece every real list query needs: following cursor-based pagination to fetch every record instead of just the first page.

Each page returns an endCursor that feeds the next request's after variable; the loop ends when hasNextPage is false.

Quick answer: GraphQL connections paginate with an opaque cursor, not a page number. You request pageInfo { endCursor hasNextPage } alongside your data, pass the returned endCursor back into the query's after variable, and repeat while hasNextPage is true. Start with after: null for the first page, and stop the loop the moment hasNextPage turns false.

Why Cursors Instead of Page Numbers

Offset pagination (?page=3) breaks when the underlying list changes between requests: insert a row and every later page shifts, so you skip or duplicate records. Relay-style connections solve this with a cursor — an opaque token that marks a fixed position in the result set. Because the cursor points at a specific record rather than a numeric offset, the sequence stays stable even as data changes underneath you. This is the same reliability problem that offset-based pagination and infinite scroll faces on rendered pages, solved at the API layer.

A relay connection has a consistent shape:

edges — a list where each entry wraps a node (your actual record) and its cursor.
pageInfo — metadata containing at least endCursor (the cursor of the last edge) and hasNextPage (whether more records exist).

Your query asks for the fields you want on each node, plus pageInfo, and takes two arguments: first (how many records per page) and after (the cursor to start after).

The Pagination Query

PAGINATED_QUERY = """
query GetProducts($first: Int!, $after: String) {
  products(first: $first, after: $after) {
    edges {
      node {
        id
        name
        price
      }
    }
    pageInfo {
      endCursor
      hasNextPage
    }
  }
}
"""

On the first call, after is null (Python None), which tells the server to start from the beginning. Each subsequent call passes the previous response's endCursor.

The Full Loop in Python

The loop is small once the shape is clear: run the query, collect the nodes, read pageInfo, and either continue with the new cursor or stop. A guard on the iteration count prevents an infinite loop if a server ever misreports hasNextPage.

import httpx

GRAPHQL_URL = "https://api.example.com/graphql"

PAGINATED_QUERY = """
query GetProducts($first: Int!, $after: String) {
  products(first: $first, after: $after) {
    edges { node { id name price } }
    pageInfo { endCursor hasNextPage }
  }
}
"""


def run_query(client: httpx.Client, variables: dict) -> dict:
    response = client.post(GRAPHQL_URL, json={"query": PAGINATED_QUERY, "variables": variables})
    response.raise_for_status()
    result = response.json()
    if "errors" in result:
        raise RuntimeError(result["errors"])
    return result["data"]["products"]


def fetch_all_products(page_size: int = 50, max_pages: int = 1000) -> list[dict]:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                      "(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
        "Accept": "application/json",
        "Content-Type": "application/json",
    }
    products: list[dict] = []
    cursor: str | None = None
    with httpx.Client(http2=True, timeout=20, headers=headers) as client:
        for _ in range(max_pages):
            connection = run_query(client, {"first": page_size, "after": cursor})
            products.extend(edge["node"] for edge in connection["edges"])
            page_info = connection["pageInfo"]
            if not page_info["hasNextPage"]:
                break
            cursor = page_info["endCursor"]
    return products


if __name__ == "__main__":
    all_products = fetch_all_products()
    print(f"collected {len(all_products)} products across all pages")

Reusing one Client pools the connection so every page after the first skips the TCP and TLS handshake. Once you have the flat list of nodes, reshape or store it using the patterns in Parsing JSON and XML Responses.

Edge Cases and Caveats

Cursors are opaque — never build them yourself. A cursor is often a base64-encoded token whose format is an implementation detail. Only ever pass back a value the server gave you.
hasNextPage can be optimistic. Some servers return true on the final page and then an empty edges list on the next call. Break the loop if a page returns zero edges, regardless of the flag.
Respect first limits. Many APIs cap page size (often 100). Requesting first: 1000 may return an error or silently clamp to the maximum, so keep it modest.
Backward pagination exists too. Relay also defines last and before with a startCursor for paging in reverse. Most scraping only needs forward paging with first/after.
Total counts are not guaranteed. A totalCount field is a common extension but not part of the relay spec — do not rely on it being present for loop control; use hasNextPage.
Pace long crawls. Thousands of sequential pages add up; when volume is high, move to concurrent requests with the semaphore approach in Asynchronous Scraping with asyncio and HTTPX.

Frequently Asked Questions

What exactly is a cursor? An opaque token that marks a record's position in the result set. You treat it as a black box: read it from pageInfo.endCursor and pass it straight back as the after variable. Its internal format is the server's business, not yours.

How do I know when to stop paginating? Loop while pageInfo.hasNextPage is true and feed endCursor into after each time. As a safety net, also stop if a page returns no edges, since some servers report hasNextPage: true one page too long.

Can I paginate faster with concurrent requests? Cursor pagination is inherently sequential — you need each page's endCursor before you can request the next. To parallelize, split the work along another dimension (categories, date ranges, separate queries) and paginate each stream concurrently with the async approach described above.

The response has no pageInfo. How do I page then? That endpoint likely uses a non-relay style — perhaps offset arguments like limit/offset, or a simple nextToken. Inspect a real request from the site's frontend to see its exact pagination arguments, using the discovery method in Reverse-Engineering Private APIs in Python.