Reverse-Engineering Private APIs in Python
Most modern websites do not embed their data in the HTML you first download. The page arrives nearly empty and then JavaScript fetches the real content from a private, undocumented JSON API — the same endpoints the frontend calls. Scraping that API directly is faster, cleaner, and more reliable than parsing rendered HTML: you get structured data instead of brittle CSS selectors, and you skip running a browser entirely. This guide is part of Data Extraction Patterns & APIs, and it walks through discovering those endpoints in the browser Network tab, replaying the requests in Python with httpx, and handling the auth tokens and headers that make them work.
When to Use This Approach
Reach for private-API scraping when the signals point to a data-driven frontend:
- The page loads content after the initial HTML. You see a spinner, then data appears — a classic sign of an XHR/fetch call.
- View-source shows no data. If the product prices or listings are missing from the raw HTML but visible in the browser, they arrive over the network afterward.
- The Network tab shows JSON. Filtering to Fetch/XHR reveals responses with
Content-Type: application/json. - You want speed and stability. One JSON request can replace a full browser render plus HTML parsing, and JSON schemas change far less often than page markup.
If instead the data is baked into the initial HTML, parse it directly with the techniques in Understanding HTTP Requests and Responses. And when a site uses a GraphQL endpoint rather than plain REST, follow the dedicated walkthrough in Scraping GraphQL Endpoints instead.
Prerequisites
You need Python 3.10 or newer and a modern HTTP client. httpx is used throughout because it supports HTTP/2 and has a clean typed API, but the code translates directly to requests.
python -m pip install "httpx[http2]>=0.27"
A Chromium-based browser (Chrome, Edge, or Brave) gives you the best Network tab and a "Copy as cURL" option that captures every header exactly.
1. Open the Network Tab and Reproduce the Action
Open DevTools (F12), switch to the Network panel, and filter to Fetch/XHR. Then perform the action that loads the data you want — scroll the list, click a tab, or submit a search. Each request that appears is a candidate. Click one and inspect the Response sub-tab: if it contains the JSON you are after, you have found your endpoint. The full mechanics of filtering and hunting are covered in Finding Hidden API Endpoints in Network Traffic.
Note three things from the request: the URL (including query parameters), the method (GET or POST), and the request headers.
2. Replay the Request in Python
Start with the smallest possible request — just the URL and a realistic User-Agent. Many public-facing APIs need nothing more.
import httpx
def fetch_json(url: str) -> dict:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "application/json",
}
with httpx.Client(http2=True, timeout=15) as client:
response = client.get(url, headers=headers)
response.raise_for_status()
return response.json()
if __name__ == "__main__":
data = fetch_json("https://api.example.com/v1/products?page=1")
print(f"received {len(data.get('items', []))} items")
If this returns your JSON, you are done — reshape it and move on. If it returns 401, 403, or an HTML error page, the endpoint expects more headers.
3. Add the Headers That Matter
APIs distinguish real browser traffic from scripts using a handful of headers. Copy them from the Network tab request and add only the ones that change the outcome. The usual suspects:
Referer— many endpoints reject requests that did not "come from" their own site.Origin— checked on cross-origin POST requests.X-Requested-With: XMLHttpRequest— a legacy marker some backends still require.Accept-Language— occasionally used to gate or shape the response.
import httpx
def fetch_with_context(url: str) -> dict:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "application/json, text/plain, */*",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.example.com/products",
"X-Requested-With": "XMLHttpRequest",
}
with httpx.Client(http2=True, timeout=15) as client:
response = client.get(url, headers=headers)
response.raise_for_status()
return response.json()
Add headers one at a time and re-test. Once the request succeeds, drop any header that is not required — a lean request is easier to maintain.
4. Handle Authentication Tokens
Endpoints behind a login use one of two common schemes. The first is a bearer token in an Authorization header, often issued by a separate login call and refreshed periodically. The second is a session cookie set when you sign in. For cookie-based auth, log in once and reuse the session so cookies persist across requests, exactly as described in Managing Cookies and Sessions.
import httpx
def fetch_authenticated(url: str, bearer_token: str) -> dict:
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "application/json",
"Authorization": f"Bearer {bearer_token}",
}
with httpx.Client(http2=True, timeout=15) as client:
response = client.get(url, headers=headers)
response.raise_for_status()
return response.json()
def login_and_fetch(login_url: str, data_url: str, username: str, password: str) -> dict:
ua = ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36")
headers = {"User-Agent": ua, "Accept": "application/json"}
with httpx.Client(http2=True, timeout=15, headers=headers) as client:
auth = client.post(login_url, json={"username": username, "password": password})
auth.raise_for_status()
token = auth.json()["access_token"]
client.headers["Authorization"] = f"Bearer {token}"
data = client.get(data_url)
data.raise_for_status()
return data.json()
Watch for short-lived tokens: if requests start failing with 401 after a few minutes, the token expired and you need to re-run the login step.
5. Paginate and Extract
Private APIs almost always paginate. Look at the query string (?page=2, ?offset=40&limit=20) or a next field in the response body, then loop until the data runs out. Once you have the raw JSON, flatten and select the fields you need — the reshaping patterns live in Parsing JSON and XML Responses.
import httpx
def scrape_all_pages(base_url: str) -> list[dict]:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
"Accept": "application/json",
}
results: list[dict] = []
page = 1
with httpx.Client(http2=True, timeout=15, headers=headers) as client:
while True:
response = client.get(base_url, params={"page": page, "limit": 50})
response.raise_for_status()
payload = response.json()
items = payload.get("items", [])
if not items:
break
results.extend(items)
if not payload.get("has_more", False):
break
page += 1
return results
Performance and Scaling Considerations
Hitting a JSON API is dramatically cheaper than driving a browser — a single request often replaces megabytes of rendering. That efficiency makes it tempting to hammer the endpoint, so pace yourself. Respect the site by keeping request rates modest and reusing one Client to pool connections. When you need throughput across thousands of pages, move the loop to concurrent requests with the semaphore-gated approach in Asynchronous Scraping with asyncio and HTTPX. Cache responses during development so you are not re-fetching the same page while you iterate on parsing logic.
Common Errors and Fixes
403 Forbidden on a request that works in the browser. You are missing a header the backend checks. Add Referer, Origin, and X-Requested-With, and confirm your User-Agent looks like a real browser. If it still fails, the endpoint may be behind an anti-bot layer — see Advanced Scraping Techniques & Anti-Bot Evasion.
401 Unauthorized after some time. The bearer token expired. Re-run the login call to mint a fresh token before continuing.
json.JSONDecodeError / "Expecting value". The response was not JSON — usually an HTML block page or a redirect to a login form. Print response.text[:500] to see what actually came back, then adjust headers or auth.
httpx.ReadTimeout. The endpoint is slow or you are being throttled. Raise the timeout and slow your request rate; a sudden batch of timeouts often means rate limiting.
Empty items on page 1. The pagination parameter name is wrong. Re-check the exact query string in the Network tab — page vs p vs offset — and match it precisely.
Frequently Asked Questions
Is calling a private API legal? This guide is technical, not legal advice. Calling an undocumented endpoint is the same HTTP a browser makes, but you should still review the site's terms of service and applicable laws, and avoid accessing data you are not authorized to see. Stay within what the frontend itself exposes to a logged-in user.
Why bother with the API instead of just parsing HTML? Speed and stability. A JSON endpoint returns clean, structured data in one request, with no browser to run and no CSS selectors that break when the design changes. When an internal API exists, it is almost always the better target.
How do I find the right request among hundreds in the Network tab? Filter to Fetch/XHR, clear the log, then trigger the action you care about so only the relevant requests appear. Sort by response size or search the responses for a value you can see on the page — the in-depth walkthrough on finding hidden endpoints covers the full method.
What if the endpoint needs a signature or hashed parameter? Some APIs sign requests with a token computed in JavaScript. You then have to read the site's JS to reproduce the signing logic, or fall back to driving a real browser that computes it for you. This is the hardest case and often signals the site actively discourages scraping.