Handling Pagination and Infinite Scroll in Python Web Scraping
Navigating multi-page datasets and dynamically loaded feeds is a fundamental challenge in modern data extraction. While The Complete Guide to Python Web Scraping covers foundational concepts, mastering navigation logic requires targeted strategies. This guide details how to programmatically traverse traditional page offsets, cursor-based APIs, and simulated scroll behavior to capture complete datasets.
Identifying Pagination Patterns and Data Sources
Before writing extraction loops, inspect network traffic to determine whether a site uses URL parameters, hidden API endpoints, or JavaScript-driven rendering. Properly configuring your workspace and dependencies, as outlined in Setting Up Your Python Scraping Environment, ensures you have the necessary debugging tools — browser developer consoles and proxy loggers — ready for traffic analysis.
Open your browser's Developer Tools (F12), navigate to the Network tab, and filter by Fetch/XHR while navigating or scrolling. This reveals whether the site uses traditional query strings (?page=2), RESTful path structures (/page/3), or asynchronous JSON payloads. Identifying these patterns early prevents wasted development time and guides your choice between lightweight HTTP clients and full browser automation. Document the request headers, payload structures, and response formats before writing your scraper.
Traditional Pagination with HTTP Requests
Static pagination relies on predictable URL structures. Understanding Understanding HTTP Requests and Responses is crucial for constructing robust loops that handle status codes, redirects, and session persistence across multiple page fetches.
When implementing a pagination loop, always validate the response before parsing. If a page returns 200 OK but contains no target elements or a "No results found" message, it likely signals the end of the dataset. Avoid hardcoding page limits; rely on dynamic termination signals: missing "Next" buttons, empty result containers, or HTTP 404 responses.
Automating Infinite Scroll with Headless Browsers
Dynamic feeds load content asynchronously as users scroll, bypassing traditional pagination. Tools like Selenium or Playwright execute JavaScript, trigger scroll events, and wait for DOM mutations.
To scrape an infinite scroll feed effectively, monitor the DOM height or the count of target elements after each scroll action. Continue scrolling until the element count stops increasing over multiple iterations. Always pair scroll actions with explicit waits to account for network latency and lazy-loading image placeholders.
Anti-Bot Mitigation and Request Throttling
Rapid sequential requests across dozens of pages often trigger IP bans, CAPTCHAs, or temporary blocks. Implement randomized delays, rotate user agents, and respect robots.txt directives. For additional defensive strategies when targeting heavily protected sites, refer to How to Scrape a Static Website Without Getting Blocked.
Implement exponential backoff for 429 Too Many Requests responses and avoid aggressive concurrency. A randomized pause of 2–5 seconds per page is a reasonable baseline for most sites; cache responses locally to avoid redundant server hits during development.
Data Deduplication and Pipeline Integration
Paginated and infinite scroll scrapers frequently encounter overlapping records due to dynamic sorting, real-time updates, or shifting content feeds. Use Python sets, SQLite unique constraints, or database UPSERT operations to track scraped IDs across sessions. Normalize timestamps, strip whitespace, and validate data types before committing records to your final storage layer.
Production-Ready Code Examples
1. Requests-Based Pagination Loop
Iterates through numbered pages with break conditions on empty results or 404 status codes.
import requests
from bs4 import BeautifulSoup
import time
import random
BASE_URL = "https://example.com/products"
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
session = requests.Session()
session.headers.update(HEADERS)
page = 1
all_data = []
while True:
print(f"Fetching page {page}...")
response = session.get(f"{BASE_URL}?page={page}")
if response.status_code in (404, 403):
print("End of pagination reached or access denied.")
break
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select(".product-card")
if not items:
print("No more items found. Exiting loop.")
break
for item in items:
name_el = item.select_one(".title")
price_el = item.select_one(".price")
all_data.append({
"name": name_el.get_text(strip=True) if name_el else None,
"price": price_el.get_text(strip=True) if price_el else None,
})
print(f"Extracted {len(items)} items from page {page}.")
time.sleep(random.uniform(1.5, 3.0))
page += 1
print(f"Total items scraped: {len(all_data)}")
2. Selenium Infinite Scroll Simulation
Uses JavaScript execution to scroll to the bottom, waits for new elements, and repeats until no new content appears.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import time
driver = webdriver.Chrome()
driver.get("https://example.com/infinite-feed")
SCROLL_PAUSE = 2.0
last_height = driver.execute_script("return document.body.scrollHeight")
seen_count = 0
max_stalls = 3
stall_counter = 0
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(SCROLL_PAUSE)
try:
WebDriverWait(driver, 5).until(
lambda d: len(d.find_elements(By.CSS_SELECTOR, ".feed-item")) > seen_count
)
stall_counter = 0
except Exception:
stall_counter += 1
if stall_counter >= max_stalls:
print("Content loading stalled. Assuming end of feed.")
break
seen_count = len(driver.find_elements(By.CSS_SELECTOR, ".feed-item"))
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
print("Reached bottom of page.")
break
last_height = new_height
items = driver.find_elements(By.CSS_SELECTOR, ".feed-item")
print(f"Total items loaded: {len(items)}")
driver.quit()
3. Cursor-Based API Pagination
Extracts the next-page token from JSON responses to fetch subsequent datasets without relying on page numbers.
import requests
import time
API_URL = "https://api.example.com/v1/data"
params = {"limit": 50}
headers = {"Authorization": "Bearer YOUR_TOKEN"}
all_records = []
while True:
response = requests.get(API_URL, params=params, headers=headers)
response.raise_for_status()
data = response.json()
records = data.get("results", [])
if not records:
break
all_records.extend(records)
print(f"Fetched {len(records)} records. Total: {len(all_records)}")
next_cursor = data.get("next_cursor")
if not next_cursor:
print("No next cursor provided. Pagination complete.")
break
params["cursor"] = next_cursor
time.sleep(1)
print(f"Successfully retrieved {len(all_records)} total records.")
Common Mistakes
- Hardcoding maximum page limits instead of dynamically detecting end-of-content signals, leading to incomplete datasets or wasted requests on empty pages.
- Failing to implement explicit waits for dynamically injected DOM elements during infinite scroll, causing the scraper to parse partially rendered HTML.
- Overlooking duplicate records caused by real-time data updates between page requests, which corrupts downstream analytics.
- Sending requests too rapidly without exponential backoff or randomized delays, triggering rate limits and IP bans.
Frequently Asked Questions
How do I know if a website uses traditional pagination or infinite scroll? Inspect the Network tab while navigating. If new pages trigger full URL changes or predictable query parameters, it's traditional pagination. If content loads via XHR/Fetch requests without URL changes as you scroll, it's infinite scroll or dynamic loading.
Can I scrape infinite scroll sites without Selenium or Playwright?
Often yes. Many infinite scroll sites fetch data from hidden REST or GraphQL APIs. By monitoring network traffic, you can reverse-engineer the endpoints and use requests to fetch paginated JSON directly — faster and less resource-intensive than browser automation.
How do I prevent my scraper from getting stuck in an infinite loop?
Implement strict termination conditions: track consecutive empty pages, set a maximum iteration limit, verify that newly fetched data contains unique identifiers, and monitor for HTTP 403/429 status codes.