Distributed Crawling with Celery and Redis
A single process can only fetch so much, even with async concurrency maxed out. When a crawl needs to span millions of URLs, survive machine reboots, and spread load across a fleet, you move from one loop to a distributed task queue. Celery is the mature Python job-queue framework, and Redis is the fastest way to run its broker, result backend, deduplication set, and rate limiter in one lightweight service. This guide builds a distributed crawler where a producer fans URL work out to many Celery workers, dedupes against a shared Redis set, throttles per domain, and retries transient failures. It is part of Scaling & Deploying Python Web Scrapers.
When to Use a Distributed Task Queue
Reach for Celery and Redis when a single machine is no longer enough — but not before, because the moving parts add real operational cost. Use this architecture when:
- The URL frontier is huge or unbounded — you discover new links as you crawl and cannot hold the whole queue in one process.
- You want horizontal scale — adding capacity should mean starting another worker, on another machine, pointed at the same broker.
- Work must survive crashes — a queued task should still be there after a worker dies, and be retried automatically.
- Different stages have different costs — cheap HTML fetches and expensive browser renders can run on separate worker pools.
If your crawl fits comfortably on one box, a single async event loop is simpler and faster to reason about — see Asynchronous Scraping with asyncio and HTTPX. And if you mostly need link-following with built-in retries on one machine, Web Scraping with Scrapy already gives you a scheduler and concurrency without a separate broker.
Prerequisites
You need Python 3.10+ and a running Redis server (5.0 or newer). Install the Python packages:
pip install "celery[redis]==5.4.0" httpx==0.27.0 redis==5.0.7
Run Redis locally with Docker while you develop:
docker run --rm -p 6379:6379 redis:7-alpine
Verify the connection before writing any tasks:
redis-cli -h 127.0.0.1 -p 6379 ping
# → PONG
1. Configure the Celery App
A single Celery instance names the app, points at the Redis broker (where tasks are queued) and result backend (where return values are stored), and sets sane defaults. Keep this in one importable module so both the producer and the workers load identical configuration.
# crawler/app.py
from celery import Celery
app = Celery(
"crawler",
broker="redis://127.0.0.1:6379/0",
backend="redis://127.0.0.1:6379/1",
)
app.conf.update(
task_acks_late=True, # re-queue a task if the worker dies mid-run
task_reject_on_worker_lost=True,
worker_prefetch_multiplier=1, # fair dispatch for long-running fetches
task_default_retry_delay=5,
task_time_limit=120, # hard kill a stuck fetch after 2 minutes
broker_transport_options={"visibility_timeout": 300},
)
Using database 0 for the broker and 1 for the backend keeps queued work and results in separate Redis logical databases, which makes them easy to inspect and flush independently.
2. Design the Fetch Task
A task is a normal Python function decorated with @app.task. Bind it (bind=True) so it has access to self for retries, and always send an explicit User-Agent — anonymous default clients are the first thing anti-bot systems flag.
# crawler/tasks.py
import httpx
from celery import Task
from .app import app
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml",
}
@app.task(bind=True, max_retries=4, acks_late=True)
def fetch_url(self: Task, url: str) -> dict[str, object]:
try:
with httpx.Client(timeout=15, follow_redirects=True) as client:
response = client.get(url, headers=HEADERS)
response.raise_for_status()
except httpx.HTTPStatusError as exc:
# Back off on rate-limit / server errors, give up on 4xx client errors.
if exc.response.status_code in (429, 500, 502, 503, 504):
raise self.retry(exc=exc, countdown=2 ** self.request.retries)
raise
except httpx.TransportError as exc:
raise self.retry(exc=exc, countdown=2 ** self.request.retries)
return {"url": url, "status": response.status_code, "length": len(response.text)}
The 2 ** self.request.retries countdown is exponential backoff: waits of 1, 2, 4, then 8 seconds before Celery gives up after four attempts.
3. Fan Out URL Work from a Producer
The producer enqueues tasks with .delay() (or .apply_async() for more control). Because each call returns immediately, one process can dispatch thousands of URLs in a tight loop; the workers pick them up in parallel across every machine subscribed to the broker.
# crawler/producer.py
from .tasks import fetch_url
def enqueue(urls: list[str]) -> None:
for url in urls:
fetch_url.apply_async(args=[url], queue="fetch")
if __name__ == "__main__":
seed = [f"https://books.toscrape.com/catalogue/page-{n}.html" for n in range(1, 51)]
enqueue(seed)
print(f"Queued {len(seed)} URLs")
Start workers on any number of machines, all pointed at the same Redis:
celery -A crawler.app worker --queues fetch --concurrency 8 --loglevel info
Each worker runs 8 tasks at once; four such workers give you 32 concurrent fetches with no code changes.
4. Deduplicate with a Shared Redis Set
In a distributed crawl the same URL is discovered from many pages. A Redis set is the natural shared dedup store: SADD is atomic, so only the first worker to see a URL enqueues it. Redis returns 1 for a newly added member and 0 for a duplicate.
# crawler/dedup.py
import redis
_pool = redis.ConnectionPool(host="127.0.0.1", port=6379, db=2)
def is_new_url(url: str) -> bool:
client = redis.Redis(connection_pool=_pool)
# SADD returns the number of elements actually added (1 = new, 0 = seen).
return client.sadd("crawl:seen", url) == 1
Wire it into discovery so only unseen links are enqueued:
# crawler/discover.py
from .dedup import is_new_url
from .tasks import fetch_url
def enqueue_links(links: list[str]) -> int:
queued = 0
for link in links:
if is_new_url(link):
fetch_url.apply_async(args=[link], queue="fetch")
queued += 1
return queued
Set an expiry (client.expire("crawl:seen", 86400)) if you want the frontier to reset daily rather than growing forever.
5. Rate-Limit Per Domain
Celery's built-in rate_limit throttles a task type globally, but polite crawling usually means per-domain limits. A short Lua-free token check against Redis with a per-second key gives you that cheaply and atomically across all workers.
# crawler/ratelimit.py
import time
import redis
_pool = redis.ConnectionPool(host="127.0.0.1", port=6379, db=2)
def allow_request(domain: str, max_per_second: int = 4) -> bool:
client = redis.Redis(connection_pool=_pool)
bucket = f"rate:{domain}:{int(time.time())}"
count = client.incr(bucket)
if count == 1:
client.expire(bucket, 2)
return count <= max_per_second
Inside the task, requeue politely instead of hammering when the budget is spent:
# inside fetch_url, before the request
from urllib.parse import urlparse
from .ratelimit import allow_request
domain = urlparse(url).netloc
if not allow_request(domain):
raise self.retry(countdown=1)
Performance and Scaling Considerations
- Match concurrency to the workload. For I/O-bound fetches, run high
--concurrency(or theeventlet/geventpools) so each worker keeps many sockets busy. For CPU-heavy parsing, keep concurrency near the core count and split parsing into its own queue. - Use dedicated queues. Route cheap fetches to a
fetchqueue and expensive browser renders to arenderqueue, then scale their worker pools independently. - Async inside each worker. A worker task can itself run an
asynciobatch to fetch many URLs per task, combining process-level distribution with in-process concurrency from Asynchronous Scraping with asyncio and HTTPX. - Watch the broker. Redis holds the queue in memory; a runaway producer can exhaust RAM. Cap the frontier, set
maxmemorywith an eviction policy, and monitor queue depth withredis-cli llen. - Result backend hygiene. Storing every return value in Redis adds up. Set
result_expiresor useignore_result=Truefor fire-and-forget fetches and persist real data to a database instead — see Storing and Exporting Scraped Data.
Common Errors and Fixes
kombu.exceptions.OperationalError: Error 111 connecting to 127.0.0.1:6379. Connection refused.
The broker is unreachable. Confirm Redis is running (redis-cli ping returns PONG) and that the broker= URL host, port, and database match. Inside Docker networks, use the service name, not 127.0.0.1.
Tasks queue but never run. The worker is listening on a different queue than the producer targets. If you enqueue with queue="fetch", start the worker with --queues fetch. Without --queues, workers only consume the default celery queue.
Received unregistered task of type 'crawler.tasks.fetch_url'. The worker did not import the task module. Start it with the app that imports your tasks (celery -A crawler.app worker) and ensure crawler/tasks.py is imported when crawler.app loads, or set app.autodiscover_tasks(["crawler"]).
WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL). The OS killed a worker, usually out-of-memory from oversized responses or too-high concurrency. Lower --concurrency, stream large bodies, and set worker_max_tasks_per_child=100 so workers recycle and release leaked memory.
Duplicate work despite the dedup set. You are enqueuing before checking is_new_url, or checking in the producer and re-discovering in the worker. Do the SADD check exactly once, at the moment a URL is discovered, before apply_async.
Frequently Asked Questions
Do I have to use Redis, or can Celery use RabbitMQ? Both are supported. RabbitMQ is a dedicated message broker with richer routing and stronger delivery guarantees; Redis is simpler, faster to stand up, and doubles as your dedup set, cache, and rate limiter. For most scraping workloads Redis is the pragmatic choice. If you want a lighter framework than Celery entirely, weigh the trade-offs in Celery vs RQ for Scraping Task Queues.
How is this different from just running Scrapy? Scrapy distributes concurrency within one process and one machine extremely well. Celery distributes tasks across machines and survives restarts through a durable broker. Many teams run Scrapy spiders as Celery tasks to get both — Scrapy's crawling engine inside each distributed worker.
How do I stop workers from getting IP-banned at scale? Per-domain rate limiting is the first defense, but many workers still share few egress IPs. Route requests through rotating proxies so the target sees many source addresses — see Rotating Proxies and Managing IP Blocks.
Where should scraped results actually be stored? Not in the Celery result backend long-term. Return small metadata from tasks, but write the extracted records to a database or file store from within the task. See Storing and Exporting Scraped Data for durable sinks and schema validation.
Can I schedule recurring crawls with Celery? Yes — Celery Beat is a built-in scheduler that enqueues tasks on a cron-like timetable. For lighter needs, a plain cron job or a scheduled CI workflow can trigger the producer instead; see Scheduling Scrapers with Cron and GitHub Actions.