Reading layout

Running Python Scrapers on AWS Lambda

AWS Lambda runs your code on demand with no server to manage and no cost while idle — an appealing fit for scrapers that run periodically or in short bursts. But its execution model imposes real limits that shape how you build. This guide covers packaging with a Lambda layer, the headless-browser and timeout ceilings, and scheduling with EventBridge. It builds on Deploying Python Scrapers to the Cloud.

AWS Lambda scraper invocation flow An EventBridge schedule triggers a Lambda function. The function loads dependencies from a layer, fetches the target page within the timeout, and stores results in an S3 bucket. EventBridgecron scheduleLambdahandler(event, ctx)15 min maxDependency layer/opt/python (httpx)Target siteHTTP GETS3 bucketresults
EventBridge invokes a Lambda function on a schedule; the function uses a dependency layer and writes results to S3.

Quick answer: Lambda is excellent for lightweight HTTP scrapers that finish in seconds and export their results to S3 or a database. Package dependencies as a layer or container image, keep each invocation well under the 15-minute hard limit, and trigger it on a schedule with EventBridge. Full headless browsers are possible but awkward — for heavy browser automation, a container is usually the better home.

Packaging Dependencies with a Lambda Layer

Lambda's base runtime has only the standard library, so your third-party packages must ship with the function. A layer is a zip of dependencies mounted at /opt/python, kept separate from your code so you can update logic without rebuilding the whole bundle. Build it in an image matching Lambda's environment so compiled wheels are compatible.

mkdir -p layer/python
pip install httpx==0.27.0 -t layer/python \
  --platform manylinux2014_x86_64 --only-binary=:all: --python-version 3.12
cd layer && zip -r ../httpx-layer.zip python && cd ..

Publish the layer and attach it to your function:

aws lambda publish-layer-version \
  --layer-name httpx-deps \
  --zip-file fileb://httpx-layer.zip \
  --compatible-runtimes python3.12

For anything with heavy or binary dependencies, a container image (up to 10 GB) is simpler than juggling layers — Lambda runs OCI images directly.

The Handler Function

Lambda invokes a handler (event, context). Read the target from the event so one function can scrape different URLs, always send an explicit User-Agent, and write results to durable storage because the local filesystem is wiped between invocations.

# handler.py
import json
import os
import httpx
import boto3

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml",
}

def lambda_handler(event: dict, context: object) -> dict:
    url = event.get("url", "https://books.toscrape.com/")
    with httpx.Client(timeout=10, follow_redirects=True) as client:
        response = client.get(url, headers=HEADERS)
    response.raise_for_status()

    bucket = os.environ["RESULT_BUCKET"]
    key = f"pages/{context.aws_request_id}.html"
    boto3.client("s3").put_object(Bucket=bucket, Key=key, Body=response.content)

    return {"statusCode": 200, "body": json.dumps({"url": url, "saved": key})}

Only /tmp is writable (512 MB by default), so treat it as scratch space and persist real output to S3 or a database.

Headless-Browser Limits

Running Selenium or Playwright on Lambda is possible but constrained. Chromium plus its dependencies is large, and Lambda caps deployment packages, memory (up to 10 GB), and /tmp. The practical path is a container image bundling a headless-Chromium build compiled for Lambda, invoked with flags that keep it inside the sandbox:

# browser_handler.py — inside a container image with chromium bundled
from playwright.sync_api import sync_playwright

UA = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0 Safari/537.36"

def lambda_handler(event: dict, context: object) -> dict:
    url = event.get("url", "https://books.toscrape.com/")
    with sync_playwright() as p:
        browser = p.chromium.launch(args=[
            "--no-sandbox",            # Lambda has no user namespaces
            "--single-process",        # fit within the constrained runtime
            "--disable-dev-shm-usage", # /dev/shm is tiny on Lambda
        ])
        page = browser.new_page(user_agent=UA)
        page.goto(url, wait_until="networkidle")
        title = page.title()
        browser.close()
    return {"statusCode": 200, "body": title}

If you find yourself fighting these limits, that is the signal to move browser work to a container service instead — for the trade-offs, see Deploying Python Scrapers to the Cloud, and compare stealth browser options in Using Playwright for Modern Web Automation.

Timeouts and Working Within Them

Lambda's maximum execution time is 15 minutes, and you set a lower per-function timeout. A crawl that loops over thousands of URLs will hit that wall and lose everything in flight. The fix is to keep each invocation small: one function scrapes one page (or a small batch) and writes its output, and a queue or scheduler drives many invocations in parallel. This turns a long serial crawl into many short, independent, retry-safe runs — the same fan-out principle behind Distributed Crawling with Celery and Redis, implemented with Lambda concurrency instead of worker processes.

Scheduling with EventBridge

To run a scraper on a timetable, an EventBridge (CloudWatch Events) rule invokes the function on a cron or rate schedule with no server involved. Create a rule and wire it to the function:

aws events put-rule \
  --name nightly-scrape \
  --schedule-expression "cron(0 3 * * ? *)"   # 03:00 UTC daily

aws events put-targets \
  --rule nightly-scrape \
  --targets "Id=1,Arn=arn:aws:lambda:us-east-1:123456789012:function:scraper,Input={\"url\":\"https://books.toscrape.com/\"}"

The Input field passes a fixed event payload, so the same function can be scheduled multiple times with different targets. If you prefer scheduling from a repository instead of AWS, a scheduled workflow can invoke Lambda or run the scrape directly — see Scheduling Scrapers with Cron and GitHub Actions.

Edge Cases and Caveats

  • Cold starts add latency. The first invocation after idle loads the runtime and dependencies; large layers make it slower. Keep packages lean, or use provisioned concurrency for latency-sensitive jobs.
  • Shared, uncontrolled egress IPs. Lambda scrapes from AWS ranges that many sites block. Route through proxies — see Rotating Proxies and Managing IP Blocks. A VPC NAT gateway gives a stable IP but is easy to block once noticed.
  • /tmp is ephemeral and small. Never rely on local files persisting between invocations; write to S3 or a database every run.
  • Package-size limits. Zipped deployment packages are capped (250 MB unzipped for layers); browser builds usually force the container-image route.
  • Concurrency caps and cost. Massive fan-out can hit account concurrency limits and run up per-millisecond charges. Watch both when scaling out.

Frequently Asked Questions

Can I run Selenium or Playwright on Lambda? Yes, using a container image that bundles a Lambda-compatible headless Chromium and launching with --no-sandbox and --single-process. It works but is finicky and memory-hungry; for heavy browser automation a container service is usually a better fit than Lambda.

How do I avoid hitting the 15-minute timeout? Do not run a long serial crawl in one invocation. Split the work so each invocation handles one page or a small batch and persists its result, then drive many invocations concurrently via a queue or EventBridge. Short, independent runs are also easier to retry.

Where should Lambda scrapers store their data? In durable managed storage — S3 for raw pages, or a hosted database for structured records. The function's filesystem is wiped after each run, so persist everything externally. See Storing and Exporting Scraped Data.

Is Lambda cheaper than a VM for scraping? For infrequent or bursty jobs, usually yes — you pay only while the function runs. For continuous, high-volume crawling the per-invocation and per-millisecond charges can exceed a right-sized VM. Match the model to your duty cycle, as discussed in Deploying Python Scrapers to the Cloud.