Reading layout

Scheduling Scrapers with Cron and GitHub Actions

Many scrapers do not need to run continuously — they need to run regularly: every night, every hour, every Monday. Two tools cover almost every periodic case: classic cron on a machine you control, and GitHub Actions scheduled workflows that run in the cloud with no server of your own. This guide shows both, plus how to handle secrets and export results. It builds on Deploying Python Scrapers to the Cloud.

Scheduled scraper runs with cron A cron or GitHub Actions schedule triggers the scraper at fixed times. Each run installs dependencies, fetches the target with an explicit user agent, and exports results to a repository or storage. Schedulecron: 0 3 * * *UTC, dailyRun — Mon 03:00fetch + exportRun — Tue 03:00fetch + exportRun — Wed 03:00fetch + exportStoragerepo / DB / bucket
A cron schedule fires the scraper on a timetable; each run fetches, then exports results to storage.

Quick answer: Use plain cron when you already have an always-on VM and want full control over the environment. Use a GitHub Actions scheduled workflow when you would rather not run a server at all — GitHub provides the runner, you provide a cron expression and a script, and results get committed to the repo or pushed to storage. For most small periodic scrapers, the Actions route is the lower-maintenance choice.

Scheduling with System Cron

On a VM or any always-on Linux box, cron is the simplest scheduler there is. Each line in a crontab is a schedule plus a command. The five fields are minute, hour, day-of-month, month, and day-of-week.

# crontab -e  → run the scraper every day at 03:00, log output
0 3 * * * /home/scraper/.venv/bin/python /home/scraper/app/run.py >> /home/scraper/logs/scrape.log 2>&1

The scraper itself is an ordinary script. Read secrets from the environment, send an explicit User-Agent, and exit non-zero on failure so cron's mail or your log makes the problem visible.

# run.py
import os
import sys
import csv
import httpx

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
    "(KHTML, like Gecko) Chrome/125.0 Safari/537.36",
}

def scrape(url: str) -> list[dict[str, str]]:
    with httpx.Client(timeout=15, follow_redirects=True) as client:
        response = client.get(url, headers=HEADERS)
    response.raise_for_status()
    # Parsing omitted for brevity; return one record per row.
    return [{"url": url, "bytes": str(len(response.text))}]

def main() -> None:
    rows = scrape(os.environ.get("TARGET_URL", "https://books.toscrape.com/"))
    with open("results.csv", "w", newline="") as fh:
        writer = csv.DictWriter(fh, fieldnames=["url", "bytes"])
        writer.writeheader()
        writer.writerows(rows)

if __name__ == "__main__":
    try:
        main()
    except httpx.HTTPError as exc:
        print(f"scrape failed: {exc}", file=sys.stderr)
        sys.exit(1)

Cron gives you control but also responsibility: you patch the box, rotate the logs, and keep the machine alive.

Scheduling with GitHub Actions

A GitHub Actions workflow with an on: schedule trigger runs the scraper on GitHub's infrastructure — no server to maintain. The cron expression uses the same five fields, always in UTC. This workflow checks out the repo, installs dependencies, runs the scraper, and commits the output back.

# .github/workflows/scrape.yml
name: scheduled-scrape
on:
  schedule:
    - cron: "0 3 * * *"      # 03:00 UTC daily
  workflow_dispatch: {}       # allow manual runs from the Actions tab

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install httpx==0.27.0
      - name: Run scraper
        env:
          TARGET_URL: ${{ secrets.TARGET_URL }}
          PROXY_URL: ${{ secrets.PROXY_URL }}
        run: python run.py
      - name: Commit results
        run: |
          git config user.name "scraper-bot"
          git config user.email "bot@users.noreply.github.com"
          git add results.csv
          git commit -m "data: scheduled scrape $(date -u +%FT%TZ)" || echo "no changes"
          git push

Add workflow_dispatch so you can trigger a run by hand from the Actions tab while testing, without waiting for the schedule.

Passing Secrets Safely

Never hardcode proxy credentials or API keys in the workflow file — it lives in the repository. Store them as encrypted repository secrets (Settings → Secrets and variables → Actions) and reference them with ${{ secrets.NAME }}, which injects them as masked environment variables. Your code reads them the same way it would locally.

# fetch.py
import os
import httpx

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}

def fetch(url: str) -> httpx.Response:
    proxy = os.environ.get("PROXY_URL")   # from ${{ secrets.PROXY_URL }}
    with httpx.Client(proxy=proxy, timeout=20, headers=HEADERS) as client:
        response = client.get(url)
    response.raise_for_status()
    return response

GitHub masks secret values in logs, but be careful not to print them yourself — a leaked proxy password ends up in the public run log.

Exporting Results

A scheduled run is only useful if its output persists somewhere. Two common patterns:

  • Commit to the repo — for small datasets, committing results.csv gives you a free, versioned history of every run (shown in the workflow above). Keep an eye on repo size for large or frequent outputs.
  • Push to external storage — for anything sizable, upload to object storage or a database instead of committing. Use an upload step with credentials from secrets.
      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: scrape-results
          path: results.csv

For choosing a durable sink and validating records before you store them, see Storing and Exporting Scraped Data.

Edge Cases and Caveats

  • Schedules are UTC and best-effort. GitHub Actions cron times are UTC and may start several minutes late under load — never assume second-level precision.
  • Scheduled workflows get disabled on inactive repos. If a repository sees no pushes for 60 days, GitHub pauses its scheduled workflows; a commit re-enables them.
  • Shared runner egress IPs. GitHub runners scrape from cloud IP ranges that many sites block. Route through proxies from Rotating Proxies and Managing IP Blocks for anything defended.
  • Overlapping runs. A slow job can still be running when the next schedule fires. Add a concurrency group in the workflow, or keep runs comfortably shorter than the interval.
  • Cron is not for high volume. Both cron and Actions are for periodic batch jobs. For large, continuous, distributed crawls, drive a queue instead — see Distributed Crawling with Celery and Redis.
  • Free-tier minutes. GitHub Actions has monthly minute limits; long or frequent scrapes can exhaust them, after which runs cost money.

Frequently Asked Questions

Cron or GitHub Actions — which should I pick? Use cron if you already run an always-on VM and want full control of the environment. Use GitHub Actions if you would rather not maintain a server: GitHub supplies the runner and you supply a cron expression and script. For small periodic scrapers, Actions is usually less work.

Why did my scheduled workflow stop running? The most common cause is repository inactivity — GitHub disables scheduled workflows after 60 days without a push. Make any commit to re-enable it. Also check that the cron expression is valid and remember the times are UTC.

How do I keep proxy passwords out of the workflow file? Store them as encrypted repository secrets and reference them as ${{ secrets.NAME }}, which injects masked environment variables. Read them with os.environ in your code, exactly as covered in Deploying Python Scrapers to the Cloud, and never print them.

Can I commit scraped data back to the repository? Yes, and it is handy for small datasets — you get a versioned history of every run for free. For large or frequent output, push to object storage or a database instead and use upload-artifact for transient files, to avoid bloating the repository.