Storing and Exporting Scraped Data

Extraction is only half of a scraping pipeline; the data still has to land somewhere clean, queryable, and durable. The wrong storage choice shows up later as duplicate records, corrupted exports, or a multi-hour crawl lost to a crash. This guide covers how to pick a storage format, validate records before writing, deduplicate, and persist incrementally so progress survives failures. It is the final stage of the Scaling & Deploying Python Web Scrapers workflow and follows naturally from cleaning data in The Complete Guide to Python Web Scraping.

Pick the sink by volume and use — files for small exports, databases and Parquet at scale.

Choosing a Storage Format

Match the sink to the volume of data and how it will be consumed:

Format	Best for	Trade-offs
CSV	Small, flat datasets; spreadsheet hand-off	No nesting; type information is lost
JSON / JSON Lines	Nested records; API consumption	Larger files; not columnar
SQLite	Local structured storage, dedup, queries	Single-writer; not for heavy concurrency
PostgreSQL	Large, concurrent, relational workloads	Requires a running server
Parquet	Large analytical datasets	Columnar; needs pandas/pyarrow to read

A useful rule of thumb: CSV or JSON Lines for one-off exports, SQLite for local projects that need querying and deduplication, and PostgreSQL or Parquet once data volume or analytics demands outgrow a single file.

Validate Before You Store

Storing unvalidated scraped data poisons everything downstream. Define a schema and reject or flag malformed records at the boundary. Pydantic makes this concise and gives clear errors instead of silent corruption.

from pydantic import BaseModel, ValidationError, field_validator

class Book(BaseModel):
    title: str
    price: float
    in_stock: bool = True

    @field_validator("price", mode="before")
    @classmethod
    def clean_price(cls, v):
        return float(str(v).replace("£", "").strip())

def validate(rows: list[dict]) -> list[Book]:
    clean = []
    for row in rows:
        try:
            clean.append(Book(**row))
        except ValidationError as exc:
            print(f"Skipping invalid record: {exc}")
    return clean

Writing CSV and JSON Lines

For flat data, stream rows to CSV with the standard library. JSON Lines (one JSON object per line) is preferable to a single JSON array for large datasets because it can be appended to and read line by line without loading the whole file.

import csv, json

def to_csv(books: list[Book], path: str) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["title", "price", "in_stock"])
        writer.writeheader()
        for b in books:
            writer.writerow(b.model_dump())

def to_jsonl(books: list[Book], path: str) -> None:
    with open(path, "a", encoding="utf-8") as f:           # append-friendly
        for b in books:
            f.write(json.dumps(b.model_dump()) + "\n")

Always specify encoding="utf-8" — mismatched encodings are a frequent source of corruption. See Fixing Common Unicode Errors in Python Scraping when text comes out garbled.

Incremental Writes and Deduplication with SQLite

For any crawl longer than a few seconds, write as you go rather than accumulating everything in memory and dumping at the end — a crash should never cost the whole run. SQLite is ideal for this locally: it gives you durable, incremental inserts plus deduplication via a unique constraint and INSERT OR IGNORE.

import sqlite3

conn = sqlite3.connect("books.db")
conn.execute("""
    CREATE TABLE IF NOT EXISTS books (
        title TEXT,
        price REAL,
        in_stock INTEGER,
        url TEXT UNIQUE          -- natural key prevents duplicates
    )
""")

def save(book: Book, url: str) -> None:
    conn.execute(
        "INSERT OR IGNORE INTO books (title, price, in_stock, url) VALUES (?, ?, ?, ?)",
        (book.title, book.price, int(book.in_stock), url),
    )
    conn.commit()

The UNIQUE column means re-running the crawl will not create duplicates — essential for resumable or scheduled jobs. In a Scrapy project, this same logic belongs in an item pipeline.

Scaling Up: PostgreSQL and Parquet

When data outgrows a single file or needs concurrent writers, move to PostgreSQL — use psycopg's execute_many or COPY for efficient bulk inserts, and an ON CONFLICT DO NOTHING clause for deduplication. For analytical datasets measured in millions of rows, write Parquet with pandas or pyarrow: it is columnar, compressed, and dramatically faster to query than CSV.

import pandas as pd

df = pd.DataFrame(b.model_dump() for b in books)
df.drop_duplicates(subset="title").to_parquet("books.parquet", index=False)

Common Mistakes to Avoid

Buffering everything in memory: accumulating a giant list and writing once means a crash loses all of it. Stream to disk or a database incrementally.
No deduplication key: without a unique constraint, re-runs and overlapping pagination create duplicate rows. Pick a natural key (URL, ID).
Skipping validation: unvalidated records corrupt analytics silently. Validate at the storage boundary.
Wrong format for the scale: CSV for millions of rows is slow and lossy; Parquet or a database is the right tool.
Ignoring encoding: always write UTF-8 explicitly to avoid mojibake in exported text.

Frequently Asked Questions

CSV or JSON for scraped data? Use CSV for flat, tabular data destined for spreadsheets. Use JSON (or JSON Lines) when records are nested or feed an API. JSON Lines is best for large, append-as-you-go exports.

When should I use a database instead of files? Once you need querying, deduplication, concurrent writes, or resumable crawls. SQLite covers local needs with zero setup; PostgreSQL handles large, concurrent, or relational workloads.

How do I avoid duplicate records? Define a unique key (such as the source URL or an ID) and let the database enforce it with a unique constraint plus INSERT OR IGNORE (SQLite) or ON CONFLICT DO NOTHING (PostgreSQL).

What is Parquet and when should I use it? Parquet is a columnar, compressed file format optimized for analytics. Use it for large datasets you will query or load into data tools — it is far faster and smaller than CSV at scale.