Storing and Exporting Scraped Data
Extraction is only half of a scraping pipeline; the data still has to land somewhere clean, queryable, and durable. The wrong storage choice shows up later as duplicate records, corrupted exports, or a multi-hour crawl lost to a crash. This guide covers how to pick a storage format, validate records before writing, deduplicate, and persist incrementally so progress survives failures. It is the final stage of the Scaling & Deploying Python Web Scrapers workflow and follows naturally from cleaning data in The Complete Guide to Python Web Scraping.
Choosing a Storage Format
Match the sink to the volume of data and how it will be consumed:
| Format | Best for | Trade-offs |
|---|---|---|
| CSV | Small, flat datasets; spreadsheet hand-off | No nesting; type information is lost |
| JSON / JSON Lines | Nested records; API consumption | Larger files; not columnar |
| SQLite | Local structured storage, dedup, queries | Single-writer; not for heavy concurrency |
| PostgreSQL | Large, concurrent, relational workloads | Requires a running server |
| Parquet | Large analytical datasets | Columnar; needs pandas/pyarrow to read |
A useful rule of thumb: CSV or JSON Lines for one-off exports, SQLite for local projects that need querying and deduplication, and PostgreSQL or Parquet once data volume or analytics demands outgrow a single file.
Validate Before You Store
Storing unvalidated scraped data poisons everything downstream. Define a schema and reject or flag malformed records at the boundary. Pydantic makes this concise and gives clear errors instead of silent corruption.
from pydantic import BaseModel, ValidationError, field_validator
class Book(BaseModel):
title: str
price: float
in_stock: bool = True
@field_validator("price", mode="before")
@classmethod
def clean_price(cls, v):
return float(str(v).replace("£", "").strip())
def validate(rows: list[dict]) -> list[Book]:
clean = []
for row in rows:
try:
clean.append(Book(**row))
except ValidationError as exc:
print(f"Skipping invalid record: {exc}")
return clean
Writing CSV and JSON Lines
For flat data, stream rows to CSV with the standard library. JSON Lines (one JSON object per line) is preferable to a single JSON array for large datasets because it can be appended to and read line by line without loading the whole file.
import csv, json
def to_csv(books: list[Book], path: str) -> None:
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "in_stock"])
writer.writeheader()
for b in books:
writer.writerow(b.model_dump())
def to_jsonl(books: list[Book], path: str) -> None:
with open(path, "a", encoding="utf-8") as f: # append-friendly
for b in books:
f.write(json.dumps(b.model_dump()) + "\n")
Always specify encoding="utf-8" — mismatched encodings are a frequent source of corruption. See Fixing Common Unicode Errors in Python Scraping when text comes out garbled.
Incremental Writes and Deduplication with SQLite
For any crawl longer than a few seconds, write as you go rather than accumulating everything in memory and dumping at the end — a crash should never cost the whole run. SQLite is ideal for this locally: it gives you durable, incremental inserts plus deduplication via a unique constraint and INSERT OR IGNORE.
import sqlite3
conn = sqlite3.connect("books.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS books (
title TEXT,
price REAL,
in_stock INTEGER,
url TEXT UNIQUE -- natural key prevents duplicates
)
""")
def save(book: Book, url: str) -> None:
conn.execute(
"INSERT OR IGNORE INTO books (title, price, in_stock, url) VALUES (?, ?, ?, ?)",
(book.title, book.price, int(book.in_stock), url),
)
conn.commit()
The UNIQUE column means re-running the crawl will not create duplicates — essential for resumable or scheduled jobs. In a Scrapy project, this same logic belongs in an item pipeline.
Scaling Up: PostgreSQL and Parquet
When data outgrows a single file or needs concurrent writers, move to PostgreSQL — use psycopg's execute_many or COPY for efficient bulk inserts, and an ON CONFLICT DO NOTHING clause for deduplication. For analytical datasets measured in millions of rows, write Parquet with pandas or pyarrow: it is columnar, compressed, and dramatically faster to query than CSV.
import pandas as pd
df = pd.DataFrame(b.model_dump() for b in books)
df.drop_duplicates(subset="title").to_parquet("books.parquet", index=False)
Common Mistakes to Avoid
- Buffering everything in memory: accumulating a giant list and writing once means a crash loses all of it. Stream to disk or a database incrementally.
- No deduplication key: without a unique constraint, re-runs and overlapping pagination create duplicate rows. Pick a natural key (URL, ID).
- Skipping validation: unvalidated records corrupt analytics silently. Validate at the storage boundary.
- Wrong format for the scale: CSV for millions of rows is slow and lossy; Parquet or a database is the right tool.
- Ignoring encoding: always write UTF-8 explicitly to avoid mojibake in exported text.
Frequently Asked Questions
CSV or JSON for scraped data? Use CSV for flat, tabular data destined for spreadsheets. Use JSON (or JSON Lines) when records are nested or feed an API. JSON Lines is best for large, append-as-you-go exports.
When should I use a database instead of files? Once you need querying, deduplication, concurrent writes, or resumable crawls. SQLite covers local needs with zero setup; PostgreSQL handles large, concurrent, or relational workloads.
How do I avoid duplicate records?
Define a unique key (such as the source URL or an ID) and let the database enforce it with a unique constraint plus INSERT OR IGNORE (SQLite) or ON CONFLICT DO NOTHING (PostgreSQL).
What is Parquet and when should I use it? Parquet is a columnar, compressed file format optimized for analytics. Use it for large datasets you will query or load into data tools — it is far faster and smaller than CSV at scale.