ArchiveBox – Durable Web Archiving for Data & ML Pipelines

ArchiveBox is an open-source self-hosted system that ingests URLs and persistently archives their content using a pluggable set of extractors (WARC, HTML, PDF, screenshots, readability text, media, etc.). It is valuable in MLOps, LLM training, compliance, research reproducibility, and knowledge retention scenarios when you need: stable snapshots of sources, dataset provenance, and long-term link durability.

When To Choose ArchiveBox

Use ArchiveBox when you need verifiable, offline-accessible copies of web resources or to prevent dataset drift due to disappearing / changing web pages.

1. Key Use Cases

Category	Scenario	Value
LLM Dataset Curation	Capture blog posts, docs, academic pages at ingestion time	Ensures reproducible training corpora
Compliance / Audit	Preserve sources referenced in reports	Immutable history trail
Research	Snapshot volatile content (news, APIs docs)	Prevents link rot
ML Feature Lineage	Store external knowledge references	Traceability for regulated features
Competitive Intelligence	Track changes over time	Temporal diffing
Knowledge Base	Offline searchable mirror	Fast retrieval

2. Architecture Overview

Ingestion Sources → Queue → Extractor Workers → Archive Output Tree
                                        │
                                        ├─ Structured Index (SQLite / JSON)
                                        ├─ Fulltext (ripgrep / search plugin)
                                        └─ Optional External Storage (S3 / MinIO / NFS)

Components:

CLI / Web UI: Manage URLs, view snapshots, search.
Scheduler / Cron: Automate recurring archiving.
Extractors: wget, singlefile, readability, pdf, screenshot, media, git, etc.
Output Tree: Each URL gets a timestamped folder with artifacts + metadata.
Index DB: SQLite (default) holds canonical URL + snapshot metadata.

3. Installation Methods

3.1 Quick Docker (Ephemeral Demo)

docker run -it --rm -p 8000:8000 \
  -v $(pwd)/data:/data \
  archivebox/archivebox:latest server 0.0.0.0:8000

Then in another shell:

docker exec -it $(docker ps -q -f ancestor=archivebox/archivebox:latest) archivebox add 'https://example.com'

3.2 Persistent Docker Compose

docker-compose.yml
services:
  archivebox:
    image: archivebox/archivebox:latest
    container_name: archivebox
    restart: unless-stopped
    environment:
      - ALLOWED_HOSTS=* 
      - MEDIA_MAX_SIZE=750m
    volumes:
      - ./archivebox:/data
    ports:
      - "8000:8000"
    command: server 0.0.0.0:8000

Bring up:

docker compose up -d

3.3 Python (Virtualenv)

python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install archivebox[all]
archivebox init
archivebox add 'https://example.com'
archivebox server 0.0.0.0:8000

3.4 Kubernetes (Basic Stateful Deployment)

Minimal sample – customize resources & persistence.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: archivebox
spec:
  replicas: 1
  selector:
    matchLabels: { app: archivebox }
  template:
    metadata:
      labels: { app: archivebox }
    spec:
      containers:
        - name: archivebox
          image: archivebox/archivebox:latest
          args: ["server","0.0.0.0:8000"]
          ports:
            - containerPort: 8000
          volumeMounts:
            - name: data
              mountPath: /data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: archivebox-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: archivebox-pvc
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 50Gi

4. Directory & Data Layout

Typical structure (inside /data):

index.sqlite3
ArchiveBox.conf
logs/
sources/
archive/
  1700000000/                       # timestamped snapshot folder
    index.json                      # metadata
    warc/
    singlefile.html
    readability/                      
    screenshot.png
    media/

5. Core Commands

Command	Purpose
`archivebox init`	Initialize repo (creates config + DB).
`archivebox add <url\|file\|stdin>`	Add URLs from arg / file / pipe.
`archivebox server 0.0.0.0:8000`	Start web UI/API.
`archivebox schedule --every=1d`	Cron-like recurring ingest.
`archivebox list --json`	List stored URLs with metadata.
`archivebox oneshot <url>`	Temporary capture not stored in main index.
`archivebox config --set KEY=VALUE`	Persist configuration override.

Examples

echo 'https://news.ycombinator.com' | archivebox add -
archivebox add --depth=1 'https://example.com'
archivebox list | head

6. Ingestion Sources

Source Type	Method	Notes
Single URL	`archivebox add https://...`	Immediate snapshot.
Text File	`archivebox add urls.txt`	One per line.
Browser Bookmarks	Export HTML, then add file.
RSS/Atom Feeds	Preprocess feed items → pipe to add.
API / Script	Call CLI programmatically.
Web UI Form	Add interactively.

Feed Example

curl -s https://example.com/feed.xml | xmlstarlet sel -t -v '//item/link' | archivebox add -

7. Extractors & Output Artifacts

Extractor	Artifact	Use Case
wget	Raw WARC-like tree	Full fidelity offline browse
singlefile	Single HTML file	Compact snapshot
readability	Clean article text	LLM training / NLP
screenshot	PNG image	Visual diffing
pdf	PDF rendition	Compliance, shareable
media	Downloaded audio/video	Rich media archival
git	Repo mirror	Code reference
favicon	Icon asset	UI display

Configure which run via environment variables or ArchiveBox.conf:

export SAVE_WGET=True
export SAVE_PDF=False
export SAVE_SCREENSHOT=True

8. Configuration Model

Settings precedence (highest → lowest):

CLI --flag
Environment variable (e.g. TIMEOUT=60)
ArchiveBox.conf
Built-in defaults

List effective config:

archivebox config

Set persistent:

archivebox config --set FETCH_TIMEOUT=80

Common keys:

Variable	Purpose
`TIMEOUT`	Global per extractor timeout.
`FETCH_TIMEOUT`	Network retrieval cap.
`SAVE_PDF`	Enable PDF generation.
`OUTPUT_PERMISSIONS`	Unix mode override for saved files.
`MEDIA_MAX_SIZE`	Limit large media downloads.
`CHROME_HEADLESS`	Use headless browser features.
`USE_COLOR`	CLI color toggle.

9. Authentication & Access Control

ArchiveBox by default runs an admin-only lightweight UI; for multi-user scenarios front it with:

Reverse proxy (Traefik / Nginx) + Basic Auth / SSO.
OAuth2 proxy (e.g., oauth2-proxy against GitHub, Google, Keycloak).

Public Exposure

Do not expose an unauthenticated instance; ingestion endpoints can fetch arbitrary URLs and cause unexpected outbound traffic / SSRF risk.

10. Storage Strategy & External Object Stores

ArchiveBox stores structured artifacts on a POSIX filesystem. For scale-out durability integrate with object storage:

10.1 Offloading Snapshot Folders to MinIO / S3

After each snapshot, sync archive/ subtree to S3-compatible storage.

aws --endpoint-url http://minio:9000 s3 sync archive/ s3://archivebox-snapshots/archive/

Automate post-run via wrapper script or cron.

10.2 Deduplication & Pruning

Periodic removal of bulky artifacts you no longer need:

find archive -name 'screenshot.png' -size +5M -delete

Or disable generation (SAVE_SCREENSHOT=False).

11. Performance & Scaling

Area	Tactic	Impact
Parallelism	Run multiple add processes (distinct queues)	Higher throughput
Headless Browser	Reuse Chrome instance / disable unneeded extractors	Lower CPU/memory
Media Limits	Set `MEDIA_MAX_SIZE`	Avoid huge downloads
Filesystem	Use SSD/NVMe for index + small files	Faster extractor IO
Scheduling	Stagger large batches off-peak	Consistent load
Caching	Local DNS / HTTP cache (squid)	Reduced latency

11.1 Horizontal Sharding Strategy

ArchiveBox is not inherently clustered, but you can shard logically:

Shard Basis	Method	Benefit
Domain Hash	Distribute URL lists by `hash(domain) % N`	Balances workload
Content Type	Separate media-heavy vs text-only feeds	Optimizes resource sizing
Recency Tier	Recent URLs on fast NVMe, older on HDD	Cost efficiency

Each shard runs its own instance + index. Aggregate global search using an external catalog (e.g., ingest each shard's index.sqlite3 metadata into a central analytical DB).

11.2 Content Hash Deduplication (Post-Process)

To avoid storing identical pages (mirrors, syndicated content):

find archive -name 'singlefile.html' -exec sh -c 'sha256sum "$0"' {} \; | sort > /tmp/hashes.txt
awk '{if(seen[$1]++){print $2}}' /tmp/hashes.txt | while read dup; do echo "Duplicate: $dup"; done

Integrate with a script to remove duplicates after verifying retention policy alignment.

11.3 Integrity Verification

Generate periodic manifest of critical artifacts:

find archive -type f -maxdepth 3 -name 'singlefile.html' -o -name 'index.json' | \
  xargs sha256sum > manifests/archive-manifest-$(date +%F).sha256
sha256sum -c manifests/archive-manifest-$(ls manifests | tail -1)

Store manifest copies offsite to detect tampering or bitrot.

11.4 Retention Classification

Assign tiers:

Tier	Definition	Action
Gold	Legal/compliance sources	Keep all artifacts, replicate
Silver	ML training canonical set	Keep text + HTML; drop media
Bronze	Low-value feed noise	Keep readability only; 90d expiry

Implement via labeling URL lists and customizing extractor toggles per ingestion run.

Benchmark idea:

time archivebox add --depth=0 $(cat top100.txt)

Extractor Selection

For LLM text corpora you can often disable heavy assets (media, screenshot, pdf) and keep readability + singlefile + wget for balanced fidelity vs size.

12. Backup & Disaster Recovery

Component	Backup Method
`index.sqlite3`	Daily snapshot copy (sqlite is single file)
`archive/`	rsync / rclone to remote
`ArchiveBox.conf`	Git commit or config mgmt
Logs	Centralize via Fluent Bit / Loki

Disaster restore minimal sequence:

cp -r backup/archive ./archive
cp backup/index.sqlite3 ./index.sqlite3
archivebox list | head

13. Automation Patterns

Cron (Host)

0 2 * * * cd /opt/archivebox && docker compose run --rm archivebox schedule --every=1d >> logs/cron.log 2>&1

GitOps (K8s)

Commit URL lists into a repo.
Sidecar container tails repo and feeds new URLs to CLI.

Webhook Collector

Expose minimal API endpoint behind auth that enqueues URLs posted by other services (e.g., Slack slash command → webhook → ArchiveBox CLI). Wrap with a small Flask/FastAPI facade if needed.

14. Monitoring & Observability

Signal	Method
Ingestion Success Rate	Parse logs for `Saved` vs `Failed` lines.
Queue Lag	Compare new URL timestamp vs capture time.
Disk Growth	Track `du -sh archive/` chronologically.
Error Types	Log classification (timeouts, 403, JS errors).
API Health	HTTP 200 on `/` UI endpoint.

Prometheus sidecar example (export simple metrics via script):

echo "archivebox_snapshots_total $(find archive -maxdepth 1 -type d | wc -l)" > /var/lib/node_exporter/textfile_collector/archivebox.prom

15. Security Hardening

Risk	Mitigation
SSRF / internal fetch	Restrict outbound network via firewall; disallow metadata IP ranges.
Arbitrary file growth	Quotas + monitor disk usage.
Sensitive content	Encrypt volume (LUKS) or offload encrypted to S3 (SSE-KMS).
Credential leakage	Store secrets in env vaults (K8s Secrets, HashiCorp Vault).
Outdated dependencies	Rebuild image weekly with latest extractor tools.
Integrity drift	Periodic hashing + diff manifests.
Duplicate ingestion	Central registry to prevent re-adding processed URLs.

Review Allowed Hosts

Some extractor tools may follow redirects—validate target domains if ingesting from untrusted feeds.

16. Troubleshooting

Symptom	Action
Missing screenshot	Ensure headless Chrome available (`SAVE_SCREENSHOT=True`).
Slow captures	Disable heavy extractors, raise timeouts gradually.
DB locked errors	Stagger concurrent writes, use fewer parallel add processes.
Failed media	Increase `MEDIA_MAX_SIZE` or confirm codec tools installed.
High disk usage	Prune unneeded artifact types, compress logs.

Useful commands:

archivebox list --status=failed --json
archivebox shell
sqlite3 index.sqlite3 'SELECT url, updated FROM core_snapshot ORDER BY updated DESC LIMIT 5;'

17. Advanced Workflow: Integrating with MinIO & LLM Pipelines

Goal: Archive pages → extract clean text → push text objects to MinIO for downstream embedding/vectorization.

Flow:

archivebox add ingest URLs.
Parse readability text artifacts.
Chunk into documents.
Upload chunks to MinIO bucket (s3://web-corpus/).
Trigger vector index build job.

Sample Python script:

import os, json, boto3, pathlib

ARCHIVE_ROOT = pathlib.Path('archive')
s3 = boto3.client('s3', endpoint_url='http://localhost:9000',
                  aws_access_key_id='minioadmin', aws_secret_access_key='strongpassword123!',
                  region_name='us-east-1')
bucket = 'web-corpus'
try:
    s3.create_bucket(Bucket=bucket)
except Exception:
    pass

for snap in ARCHIVE_ROOT.iterdir():
    txt_dir = snap / 'readability'
    if not txt_dir.exists():
        continue
    for txt in txt_dir.glob('*.json'):  # readability outputs JSON (depending version)
        data = json.loads(txt.read_text())
        body = data.get('content') or data.get('textContent') or ''
        if not body:
            continue
        key = f"readability/{snap.name}/{txt.stem}.txt"
        s3.put_object(Bucket=bucket, Key=key, Body=body.encode('utf-8'))
        print('Uploaded', key)

18. Capacity Planning

Dimension	Estimate Basis
Average Snapshot Size	2–15 MB (HTML + assets)
Heavy Media Snapshot	50–500 MB
Text-Only Snapshot	50–400 KB
Daily Growth	(#URLs × Avg Size)
Index DB Size	~ few KB per URL + metadata

Rule of thumb: For 100k URLs/year at 8 MB avg → ~800 GB raw (before compression/offloading).

19. Minimal API Interaction

ArchiveBox exposes a JSON index (depending version). Example simple scrape:

curl -s http://localhost:8000/json/ | jq '.[0:3] | map({url, timestamp})'

20. Further Reading & References

Official Repo: https://github.com/ArchiveBox/ArchiveBox
Docs: https://docs.archivebox.io/
Extractors Matrix: https://docs.archivebox.io/extractors/
Web Archiving Standards: https://iipc.github.io/ (International Internet Preservation Consortium)
Wayback Packaging: Consider dual archiving to Internet Archive for redundancy.

Next Steps

Automate feed ingestion & ship cleaned textual artifacts into your feature or embedding pipeline for durable, reproducible knowledge datasets.

Last Updated: 2025-09-17

1. Key Use Cases​

2. Architecture Overview​

3. Installation Methods​

3.1 Quick Docker (Ephemeral Demo)​

3.2 Persistent Docker Compose​

3.3 Python (Virtualenv)​

3.4 Kubernetes (Basic Stateful Deployment)​

4. Directory & Data Layout​

5. Core Commands​

Examples​

6. Ingestion Sources​

Feed Example​

7. Extractors & Output Artifacts​

8. Configuration Model​

9. Authentication & Access Control​

10. Storage Strategy & External Object Stores​

10.1 Offloading Snapshot Folders to MinIO / S3​

10.2 Deduplication & Pruning​

11. Performance & Scaling​

11.1 Horizontal Sharding Strategy​

11.2 Content Hash Deduplication (Post-Process)​

11.3 Integrity Verification​

11.4 Retention Classification​

12. Backup & Disaster Recovery​

13. Automation Patterns​

Cron (Host)​

GitOps (K8s)​

Webhook Collector​

14. Monitoring & Observability​

15. Security Hardening​

16. Troubleshooting​

17. Advanced Workflow: Integrating with MinIO & LLM Pipelines​

18. Capacity Planning​

19. Minimal API Interaction​

20. Further Reading & References​