ArchiveBox – Durable Web Archiving for Data & ML Pipelines
ArchiveBox is an open-source self-hosted system that ingests URLs and persistently archives their content using a pluggable set of extractors (WARC, HTML, PDF, screenshots, readability text, media, etc.). It is valuable in MLOps, LLM training, compliance, research reproducibility, and knowledge retention scenarios when you need: stable snapshots of sources, dataset provenance, and long-term link durability.
Use ArchiveBox when you need verifiable, offline-accessible copies of web resources or to prevent dataset drift due to disappearing / changing web pages.
1. Key Use Cases
| Category | Scenario | Value |
|---|---|---|
| LLM Dataset Curation | Capture blog posts, docs, academic pages at ingestion time | Ensures reproducible training corpora |
| Compliance / Audit | Preserve sources referenced in reports | Immutable history trail |
| Research | Snapshot volatile content (news, APIs docs) | Prevents link rot |
| ML Feature Lineage | Store external knowledge references | Traceability for regulated features |
| Competitive Intelligence | Track changes over time | Temporal diffing |
| Knowledge Base | Offline searchable mirror | Fast retrieval |
2. Architecture Overview
Ingestion Sources → Queue → Extractor Workers → Archive Output Tree
│
├─ Structured Index (SQLite / JSON)
├─ Fulltext (ripgrep / search plugin)
└─ Optional External Storage (S3 / MinIO / NFS)
Components:
- CLI / Web UI: Manage URLs, view snapshots, search.
- Scheduler / Cron: Automate recurring archiving.
- Extractors:
wget,singlefile,readability,pdf,screenshot,media,git, etc. - Output Tree: Each URL gets a timestamped folder with artifacts + metadata.
- Index DB: SQLite (default) holds canonical URL + snapshot metadata.
3. Installation Methods
3.1 Quick Docker (Ephemeral Demo)
docker run -it --rm -p 8000:8000 \
-v $(pwd)/data:/data \
archivebox/archivebox:latest server 0.0.0.0:8000
Then in another shell:
docker exec -it $(docker ps -q -f ancestor=archivebox/archivebox:latest) archivebox add 'https://example.com'
3.2 Persistent Docker Compose
services:
archivebox:
image: archivebox/archivebox:latest
container_name: archivebox
restart: unless-stopped
environment:
- ALLOWED_HOSTS=*
- MEDIA_MAX_SIZE=750m
volumes:
- ./archivebox:/data
ports:
- "8000:8000"
command: server 0.0.0.0:8000
Bring up:
docker compose up -d
3.3 Python (Virtualenv)
python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install archivebox[all]
archivebox init
archivebox add 'https://example.com'
archivebox server 0.0.0.0:8000
3.4 Kubernetes (Basic Stateful Deployment)
Minimal sample – customize resources & persistence.
apiVersion: apps/v1
kind: Deployment
metadata:
name: archivebox
spec:
replicas: 1
selector:
matchLabels: { app: archivebox }
template:
metadata:
labels: { app: archivebox }
spec:
containers:
- name: archivebox
image: archivebox/archivebox:latest
args: ["server","0.0.0.0:8000"]
ports:
- containerPort: 8000
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: archivebox-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: archivebox-pvc
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
4. Directory & Data Layout
Typical structure (inside /data):
index.sqlite3
ArchiveBox.conf
logs/
sources/
archive/
1700000000/ # timestamped snapshot folder
index.json # metadata
warc/
singlefile.html
readability/
screenshot.png
media/
5. Core Commands
| Command | Purpose |
|---|---|
archivebox init | Initialize repo (creates config + DB). |
archivebox add <url|file|stdin> | Add URLs from arg / file / pipe. |
archivebox server 0.0.0.0:8000 | Start web UI/API. |
archivebox schedule --every=1d | Cron-like recurring ingest. |
archivebox list --json | List stored URLs with metadata. |
archivebox oneshot <url> | Temporary capture not stored in main index. |
archivebox config --set KEY=VALUE | Persist configuration override. |
Examples
echo 'https://news.ycombinator.com' | archivebox add -
archivebox add --depth=1 'https://example.com'
archivebox list | head
6. Ingestion Sources
| Source Type | Method | Notes |
|---|---|---|
| Single URL | archivebox add https://... | Immediate snapshot. |
| Text File | archivebox add urls.txt | One per line. |
| Browser Bookmarks | Export HTML, then add file. | |
| RSS/Atom Feeds | Preprocess feed items → pipe to add. | |
| API / Script | Call CLI programmatically. | |
| Web UI Form | Add interactively. |
Feed Example
curl -s https://example.com/feed.xml | xmlstarlet sel -t -v '//item/link' | archivebox add -
7. Extractors & Output Artifacts
| Extractor | Artifact | Use Case |
|---|---|---|
| wget | Raw WARC-like tree | Full fidelity offline browse |
| singlefile | Single HTML file | Compact snapshot |
| readability | Clean article text | LLM training / NLP |
| screenshot | PNG image | Visual diffing |
| PDF rendition | Compliance, shareable | |
| media | Downloaded audio/video | Rich media archival |
| git | Repo mirror | Code reference |
| favicon | Icon asset | UI display |
Configure which run via environment variables or ArchiveBox.conf:
export SAVE_WGET=True
export SAVE_PDF=False
export SAVE_SCREENSHOT=True
8. Configuration Model
Settings precedence (highest → lowest):
- CLI
--flag - Environment variable (e.g.
TIMEOUT=60) ArchiveBox.conf- Built-in defaults
List effective config:
archivebox config
Set persistent:
archivebox config --set FETCH_TIMEOUT=80
Common keys:
| Variable | Purpose |
|---|---|
TIMEOUT | Global per extractor timeout. |
FETCH_TIMEOUT | Network retrieval cap. |
SAVE_PDF | Enable PDF generation. |
OUTPUT_PERMISSIONS | Unix mode override for saved files. |
MEDIA_MAX_SIZE | Limit large media downloads. |
CHROME_HEADLESS | Use headless browser features. |
USE_COLOR | CLI color toggle. |
9. Authentication & Access Control
ArchiveBox by default runs an admin-only lightweight UI; for multi-user scenarios front it with:
- Reverse proxy (Traefik / Nginx) + Basic Auth / SSO.
- OAuth2 proxy (e.g.,
oauth2-proxyagainst GitHub, Google, Keycloak).
Do not expose an unauthenticated instance; ingestion endpoints can fetch arbitrary URLs and cause unexpected outbound traffic / SSRF risk.
10. Storage Strategy & External Object Stores
ArchiveBox stores structured artifacts on a POSIX filesystem. For scale-out durability integrate with object storage:
10.1 Offloading Snapshot Folders to MinIO / S3
After each snapshot, sync archive/ subtree to S3-compatible storage.
aws --endpoint-url http://minio:9000 s3 sync archive/ s3://archivebox-snapshots/archive/
Automate post-run via wrapper script or cron.
10.2 Deduplication & Pruning
Periodic removal of bulky artifacts you no longer need:
find archive -name 'screenshot.png' -size +5M -delete
Or disable generation (SAVE_SCREENSHOT=False).
11. Performance & Scaling
| Area | Tactic | Impact |
|---|---|---|
| Parallelism | Run multiple add processes (distinct queues) | Higher throughput |
| Headless Browser | Reuse Chrome instance / disable unneeded extractors | Lower CPU/memory |
| Media Limits | Set MEDIA_MAX_SIZE | Avoid huge downloads |
| Filesystem | Use SSD/NVMe for index + small files | Faster extractor IO |
| Scheduling | Stagger large batches off-peak | Consistent load |
| Caching | Local DNS / HTTP cache (squid) | Reduced latency |
11.1 Horizontal Sharding Strategy
ArchiveBox is not inherently clustered, but you can shard logically:
| Shard Basis | Method | Benefit |
|---|---|---|
| Domain Hash | Distribute URL lists by hash(domain) % N | Balances workload |
| Content Type | Separate media-heavy vs text-only feeds | Optimizes resource sizing |
| Recency Tier | Recent URLs on fast NVMe, older on HDD | Cost efficiency |
Each shard runs its own instance + index. Aggregate global search using an external catalog (e.g., ingest each shard's index.sqlite3 metadata into a central analytical DB).
11.2 Content Hash Deduplication (Post-Process)
To avoid storing identical pages (mirrors, syndicated content):
find archive -name 'singlefile.html' -exec sh -c 'sha256sum "$0"' {} \; | sort > /tmp/hashes.txt
awk '{if(seen[$1]++){print $2}}' /tmp/hashes.txt | while read dup; do echo "Duplicate: $dup"; done
Integrate with a script to remove duplicates after verifying retention policy alignment.
11.3 Integrity Verification
Generate periodic manifest of critical artifacts:
find archive -type f -maxdepth 3 -name 'singlefile.html' -o -name 'index.json' | \
xargs sha256sum > manifests/archive-manifest-$(date +%F).sha256
sha256sum -c manifests/archive-manifest-$(ls manifests | tail -1)
Store manifest copies offsite to detect tampering or bitrot.
11.4 Retention Classification
Assign tiers:
| Tier | Definition | Action |
|---|---|---|
| Gold | Legal/compliance sources | Keep all artifacts, replicate |
| Silver | ML training canonical set | Keep text + HTML; drop media |
| Bronze | Low-value feed noise | Keep readability only; 90d expiry |
Implement via labeling URL lists and customizing extractor toggles per ingestion run.
Benchmark idea:
time archivebox add --depth=0 $(cat top100.txt)
For LLM text corpora you can often disable heavy assets (media, screenshot, pdf) and keep readability + singlefile + wget for balanced fidelity vs size.
12. Backup & Disaster Recovery
| Component | Backup Method |
|---|---|
index.sqlite3 | Daily snapshot copy (sqlite is single file) |
archive/ | rsync / rclone to remote |
ArchiveBox.conf | Git commit or config mgmt |
| Logs | Centralize via Fluent Bit / Loki |
Disaster restore minimal sequence:
cp -r backup/archive ./archive
cp backup/index.sqlite3 ./index.sqlite3
archivebox list | head
13. Automation Patterns
Cron (Host)
0 2 * * * cd /opt/archivebox && docker compose run --rm archivebox schedule --every=1d >> logs/cron.log 2>&1
GitOps (K8s)
- Commit URL lists into a repo.
- Sidecar container tails repo and feeds new URLs to CLI.
Webhook Collector
Expose minimal API endpoint behind auth that enqueues URLs posted by other services (e.g., Slack slash command → webhook → ArchiveBox CLI). Wrap with a small Flask/FastAPI facade if needed.
14. Monitoring & Observability
| Signal | Method |
|---|---|
| Ingestion Success Rate | Parse logs for Saved vs Failed lines. |
| Queue Lag | Compare new URL timestamp vs capture time. |
| Disk Growth | Track du -sh archive/ chronologically. |
| Error Types | Log classification (timeouts, 403, JS errors). |
| API Health | HTTP 200 on / UI endpoint. |
Prometheus sidecar example (export simple metrics via script):
echo "archivebox_snapshots_total $(find archive -maxdepth 1 -type d | wc -l)" > /var/lib/node_exporter/textfile_collector/archivebox.prom
15. Security Hardening
| Risk | Mitigation |
|---|---|
| SSRF / internal fetch | Restrict outbound network via firewall; disallow metadata IP ranges. |
| Arbitrary file growth | Quotas + monitor disk usage. |
| Sensitive content | Encrypt volume (LUKS) or offload encrypted to S3 (SSE-KMS). |
| Credential leakage | Store secrets in env vaults (K8s Secrets, HashiCorp Vault). |
| Outdated dependencies | Rebuild image weekly with latest extractor tools. |
| Integrity drift | Periodic hashing + diff manifests. |
| Duplicate ingestion | Central registry to prevent re-adding processed URLs. |
Some extractor tools may follow redirects—validate target domains if ingesting from untrusted feeds.
16. Troubleshooting
| Symptom | Action |
|---|---|
| Missing screenshot | Ensure headless Chrome available (SAVE_SCREENSHOT=True). |
| Slow captures | Disable heavy extractors, raise timeouts gradually. |
| DB locked errors | Stagger concurrent writes, use fewer parallel add processes. |
| Failed media | Increase MEDIA_MAX_SIZE or confirm codec tools installed. |
| High disk usage | Prune unneeded artifact types, compress logs. |
Useful commands:
archivebox list --status=failed --json
archivebox shell
sqlite3 index.sqlite3 'SELECT url, updated FROM core_snapshot ORDER BY updated DESC LIMIT 5;'
17. Advanced Workflow: Integrating with MinIO & LLM Pipelines
Goal: Archive pages → extract clean text → push text objects to MinIO for downstream embedding/vectorization.
Flow:
archivebox addingest URLs.- Parse readability text artifacts.
- Chunk into documents.
- Upload chunks to MinIO bucket (
s3://web-corpus/). - Trigger vector index build job.
Sample Python script:
import os, json, boto3, pathlib
ARCHIVE_ROOT = pathlib.Path('archive')
s3 = boto3.client('s3', endpoint_url='http://localhost:9000',
aws_access_key_id='minioadmin', aws_secret_access_key='strongpassword123!',
region_name='us-east-1')
bucket = 'web-corpus'
try:
s3.create_bucket(Bucket=bucket)
except Exception:
pass
for snap in ARCHIVE_ROOT.iterdir():
txt_dir = snap / 'readability'
if not txt_dir.exists():
continue
for txt in txt_dir.glob('*.json'): # readability outputs JSON (depending version)
data = json.loads(txt.read_text())
body = data.get('content') or data.get('textContent') or ''
if not body:
continue
key = f"readability/{snap.name}/{txt.stem}.txt"
s3.put_object(Bucket=bucket, Key=key, Body=body.encode('utf-8'))
print('Uploaded', key)
18. Capacity Planning
| Dimension | Estimate Basis |
|---|---|
| Average Snapshot Size | 2–15 MB (HTML + assets) |
| Heavy Media Snapshot | 50–500 MB |
| Text-Only Snapshot | 50–400 KB |
| Daily Growth | (#URLs × Avg Size) |
| Index DB Size | ~ few KB per URL + metadata |
Rule of thumb: For 100k URLs/year at 8 MB avg → ~800 GB raw (before compression/offloading).
19. Minimal API Interaction
ArchiveBox exposes a JSON index (depending version). Example simple scrape:
curl -s http://localhost:8000/json/ | jq '.[0:3] | map({url, timestamp})'
20. Further Reading & References
- Official Repo: https://github.com/ArchiveBox/ArchiveBox
- Docs: https://docs.archivebox.io/
- Extractors Matrix: https://docs.archivebox.io/extractors/
- Web Archiving Standards: https://iipc.github.io/ (International Internet Preservation Consortium)
- Wayback Packaging: Consider dual archiving to Internet Archive for redundancy.
Automate feed ingestion & ship cleaned textual artifacts into your feature or embedding pipeline for durable, reproducible knowledge datasets.
Last Updated: 2025-09-17