Skip to main content

ArchiveBox – Durable Web Archiving for Data & ML Pipelines

ArchiveBox is an open-source self-hosted system that ingests URLs and persistently archives their content using a pluggable set of extractors (WARC, HTML, PDF, screenshots, readability text, media, etc.). It is valuable in MLOps, LLM training, compliance, research reproducibility, and knowledge retention scenarios when you need: stable snapshots of sources, dataset provenance, and long-term link durability.

When To Choose ArchiveBox

Use ArchiveBox when you need verifiable, offline-accessible copies of web resources or to prevent dataset drift due to disappearing / changing web pages.

1. Key Use Cases

CategoryScenarioValue
LLM Dataset CurationCapture blog posts, docs, academic pages at ingestion timeEnsures reproducible training corpora
Compliance / AuditPreserve sources referenced in reportsImmutable history trail
ResearchSnapshot volatile content (news, APIs docs)Prevents link rot
ML Feature LineageStore external knowledge referencesTraceability for regulated features
Competitive IntelligenceTrack changes over timeTemporal diffing
Knowledge BaseOffline searchable mirrorFast retrieval

2. Architecture Overview

Ingestion Sources → Queue → Extractor Workers → Archive Output Tree

├─ Structured Index (SQLite / JSON)
├─ Fulltext (ripgrep / search plugin)
└─ Optional External Storage (S3 / MinIO / NFS)

Components:

  • CLI / Web UI: Manage URLs, view snapshots, search.
  • Scheduler / Cron: Automate recurring archiving.
  • Extractors: wget, singlefile, readability, pdf, screenshot, media, git, etc.
  • Output Tree: Each URL gets a timestamped folder with artifacts + metadata.
  • Index DB: SQLite (default) holds canonical URL + snapshot metadata.

3. Installation Methods

3.1 Quick Docker (Ephemeral Demo)

docker run -it --rm -p 8000:8000 \
-v $(pwd)/data:/data \
archivebox/archivebox:latest server 0.0.0.0:8000

Then in another shell:

docker exec -it $(docker ps -q -f ancestor=archivebox/archivebox:latest) archivebox add 'https://example.com'

3.2 Persistent Docker Compose

docker-compose.yml
services:
archivebox:
image: archivebox/archivebox:latest
container_name: archivebox
restart: unless-stopped
environment:
- ALLOWED_HOSTS=*
- MEDIA_MAX_SIZE=750m
volumes:
- ./archivebox:/data
ports:
- "8000:8000"
command: server 0.0.0.0:8000

Bring up:

docker compose up -d

3.3 Python (Virtualenv)

python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install archivebox[all]
archivebox init
archivebox add 'https://example.com'
archivebox server 0.0.0.0:8000

3.4 Kubernetes (Basic Stateful Deployment)

Minimal sample – customize resources & persistence.

apiVersion: apps/v1
kind: Deployment
metadata:
name: archivebox
spec:
replicas: 1
selector:
matchLabels: { app: archivebox }
template:
metadata:
labels: { app: archivebox }
spec:
containers:
- name: archivebox
image: archivebox/archivebox:latest
args: ["server","0.0.0.0:8000"]
ports:
- containerPort: 8000
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: archivebox-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: archivebox-pvc
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi

4. Directory & Data Layout

Typical structure (inside /data):

index.sqlite3
ArchiveBox.conf
logs/
sources/
archive/
1700000000/ # timestamped snapshot folder
index.json # metadata
warc/
singlefile.html
readability/
screenshot.png
media/

5. Core Commands

CommandPurpose
archivebox initInitialize repo (creates config + DB).
archivebox add <url|file|stdin>Add URLs from arg / file / pipe.
archivebox server 0.0.0.0:8000Start web UI/API.
archivebox schedule --every=1dCron-like recurring ingest.
archivebox list --jsonList stored URLs with metadata.
archivebox oneshot <url>Temporary capture not stored in main index.
archivebox config --set KEY=VALUEPersist configuration override.

Examples

echo 'https://news.ycombinator.com' | archivebox add -
archivebox add --depth=1 'https://example.com'
archivebox list | head

6. Ingestion Sources

Source TypeMethodNotes
Single URLarchivebox add https://...Immediate snapshot.
Text Filearchivebox add urls.txtOne per line.
Browser BookmarksExport HTML, then add file.
RSS/Atom FeedsPreprocess feed items → pipe to add.
API / ScriptCall CLI programmatically.
Web UI FormAdd interactively.

Feed Example

curl -s https://example.com/feed.xml | xmlstarlet sel -t -v '//item/link' | archivebox add -

7. Extractors & Output Artifacts

ExtractorArtifactUse Case
wgetRaw WARC-like treeFull fidelity offline browse
singlefileSingle HTML fileCompact snapshot
readabilityClean article textLLM training / NLP
screenshotPNG imageVisual diffing
pdfPDF renditionCompliance, shareable
mediaDownloaded audio/videoRich media archival
gitRepo mirrorCode reference
faviconIcon assetUI display

Configure which run via environment variables or ArchiveBox.conf:

export SAVE_WGET=True
export SAVE_PDF=False
export SAVE_SCREENSHOT=True

8. Configuration Model

Settings precedence (highest → lowest):

  1. CLI --flag
  2. Environment variable (e.g. TIMEOUT=60)
  3. ArchiveBox.conf
  4. Built-in defaults

List effective config:

archivebox config

Set persistent:

archivebox config --set FETCH_TIMEOUT=80

Common keys:

VariablePurpose
TIMEOUTGlobal per extractor timeout.
FETCH_TIMEOUTNetwork retrieval cap.
SAVE_PDFEnable PDF generation.
OUTPUT_PERMISSIONSUnix mode override for saved files.
MEDIA_MAX_SIZELimit large media downloads.
CHROME_HEADLESSUse headless browser features.
USE_COLORCLI color toggle.

9. Authentication & Access Control

ArchiveBox by default runs an admin-only lightweight UI; for multi-user scenarios front it with:

  • Reverse proxy (Traefik / Nginx) + Basic Auth / SSO.
  • OAuth2 proxy (e.g., oauth2-proxy against GitHub, Google, Keycloak).
Public Exposure

Do not expose an unauthenticated instance; ingestion endpoints can fetch arbitrary URLs and cause unexpected outbound traffic / SSRF risk.

10. Storage Strategy & External Object Stores

ArchiveBox stores structured artifacts on a POSIX filesystem. For scale-out durability integrate with object storage:

10.1 Offloading Snapshot Folders to MinIO / S3

After each snapshot, sync archive/ subtree to S3-compatible storage.

aws --endpoint-url http://minio:9000 s3 sync archive/ s3://archivebox-snapshots/archive/

Automate post-run via wrapper script or cron.

10.2 Deduplication & Pruning

Periodic removal of bulky artifacts you no longer need:

find archive -name 'screenshot.png' -size +5M -delete

Or disable generation (SAVE_SCREENSHOT=False).

11. Performance & Scaling

AreaTacticImpact
ParallelismRun multiple add processes (distinct queues)Higher throughput
Headless BrowserReuse Chrome instance / disable unneeded extractorsLower CPU/memory
Media LimitsSet MEDIA_MAX_SIZEAvoid huge downloads
FilesystemUse SSD/NVMe for index + small filesFaster extractor IO
SchedulingStagger large batches off-peakConsistent load
CachingLocal DNS / HTTP cache (squid)Reduced latency

11.1 Horizontal Sharding Strategy

ArchiveBox is not inherently clustered, but you can shard logically:

Shard BasisMethodBenefit
Domain HashDistribute URL lists by hash(domain) % NBalances workload
Content TypeSeparate media-heavy vs text-only feedsOptimizes resource sizing
Recency TierRecent URLs on fast NVMe, older on HDDCost efficiency

Each shard runs its own instance + index. Aggregate global search using an external catalog (e.g., ingest each shard's index.sqlite3 metadata into a central analytical DB).

11.2 Content Hash Deduplication (Post-Process)

To avoid storing identical pages (mirrors, syndicated content):

find archive -name 'singlefile.html' -exec sh -c 'sha256sum "$0"' {} \; | sort > /tmp/hashes.txt
awk '{if(seen[$1]++){print $2}}' /tmp/hashes.txt | while read dup; do echo "Duplicate: $dup"; done

Integrate with a script to remove duplicates after verifying retention policy alignment.

11.3 Integrity Verification

Generate periodic manifest of critical artifacts:

find archive -type f -maxdepth 3 -name 'singlefile.html' -o -name 'index.json' | \
xargs sha256sum > manifests/archive-manifest-$(date +%F).sha256
sha256sum -c manifests/archive-manifest-$(ls manifests | tail -1)

Store manifest copies offsite to detect tampering or bitrot.

11.4 Retention Classification

Assign tiers:

TierDefinitionAction
GoldLegal/compliance sourcesKeep all artifacts, replicate
SilverML training canonical setKeep text + HTML; drop media
BronzeLow-value feed noiseKeep readability only; 90d expiry

Implement via labeling URL lists and customizing extractor toggles per ingestion run.

Benchmark idea:

time archivebox add --depth=0 $(cat top100.txt)
Extractor Selection

For LLM text corpora you can often disable heavy assets (media, screenshot, pdf) and keep readability + singlefile + wget for balanced fidelity vs size.

12. Backup & Disaster Recovery

ComponentBackup Method
index.sqlite3Daily snapshot copy (sqlite is single file)
archive/rsync / rclone to remote
ArchiveBox.confGit commit or config mgmt
LogsCentralize via Fluent Bit / Loki

Disaster restore minimal sequence:

cp -r backup/archive ./archive
cp backup/index.sqlite3 ./index.sqlite3
archivebox list | head

13. Automation Patterns

Cron (Host)

0 2 * * * cd /opt/archivebox && docker compose run --rm archivebox schedule --every=1d >> logs/cron.log 2>&1

GitOps (K8s)

  • Commit URL lists into a repo.
  • Sidecar container tails repo and feeds new URLs to CLI.

Webhook Collector

Expose minimal API endpoint behind auth that enqueues URLs posted by other services (e.g., Slack slash command → webhook → ArchiveBox CLI). Wrap with a small Flask/FastAPI facade if needed.

14. Monitoring & Observability

SignalMethod
Ingestion Success RateParse logs for Saved vs Failed lines.
Queue LagCompare new URL timestamp vs capture time.
Disk GrowthTrack du -sh archive/ chronologically.
Error TypesLog classification (timeouts, 403, JS errors).
API HealthHTTP 200 on / UI endpoint.

Prometheus sidecar example (export simple metrics via script):

echo "archivebox_snapshots_total $(find archive -maxdepth 1 -type d | wc -l)" > /var/lib/node_exporter/textfile_collector/archivebox.prom

15. Security Hardening

RiskMitigation
SSRF / internal fetchRestrict outbound network via firewall; disallow metadata IP ranges.
Arbitrary file growthQuotas + monitor disk usage.
Sensitive contentEncrypt volume (LUKS) or offload encrypted to S3 (SSE-KMS).
Credential leakageStore secrets in env vaults (K8s Secrets, HashiCorp Vault).
Outdated dependenciesRebuild image weekly with latest extractor tools.
Integrity driftPeriodic hashing + diff manifests.
Duplicate ingestionCentral registry to prevent re-adding processed URLs.
Review Allowed Hosts

Some extractor tools may follow redirects—validate target domains if ingesting from untrusted feeds.

16. Troubleshooting

SymptomAction
Missing screenshotEnsure headless Chrome available (SAVE_SCREENSHOT=True).
Slow capturesDisable heavy extractors, raise timeouts gradually.
DB locked errorsStagger concurrent writes, use fewer parallel add processes.
Failed mediaIncrease MEDIA_MAX_SIZE or confirm codec tools installed.
High disk usagePrune unneeded artifact types, compress logs.

Useful commands:

archivebox list --status=failed --json
archivebox shell
sqlite3 index.sqlite3 'SELECT url, updated FROM core_snapshot ORDER BY updated DESC LIMIT 5;'

17. Advanced Workflow: Integrating with MinIO & LLM Pipelines

Goal: Archive pages → extract clean text → push text objects to MinIO for downstream embedding/vectorization.

Flow:

  1. archivebox add ingest URLs.
  2. Parse readability text artifacts.
  3. Chunk into documents.
  4. Upload chunks to MinIO bucket (s3://web-corpus/).
  5. Trigger vector index build job.

Sample Python script:

import os, json, boto3, pathlib

ARCHIVE_ROOT = pathlib.Path('archive')
s3 = boto3.client('s3', endpoint_url='http://localhost:9000',
aws_access_key_id='minioadmin', aws_secret_access_key='strongpassword123!',
region_name='us-east-1')
bucket = 'web-corpus'
try:
s3.create_bucket(Bucket=bucket)
except Exception:
pass

for snap in ARCHIVE_ROOT.iterdir():
txt_dir = snap / 'readability'
if not txt_dir.exists():
continue
for txt in txt_dir.glob('*.json'): # readability outputs JSON (depending version)
data = json.loads(txt.read_text())
body = data.get('content') or data.get('textContent') or ''
if not body:
continue
key = f"readability/{snap.name}/{txt.stem}.txt"
s3.put_object(Bucket=bucket, Key=key, Body=body.encode('utf-8'))
print('Uploaded', key)

18. Capacity Planning

DimensionEstimate Basis
Average Snapshot Size2–15 MB (HTML + assets)
Heavy Media Snapshot50–500 MB
Text-Only Snapshot50–400 KB
Daily Growth(#URLs × Avg Size)
Index DB Size~ few KB per URL + metadata

Rule of thumb: For 100k URLs/year at 8 MB avg → ~800 GB raw (before compression/offloading).

19. Minimal API Interaction

ArchiveBox exposes a JSON index (depending version). Example simple scrape:

curl -s http://localhost:8000/json/ | jq '.[0:3] | map({url, timestamp})'

20. Further Reading & References

Next Steps

Automate feed ingestion & ship cleaned textual artifacts into your feature or embedding pipeline for durable, reproducible knowledge datasets.


Last Updated: 2025-09-17