Weights & Biases (W&B)
This tutorial reflects current (2025) W&B capabilities. Some enterprise-only features (SSO, audit logs) are noted where relevant.
1. Overview
Weights & Biases (W&B) is a SaaS / self-hostable platform for experiment tracking, hyperparameter optimization, dataset & artifact versioning, model management, collaboration, and LLMOps. It provides deep integrations with popular ML/DL frameworks and emerging GenAI tooling.
Core feature areas:
| Domain | Features |
|---|---|
| Tracking | Runs, configs, metrics, media (images/audio/video), system stats |
| Artifacts | Versioned datasets, models, intermediate outputs, lineage graph |
| Sweeps | Distributed hyperparameter search (grid, random, Bayesian, early stopping) |
| Reports & Dashboards | Shareable analyses, interactive panels, tables |
| Tables | Structured dataset / evaluation / prediction logging for rich comparison |
| Model Registry | Model versions, aliases (e.g., production, staging), links to artifacts & evals |
| LLMOps | Prompt versioning, chain traces, token usage, evaluation results |
| Alerts & Automations | Notifications, triggers (e.g., metric threshold) |
2. Architecture & Concepts
Your Code (Python Script / Notebook / Trainer)
│
├── wandb.init() → Creates a Run (unique ID, config, settings)
│
├── wandb.log({...}) → Streams metrics & media asynchronously
│
├── wandb.Artifact() → Declares versioned artifact, add files, .log_artifact()
│
└── wandb.finish() → Finalizes run metadata
Cloud Backend (or Self-Hosted) → Stores metadata + binary artifacts (S3/GCS/Azure or W&B managed storage)
UI / API / SDK / CLI → Query runs, lineage, sweeps, tables, registry
Key entities:
- Project: Namespace grouping related runs
- Run: Single execution (training / eval)
- Config: Immutable hyperparameters (dictionary recorded at start)
- Metrics: Time-series values logged over steps / epochs
- Artifacts: Versioned file bundles with semantic types (
dataset,model,result, custom) - Aliases: Human-friendly labels (e.g.,
latest,best,prod) referencing artifact versions - Tables: Structured logging for predictions/evaluations
- Sweep: Coordinated HPO orchestration
- Model Registry: Higher-level view binding artifacts + metadata + approvals
3. Installation & Setup
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install wandb scikit-learn pandas numpy torch torchvision matplotlib seaborn
Login (stores API key in ~/.netrc):
wandb login
Or programmatic:
import wandb, os
os.environ["WANDB_API_KEY"] = "<YOUR_KEY>"
wandb.login()
Environment variables (commonly used):
| Variable | Purpose |
|---|---|
WANDB_API_KEY | Authentication token |
WANDB_PROJECT | Default project name |
WANDB_ENTITY | Team/org namespace |
WANDB_MODE=offline | Offline logging (sync later) |
WANDB_SILENT=true | Reduce console output |
4. Basic Run & Logging
import wandb, time, random
wandb.init(project="diabetes_rf", config={
"model": "RandomForestRegressor",
"n_estimators": 120,
"max_depth": 6,
"seed": 42
})
for epoch in range(10):
loss = random.random() * (10/(epoch+1))
wandb.log({"epoch": epoch, "train/loss": loss})
time.sleep(0.2)
wandb.finish()
Run:
python train.py
Logging Media
wandb.log({"examples": [wandb.Image(img_array, caption="sample")]})
Grouping & Tags
run = wandb.init(project="exp-group-demo", group="baseline", tags=["rf", "v1"], job_type="train")
5. Configuration vs Metrics
- Config: set once (hyperparams) at or before
wandb.init:wandb.config.learning_rate = 0.001 - Metrics: dynamic:
wandb.log({"accuracy": acc}, step=epoch) - Avoid logging huge arrays directly; use
Tables/ artifact files.
6. Artifacts: Dataset & Model Versioning
Create and Log Dataset Artifact
import wandb, pandas as pd
df = pd.DataFrame({"x":[1,2,3], "y":[3,2,1]})
run = wandb.init(project="artifact-demo")
artifact = wandb.Artifact("toy_dataset", type="dataset", description="Small demo dataset")
df.to_csv("data.csv", index=False)
artifact.add_file("data.csv")
run.log_artifact(artifact)
run.finish()
Consume Artifact
run = wandb.init(project="artifact-demo")
artifact = run.use_artifact("toy_dataset:latest")
artifact_dir = artifact.download()
Log Model Artifact with Aliases
model_art = wandb.Artifact("rf_model", type="model", description="Baseline RF")
model_art.add_dir("model_dir") # e.g., saved sklearn object
run.log_artifact(model_art, aliases=["baseline","v1"])
Lineage is automatically recorded: which run produced which artifact and which runs consumed it.
7. Tables (Structured Data Logging)
wandb.Table enables interactive filtering, sorting, joining in the UI.
table = wandb.Table(columns=["id","prediction","label"])
for i,(pred,label) in enumerate(zip(preds, y_test)):
table.add_data(i, float(pred), float(label))
wandb.log({"eval/predictions": table})
Use tables for:
- Prediction inspection
- Error analysis (add columns for residuals)
- Prompt → response pairs for LLM evaluation
8. Sweeps (Hyperparameter Optimization)
Sweep Config (YAML)
program: train.py
method: bayes
metric:
name: val/accuracy
goal: maximize
parameters:
learning_rate:
min: 1e-5
max: 1e-2
batch_size:
values: [32, 64, 128]
dropout:
distribution: uniform
min: 0.0
max: 0.5
early_terminate:
type: hyperband
Create & launch agents:
wandb sweep sweep.yaml # outputs SWEEP_ID
wandb agent <ENTITY>/<PROJECT>/<SWEEP_ID>
Inside train.py, reference config:
config = wandb.config
model = build_model(lr=config.learning_rate, dropout=config.dropout)
9. Model Registry
W&B Model Registry layers on top of artifacts:
- Assign aliases:
production,staging,best - Attach evaluation tables & metrics panels
- Track promotions & approvals (enterprise features: permissions, comments)
Python alias update:
api = wandb.Api()
artifact = api.artifact("entity/project/rf_model:latest")
artifact.aliases.append("candidate")
artifact.save()
Compare models by linking evaluation tables in a Report.
10. Reports & Dashboards
Use the UI to compose Reports combining:
- Markdown narrative
- Run tables / charts
- Filtered panels (e.g.,
group=baseline) - Embedded artifacts & media
Programmatic run set retrieval for analysis notebooks:
api = wandb.Api()
runs = api.runs("entity/diabetes_rf", filters={"config.model":"RandomForestRegressor"})
for r in runs:
print(r.name, r.summary.get("rmse"))
11. Framework Integrations
PyTorch
wandb.watch(model, log_graph=False, log="all") # gradients, parameters, histograms
PyTorch Lightning
from pytorch_lightning.loggers import WandbLogger
logger = WandbLogger(project="pl-demo")
trainer = Trainer(logger=logger)
Hugging Face Transformers
pip install transformers datasets
from transformers import TrainingArguments
args = TrainingArguments(output_dir="outputs", report_to=["wandb"], run_name="bert-finetune")
Keras
wandb_callback = wandb.keras.WandbCallback(save_model=False)
model.fit(x, y, callbacks=[wandb_callback])
12. LLMOps & Prompt Tracking
Patterns for LLM evaluation:
- Log prompt, model name, parameters, latency
- Log response, tokens, quality metrics (BLEU, ROUGE, custom rubric)
- Use Tables to capture multi-turn context
Example:
import time
run = wandb.init(project="llm-eval")
for sample in dataset:
prompt = sample["prompt"]
t0 = time.time()
response = call_llm(prompt)
latency = time.time() - t0
wandb.log({
"llm/prompt": wandb.Html(f"<pre>{prompt}</pre>"),
"llm/response": wandb.Html(f"<pre>{response}</pre>"),
"llm/latency_s": latency,
"llm/tokens_total": len(response.split())
})
run.finish()
Advanced: integrate toolchains (LangChain, LlamaIndex) via their built-in W&B callbacks for trace graphs.
13. CI/CD Integration
Suggested pipeline:
- Train job logs run & model artifact
- Evaluation job consumes candidate artifact + production baseline → logs comparison table
- Policy step (script) reads metrics via API → if pass threshold, applies
candidatealias or promotes toproduction - Deployment uses alias to fetch pinned model artifact
Promotion script snippet:
api = wandb.Api()
model_art = api.artifact("entity/project/rf_model:latest")
metric = model_art.metadata.get("rmse")
prod = api.artifact("entity/project/rf_model:production")
prod_rmse = prod.metadata.get("rmse") if prod else None
if prod_rmse is None or metric < 0.98 * prod_rmse:
model_art.aliases.extend(["candidate","production"])
model_art.save()
14. Security & Governance
| Area | Practice |
|---|---|
| API Keys | Store in secrets manager; never hardcode in repo |
| Access Control | Use Teams/Entities + role-based permissions (enterprise: SSO) |
| Data Minimization | Avoid uploading raw PII; hash or anonymize first |
| Offline Mode | Use WANDB_MODE=offline for air-gapped logging & then wandb sync |
| Artifact Retention | Periodically purge large unused versions; use lifecycle policies |
15. Performance & Cost Optimization
- Limit image/video frequency (e.g., every N epochs)
- Aggregate metrics before logging (avoid logging per-batch if not needed)
- Use Tables for structured data instead of thousands of individual metrics
- Compress large artifacts; chunk data logically
- Prefer incremental dataset artifacts (diff strategy) when practical
16. Troubleshooting
| Symptom | Cause | Resolution |
|---|---|---|
| Stuck at "Waiting for W&B process" | Network / firewall | Use offline mode or open required ports; check proxy |
| Duplicate runs | Script re-execution without guard | Ensure if __name__ == '__main__': and wandb.finish() |
| High memory usage | Logging huge objects | Store externally & reference; use artifacts |
| Sweep not starting agents | Wrong SWEEP_ID or entity | Re-run wandb sweep, verify entity/project |
| 403 errors | Invalid / expired API key | wandb login again |
17. FAQ
Q: How do I sync offline runs? Run wandb sync path/to/offline-dir.
Q: Can I remove a metric? You can hide it in the UI; raw history remains for integrity.
Q: Difference between artifact alias and version? Version is immutable (e.g., v3); alias is a movable pointer (production).
Q: How to attach metadata to artifacts? artifact.metadata.update({...}); artifact.save().
Q: How to export run data? Via UI export CSV or programmatically with wandb.Api().run(<path>).history().
18. Next Steps
- Add organization-wide template reports
- Integrate with feature store & lineage graph
- Standardize evaluation table schema (e.g., columns: id, input, prediction, label, delta)
- Add LLM structured evaluation metrics (toxicity, factuality)
Last reviewed: 2025-09-17.
Suggestions or improvements? Open a PR to extend this guide.