Skip to main content

Weights & Biases (W&B)

This tutorial reflects current (2025) W&B capabilities. Some enterprise-only features (SSO, audit logs) are noted where relevant.

1. Overview

Weights & Biases (W&B) is a SaaS / self-hostable platform for experiment tracking, hyperparameter optimization, dataset & artifact versioning, model management, collaboration, and LLMOps. It provides deep integrations with popular ML/DL frameworks and emerging GenAI tooling.

Core feature areas:

DomainFeatures
TrackingRuns, configs, metrics, media (images/audio/video), system stats
ArtifactsVersioned datasets, models, intermediate outputs, lineage graph
SweepsDistributed hyperparameter search (grid, random, Bayesian, early stopping)
Reports & DashboardsShareable analyses, interactive panels, tables
TablesStructured dataset / evaluation / prediction logging for rich comparison
Model RegistryModel versions, aliases (e.g., production, staging), links to artifacts & evals
LLMOpsPrompt versioning, chain traces, token usage, evaluation results
Alerts & AutomationsNotifications, triggers (e.g., metric threshold)

2. Architecture & Concepts

 Your Code (Python Script / Notebook / Trainer)

├── wandb.init() → Creates a Run (unique ID, config, settings)

├── wandb.log({...}) → Streams metrics & media asynchronously

├── wandb.Artifact() → Declares versioned artifact, add files, .log_artifact()

└── wandb.finish() → Finalizes run metadata

Cloud Backend (or Self-Hosted) → Stores metadata + binary artifacts (S3/GCS/Azure or W&B managed storage)
UI / API / SDK / CLI → Query runs, lineage, sweeps, tables, registry

Key entities:

  • Project: Namespace grouping related runs
  • Run: Single execution (training / eval)
  • Config: Immutable hyperparameters (dictionary recorded at start)
  • Metrics: Time-series values logged over steps / epochs
  • Artifacts: Versioned file bundles with semantic types (dataset, model, result, custom)
  • Aliases: Human-friendly labels (e.g., latest, best, prod) referencing artifact versions
  • Tables: Structured logging for predictions/evaluations
  • Sweep: Coordinated HPO orchestration
  • Model Registry: Higher-level view binding artifacts + metadata + approvals

3. Installation & Setup

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install wandb scikit-learn pandas numpy torch torchvision matplotlib seaborn

Login (stores API key in ~/.netrc):

wandb login

Or programmatic:

import wandb, os
os.environ["WANDB_API_KEY"] = "<YOUR_KEY>"
wandb.login()

Environment variables (commonly used):

VariablePurpose
WANDB_API_KEYAuthentication token
WANDB_PROJECTDefault project name
WANDB_ENTITYTeam/org namespace
WANDB_MODE=offlineOffline logging (sync later)
WANDB_SILENT=trueReduce console output

4. Basic Run & Logging

train.py
import wandb, time, random

wandb.init(project="diabetes_rf", config={
"model": "RandomForestRegressor",
"n_estimators": 120,
"max_depth": 6,
"seed": 42
})

for epoch in range(10):
loss = random.random() * (10/(epoch+1))
wandb.log({"epoch": epoch, "train/loss": loss})
time.sleep(0.2)

wandb.finish()

Run:

python train.py

Logging Media

wandb.log({"examples": [wandb.Image(img_array, caption="sample")]})

Grouping & Tags

run = wandb.init(project="exp-group-demo", group="baseline", tags=["rf", "v1"], job_type="train")

5. Configuration vs Metrics

  • Config: set once (hyperparams) at or before wandb.init: wandb.config.learning_rate = 0.001
  • Metrics: dynamic: wandb.log({"accuracy": acc}, step=epoch)
  • Avoid logging huge arrays directly; use Tables / artifact files.

6. Artifacts: Dataset & Model Versioning

Create and Log Dataset Artifact

import wandb, pandas as pd
df = pd.DataFrame({"x":[1,2,3], "y":[3,2,1]})
run = wandb.init(project="artifact-demo")
artifact = wandb.Artifact("toy_dataset", type="dataset", description="Small demo dataset")
df.to_csv("data.csv", index=False)
artifact.add_file("data.csv")
run.log_artifact(artifact)
run.finish()

Consume Artifact

run = wandb.init(project="artifact-demo")
artifact = run.use_artifact("toy_dataset:latest")
artifact_dir = artifact.download()

Log Model Artifact with Aliases

model_art = wandb.Artifact("rf_model", type="model", description="Baseline RF")
model_art.add_dir("model_dir") # e.g., saved sklearn object
run.log_artifact(model_art, aliases=["baseline","v1"])

Lineage is automatically recorded: which run produced which artifact and which runs consumed it.

7. Tables (Structured Data Logging)

wandb.Table enables interactive filtering, sorting, joining in the UI.

table = wandb.Table(columns=["id","prediction","label"])
for i,(pred,label) in enumerate(zip(preds, y_test)):
table.add_data(i, float(pred), float(label))
wandb.log({"eval/predictions": table})

Use tables for:

  • Prediction inspection
  • Error analysis (add columns for residuals)
  • Prompt → response pairs for LLM evaluation

8. Sweeps (Hyperparameter Optimization)

Sweep Config (YAML)

sweep.yaml
program: train.py
method: bayes
metric:
name: val/accuracy
goal: maximize
parameters:
learning_rate:
min: 1e-5
max: 1e-2
batch_size:
values: [32, 64, 128]
dropout:
distribution: uniform
min: 0.0
max: 0.5
early_terminate:
type: hyperband

Create & launch agents:

wandb sweep sweep.yaml   # outputs SWEEP_ID
wandb agent <ENTITY>/<PROJECT>/<SWEEP_ID>

Inside train.py, reference config:

config = wandb.config
model = build_model(lr=config.learning_rate, dropout=config.dropout)

9. Model Registry

W&B Model Registry layers on top of artifacts:

  • Assign aliases: production, staging, best
  • Attach evaluation tables & metrics panels
  • Track promotions & approvals (enterprise features: permissions, comments)

Python alias update:

api = wandb.Api()
artifact = api.artifact("entity/project/rf_model:latest")
artifact.aliases.append("candidate")
artifact.save()

Compare models by linking evaluation tables in a Report.

10. Reports & Dashboards

Use the UI to compose Reports combining:

  • Markdown narrative
  • Run tables / charts
  • Filtered panels (e.g., group=baseline)
  • Embedded artifacts & media

Programmatic run set retrieval for analysis notebooks:

api = wandb.Api()
runs = api.runs("entity/diabetes_rf", filters={"config.model":"RandomForestRegressor"})
for r in runs:
print(r.name, r.summary.get("rmse"))

11. Framework Integrations

PyTorch

wandb.watch(model, log_graph=False, log="all")  # gradients, parameters, histograms

PyTorch Lightning

from pytorch_lightning.loggers import WandbLogger
logger = WandbLogger(project="pl-demo")
trainer = Trainer(logger=logger)

Hugging Face Transformers

pip install transformers datasets
from transformers import TrainingArguments
args = TrainingArguments(output_dir="outputs", report_to=["wandb"], run_name="bert-finetune")

Keras

wandb_callback = wandb.keras.WandbCallback(save_model=False)
model.fit(x, y, callbacks=[wandb_callback])

12. LLMOps & Prompt Tracking

Patterns for LLM evaluation:

  1. Log prompt, model name, parameters, latency
  2. Log response, tokens, quality metrics (BLEU, ROUGE, custom rubric)
  3. Use Tables to capture multi-turn context

Example:

import time
run = wandb.init(project="llm-eval")
for sample in dataset:
prompt = sample["prompt"]
t0 = time.time()
response = call_llm(prompt)
latency = time.time() - t0
wandb.log({
"llm/prompt": wandb.Html(f"<pre>{prompt}</pre>"),
"llm/response": wandb.Html(f"<pre>{response}</pre>"),
"llm/latency_s": latency,
"llm/tokens_total": len(response.split())
})
run.finish()

Advanced: integrate toolchains (LangChain, LlamaIndex) via their built-in W&B callbacks for trace graphs.

13. CI/CD Integration

Suggested pipeline:

  1. Train job logs run & model artifact
  2. Evaluation job consumes candidate artifact + production baseline → logs comparison table
  3. Policy step (script) reads metrics via API → if pass threshold, applies candidate alias or promotes to production
  4. Deployment uses alias to fetch pinned model artifact

Promotion script snippet:

api = wandb.Api()
model_art = api.artifact("entity/project/rf_model:latest")
metric = model_art.metadata.get("rmse")
prod = api.artifact("entity/project/rf_model:production")
prod_rmse = prod.metadata.get("rmse") if prod else None
if prod_rmse is None or metric < 0.98 * prod_rmse:
model_art.aliases.extend(["candidate","production"])
model_art.save()

14. Security & Governance

AreaPractice
API KeysStore in secrets manager; never hardcode in repo
Access ControlUse Teams/Entities + role-based permissions (enterprise: SSO)
Data MinimizationAvoid uploading raw PII; hash or anonymize first
Offline ModeUse WANDB_MODE=offline for air-gapped logging & then wandb sync
Artifact RetentionPeriodically purge large unused versions; use lifecycle policies

15. Performance & Cost Optimization

  • Limit image/video frequency (e.g., every N epochs)
  • Aggregate metrics before logging (avoid logging per-batch if not needed)
  • Use Tables for structured data instead of thousands of individual metrics
  • Compress large artifacts; chunk data logically
  • Prefer incremental dataset artifacts (diff strategy) when practical

16. Troubleshooting

SymptomCauseResolution
Stuck at "Waiting for W&B process"Network / firewallUse offline mode or open required ports; check proxy
Duplicate runsScript re-execution without guardEnsure if __name__ == '__main__': and wandb.finish()
High memory usageLogging huge objectsStore externally & reference; use artifacts
Sweep not starting agentsWrong SWEEP_ID or entityRe-run wandb sweep, verify entity/project
403 errorsInvalid / expired API keywandb login again

17. FAQ

Q: How do I sync offline runs? Run wandb sync path/to/offline-dir.

Q: Can I remove a metric? You can hide it in the UI; raw history remains for integrity.

Q: Difference between artifact alias and version? Version is immutable (e.g., v3); alias is a movable pointer (production).

Q: How to attach metadata to artifacts? artifact.metadata.update({...}); artifact.save().

Q: How to export run data? Via UI export CSV or programmatically with wandb.Api().run(<path>).history().

18. Next Steps

  • Add organization-wide template reports
  • Integrate with feature store & lineage graph
  • Standardize evaluation table schema (e.g., columns: id, input, prediction, label, delta)
  • Add LLM structured evaluation metrics (toxicity, factuality)

Last reviewed: 2025-09-17.

Suggestions or improvements? Open a PR to extend this guide.