Weights & Biases (W&B)

This tutorial reflects current (2025) W&B capabilities. Some enterprise-only features (SSO, audit logs) are noted where relevant.

1. Overview

Weights & Biases (W&B) is a SaaS / self-hostable platform for experiment tracking, hyperparameter optimization, dataset & artifact versioning, model management, collaboration, and LLMOps. It provides deep integrations with popular ML/DL frameworks and emerging GenAI tooling.

Core feature areas:

Domain	Features
Tracking	Runs, configs, metrics, media (images/audio/video), system stats
Artifacts	Versioned datasets, models, intermediate outputs, lineage graph
Sweeps	Distributed hyperparameter search (grid, random, Bayesian, early stopping)
Reports & Dashboards	Shareable analyses, interactive panels, tables
Tables	Structured dataset / evaluation / prediction logging for rich comparison
Model Registry	Model versions, aliases (e.g., `production`, `staging`), links to artifacts & evals
LLMOps	Prompt versioning, chain traces, token usage, evaluation results
Alerts & Automations	Notifications, triggers (e.g., metric threshold)

2. Architecture & Concepts

 Your Code (Python Script / Notebook / Trainer)
			 │
			 ├── wandb.init()  → Creates a Run (unique ID, config, settings)
			 │
			 ├── wandb.log({...}) → Streams metrics & media asynchronously
			 │
			 ├── wandb.Artifact() → Declares versioned artifact, add files, .log_artifact()
			 │
			 └── wandb.finish() → Finalizes run metadata

 Cloud Backend (or Self-Hosted) → Stores metadata + binary artifacts (S3/GCS/Azure or W&B managed storage)
 UI / API / SDK / CLI → Query runs, lineage, sweeps, tables, registry

Key entities:

Project: Namespace grouping related runs
Run: Single execution (training / eval)
Config: Immutable hyperparameters (dictionary recorded at start)
Metrics: Time-series values logged over steps / epochs
Artifacts: Versioned file bundles with semantic types (dataset, model, result, custom)
Aliases: Human-friendly labels (e.g., latest, best, prod) referencing artifact versions
Tables: Structured logging for predictions/evaluations
Sweep: Coordinated HPO orchestration
Model Registry: Higher-level view binding artifacts + metadata + approvals

3. Installation & Setup

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install wandb scikit-learn pandas numpy torch torchvision matplotlib seaborn

wandb login

Or programmatic:

import wandb, os
os.environ["WANDB_API_KEY"] = "<YOUR_KEY>"
wandb.login()

Environment variables (commonly used):

Variable	Purpose
`WANDB_API_KEY`	Authentication token
`WANDB_PROJECT`	Default project name
`WANDB_ENTITY`	Team/org namespace
`WANDB_MODE=offline`	Offline logging (sync later)
`WANDB_SILENT=true`	Reduce console output

4. Basic Run & Logging

train.py
import wandb, time, random

wandb.init(project="diabetes_rf", config={
		"model": "RandomForestRegressor",
		"n_estimators": 120,
		"max_depth": 6,
		"seed": 42
})

for epoch in range(10):
		loss = random.random() * (10/(epoch+1))
		wandb.log({"epoch": epoch, "train/loss": loss})
		time.sleep(0.2)

wandb.finish()

Run:

python train.py

Logging Media

wandb.log({"examples": [wandb.Image(img_array, caption="sample")]})

Grouping & Tags

run = wandb.init(project="exp-group-demo", group="baseline", tags=["rf", "v1"], job_type="train")

5. Configuration vs Metrics

Config: set once (hyperparams) at or before wandb.init: wandb.config.learning_rate = 0.001
Metrics: dynamic: wandb.log({"accuracy": acc}, step=epoch)
Avoid logging huge arrays directly; use Tables / artifact files.

6. Artifacts: Dataset & Model Versioning

Create and Log Dataset Artifact

import wandb, pandas as pd
df = pd.DataFrame({"x":[1,2,3], "y":[3,2,1]})
run = wandb.init(project="artifact-demo")
artifact = wandb.Artifact("toy_dataset", type="dataset", description="Small demo dataset")
df.to_csv("data.csv", index=False)
artifact.add_file("data.csv")
run.log_artifact(artifact)
run.finish()

Consume Artifact

run = wandb.init(project="artifact-demo")
artifact = run.use_artifact("toy_dataset:latest")
artifact_dir = artifact.download()

Log Model Artifact with Aliases

model_art = wandb.Artifact("rf_model", type="model", description="Baseline RF")
model_art.add_dir("model_dir")  # e.g., saved sklearn object
run.log_artifact(model_art, aliases=["baseline","v1"])

Lineage is automatically recorded: which run produced which artifact and which runs consumed it.

7. Tables (Structured Data Logging)

wandb.Table enables interactive filtering, sorting, joining in the UI.

table = wandb.Table(columns=["id","prediction","label"])
for i,(pred,label) in enumerate(zip(preds, y_test)):
		table.add_data(i, float(pred), float(label))
wandb.log({"eval/predictions": table})

Use tables for:

Prediction inspection
Error analysis (add columns for residuals)
Prompt → response pairs for LLM evaluation

8. Sweeps (Hyperparameter Optimization)

Sweep Config (YAML)

sweep.yaml
program: train.py
method: bayes
metric:
	name: val/accuracy
	goal: maximize
parameters:
	learning_rate:
		min: 1e-5
		max: 1e-2
	batch_size:
		values: [32, 64, 128]
	dropout:
		distribution: uniform
		min: 0.0
		max: 0.5
early_terminate:
	type: hyperband

Create & launch agents:

wandb sweep sweep.yaml   # outputs SWEEP_ID
wandb agent <ENTITY>/<PROJECT>/<SWEEP_ID>

Inside train.py, reference config:

config = wandb.config
model = build_model(lr=config.learning_rate, dropout=config.dropout)

9. Model Registry

W&B Model Registry layers on top of artifacts:

Assign aliases: production, staging, best
Attach evaluation tables & metrics panels
Track promotions & approvals (enterprise features: permissions, comments)

Python alias update:

api = wandb.Api()
artifact = api.artifact("entity/project/rf_model:latest")
artifact.aliases.append("candidate")
artifact.save()

Compare models by linking evaluation tables in a Report.

10. Reports & Dashboards

Use the UI to compose Reports combining:

Markdown narrative
Run tables / charts
Filtered panels (e.g., group=baseline)
Embedded artifacts & media

Programmatic run set retrieval for analysis notebooks:

api = wandb.Api()
runs = api.runs("entity/diabetes_rf", filters={"config.model":"RandomForestRegressor"})
for r in runs:
		print(r.name, r.summary.get("rmse"))

11. Framework Integrations

PyTorch

wandb.watch(model, log_graph=False, log="all")  # gradients, parameters, histograms

PyTorch Lightning

from pytorch_lightning.loggers import WandbLogger
logger = WandbLogger(project="pl-demo")
trainer = Trainer(logger=logger)

Hugging Face Transformers

pip install transformers datasets

from transformers import TrainingArguments
args = TrainingArguments(output_dir="outputs", report_to=["wandb"], run_name="bert-finetune")

Keras

wandb_callback = wandb.keras.WandbCallback(save_model=False)
model.fit(x, y, callbacks=[wandb_callback])

12. LLMOps & Prompt Tracking

Patterns for LLM evaluation:

Log prompt, model name, parameters, latency
Log response, tokens, quality metrics (BLEU, ROUGE, custom rubric)
Use Tables to capture multi-turn context

Example:

import time
run = wandb.init(project="llm-eval")
for sample in dataset:
		prompt = sample["prompt"]
		t0 = time.time()
		response = call_llm(prompt)
		latency = time.time() - t0
		wandb.log({
				"llm/prompt": wandb.Html(f"<pre>{prompt}</pre>"),
				"llm/response": wandb.Html(f"<pre>{response}</pre>"),
				"llm/latency_s": latency,
				"llm/tokens_total": len(response.split())
		})
run.finish()

Advanced: integrate toolchains (LangChain, LlamaIndex) via their built-in W&B callbacks for trace graphs.

13. CI/CD Integration

Suggested pipeline:

Train job logs run & model artifact
Evaluation job consumes candidate artifact + production baseline → logs comparison table
Policy step (script) reads metrics via API → if pass threshold, applies candidate alias or promotes to production
Deployment uses alias to fetch pinned model artifact

Promotion script snippet:

api = wandb.Api()
model_art = api.artifact("entity/project/rf_model:latest")
metric = model_art.metadata.get("rmse")
prod = api.artifact("entity/project/rf_model:production")
prod_rmse = prod.metadata.get("rmse") if prod else None
if prod_rmse is None or metric < 0.98 * prod_rmse:
		model_art.aliases.extend(["candidate","production"])
		model_art.save()

14. Security & Governance

Area	Practice
API Keys	Store in secrets manager; never hardcode in repo
Access Control	Use Teams/Entities + role-based permissions (enterprise: SSO)
Data Minimization	Avoid uploading raw PII; hash or anonymize first
Offline Mode	Use `WANDB_MODE=offline` for air-gapped logging & then `wandb sync`
Artifact Retention	Periodically purge large unused versions; use lifecycle policies

15. Performance & Cost Optimization

Limit image/video frequency (e.g., every N epochs)
Aggregate metrics before logging (avoid logging per-batch if not needed)
Use Tables for structured data instead of thousands of individual metrics
Compress large artifacts; chunk data logically
Prefer incremental dataset artifacts (diff strategy) when practical

16. Troubleshooting

Symptom	Cause	Resolution
Stuck at "Waiting for W&B process"	Network / firewall	Use offline mode or open required ports; check proxy
Duplicate runs	Script re-execution without guard	Ensure `if __name__ == '__main__':` and `wandb.finish()`
High memory usage	Logging huge objects	Store externally & reference; use artifacts
Sweep not starting agents	Wrong SWEEP_ID or entity	Re-run `wandb sweep`, verify entity/project
403 errors	Invalid / expired API key	`wandb login` again

17. FAQ

Q: How do I sync offline runs? Run wandb sync path/to/offline-dir.

Q: Can I remove a metric? You can hide it in the UI; raw history remains for integrity.

Q: Difference between artifact alias and version? Version is immutable (e.g., v3); alias is a movable pointer (production).

Q: How to attach metadata to artifacts? artifact.metadata.update({...}); artifact.save().

Q: How to export run data? Via UI export CSV or programmatically with wandb.Api().run(<path>).history().

18. Next Steps

Add organization-wide template reports
Integrate with feature store & lineage graph
Standardize evaluation table schema (e.g., columns: id, input, prediction, label, delta)
Add LLM structured evaluation metrics (toxicity, factuality)

Last reviewed: 2025-09-17.

Suggestions or improvements? Open a PR to extend this guide.

1. Overview​

2. Architecture & Concepts​

3. Installation & Setup​

4. Basic Run & Logging​

Logging Media​

Grouping & Tags​

5. Configuration vs Metrics​

6. Artifacts: Dataset & Model Versioning​

Create and Log Dataset Artifact​

Consume Artifact​

Log Model Artifact with Aliases​

7. Tables (Structured Data Logging)​

8. Sweeps (Hyperparameter Optimization)​

Sweep Config (YAML)​

9. Model Registry​

10. Reports & Dashboards​

11. Framework Integrations​

PyTorch​

PyTorch Lightning​

Hugging Face Transformers​

Keras​

12. LLMOps & Prompt Tracking​

13. CI/CD Integration​

14. Security & Governance​

15. Performance & Cost Optimization​

16. Troubleshooting​

17. FAQ​

18. Next Steps​