MLflow Tutorial
This guide reflects the latest MLflow open-source docs (2025). Always verify the pinned package version you install in production for reproducibility.
1. What is MLflow?
MLflow is an open-source platform to manage the end-to-end machine learning & LLM application lifecycle: tracking experiments, packaging code, managing models, evaluating and serving them. It is framework-agnostic and works with scikit-learn, PyTorch, TensorFlow, XGBoost, LightGBM, Hugging Face, and custom logic.
Core pillars:
| Component | Purpose |
|---|---|
| Tracking | Log & query params, metrics, tags, artifacts, models |
| Models & Flavors | Standard format for saving models with multiple runtime flavors (e.g., python_function, sklearn, pyfunc) |
| Model Registry | Model governance: versions, stages (None → Staging → Production → Archived), lineage, annotations |
| Projects | Reproducible packaging of ML code (entry points + conda/env spec) |
| Model Evaluation | Standardized evaluation & comparison of models (incl. LLM/GenAI) |
| Deployment | Serve models locally, in REST, batch, or to external platforms |
| GenAI Tracing | Track prompts, responses, latencies, costs for LLM apps |
2. Architecture Overview
Client (Python / R / REST / JS) ──▶ Tracking Server (API) ──▶ Backend Store (SQL / SQLite / MySQL / PostgreSQL)
│
└──▶ Artifact Store (Local FS / S3 / GCS / Azure Blob / MinIO / NFS)
Model Registry (DB tables + artifact pointers)
Deployment Targets: Local pyfunc, mlflow models serve, Docker, SageMaker, Databricks, Ray Serve, Kubernetes, Custom
Key separation:
- Backend store persists runs, params, metrics, tags, model versions (relational DB recommended for multi-user)
- Artifact store holds large binary objects: model artifacts, plots, datasets
3. Installation & Environment
Use a virtual environment and pin version (example uses a hypothetical latest stable like mlflow==2.16.0—adjust to the most recent release):
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install "mlflow>=2.15.0" scikit-learn pandas numpy
Optional extras:
pip install xgboost lightgbm matplotlib seaborn tqdm jinja2[extras] boto3 minio
Verify:
python -c "import mlflow, sys; print('MLflow version:', mlflow.__version__)"
4. Quick Start: Minimal Experiment
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
with mlflow.start_run(run_name="rf_baseline"):
n_estimators = 120
max_depth = 6
model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
rmse = mean_squared_error(y_test, preds, squared=False)
# Log params & metrics
mlflow.log_param("n_estimators", n_estimators)
mlflow.log_param("max_depth", max_depth)
mlflow.log_metric("rmse", rmse)
# Log model with signature (auto infers input schema) & example
mlflow.sklearn.log_model(model, artifact_path="model", input_example=X_test[:5], registered_model_name=None)
print("Run complete. View UI with: mlflow ui --port 5000")
Launch local UI:
mlflow ui --port 5000
Open http://127.0.0.1:5000 and inspect the run.
5. Tracking Concepts
| Concept | Description | Notes |
|---|---|---|
| Experiment | Logical group for runs | Name or ID; auto-created on logging if missing |
| Run | Single execution context | Identified by run UUID |
| Param | Immutable key/value (string) | Changing param requires new run |
| Metric | Time-series numeric value | Last logged value shown; supports step logging |
| Tag | Metadata label (string) | Free-form indexing |
| Artifact | File / dir output | Stored in artifact store |
Programmatic creation:
experiment_id = mlflow.set_experiment("diabetes_rf")
print(experiment_id)
Nested & Child Runs
with mlflow.start_run(run_name="parent"):
mlflow.log_param("parent", True)
with mlflow.start_run(run_name="child", nested=True):
mlflow.log_metric("child_score", 0.87)
Autologging
Autologging captures parameters, metrics, models automatically.
mlflow.sklearn.autolog(log_models=True, registered_model_name="DiabetesRF")
Be cautious: explicit logging overrides conflicts. Disable with mlflow.autolog(disable=True).
6. Hyperparameter Search Example
import mlflow, mlflow.sklearn
from sklearn.model_selection import ParameterGrid
# ... load data as before ...
grid = ParameterGrid({"n_estimators":[50,100,150], "max_depth":[4,6,8]})
mlflow.set_experiment("diabetes_rf_grid")
for params in grid:
with mlflow.start_run():
model = RandomForestRegressor(**params, random_state=42)
model.fit(X_train, y_train)
rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)
mlflow.log_params(params)
mlflow.log_metric("rmse", rmse)
Query best run:
from mlflow import MlflowClient
client = MlflowClient()
runs = client.search_runs(experiment_ids=[client.get_experiment_by_name("diabetes_rf_grid").experiment_id], order_by=["metrics.rmse ASC"], max_results=1)
print(runs[0].info.run_id, runs[0].data.metrics["rmse"])
7. Artifacts (Datasets, Plots, Models)
import tempfile, json, matplotlib.pyplot as plt
with mlflow.start_run():
tmp = tempfile.mkdtemp()
config_path = f"{tmp}/config.json"
json.dump({"seed":42}, open(config_path,"w"))
mlflow.log_artifact(config_path, artifact_path="config")
plt.figure(); plt.plot([1,2,3],[2,3,4]); plt.title("Trend"); plt.savefig(f"{tmp}/plot.png")
mlflow.log_artifact(f"{tmp}/plot.png", artifact_path="figures")
Download artifacts later:
client.download_artifacts(run_id, path="figures/plot.png", dst_path="./downloads")
8. Model Flavors & pyfunc
Every logged model has one or more flavors describing how to load it. Common flavors: python_function (universal), sklearn, xgboost, lightgbm, pytorch, transformers, onnx.
Load model by URI:
model_uri = f"runs:/{run_id}/model"
loaded = mlflow.pyfunc.load_model(model_uri)
preds = loaded.predict(X_test)
Custom pyfunc Model
import mlflow.pyfunc
class Multiplier(mlflow.pyfunc.PythonModel):
def load_context(self, context):
self.factor = int(context.artifacts.get("factor", 3))
def predict(self, context, model_input):
return model_input * self.factor
with mlflow.start_run():
mlflow.pyfunc.log_model(
artifact_path="multiplier",
python_model=Multiplier(),
artifacts={},
input_example=[1,2,3]
)
9. Model Registry (Governance)
- Log model with
registered_model_nameOR register existing run model.
result = mlflow.register_model(model_uri, name="DiabetesRF")
print(result.version)
- Transition stage:
client.transition_model_version_stage(
name="DiabetesRF", version=result.version, stage="Staging", archive_existing_versions=True)
- Add description & tags:
client.update_model_version(
name="DiabetesRF", version=result.version, description="Baseline RandomForest")
client.set_model_version_tag("DiabetesRF", result.version, "framework", "sklearn")
Search production models:
client.search_model_versions("name='DiabetesRF' and current_stage='Production'")
10. MLflow Projects (Optional)
MLproject file example:
name: diabetes_rf
conda_env: conda.yaml
entry_points:
train:
parameters:
n_estimators: {type: int, default: 100}
max_depth: {type: int, default: 6}
command: "python train.py --n-estimators {n_estimators} --max-depth {max_depth}"
Run:
mlflow run . -e train -P n_estimators=150 -P max_depth=8
11. Model Evaluation
Recent MLflow versions provide standardized evaluation API.
from mlflow.models import evaluate
eval_result = evaluate(
model=model,
model_type="regressor",
data=X_test,
targets=y_test,
evaluators=["default"],
feature_names=[f"f{i}" for i in range(X_test.shape[1])]
)
print(eval_result.metrics)
Artifacts like confusion matrices, residual plots (for regression) are logged automatically when supported.
12. Serving & Deployment
Local Serving
mlflow models serve -m runs:/$RUN_ID/model -p 5001 --env-manager local
POST request:
curl -X POST http://127.0.0.1:5001/invocations \
-H 'Content-Type: application/json' \
-d '{"inputs": [[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]]}'
Docker Image
mlflow models build-docker -m runs:/$RUN_ID/model -n diabetes_rf:latest
docker run -p 5002:8080 diabetes_rf:latest
Other Targets
- SageMaker:
mlflow sagemaker deploy - Azure ML / Kubernetes: export model & integrate
- Custom: load with pyfunc inside FastAPI / Ray Serve / BentoML.
13. Batch Inference
model = mlflow.pyfunc.load_model("models:/DiabetesRF/Production")
import pandas as pd
df = pd.read_csv("new_data.csv")
df["prediction"] = model.predict(df.drop("target", axis=1))
df.to_parquet("predictions.parquet")
14. GenAI / LLM Tracking (Newer Capabilities)
Track prompt/response pairs (simplified stand‑in):
with mlflow.start_run(run_name="llm_prompt"):
mlflow.log_param("model_family", "gpt-like")
prompt = "Summarize: MLflow manages ML lifecycle."
# Suppose response & tokens
response = "MLflow tracks, packages, registers, and deploys models."
mlflow.log_text(prompt, artifact_file="prompt.txt")
mlflow.log_text(response, artifact_file="response.txt")
mlflow.log_metric("prompt_tokens", 8)
mlflow.log_metric("completion_tokens", 9)
For full GenAI tracing, use the dedicated MLflow genai APIs (refer to latest official docs as they evolve fast).
15. CI/CD Integration
Recommended pattern:
- Training job (GitHub Actions / Jenkins) runs
mlflow runor python script - Logs model & registers new version
- Automated tests evaluate candidate vs Production (A/B metrics)
- If passes threshold, transition stage to
StagingthenProduction - Trigger deployment pipeline (Docker build & push, infra update)
Example stage transition gating:
if new_rmse < prod_rmse * 0.98:
client.transition_model_version_stage(name, version, stage="Production", archive_existing_versions=True)
16. Security & Governance
| Aspect | Recommendation |
|---|---|
| Access Control | Use reverse proxy + auth for tracking server (e.g., nginx + OIDC) |
| Data Privacy | Avoid logging PII/raw sensitive data as artifacts or params |
| Reproducibility | Pin versions (mlflow, libs, dataset hashes) via requirements.txt or conda env |
| Lineage | Use tags: mlflow.set_tag("data_version", "2025Q3_v2") |
| Isolation | Separate dev / staging / prod tracking servers if compliance requires |
17. Performance Tips
- Use a real DB (PostgreSQL/MySQL) rather than default SQLite for concurrency
- Store large datasets outside MLflow; log references (URIs) instead
- Prune old runs or archive using search queries
- Turn off autologging parts you don't need to reduce overhead
18. Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
sqlite database is locked | Concurrent writes | Migrate to PostgreSQL/MySQL |
| Slow UI load | Too many metrics per run | Log aggregated metrics not all raw steps |
| Model not loading | Missing dependency | Recreate env from conda.yaml or python_env.yaml |
| 404 on model stage | Version not transitioned | Check registry permissions & stage spelling |
19. Frequently Asked Questions
Q: How to ensure reproducible runs? Log environment (mlflow.log_artifact("requirements.txt")) + dataset version tags.
Q: Can I edit a metric after logging? Metrics are append-only; log a corrected metric with a higher step.
Q: How big can artifacts be? Depends on artifact store; for very large (>GB) prefer external storage & log pointer.
Q: Difference between run model URI & registry URI? runs:/<run_id>/artifact_path is immutable snapshot; models:/Name/Stage resolves dynamic latest version in that stage.
20. Next Steps
- Add automated evaluation notebook
- Integrate with feature store (e.g., Feast) for consistent offline/online features
- Extend GenAI tracing once stabilizing APIs are needed
Last validated against MLflow public docs (latest branch) on 2025-09-17.
Have improvements or org-specific patterns? Add them below or open a PR.