MLflow Tutorial

This guide reflects the latest MLflow open-source docs (2025). Always verify the pinned package version you install in production for reproducibility.

1. What is MLflow?

MLflow is an open-source platform to manage the end-to-end machine learning & LLM application lifecycle: tracking experiments, packaging code, managing models, evaluating and serving them. It is framework-agnostic and works with scikit-learn, PyTorch, TensorFlow, XGBoost, LightGBM, Hugging Face, and custom logic.

Core pillars:

Component	Purpose
Tracking	Log & query params, metrics, tags, artifacts, models
Models & Flavors	Standard format for saving models with multiple runtime flavors (e.g., `python_function`, `sklearn`, `pyfunc`)
Model Registry	Model governance: versions, stages (None → Staging → Production → Archived), lineage, annotations
Projects	Reproducible packaging of ML code (entry points + conda/env spec)
Model Evaluation	Standardized evaluation & comparison of models (incl. LLM/GenAI)
Deployment	Serve models locally, in REST, batch, or to external platforms
GenAI Tracing	Track prompts, responses, latencies, costs for LLM apps

2. Architecture Overview

 Client (Python / R / REST / JS) ──▶ Tracking Server (API) ──▶ Backend Store (SQL / SQLite / MySQL / PostgreSQL)
										   │
										   └──▶ Artifact Store (Local FS / S3 / GCS / Azure Blob / MinIO / NFS)

 Model Registry (DB tables + artifact pointers)
 Deployment Targets: Local pyfunc, mlflow models serve, Docker, SageMaker, Databricks, Ray Serve, Kubernetes, Custom

Key separation:

Backend store persists runs, params, metrics, tags, model versions (relational DB recommended for multi-user)
Artifact store holds large binary objects: model artifacts, plots, datasets

3. Installation & Environment

Use a virtual environment and pin version (example uses a hypothetical latest stable like mlflow==2.16.0—adjust to the most recent release):

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install "mlflow>=2.15.0" scikit-learn pandas numpy

Optional extras:

pip install xgboost lightgbm matplotlib seaborn tqdm jinja2[extras] boto3 minio

Verify:

python -c "import mlflow, sys; print('MLflow version:', mlflow.__version__)"

4. Quick Start: Minimal Experiment

quickstart.py
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

with mlflow.start_run(run_name="rf_baseline"):
	n_estimators = 120
	max_depth = 6
	model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
	model.fit(X_train, y_train)

	preds = model.predict(X_test)
	rmse = mean_squared_error(y_test, preds, squared=False)

	# Log params & metrics
	mlflow.log_param("n_estimators", n_estimators)
	mlflow.log_param("max_depth", max_depth)
	mlflow.log_metric("rmse", rmse)

	# Log model with signature (auto infers input schema) & example
	mlflow.sklearn.log_model(model, artifact_path="model", input_example=X_test[:5], registered_model_name=None)

print("Run complete. View UI with: mlflow ui --port 5000")

Launch local UI:

mlflow ui --port 5000

Open http://127.0.0.1:5000 and inspect the run.

5. Tracking Concepts

Concept	Description	Notes
Experiment	Logical group for runs	Name or ID; auto-created on logging if missing
Run	Single execution context	Identified by run UUID
Param	Immutable key/value (string)	Changing param requires new run
Metric	Time-series numeric value	Last logged value shown; supports step logging
Tag	Metadata label (string)	Free-form indexing
Artifact	File / dir output	Stored in artifact store

Programmatic creation:

experiment_id = mlflow.set_experiment("diabetes_rf")
print(experiment_id)

Nested & Child Runs

with mlflow.start_run(run_name="parent"):
	mlflow.log_param("parent", True)
	with mlflow.start_run(run_name="child", nested=True):
		mlflow.log_metric("child_score", 0.87)

Autologging

Autologging captures parameters, metrics, models automatically.

mlflow.sklearn.autolog(log_models=True, registered_model_name="DiabetesRF")

Be cautious: explicit logging overrides conflicts. Disable with mlflow.autolog(disable=True).

6. Hyperparameter Search Example

tune.py
import mlflow, mlflow.sklearn
from sklearn.model_selection import ParameterGrid
# ... load data as before ...
grid = ParameterGrid({"n_estimators":[50,100,150], "max_depth":[4,6,8]})
mlflow.set_experiment("diabetes_rf_grid")
for params in grid:
	with mlflow.start_run():
		model = RandomForestRegressor(**params, random_state=42)
		model.fit(X_train, y_train)
		rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)
		mlflow.log_params(params)
		mlflow.log_metric("rmse", rmse)

Query best run:

from mlflow import MlflowClient
client = MlflowClient()
runs = client.search_runs(experiment_ids=[client.get_experiment_by_name("diabetes_rf_grid").experiment_id], order_by=["metrics.rmse ASC"], max_results=1)
print(runs[0].info.run_id, runs[0].data.metrics["rmse"])

7. Artifacts (Datasets, Plots, Models)

import tempfile, json, matplotlib.pyplot as plt
with mlflow.start_run():
	tmp = tempfile.mkdtemp()
	config_path = f"{tmp}/config.json"
	json.dump({"seed":42}, open(config_path,"w"))
	mlflow.log_artifact(config_path, artifact_path="config")
	plt.figure(); plt.plot([1,2,3],[2,3,4]); plt.title("Trend"); plt.savefig(f"{tmp}/plot.png")
	mlflow.log_artifact(f"{tmp}/plot.png", artifact_path="figures")

Download artifacts later:

client.download_artifacts(run_id, path="figures/plot.png", dst_path="./downloads")

8. Model Flavors & pyfunc

Every logged model has one or more flavors describing how to load it. Common flavors: python_function (universal), sklearn, xgboost, lightgbm, pytorch, transformers, onnx.

Load model by URI:

model_uri = f"runs:/{run_id}/model"
loaded = mlflow.pyfunc.load_model(model_uri)
preds = loaded.predict(X_test)

Custom pyfunc Model

import mlflow.pyfunc
class Multiplier(mlflow.pyfunc.PythonModel):
	def load_context(self, context):
		self.factor = int(context.artifacts.get("factor", 3))
	def predict(self, context, model_input):
		return model_input * self.factor

with mlflow.start_run():
	mlflow.pyfunc.log_model(
		artifact_path="multiplier",
		python_model=Multiplier(),
		artifacts={},
		input_example=[1,2,3]
	)

9. Model Registry (Governance)

Log model with registered_model_name OR register existing run model.

result = mlflow.register_model(model_uri, name="DiabetesRF")
print(result.version)

Transition stage:

client.transition_model_version_stage(
	name="DiabetesRF", version=result.version, stage="Staging", archive_existing_versions=True)

Add description & tags:

client.update_model_version(
	name="DiabetesRF", version=result.version, description="Baseline RandomForest")
client.set_model_version_tag("DiabetesRF", result.version, "framework", "sklearn")

Search production models:

client.search_model_versions("name='DiabetesRF' and current_stage='Production'")

10. MLflow Projects (Optional)

MLproject file example:

MLproject
name: diabetes_rf
conda_env: conda.yaml
entry_points:
  train:
	parameters:
	  n_estimators: {type: int, default: 100}
	  max_depth: {type: int, default: 6}
	command: "python train.py --n-estimators {n_estimators} --max-depth {max_depth}"

Run:

mlflow run . -e train -P n_estimators=150 -P max_depth=8

11. Model Evaluation

Recent MLflow versions provide standardized evaluation API.

from mlflow.models import evaluate
eval_result = evaluate(
	model=model,
	model_type="regressor",
	data=X_test,
	targets=y_test,
	evaluators=["default"],
	feature_names=[f"f{i}" for i in range(X_test.shape[1])]
)
print(eval_result.metrics)

Artifacts like confusion matrices, residual plots (for regression) are logged automatically when supported.

12. Serving & Deployment

Local Serving

mlflow models serve -m runs:/$RUN_ID/model -p 5001 --env-manager local

POST request:

curl -X POST http://127.0.0.1:5001/invocations \
  -H 'Content-Type: application/json' \
  -d '{"inputs": [[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]]}'

Docker Image

mlflow models build-docker -m runs:/$RUN_ID/model -n diabetes_rf:latest
docker run -p 5002:8080 diabetes_rf:latest

Other Targets

SageMaker: mlflow sagemaker deploy
Azure ML / Kubernetes: export model & integrate
Custom: load with pyfunc inside FastAPI / Ray Serve / BentoML.

13. Batch Inference

model = mlflow.pyfunc.load_model("models:/DiabetesRF/Production")
import pandas as pd
df = pd.read_csv("new_data.csv")
df["prediction"] = model.predict(df.drop("target", axis=1))
df.to_parquet("predictions.parquet")

14. GenAI / LLM Tracking (Newer Capabilities)

Track prompt/response pairs (simplified stand‑in):

with mlflow.start_run(run_name="llm_prompt"):
	mlflow.log_param("model_family", "gpt-like")
	prompt = "Summarize: MLflow manages ML lifecycle." 
	# Suppose response & tokens
	response = "MLflow tracks, packages, registers, and deploys models."
	mlflow.log_text(prompt, artifact_file="prompt.txt")
	mlflow.log_text(response, artifact_file="response.txt")
	mlflow.log_metric("prompt_tokens", 8)
	mlflow.log_metric("completion_tokens", 9)

For full GenAI tracing, use the dedicated MLflow genai APIs (refer to latest official docs as they evolve fast).

15. CI/CD Integration

Recommended pattern:

Training job (GitHub Actions / Jenkins) runs mlflow run or python script
Logs model & registers new version
Automated tests evaluate candidate vs Production (A/B metrics)
If passes threshold, transition stage to Staging then Production
Trigger deployment pipeline (Docker build & push, infra update)

Example stage transition gating:

if new_rmse < prod_rmse * 0.98:
	client.transition_model_version_stage(name, version, stage="Production", archive_existing_versions=True)

16. Security & Governance

Aspect	Recommendation
Access Control	Use reverse proxy + auth for tracking server (e.g., nginx + OIDC)
Data Privacy	Avoid logging PII/raw sensitive data as artifacts or params
Reproducibility	Pin versions (`mlflow`, libs, dataset hashes) via `requirements.txt` or conda env
Lineage	Use tags: `mlflow.set_tag("data_version", "2025Q3_v2")`
Isolation	Separate dev / staging / prod tracking servers if compliance requires

17. Performance Tips

Use a real DB (PostgreSQL/MySQL) rather than default SQLite for concurrency
Store large datasets outside MLflow; log references (URIs) instead
Prune old runs or archive using search queries
Turn off autologging parts you don't need to reduce overhead

18. Troubleshooting

Issue	Cause	Fix
`sqlite database is locked`	Concurrent writes	Migrate to PostgreSQL/MySQL
Slow UI load	Too many metrics per run	Log aggregated metrics not all raw steps
Model not loading	Missing dependency	Recreate env from `conda.yaml` or `python_env.yaml`
404 on model stage	Version not transitioned	Check registry permissions & stage spelling

19. Frequently Asked Questions

Q: How to ensure reproducible runs? Log environment (mlflow.log_artifact("requirements.txt")) + dataset version tags.

Q: Can I edit a metric after logging? Metrics are append-only; log a corrected metric with a higher step.

Q: How big can artifacts be? Depends on artifact store; for very large (>GB) prefer external storage & log pointer.

Q: Difference between run model URI & registry URI? runs:/<run_id>/artifact_path is immutable snapshot; models:/Name/Stage resolves dynamic latest version in that stage.

20. Next Steps

Add automated evaluation notebook
Integrate with feature store (e.g., Feast) for consistent offline/online features
Extend GenAI tracing once stabilizing APIs are needed

Last validated against MLflow public docs (latest branch) on 2025-09-17.

Have improvements or org-specific patterns? Add them below or open a PR.

1. What is MLflow?​

2. Architecture Overview​

3. Installation & Environment​

4. Quick Start: Minimal Experiment​

5. Tracking Concepts​

Nested & Child Runs​

Autologging​

6. Hyperparameter Search Example​

7. Artifacts (Datasets, Plots, Models)​

8. Model Flavors & pyfunc​

Custom pyfunc Model​

9. Model Registry (Governance)​

10. MLflow Projects (Optional)​

11. Model Evaluation​

12. Serving & Deployment​

Local Serving​

Docker Image​

Other Targets​

13. Batch Inference​

14. GenAI / LLM Tracking (Newer Capabilities)​

15. CI/CD Integration​

16. Security & Governance​

17. Performance Tips​

18. Troubleshooting​

19. Frequently Asked Questions​

20. Next Steps​