Skip to main content

MLflow Tutorial

This guide reflects the latest MLflow open-source docs (2025). Always verify the pinned package version you install in production for reproducibility.

1. What is MLflow?

MLflow is an open-source platform to manage the end-to-end machine learning & LLM application lifecycle: tracking experiments, packaging code, managing models, evaluating and serving them. It is framework-agnostic and works with scikit-learn, PyTorch, TensorFlow, XGBoost, LightGBM, Hugging Face, and custom logic.

Core pillars:

ComponentPurpose
TrackingLog & query params, metrics, tags, artifacts, models
Models & FlavorsStandard format for saving models with multiple runtime flavors (e.g., python_function, sklearn, pyfunc)
Model RegistryModel governance: versions, stages (None → Staging → Production → Archived), lineage, annotations
ProjectsReproducible packaging of ML code (entry points + conda/env spec)
Model EvaluationStandardized evaluation & comparison of models (incl. LLM/GenAI)
DeploymentServe models locally, in REST, batch, or to external platforms
GenAI TracingTrack prompts, responses, latencies, costs for LLM apps

2. Architecture Overview

 Client (Python / R / REST / JS) ──▶ Tracking Server (API) ──▶ Backend Store (SQL / SQLite / MySQL / PostgreSQL)

└──▶ Artifact Store (Local FS / S3 / GCS / Azure Blob / MinIO / NFS)

Model Registry (DB tables + artifact pointers)
Deployment Targets: Local pyfunc, mlflow models serve, Docker, SageMaker, Databricks, Ray Serve, Kubernetes, Custom

Key separation:

  • Backend store persists runs, params, metrics, tags, model versions (relational DB recommended for multi-user)
  • Artifact store holds large binary objects: model artifacts, plots, datasets

3. Installation & Environment

Use a virtual environment and pin version (example uses a hypothetical latest stable like mlflow==2.16.0—adjust to the most recent release):

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install "mlflow>=2.15.0" scikit-learn pandas numpy

Optional extras:

pip install xgboost lightgbm matplotlib seaborn tqdm jinja2[extras] boto3 minio

Verify:

python -c "import mlflow, sys; print('MLflow version:', mlflow.__version__)"

4. Quick Start: Minimal Experiment

quickstart.py
import mlflow
import mlflow.sklearn
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

with mlflow.start_run(run_name="rf_baseline"):
n_estimators = 120
max_depth = 6
model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
model.fit(X_train, y_train)

preds = model.predict(X_test)
rmse = mean_squared_error(y_test, preds, squared=False)

# Log params & metrics
mlflow.log_param("n_estimators", n_estimators)
mlflow.log_param("max_depth", max_depth)
mlflow.log_metric("rmse", rmse)

# Log model with signature (auto infers input schema) & example
mlflow.sklearn.log_model(model, artifact_path="model", input_example=X_test[:5], registered_model_name=None)

print("Run complete. View UI with: mlflow ui --port 5000")

Launch local UI:

mlflow ui --port 5000

Open http://127.0.0.1:5000 and inspect the run.

5. Tracking Concepts

ConceptDescriptionNotes
ExperimentLogical group for runsName or ID; auto-created on logging if missing
RunSingle execution contextIdentified by run UUID
ParamImmutable key/value (string)Changing param requires new run
MetricTime-series numeric valueLast logged value shown; supports step logging
TagMetadata label (string)Free-form indexing
ArtifactFile / dir outputStored in artifact store

Programmatic creation:

experiment_id = mlflow.set_experiment("diabetes_rf")
print(experiment_id)

Nested & Child Runs

with mlflow.start_run(run_name="parent"):
mlflow.log_param("parent", True)
with mlflow.start_run(run_name="child", nested=True):
mlflow.log_metric("child_score", 0.87)

Autologging

Autologging captures parameters, metrics, models automatically.

mlflow.sklearn.autolog(log_models=True, registered_model_name="DiabetesRF")

Be cautious: explicit logging overrides conflicts. Disable with mlflow.autolog(disable=True).

6. Hyperparameter Search Example

tune.py
import mlflow, mlflow.sklearn
from sklearn.model_selection import ParameterGrid
# ... load data as before ...
grid = ParameterGrid({"n_estimators":[50,100,150], "max_depth":[4,6,8]})
mlflow.set_experiment("diabetes_rf_grid")
for params in grid:
with mlflow.start_run():
model = RandomForestRegressor(**params, random_state=42)
model.fit(X_train, y_train)
rmse = mean_squared_error(y_test, model.predict(X_test), squared=False)
mlflow.log_params(params)
mlflow.log_metric("rmse", rmse)

Query best run:

from mlflow import MlflowClient
client = MlflowClient()
runs = client.search_runs(experiment_ids=[client.get_experiment_by_name("diabetes_rf_grid").experiment_id], order_by=["metrics.rmse ASC"], max_results=1)
print(runs[0].info.run_id, runs[0].data.metrics["rmse"])

7. Artifacts (Datasets, Plots, Models)

import tempfile, json, matplotlib.pyplot as plt
with mlflow.start_run():
tmp = tempfile.mkdtemp()
config_path = f"{tmp}/config.json"
json.dump({"seed":42}, open(config_path,"w"))
mlflow.log_artifact(config_path, artifact_path="config")
plt.figure(); plt.plot([1,2,3],[2,3,4]); plt.title("Trend"); plt.savefig(f"{tmp}/plot.png")
mlflow.log_artifact(f"{tmp}/plot.png", artifact_path="figures")

Download artifacts later:

client.download_artifacts(run_id, path="figures/plot.png", dst_path="./downloads")

8. Model Flavors & pyfunc

Every logged model has one or more flavors describing how to load it. Common flavors: python_function (universal), sklearn, xgboost, lightgbm, pytorch, transformers, onnx.

Load model by URI:

model_uri = f"runs:/{run_id}/model"
loaded = mlflow.pyfunc.load_model(model_uri)
preds = loaded.predict(X_test)

Custom pyfunc Model

import mlflow.pyfunc
class Multiplier(mlflow.pyfunc.PythonModel):
def load_context(self, context):
self.factor = int(context.artifacts.get("factor", 3))
def predict(self, context, model_input):
return model_input * self.factor

with mlflow.start_run():
mlflow.pyfunc.log_model(
artifact_path="multiplier",
python_model=Multiplier(),
artifacts={},
input_example=[1,2,3]
)

9. Model Registry (Governance)

  1. Log model with registered_model_name OR register existing run model.
result = mlflow.register_model(model_uri, name="DiabetesRF")
print(result.version)
  1. Transition stage:
client.transition_model_version_stage(
name="DiabetesRF", version=result.version, stage="Staging", archive_existing_versions=True)
  1. Add description & tags:
client.update_model_version(
name="DiabetesRF", version=result.version, description="Baseline RandomForest")
client.set_model_version_tag("DiabetesRF", result.version, "framework", "sklearn")

Search production models:

client.search_model_versions("name='DiabetesRF' and current_stage='Production'")

10. MLflow Projects (Optional)

MLproject file example:

MLproject
name: diabetes_rf
conda_env: conda.yaml
entry_points:
train:
parameters:
n_estimators: {type: int, default: 100}
max_depth: {type: int, default: 6}
command: "python train.py --n-estimators {n_estimators} --max-depth {max_depth}"

Run:

mlflow run . -e train -P n_estimators=150 -P max_depth=8

11. Model Evaluation

Recent MLflow versions provide standardized evaluation API.

from mlflow.models import evaluate
eval_result = evaluate(
model=model,
model_type="regressor",
data=X_test,
targets=y_test,
evaluators=["default"],
feature_names=[f"f{i}" for i in range(X_test.shape[1])]
)
print(eval_result.metrics)

Artifacts like confusion matrices, residual plots (for regression) are logged automatically when supported.

12. Serving & Deployment

Local Serving

mlflow models serve -m runs:/$RUN_ID/model -p 5001 --env-manager local

POST request:

curl -X POST http://127.0.0.1:5001/invocations \
-H 'Content-Type: application/json' \
-d '{"inputs": [[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]]}'

Docker Image

mlflow models build-docker -m runs:/$RUN_ID/model -n diabetes_rf:latest
docker run -p 5002:8080 diabetes_rf:latest

Other Targets

  • SageMaker: mlflow sagemaker deploy
  • Azure ML / Kubernetes: export model & integrate
  • Custom: load with pyfunc inside FastAPI / Ray Serve / BentoML.

13. Batch Inference

model = mlflow.pyfunc.load_model("models:/DiabetesRF/Production")
import pandas as pd
df = pd.read_csv("new_data.csv")
df["prediction"] = model.predict(df.drop("target", axis=1))
df.to_parquet("predictions.parquet")

14. GenAI / LLM Tracking (Newer Capabilities)

Track prompt/response pairs (simplified stand‑in):

with mlflow.start_run(run_name="llm_prompt"):
mlflow.log_param("model_family", "gpt-like")
prompt = "Summarize: MLflow manages ML lifecycle."
# Suppose response & tokens
response = "MLflow tracks, packages, registers, and deploys models."
mlflow.log_text(prompt, artifact_file="prompt.txt")
mlflow.log_text(response, artifact_file="response.txt")
mlflow.log_metric("prompt_tokens", 8)
mlflow.log_metric("completion_tokens", 9)

For full GenAI tracing, use the dedicated MLflow genai APIs (refer to latest official docs as they evolve fast).

15. CI/CD Integration

Recommended pattern:

  1. Training job (GitHub Actions / Jenkins) runs mlflow run or python script
  2. Logs model & registers new version
  3. Automated tests evaluate candidate vs Production (A/B metrics)
  4. If passes threshold, transition stage to Staging then Production
  5. Trigger deployment pipeline (Docker build & push, infra update)

Example stage transition gating:

if new_rmse < prod_rmse * 0.98:
client.transition_model_version_stage(name, version, stage="Production", archive_existing_versions=True)

16. Security & Governance

AspectRecommendation
Access ControlUse reverse proxy + auth for tracking server (e.g., nginx + OIDC)
Data PrivacyAvoid logging PII/raw sensitive data as artifacts or params
ReproducibilityPin versions (mlflow, libs, dataset hashes) via requirements.txt or conda env
LineageUse tags: mlflow.set_tag("data_version", "2025Q3_v2")
IsolationSeparate dev / staging / prod tracking servers if compliance requires

17. Performance Tips

  • Use a real DB (PostgreSQL/MySQL) rather than default SQLite for concurrency
  • Store large datasets outside MLflow; log references (URIs) instead
  • Prune old runs or archive using search queries
  • Turn off autologging parts you don't need to reduce overhead

18. Troubleshooting

IssueCauseFix
sqlite database is lockedConcurrent writesMigrate to PostgreSQL/MySQL
Slow UI loadToo many metrics per runLog aggregated metrics not all raw steps
Model not loadingMissing dependencyRecreate env from conda.yaml or python_env.yaml
404 on model stageVersion not transitionedCheck registry permissions & stage spelling

19. Frequently Asked Questions

Q: How to ensure reproducible runs? Log environment (mlflow.log_artifact("requirements.txt")) + dataset version tags.

Q: Can I edit a metric after logging? Metrics are append-only; log a corrected metric with a higher step.

Q: How big can artifacts be? Depends on artifact store; for very large (>GB) prefer external storage & log pointer.

Q: Difference between run model URI & registry URI? runs:/<run_id>/artifact_path is immutable snapshot; models:/Name/Stage resolves dynamic latest version in that stage.

20. Next Steps

  • Add automated evaluation notebook
  • Integrate with feature store (e.g., Feast) for consistent offline/online features
  • Extend GenAI tracing once stabilizing APIs are needed

Last validated against MLflow public docs (latest branch) on 2025-09-17.

Have improvements or org-specific patterns? Add them below or open a PR.