How to add MLflow to your Kedro workflow¶

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides tools for tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow supports machine learning frameworks such as TensorFlow, PyTorch, and scikit-learn.

Adding MLflow to a Kedro project enables you to track and manage your machine learning experiments and models. For example, you can log metrics, parameters, and artifacts from your Kedro pipeline runs to MLflow, then compare and reproduce the results. When collaborating with others on a Kedro project, the MLflow model registry and deployment tools help you share and deploy machine learning models.

Prerequisites¶

You will need the following:

A working Kedro project in a virtual environment. The examples in this document assume the spaceflights-pandas starter. If you're unfamiliar with the Spaceflights project, check out our tutorial.
The MLflow client installed into the same virtual environment. For the purposes of this tutorial, you can use MLflowin its simplest configuration <tracking>.

To set yourself up, create a new Kedro project:

$ kedro new --starter=spaceflights-pandas --name spaceflights-mlflow
$ cd spaceflights-mlflow
$ python -m venv .venv && source .venv/bin/activate
(.venv) $ pip install -r requirements.txt

And then launch the UI locally from the root of your directory as follows:

(.venv) $ pip install mlflow
(.venv) $ mlflow ui --backend-store-uri ./mlflow_runs

This will make MLflow record metadata and artifacts for each run to a local directory called mlflow_runs.

Note

If you want to use a more sophisticated setup, review these MLflow guides:

Simple use cases¶

MLflow works best with machine learning (ML) and AI pipelines. You can still track regular Kedro runs as experiments in MLflow, even when they do not use ML.

This section explains how you can use the kedro-mlflow plugin to track your Kedro pipelines in MLflow in a straightforward way.

Easy tracking of Kedro runs in MLflow using `kedro-mlflow`¶

To start using kedro-mlflow, install it first:

pip install kedro-mlflow

In recent versions of Kedro, this will already register the kedro-mlflow Hooks for you.

Next, create a mlflow.yml configuration file in your conf/local directory that configures where the MLflow runs are stored, consistent with how you launched the mlflow ui command:

server:
  mlflow_tracking_uri: mlflow_runs

From this point, when you execute kedro run you will see the logs coming from kedro-mlflow:

[06/04/24 09:52:53] INFO     Kedro project spaceflights-mlflow                                     session.py:324
                    INFO     Registering new custom resolver: 'km.random_name'                  mlflow_hook.py:65
                    INFO     The 'tracking_uri' key in mlflow.yml is relative          kedro_mlflow_config.py:260
                             ('server.mlflow_(tracking|registry)_uri = mlflow_runs').
                             It is converted to a valid uri:
                             'file:///Users/juan_cano/Projects/QuantumBlackLabs/kedro-
                             mlflow-playground/spaceflights-mlflow/mlflow_runs'

If you open your tracking server UI you will observe a result like this:

Complete MLflow tracking with kedro-mlflow

Notice that kedro-mlflow used a subset of the run_params as tags for the MLflow run, and logged the Kedro parameters as MLflow parameters.

Check out the official kedro-mlflow tutorial for more detailed steps.

Artifact tracking in MLflow using `kedro-mlflow`¶

kedro-mlflow provides some out-of-the-box artifact tracking capabilities that connect your Kedro project with your MLflow deployment. An example is MlflowArtifactDataset, which can be used to wrap any of your existing Kedro datasets.

Use of this dataset has the advantage that the preview capabilities of the MLflow UI can be used.

Warning

This will work for datasets that are outputs of a node, and will have no effect for datasets that are free inputs because they are loaded without modification.

For example, if you change a matplotlib.MatplotlibDataset dataset like this:

 # conf/base/catalog.yml

 dummy_confusion_matrix:
-  type: matplotlib.MatplotlibDataset
-  filepath: data/08_reporting/dummy_confusion_matrix.png
-  versioned: true
+  type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
+  dataset:
+    type: matplotlib.MatplotlibDataset
+    filepath: data/08_reporting/dummy_confusion_matrix.png

Then the image would be logged as part of the artifacts of the run, and you would be able to preview it in the MLflow web UI:

MLflow image preview thanks to the artifact tracking capabilities of kedro-mlflow

Warning

If you get a Failed while saving data to dataset MlflowArtifactDataset error, it's probably because you had already executed kedro run while the dataset was marked as versioned: true. The solution is to clean up the old data/08_reporting/dummy_confusion_matrix.png directory.

Check out the official kedro-mlflow documentation on versioning Kedro datasets for more information.

Model registry in MLflow using `kedro-mlflow`¶

If your Kedro pipeline trains a machine learning model, you can track those models in MLflow so that you can manage and deploy them. The kedro-mlflow plugin introduces a special artifact, MlflowModelTrackingDataset, that you can use to load and save your models as MLflow artifacts.

For example, if you have a dataset corresponding to a scikit-learn model, you can change it as follows:

# conf/base/catalog.yml
regressor:
-  type: pickle.PickleDataset
-  filepath: data/06_models/regressor.pickle
-  versioned: true
+  type: kedro_mlflow.io.models.MlflowModelTrackingDataset
+  flavor: mlflow.sklearn

The kedro-mlflow Hook will log the model as part of the run in MLflowthe standard MLflow Model format <models>.

If you also want to register it (hence store it in the MLflow Model Registry) you can add a registered_model_name parameter:

# conf/base/catalog.yml
regressor:
  type: kedro_mlflow.io.models.MlflowModelTrackingDataset
  flavor: mlflow.sklearn
  save_args:
    registered_model_name: spaceflights-regressor

Then you will see it listed as a Registered Model:

MLflow Model Registry listing one model registered with kedro-mlflow

To load a model from a specific run, you can specify the run_id. For that, you can make use of {ref}runtime parameters <runtime-params>:

# conf/base/catalog.yml
# Add the intermediate datasets to run the inference pipeline
X_test:
  type: pandas.ParquetDataset
  filepath: data/05_model_input/X_test.parquet

y_test:
  type: pandas.CSVDataset  # https://github.com/pandas-dev/pandas/issues/54638
  filepath: data/05_model_input/y_test.csv

regressor:
  type: kedro_mlflow.io.models.MlflowModelTrackingDataset
  flavor: mlflow.sklearn
  run_id: ${runtime_params:mlflow_run_id,null}
  save_args:
    registered_model_name: spaceflights-regressor

And specify the MLflow run id on the command line as follows:

$ kedro run --to-outputs=X_test,y_test
...
$ kedro run --from-nodes=evaluate_model_node --params mlflow_run_id=4cba84...

Note

Notice that MLflow runs are immutable for reproducibility purposes,

so you cannot save a model in an existing run.

Advanced use cases¶

Track additional metadata of Kedro runs in MLflow using Hooks¶

So far, kedro-mlflow has proven abundantly useful already. And yet, you might have the need to track additional metadata in the run.

One possible way of doing it is using the before_pipeline_run Hook to log the run_params passed to the Hook. An implementation would look as follows:

# src/spaceflights_mlflow/hooks.py

import typing as t
import logging

import mlflow
from kedro.framework.hooks import hook_impl

logger = logging.getLogger(__name__)


class ExtraMLflowHooks:
    @hook_impl
    def before_pipeline_run(self, run_params: dict[str, t.Any]):
        logger.info("Logging extra metadata to MLflow")
        mlflow.set_tags({
            "pipeline": run_params["pipeline_name"] or "__default__",
            "custom_version": "0.1.0",
        })

And then enable your custom hook in settings.py:

# src/spaceflights_mlflow/settings.py
...
from .hooks import ExtraMLflowHooks

HOOKS = (ExtraMLflowHooks(),)
...

After enabling this custom Hook, you can execute kedro run, and see something like this in the logs:

...
[06/04/24 10:44:25] INFO     Logging extra metadata to MLflow                                         hooks.py:13
...

If you open your tracking server UI you will observe a result like this:

Simple MLflow tracking

Tracking Kedro in MLflow using the Python API¶

If you are running Kedro programmatically using the Python API, you can log your runs using the MLflow "fluent" API.

For example, taking the {doc}lifecycle management example </kedro_project_setup/session> as a starting point:

from pathlib import Path

import mlflow
from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project

bootstrap_project(Path.cwd())

mlflow.set_experiment("Kedro Spaceflights test")

with KedroSession.create() as session:
    with mlflow.start_run():
        mlflow.set_tag("session_id", session.session_id)
        session.run()

If you want more flexibility or to log extra parameters, you might need to run the Kedro pipelines manually yourself.