Skip to content

Deploying Kedro on Databricks

Databricks is a managed Spark platform that is commonly used to run large-scale data processing workloads in production. Kedro integrates naturally with Databricks. There is no single “correct” way to work with it, and the best setup depends on where your code lives and how you prefer to develop.

This guide explains the three supported ways to run Kedro on Databricks, what actually happens in each setup, and when you should choose one over another:

Option Where your code runs Where Spark runs Best for
Run within Databricks (Git folders) Databricks workspace Databricks cluster Notebook-first workflows, analysts and platform teams
Local + remote Databricks (Databricks Connect) Your local machine or Docker container Databricks cluster Local-first development, tight IDE integration, or cloud-agnostic execution
Production with kedro-databricks CI/CD pipeline Databricks Jobs Repeatable production deployments

Prerequisites

Before starting, make sure you have:

  • Python 3.9+ installed locally.
  • Kedro 1.0+ (installed with pip or uv).
  • kedro-datasets 9.1.1+ installed (contains SparkDatasetV2 and other dataset implementations)
  • A Databricks workspace with access to a cluster or serverless compute and permission to create Unity Catalog Volumes (or access to existing ones).
  • A Databricks personal access token (required for Databricks Connect and production deployments).

Note

Databricks Free tier does not support DBFS. Use Unity Catalog tables instead.

To follow any of the approaches below, you first need a Spark-enabled Kedro project. Create one using:

uvx kedro new --name=spaceflights-databricks --tools=pyspark --example=y

This starter is designed specifically for Databricks: it replaces pandas-based datasets with SparkDatasetV2 in the DataCatalog and implements data transformations using Spark.

Once the project is created, choose one of the workflows below depending on where you want your code to live and how you prefer to develop.

Run Kedro within Databricks (Git folders)

This option is suitable if you primarily work within the Databricks workspace, using notebooks and Databricks Jobs. Databricks provides Git folders, which allow you to clone a Git repository directly into the workspace and work with it interactively.

Typical workflow

  1. Push your Kedro project to a Git repository (GitHub, GitLab, Azure DevOps, Bitbucket, and more).

  2. Clone the repository into Databricks using Git folders.

  3. Open the cloned repository in Databricks and update your Kedro Data Catalog (conf/base/catalog.yml):

  4. For all spark.SparkDatasetV2 datasets, update file paths to point to Databricks Volumes, for example:

    preprocessed_companies:
    -  filepath: data/02_intermediate/preprocessed_companies.csv
    +  filepath: /Volumes/<catalog_name>/<schema_name>/<volume_name>/data/02_intermediate/preprocessed_companies.csv
    type: spark.SparkDatasetV2
    

  5. Make sure the volume exists in Unity Catalog before running the pipeline. You can find instructions on how to create a volume in the Databricks docs.
  6. Non-Spark datasets (for example, pandas-based datasets) can read from and write to the cloned Git folder without changing their file paths.

  7. Open the notebooks/ folder in the cloned repository and create a new notebook.

  8. Attach the notebook to a Databricks cluster (for example, a serverless cluster).

  9. Run Kedro from a notebook. First, install the project dependencies:

%pip install -r ../requirements.txt

Then load Kedro's IPython extension and initialise the project:

%load_ext kedro.ipython
%reload_kedro

This makes the project objects available in the notebook (catalog, context, pipelines, and session). You can find more information about notebook line magics here. You can now run the pipeline:

session.run()

If you launched the notebook from outside the Kedro project directory, pass the project root explicitly:

%reload_kedro /Workspace/Users/<databricks_user_name>/<cloned_repo_name>

Scheduling

In this setup, your Kedro pipeline is executed from a Databricks notebook. Scheduling is handled by creating a Databricks Job that runs this notebook on a cluster.

To schedule execution:

  • Create a Databricks Job that runs the notebook which calls session.run().

Local development, remote Databricks cluster (Databricks Connect)

This option is recommended for local-first development, where you run code locally but execute Spark workloads remotely on a Databricks cluster through Databricks Connect. For more advanced use cases, you can wrap the project in a Docker container and run it on any Docker-compatible runtime.

How it works

  • Kedro runs locally
  • Spark execution happens on a remote Databricks cluster
  • No project code needs to be copied into Databricks

Setup steps

Install databricks-connect:

pip install databricks-connect

Note

Ensure that the installed databricks-connect version matches your Databricks Runtime version. You can check the requirements here.

Databricks Connect requires two environment variables:

export DATABRICKS_HOST="https://<your-workspace>.cloud.databricks.com"
export DATABRICKS_TOKEN="<your-personal-access-token>"

Configure the Kedro Data Catalog: Spark workloads execute remotely on Databricks and do not have access to your local filesystem. As a result, all SparkDatasetV2 entries in the Data Catalog must use paths pointing to Databricks Volumes or other remote storage.

Non-Spark datasets (for example, pandas-based datasets) can be stored anywhere - locally or in cloud storage - as long as you specify their path in the Data Catalog.

If a Spark transformation running on Databricks produces a SparkDatasetV2 on a Databricks Volume, you can continue processing it locally. To do this, provide the required credentials in the Data Catalog so your local environment can access the Volume.

Once configured, run Kedro locally as usual:

kedro run

All operations on SparkDatasetV2 within nodes run remotely on the Databricks cluster, while all other code runs locally in your Python environment, with Spark execution logs streamed to your local terminal:

You can view your intermediate SparkDatasetV2 datasets in the Databricks Catalog UI:


Production-grade deployments through kedro-databricks

For production deployments, we recommend using the community-maintained kedro-databricks plugin.

This option is suitable when you need:

  • Repeatable deployments
  • CI/CD integration
  • Environment-specific configuration
  • Job-based execution without notebooks

What the plugin does

  • Packages a Kedro project
  • Converts it into a Databricks Asset Bundle
  • Deploys it as a Databricks Job

Note

This is a community-maintained plugin. Databricks permissions, workspace layouts, and runtime versions vary between organisations, so some configuration steps may require updates.

For full setup instructions, see the plugin documentation.

Visualise Kedro pipelines in Databricks notebooks

Kedro-Viz is a visualisation tool for exploring Kedro pipelines and (optionally) run metadata. In Databricks, you can use it in two ways:

  • Visualise a pipeline directly in the notebook with NotebookVisualizer (lightweight and suitable for inspecting a single pipeline)
  • Launch the full Kedro-Viz web app (opens in a new browser tab, best for full project exploration)

Prerequisites

Make sure Kedro and Kedro-Viz are installed in the same scope (notebook-scoped or cluster-scoped). For notebook-scoped installs:

%pip install -r ../requirements.txt
%pip install kedro-viz

If Kedro is already installed as a cluster library, add Kedro-Viz as a cluster library too.

Load the Kedro IPython extension:

%load_ext kedro.ipython
%reload_kedro

If you launched the notebook from outside the Kedro project directory, pass the project root explicitly:

    %reload_kedro /Workspace/Users/<databricks_user_name>/<cloned_repo_name>

Visualise a pipeline directly in the notebook with NotebookVisualizer

If you want to inspect a single pipeline without opening the full web application, use NotebookVisualizer:

from kedro_viz.integrations.notebook import NotebookVisualizer

NotebookVisualizer(pipelines["data_science"]).show()

Launch the full Kedro-Viz web app

Kedro-Viz can be launched in a new browser tab with the %run_viz line magic:

%run_viz

Note

You may encounter issues running this command on the Databricks Free tier. If that happens, we recommend using the NotebookVisualizer instead.

This command presents a link to the Kedro-Viz web application.

Kedro-Viz link rendered in a Databricks notebook

Clicking the link opens a new browser tab running Kedro-Viz for your project.

Kedro-Viz UI displayed from a Databricks notebook link