Use an IDE and Databricks Asset Bundles to deploy a Kedro project

Note

The dbx package was deprecated by Databricks, and dbx workflow documentation is moved to a new page.

This guide demonstrates a workflow for developing a Kedro Project on Databricks using Databricks Asset Bundles. You will learn how to develop your project using a local environment, then use kedro-databricks and Databricks Asset Bundle to package your code for running pipelines on Databricks. To learn more about Databricks Asset Bundles and customisation, read What are Databricks Asset Bundles.

Benefits of local development

By working in your local environment, you can take advantage of features within an IDE that are not available on Databricks notebooks:

  • Auto-completion and suggestions for code, improving your development speed and accuracy.

  • Linters like Ruff can be integrated to catch potential issues in your code.

  • Static type checkers like Mypy can check types in your code, helping to identify potential type-related issues early in the development process.

To set up these features, look for instructions specific to your IDE (for instance, VS Code).

Note

If you prefer to develop projects in notebooks rather than in an IDE, you should follow our guide on how to develop a Kedro project within a Databricks workspace instead.

What this page covers

The main steps in this tutorial are as follows:

Prerequisites

Set up your project

Note your Databricks username and host

Note your Databricks username and host as you will need it for the remainder of this guide.

Find your Databricks username in the top right of the workspace UI and the host in the browser’s URL bar, up to the first slash (e.g., https://adb-123456789123456.1.azuredatabricks.net/):

Find Databricks host and username

Note

Your databricks host must include the protocol (https://).

Install Kedro and Databricks CLI in a new virtual environment

In your local development environment, create a virtual environment for this tutorial using Conda:

conda create --name databricks-iris python=3.10

Once it is created, activate it:

conda activate databricks-iris

Authenticate the Databricks CLI

Now, you must authenticate the Databricks CLI with your Databricks instance.

Refer to the Databricks documentation for a complete guide on how to authenticate your CLI. The key steps are:

  1. Create a personal access token for your user on your Databricks instance.

  2. Run databricks configure --token.

  3. Enter your token and Databricks host when prompted.

  4. Run databricks fs ls dbfs:/ at the command line to verify your authentication.

Create a new Kedro Project

Create a Kedro project with the databricks-iris starter using the following command in your local environment:

kedro new --starter=databricks-iris

Name your new project iris-databricks for consistency with the rest of this guide. This command creates a new Kedro project using the databricks-iris starter template.

Note

If you are not using the databricks-iris starter to create a Kedro project, and you are working with a version of Kedro earlier than 0.19.0, then you should disable file-based logging to prevent Kedro from attempting to write to the read-only file system.

Create the Databricks Asset Bundles using kedro-databricks

kedro-databricks is a wrapper around the databricks CLI. It’s the simplest way to get started without getting stuck with configuration.

  1. Install kedro-databricks:

pip install kedro-databricks
  1. Initialise the Databricks configuration:

kedro databricks init

This generates a databricks.yml file in the conf folder, which sets the default cluster type. You can override these configurations if needed.

  1. Create Databricks Asset Bundles:

kedro databricks bundle

This command reads the configuration from conf/databricks.yml (if it exists) and generates the Databricks job configuration inside a resource folder.

Running a Databricks Job Using an Existing Cluster

By default, Databricks creates a new job cluster for each job. However, there are instances where you might prefer to use an existing cluster, such as:

  1. Lack of permissions to create a new cluster.

  2. The need for a quick start with an all-purpose cluster.

While it is generally not recommended to utilise all-purpose compute for running jobs, it is feasible to configure a Databricks job for testing purposes.

To begin, you need to determine the cluster_id. Navigate to the Compute tab and select the View JSON option.

Find cluster ID through UI

You will see the cluster configuration in JSON format, copy the cluster_id cluster_id in the JSON view

Next, update conf/databricks.yml

    tasks:
        - task_key: default
-          job_cluster_key: default
+          existing_cluster_id: 0502-***********

Then generate the bundle definition again with the overwrite option.

kedro databricks bundle --overwrite

Deploy Databricks Job using Databricks Asset Bundles

Once you have all the resources generated, deploy the Databricks Asset Bundles to Databricks:

kedro databricks deploy

You should see output similar to:

Uploading databrick_iris-0.1-py3-none-any.whl...
Uploading bundle files to /Workspace/Users/xxxxxxx.com/.bundle/databrick_iris/local/files...
Deploying resources...
Updating deployment state...
Deployment complete!

How to run the Deployed job?

There are two options to run Databricks Jobs:

Run Databricks Job with databricks CLI

databricks bundle run

This will shows all the job that you have created. Select the job and run it.

? Resource to run:
  Job: [dev] databricks-iris (databricks-iris)

You should see similar output like this:

databricks bundle run
Run URL: https://<host>/?*********#job/**************/run/**********

Copy that URL into your browser or go to the Jobs Run UI to see the run status.

Run Databricks Job with Databricks UI

Alternatively, you can go to the Workflow tab and select the desired job to run directly: Run deployed Databricks Job with Databricks UI