Deployment to a Databricks cluster

This tutorial uses the PySpark Iris Kedro Starter to illustrate how to bootstrap a Kedro project using Spark and deploy it to a Databricks cluster on AWS.

Note

If you are using Databricks Repos to run a Kedro project then you should disable file-based logging. This prevents Kedro from attempting to write to the read-only file system.

Prerequisites

Running Kedro project from a Databricks notebook

As noted in this post describing CI/CD automation on Databricks, “Users may find themselves struggling to keep up with the numerous notebooks containing the ETL, data science experimentation, dashboards etc.”

Therefore, we do not recommend that you rely on the notebooks for running and/or deploying your Kedro pipelines unless it is unavoidable. The workflow described in this section may be useful for experimentation and initial data analysis stages, but it is not designed for productionisation.

1. Project setup

First, let’s create a new virtual environment and, within it, a new Kedro project:

# create fresh virtual env
# NOTE: minor Python version of the environment
# must match the version on the Databricks cluster
conda create --name iris_databricks python=3.7 -y
conda activate iris_databricks

# install Kedro and create a new project
pip install "kedro~=0.18.4"
# name your project Iris Databricks when prompted for it
kedro new --starter pyspark-iris

2. Install dependencies and run locally

Now, as the project has been successfully created, we should move into the project root directory, install project dependencies, and then start a local test run using Spark local execution mode, which means that all Spark jobs will be executed in a single JVM locally, rather than in a cluster. pyspark-iris Kedro starter used to generate the project already has all necessary configuration for it to work, you just need to have pyspark Python package installed, which is done for you by pip install -r src/requirements.txt command below.

# change the directory to the project root
cd iris-databricks/
# compile and install the project dependencies, this may take a few minutes
pip install -r src/requirements.txt
# start a local run
kedro run

You should get a similar output:

...
[08/09/22 11:23:30] INFO     Model has accuracy of 0.933 on test data.                                        nodes.py:74
                    INFO     Saving data to 'metrics' (MetricsDataSet)...                             data_catalog.py:382
                    INFO     Completed 3 out of 3 tasks                                           sequential_runner.py:85
                    INFO     Pipeline execution completed successfully.                                      runner.py:89

3. Create a Databricks cluster

If you already have an active cluster with runtime version 7.1, you can skip this step. Here is how to find clusters in your Databricks workspace.

Follow the Databricks official guide to create a new cluster. For the purpose of this tutorial (and to minimise costs) we recommend the following settings:

  • Runtime: 7.1 (Scala 2.12, Spark 3.0.0)

  • Enable autoscaling: off

  • Terminate after 120 minutes of inactivity: on

  • Worker type: m4.large

  • Driver Type: Same as worker

  • Workers: 2

  • Advanced options -> Instances -> # Volumes: 1

While your cluster is being provisioned, you can continue to the next step.

As a result you should have:

  • A Kedro project, which runs with the local version of PySpark library

  • A running Databricks cluster

4. Create GitHub personal access token

To synchronise the project between the local development environment and Databricks, we will use a private GitHub repository, which you will create in the next step. For authentication, we will need a GitHub personal access token, so go ahead and create this token in your GitHub developer settings.

Note

Make sure that repo scopes are enabled for your token.

5. Create a GitHub repository

Now you should create a new repository in GitHub using the official guide. You can keep the repository private and you don’t need to commit to it just yet.

To connect to the newly created repository you can use one of 2 options:

6. Push Kedro project to the GitHub repository

We will use a CLI to push the newly created Kedro project to GitHub. First, you need to initialise Git in your project root directory:

# change the directory to the project root
cd iris-databricks/
# initialise git
git init

Then, create the first commit:

# add all files to git staging area
git add .
# create the first commit
git commit -m "first commit"

Finally, push the commit to GitHub:

# configure a new remote
# for HTTPS run:
git remote add origin https://github.com/<username>/<repo-name>.git
# or for SSH run:
git remote add origin git@github.com:<username>/<repo-name>.git

# verify the new remote URL
git remote -v

# push the first commit
git push --set-upstream origin main

7. Configure the Databricks cluster

The project has now been pushed to your private GitHub repository, and in order to pull it from the Databricks, we need to configure the personal access token you generated in Step 2.

Log into your Databricks workspace and then:

  1. Open Clusters tab

  2. Click on your cluster name

  3. Press Edit

  4. Go to the Advanced Options and then Spark

Then in the Environment Variables section add your GITHUB_USER and GITHUB_TOKEN as shown on the picture:

Note

For security purposes, we strongly recommend against hard-coding any secrets into your notebooks.

Then press Confirm button. Your cluster will be restarted to apply the changes, this will take a few minutes.

8. Run your Kedro project from the Databricks notebook

Congratulations, you are now ready to run your Kedro project from the Databricks!

Create your Databricks notebook and remember to attach it to the cluster you have just configured.

In your newly-created notebook, put each of the below code snippets into a separate cell, then run all cells:

  • Clone your project from GitHub

%sh rm -rf ~/projects/iris-databricks && git clone --single-branch --branch main https://${GITHUB_USER}:${GITHUB_TOKEN}@github.com/${GITHUB_USER}/<your-repo-name>.git ~/projects/iris-databricks
  • Install the latest version of Kedro compatible with version 0.18.4

%pip install "kedro[spark.SparkDataSet]~=0.18.4"
  • Copy input data into DBFS

import logging
from pathlib import Path

# suppress excessive logging from py4j
logging.getLogger("py4j.java_gateway").setLevel(logging.ERROR)

# copy project data into DBFS
project_root = Path.home() / "projects" / "iris-databricks"
data_dir = project_root / "data"
dbutils.fs.cp(
    f"file://{data_dir.as_posix()}", f"dbfs://{data_dir.as_posix()}", recurse=True
)

# make sure the data has been copied
dbutils.fs.ls((data_dir / "01_raw").as_posix())

You should get a similar output:

Out[11]: [FileInfo(path='dbfs:/root/projects/iris-databricks/data/01_raw/.gitkeep', name='.gitkeep', size=0),
 FileInfo(path='dbfs:/root/projects/iris-databricks/data/01_raw/iris.csv', name='iris.csv', size=3858)]
  • Run Kedro project

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project

bootstrap_project(project_root)

with KedroSession.create(project_path=project_root) as session:
    session.run()

You should get a similar output:

...
[08/09/22 11:23:30] INFO     Model has accuracy of 0.933 on test data.                                        nodes.py:74
                    INFO     Saving data to 'metrics' (MetricsDataSet)...                             data_catalog.py:382
                    INFO     Completed 3 out of 3 tasks                                           sequential_runner.py:85
                    INFO     Pipeline execution completed successfully.                                      runner.py:89

Your complete notebook should look similar to this (the results are hidden):

9. Using the Kedro IPython Extension

You can interact with Kedro in Databricks through the Kedro IPython extension, kedro.ipython.

The Kedro IPython extension launches a Kedro session and makes available the useful Kedro variables catalog, context, pipelines and session. It also provides the %reload_kedro line magic that reloads these variables (for example, if you need to update catalog following changes to your Data Catalog).

The IPython extension can be used in a Databricks notebook in a similar way to how it is used in Jupyter notebooks.

If you encounter a ContextualVersionConflictError, it is likely caused by Databricks using an old version of pip. Hence there’s one additional step you need to do in the Databricks notebook to make use of the IPython extension. After you load the IPython extension using the below command:

In [1]: %load_ext kedro.ipython

You must explicitly upgrade your pip version by doing the below:

%pip install -U pip

After this, you can reload Kedro by running the line magic command %reload_kedro <project_root>.

10. Running Kedro-Viz on Databricks

For Kedro-Viz to run with your Kedro project, you need to ensure that both the packages are installed in the same scope (notebook-scoped vs. cluster library). i.e. if you %pip install kedro from inside your notebook then you should also %pip install kedro-viz from inside your notebook. If your cluster comes with Kedro installed on it as a library already then you should also add Kedro-Viz as a cluster library.

Kedro-Viz can then be launched in a new browser tab with the %run_viz line magic:

In [2]: %run_viz