Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro because workflows in Airflow are modelled and organised as DAGs.
How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster¶
kedro-airflow-k8s plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with
kedro-docker to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only.
How to run a Kedro pipeline on Apache Airflow with Astronomer¶
Astronomer is a managed Airflow platform which allows users to spin up and run an Airflow cluster easily in production. Additionally, it also provides a set of tools to help users get started with Airflow locally in the easiest way possible.
The tutorial discusses how to run the example Iris classification pipeline on a local Airflow cluster with Astronomer. You may also consider using our
astro-airflow-iris starter which provides a template containing the boilerplate code that the tutorial describes:
kedro new --starter=astro-airflow-iris
The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an Airflow task while the whole pipeline is converted into a DAG for orchestration purpose. This approach mirrors the principles of running Kedro in a distributed environment.
To follow this tutorial, ensure you have the following:
Tutorial project setup¶
Initialise an Airflow project with Astro. Let’s call it
mkdir kedro-airflow-iris cd kedro-airflow-iris astro dev init
Create a new Kedro project using the
pandas-irisstarter. You can use the default value in the project creation process:
kedro new --starter=pandas-iris
Copy all files and directories under
new-kedro-project, which was the default project name created in step 2, to the root directory so Kedro and Astro CLI share the same project root:
cp new-kedro-project/* . rm -r new-kedro-project
After this step, your project should have the following structure:
. ├── Dockerfile ├── README.md ├── airflow_settings.yaml ├── conf ├── dags ├── data ├── docs ├── include ├── notebooks ├── packages.txt ├── plugins ├── pyproject.toml ├── requirements.txt ├── .flake8 └── src
kedro-airflow~=0.4. We will use this plugin to convert the Kedro pipeline into an Airflow DAG.
pip install kedro-airflow~=0.4
pip install -r src/requirements.txtto install all dependencies.
Step 1. Create new configuration environment to prepare a compatible
conf/airflowdirectory in your Kedro project
catalog.ymlfile in this directory with the following content
example_iris_data: type: pandas.CSVDataSet filepath: data/01_raw/iris.csv example_train_x: type: pickle.PickleDataSet filepath: data/05_model_input/example_train_x.pkl example_train_y: type: pickle.PickleDataSet filepath: data/05_model_input/example_train_y.pkl example_test_x: type: pickle.PickleDataSet filepath: data/05_model_input/example_test_x.pkl example_test_y: type: pickle.PickleDataSet filepath: data/05_model_input/example_test_y.pkl example_model: type: pickle.PickleDataSet filepath: data/06_models/example_model.pkl example_predictions: type: pickle.PickleDataSet filepath: data/07_model_output/example_predictions.pkl
This ensures that all datasets are persisted so all Airflow tasks can read them without the need to share memory. In the example here we assume that all Airflow tasks share one disk, but for distributed environment you would need to use non-local filepaths.
Step 2. Package the Kedro pipeline as an Astronomer-compliant Docker image¶
Step 2.1: Package the Kedro pipeline as a Python package so you can install it into the container later on:
This step should produce a wheel file called
new_kedro_project-0.1-py3-none-any.whl located at
Step 2.2: Add the
.dockerignore, as it’s not necessary to bundle the entire code base with the container once we have the packaged wheel file.
echo "src/" >> .dockerignore
Step 2.3: Modify the
Dockerfileto have the following content:
FROM quay.io/astronomer/ap-airflow:2.0.0-buster-onbuild RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl
Step 3. Convert the Kedro pipeline into an Airflow DAG with
kedro airflow create --target-dir=dags/ --env=airflow
Step 4. Launch the local Airflow cluster with Astronomer¶
astro dev start
If you visit the Airflow UI, you should now see the Kedro pipeline as an Airflow DAG: