Apache Airflow¶
Apache Airflow is a popular open-source workflow management platform. It is a suitable engine to orchestrate and execute a pipeline authored with Kedro, because workflows in Airflow are modelled and organised as DAGs.
Introduction and strategy¶
The general strategy to deploy a Kedro pipeline on Apache Airflow is to run every Kedro node as an Airflow task while the whole pipeline is converted to an Airflow DAG. This approach mirrors the principles of running Kedro in a distributed environment.
Each node will be executed within a new Kedro session, which implies that MemoryDatasets cannot serve as storage for the intermediate results of nodes. Instead, all datasets must be registered in the DataCatalog and stored in persistent storage. This approach enables nodes to access the results from preceding nodes.
This guide describes how to run a Kedro pipeline on different Airflow platforms. Jump to any of the sections below to learn how to run a Kedro pipeline on:
- Apache Airflow with Astronomer
- Amazon AWS Managed Workflows for Apache Airflow (MWAA)
- Apache Airflow using a Kubernetes cluster
How to run a Kedro pipeline on Apache Airflow with Astronomer¶
The following tutorial shows how to deploy an example Spaceflights Kedro project on Apache Airflow with the Astro CLI. This command-line tool, created by Astronomer, streamlines the creation of local Airflow projects. You will deploy it locally first, then transition to Astro Cloud.
Astronomer is a managed Airflow platform that lets teams spin up and run Airflow clusters in production. It also ships with tooling to help teams get started with Airflow on their local machines.
Prerequisites¶
To follow this tutorial, ensure you have the following:
- The Astro CLI installed
- A container service like Docker Desktop (v18.09 or higher)
kedro>=0.19installedkedro-airflow>=0.8installed. We will use this plugin to convert the Kedro pipeline into an Airflow DAG.
Create, prepare and package example Kedro project¶
In this section, you will create a new Kedro project equipped with an example pipeline designed to solve a typical data science task: predicting spaceflights prices. You will customise the project for Airflow by registering datasets that were stored in memory and by simplifying logging with custom settings. After making these changes, package the project for installation in an Airflow Docker container and generate an Airflow DAG that mirrors the Kedro pipeline.
-
To create a new Kedro project, select the
example=yesoption to include example code. To enable custom logging, addtools=log. Proceed with the default project name, and add any other tools as needed:kedro new --example=yes --name=new-kedro-project --tools=log -
Navigate to your project's directory, create a new
conf/airflowdirectory for Airflow-specific configurations, and copy thecatalog.ymlfile fromconf/basetoconf/airflow. This setup allows you to customise theDataCatalogfor use with Airflow:cd new-kedro-project mkdir conf/airflow cp conf/base/catalog.yml conf/airflow/catalog.yml -
Open
conf/airflow/catalog.ymlto view the datasets used in the project. Additional intermediate datasets (X_train,X_test,y_train,y_test) live in memory. You can locate these in the pipeline description under/src/new_kedro_project/pipelines/data_science/pipeline.py. To ensure the datasets persist across tasks in Airflow, include them in theDataCatalog. Instead of repeating similar code for each dataset, use Dataset Factories, a syntax that lets you define a catch-all pattern to replace the defaultMemoryDatasetcreation. Add the following to the end of the file:
"{base_dataset}":
type: pandas.CSVDataset
filepath: data/02_intermediate/{base_dataset}.csv
In the example here we assume that all Airflow tasks share one disk, but for distributed environments you would need to use non-local file paths.
Starting with kedro-airflow release 0.9.0, you can adopt a different strategy instead of following steps 2-3: group nodes that use intermediate MemoryDatasets into larger tasks using --group-by memory. This approach allows intermediate data manipulation to occur within a single task, eliminating the need to transfer data between nodes. Enable this behaviour by running kedro airflow create with the --group-by memory flag on Step 6.
- Open
conf/logging.ymland change theroot: handlerssection to[console]at the end of the file. By default, Kedro uses the Rich library to enhance log output with sophisticated formatting. Some deployment systems, including Airflow, do not work well with Rich, so update the logging configuration to a simpler console version. For more information on logging in Kedro, see the Kedro docs.
⚠️ Note: This step is optional for Airflow from version
2.11.0onward, as the compatibility issue has been confirmed fixed at least from this version.
root:
handlers: [console]
- Package the Kedro pipeline as a Python package so you can install it into the Airflow container later on:
kedro package
This step should produce a wheel file called new_kedro_project-0.1-py3-none-any.whl located at dist/.
- Convert the Kedro pipeline into an Airflow DAG with
kedro airflow
kedro airflow create --target-dir=dags/ --env=airflow
This step should produce a .py file called new_kedro_project_airflow_dag.py located at dags/.
Optionally, you can use the --group-by flag to group nodes:
- --group-by memory: Groups nodes connected through MemoryDatasets
- --group-by namespace: Groups nodes by their namespace
Deployment process with Astro CLI¶
In this section, you will set up a new blank Airflow project using Astro and copy the files prepared in the previous section from the Kedro project. Next, customise the Dockerfile to enhance logging and to install the Kedro package. After that, run the Airflow cluster and explore the results.
-
To complete this section, you have to install both the Astro CLI and Docker Desktop.
-
Initialise an Airflow project with Astro in a new folder outside of your Kedro project. Let's call it
kedro-airflow-spaceflightscd .. mkdir kedro-airflow-spaceflights cd kedro-airflow-spaceflights astro dev init -
The folder
kedro-airflow-spaceflightswill be executed within the Airflow container. To run the Kedro project there, you need to copy several items from the previous section into it: - the
/datafolder from Step 1, containing sample input datasets for our pipeline. This folder will also store the output results. - the
/conffolder from Steps 2-4, which includes ourDataCatalog, parameters, and customised logging files. These files will be used by Kedro during its execution in the Airflow container. - the
.whlfile from Step 5, which you will need to install in the Airflow Docker container to execute our project node by node. - the Airflow DAG from Step 6 for deployment in the Airflow cluster.
cd .. cp -r new-kedro-project/data kedro-airflow-spaceflights/data cp -r new-kedro-project/conf kedro-airflow-spaceflights/conf mkdir -p kedro-airflow-spaceflights/dist/ cp new-kedro-project/dist/new_kedro_project-0.1-py3-none-any.whl kedro-airflow-spaceflights/dist/ cp new-kedro-project/dags/new_kedro_project_airflow_dag.py kedro-airflow-spaceflights/dags/
You can copy the entire new-kedro-project directory into kedro-airflow-spaceflights if the project requires frequent updates, DAG recreation, and repackaging. Working with Kedro and Astro projects in a single folder saves you from copying files for each development iteration. Be aware that the projects will then share files such as requirements.txt, README.md, and .gitignore.
- Add the following lines to the
Dockerfilelocated in thekedro-airflow-spaceflightsfolder to set the environment variableKEDRO_LOGGING_CONFIGto point toconf/logging.yml, enabling custom logging in Kedro (from Kedro 0.19.6 onward, this step is unnecessary because Kedro uses theconf/logging.ymlfile by default), and to install the.whlfile of the prepared Kedro project into the Airflow container:
ENV KEDRO_LOGGING_CONFIG="conf/logging.yml" # This line is not needed from Kedro 0.19.6
RUN pip install --user dist/new_kedro_project-0.1-py3-none-any.whl
- Navigate to
kedro-airflow-spaceflightsfolder and launch the local Airflow cluster with Astronomer
cd kedro-airflow-spaceflights
astro dev start
- Visit the Airflow web server UI at its default address, http://localhost:8080, using the default login credentials: username and password both set to
admin. There, you'll find a list of all DAGs. Navigate to thenew-kedro-projectDAG and press theTrigger DAGplay button to start it. You can then observe the steps of your project as they run:

- The Kedro project was run inside an Airflow Docker container, and the results are stored there as well. To copy these results to your host, first identify the relevant Docker containers by listing them:
Select the container acting as the scheduler and note its ID. Then, use the following command to copy the results, substituting
docker psd36ef786892awith the actual container ID:
docker cp d36ef786892a:/usr/local/airflow/data/ ./data/
- To stop the Astro Airflow environment, you can use the command:
astro dev stop
Deployment to Astro Cloud¶
You can deploy and run your project on Astro Cloud, the cloud infrastructure provided by Astronomer, by following these steps:
-
Log in to your account on the Astronomer portal and create a new deployment if you don't already have one:

-
Use the Astro CLI to log in to your Astro Cloud account:
You will be redirected to enter your login credentials in your browser. Successful login indicates that your terminal is now linked with your Astro Cloud account:astro auth

-
To deploy your local project to the cloud, navigate to the
kedro-airflow-spaceflightsfolder and run:astro deploy -
At the end of the deployment process, you receive a link that lets you manage your project in the cloud:

How to run a Kedro pipeline on Amazon AWS Managed Workflows for Apache Airflow (MWAA)¶
Kedro project preparation¶
MWAA, or Managed Workflows for Apache Airflow, is an AWS service that makes it easier to set up, operate, and scale Apache Airflow in the cloud. Deploying a Kedro pipeline to MWAA shares several steps with Astronomer. There are key differences: you need to store your project data in an AWS S3 bucket and update the DataCatalog. You must also plan how you upload your Kedro configuration, install your Kedro package, and set the required environment variables.
1. Complete steps 1-4 from the Create, prepare and package example Kedro project section.
2. Your project's data should not sit in the working directory of the Airflow container. Instead, create an S3 bucket and upload your data folder from the new-kedro-project folder to your S3 bucket.
3. Update the DataCatalog to reference data in your S3 bucket by adjusting the filepath and adding a credentials entry for each dataset in new-kedro-project/conf/airflow/catalog.yml. Add the S3 prefix to the filepath, as shown below:
companies:
type: pandas.CSVDataset
filepath: s3://<your_S3_bucket>/data/01_raw/companies.csv
credentials: dev_s3
new-kedro-project/conf/local/credentials.yml with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY and copy it to the new-kedro-project/conf/airflow/ folder:
dev_s3:
client_kwargs:
aws_access_key_id: *********************
aws_secret_access_key: ******************************************
s3fs to your project’s requirements.txt in new-kedro-project to enable communication with AWS S3. Some libraries trigger dependency conflicts in the Airflow environment, so keep the requirements list lean and avoid adding kedro-viz and pytest.
s3fs
- Follow steps 5-6 from the Create, prepare and package example Kedro project section to package your Kedro project and generate an Airflow DAG.
- Update the DAG file
new_kedro_project_airflow_dag.pylocated in thedags/folder by addingconf_source="plugins/conf-new_kedro_project.tar.gz"to the arguments ofKedroSession.create()in the Kedro operator execution function. This change is necessary because your Kedro configuration archive will be stored in theplugins/folder, not the root directory:def execute(self, context): configure_project(self.package_name) with KedroSession.create(project_path=self.project_path, env=self.env, conf_source="plugins/conf-new_kedro_project.tar.gz") as session: session.run(self.pipeline_name, node_names=[self.node_name])
Deployment on MWAA¶
-
Archive your three files:
new_kedro_project-0.1-py3-none-any.whlandconf-new_kedro_project.tar.gzlocated innew-kedro-project/dist, andlogging.ymllocated innew-kedro-project/conf/into a file calledplugins.zipand upload it tos3://your_S3_bucket.This archive will be later unpacked to thezip -j plugins.zip dist/new_kedro_project-0.1-py3-none-any.whl dist/conf-new_kedro_project.tar.gz conf/logging.yml/pluginsfolder in the working directory of the Airflow container. -
Create a new
requirements.txtfile, add the path where your Kedro project will be unpacked in the Airflow container, and uploadrequirements.txttos3://your_S3_bucket:Libraries from./plugins/new_kedro_project-0.1-py3-none-any.whlrequirements.txtwill be installed during container initialisation. -
Upload
new_kedro_project_airflow_dag.pyfrom thenew-kedro-project/dagstos3://your_S3_bucket/dags. - Create an empty
startup.shfile for container startup commands. Set an environment variable for custom Kedro logging:export KEDRO_LOGGING_CONFIG="plugins/logging.yml" -
Set up a new AWS MWAA environment using the following settings:
On the next page, set theS3 Bucket: s3://your_S3_bucket DAGs folder s3://your_S3_bucket/dags Plugins file - optional s3://your_S3_bucket/plugins.zip Requirements file - optional s3://your_S3_bucket/requirements.txt Startup script file - optional s3://your_S3_bucket/startup.shPublic network (Internet accessible)option in theWeb server accesssection if you want to access your Airflow UI from the internet. Continue with the default options on the remaining pages. -
Once the environment is created, use the
Open Airflow UIbutton to access the standard Airflow interface, where you can manage your DAG.
How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster¶
If you want to execute your DAG in an isolated environment on Airflow using a Kubernetes cluster, you can use a combination of the kedro-airflow and kedro-docker plugins.
-
Package your Kedro project as a Docker container Use the
kedro docker initandkedro docker buildcommands to containerise your Kedro project. -
Push the Docker image to a container registry Upload the built Docker image to a cloud container registry, such as AWS ECR, Google Container Registry, or Docker Hub.
-
Generate an Airflow DAG Run the following command to generate an Airflow DAG:
This will create a DAG file that includes thekedro airflow createKedroOperator()by default.
Optionally, use --group-by memory or --group-by namespace to group nodes into fewer tasks.
- Update the DAG to use
KubernetesPodOperatorTo execute each Kedro node (or group of nodes) in an isolated Kubernetes pod, replaceKedroOperator()withKubernetesPodOperator(), as shown in the example below:
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
KubernetesPodOperator(
task_id=node_name,
name=node_name,
namespace=NAMESPACE,
image=DOCKER_IMAGE,
cmds=["kedro"],
arguments=["run", f"--nodes={node_name}"],
get_logs=True,
is_delete_operator_pod=True, # Cleanup after execution
in_cluster=False, # Set to True if Airflow runs inside the Kubernetes cluster
do_xcom_push=False,
image_pull_policy="Always",
# Uncomment the following lines if Airflow is running outside Kubernetes
# cluster_context="k3d-your-cluster", # Specify the Kubernetes context from your kubeconfig
# config_file="~/.kube/config", # Path to your kubeconfig file
)
Run multiple nodes in a single container¶
By default, this approach runs each node in an isolated Docker container. To reduce computational overhead, you can run multiple nodes together within the same container.
The --group-by option in kedro airflow create provides an automated way to group nodes:
- --group-by memory: Groups nodes connected through MemoryDatasets
- --group-by namespace: Groups nodes by their namespace
If you opt for manual grouping or need to customise the generated DAG, update it to adjust task dependencies and execution order.
For example, in the spaceflights-pandas tutorial, if you want to execute the first two nodes together, your DAG may look like this:
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
with DAG(...) as dag:
tasks = {
"preprocess-companies-and-shuttles": KubernetesPodOperator(
task_id="preprocess-companies-and-shuttles",
name="preprocess-companies-and-shuttles",
namespace=NAMESPACE,
image=DOCKER_IMAGE,
cmds=["kedro"],
arguments=["run", "--nodes=preprocess-companies-node,preprocess-shuttles-node"],
...
),
"create-model-input-table-node": KubernetesPodOperator(...),
...
}
tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"]
tasks["create-model-input-table-node"] >> tasks["split-data-node"]
...
In this example, we modified the original DAG generated by the kedro airflow create command by replacing KedroOperator() with KubernetesPodOperator(). We also merged the first two tasks into a single task named preprocess-companies-and-shuttles. This task executes the Docker image running two Kedro nodes: preprocess-companies-node and preprocess-shuttles-node.
Furthermore, we adjusted the task order at the end of the DAG. Instead of having separate dependencies for the first two tasks, we consolidated them into a single line:
tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"]
This ordering ensures that the create-model-input-table-node task runs after preprocess-companies-and-shuttles has completed.
Grouping nodes in Airflow tasks¶
By default, kedro airflow create converts each Kedro node into a separate Airflow task. If you need to group multiple nodes into a single task to reduce scheduling overhead or to handle datasets that cannot be shared across distributed workers.
The --group-by option provides two grouping strategies:
Grouping by memory¶
When running Kedro nodes using Airflow, MemoryDatasets are not shared across operators, which can cause the DAG run to fail. Nodes connected through can be grouped together using the --group-by memory flag:
kedro airflow create --group-by memory
This combines nodes that share MemoryDatasets into single Airflow tasks, preserving the logical separation in Kedro while avoiding data persistence issues.
Grouping by namespace¶
If your Kedro pipeline uses namespaces to organise nodes, you can group all nodes within the same namespace into a single Airflow task:
kedro airflow create --group-by namespace
This is useful when:
- You have logically grouped nodes using namespaces and want to execute them together
- You want to reduce the number of Airflow tasks while maintaining the namespace structure
- Your namespaced nodes share intermediate data that doesn't need to be persisted between tasks
Nodes without a namespace will each be converted to individual Airflow tasks.
For more information about node grouping strategies in Kedro, see the node grouping guide.