Skip to content

AWS Batch

AWS Batch runs containerised batch jobs at scale. Each job runs in an isolated Docker container. The sections below show how to deploy a Kedro project so each pipeline-level namespace runs as one Batch job, with datasets stored on Amazon S3.

AWS Batch fits Kedro pipelines that need more time or memory than AWS Lambda allows, but do not require PySpark or distributed Spark. For lightweight stages with managed orchestration, use AWS Step Functions. For Spark workloads, use Amazon EMR Serverless.

This guide targets Kedro 1.x (kedro>=1.0) and uses the Spaceflights starter as a worked example. Read the deployment strategy first if you are deploying your own Kedro project and need guidance on namespace grouping, S3 storage, the custom Batch runner, and job definition settings.

Strategy

Read this section before you deploy your own project. It starts with an overview of the approach, then gives practical advice for adapting it to your pipelines.

Overview

This guide deploys a Kedro pipeline on AWS Batch using a custom AWSBatchRunner on your local machine (or CI agent) and a shared container image on Amazon S3-backed storage.

The approach in brief:

  1. Group nodes by pipeline-level namespace. Each group becomes one Batch job. Inside the container, the job runs every node in that namespace with your packaged project CLI.
  2. Store shared datasets on S3. Batch containers are isolated, so datasets that cross namespace boundaries cannot use MemoryDataset.
  3. Package the project into a container image (wheel plus conf/). Every namespace group uses the same image.
  4. Submit jobs from a custom runner. AWSBatchRunner groups the pipeline with Pipeline.group_nodes_by("namespace"), submits one Batch job per group with dependsOn links, and polls until each job finishes.

Why use a container image?

A container image lets you package Kedro, project dependencies, and configuration into one artefact. You can build and test that artefact locally before pushing it to Amazon Elastic Container Registry. Each Batch job overrides the container command to run one namespace group.

flowchart LR
  subgraph driver [Your machine]
    Runner[AWSBatchRunner]
  end
  subgraph aws [AWS]
    Q[Batch job queue]
    J1[Job data_processing]
    J2[Job data_science]
    J3[Job reporting]
    S3[(S3 bucket)]
  end
  Runner --> Q
  Q --> J1
  J1 --> S3
  J1 --> J2
  J1 --> J3
  J2 --> S3
  J3 --> S3

For the Spaceflights starter, this pattern creates three Batch jobs (data_processing, data_science, and reporting) instead of one per node.

Use pipeline-level namespaces (defined on the Pipeline object), not node-level namespaces. Node-level namespaces are for Kedro-Viz layout and do not group execution. See the section on grouping nodes with namespaces in Kedro for further explanation.

Choose how to group nodes

Namespace grouping suits most production pipelines where related nodes share dependencies and finish within Batch job timeout and memory limits.

Grouping Pros Cons When to use
One Batch job per namespace (recommended) Fewer jobs. Related nodes run together Whole namespace must fit job timeout and memory Most production Spaceflights-style pipelines
One Batch job per node Full isolation. Easy to debug a single node More jobs, longer total runtime, more ECR pulls Small pipelines or prototyping
Offload Spark stages to EMR Serverless Distributed Spark fits better Extra infrastructure to set up and operate PySpark pipelines or nodes that need Spark

When a namespace exceeds Batch limits

If a namespace outgrows your job definition timeout or memory, split it further or increase job definition resources. Run Spark stages on Amazon EMR Serverless instead of Batch.

Plan execution order and storage

AWSBatchRunner submits namespace groups when their upstream dependencies have finished. Groups with no upstream dependencies can run in parallel, up to max_workers in conf/aws_batch/parameters.yml. The runner uses the dependencies list on each GroupedNodes object from Pipeline.group_nodes_by("namespace").

For Spaceflights, data_processing runs first to produce intermediate datasets on S3. data_science and reporting can run in parallel at the next level because both depend on data_processing outputs (model_input_table and preprocessed_shuttles) but not on each other. The same dependency rules apply in distributed Kedro runs and in grouping nodes for deployment.

List every dataset shared across namespace groups in conf/aws_batch/catalog.yml on S3. Omitting a dataset causes MemoryDataset errors when Batch moves between jobs.

Configure before you deploy

  • Assign pipeline-level namespaces with explicit inputs / outputs and prefix_datasets_with_namespace=False
  • Run each namespace locally (kedro run --namespaces=<name>) to estimate duration and memory, then set job definition timeout and memory for the heaviest namespace
  • Add s3fs and boto3 when using S3-backed datasets and the Batch API from your driver machine
  • Trim image dependencies if the container grows too large for your compute environment

Deploying without pipeline-level namespaces

If your project has no pipeline-level namespaces, AWSBatchRunner still works. Pipeline.group_nodes_by("namespace") treats each node without a namespace as its own group, so you get one Batch job per node. The runner passes --nodes <node_name> for those groups instead of --namespaces <name>.

Working example

Prerequisites

These apply to the step-by-step guide below. This guide builds and deploys from your machine with Kedro, Docker, and the AWS CLI. You use the AWS Management Console to inspect the job queue and job runs after submission, but you cannot complete the guide with the console alone.

You need Used for
A Kedro project (requires-python = ">=3.10" in pyproject.toml) and Python >=3.10 locally Packaging the project, local test runs, and running AWSBatchRunner on your driver machine
Docker (Podman also works if you have a docker-compatible CLI) Building the Batch container image
AWS CLI configured for your target region Creating S3 and Batch resources, uploading data, pushing the image to ECR, and submitting jobs
An AWS account with permissions for Batch, S3, ECR, IAM, and EC2 (for the Batch compute environment) Creating and running the deployed resources

The steps that follow deploy the Spaceflights starter end to end. Create the project with:

kedro new -s spaceflights-pandas -n spaceflights_batch

If you are new to the project layout, complete the Spaceflights tutorial. If you use your own Kedro project, replace the placeholders below and follow the same steps.

Placeholders used in this guide

Replace these before building and submitting jobs:

Placeholder Example
<your-bucket> kedro-batch-test-123456789012
<your-aws-account-id> 123456789012
<your-aws-region> us-east-1
<PACKAGE_NAME> spaceflights_batch
<PACKAGE_CLI> spaceflights-batch
<ecr-image-uri> 123456789012.dkr.ecr.us-east-1.amazonaws.com/spaceflights-batch:latest
<batch-job-role-arn> IAM role ARN for Batch jobs (S3 access)
<batch-job-queue> spaceflights_queue
<batch-job-definition> kedro_run

What you will do

  1. Prepare your Kedro project
  2. Set up AWS
  3. Configure Kedro for AWS Batch
  4. Create the custom AWS Batch runner
  5. Customise the project CLI
  6. Package, build, and push the container image
  7. Submit the pipeline from your machine
  8. Verify the jobs succeeded

Step 1: Prepare your Kedro project

From the project root, install dependencies and run the pipeline locally:

pip install -e .
kedro run

Keep conf/base/catalog.yml on local file paths for local development. You add S3 paths in Step 3 after you create the bucket in Step 2.

Each node should have a meaningful name in its Node(...) definition. Batch job names and log streams use namespace or node names.

Assign pipeline-level namespaces

This step is recommended for fewer Batch jobs and lower orchestration overhead. Read the deployment strategy for grouping trade-offs and namespace requirements. If your pipeline has no pipeline-level namespaces, skip to Step 2: Set up AWS.

Assign a pipeline-level namespace to each sub-pipeline you want to run as one Batch job. In Spaceflights, update create_pipeline() in each module under src/<PACKAGE_NAME>/pipelines/:

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        [
            # ... nodes unchanged ...
        ],
        namespace="data_processing",
        prefix_datasets_with_namespace=False,
        inputs={"companies", "shuttles", "reviews"},
        outputs={"model_input_table"},
    )

Set prefix_datasets_with_namespace=False so dataset names in conf/base/catalog.yml and conf/aws_batch/catalog.yml keep their original names. Declare explicit inputs and outputs for each namespace.

Sub-pipeline namespace inputs outputs
data_science data_science {"model_input_table"} {"regressor", "X_train", "X_test", "y_train", "y_test"}
reporting reporting {"preprocessed_shuttles"} {"shuttle_passenger_capacity_plot_exp", "shuttle_passenger_capacity_plot_go", "dummy_confusion_matrix"}

See the section on grouping nodes with namespaces in Kedro using the full Spaceflights example for further explanation.

Update conf/base/catalog.yml for reporting

If the starter lists matplotlib.MatplotlibWriter for dummy_confusion_matrix, change it to matplotlib.MatplotlibDataset in conf/base/catalog.yml before running locally.

Verify locally after adding namespaces:

kedro run --namespaces=data_processing
kedro run

Step 2: Set up AWS

Create the AWS resources your Kedro project will use before you point configuration at them. Complete this setup for each AWS account and region. Follow the linked AWS guides for console and CLI steps. This section lists what you need and Kedro-specific settings.

Resource AWS documentation What you need for Kedro
S3 bucket Creating a bucket Raw data and pipeline outputs (s3://<your-bucket>/...)
ECR repository Create a private repository One private repo for the Batch container image (for example spaceflights-batch)
IAM job role IAM roles for Batch S3 read/write for datasets; attach to the job definition as jobRoleArn
Compute environment Compute environments Managed EC2 environment (for example spaceflights_env)
Job queue Job queues Links jobs to the compute environment (for example spaceflights_queue)
Job definition Job definitions Points at <ecr-image-uri> after you push in Step 6; leave command empty (overridden per job)

Avoid overly broad IAM policies

For production, scope the policy to your bucket ARN.

Upload raw data to S3

Upload input data before you configure the catalog. Follow the AWS guide for uploading objects to S3:

export AWS_REGION=<your-aws-region>
export S3_BUCKET=<your-bucket>

aws s3 mb "s3://${S3_BUCKET}" --region "${AWS_REGION}"
aws s3 sync data/01_raw/ "s3://${S3_BUCKET}/01_raw/"

shuttles.xlsx may be missing locally

The starter gitignores data/01_raw/shuttles.xlsx. Copy it from the Spaceflights starter repository on GitHub if kedro new did not place it in your project.

Create the Amazon ECR repository

Create the repository now. You push the image in Step 6:

aws ecr create-repository --repository-name spaceflights-batch --region <your-aws-region>

Job definition settings (Kedro-specific)

When creating the job definition (for example kedro_run), set:

Setting Recommended value
Image <ecr-image-uri> (after Step 6 push)
vCPUs 2
Memory 4096 MiB (increase for heavy modelling nodes)
Job role <batch-job-role-arn> with S3 access
Timeout 3600 seconds or higher
Command leave empty (the runner overrides per job)

The compute environment does not launch instances until jobs are submitted, so creating it does not incur immediate cost.


Step 3: Configure Kedro for AWS Batch

Create an aws_batch config environment

Add conf/aws_batch/ with globals.yml, catalog.yml, and parameters.yml. Set s3_bucket in conf/aws_batch/globals.yml to the same value as S3_BUCKET from Step 2. Learn how to use catalog globals in Kedro configuration.

conf/aws_batch/globals.yml:

s3_bucket: <your-bucket>

conf/aws_batch/parameters.yml (used by the custom runner in Step 5):

aws_batch:
  job_queue: <batch-job-queue>
  job_definition: <batch-job-definition>
  max_workers: 4
  package_cli: <PACKAGE_CLI>
  conf_source: /app/conf

Catalog environment merge is destructive

By default, Kedro merges configuration environments at the top level. If conf/aws_batch/catalog.yml overrides a dataset using filepath alone, it replaces the entire dataset entry from conf/base/ and drops keys such as type. Either include the full dataset definition (including type) in conf/aws_batch/catalog.yml, or set merge_strategy: {catalog: soft} in settings.py so environment files can override individual fields.

Every shared dataset needs S3 storage

The aws_batch catalog must list every dataset used by the deployed pipeline, including intermediate outputs such as X_train and reporting artefacts. Omitting a dataset causes MemoryDataset errors when Batch moves between jobs.

Add dependencies

Add s3fs and boto3 to pyproject.toml:

"s3fs>=2024.6.0",
"boto3>=1.34.0",

The Spaceflights starter uses Parquet for intermediate tables. Use matplotlib.MatplotlibDataset (not the legacy MatplotlibWriter type).

View conf/aws_batch/catalog.yml
companies:
  type: pandas.CSVDataset
  filepath: s3://${globals:s3_bucket}/01_raw/companies.csv

reviews:
  type: pandas.CSVDataset
  filepath: s3://${globals:s3_bucket}/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataset
  filepath: s3://${globals:s3_bucket}/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl

preprocessed_companies:
  type: pandas.ParquetDataset
  filepath: s3://${globals:s3_bucket}/02_intermediate/preprocessed_companies.parquet

preprocessed_shuttles:
  type: pandas.ParquetDataset
  filepath: s3://${globals:s3_bucket}/02_intermediate/preprocessed_shuttles.parquet

model_input_table:
  type: pandas.ParquetDataset
  filepath: s3://${globals:s3_bucket}/03_primary/model_input_table.parquet

X_train:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/04_feature/X_train.pickle

X_test:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/04_feature/X_test.pickle

y_train:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/04_feature/y_train.pickle

y_test:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/04_feature/y_test.pickle

regressor:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/06_models/regressor.pickle
  versioned: true

shuttle_passenger_capacity_plot_exp:
  type: plotly.PlotlyDataset
  filepath: s3://${globals:s3_bucket}/08_reporting/shuttle_passenger_capacity_plot_exp.json
  versioned: true
  plotly_args:
    type: bar
    fig:
      x: shuttle_type
      y: passenger_capacity
      orientation: h
    layout:
      xaxis_title: Shuttles
      yaxis_title: Average passenger capacity
      title: Shuttle Passenger capacity

shuttle_passenger_capacity_plot_go:
  type: plotly.JSONDataset
  filepath: s3://${globals:s3_bucket}/08_reporting/shuttle_passenger_capacity_plot_go.json
  versioned: true

dummy_confusion_matrix:
  type: matplotlib.MatplotlibDataset
  filepath: s3://${globals:s3_bucket}/08_reporting/dummy_confusion_matrix.png
  versioned: true

Verify the AWS Batch environment locally

kedro run --env aws_batch

Confirm outputs appear under your S3 bucket paths.


Step 4: Create the custom AWS Batch runner

Create src/<PACKAGE_NAME>/runner/batch_runner.py with an AWSBatchRunner class that extends AbstractRunner and submits Batch jobs for each namespace group (or each node when no namespace is defined).

View batch_runner.py
"""``AWSBatchRunner`` submits Kedro namespace groups as AWS Batch jobs."""

from __future__ import annotations

from concurrent.futures import ThreadPoolExecutor
from time import sleep
from typing import Any

import boto3
from pluggy import PluginManager

from kedro.io import CatalogProtocol
from kedro.pipeline import Pipeline
from kedro.pipeline.node import GroupedNodes
from kedro.runner import AbstractRunner


def _track_batch_job(job_id: str, client: Any) -> None:
    """Poll Batch until the job succeeds or raises on failure."""
    while True:
        sleep(1.0)
        jobs = client.describe_jobs(jobs=[job_id])["jobs"]
        if not jobs:
            raise ValueError(f"Job ID {job_id} not found.")

        job = jobs[0]
        status = job["status"]

        if status == "FAILED":
            reason = job.get("statusReason", "unknown")
            raise RuntimeError(f"Job {job_id} failed: {reason}")

        if status == "SUCCEEDED":
            return


class AWSBatchRunner(AbstractRunner):
    """Submit Kedro namespace groups to AWS Batch with dependency ordering."""

    def __init__(
        self,
        job_queue: str,
        job_definition: str,
        max_workers: int | None = None,
        package_cli: str | None = None,
        conf_source: str | None = None,
        is_async: bool = False,
    ):
        super().__init__(is_async=is_async)
        self._job_queue = job_queue
        self._job_definition = job_definition
        self._max_workers = max_workers
        self._package_cli = package_cli or "kedro"
        self._conf_source = conf_source
        self._client = boto3.client("batch")

    def _get_required_workers_count(self, groups: list[GroupedNodes]) -> int:
        required = len(groups)
        if self._max_workers is not None:
            return min(required, self._max_workers)
        return required

    def _get_executor(self, max_workers: int):
        return ThreadPoolExecutor(max_workers=max_workers)

    def _run(
        self,
        pipeline: Pipeline,
        catalog: CatalogProtocol,
        hook_manager: PluginManager | None = None,
        run_id: str | None = None,
    ) -> None:
        groups = pipeline.group_nodes_by("namespace")
        group_map = {group.name: group for group in groups}
        group_deps = {group.name: set(group.dependencies) for group in groups}

        todo_groups = set(group_map.keys())
        group_to_job: dict[str, str] = {}
        done_groups: set[str] = set()
        futures: set = set()
        max_workers = self._get_required_workers_count(groups)

        self._logger.info("Max workers: %d", max_workers)
        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            while True:
                done = {fut for fut in futures if fut.done()}
                futures -= done
                for future in done:
                    try:
                        group_name = future.result()
                    except Exception:
                        self._suggest_resume_scenario(
                            pipeline, set(), catalog
                        )
                        raise
                    done_groups.add(group_name)
                    self._logger.info(
                        "Completed %d out of %d jobs",
                        len(done_groups),
                        len(groups),
                    )

                ready = {
                    name
                    for name in todo_groups
                    if group_deps[name] <= done_groups
                }
                todo_groups -= ready
                for name in ready:
                    future = pool.submit(
                        self._submit_job,
                        group_map[name],
                        group_to_job,
                        group_deps[name],
                        run_id,
                    )
                    futures.add(future)

                if not futures:
                    if todo_groups:
                        raise RuntimeError(
                            f"Unresolved groups: {sorted(todo_groups)}"
                        )
                    break

    def _submit_job(
        self,
        group: GroupedNodes,
        group_to_job: dict[str, str],
        group_dependencies: set[str],
        run_id: str | None,
    ) -> str:
        self._logger.info("Submitting the job for group: %s", group.name)

        run_suffix = run_id or "local"
        job_name = f"kedro-{run_suffix}-{group.name}".replace(".", "-")[:128]
        depends_on = [
            {"jobId": group_to_job[dep]}
            for dep in group_dependencies
            if dep in group_to_job
        ]

        command = [
            self._package_cli,
            "run",
            "--env",
            "aws_batch",
        ]
        if group.type == "namespace":
            command.extend(["--namespaces", group.name])
        else:
            command.extend(["--nodes", group.nodes[0]])
        if self._conf_source:
            command.extend(["--conf-source", self._conf_source])

        response = self._client.submit_job(
            jobName=job_name,
            jobQueue=self._job_queue,
            jobDefinition=self._job_definition,
            dependsOn=depends_on,
            containerOverrides={"command": command},
        )

        job_id = response["jobId"]
        group_to_job[group.name] = job_id
        _track_batch_job(job_id, self._client)
        return group.name

Export the runner from src/<PACKAGE_NAME>/runner/__init__.py:

from .batch_runner import AWSBatchRunner

__all__ = ["AWSBatchRunner"]

Step 5: Customise the project CLI

Add the custom runner and CLI on your driver machine before you build the container image. Push the image in Step 6 and update the job definition Image field before you submit jobs in Step 7.

Kedro's built-in kedro run passes is_async to the runner constructor and nothing else. AWSBatchRunner needs job_queue, job_definition, and other settings from conf/aws_batch/parameters.yml.

Add src/<PACKAGE_NAME>/cli.py using the project CLI template. Override the run command so the runner is constructed with Batch parameters from the active Kedro session:

def _instantiate_runner(runner: str | None, is_async: bool, params: dict[str, Any]):
    runner_class = load_obj(runner or "SequentialRunner", "kedro.runner")
    runner_kwargs: dict[str, Any] = {"is_async": is_async}
    if runner and runner.endswith("AWSBatchRunner"):
        batch_kwargs = params.get("aws_batch") or {}
        runner_kwargs.update(batch_kwargs)
    return runner_class(**runner_kwargs)

Inside run(), create the session, load the context, build the runner from context.params, and pass it to session.run(). When runner is omitted (for example inside a Batch container job), Kedro defaults to SequentialRunner and does not pass Batch driver settings to the constructor:

with KedroSession.create(
    env=env, conf_source=conf_source, runtime_params=params
) as session:
    context = session.load_context()
    runner_instance = _instantiate_runner(runner, is_async, dict(context.params))
    return session.run(
        runner=runner_instance,
        # ... other run arguments ...
    )

Learn how to customise Kedro commands in common use cases.


Step 6: Package, build, and push the container image

Package the project, then build and push the Batch image. Repeat this step when you change pipeline code, dependencies, or conf/aws_batch/.

Run this in your project root:

kedro package

This creates a .whl in dist/. Learn how to package a Kedro project.

Create a Dockerfile in your project root:

FROM python:3.12-slim

WORKDIR /app

COPY dist/*.whl /tmp/
RUN pip install --no-cache-dir /tmp/*.whl && rm -f /tmp/*.whl

COPY conf/ /app/conf/

Apple Silicon (ARM) builders

Batch on EC2 compute environments uses x86_64 instances. Build with --platform linux/amd64 (Docker or Podman).

Build the image. Tag it with your ECR URI at build time:

export ECR_IMAGE=<ecr-image-uri>

docker build --platform linux/amd64 -t ${ECR_IMAGE} .

If you build with a local tag first, run docker tag spaceflights-batch:latest <ecr-image-uri> right before pushing.

How config reaches the job

Now that you have built the image, here is how your Step 3 configuration reaches each Batch container at runtime:

  1. The wheel carries pipeline code. kedro package bundles your pipeline code and dependencies into a .whl file. It does not include conf/.
  2. The Dockerfile carries conf/. COPY conf/ places your conf/aws_batch/ settings at /app/conf inside the container.
  3. The Batch job command selects the environment. AWSBatchRunner passes --env aws_batch --conf-source /app/conf --namespaces <name> (or --nodes <name> for nodes without a namespace) through containerOverrides.

Push the image to Amazon ECR

Follow the AWS guide for pushing a Docker image to an Amazon ECR repository:

aws ecr get-login-password --region <your-aws-region> | \
  docker login --username AWS --password-stdin <your-aws-account-id>.dkr.ecr.<your-aws-region>.amazonaws.com
docker push ${ECR_IMAGE}

Trim dependencies for production images

The Spaceflights starter includes Jupyter and Kedro-Viz, which increase image size. For production Batch images, consider a slimmer requirements subset or a multi-stage Dockerfile that omits development dependencies.

Re-push after catalog or pipeline changes

When you change conf/aws_batch/ or rebuild the wheel, repeat Step 6 and push a new image tag. Update the job definition Image field to <ecr-image-uri> if you created the definition before pushing.


Step 7: Submit the pipeline from your machine

Complete Step 6 first so your job definition points at the image in ECR.

From your project root (with AWS credentials configured), use your packaged project CLI and the custom runner:

spaceflights-batch run \
  --env aws_batch \
  --runner spaceflights_batch.runner.AWSBatchRunner

The AWSBatchRunner on your machine submits Batch jobs. Each job runs inside the container image and executes one namespace group (or a single node when no namespace is defined). For Spaceflights __default__, expect three jobs.

Track jobs in the AWS Batch console or with:

aws batch list-jobs --job-queue <batch-job-queue> --job-status RUNNING

Logs are available in CloudWatch Logs under /aws/batch/job.


Step 8: Verify the jobs succeeded

  1. Check Batch job states. All jobs should reach SUCCEEDED. Learn how to check AWS Batch job status:
aws batch list-jobs --job-queue <batch-job-queue> --job-status SUCCEEDED
  1. Check S3 outputs. List paths from your conf/aws_batch/catalog.yml. Learn how to list objects in Amazon S3:
aws s3 ls "s3://<your-bucket>/" --recursive

You should see objects under 02_intermediate/, 03_primary/, 04_feature/, 06_models/, and 08_reporting/.

If jobs failed, see Troubleshooting.


Troubleshooting

Symptom Cause Fix
Cannot install ... s3fs dependency conflict kedro-viz pins an older s3fs range Use s3fs>=2021.4 locally; omit Kedro-Viz from the Batch image for production
AccessDenied on S3 inside Batch jobs Job role lacks bucket permissions Attach an IAM policy scoped to <your-bucket> on the job role
MemoryDataset errors between jobs Dataset missing from conf/aws_batch/catalog.yml Add S3-backed entries for all shared datasets
Batch job times out mid-namespace Namespace contains too much work for one job timeout Split namespaces further, or increase job definition timeout/memory, or run heavy stages on Amazon EMR Serverless
Dataset 'MatplotlibWriter' not found Outdated dataset type in conf/base/catalog.yml Use matplotlib.MatplotlibDataset in base and aws_batch catalogs
S3 errors during kedro run --env aws_batch locally Missing AWS CRT support in botocore Run pip install 'botocore[crt]' and retry
Jobs stuck in RUNNABLE Compute environment not scaled or no capacity Check compute environment status; increase maxvCpus or instance types
Essential container exited immediately Wrong --conf-source or missing conf/ in image Verify COPY conf/ /app/conf/ in the Dockerfile and conf_source: /app/conf in parameters
ModuleNotFoundError for runner kwargs Built-in kedro run used instead of custom cli.py Use <PACKAGE_CLI> run after adding the customised cli.py from Step 5
Job fails with out-of-memory Default memory too low for sklearn/matplotlib nodes Increase memory in the job definition (for example 4096 or 8192 MiB)

Limitations

  • Each Batch job runs one namespace group (or one node when no namespace is defined). A namespace must finish within the job definition timeout and memory limits.
  • Not for Spark: This pattern is for non-distributed Python stages. Run PySpark workloads on Amazon EMR Serverless instead.
  • Driver machine: The machine where you run AWSBatchRunner must stay online until all jobs complete. For managed orchestration of lightweight stages, consider AWS Step Functions.
  • Image lifecycle: When you change pipeline code, dependencies, or conf/aws_batch/, repeat Step 6, push to ECR, and update the job definition image URI before the next run.
  • Pipelines with dozens of nodes without namespaces increase Batch job count and ECR pulls. Add pipeline-level namespaces for coarser grouping.
  • Batch job dependencies use AWS Batch dependsOn. This matches Kedro's DAG but does not replace Kedro's own ThreadRunner parallelism on a single machine.

Further reading

Kedro

AWS