AWS Batch¶

AWS Batch runs containerised batch jobs at scale. Each job runs in an isolated Docker container. The sections below show how to deploy a Kedro project so each pipeline-level namespace runs as one Batch job, with datasets stored on Amazon S3.

AWS Batch fits Kedro pipelines that need more time or memory than AWS Lambda allows, but do not require PySpark or distributed Spark. For lightweight stages with managed orchestration, use AWS Step Functions. For Spark workloads, use Amazon EMR Serverless.

This guide targets Kedro 1.x (kedro>=1.0) and uses the Spaceflights starter as a worked example. Read the deployment strategy first if you are deploying your own Kedro project and need guidance on namespace grouping, S3 storage, the custom Batch runner, and job definition settings.

Strategy¶

Read this section before you deploy your own project. It starts with an overview of the approach, then gives practical advice for adapting it to your pipelines.

Overview¶

This guide deploys a Kedro pipeline on AWS Batch using a custom AWSBatchRunner on your local machine (or CI agent) and a shared container image on Amazon S3-backed storage.

The approach in brief:

Group nodes by pipeline-level namespace. Each group becomes one Batch job. Inside the container, the job runs every node in that namespace with your packaged project CLI.
Store shared datasets on S3. Batch containers are isolated, so datasets that cross namespace boundaries cannot use MemoryDataset.
Package the project into a container image (wheel plus conf/). Every namespace group uses the same image.
Submit jobs from a custom runner. AWSBatchRunner groups the pipeline with Pipeline.group_nodes_by("namespace"), submits one Batch job per group with dependsOn links, and polls until each job finishes.

Why use a container image?¶

A container image lets you package Kedro, project dependencies, and configuration into one artefact. You can build and test that artefact locally before pushing it to Amazon Elastic Container Registry. Each Batch job overrides the container command to run one namespace group.

flowchart LR
  subgraph driver [Your machine]
    Runner[AWSBatchRunner]
  end
  subgraph aws [AWS]
    Q[Batch job queue]
    J1[Job data_processing]
    J2[Job data_science]
    J3[Job reporting]
    S3[(S3 bucket)]
  end
  Runner --> Q
  Q --> J1
  J1 --> S3
  J1 --> J2
  J1 --> J3
  J2 --> S3
  J3 --> S3

For the Spaceflights starter, this pattern creates three Batch jobs (data_processing, data_science, and reporting) instead of one per node.

Use pipeline-level namespaces (defined on the Pipeline object), not node-level namespaces. Node-level namespaces are for Kedro-Viz layout and do not group execution. See the section on grouping nodes with namespaces in Kedro for further explanation.

Choose how to group nodes¶

Namespace grouping suits most production pipelines where related nodes share dependencies and finish within Batch job timeout and memory limits.

Grouping	Pros	Cons	When to use
One Batch job per namespace (recommended)	Fewer jobs. Related nodes run together	Whole namespace must fit job timeout and memory	Most production Spaceflights-style pipelines
One Batch job per node	Full isolation. Easy to debug a single node	More jobs, longer total runtime, more ECR pulls	Small pipelines or prototyping
Offload Spark stages to EMR Serverless	Distributed Spark fits better	Extra infrastructure to set up and operate	PySpark pipelines or nodes that need Spark

When a namespace exceeds Batch limits

If a namespace outgrows your job definition timeout or memory, split it further or increase job definition resources. Run Spark stages on Amazon EMR Serverless instead of Batch.

Plan execution order and storage¶

AWSBatchRunner submits namespace groups when their upstream dependencies have finished. Groups with no upstream dependencies can run in parallel, up to max_workers in conf/aws_batch/parameters.yml. The runner uses the dependencies list on each GroupedNodes object from Pipeline.group_nodes_by("namespace").

For Spaceflights, data_processing runs first to produce intermediate datasets on S3. data_science and reporting can run in parallel at the next level because both depend on data_processing outputs (model_input_table and preprocessed_shuttles) but not on each other. The same dependency rules apply in distributed Kedro runs and in grouping nodes for deployment.

List every dataset shared across namespace groups in conf/aws_batch/catalog.yml on S3. Omitting a dataset causes MemoryDataset errors when Batch moves between jobs.

Configure before you deploy¶

Assign pipeline-level namespaces with explicit inputs / outputs and prefix_datasets_with_namespace=False
Run each namespace locally (kedro run --namespaces=<name>) to estimate duration and memory, then set job definition timeout and memory for the heaviest namespace
Add s3fs and boto3 when using S3-backed datasets and the Batch API from your driver machine
Trim image dependencies if the container grows too large for your compute environment

Deploying without pipeline-level namespaces

If your project has no pipeline-level namespaces, AWSBatchRunner still works. Pipeline.group_nodes_by("namespace") treats each node without a namespace as its own group, so you get one Batch job per node. The runner passes --nodes <node_name> for those groups instead of --namespaces <name>.

Working example¶

Prerequisites¶

These apply to the step-by-step guide below. This guide builds and deploys from your machine with Kedro, Docker, and the AWS CLI. You use the AWS Management Console to inspect the job queue and job runs after submission, but you cannot complete the guide with the console alone.

You need	Used for
A Kedro project (`requires-python = ">=3.10"` in `pyproject.toml`) and Python >=3.10 locally	Packaging the project, local test runs, and running `AWSBatchRunner` on your driver machine
Docker (Podman also works if you have a `docker`-compatible CLI)	Building the Batch container image
AWS CLI configured for your target region	Creating S3 and Batch resources, uploading data, pushing the image to ECR, and submitting jobs
An AWS account with permissions for Batch, S3, ECR, IAM, and EC2 (for the Batch compute environment)	Creating and running the deployed resources

The steps that follow deploy the Spaceflights starter end to end. Create the project with:

kedro new -s spaceflights-pandas -n spaceflights_batch

If you are new to the project layout, complete the Spaceflights tutorial. If you use your own Kedro project, replace the placeholders below and follow the same steps.

Placeholders used in this guide¶

Replace these before building and submitting jobs:

Placeholder	Example
`<your-bucket>`	`kedro-batch-test-123456789012`
`<your-aws-account-id>`	`123456789012`
`<your-aws-region>`	`us-east-1`
`<PACKAGE_NAME>`	`spaceflights_batch`
`<PACKAGE_CLI>`	`spaceflights-batch`
`<ecr-image-uri>`	`123456789012.dkr.ecr.us-east-1.amazonaws.com/spaceflights-batch:latest`
`<batch-job-role-arn>`	IAM role ARN for Batch jobs (S3 access)
`<batch-job-queue>`	`spaceflights_queue`
`<batch-job-definition>`	`kedro_run`

What you will do¶

Prepare your Kedro project
Set up AWS
Configure Kedro for AWS Batch
Create the custom AWS Batch runner
Customise the project CLI
Package, build, and push the container image
Submit the pipeline from your machine
Verify the jobs succeeded

Step 1: Prepare your Kedro project¶

From the project root, install dependencies and run the pipeline locally:

pip install -e .
kedro run

Keep conf/base/catalog.yml on local file paths for local development. You add S3 paths in Step 3 after you create the bucket in Step 2.

Each node should have a meaningful name in its Node(...) definition. Batch job names and log streams use namespace or node names.

Assign pipeline-level namespaces¶

This step is recommended for fewer Batch jobs and lower orchestration overhead. Read the deployment strategy for grouping trade-offs and namespace requirements. If your pipeline has no pipeline-level namespaces, skip to Step 2: Set up AWS.

Assign a pipeline-level namespace to each sub-pipeline you want to run as one Batch job. In Spaceflights, update create_pipeline() in each module under src/<PACKAGE_NAME>/pipelines/:

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline(
        [
            # ... nodes unchanged ...
        ],
        namespace="data_processing",
        prefix_datasets_with_namespace=False,
        inputs={"companies", "shuttles", "reviews"},
        outputs={"model_input_table"},
    )

Set prefix_datasets_with_namespace=False so dataset names in conf/base/catalog.yml and conf/aws_batch/catalog.yml keep their original names. Declare explicit inputs and outputs for each namespace.

Sub-pipeline	`namespace`	`inputs`	`outputs`
`data_science`	`data_science`	`{"model_input_table"}`	`{"regressor", "X_train", "X_test", "y_train", "y_test"}`
`reporting`	`reporting`	`{"preprocessed_shuttles"}`	`{"shuttle_passenger_capacity_plot_exp", "shuttle_passenger_capacity_plot_go", "dummy_confusion_matrix"}`

See the section on grouping nodes with namespaces in Kedro using the full Spaceflights example for further explanation.

Update conf/base/catalog.yml for reporting

If the starter lists matplotlib.MatplotlibWriter for dummy_confusion_matrix, change it to matplotlib.MatplotlibDataset in conf/base/catalog.yml before running locally.

Verify locally after adding namespaces:

kedro run --namespaces=data_processing
kedro run

Step 2: Set up AWS¶

Create the AWS resources your Kedro project will use before you point configuration at them. Complete this setup for each AWS account and region. Follow the linked AWS guides for console and CLI steps. This section lists what you need and Kedro-specific settings.

Resource	AWS documentation	What you need for Kedro
S3 bucket	Creating a bucket	Raw data and pipeline outputs (`s3://<your-bucket>/...`)
ECR repository	Create a private repository	One private repo for the Batch container image (for example `spaceflights-batch`)
IAM job role	IAM roles for Batch	S3 read/write for datasets; attach to the job definition as `jobRoleArn`
Compute environment	Compute environments	Managed EC2 environment (for example `spaceflights_env`)
Job queue	Job queues	Links jobs to the compute environment (for example `spaceflights_queue`)
Job definition	Job definitions	Points at `<ecr-image-uri>` after you push in Step 6; leave `command` empty (overridden per job)

Avoid overly broad IAM policies

For production, scope the policy to your bucket ARN.

Upload raw data to S3¶

Upload input data before you configure the catalog. Follow the AWS guide for uploading objects to S3:

export AWS_REGION=<your-aws-region>
export S3_BUCKET=<your-bucket>

aws s3 mb "s3://${S3_BUCKET}" --region "${AWS_REGION}"
aws s3 sync data/01_raw/ "s3://${S3_BUCKET}/01_raw/"

shuttles.xlsx may be missing locally

The starter gitignores data/01_raw/shuttles.xlsx. Copy it from the Spaceflights starter repository on GitHub if kedro new did not place it in your project.

Create the Amazon ECR repository¶

Create the repository now. You push the image in Step 6:

aws ecr create-repository --repository-name spaceflights-batch --region <your-aws-region>

Job definition settings (Kedro-specific)¶

When creating the job definition (for example kedro_run), set:

Setting	Recommended value
Image	`<ecr-image-uri>` (after Step 6 push)
vCPUs	`2`
Memory	`4096` MiB (increase for heavy modelling nodes)
Job role	`<batch-job-role-arn>` with S3 access
Timeout	`3600` seconds or higher
Command	leave empty (the runner overrides per job)

The compute environment does not launch instances until jobs are submitted, so creating it does not incur immediate cost.

Step 3: Configure Kedro for AWS Batch¶

Create an `aws_batch` config environment¶

Add conf/aws_batch/ with globals.yml, catalog.yml, and parameters.yml. Set s3_bucket in conf/aws_batch/globals.yml to the same value as S3_BUCKET from Step 2. Learn how to use catalog globals in Kedro configuration.

conf/aws_batch/globals.yml:

s3_bucket: <your-bucket>

conf/aws_batch/parameters.yml (used by the custom runner in Step 5):

aws_batch:
  job_queue: <batch-job-queue>
  job_definition: <batch-job-definition>
  max_workers: 4
  package_cli: <PACKAGE_CLI>
  conf_source: /app/conf

Catalog environment merge is destructive

By default, Kedro merges configuration environments at the top level. If conf/aws_batch/catalog.yml overrides a dataset using filepath alone, it replaces the entire dataset entry from conf/base/ and drops keys such as type. Either include the full dataset definition (including type) in conf/aws_batch/catalog.yml, or set merge_strategy: {catalog: soft} in settings.py so environment files can override individual fields.

Every shared dataset needs S3 storage

The aws_batch catalog must list every dataset used by the deployed pipeline, including intermediate outputs such as X_train and reporting artefacts. Omitting a dataset causes MemoryDataset errors when Batch moves between jobs.

Add dependencies¶

Add s3fs and boto3 to pyproject.toml:

"s3fs>=2024.6.0",
"boto3>=1.34.0",

The Spaceflights starter uses Parquet for intermediate tables. Use matplotlib.MatplotlibDataset (not the legacy MatplotlibWriter type).

View conf/aws_batch/catalog.yml

companies:
  type: pandas.CSVDataset
  filepath: s3://${globals:s3_bucket}/01_raw/companies.csv

reviews:
  type: pandas.CSVDataset
  filepath: s3://${globals:s3_bucket}/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataset
  filepath: s3://${globals:s3_bucket}/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl

preprocessed_companies:
  type: pandas.ParquetDataset
  filepath: s3://${globals:s3_bucket}/02_intermediate/preprocessed_companies.parquet

preprocessed_shuttles:
  type: pandas.ParquetDataset
  filepath: s3://${globals:s3_bucket}/02_intermediate/preprocessed_shuttles.parquet

model_input_table:
  type: pandas.ParquetDataset
  filepath: s3://${globals:s3_bucket}/03_primary/model_input_table.parquet

X_train:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/04_feature/X_train.pickle

X_test:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/04_feature/X_test.pickle

y_train:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/04_feature/y_train.pickle

y_test:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/04_feature/y_test.pickle

regressor:
  type: pickle.PickleDataset
  filepath: s3://${globals:s3_bucket}/06_models/regressor.pickle
  versioned: true

shuttle_passenger_capacity_plot_exp:
  type: plotly.PlotlyDataset
  filepath: s3://${globals:s3_bucket}/08_reporting/shuttle_passenger_capacity_plot_exp.json
  versioned: true
  plotly_args:
    type: bar
    fig:
      x: shuttle_type
      y: passenger_capacity
      orientation: h
    layout:
      xaxis_title: Shuttles
      yaxis_title: Average passenger capacity
      title: Shuttle Passenger capacity

shuttle_passenger_capacity_plot_go:
  type: plotly.JSONDataset
  filepath: s3://${globals:s3_bucket}/08_reporting/shuttle_passenger_capacity_plot_go.json
  versioned: true

dummy_confusion_matrix:
  type: matplotlib.MatplotlibDataset
  filepath: s3://${globals:s3_bucket}/08_reporting/dummy_confusion_matrix.png
  versioned: true

Verify the AWS Batch environment locally¶

kedro run --env aws_batch

Confirm outputs appear under your S3 bucket paths.

Step 4: Create the custom AWS Batch runner¶

Create src/<PACKAGE_NAME>/runner/batch_runner.py with an AWSBatchRunner class that extends AbstractRunner and submits Batch jobs for each namespace group (or each node when no namespace is defined).

View batch_runner.py

"""``AWSBatchRunner`` submits Kedro namespace groups as AWS Batch jobs."""

from __future__ import annotations

from concurrent.futures import ThreadPoolExecutor
from time import sleep
from typing import Any

import boto3
from pluggy import PluginManager

from kedro.io import CatalogProtocol
from kedro.pipeline import Pipeline
from kedro.pipeline.node import GroupedNodes
from kedro.runner import AbstractRunner


def _track_batch_job(job_id: str, client: Any) -> None:
    """Poll Batch until the job succeeds or raises on failure."""
    while True:
        sleep(1.0)
        jobs = client.describe_jobs(jobs=[job_id])["jobs"]
        if not jobs:
            raise ValueError(f"Job ID {job_id} not found.")

        job = jobs[0]
        status = job["status"]

        if status == "FAILED":
            reason = job.get("statusReason", "unknown")
            raise RuntimeError(f"Job {job_id} failed: {reason}")

        if status == "SUCCEEDED":
            return


class AWSBatchRunner(AbstractRunner):
    """Submit Kedro namespace groups to AWS Batch with dependency ordering."""

    def __init__(
        self,
        job_queue: str,
        job_definition: str,
        max_workers: int | None = None,
        package_cli: str | None = None,
        conf_source: str | None = None,
        is_async: bool = False,
    ):
        super().__init__(is_async=is_async)
        self._job_queue = job_queue
        self._job_definition = job_definition
        self._max_workers = max_workers
        self._package_cli = package_cli or "kedro"
        self._conf_source = conf_source
        self._client = boto3.client("batch")

    def _get_required_workers_count(self, groups: list[GroupedNodes]) -> int:
        required = len(groups)
        if self._max_workers is not None:
            return min(required, self._max_workers)
        return required

    def _get_executor(self, max_workers: int):
        return ThreadPoolExecutor(max_workers=max_workers)

    def _run(
        self,
        pipeline: Pipeline,
        catalog: CatalogProtocol,
        hook_manager: PluginManager | None = None,
        run_id: str | None = None,
    ) -> None:
        groups = pipeline.group_nodes_by("namespace")
        group_map = {group.name: group for group in groups}
        group_deps = {group.name: set(group.dependencies) for group in groups}

        todo_groups = set(group_map.keys())
        group_to_job: dict[str, str] = {}
        done_groups: set[str] = set()
        futures: set = set()
        max_workers = self._get_required_workers_count(groups)

        self._logger.info("Max workers: %d", max_workers)
        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            while True:
                done = {fut for fut in futures if fut.done()}
                futures -= done
                for future in done:
                    try:
                        group_name = future.result()
                    except Exception:
                        self._suggest_resume_scenario(
                            pipeline, set(), catalog
                        )
                        raise
                    done_groups.add(group_name)
                    self._logger.info(
                        "Completed %d out of %d jobs",
                        len(done_groups),
                        len(groups),
                    )

                ready = {
                    name
                    for name in todo_groups
                    if group_deps[name] <= done_groups
                }
                todo_groups -= ready
                for name in ready:
                    future = pool.submit(
                        self._submit_job,
                        group_map[name],
                        group_to_job,
                        group_deps[name],
                        run_id,
                    )
                    futures.add(future)

                if not futures:
                    if todo_groups:
                        raise RuntimeError(
                            f"Unresolved groups: {sorted(todo_groups)}"
                        )
                    break

    def _submit_job(
        self,
        group: GroupedNodes,
        group_to_job: dict[str, str],
        group_dependencies: set[str],
        run_id: str | None,
    ) -> str:
        self._logger.info("Submitting the job for group: %s", group.name)

        run_suffix = run_id or "local"
        job_name = f"kedro-{run_suffix}-{group.name}".replace(".", "-")[:128]
        depends_on = [
            {"jobId": group_to_job[dep]}
            for dep in group_dependencies
            if dep in group_to_job
        ]

        command = [
            self._package_cli,
            "run",
            "--env",
            "aws_batch",
        ]
        if group.type == "namespace":
            command.extend(["--namespaces", group.name])
        else:
            command.extend(["--nodes", group.nodes[0]])
        if self._conf_source:
            command.extend(["--conf-source", self._conf_source])

        response = self._client.submit_job(
            jobName=job_name,
            jobQueue=self._job_queue,
            jobDefinition=self._job_definition,
            dependsOn=depends_on,
            containerOverrides={"command": command},
        )

        job_id = response["jobId"]
        group_to_job[group.name] = job_id
        _track_batch_job(job_id, self._client)
        return group.name

Export the runner from src/<PACKAGE_NAME>/runner/__init__.py:

from .batch_runner import AWSBatchRunner

__all__ = ["AWSBatchRunner"]

Step 5: Customise the project CLI¶

Add the custom runner and CLI on your driver machine before you build the container image. Push the image in Step 6 and update the job definition Image field before you submit jobs in Step 7.

Kedro's built-in kedro run passes is_async to the runner constructor and nothing else. AWSBatchRunner needs job_queue, job_definition, and other settings from conf/aws_batch/parameters.yml.

Add src/<PACKAGE_NAME>/cli.py using the project CLI template. Override the run command so the runner is constructed with Batch parameters from the active Kedro session:

def _instantiate_runner(runner: str | None, is_async: bool, params: dict[str, Any]):
    runner_class = load_obj(runner or "SequentialRunner", "kedro.runner")
    runner_kwargs: dict[str, Any] = {"is_async": is_async}
    if runner and runner.endswith("AWSBatchRunner"):
        batch_kwargs = params.get("aws_batch") or {}
        runner_kwargs.update(batch_kwargs)
    return runner_class(**runner_kwargs)

Inside run(), create the session, load the context, build the runner from context.params, and pass it to session.run(). When runner is omitted (for example inside a Batch container job), Kedro defaults to SequentialRunner and does not pass Batch driver settings to the constructor:

with KedroSession.create(
    env=env, conf_source=conf_source, runtime_params=params
) as session:
    context = session.load_context()
    runner_instance = _instantiate_runner(runner, is_async, dict(context.params))
    return session.run(
        runner=runner_instance,
        # ... other run arguments ...
    )

Learn how to customise Kedro commands in common use cases.

Step 6: Package, build, and push the container image¶

Package the project, then build and push the Batch image. Repeat this step when you change pipeline code, dependencies, or conf/aws_batch/.

Run this in your project root:

kedro package

This creates a .whl in dist/. Learn how to package a Kedro project.

Create a Dockerfile in your project root:

FROM python:3.12-slim

WORKDIR /app

COPY dist/*.whl /tmp/
RUN pip install --no-cache-dir /tmp/*.whl && rm -f /tmp/*.whl

COPY conf/ /app/conf/

Apple Silicon (ARM) builders

Batch on EC2 compute environments uses x86_64 instances. Build with --platform linux/amd64 (Docker or Podman).

Build the image. Tag it with your ECR URI at build time:

export ECR_IMAGE=<ecr-image-uri>

docker build --platform linux/amd64 -t ${ECR_IMAGE} .

If you build with a local tag first, run docker tag spaceflights-batch:latest <ecr-image-uri> right before pushing.

How config reaches the job¶

Now that you have built the image, here is how your Step 3 configuration reaches each Batch container at runtime:

The wheel carries pipeline code. kedro package bundles your pipeline code and dependencies into a .whl file. It does not include conf/.
The Dockerfile carries conf/. COPY conf/ places your conf/aws_batch/ settings at /app/conf inside the container.
The Batch job command selects the environment. AWSBatchRunner passes --env aws_batch --conf-source /app/conf --namespaces <name> (or --nodes <name> for nodes without a namespace) through containerOverrides.

Push the image to Amazon ECR¶

Follow the AWS guide for pushing a Docker image to an Amazon ECR repository:

aws ecr get-login-password --region <your-aws-region> | \
  docker login --username AWS --password-stdin <your-aws-account-id>.dkr.ecr.<your-aws-region>.amazonaws.com
docker push ${ECR_IMAGE}

Trim dependencies for production images

The Spaceflights starter includes Jupyter and Kedro-Viz, which increase image size. For production Batch images, consider a slimmer requirements subset or a multi-stage Dockerfile that omits development dependencies.

Re-push after catalog or pipeline changes

When you change conf/aws_batch/ or rebuild the wheel, repeat Step 6 and push a new image tag. Update the job definition Image field to <ecr-image-uri> if you created the definition before pushing.

Step 7: Submit the pipeline from your machine¶

Complete Step 6 first so your job definition points at the image in ECR.

From your project root (with AWS credentials configured), use your packaged project CLI and the custom runner:

spaceflights-batch run \
  --env aws_batch \
  --runner spaceflights_batch.runner.AWSBatchRunner

The AWSBatchRunner on your machine submits Batch jobs. Each job runs inside the container image and executes one namespace group (or a single node when no namespace is defined). For Spaceflights __default__, expect three jobs.

Track jobs in the AWS Batch console or with:

aws batch list-jobs --job-queue <batch-job-queue> --job-status RUNNING

Logs are available in CloudWatch Logs under /aws/batch/job.

Step 8: Verify the jobs succeeded¶

Check Batch job states. All jobs should reach SUCCEEDED. Learn how to check AWS Batch job status:

aws batch list-jobs --job-queue <batch-job-queue> --job-status SUCCEEDED

Check S3 outputs. List paths from your conf/aws_batch/catalog.yml. Learn how to list objects in Amazon S3:

aws s3 ls "s3://<your-bucket>/" --recursive

You should see objects under 02_intermediate/, 03_primary/, 04_feature/, 06_models/, and 08_reporting/.

If jobs failed, see Troubleshooting.

Troubleshooting¶

Symptom	Cause	Fix
`Cannot install ... s3fs` dependency conflict	`kedro-viz` pins an older `s3fs` range	Use `s3fs>=2021.4` locally; omit Kedro-Viz from the Batch image for production
`AccessDenied` on S3 inside Batch jobs	Job role lacks bucket permissions	Attach an IAM policy scoped to `<your-bucket>` on the job role
`MemoryDataset` errors between jobs	Dataset missing from `conf/aws_batch/catalog.yml`	Add S3-backed entries for all shared datasets
Batch job times out mid-namespace	Namespace contains too much work for one job timeout	Split namespaces further, or increase job definition timeout/memory, or run heavy stages on Amazon EMR Serverless
`Dataset 'MatplotlibWriter' not found`	Outdated dataset type in `conf/base/catalog.yml`	Use `matplotlib.MatplotlibDataset` in base and aws_batch catalogs
S3 errors during `kedro run --env aws_batch` locally	Missing AWS CRT support in `botocore`	Run `pip install 'botocore[crt]'` and retry
Jobs stuck in `RUNNABLE`	Compute environment not scaled or no capacity	Check compute environment status; increase `maxvCpus` or instance types
`Essential container exited` immediately	Wrong `--conf-source` or missing `conf/` in image	Verify `COPY conf/ /app/conf/` in the Dockerfile and `conf_source: /app/conf` in parameters
`ModuleNotFoundError` for runner kwargs	Built-in `kedro run` used instead of custom `cli.py`	Use `<PACKAGE_CLI> run` after adding the customised `cli.py` from Step 5
Job fails with out-of-memory	Default memory too low for sklearn/matplotlib nodes	Increase memory in the job definition (for example `4096` or `8192` MiB)

Limitations¶

Each Batch job runs one namespace group (or one node when no namespace is defined). A namespace must finish within the job definition timeout and memory limits.
Not for Spark: This pattern is for non-distributed Python stages. Run PySpark workloads on Amazon EMR Serverless instead.
Driver machine: The machine where you run AWSBatchRunner must stay online until all jobs complete. For managed orchestration of lightweight stages, consider AWS Step Functions.
Image lifecycle: When you change pipeline code, dependencies, or conf/aws_batch/, repeat Step 6, push to ECR, and update the job definition image URI before the next run.
Pipelines with dozens of nodes without namespaces increase Batch job count and ECR pulls. Add pipeline-level namespaces for coarser grouping.
Batch job dependencies use AWS Batch dependsOn. This matches Kedro's DAG but does not replace Kedro's own ThreadRunner parallelism on a single machine.

AWS Batch¶

Strategy¶

Overview¶

Why use a container image?¶

Choose how to group nodes¶

Plan execution order and storage¶

Configure before you deploy¶

Working example¶

Prerequisites¶

Placeholders used in this guide¶

What you will do¶

Step 1: Prepare your Kedro project¶

Assign pipeline-level namespaces¶

Step 2: Set up AWS¶

Upload raw data to S3¶

Create the Amazon ECR repository¶

Job definition settings (Kedro-specific)¶

Step 3: Configure Kedro for AWS Batch¶

Create an `aws_batch` config environment¶

Add dependencies¶

Verify the AWS Batch environment locally¶

Step 4: Create the custom AWS Batch runner¶

Step 5: Customise the project CLI¶

Step 6: Package, build, and push the container image¶

How config reaches the job¶

Push the image to Amazon ECR¶

Step 7: Submit the pipeline from your machine¶

Step 8: Verify the jobs succeeded¶

Troubleshooting¶

Limitations¶

Further reading¶

Kedro¶

AWS¶

AWS Batch¶

Strategy¶

Overview¶

Why use a container image?¶

Choose how to group nodes¶

Plan execution order and storage¶

Configure before you deploy¶

Working example¶

Prerequisites¶

Placeholders used in this guide¶

What you will do¶

Step 1: Prepare your Kedro project¶

Assign pipeline-level namespaces¶

Step 2: Set up AWS¶

Upload raw data to S3¶

Create the Amazon ECR repository¶

Job definition settings (Kedro-specific)¶

Step 3: Configure Kedro for AWS Batch¶

Create an aws_batch config environment¶

Add dependencies¶

Verify the AWS Batch environment locally¶

Step 4: Create the custom AWS Batch runner¶

Step 5: Customise the project CLI¶

Step 6: Package, build, and push the container image¶

How config reaches the job¶

Push the image to Amazon ECR¶

Step 7: Submit the pipeline from your machine¶

Step 8: Verify the jobs succeeded¶

Troubleshooting¶

Limitations¶

Further reading¶

Kedro¶

AWS¶

Create an `aws_batch` config environment¶