Amazon EMR Serverless¶
Amazon EMR Serverless runs Apache Spark jobs without you managing clusters. EMR allocates resources per job and releases them when the job finishes. The sections below show how to deploy a Kedro project as a Spark job on EMR Serverless, with datasets stored on Amazon S3.
EMR Serverless suits pipelines with PySpark or other distributed Spark work. Single-node stages (pandas, scikit-learn, reporting) may fit better on AWS Step Functions or AWS Batch, or as separate jobs you orchestrate around Spark stages.
This guide targets Kedro 1.x (kedro>=1.0) and uses the Spaceflights PySpark starter as a worked example. Read the deployment strategy if you are deploying your own project and need guidance on EMR release choice, job scope, S3 storage and catalog configuration, and custom images.
Strategy¶
Read this section before you deploy your own project. It starts with an overview of the approach, then gives practical advice for adapting it to your pipelines.
Overview¶
This guide deploys a Kedro pipeline as an EMR Serverless Spark job backed by a custom container image and Amazon S3.
The approach in brief:
- Package the project with
kedro package. You get a wheel and aconf-<package_name>.tar.gzarchive. The wheel does not includeconf/. - Build a custom Docker image that installs Python >=3.10, your Kedro wheel, and the config archive. EMR base images ship with Python below Kedro's minimum, so a custom image is required.
- Push the image to ECR and create an EMR Serverless application that references it.
- Submit a Spark job with an S3-hosted entrypoint script. Pass Kedro CLI flags (
--env,--conf-source,--pipelines,--runner) asentryPointArguments.
flowchart LR
subgraph local [Your machine]
Pack[kedro package]
Build[docker build]
end
subgraph aws [AWS]
ECR[ECR]
App[EMR Serverless app]
Job[Spark job run]
S3[(S3 bucket)]
end
Pack --> Build
Build --> ECR
ECR --> App
App --> Job
Job --> S3
EMR Serverless ignores [CMD] and [ENTRYPOINT] in the Dockerfile. The job always starts from the S3 entryPoint script you pass at submission time.
Why use a custom image?¶
EMR Serverless supports installing Python and dependencies at job-submit time, but that approach is error-prone and hard to debug. A custom image for EMR Serverless packages Python, your Kedro wheel, and configuration into one immutable container you can test locally before pushing to Amazon Elastic Container Registry.
On managed EMR clusters, teams typically use bootstrap actions or custom AMIs instead. See the Kedro EMR blog post for an older virtual-environment approach on managed EMR.
AWS also documents a virtual-environment and venv-pack workflow for EMR Serverless. That path is easy to break when Python is symlinked (venv-pack does not bundle the interpreter). Kedro's older experiments with pyenv on top of it hit symlink errors such as Too many levels of symbolic links. A custom image avoids those packaging steps and keeps Python, dependencies, and your Kedro project in one testable artefact.
Choose EMR 7.x or 6.x¶
Use EMR 7.x for all new deployments. EMR base images ship with Python below Kedro's minimum (>=3.10), so you build a custom Docker image that installs Python >=3.10 and your Kedro project.
| EMR 7.x (recommended) | EMR 6.x (legacy) | |
|---|---|---|
| EMR release (example) | emr-7.13.0 |
emr-6.10.0 |
| Base image | public.ecr.aws/emr-serverless/spark/emr-7.13.0:latest |
public.ecr.aws/emr-serverless/spark/emr-6.10.0:latest |
| Base OS | Amazon Linux 2023 | Amazon Linux 2 |
| Python install | dnf (for example python3.12) |
Pre-built standalone Python (recommended) or pyenv |
| Spark Python path (example) | /usr/bin/python3.12 |
/opt/python/bin/python |
| Dockerfile | Dockerfile |
Dockerfile.emr6 |
| Wheel install | full dependencies OK | --no-deps + slim runtime |
Legacy: EMR 6.x
EMR 6.x defaults to Python ~3.7 and is in end of support or end of life. We do not recommend it for new deployments. See Legacy: EMR 6.x if you cannot use EMR 7.x.
Keep the EMR release label aligned everywhere: Dockerfile FROM line, EMR Serverless application release, validate-image -r flag, and job submission must all match (for example emr-7.13.0).
Choose how to run pipelines on EMR¶
Unlike Step Functions, EMR Serverless does not automatically wire pipeline dependency order. You choose how much work each job runs and how you orchestrate multiple jobs.
| Approach | Pros | Cons | When to use |
|---|---|---|---|
One job, full pipeline (__default__) |
Simplest submission | Longer job; harder to retry one stage | Small PySpark pipelines |
One job per pipeline (--pipelines=<name>) |
Isolated retries and scaling | Multiple submissions; you orchestrate order | Medium pipelines with mixed Spark and pandas stages |
| Offload orchestration to Step Functions or Airflow | Handles dependency order across jobs | Extra infrastructure to set up and operate | Multi-pipeline production workflows |
When pipelines depend on each other, submit jobs in dependency order: run upstream pipelines before downstream ones that read their S3 outputs. For automatic ordering across stages that fit within Lambda limits, combine AWS Step Functions (or AWS Batch) with EMR Serverless for Spark-heavy work. The same dependency rules apply in distributed Kedro runs and in grouping nodes for deployment.
Plan storage and configuration¶
List every dataset your pipeline reads or writes on EMR in conf/emr/catalog.yml (or your chosen environment) with full S3 paths:
- Use
s3://for pandas datasets (requiress3fsinpyproject.toml). - Use
s3a://forspark.SparkDatasetV2paths so Spark uses the Hadoop S3 filesystem bundled with EMR.
By default, Kedro merges configuration environments at the top level. If conf/emr/catalog.yml overrides a dataset using filepath alone, it replaces the entire dataset entry from conf/base/ and drops keys such as type. Either include the full dataset definition (including type) in conf/emr/catalog.yml, or set merge_strategy: {catalog: soft} in settings.py so environment files can override individual fields.
At job submission, pass --conf-source /home/hadoop/conf. The Dockerfile copies the packaged config archive there because the wheel does not include conf/. The Spark driver's working directory is often /tmp/spark-.../, so a relative config path will fail without --conf-source.
Configure before you deploy¶
- Create
conf/emr/(or your environment name) with full S3 catalog entries for every dataset the job touches - Update project dependencies in Step 3 before you package the image
- Keep
conf/base/catalog.ymlon local paths for local development; use the EMR environment for cloud runs - Run
kedro run --env emrlocally before building the image to catch catalog and dependency errors early - Decide
--pipelinesscope for each job (full pipeline vs one pipeline per submission) - For slim EMR 6.x images, use the
KEDRO_PIPELINES_TO_FINDpattern so Kedro does not import pipelines missing from the image - On Apple Silicon, build with
--platform linux/amd64because EMR Serverless applications usex86_64
Working example¶
Prerequisites¶
These apply to the step-by-step guide below. This guide builds and deploys from your machine with Kedro, Docker, and the AWS CLI. You use the AWS Management Console to inspect the EMR application and job runs after submission, but you cannot complete the guide with the console alone.
| You need | Used for |
|---|---|
A Kedro project (requires-python = ">=3.10" in pyproject.toml) and Python >=3.10 locally |
Packaging the project, local test runs, and matching the Python version in the custom image |
Docker (Podman also works if you have a docker-compatible CLI) |
Building the custom EMR container image |
| AWS CLI configured for your target region | Creating S3 and ECR resources, uploading data, pushing the image, and submitting jobs |
| An AWS account with permissions for EMR Serverless, S3, ECR, and IAM | Creating and running the deployed resources |
The steps that follow deploy the Spaceflights PySpark starter end to end. Create the project with:
kedro new -s spaceflights-pyspark -n spaceflights_emr
If you are new to the project layout, complete the Spaceflights tutorial. Kedro requires Python >=3.10. This guide uses Python 3.12 on EMR 7.13.0; install the same Python version locally. If you use your own Kedro project, replace the placeholders below and follow the same steps.
Placeholders used in this guide¶
Replace these before building and submitting jobs:
| Placeholder | Example |
|---|---|
<PACKAGE_WHEEL_NAME> |
spaceflights_emr-0.1-py3-none-any.whl |
<PACKAGE_CONF_ARCHIVE> |
conf-spaceflights_emr.tar.gz |
<PACKAGE_NAME> |
spaceflights_emr |
<your-bucket> |
my-kedro-emr-bucket |
<your-aws-region> |
us-east-1 |
<ecr-image-uri> |
123456789012.dkr.ecr.us-east-1.amazonaws.com/kedro-emr-7:latest |
<emr-conf> |
emr |
<kedro-pipeline-name> |
data_processing |
<s3-path-to-entrypoint-script> |
s3://my-bucket/scripts/entrypoint.py |
<application-id> |
from EMR Serverless console or CLI |
<execution-role-arn> |
IAM job runtime role ARN |
What you will do¶
- Prepare your Kedro project
- Set up AWS
- Configure Kedro for EMR
- Package the Kedro project
- Build the custom Docker image
- Validate and push the image to ECR
- Create the EMR Serverless application
- Create and upload the entrypoint script
- Submit your first job
- Verify the job succeeded
If your organisation requires EMR 6.x, follow the same steps for EMR 7.x, then apply the differences in Legacy: EMR 6.x.
Step 1: Prepare your Kedro project¶
From the project root, install dependencies and run the pipeline locally:
pip install -e .
pip install "kedro-datasets[spark-local]"
kedro run
The PySpark starter does not include a local Spark runtime by default. Install kedro-datasets[spark-local] so kedro run works on your machine before you submit to EMR. For more detail, read Get started with the PySpark starter.
Keep conf/base/catalog.yml on local file paths for local development. You add S3 paths in a separate environment in Step 3. For release choice, pipeline grouping, and storage planning, see Strategy.
Step 2: Set up AWS¶
Complete this setup for each AWS account and region (or per EMR release line if you run both 7.x and 6.x). Follow the linked AWS guides for console and CLI steps. This section lists what you need and Kedro-specific settings.
| Resource | AWS documentation | What you need for Kedro |
|---|---|---|
| S3 bucket | Creating a bucket | Store raw and output data (s3://<your-bucket>/data/...) and the entrypoint script (s3://<your-bucket>/scripts/entrypoint.py) |
| ECR repository | Create a private repository | One private repo per EMR release line (for example kedro-emr-7 and kedro-emr-6) |
| Job runtime role | Getting started with EMR Serverless | IAM role with access to your S3 bucket; note the ARN as <execution-role-arn> |
| ECR repository policy | Custom images for EMR Serverless | Lets EMR Serverless pull your custom image |
ECR policy gotchas
When applying the repository policy with aws ecr set-repository-policy, use the AWS template but:
- Use
"ArnLike"(not"StringLike") in the condition. - Include
ecr:DescribeImagesalongside the other actions. - Set
aws:SourceArntoarn:aws:emr-serverless:<region>:<account-id>:/applications/*. - Omit the
"Resource"field. The policy is already scoped to the repository.
Apply the ECR repository policy before starting jobs, or image pull will fail.
Upload raw data to S3¶
Upload input data before you submit the job. Follow the AWS guide for uploading objects to S3:
aws s3 sync data/01_raw/ s3://<your-bucket>/data/01_raw/
shuttles.xlsx may be missing locally
The starter gitignores data/01_raw/shuttles.xlsx. Copy it from the Spaceflights PySpark starter repository on GitHub if kedro new did not place it in your project.
Step 3: Configure Kedro for EMR¶
Create an EMR config environment¶
Add conf/emr/ (or another environment name) with S3 paths for datasets your pipeline reads and writes. Pass --env emr when you submit the job.
Catalog environment merge is destructive
By default, Kedro merges configuration environments at the top level. If conf/emr/catalog.yml overrides a dataset using filepath alone, it replaces the entire dataset entry from conf/base/ and drops keys such as type. Either include the full dataset definition (including type) in conf/emr/catalog.yml, or set merge_strategy: {catalog: soft} in settings.py so environment files can override individual fields.
Example conf/emr/catalog.yml (full entries):
companies:
type: pandas.CSVDataset
filepath: s3://<your-bucket>/data/01_raw/companies.csv
preprocessed_companies:
type: spark.SparkDatasetV2
filepath: s3a://<your-bucket>/data/02_intermediate/preprocessed_companies.csv
file_format: csv
save_args:
header: True
mode: overwrite
- Use
s3://for pandas datasets (withs3fs). - Use
s3a://forspark.SparkDatasetV2paths so Spark uses the Hadoop S3 filesystem bundled with EMR.
Add project dependencies¶
Update pyproject.toml before you package the image:
dependencies = [
# ...
"s3fs>=2024.0",
"hdfs",
]
Dependencies for EMR
- Add
s3fswhen pandas datasets uses3://paths. - Add
hdfswhen you use Spark datasets (SparkDataset/SparkDatasetV2import it at load time). - Do not put spaces inside
kedro-datasets[...]extra names (for examplekedro-datasets[pandas-csvdataset,spark-sparkdataset], notpandas-csvdataset, spark-sparkdataset). Some dependency resolvers treat spaced names as invalid.
Verify the EMR environment locally¶
Run the pipeline locally with the EMR config before you package and build the image:
kedro run --env emr
If S3 access fails locally, install botocore[crt] or check your AWS credentials. Fix catalog errors here before moving on.
Step 4: Package the Kedro project¶
Run this in your project root. Repeat whenever you change pipeline code or dependencies:
kedro package
This creates a .whl file and conf-<package_name>.tar.gz in dist/ (for example conf-spaceflights_emr.tar.gz). For more detail, read Package a Kedro project.
Step 5: Build the custom Docker image¶
Create a Dockerfile in your project root. This example targets EMR 7.13.0 with Python 3.12:
FROM public.ecr.aws/emr-serverless/spark/emr-7.13.0:latest AS base
USER root
RUN dnf install -y python3.12 python3.12-pip
ENV SPARK_HOME=/usr/lib/spark
ENV KEDRO_PACKAGE=<PACKAGE_WHEEL_NAME>
ENV KEDRO_CONF=<PACKAGE_CONF_ARCHIVE>
COPY dist/$KEDRO_PACKAGE /tmp/dist/$KEDRO_PACKAGE
RUN python3.12 -m pip install --upgrade pip && \
python3.12 -m pip install /tmp/dist/$KEDRO_PACKAGE && \
rm -f /tmp/dist/$KEDRO_PACKAGE
# Custom Python cannot import EMR's bundled PySpark without this path. spark-submit
# overrides ENV PYTHONPATH, so a .pth file in site-packages is the reliable fix.
RUN PY4J_ZIP=$(ls ${SPARK_HOME}/python/lib/py4j-*-src.zip) && \
SITE_PACKAGES=$(python3.12 -c "import site; print(site.getsitepackages()[0])") && \
printf '%s\n%s\n' "${SPARK_HOME}/python" "${PY4J_ZIP}" > "${SITE_PACKAGES}/pyspark.pth"
ADD --chown=hadoop:hadoop dist/$KEDRO_CONF /home/hadoop/
RUN chmod -R a+rX /home/hadoop/conf
USER hadoop:hadoop
Apple Silicon (ARM) builders
EMR Serverless applications use x86_64. Build with --platform linux/amd64 (Docker or Podman) or the job may fail with Custom image architecture doesn't match application architecture.
Build the image. Tag it with your ECR URI at build time so the image you push is the one you built:
export ECR_IMAGE=<ecr-image-uri>
kedro package
docker build --platform linux/amd64 -t ${ECR_IMAGE} .
If you build with a local tag (for example kedro-emr-7), run docker tag kedro-emr-7:latest <ecr-image-uri> right before pushing. Pushing an older ECR-tagged image by mistake is a common source of stale-configuration failures.
How config reaches the cluster¶
Now that you have built the image, here is how your Step 3 configuration reaches the Spark driver at runtime:
- The wheel carries pipeline code.
kedro packagebundles your pipeline code and dependencies into a.whlfile. It does not includeconf/. - The
Dockerfilecarriesconf/.ADD dist/$KEDRO_CONF /home/hadoop/unpacks yourconf/<emr-conf>/settings at/home/hadoop/confinside the container image. - Job submission selects the
emrenvironment.entryPointArgumentspass--env <emr-conf>and--conf-source /home/hadoop/confto your packaged project's__main__module, so the job loadsconf/<emr-conf>/catalog.ymlat runtime regardless of the Spark driver's working directory (often/tmp/spark-.../).
Step 6: Validate and push the image to ECR¶
Verify locally before pushing¶
Inspect the ECR-tagged image (avoid relying on a localhost/... tag alone):
docker run --rm --user hadoop --entrypoint head ${ECR_IMAGE} \
-n 6 /home/hadoop/conf/<emr-conf>/catalog.yml
docker run --rm --user hadoop --entrypoint python3.12 ${ECR_IMAGE} \
-c "import s3fs; print(s3fs.__version__)"
docker run --rm --user hadoop --entrypoint python3.12 ${ECR_IMAGE} \
-c "import pyspark; print(pyspark.__version__)"
Push to ECR¶
Follow the AWS guide for pushing a Docker image to an Amazon ECR repository. Authenticate and push ${ECR_IMAGE}, the URI you used at build time:
aws ecr get-login-password --region <your-aws-region> | \
docker login --username AWS --password-stdin <ecr-registry>
docker push ${ECR_IMAGE}
Optional: validate with the EMR Image CLI¶
EMR Serverless provides a utility to check Dockerfile structure before deployment:
pip install docker>=7.0.0
pip install https://github.com/awslabs/amazon-emr-serverless-image-cli/releases/download/v0.0.1/amazon_emr_serverless_image_cli-0.0.1-py3-none-any.whl
amazon-emr-serverless-image validate-image -i ${ECR_IMAGE} -r emr-7.13.0 -t spark
Note
The Image CLI release manifest may not list every EMR 7.x release. A successful EMR job run remains the definitive test.
Step 7: Create the EMR Serverless application¶
Create one application per EMR release line (for example a separate app for emr-7.13.0 and emr-6.10.0). The release label and custom image URI must match your Dockerfile FROM line.
Follow the AWS guides to create and start the application:
When creating the application, set these Kedro-specific values:
| Setting | Value |
|---|---|
| Type | Spark |
| Release | emr-7.13.0 (must match your Dockerfile) |
| Architecture | x86_64 |
| Custom image | Enable custom image; use your <ecr-image-uri> for driver and executors |
Note the Application ID as <application-id>. The application must be in STARTED state before you submit jobs.
Example CLI workflow. Follow the AWS guides for creating an EMR Serverless application and starting an application for full options:
export AWS_REGION=<your-aws-region>
export ECR_IMAGE=<ecr-image-uri>
APP_ID=$(aws emr-serverless create-application \
--name kedro-emr-7-test \
--release-label emr-7.13.0 \
--type SPARK \
--architecture X86_64 \
--image-configuration "{\"imageUri\":\"${ECR_IMAGE}\"}" \
--region "${AWS_REGION}" \
--query 'applicationId' --output text)
echo "Application ID: ${APP_ID}"
aws emr-serverless start-application \
--application-id "${APP_ID}" \
--region "${AWS_REGION}"
Wait until state is STARTED:
aws emr-serverless get-application \
--application-id <application-id> \
--region <your-aws-region> \
--query 'application.state' --output text
Restart after image updates
When you rebuild and push a new image to ECR, stop and start the application before submitting another job. A running application may keep warm workers referencing the previous image digest. Read about pre-initialised capacity, stopping an application, and starting an application:
aws emr-serverless stop-application \
--application-id <application-id> \
--region <your-aws-region>
aws emr-serverless start-application \
--application-id <application-id> \
--region <your-aws-region>
Step 8: Create and upload the entrypoint script¶
EMR Serverless ignores [CMD] and [ENTRYPOINT] in the Dockerfile. Package the project into a wheel, install it in the image, and invoke it from an S3-hosted script. We recommend calling your packaged project's __main__ module directly rather than shelling out to kedro run with subprocess.
Create entrypoint.py:
import sys
from <PACKAGE_NAME>.__main__ import main
main(sys.argv[1:])
Upload it to S3. Follow the AWS guide for uploading objects to S3:
aws s3 cp entrypoint.py s3://<your-bucket>/scripts/entrypoint.py
The entryPointArguments you pass at job submission are forwarded to main() as CLI arguments (for example --env, --conf-source, --pipelines, --runner).
Optional: invoke with KedroSession¶
You can run Kedro with KedroSession inside your Spark entrypoint.py instead of calling your packaged project's __main__ module. You still package the project and install the wheel in the custom image; the session API is an alternative way to invoke the pipeline.
| CLI flag | KedroSession API |
Method |
|---|---|---|
--env |
env |
KedroSession.create() |
--params |
runtime_params |
KedroSession.create() |
--conf-source |
conf_source |
KedroSession.create() |
--pipelines |
pipeline_names (a list of strings) |
session.run() |
--nodes |
node_names |
session.run() |
--runner |
runner (an AbstractRunner instance, for example ThreadRunner()) |
session.run() |
--tags |
tags |
session.run() |
--from-nodes / --to-nodes |
from_nodes / to_nodes |
session.run() |
--load-versions |
load_versions |
session.run() |
In a packaged EMR deployment, call configure_project() (not bootstrap_project()) before creating the session:
from kedro.framework.project import configure_project
from kedro.framework.session import KedroSession
from kedro.runner import ThreadRunner
configure_project("<PACKAGE_NAME>")
with KedroSession.create(env="<emr-conf>", conf_source="/home/hadoop/conf") as session:
session.run(
pipeline_names=["<kedro-pipeline-name>"],
runner=ThreadRunner(),
)
Upload this script to S3 and reference it as the Spark entryPoint.
Use the __main__ entrypoint when you want the same interface as python -m <package_name> or kedro run at the command line. Use KedroSession when you prefer explicit Python control over run options.
Step 9: Submit your first job¶
Follow the AWS guide for submitting Spark jobs to EMR Serverless. Use your <application-id>, <execution-role-arn>, and an S3 entryPoint pointing at entrypoint.py.
The Kedro-specific job driver settings below pass CLI arguments to your packaged project and set the custom Python interpreter for Spark:
aws emr-serverless start-job-run \
--application-id <application-id> \
--execution-role-arn <execution-role-arn> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "<s3-path-to-entrypoint-script>",
"entryPointArguments": ["--env", "<emr-conf>", "--conf-source", "/home/hadoop/conf", "--runner", "ThreadRunner", "--pipelines", "<kedro-pipeline-name>"],
"sparkSubmitParameters": "--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=/usr/bin/python3.12 --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=/usr/bin/python3.12 --conf spark.executorEnv.PYSPARK_PYTHON=/usr/bin/python3.12"
}
}'
| Job driver field | Purpose |
|---|---|
entryPoint |
S3 URI to entrypoint.py |
entryPointArguments |
Kedro CLI flags (--env, --conf-source, --pipelines, --runner) |
sparkSubmitParameters |
Tells Spark which Python interpreter to use on driver and executors |
Kedro version and --pipelines
The --pipelines flag runs one or more named pipelines in a single job. It requires Kedro 1.2.0+.
On Kedro before 1.2.0, use --pipeline (singular) to run one pipeline. To run __default__, omit both flags.
Read the AWS guide for viewing EMR Serverless job runs for logging and status checks.
Step 10: Verify the job succeeded¶
- Check job state. Confirm the job reached SUCCESS. Read the AWS guide for viewing EMR Serverless job runs:
aws emr-serverless get-job-run \
--application-id <application-id> \
--job-run-id <job-run-id> \
--region <your-aws-region> \
--query 'jobRun.state' --output text
- Check S3 outputs. List the output paths from your
conf/emr/catalog.yml. Follow the AWS guide for listing objects in S3:
aws s3 ls s3://<your-bucket>/data/02_intermediate/
aws s3 ls s3://<your-bucket>/data/03_primary/
If the job failed, see Troubleshooting.
Legacy: EMR 6.x¶
Warning
This section is for teams that cannot migrate to EMR 7.x. EMR 6.x is built on Amazon Linux 2 and ships with Python ~3.7 by default. Custom images require EMR 6.9.0 or later.
Follow Steps 1–4 as for EMR 7.x, then apply the differences below for Steps 5–9.
Additional Kedro configuration for slim images¶
EMR 6.x images typically install the wheel with --no-deps and the runtime packages the job needs. That means optional starter dependencies (for example matplotlib, scikit-learn) are not in the image.
Kedro's default find_pipelines() imports every pipeline package at startup, even when you pass --pipelines data_processing. To avoid import errors:
- Use this
pipeline_registry.pypattern:
import os
from kedro.framework.project import find_pipelines
from kedro.pipeline import Pipeline
def register_pipelines() -> dict[str, Pipeline]:
pipelines_to_find = os.getenv("KEDRO_PIPELINES_TO_FIND")
if pipelines_to_find:
names = [name.strip() for name in pipelines_to_find.split(",") if name.strip()]
pipelines = find_pipelines(pipelines_to_find=names, raise_errors=True)
else:
pipelines = find_pipelines(raise_errors=True)
pipelines["__default__"] = sum(pipelines.values())
return pipelines
When KEDRO_PIPELINES_TO_FIND is unset (local development), all pipelines are registered as usual.
-
Move datasets for pipelines you are not running into
conf/local/catalog.yml. Kedro mergesconf/base/withconf/emr/when you pass--env emr; entries left inconf/base/are still loaded and validated. -
Set
KEDRO_PIPELINES_TO_FINDinsparkSubmitParametersat job submission (see below).
Step 5 (EMR 6.x): build the custom Docker image¶
Save as Dockerfile.emr6 and pass -f Dockerfile.emr6 at build time.
Uses emr-6.10.0, Python 3.10.16, and Amazon Linux 2.
Prefer pre-built Python on EMR 6.x
Compiling Python with pyenv inside a linux/amd64 image on Apple Silicon can take 1–2+ hours. The Dockerfile below uses a python-build-standalone release instead. See Alternative: pyenv on EMR 6.x if you must follow the AWS pyenv samples.
Match Python to EMR 6.x Spark
EMR 6.10 ships Spark 3.3, which is not compatible with Python 3.12 or pandas 2.x+ for createDataFrame(pandas_df). Use Python 3.10 and pin pandas>=1.5,<2.0 in the EMR 6.x image. Prefer EMR 7.x if you need Python 3.12 and modern pandas.
FROM public.ecr.aws/emr-serverless/spark/emr-6.10.0:latest AS base
USER root
ARG PYTHON_STANDALONE_TAG=20241219
ARG PYTHON_VERSION=3.10.16
RUN yum install -y tar curl gcc gcc-c++ && \
curl -fsSL -o /tmp/python.tar.gz \
"https://github.com/astral-sh/python-build-standalone/releases/download/${PYTHON_STANDALONE_TAG}/cpython-${PYTHON_VERSION}+${PYTHON_STANDALONE_TAG}-x86_64-unknown-linux-gnu-install_only.tar.gz" && \
mkdir -p /opt/python && \
tar -xzf /tmp/python.tar.gz -C /opt/python --strip-components=1 && \
rm -f /tmp/python.tar.gz
ENV PATH=/opt/python/bin:$PATH
ENV SPARK_HOME=/usr/lib/spark
ENV PYTHONPATH=/usr/lib/spark/python:/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip
ENV KEDRO_PACKAGE=<PACKAGE_WHEEL_NAME>
ENV KEDRO_CONF=<PACKAGE_CONF_ARCHIVE>
COPY dist/$KEDRO_PACKAGE /tmp/dist/$KEDRO_PACKAGE
RUN python -m pip install --upgrade pip && \
python -m pip install --no-deps /tmp/dist/$KEDRO_PACKAGE && \
python -m pip install --prefer-binary \
"kedro~=1.4.0" \
"kedro-datasets[pandas-csvdataset,pandas-exceldataset,spark-sparkdataset]>=9.1" \
"s3fs>=2024.0" \
"hdfs" \
"pandas>=1.5,<2.0" \
"numpy<2" && \
rm -f /tmp/dist/$KEDRO_PACKAGE
ADD --chown=hadoop:hadoop dist/$KEDRO_CONF /home/hadoop/
RUN chmod -R a+rX /home/hadoop/conf
USER hadoop:hadoop
Why not install the full project dependency set on EMR 6.x?
The EMR 7.x Dockerfile can install the packaged wheel with all dependencies from pyproject.toml. On EMR 6.x, install the wheel with --no-deps and add the runtime packages the job needs with --prefer-binary, because:
- PySpark is provided by EMR. Do not
pip install pysparkin the image. - Amazon Linux 2 (GCC 7.3) cannot compile some packages (for example scikit-learn 1.8+) from source when wheels are unavailable.
- Spark 3.3 on EMR 6.x requires Python 3.10 and
pandas>=1.5,<2.0, which may differ from your localpyproject.toml. - Dev-only packages (Jupyter, Kedro-Viz, and so on) are not needed on the cluster.
hdfsis listed in the runtimepip installbecause the wheel is installed with--no-depsand Spark datasets import it at load time.
Alternatively, maintain a project-owned requirements-emr.txt subset and pip install -r requirements-emr.txt after the --no-deps wheel install.
Set PYTHONPATH so the standalone Python interpreter can import PySpark from /usr/lib/spark/python (verify the py4j zip name with ls /usr/lib/spark/python/lib/py4j-*-src.zip in the base image).
Build:
export ECR_IMAGE=<ecr-image-uri> # use a separate repo, e.g. kedro-emr-6
kedro package
docker build --platform linux/amd64 -f Dockerfile.emr6 -t ${ECR_IMAGE} .
Step 6 (EMR 6.x): validate and push¶
docker run --rm --user hadoop --entrypoint head ${ECR_IMAGE} \
-n 8 /home/hadoop/conf/<emr-conf>/catalog.yml
docker run --rm --user hadoop --entrypoint /opt/python/bin/python ${ECR_IMAGE} \
-c "import sys, pandas, pyspark, s3fs; print(sys.version.split()[0], pandas.__version__, pyspark.__version__)"
Expect Python 3.10.x, pandas 1.5.x, pyspark 3.3.x.
Follow the AWS guide for pushing a Docker image to an Amazon ECR repository:
aws ecr get-login-password --region <your-aws-region> | \
docker login --username AWS --password-stdin <ecr-registry>
docker push ${ECR_IMAGE}
Optional Image CLI validation:
amazon-emr-serverless-image validate-image -i ${ECR_IMAGE} -r emr-6.10.0 -t spark
Step 7 (EMR 6.x): create the application¶
Same as Step 7, but set Release to emr-6.10.0 and use your kedro-emr-6 image URI. Follow the AWS guides for creating an EMR Serverless application and starting an application.
export ECR_IMAGE=<ecr-image-uri>
APP_ID=$(aws emr-serverless create-application \
--name kedro-emr-6-test \
--release-label emr-6.10.0 \
--type SPARK \
--architecture X86_64 \
--image-configuration "{\"imageUri\":\"${ECR_IMAGE}\"}" \
--region <your-aws-region> \
--query applicationId --output text)
aws emr-serverless start-application --application-id ${APP_ID} --region <your-aws-region>
Step 8 is unchanged (same entrypoint script).
Step 9 (EMR 6.x): submit a job¶
Follow the AWS guide for submitting Spark jobs to EMR Serverless. Use /opt/python/bin/python and pass extra environment variables for pipeline filtering and PySpark imports:
aws emr-serverless start-job-run \
--application-id <application-id> \
--execution-role-arn <execution-role-arn> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "<s3-path-to-entrypoint-script>",
"entryPointArguments": ["--env", "<emr-conf>", "--conf-source", "/home/hadoop/conf", "--runner", "ThreadRunner", "--pipelines", "<kedro-pipeline-name>"],
"sparkSubmitParameters": "--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=/opt/python/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=/opt/python/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=/opt/python/bin/python --conf spark.emr-serverless.driverEnv.KEDRO_PIPELINES_TO_FIND=data_processing --conf spark.emr-serverless.driverEnv.PYTHONPATH=/usr/lib/spark/python:/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip --conf spark.executorEnv.PYTHONPATH=/usr/lib/spark/python:/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip"
}
}'
If you change PYTHON_VERSION in the Dockerfile, keep the Spark Python paths in sparkSubmitParameters aligned with the interpreter inside the image.
EMR 7.x vs 6.x comparison¶
| EMR 7.x | EMR 6.x | |
|---|---|---|
| Release | emr-7.13.0 |
emr-6.10.0 |
| Python | 3.12 with dnf |
3.10.16 standalone |
| Dockerfile | Dockerfile |
Dockerfile.emr6 |
| Wheel install | full dependencies OK | --no-deps + slim runtime |
| Spark Python | /usr/bin/python3.12 + pyspark.pth in site-packages |
/opt/python/bin/python + PYTHONPATH |
| Extra environment variables | none (PySpark path in Dockerfile) | KEDRO_PIPELINES_TO_FIND, PYTHONPATH |
Alternative: pyenv on EMR 6.x¶
Use this path if you need to follow the AWS custom Python version samples. On Apple Silicon, build on an x86_64 host or in CI if possible.
FROM public.ecr.aws/emr-serverless/spark/emr-6.10.0:latest AS base
USER root
RUN yum install -y gcc gcc-c++ make patch zlib-devel bzip2 bzip2-devel readline-devel \
sqlite sqlite-devel openssl11-devel tk-devel libffi-devel xz-devel tar git curl
ENV PYENV_ROOT=/usr/.pyenv
ENV PATH=$PYENV_ROOT/shims:$PYENV_ROOT/bin:$PATH
ENV PYTHON_VERSION=3.10.16
RUN curl https://pyenv.run | bash && \
pyenv install -v ${PYTHON_VERSION} && \
pyenv global ${PYTHON_VERSION}
ENV SPARK_HOME=/usr/lib/spark
ENV PYTHONPATH=/usr/lib/spark/python:/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip
ENV KEDRO_PACKAGE=<PACKAGE_WHEEL_NAME>
ENV KEDRO_CONF=<PACKAGE_CONF_ARCHIVE>
COPY dist/$KEDRO_PACKAGE /tmp/dist/$KEDRO_PACKAGE
RUN python -m pip install --upgrade pip && \
python -m pip install --no-deps /tmp/dist/$KEDRO_PACKAGE && \
python -m pip install --prefer-binary \
"kedro~=1.4.0" \
"kedro-datasets[pandas-csvdataset,pandas-exceldataset,spark-sparkdataset]>=9.1" \
"s3fs>=2024.0" \
"hdfs" \
"pandas>=1.5,<2.0" \
"numpy<2" && \
rm -f /tmp/dist/$KEDRO_PACKAGE
ADD --chown=hadoop:hadoop dist/$KEDRO_CONF /home/hadoop/
RUN chmod -R a+rX /home/hadoop/conf
USER hadoop:hadoop
For pyenv, set sparkSubmitParameters to /usr/.pyenv/versions/3.10.16/bin/python and include the same KEDRO_PIPELINES_TO_FIND and PYTHONPATH settings as the recommended EMR 6.x job submission example above.
Troubleshooting¶
| Symptom | Cause | Fix |
|---|---|---|
Custom image architecture doesn't match application architecture |
Image built for ARM on Apple Silicon | Rebuild with --platform linux/amd64 |
Permission denied on /home/hadoop/conf/... |
Config copied as root without read permissions for hadoop |
Use ADD --chown=hadoop:hadoop and chmod -R a+rX /home/hadoop/conf in the Dockerfile |
'type' is missing from dataset catalog configuration |
conf/<emr-conf>/catalog.yml overrides a dataset using filepath alone, or an old ECR image is still running |
Add full dataset entries (including type) or enable soft catalog merge; tag and push the new image, then restart the application |
s3fs not installed |
Pandas datasets use s3:// paths |
See Step 3 |
ModuleNotFoundError: No module named 'hdfs' |
SparkDataset / SparkDatasetV2 imports hdfs at load time |
See Step 3 |
ModuleNotFoundError: No module named 'spark' or pyspark on EMR 7.x |
Custom Python in the image cannot see EMR's bundled PySpark; ENV PYTHONPATH is overridden by spark-submit |
Add the pyspark.pth step to the EMR 7.x Dockerfile in Step 5; verify with import pyspark in Step 6 |
No such option '--pipelines' |
Packaged Kedro is older than 1.2.0 | Upgrade to Kedro 1.2.0+, or use --pipeline (singular) for one pipeline, or omit the flag to run __default__ |
BatchGetImage / ECR access denied |
ECR repository policy missing or incorrect | Apply the AWS policy template with ArnLike and ecr:DescribeImages; omit "Resource" when applying with aws ecr set-repository-policy |
| Job uses stale configuration after a rebuild | EMR application still holds warm workers on the old image digest | Stop and start the application after each ECR push |
Installing Python-3.12.8... hangs during EMR 6.x image build |
pyenv is compiling Python from source under linux/amd64 emulation (common on Apple Silicon) |
Use the pre-built Python Dockerfile, build on an x86_64 host/CI, or run pyenv install -v and expect a long wait |
NumPy/scikit-learn build fails with Unknown compiler(s): ... g++ on EMR 6.x |
C++ compiler missing from the image | Add gcc-c++ to the yum install line; install the wheel with --no-deps and use --prefer-binary for runtime packages |
scikit-learn requires GCC >= 8.0 during EMR 6.x image build |
pip is compiling scikit-learn from source on Amazon Linux 2 (GCC 7.3) | Install the wheel with --no-deps and add the runtime packages you need with --prefer-binary; omit pyspark (provided by EMR) |
No matching distribution found for pyspark with --only-binary :all: |
PySpark is not available under a wheels-only install constraint | Do not add pyspark to the Dockerfile; use the Spark/pyspark bundled in the EMR image |
ModuleNotFoundError: No module named 'matplotlib' (or plotly, sklearn) when running a subset of pipelines |
Kedro imports all pipeline modules at startup; slim EMR images omit optional starter dependencies | Use the pipeline_registry.py pattern and set KEDRO_PIPELINES_TO_FIND in sparkSubmitParameters, or add the missing packages to the Dockerfile runtime install |
DatasetError: No module named 'plotly' (or other dataset backend) with a slim image |
conf/base/catalog.yml still lists datasets merged into --env emr |
Move non-EMR datasets to conf/local/ (or another environment), or install the matching kedro-datasets extras in the image |
DatasetError: ... Failed while saving data for spark.SparkDatasetV2 on EMR 6.x with custom Python |
Standalone Python cannot import EMR's bundled PySpark, or Python/pandas versions are incompatible with Spark 3.3 | Set SPARK_HOME and PYTHONPATH in the Dockerfile; use Python 3.10 and pandas>=1.5,<2.0 on EMR 6.x |
Spark save fails with schema inference errors after createDataFrame(pandas_df) |
Null/NaN values in string columns can break Spark schema inference on EMR 6.x | Fill or cast nulls in object columns before converting to Spark (for example df.fillna({"col": ""}) for string fields) |
To confirm which config is inside the image EMR will run:
docker run --rm --user hadoop --entrypoint head <ecr-image-uri> \
-n 8 /home/hadoop/conf/<emr-conf>/catalog.yml
Limitations¶
- EMR Serverless runs Spark jobs. It is not a general-purpose Python batch platform for non-Spark Kedro pipelines.
- EMR Serverless does not orchestrate multi-pipeline dependency order. Submit jobs in dependency order yourself, or use AWS Step Functions or Airflow.
- Each deployment uses a custom container image. After you change code or dependencies, rebuild the image, push to ECR, and restart the application before the next job run.
- Jobs are subject to EMR Serverless service quotas and Spark limits on memory, vCPU, and runtime.
- Mixed Spark and pandas in one job is supported, but Python and dependency versions must stay compatible with the EMR release (see Legacy: EMR 6.x for EMR 6.x constraints).
Further reading¶
Kedro¶
- Learn how to run Kedro in a distributed environment
- Learn how to group nodes for deployment
- Learn how to package a Kedro project
- Learn how to run a packaged Kedro project
- Learn how to manage Kedro sessions and lifecycle