
Welcome to Kedro’s documentation!¶
Introduction
Get started
Tutorial
Kedro project setup
Data Catalog
- The Data Catalog
- Using the Data Catalog within Kedro configuration
- Specifying the location of the dataset
- Data Catalog
*_args
parameters - Using the Data Catalog with the YAML API
- Creating a Data Catalog YAML configuration file via CLI
- Adding parameters
- Feeding in credentials
- Loading multiple datasets that have similar configuration
- Transcoding datasets
- Transforming datasets
- Versioning datasets and ML models
- Using the Data Catalog with the Code API
- Kedro IO
Nodes and pipelines
Extend Kedro
- Common use cases
- Hooks
- Custom datasets
- Scenario
- Project setup
- The anatomy of a dataset
- Implement the
_load
method withfsspec
- Implement the
_save
method withfsspec
- Implement the
_describe
method - The complete example
- Integration with
PartitionedDataSet
- Versioning
- Thread-safety
- How to handle credentials and different filesystems
- How to contribute a custom dataset implementation
- Kedro plugins
- Create a Kedro starter
- Dataset transformers (deprecated)
- Decorators (deprecated)
Logging
Development
Deployment
- Deployment guide
- Single-machine deployment
- Distributed deployment
- Deployment with Argo Workflows
- Deployment with Prefect
- Deployment with Kubeflow Pipelines
- Deployment with AWS Batch
- Deployment to a Databricks cluster
- How to integrate Amazon SageMaker into your Kedro pipeline
- How to deploy your Kedro pipeline with AWS Step Functions
- How to deploy your Kedro pipeline on Apache Airflow with Astronomer
Tools integration
- Build a Kedro pipeline with PySpark
- Centralise Spark configuration in
conf/base/spark.yml
- Initialise a
SparkSession
in custom project context class - Use Kedro’s built-in Spark datasets to load and save raw data
- Spark and Delta Lake interaction
- Use
MemoryDataSet
for intermediaryDataFrame
- Use
MemoryDataSet
withcopy_mode="assign"
for non-DataFrame
Spark objects - Tips for maximising concurrency using
ThreadRunner
- Centralise Spark configuration in
- Use Kedro with IPython and Jupyter Notebooks/Lab
FAQs
- Frequently asked questions
- What is Kedro?
- Who maintains Kedro?
- What are the primary advantages of Kedro?
- How does Kedro compare to other projects?
- What is data engineering convention?
- How do I upgrade Kedro?
- How can I use a development version of Kedro?
- How can I find out more about Kedro?
- How can I cite Kedro?
- How can I get my question answered?
- Kedro architecture overview
- Kedro Principles
Resources
API documentation¶
Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly. |