The Data Catalog

This section introduces catalog.yml, the project-shareable Data Catalog. The file is located in conf/base and is a registry of all data sources available for use by a project; it manages loading and saving of data.

All supported data connectors are available in kedro.extras.datasets.

Use the Data Catalog within Kedro configuration

Kedro uses configuration to make your code reproducible when it has to reference datasets in different locations and/or in different environments.

You can copy this file and reference additional locations for the same datasets. For instance, you can use the catalog.yml file in conf/base/ to register the locations of datasets that would run in production, while copying and updating a second version of catalog.yml in conf/local/ to register the locations of sample datasets that you are using for prototyping your data pipeline(s).

Built-in functionality for conf/local/ to overwrite conf/base/ is described in the documentation about configuration. This means that a dataset called cars could exist in the catalog.yml files in conf/base/ and conf/local/. In code, in src, you would only call a dataset named cars and Kedro would detect which definition of cars dataset to use to run your pipeline - cars definition from conf/local/catalog.yml would take precedence in this case.

The Data Catalog also works with the credentials.yml file in conf/local/, allowing you to specify usernames and passwords required to load certain datasets.

You can define a Data Catalog in two ways - through YAML configuration, or programmatically using an API. Both methods allow you to specify:

  • Dataset name

  • Dataset type

  • Location of the dataset using fsspec, detailed in the next section

  • Credentials needed to access the dataset

  • Load and saving arguments

  • Whether you want a dataset or ML model to be versioned when you run your data pipeline

Specify the location of the dataset

Kedro relies on fsspec to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. When specifying a storage location in filepath:, you should provide a URL using the general form protocol://path/to/data. If no protocol is provided, the local file system is assumed (same as file://).

The following prepends are available:

  • Local or Network File System: file:// - the local file system is default in the absence of any protocol, it also permits relative paths.

  • Hadoop File System (HDFS): hdfs://user@server:port/path/to/data - Hadoop Distributed File System, for resilient, replicated files within a cluster.

  • Amazon S3: s3://my-bucket-name/path/to/data - Amazon S3 remote binary store, often used with Amazon EC2, using the library s3fs.

  • S3 Compatible Storage: s3://my-bucket-name/path/_to/data - e.g. Minio, using the s3fs library.

  • Google Cloud Storage: gcs:// - Google Cloud Storage, typically used with Google Compute resource using gcsfs (in development).

  • Azure Blob Storage / Azure Data Lake Storage Gen2: abfs:// - Azure Blob Storage, typically used when working on an Azure environment.

  • HTTP(s): http:// or https:// for reading data directly from HTTP web servers.

fsspec also provides other file systems, such as SSH, FTP and WebHDFS. See the fsspec documentation for more information.

Data Catalog *_args parameters

Data Catalog accepts two different groups of *_args parameters that serve different purposes:

  • fs_args

  • load_args and save_args

The fs_args is used to configure the interaction with a filesystem. All the top-level parameters of fs_args (except open_args_load and open_args_save) will be passed in an underlying filesystem class.

Example 1: Provide the project value to the underlying filesystem class (GCSFileSystem) to interact with Google Cloud Storage (GCS)

  type: ...
    project: test_project

The open_args_load and open_args_save parameters are passed to the filesystem’s open method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively.

Example 2: Load data from a local binary file using utf-8 encoding

  type: ...
      mode: "rb"
      encoding: "utf-8"

load_args and save_args configure how a third-party library (e.g. pandas for CSVDataSet) loads/saves data from/to a file.

Example 3: Save data to a CSV file without row names (index) using utf-8 encoding

  type: pandas.CSVDataSet
    index: False
    encoding: "utf-8"

Use the Data Catalog with the YAML API

The YAML API allows you to configure your datasets in a YAML configuration file, conf/base/catalog.yml or conf/local/catalog.yml.

Here are some examples of data configuration in a catalog.yml:

Example 1: Loads / saves a CSV file from / to a local file system

  type: pandas.CSVDataSet
  filepath: data/01_raw/bikes.csv

Example 2: Loads and saves a CSV on a local file system, using specified load and save arguments

  type: pandas.CSVDataSet
  filepath: data/01_raw/company/cars.csv
    sep: ','
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .

Example 3: Loads and saves a compressed CSV on a local file system

  type: pandas.CSVDataSet
  filepath: data/01_raw/company/boats.csv.gz
    sep: ','
    compression: 'gzip'
      mode: 'rb'

Example 4: Loads a CSV file from a specific S3 bucket, using credentials and load arguments

  type: pandas.CSVDataSet
  filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
  credentials: dev_s3
    sep: ','
    skiprows: 5
    skipfooter: 1
    na_values: ['#NA', NA]

Example 5: Loads / saves a pickle file from / to a local file system

  type: pickle.PickleDataSet
  filepath: data/06_models/airplanes.pkl
  backend: pickle

Example 6: Loads an excel file from Google Cloud Storage

  type: pandas.ExcelDataSet
  filepath: gcs://your_bucket/data/02_intermediate/company/motorbikes.xlsx
    project: my-project
  credentials: my_gcp_credentials
    sheet_name: Sheet1

Example 7: Saves an image created with Matplotlib on Google Cloud Storage

  type: matplotlib.MatplotlibWriter
  filepath: gcs://your_bucket/data/08_results/plots/output_1.jpeg
    project: my-project
  credentials: my_gcp_credentials

Example 8: Loads / saves an HDF file on local file system storage, using specified load and save arguments

  type: pandas.HDFDataSet
  filepath: data/02_intermediate/skateboards.hdf
  key: name
    columns: [brand, length]
    mode: w  # Overwrite even when the file already exists
    dropna: True

Example 9: Loads / saves a parquet file on local file system storage, using specified load and save arguments

  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/trucks.parquet
    columns: [name, gear, disp, wt]
    categories: list
    index: name
    compression: GZIP
    file_scheme: hive
    has_nulls: False
    partition_on: [name]

Example 10: Loads / saves a Spark table on S3, using specified load and save arguments

  type: spark.SparkDataSet
  filepath: s3a://your_bucket/data/01_raw/weather*
  credentials: dev_s3
  file_format: csv
    header: True
    inferSchema: True
    sep: '|'
    header: True

Example 11: Loads / saves a SQL table using credentials, a database connection, using specified load and save arguments

  type: pandas.SQLTableDataSet
  credentials: scooters_credentials
  table_name: scooters
    index_col: [name]
    columns: [name, gear]
    if_exists: replace

Example 12: Loads an SQL table with credentials, a database connection, and applies a SQL query to the table

  type: pandas.SQLQueryDataSet
  credentials: scooters_credentials
  sql: select * from cars where gear=4
    index_col: [name]

When you use pandas.SQLTableDataSet or pandas.SQLQueryDataSet, you must provide a database connection string. In the above example, we pass it using the scooters_credentials key from the credentials (see the details in the Feeding in credentials section below). scooters_credentials must have a top-level key con containing a SQLAlchemy compatible connection string. As an alternative to credentials, you could explicitly put con into load_args and save_args (pandas.SQLTableDataSet only).

Example 13: Loads data from an API endpoint, example US corn yield data from USDA

  type: api.APIDataSet
  credentials: usda_credentials
    key: SOME_TOKEN
    format: JSON
    commodity_desc: CORN
    statisticcat_des: YIELD
    agg_level_desc: STATE
    year: 2000

Note that usda_credientials will be passed as the auth argument in the requests library. Specify the username and password as a list in your credentials.yml file as follows:

  - username
  - password

Example 14: Loads data from Minio (S3 API Compatible Storage)

  type: pandas.CSVDataSet
  filepath: s3://your_bucket/test.csv # assume `test.csv` is uploaded to the Minio server.
  credentials: dev_minio

In credentials.yml, define the key, secret and the endpoint_url as follows:

  key: token
  secret: key
    endpoint_url : 'http://localhost:9000'


The easiest way to setup MinIO is to run a Docker image. After the following command, you can access the Minio server with http://localhost:9000 and create a bucket and add files as if it is on S3.

docker run -p 9000:9000 -e "MINIO_ACCESS_KEY=token" -e "MINIO_SECRET_KEY=key" minio/minio server /data

Example 15: Loads a model saved as a pickle from Azure Blob Storage

  type: pickle.PickleDataSet
  filepath: "abfs://models/ml_models.pickle"
  versioned: True
  credentials: dev_abs

In the credentials.yml file, define the account_name and account_key:

  account_name: accountname
  account_key: key

Example 16: Loads a CSV file stored in a remote location through SSH


This example requires Paramiko to be installed (pip install paramiko).

  type: pandas.CSVDataSet
  filepath: "sftp:///path/to/remote_cluster/cool_data.csv"
  credentials: cluster_credentials

All parameters required to establish the SFTP connection can be defined through fs_args or in the credentials.yml file as follows:

  username: my_username
  host: host_address
  port: 22
  password: password

The list of all available parameters is given in the Paramiko documentation.

Create a Data Catalog YAML configuration file via CLI

You can use the kedro catalog create command to create a Data Catalog YAML configuration.

This creates a <conf_root>/<env>/catalog/<pipeline_name>.yml configuration file with MemoryDataSet datasets for each dataset in a registered pipeline if it is missing from the DataCatalog.

# <conf_root>/<env>/catalog/<pipeline_name>.yml
  type: MemoryDataSet
  type: MemoryDataSet

Adding parameters

You can configure parameters for your project and reference them in your nodes. To do this, use the add_feed_dict() method (API documentation). You can use this method to add any other entry or metadata you wish on the DataCatalog.

Feeding in credentials

Before instantiating the DataCatalog, Kedro will first attempt to read the credentials from the project configuration. The resulting dictionary is then passed into DataCatalog.from_config() as the credentials argument.

Let’s assume that the project contains the file conf/local/credentials.yml with the following contents:

    aws_access_key_id: key
    aws_secret_access_key: secret

  con: sqlite:///kedro.db

  id_token: key

In the example above, the catalog.yml file contains references to credentials keys dev_s3 and scooters_credentials. This means that when it instantiates the motorbikes dataset, for example, the DataCatalog will attempt to read top-level key dev_s3 from the received credentials dictionary, and then will pass its values into the dataset __init__ as a credentials argument. This is essentially equivalent to calling this:

    load_args=dict(sep=",", skiprows=5, skipfooter=1, na_values=["#NA", "NA"]),
    credentials=dict(key="token", secret="key"),

Load multiple datasets with similar configuration

Different datasets might use the same file format, load and save arguments, and be stored in the same folder. YAML has a built-in syntax for factorising parts of a YAML file, which means that you can decide what is generalisable across your datasets, so that you need not spend time copying and pasting dataset configurations in the catalog.yml file.

You can see this in the following example:

_csv: &csv
  type: spark.SparkDataSet
  file_format: csv
    sep: ','
    na_values: ['#NA', NA]
    header: True
    inferSchema: False

  <<: *csv
  filepath: s3a://data/01_raw/cars.csv

  <<: *csv
  filepath: s3a://data/01_raw/trucks.csv

  <<: *csv
  filepath: s3a://data/01_raw/bikes.csv
    header: False

The syntax &csv names the following block csv and the syntax <<: *csv inserts the contents of the block named csv. Locally declared keys entirely override inserted ones as seen in bikes.


It’s important that the name of the template entry starts with a _ so Kedro knows not to try and instantiate it as a dataset.

You can also nest reuseable YAML syntax:

_csv: &csv
  type: spark.SparkDataSet
  file_format: csv
  load_args: &csv_load_args
    header: True
    inferSchema: False

  <<: *csv
  filepath: s3a://data/01_raw/airplanes.csv
    <<: *csv_load_args
    sep: ;

In this example, the default csv configuration is inserted into airplanes and then the load_args block is overridden. Normally, that would replace the whole dictionary. In order to extend load_args, the defaults for that block are then re-inserted.

Transcode datasets

You might come across a situation where you would like to read the same file using two different dataset implementations. Use transcoding when you want to load and save the same file, via its specified filepath, using different DataSet implementations.

A typical example of transcoding

For instance, parquet files can not only be loaded via the ParquetDataSet using pandas, but also directly by SparkDataSet. This conversion is typical when coordinating a Spark to pandas workflow.

To enable transcoding, define two DataCatalog entries for the same dataset in a common format (Parquet, JSON, CSV, etc.) in your conf/base/catalog.yml:

  type: spark.SparkDataSet
  filepath: data/02_intermediate/data.parquet
  file_format: parquet

  type: pandas.ParquetDataSet
  filepath: data/02_intermediate/data.parquet

These entries are used in the pipeline like this:

        node(func=my_func1, inputs="spark_input", outputs="my_dataframe@spark"),
        node(func=my_func2, inputs="my_dataframe@pandas", outputs="pipeline_output"),

How does transcoding work?

In this example, Kedro understands that my_dataframe is the same dataset in its spark.SparkDataSet and pandas.ParquetDataSet formats and helps resolve the node execution order.

In the pipeline, Kedro uses the spark.SparkDataSet implementation for saving and pandas.ParquetDataSet for loading, so the first node should output a pyspark.sql.DataFrame, while the second node would receive a pandas.Dataframe.

Version datasets and ML models

Making a simple addition to your Data Catalog allows you to perform versioning of datasets and machine learning models.

Consider the following versioned dataset defined in the catalog.yml:

  type: pandas.CSVDataSet
  filepath: data/01_raw/company/cars.csv
  versioned: True

The DataCatalog will create a versioned CSVDataSet called cars.csv. The actual csv file location will look like data/01_raw/company/cars.csv/<version>/cars.csv, where <version> corresponds to a global save version string formatted as

You can run the pipeline with a particular versioned data set with --load-version flag as follows:

kedro run --load-version=""

where --load-version is dataset name and version timestamp separated by :.

This section shows just the very basics of versioning, which is described further in the documentation about Kedro IO.

Use the Data Catalog with the Code API

The code API allows you to:

  • configure data sources in code

  • operate the IO module within notebooks

Configure a Data Catalog

In a file like, you can construct a DataCatalog object programmatically. In the following, we are using a number of pre-built data loaders documented in the API reference documentation.

from import DataCatalog
from kedro.extras.datasets.pandas import (

io = DataCatalog(
        "bikes": CSVDataSet(filepath="../data/01_raw/bikes.csv"),
        "cars": CSVDataSet(filepath="../data/01_raw/cars.csv", load_args=dict(sep=",")),
        "cars_table": SQLTableDataSet(
            table_name="cars", credentials=dict(con="sqlite:///kedro.db")
        "scooters_query": SQLQueryDataSet(
            sql="select * from cars where gear=4",
        "ranked": ParquetDataSet(filepath="ranked.parquet"),

When using SQLTableDataSet or SQLQueryDataSet you must provide a con key containing SQLAlchemy compatible database connection string. In the example above we pass it as part of credentials argument. Alternative to credentials is to put con into load_args and save_args (SQLTableDataSet only).

Load datasets

You can access each dataset by its name.

cars = io.load("cars")  # data is now loaded as a DataFrame in 'cars'
gear = cars["gear"].values

Behind the scenes

The following steps happened behind the scenes when load was called:

  • The value cars was located in the Data Catalog

  • The corresponding AbstractDataSet object was retrieved

  • The load method of this dataset was called

  • This load method delegated the loading to the underlying pandas read_csv function

View the available data sources

If you forget what data was assigned, you can always review the DataCatalog.


Save data

You can save data using an API similar to that used to load data.


This use is not recommended unless you are prototyping in notebooks.

Save data to memory

from import MemoryDataSet

memory = MemoryDataSet(data=None)
io.add("cars_cache", memory)"cars_cache", "Memory can store anything.")

Save data to a SQL database for querying

We might now want to put the data in a SQLite database to run queries on it. Let’s use that to rank scooters by their mpg.

import os

# This cleans up the database in case it exists at this point
except FileNotFoundError:
    pass"cars_table", cars)
ranked = io.load("scooters_query")[["brand", "mpg"]]

Save data in Parquet

Finally, we can save the processed data in Parquet format."ranked", ranked)


Saving None to a dataset is not allowed!