Introduction to the Data Catalog¶

In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. It is specified with a YAML catalog file that maps the names of node inputs and outputs as keys in the DataCatalog class.

This page introduces the basic sections of catalog.yml, which is the file Kedro uses to register data sources for a project.

Warning

Datasets are not included in the core Kedro package from Kedro version 0.19.0. Import them from the kedro-datasets package instead. From version 2.0.0 of kedro-datasets, all dataset names have changed to replace the capital letter "S" in "DataSet" with a lower case "s". For example, CSVDataSet is now CSVDataset.

The basics of `catalog.yml`¶

A separate page of Data Catalog YAML examples gives further examples of how to work with catalog.yml, but here we revisit the basic catalog.yml introduced by the spaceflights tutorial.

The example below registers two csv datasets and one xlsx dataset. To load or save a file within the local file system you must provide the dataset name (key), specify the dataset class through type, and set the file location using filepath.

companies:
  type: pandas.CSVDataset
  filepath: data/01_raw/companies.csv

reviews:
  type: pandas.CSVDataset
  filepath: data/01_raw/reviews.csv

shuttles:
  type: pandas.ExcelDataset
  filepath: data/01_raw/shuttles.xlsx
  load_args:
    engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)

Configuring dataset parameters in `catalog.yml`¶

The dataset configuration in catalog.yml is defined as follows:

The top-level key is the dataset name used as a dataset identifier in the catalog - shuttles, weather in the example below.
The next level includes multiple keys. The first mandatory key is type, which declares the dataset type to use. The rest of the keys are dataset parameters and vary depending on the implementation. To get the extensive list of dataset parameters, see the kedro-datasets documentation.and navigate to the __init__ method of the target dataset.
Some dataset parameters can be further configured depending on the libraries underlying the dataset implementation. In the example below, a configuration of the shuttles dataset includes the load_args parameter which is defined by the pandas option for loading CSV files. While the save_args parameter in a configuration of the weather dataset is defined by the snowpark saveAsTable method. To get the extensive list of dataset parameters, see the kedro-datasets documentation. and navigate to the target parameter in the __init__ definition for the dataset. For those parameters we provide a reference to the underlying library configuration parameters. For example, under the load_args parameter section for pandas.ExcelDataset you can find a reference to the pandas.read_excel method defining the full set of the parameters accepted.

Note

Kedro datasets delegate any of the load_args / save_args directly to the underlying implementation.

The example below showcases the configuration of two datasets - shuttles of type pandas.ExcelDataset and weather of type snowflake.SnowparkTableDataset.

shuttles: # Dataset name
  type: pandas.ExcelDataset # Dataset type
  filepath: data/01_raw/shuttles.xlsx # pandas.ExcelDataset parameter
  load_args: # pandas.ExcelDataset parameter
    engine: openpyxl # Pandas option for loading CSV files

weather: # Dataset name
  type: snowflake.SnowparkTableDataset # Dataset type
  table_name: "weather_data"
  database: "meteorology"
  schema: "observations"
  credentials: snowflake_client
  save_args: # snowflake.SnowparkTableDataset parameter
    mode: overwrite # Snowpark saveAsTable input option
    column_order: name
    table_type: ''

Dataset `type`¶

Kedro supports a range of connectors, for CSV files, Excel spreadsheets, Parquet files, Feather files, HDF5 files, JSON documents, pickled objects, SQL tables, SQL queries, and more. They are supported using libraries such as pandas, PySpark, NetworkX, and Matplotlib.

kedro-datasets documentation contains a comprehensive list of all available file types.

Dataset `filepath`¶

Kedro relies on fsspec to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. When specifying a storage location in filepath:, you should provide a URL using the general form protocol://path/to/data. If no protocol is provided, the local file system is assumed (which is the same as file://).

The following protocols are available:

Local or Network File System: file:// - the local file system is default in the absence of any protocol, it also permits relative paths.
Hadoop File System (HDFS): hdfs://user@server:port/path/to/data - Hadoop Distributed File System, for resilient, replicated files within a cluster.
Amazon S3: s3://my-bucket-name/path/to/data - Amazon S3 remote binary store, often used with Amazon EC2, using the library s3fs.
S3 Compatible Storage: s3://my-bucket-name/path/_to/data - for example, MinIO, using the s3fs library.
Google Cloud Storage: gcs:// - Google Cloud Storage, typically used with Google Compute resource using gcsfs (in development).
Azure Blob Storage / Azure Data Lake Storage Gen2: abfs:// - Azure Blob Storage, typically used when working on an Azure environment.
HTTP(s): http:// or https:// for reading data directly from HTTP web servers.

fsspec also provides other file systems, such as SSH, FTP, and WebHDFS. See the fsspec documentation for more information.

Additional settings in `catalog.yml`¶

This section explains the additional settings available within catalog.yml.

Load, save and filesystem arguments¶

The Kedro Data Catalog also accepts different groups of *_args parameters that serve different purposes:

load_args and save_args: Configure how a third-party library loads/saves data from/to a file. In the spaceflights example above, load_args, is passed to the excel file read method (pd.read_excel) as a keyword argument. Although not specified here, the matching output parameter is save_args and the value would be passed to pd.DataFrame.to_excel method.

For example, to load or save a CSV on a local file system, using specified load/save arguments:

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ','
  save_args:
    index: False
    date_format: '%Y-%m-%d %H:%M'
    decimal: .

fs_args: Configures the interaction with a filesystem. All the top-level parameters of fs_args (except open_args_load and open_args_save) will be passed to an underlying filesystem class.

For example, to provide the project value to the underlying filesystem class (GCSFileSystem) to interact with Google Cloud Storage:

test_dataset:
  type: ...
  fs_args:
    project: test_project

The open_args_load and open_args_save parameters are passed to the filesystem's open method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively.

For example, to load data from a local binary file using utf-8 encoding:

test_dataset:
  type: ...
  fs_args:
    open_args_load:
      mode: "rb"
      encoding: "utf-8"

If you want to save a file in append mode instead of overwrite you can use the open_args_save mode parameter:

test_dataset:
  type: ...
  fs_args:
    open_args_save:
      mode: "a"

Note

Default load, save and filesystem arguments are defined inside the specific dataset implementations as DEFAULT_LOAD_ARGS, DEFAULT_SAVE_ARGS, and DEFAULT_FS_ARGS respectively.

You can check those in the kedro-datasets documentation.

Dataset access credentials¶

The Data Catalog also works with the credentials.yml file in conf/local/, allowing you to specify usernames and passwords required to load certain datasets.

Before instantiating the DataCatalog, Kedro will first attempt to read the credentials from the project configuration. The resulting dictionary is then passed into DataCatalog.from_config() as the credentials argument.

Let's assume that the project contains the file conf/local/credentials.yml with the following contents:

dev_s3:
  client_kwargs:
    aws_access_key_id: key
    aws_secret_access_key: secret

and the Data Catalog is specified in catalog.yml as follows:

motorbikes:
  type: pandas.CSVDataset
  filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
  credentials: dev_s3
  load_args:
    sep: ','

In the example above, the catalog.yml file contains references to credentials keys dev_s3. The Data Catalog first reads dev_s3 from the received credentials dictionary, and then passes its values into the dataset as a credentials argument to __init__.

Dataset versioning¶

Kedro enables dataset and ML model versioning through the versioned definition. For example:

cars:
  type: pandas.CSVDataset
  filepath: data/01_raw/company/cars.csv
  versioned: True

In this example, filepath is used as the basis of a folder that stores versions of the cars dataset. Each time a new version is created by a pipeline run it is stored within data/01_raw/company/cars.csv/<version>/cars.csv, where <version> corresponds to a version string formatted as YYYY-MM-DDThh.mm.ss.sssZ.

By default, kedro run loads the latest version of the dataset. You can also specify a particular versioned dataset with the --load-version flag as follows:

kedro run --load-versions=cars:YYYY-MM-DDThh.mm.ss.sssZ

where --load-versions is dataset name and version timestamp separated by :.

Listing available versions¶

All versioned datasets provide a list_versions() method to retrieve all available versions:

# In a Kedro session or notebook
dataset = catalog._datasets["cars"]
versions = dataset.list_versions(full_path=False)
print(versions)  # ['2024-01-15T10.30.00.000Z', '2024-01-14T09.15.00.000Z', ...]

The list_versions() method accepts a full_path parameter: - full_path=True (default): Returns full file paths to each version - full_path=False: Returns the version strings (timestamps)

Versions are returned in reverse chronological order (newest first).

How versioning works¶

A dataset offers versioning support if it extends the kedro.io.AbstractVersionedDataset class to accept a version keyword argument as part of the constructor and adapt the _save and _load method to use the versioned data path obtained from _get_save_path and _get_load_path respectively.

To verify whether a dataset can undergo versioning, you should examine the dataset class code to inspect its inheritance (you can find contributed datasets within the kedro-datasets repository). Check if the dataset class inherits from the AbstractVersionedDataset. For instance, if you encounter a class like CSVDataset(AbstractVersionedDataset[pd.DataFrame, pd.DataFrame]), this indicates that the dataset is set up to support versioning.

Note

HTTP(S) is a supported file system in the dataset implementations, but it cannot be combined with versioning.

Use the Data Catalog within Kedro configuration¶

Kedro configuration enables you to organise your project for different stages of your data pipeline. For example, you might need different Data Catalog settings for development, testing, and production environments.

By default, Kedro has a base and a local folder for configuration. The Data Catalog configuration is loaded using a configuration loader class which recursively scans for configuration files inside the conf folder, firstly in conf/base and then in conf/local (which is the designated overriding environment). Kedro merges the configuration information and returns a configuration dictionary according to rules set out in the configuration documentation.

In summary, if you need to configure your datasets for different environments, you can create both conf/base/catalog.yml and conf/local/catalog.yml. For instance, you can use the catalog.yml file in conf/base/ to register the locations of datasets that would run in production, while adding a second version of catalog.yml in conf/local/ to register the locations of sample datasets while you are using them for prototyping data pipeline(s).

To illustrate this, consider the following catalog entry for a dataset named cars in conf/base/catalog.yml, which points to a CSV file stored in a bucket on AWS S3:

cars:
  filepath: s3://my_bucket/cars.csv
  type: pandas.CSVDataset

You can overwrite this catalog entry in conf/local/catalog.yml to point to a locally stored file instead:

cars:
  filepath: data/01_raw/cars.csv
  type: pandas.CSVDataset

In your pipeline code, when the cars dataset is used, it will use the overwritten catalog entry from conf/local/catalog.yml and rely on Kedro to detect which definition of cars dataset to use in your pipeline.

Introduction to the Data Catalog¶

The basics of catalog.yml¶

Configuring dataset parameters in catalog.yml¶

Dataset type¶

Dataset filepath¶

Additional settings in catalog.yml¶

Load, save and filesystem arguments¶

Dataset access credentials¶

Dataset versioning¶

Listing available versions¶

How versioning works¶

Use the Data Catalog within Kedro configuration¶

The basics of `catalog.yml`¶

Configuring dataset parameters in `catalog.yml`¶

Dataset `type`¶

Dataset `filepath`¶

Additional settings in `catalog.yml`¶