kedro.io.DataCatalog

class kedro.io.DataCatalog(datasets=None, feed_dict=None, dataset_patterns=None, load_versions=None, save_version=None, default_pattern=None, config_resolver=None)[source]

DataCatalog stores instances of AbstractDataset implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog, you need to instantiate it with a dictionary of datasets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying datasets.

Attributes

config_resolver

rtype:

CatalogConfigResolver

Methods

add(dataset_name, dataset[, replace])

Adds a new AbstractDataset object to the DataCatalog.

add_all(datasets[, replace])

Adds a group of new datasets to the DataCatalog.

add_feed_dict(feed_dict[, replace])

Add datasets to the DataCatalog using the data provided through the feed_dict.

confirm(name)

Confirm a dataset by its name.

exists(name)

Checks whether registered dataset exists by calling its exists() method.

from_config(catalog[, credentials, ...])

Create a DataCatalog instance from configuration.

list([regex_search])

List of all dataset names registered in the catalog.

load(name[, version])

Loads a registered dataset.

release(name)

Release any cached data associated with a dataset

save(name, data)

Save data to a registered dataset.

shallow_copy([extra_dataset_patterns])

Returns a shallow copy of the current object.

__init__(datasets=None, feed_dict=None, dataset_patterns=None, load_versions=None, save_version=None, default_pattern=None, config_resolver=None)[source]

DataCatalog stores instances of AbstractDataset implementations to provide load and save capabilities from anywhere in the program. To use a DataCatalog, you need to instantiate it with a dictionary of datasets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying datasets.

Parameters:
  • datasets (dict[str, AbstractDataset] | None) – A dictionary of dataset names and dataset instances.

  • feed_dict (dict[str, Any] | None) – A feed dict with data to be added in memory.

  • dataset_patterns (Patterns | None) – A dictionary of dataset factory patterns and corresponding dataset configuration. When fetched from catalog configuration these patterns will be sorted by: 1. Decreasing specificity (number of characters outside the curly brackets) 2. Decreasing number of placeholders (number of curly bracket pairs) 3. Alphabetically A pattern of specificity 0 is a catch-all pattern and will overwrite the default pattern provided through the runners if it comes before “default” in the alphabet. Such an overwriting pattern will emit a warning. The “{default}” name will not emit a warning.

  • load_versions (dict[str, str] | None) – A mapping between dataset names and versions to load. Has no effect on datasets without enabled versioning.

  • save_version (str | None) – Version string to be used for save operations by all datasets with enabled versioning. It must: a) be a case-insensitive string that conforms with operating system filename limitations, b) always return the latest version when sorted in lexicographical order.

  • default_pattern (Patterns | None) – A dictionary of the default catch-all pattern that overrides the default pattern provided through the runners.

  • config_resolver (CatalogConfigResolver | None) – An instance of CatalogConfigResolver to resolve dataset patterns and configurations.

Example:

from kedro_datasets.pandas import CSVDataset

cars = CSVDataset(filepath="cars.csv",
                  load_args=None,
                  save_args={"index": False})
catalog = DataCatalog(datasets={'cars': cars})
add(dataset_name, dataset, replace=False)[source]

Adds a new AbstractDataset object to the DataCatalog.

Parameters:
  • dataset_name (str) – A unique dataset name which has not been registered yet.

  • dataset (AbstractDataset) – A dataset object to be associated with the given data set name.

  • replace (bool) – Specifies whether to replace an existing dataset with the same name is allowed.

Raises:

DatasetAlreadyExistsError – When a dataset with the same name has already been registered.

Example:

from kedro_datasets.pandas import CSVDataset

catalog = DataCatalog(datasets={
                  'cars': CSVDataset(filepath="cars.csv")
                 })

catalog.add("boats", CSVDataset(filepath="boats.csv"))
Return type:

None

add_all(datasets, replace=False)[source]

Adds a group of new datasets to the DataCatalog.

Parameters:
  • datasets (dict[str, AbstractDataset]) – A dictionary of dataset names and dataset instances.

  • replace (bool) – Specifies whether to replace an existing dataset with the same name is allowed.

Raises:

DatasetAlreadyExistsError – When a dataset with the same name has already been registered.

Example:

from kedro_datasets.pandas import CSVDataset, ParquetDataset

catalog = DataCatalog(datasets={
                  "cars": CSVDataset(filepath="cars.csv")
                 })
additional = {
    "planes": ParquetDataset("planes.parq"),
    "boats": CSVDataset(filepath="boats.csv")
}

catalog.add_all(additional)

assert catalog.list() == ["cars", "planes", "boats"]
Return type:

None

add_feed_dict(feed_dict, replace=False)[source]

Add datasets to the DataCatalog using the data provided through the feed_dict.

feed_dict is a dictionary where the keys represent dataset names and the values can either be raw data or Kedro datasets - instances of classes that inherit from AbstractDataset. If raw data is provided, it will be automatically wrapped in a MemoryDataset before being added to the DataCatalog.

Parameters:
  • feed_dict (dict[str, Any]) – A dictionary with data to be added to the DataCatalog. Keys are dataset names and values can be raw data or instances of classes that inherit from AbstractDataset.

  • replace (bool) – Specifies whether to replace an existing dataset with the same name in the DataCatalog.

Example:

from kedro_datasets.pandas import CSVDataset
import pandas as pd

df = pd.DataFrame({"col1": [1, 2],
                   "col2": [4, 5],
                   "col3": [5, 6]})

catalog = DataCatalog()
catalog.add_feed_dict({
    "data_df": df
}, replace=True)

assert catalog.load("data_df").equals(df)

csv_dataset = CSVDataset(filepath="test.csv")
csv_dataset.save(df)
catalog.add_feed_dict({"data_csv_dataset": csv_dataset})

assert catalog.load("data_csv_dataset").equals(df)
Return type:

None

property config_resolver: CatalogConfigResolver
Return type:

CatalogConfigResolver

confirm(name)[source]

Confirm a dataset by its name.

Parameters:

name (str) – Name of the dataset.

Raises:

DatasetError – When the dataset does not have confirm method.

Return type:

None

exists(name)[source]

Checks whether registered dataset exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.

Parameters:

name (str) – A dataset to be checked.

Return type:

bool

Returns:

Whether the dataset output exists.

classmethod from_config(catalog, credentials=None, load_versions=None, save_version=None)[source]

Create a DataCatalog instance from configuration. This is a factory method used to provide developers with a way to instantiate DataCatalog with configuration parsed from configuration files.

Parameters:
  • catalog (dict[str, dict[str, Any]] | None) – A dictionary whose keys are the dataset names and the values are dictionaries with the constructor arguments for classes implementing AbstractDataset. The dataset class to be loaded is specified with the key type and their fully qualified class name. All kedro.io dataset can be specified by their class name only, i.e. their module name can be omitted.

  • credentials (dict[str, dict[str, Any]] | None) – A dictionary containing credentials for different datasets. Use the credentials key in a AbstractDataset to refer to the appropriate credentials as shown in the example below.

  • load_versions (dict[str, str] | None) – A mapping between dataset names and versions to load. Has no effect on datasets without enabled versioning.

  • save_version (str | None) – Version string to be used for save operations by all datasets with enabled versioning. It must: a) be a case-insensitive string that conforms with operating system filename limitations, b) always return the latest version when sorted in lexicographical order.

Return type:

DataCatalog

Returns:

An instantiated DataCatalog containing all specified datasets, created and ready to use.

Raises:
  • DatasetError – When the method fails to create any of the data sets from their config.

  • DatasetNotFoundError – When load_versions refers to a dataset that doesn’t exist in the catalog.

Example:

config = {
    "cars": {
        "type": "pandas.CSVDataset",
        "filepath": "cars.csv",
        "save_args": {
            "index": False
        }
    },
    "boats": {
        "type": "pandas.CSVDataset",
        "filepath": "s3://aws-bucket-name/boats.csv",
        "credentials": "boats_credentials",
        "save_args": {
            "index": False
        }
    }
}

credentials = {
    "boats_credentials": {
        "client_kwargs": {
            "aws_access_key_id": "<your key id>",
            "aws_secret_access_key": "<your secret>"
        }
     }
}

catalog = DataCatalog.from_config(config, credentials)

df = catalog.load("cars")
catalog.save("boats", df)
list(regex_search=None)[source]

List of all dataset names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.

Parameters:

regex_search (str | None) – An optional regular expression which can be provided to limit the datasets returned by a particular pattern.

Return type:

list[str]

Returns:

A list of dataset names available which match the regex_search criteria (if provided). All dataset names are returned by default.

Raises:

SyntaxError – When an invalid regex filter is provided.

Example:

catalog = DataCatalog()
# get datasets where the substring 'raw' is present
raw_data = catalog.list(regex_search='raw')
# get datasets which start with 'prm' or 'feat'
feat_eng_data = catalog.list(regex_search='^(prm|feat)')
# get datasets which end with 'time_series'
models = catalog.list(regex_search='.+time_series$')
load(name, version=None)[source]

Loads a registered dataset.

Parameters:
  • name (str) – A dataset to be loaded.

  • version (str | None) – Optional argument for concrete data version to be loaded. Works only with versioned datasets.

Return type:

Any

Returns:

The loaded data as configured.

Raises:

DatasetNotFoundError – When a dataset with the given name has not yet been registered.

Example:

from kedro.io import DataCatalog
from kedro_datasets.pandas import CSVDataset

cars = CSVDataset(filepath="cars.csv",
                  load_args=None,
                  save_args={"index": False})
catalog = DataCatalog(datasets={'cars': cars})

df = catalog.load("cars")
release(name)[source]

Release any cached data associated with a dataset

Parameters:

name (str) – A dataset to be checked.

Raises:

DatasetNotFoundError – When a dataset with the given name has not yet been registered.

Return type:

None

save(name, data)[source]

Save data to a registered dataset.

Parameters:
  • name (str) – A dataset to be saved to.

  • data (Any) – A data object to be saved as configured in the registered dataset.

Raises:

DatasetNotFoundError – When a dataset with the given name has not yet been registered.

Example:

import pandas as pd

from kedro_datasets.pandas import CSVDataset

cars = CSVDataset(filepath="cars.csv",
                  load_args=None,
                  save_args={"index": False})
catalog = DataCatalog(datasets={'cars': cars})

df = pd.DataFrame({'col1': [1, 2],
                   'col2': [4, 5],
                   'col3': [5, 6]})
catalog.save("cars", df)
Return type:

None

shallow_copy(extra_dataset_patterns=None)[source]

Returns a shallow copy of the current object.

Return type:

DataCatalog

Returns:

Copy of the current object.