kedro.io.DataCatalog¶
- class kedro.io.DataCatalog(data_sets=None, feed_dict=None, layers=None)[source]¶
DataCatalog
stores instances ofAbstractDataSet
implementations to provideload
andsave
capabilities from anywhere in the program. To use aDataCatalog
, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.Methods
add
(data_set_name, data_set[, replace])Adds a new
AbstractDataSet
object to theDataCatalog
.add_all
(data_sets[, replace])Adds a group of new data sets to the
DataCatalog
.add_feed_dict
(feed_dict[, replace])Adds instances of
MemoryDataSet
, containing the data provided through feed_dict.confirm
(name)Confirm a dataset by its name.
exists
(name)Checks whether registered data set exists by calling its exists() method.
from_config
(catalog[, credentials, ...])Create a
DataCatalog
instance from configuration.list
([regex_search])List of all
DataSet
names registered in the catalog.load
(name[, version])Loads a registered data set.
release
(name)Release any cached data associated with a data set
save
(name, data)Save data to a registered data set.
Returns a shallow copy of the current object.
- __init__(data_sets=None, feed_dict=None, layers=None)[source]¶
DataCatalog
stores instances ofAbstractDataSet
implementations to provideload
andsave
capabilities from anywhere in the program. To use aDataCatalog
, you need to instantiate it with a dictionary of data sets. Then it will act as a single point of reference for your calls, relaying load and save functions to the underlying data sets.- Parameters
data_sets (
Optional
[Dict
[str
,AbstractDataSet
]]) – A dictionary of data set names and data set instances.feed_dict (
Optional
[Dict
[str
,Any
]]) – A feed dict with data to be added in memory.layers (
Optional
[Dict
[str
,Set
[str
]]]) – A dictionary of data set layers. It maps a layer name to a set of data set names, according to the data engineering convention. For more details, see https://docs.kedro.org/en/stable/resources/glossary.html#layers-data-engineering-convention
Example:
from kedro.extras.datasets.pandas import CSVDataSet cars = CSVDataSet(filepath="cars.csv", load_args=None, save_args={"index": False}) io = DataCatalog(data_sets={'cars': cars})
- add(data_set_name, data_set, replace=False)[source]¶
Adds a new
AbstractDataSet
object to theDataCatalog
.- Parameters
data_set_name (
str
) – A unique data set name which has not been registered yet.data_set (
AbstractDataSet
) – A data set object to be associated with the given data set name.replace (
bool
) – Specifies whether to replace an existingDataSet
with the same name is allowed.
- Raises
DataSetAlreadyExistsError – When a data set with the same name has already been registered.
Example:
from kedro.extras.datasets.pandas import CSVDataSet io = DataCatalog(data_sets={ 'cars': CSVDataSet(filepath="cars.csv") }) io.add("boats", CSVDataSet(filepath="boats.csv"))
- Return type
None
- add_all(data_sets, replace=False)[source]¶
Adds a group of new data sets to the
DataCatalog
.- Parameters
data_sets (
Dict
[str
,AbstractDataSet
]) – A dictionary ofDataSet
names and data set instances.replace (
bool
) – Specifies whether to replace an existingDataSet
with the same name is allowed.
- Raises
DataSetAlreadyExistsError – When a data set with the same name has already been registered.
Example:
from kedro.extras.datasets.pandas import CSVDataSet, ParquetDataSet io = DataCatalog(data_sets={ "cars": CSVDataSet(filepath="cars.csv") }) additional = { "planes": ParquetDataSet("planes.parq"), "boats": CSVDataSet(filepath="boats.csv") } io.add_all(additional) assert io.list() == ["cars", "planes", "boats"]
- Return type
None
- add_feed_dict(feed_dict, replace=False)[source]¶
Adds instances of
MemoryDataSet
, containing the data provided through feed_dict.- Parameters
feed_dict (
Dict
[str
,Any
]) – A feed dict with data to be added in memory.replace (
bool
) – Specifies whether to replace an existingDataSet
with the same name is allowed.
Example:
import pandas as pd df = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}) io = DataCatalog() io.add_feed_dict({ 'data': df }, replace=True) assert io.load("data").equals(df)
- Return type
None
- confirm(name)[source]¶
Confirm a dataset by its name.
- Parameters
name (
str
) – Name of the dataset.- Raises
DataSetError – When the dataset does not have confirm method.
- Return type
None
- exists(name)[source]¶
Checks whether registered data set exists by calling its exists() method. Raises a warning and returns False if exists() is not implemented.
- Parameters
name (
str
) – A data set to be checked.- Return type
bool
- Returns
Whether the data set output exists.
- classmethod from_config(catalog, credentials=None, load_versions=None, save_version=None)[source]¶
Create a
DataCatalog
instance from configuration. This is a factory method used to provide developers with a way to instantiateDataCatalog
with configuration parsed from configuration files.- Parameters
catalog (
Optional
[Dict
[str
,Dict
[str
,Any
]]]) – A dictionary whose keys are the data set names and the values are dictionaries with the constructor arguments for classes implementingAbstractDataSet
. The data set class to be loaded is specified with the keytype
and their fully qualified class name. Allkedro.io
data set can be specified by their class name only, i.e. their module name can be omitted.credentials (
Optional
[Dict
[str
,Dict
[str
,Any
]]]) – A dictionary containing credentials for different data sets. Use thecredentials
key in aAbstractDataSet
to refer to the appropriate credentials as shown in the example below.load_versions (
Optional
[Dict
[str
,str
]]) – A mapping between dataset names and versions to load. Has no effect on data sets without enabled versioning.save_version (
Optional
[str
]) – Version string to be used forsave
operations by all data sets with enabled versioning. It must: a) be a case-insensitive string that conforms with operating system filename limitations, b) always return the latest version when sorted in lexicographical order.
- Return type
- Returns
An instantiated
DataCatalog
containing all specified data sets, created and ready to use.- Raises
DataSetError – When the method fails to create any of the data sets from their config.
DataSetNotFoundError – When load_versions refers to a dataset that doesn’t exist in the catalog.
Example:
config = { "cars": { "type": "pandas.CSVDataSet", "filepath": "cars.csv", "save_args": { "index": False } }, "boats": { "type": "pandas.CSVDataSet", "filepath": "s3://aws-bucket-name/boats.csv", "credentials": "boats_credentials", "save_args": { "index": False } } } credentials = { "boats_credentials": { "client_kwargs": { "aws_access_key_id": "<your key id>", "aws_secret_access_key": "<your secret>" } } } catalog = DataCatalog.from_config(config, credentials) df = catalog.load("cars") catalog.save("boats", df)
- list(regex_search=None)[source]¶
List of all
DataSet
names registered in the catalog. This can be filtered by providing an optional regular expression which will only return matching keys.- Parameters
regex_search (
Optional
[str
]) – An optional regular expression which can be provided to limit the data sets returned by a particular pattern.- Return type
List
[str
]- Returns
A list of
DataSet
names available which match the regex_search criteria (if provided). All data set names are returned by default.- Raises
SyntaxError – When an invalid regex filter is provided.
Example:
io = DataCatalog() # get data sets where the substring 'raw' is present raw_data = io.list(regex_search='raw') # get data sets which start with 'prm' or 'feat' feat_eng_data = io.list(regex_search='^(prm|feat)') # get data sets which end with 'time_series' models = io.list(regex_search='.+time_series$')
- load(name, version=None)[source]¶
Loads a registered data set.
- Parameters
name (
str
) – A data set to be loaded.version (
Optional
[str
]) – Optional argument for concrete data version to be loaded. Works only with versioned datasets.
- Return type
Any
- Returns
The loaded data as configured.
- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
Example:
from kedro.io import DataCatalog from kedro.extras.datasets.pandas import CSVDataSet cars = CSVDataSet(filepath="cars.csv", load_args=None, save_args={"index": False}) io = DataCatalog(data_sets={'cars': cars}) df = io.load("cars")
- release(name)[source]¶
Release any cached data associated with a data set
- Parameters
name (
str
) – A data set to be checked.- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
- save(name, data)[source]¶
Save data to a registered data set.
- Parameters
name (
str
) – A data set to be saved to.data (
Any
) – A data object to be saved as configured in the registered data set.
- Raises
DataSetNotFoundError – When a data set with the given name has not yet been registered.
Example:
import pandas as pd from kedro.extras.datasets.pandas import CSVDataSet cars = CSVDataSet(filepath="cars.csv", load_args=None, save_args={"index": False}) io = DataCatalog(data_sets={'cars': cars}) df = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}) io.save("cars", df)
- Return type
None