Kedro Data Catalog¶
KedroDataCatalog
retains the core functionality of DataCatalog
, with a few API enhancements. For a comprehensive understanding, we recommend reviewing the existing DataCatalog
documentation before exploring the additional functionality of KedroDataCatalog
.
This page highlights the new features and provides usage examples:
How to make KedroDataCatalog
the default catalog for Kedro run
¶
To set KedroDataCatalog
as the default catalog for the kedro run
command and other CLI commands, update your settings.py
as follows:
from kedro.io import KedroDataCatalog
DATA_CATALOG_CLASS = KedroDataCatalog
Once this change is made, you can run your Kedro project as usual.
For more information on settings.py
, refer to the Project settings documentation.
How to access datasets in the catalog¶
You can retrieve a dataset from the catalog using either the dictionary-like syntax or the get
method:
reviews_ds = catalog["reviews"]
reviews_ds = catalog.get("reviews", default=default_ds)
How to add datasets to the catalog¶
The new API allows you to add datasets as well as raw data directly to the catalog:
from kedro_datasets.pandas import CSVDataset
bikes_ds = CSVDataset(filepath="../data/01_raw/bikes.csv")
catalog["bikes"] = bikes_ds # Adding a dataset
catalog["cars"] = ["Ferrari", "Audi"] # Adding raw data
When you add raw data, it is automatically wrapped in a MemoryDataset
under the hood.
How to iterate trough datasets in the catalog¶
KedroDataCatalog
supports iteration over dataset names (keys), datasets (values), and both (items). Iteration defaults to dataset names, similar to standard Python dictionaries:
for ds_name in catalog: # __iter__ defaults to keys
pass
for ds_name in catalog.keys(): # Iterate over dataset names
pass
for ds in catalog.values(): # Iterate over datasets
pass
for ds_name, ds in catalog.items(): # Iterate over (name, dataset) tuples
pass
How to get the number of datasets in the catalog¶
You can get the number of datasets in the catalog using the len()
function:
ds_count = len(catalog)
How to print the full catalog and individual datasets¶
To print the catalog or an individual dataset programmatically, use the print()
function or in an interactive environment like IPython or JupyterLab, simply enter the variable:
In [1]: catalog
Out[1]: {'shuttles': kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'}), 'preprocessed_companies': kedro_datasets.pandas.parquet_dataset.ParquetDataset(filepath=PurePosixPath('/data/02_intermediate/preprocessed_companies.pq'), protocol='file', load_args={}, save_args={}), 'params:model_options.test_size': kedro.io.memory_dataset.MemoryDataset(data='<float>'), 'params:model_options.features': kedro.io.memory_dataset.MemoryDataset(data='<list>'))}
In [2]: catalog["shuttles"]
Out[2]: kedro_datasets.pandas.excel_dataset.ExcelDataset(filepath=PurePosixPath('/data/01_raw/shuttles.xlsx'), protocol='file', load_args={'engine': 'openpyxl'}, save_args={'index': False}, writer_args={'engine': 'openpyxl'})
How to access dataset patterns¶
The pattern resolution logic in KedroDataCatalog
is handled by the config_resolver
, which can be accessed as a property of the catalog:
config_resolver = catalog.config_resolver
ds_config = catalog.config_resolver.resolve_pattern(ds_name) # Resolving a dataset pattern
patterns = catalog.config_resolver.list_patterns() # Listing all available patterns
Note
KedroDataCatalog
does not support all dictionary-specific methods, such as pop()
, popitem()
, or deletion by key (del
).
For a full list of supported methods, refer to the KedroDataCatalog source code.