Kedro as a data registry

In some projects you may want to share a Jupyter Notebook with others so you need to avoid using hard-coded file paths for data access.

One solution is to set up a lightweight Kedro project that uses the Kedro DataCatalog as a registry for the data, without using any of the other features of Kedro.

The Kedro starter with alias standalone-datacatalog (formerly known as mini-kedro) provides this kind of minimal functionality.

Usage

Use the standalone-datacatalog starter to create a new project:

kedro new --starter=standalone-datacatalog

The starter comprises a minimal setup to use the traditional Iris dataset with Kedro’s DataCatalog.

The starter contains:

  • A conf directory, which contains an example DataCatalog configuration (catalog.yml):

# conf/base/catalog.yml
example_dataset_1:
 type: pandas.CSVDataSet
 filepath: folder/filepath.csv

example_dataset_2:
 type: spark.SparkDataSet
 filepath: s3a://your_bucket/data/01_raw/example_dataset_2*
 credentials: dev_s3
 file_format: csv
 save_args:
   if_exists: replace
  • A data directory, which contains an example dataset identical to the one used by the pandas-iris starter

  • An example Jupyter Notebook, which shows how to instantiate the DataCatalog and interact with the example dataset:

df = catalog.load("example_dataset_1")
df_2 = catalog.save("example_dataset_2")