Kedro as a data registry¶
In some projects you may want to share a Jupyter Notebook with others so you need to avoid using hard-coded file paths for data access.
One solution is to set up a lightweight Kedro project that uses the Kedro DataCatalog
as a registry for the data, without using any of the other features of Kedro.
The Kedro starter with alias standalone-datacatalog
(formerly known as mini-kedro
) provides this kind of minimal functionality.
Usage¶
Use the standalone-datacatalog
starter to create a new project:
kedro new --starter=standalone-datacatalog
The starter comprises a minimal setup to use the traditional Iris dataset with Kedro’s DataCatalog
.
The starter contains:
A
conf
directory, which contains an exampleDataCatalog
configuration (catalog.yml
):
# conf/base/catalog.yml
example_dataset_1:
type: pandas.CSVDataSet
filepath: folder/filepath.csv
example_dataset_2:
type: spark.SparkDataSet
filepath: s3a://your_bucket/data/01_raw/example_dataset_2*
credentials: dev_s3
file_format: csv
save_args:
if_exists: replace
A
data
directory, which contains an example dataset identical to the one used by thepandas-iris
starterAn example Jupyter Notebook, which shows how to instantiate the
DataCatalog
and interact with the example dataset:
df = catalog.load("example_dataset_1")
df_2 = catalog.save("example_dataset_2")