Set up the data¶
Don’t forget to check the tutorial FAQ if you run into problems, or ask the community for help if you need it!
In this section, we discuss the data setup phase, which is the second part of the standard development workflow. The steps are as follows:
Add datasets to your
datafolder, according to data engineering convention
Register the datasets with the Kedro Data Catalog in
conf/base/catalog.yml, which is the registry of all data sources available for use by the project. This ensures that your code is reproducible when it references datasets in different locations and/or environments.
You can find further information about the Data Catalog in specific documentation covering advanced usage.
If you are using the tutorial created by the spaceflights starter, you can omit the copy/paste steps below, but it is worth reviewing the files described.
The spaceflights tutorial makes use of three fictional datasets of companies shuttling customers to the Moon and back. The data comes in two different formats:
companies.csvcontains data about space travel companies, such as their location, fleet count and rating
reviews.csvis a set of reviews from customers for categories, such as comfort and price
shuttles.xlsxis a set of attributes for spacecraft across the fleet, such as their engine type and passenger capacity
Download and save the files to the
data/01_raw folder of your project:
Register the datasets¶
You now need to register the datasets so they can be loaded by Kedro. All Kedro projects have a
conf/base/catalog.yml file, and you register each dataset by adding a named entry into the
.yml file that includes the following:
File location (path)
Parameters for the given dataset
Type of data
First, for the spaceflights data, register the two
csv datasets by adding this snippet to the end of the
conf/base/catalog.yml file and saving it:
companies: type: pandas.CSVDataSet filepath: data/01_raw/companies.csv reviews: type: pandas.CSVDataSet filepath: data/01_raw/reviews.csv
Now register the
xlsx dataset by adding this snippet to the end of the
conf/base/catalog.yml file, and saving it:
shuttles: type: pandas.ExcelDataSet filepath: data/01_raw/shuttles.xlsx load_args: engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)
This registration has an additional line:
load_args, which is passed to the excel file read method (
pd.read_excel) as a keyword argument. Although not specified here, the equivalent output is
save_args and the value would be passed to
Test that Kedro can load the
kedro ipython session in your terminal from the project root directory:
Then type the following into the IPython prompt:
companies = catalog.load("companies") companies.head()
The first command creates a variable (
companies), which is of type
pandas.DataFrameand loads the dataset (also named
companiesas per top-level key in
catalog.yml) from the underlying filepath
pandasdisplays the first five rows of the DataFrame.
INFO Loading data from 'companies' (CSVDataSet) Out: id company_rating company_location total_fleet_count iata_approved 0 35029 100% Niue 4.0 f 1 30292 67% Anguilla 6.0 f 2 19032 67% Russian Federation 4.0 f 3 8238 91% Barbados 15.0 t 4 30342 NaN Sao Tome and Principe 2.0 t
Test that Kedro can load the
To test that everything works as expected, load the dataset within IPython and display its first five rows:
shuttles = catalog.load("shuttles") shuttles.head()
You should see output such as the following:
INFO Loading data from 'shuttles' (ExcelDataSet) Out: id shuttle_location shuttle_type engine_type ... d_check_complete moon_clearance_complete price company_id 0 63561 Niue Type V5 Quantum ... f f $1,325.0 35029 1 36260 Anguilla Type V5 Quantum ... t f $1,780.0 30292 2 57015 Russian Federation Type V5 Quantum ... f f $1,715.0 19032 3 14035 Barbados Type V5 Plasma ... f f $4,770.0 8238 4 10036 Sao Tome and Principe Type V2 Plasma ... f f $2,820.0 30342
When you have finished, close
ipython session with
Kedro supports numerous datasets out of the box, but you can also add support for any proprietary data format or filesystem in your pipeline.
You can find further information about how to add support for custom datasets in specific documentation covering advanced usage.