Set up the data¶
Note
Don’t forget to check the tutorial FAQ if you run into problems, or ask the community for help if you need it!
In this section, we discuss the data setup phase, which is the second part of the standard development workflow. The steps are as follows:
Add datasets to your
data
folder, according to data engineering conventionRegister the datasets with the Kedro Data Catalog in
conf/base/catalog.yml
, which is the registry of all data sources available for use by the project. This ensures that your code is reproducible when it references datasets in different locations and/or environments.
You can find further information about the Data Catalog in specific documentation covering advanced usage.
Note
If you are using the tutorial created by the spaceflights starter, you can omit the copy/paste steps below, but it is worth reviewing the files described.
Download datasets¶
The spaceflights tutorial makes use of three fictional datasets of companies shuttling customers to the Moon and back. The data comes in two different formats: .csv
and .xlsx
:
companies.csv
contains data about space travel companies, such as their location, fleet count and ratingreviews.csv
is a set of reviews from customers for categories, such as comfort and priceshuttles.xlsx
is a set of attributes for spacecraft across the fleet, such as their engine type and passenger capacity
Download and save the files to the data/01_raw
folder of your project:
Register the datasets¶
You now need to register the datasets so they can be loaded by Kedro. All Kedro projects have a conf/base/catalog.yml
file, and you register each dataset by adding a named entry into the .yml
file that includes the following:
File location (path)
Parameters for the given dataset
Type of data
Versioning
Register csv
data¶
First, for the spaceflights data, register the two csv
datasets by adding this snippet to the end of the conf/base/catalog.yml
file and saving it:
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
reviews:
type: pandas.CSVDataSet
filepath: data/01_raw/reviews.csv
Register xlsx
data¶
Now register the xlsx
dataset by adding this snippet to the end of the conf/base/catalog.yml
file, and saving it:
shuttles:
type: pandas.ExcelDataSet
filepath: data/01_raw/shuttles.xlsx
load_args:
engine: openpyxl # Use modern Excel engine (the default since Kedro 0.18.0)
This registration has an additional line: load_args
, which is passed to the excel file read method (pd.read_excel
) as a keyword argument. Although not specified here, the equivalent output is save_args
and the value would be passed to pd.DataFrame.to_excel
method.
Test that Kedro can load the csv
data¶
Open a kedro ipython
session in your terminal from the project root directory:
kedro ipython
Then type the following into the IPython prompt:
companies = catalog.load("companies")
companies.head()
The first command creates a variable (
companies
), which is of typepandas.DataFrame
and loads the dataset (also namedcompanies
as per top-level key incatalog.yml
) from the underlying filepathdata/01_raw/companies.csv
.The
head
method frompandas
displays the first five rows of the DataFrame.
INFO Loading data from 'companies' (CSVDataSet)
Out[1]:
id company_rating company_location total_fleet_count iata_approved
0 35029 100% Niue 4.0 f
1 30292 67% Anguilla 6.0 f
2 19032 67% Russian Federation 4.0 f
3 8238 91% Barbados 15.0 t
4 30342 NaN Sao Tome and Principe 2.0 t
Test that Kedro can load the xlsx
data¶
To test that everything works as expected, load the dataset within IPython and display its first five rows:
shuttles = catalog.load("shuttles")
shuttles.head()
You should see output such as the following:
INFO Loading data from 'shuttles' (ExcelDataSet)
Out[1]:
id shuttle_location shuttle_type engine_type ... d_check_complete moon_clearance_complete price company_id
0 63561 Niue Type V5 Quantum ... f f $1,325.0 35029
1 36260 Anguilla Type V5 Quantum ... t f $1,780.0 30292
2 57015 Russian Federation Type V5 Quantum ... f f $1,715.0 19032
3 14035 Barbados Type V5 Plasma ... f f $4,770.0 8238
4 10036 Sao Tome and Principe Type V2 Plasma ... f f $2,820.0 30342
When you have finished, close ipython
session with exit()
.
Further information¶
Custom data¶
Kedro supports numerous datasets out of the box, but you can also add support for any proprietary data format or filesystem in your pipeline.
You can find further information about how to add support for custom datasets in specific documentation covering advanced usage.