Kedro spaceflights tutorial¶
In this tutorial, we construct nodes and pipelines for a price-prediction model to illustrate the steps of a typical Kedro workflow.
In the text, we assume you have started with an empty Kedro project and we show the steps necessary to convert it into a working project. The tutorial guides you to copy and paste example code into the Kedro project. It takes approximately one hour to complete.
Note
You may prefer to get up and running more swiftly. We also provide the example as a Kedro starter so you can follow along without copy/pasting.
Scenario¶
It is 2160, and the space tourism industry is booming. Globally, thousands of space shuttle companies take tourists to the Moon and back. You have been able to source data that lists the amenities offered in each space shuttle, customer reviews, and company information.
Project: You want to construct a model that predicts the price for each trip to the Moon and the corresponding return flight.
Get help¶
If you hit an issue with the tutorial, the Kedro community can help!
Things you can do:
check the spaceflights tutorial FAQ to see if we have answered the question already
use the spaceflights starter to create a new, separate project which contains working example code, and compare that project with your own
use Kedro-Viz to visualise your project to better understand how the datasets, nodes and pipelines fit together
use the #questions channel on our Slack channel (which replaces our Discord server) to ask the community for help
search the searchable archive of Discord discussions
Terminology¶
We will explain any Kedro-specific terminology we use in the tutorial as we introduce it. We use additional terminology that may not be familiar to some readers, such as the concepts below.
Project root directory¶
Also known as the “root directory”, this is the parent folder for the entire project. It is the top-level folder that contains all other files and directories associated with the project.
Dependencies¶
These are Python packages or libraries that an individual project depends upon to complete a task. For example, the Spaceflights tutorial project depends on the scikit-learn library.
Standard development workflow¶
When you build a Kedro project, you will typically follow a standard development workflow:
Set up the project template
Create a new project and install project dependencies.
Configure credentials and any other sensitive/personal content, and logging
Set up the data
Add data to the
data
folderReference all datasets for the project
Create the pipeline
Construct nodes to make up the pipeline
Choose how to run the pipeline: sequentially or in parallel
Package the project
Build the project documentation
Package the project for distribution
Optional: source control with git
¶
You don’t need to do this section for the tutorial, but you may want to familiarise yourself with the use of git
for source control.
Note
For further information about this topic, check out this post about version control for data scientists
Click to expand
If you want to learn more about a typical git
workflow, we suggest you look into Gitflow.
Navigate to the project root directory and create a git
repository on your machine (a local repository) for the project:
git init
git remote add origin https://github.com/<your-repo>
Submit your changes to GitHub¶
If you work on a project as part of a team, you will share the git
repository via GitHub, which stores a shared copy of the repository. You should periodically save your changes to your local repository and merge them into the GitHub repository.
Within your team, we suggest that you each develop your code on a branch and create pull requests to submit it to the develop
or main
branches:
# create a new feature branch called 'feature/project-template'
git checkout -b feature/project-template
# stage all the files you have changed
git add .
# commit changes to git with an instructive message
git commit -m 'Create project template'
# push changes to remote branch
git push origin feature/project-template
It isn’t necessary to branch, but if everyone in a team works on the same branch (e.g. main
), you might have to resolve merge conflicts more often. Here is an example of working directly on main
:
# stage all files
git add .
# commit changes to git with an instructive message
git commit -m 'Create project template'
# push changes to remote main
git push origin main