FAQs

This is a growing set of technical FAQs. The product FAQs on the Kedro website explain how Kedro can answer the typical use cases and requirements of data scientists, data engineers, machine learning engineers and product owners.

Installing Kedro

  • How do I install a development version of Kedro?

  • How can I check the version of Kedro installed? To check the version installed, type kedro -V in your terminal window.

  • Do I need Git installed to use Kedro? Yes, users are expected to have Git installed when working with Kedro. This is a prerequisite for the kedro new flow. If Git is not installed, use the following workaround: kedro new -s https://github.com/kedro-org/kedro-starters/archive/0.18.6.zip --directory=pandas-iris

Kedro documentation

Working with Notebooks

Kedro project development

Configuration

Advanced topics

Nodes and pipelines

What is data engineering convention?

Bruce Philp and Guilherme Braccialli are the brains behind a layered data-engineering convention as a model of managing data. You can find an in-depth walk through of their convention as a blog post on Medium.

Refer to the following table below for a high level guide to each layer’s purpose

Note:The data layers don’t have to exist locally in the data folder within your project, but we recommend that you structure your S3 buckets or other data stores in a similar way.

data_engineering_convention

Folder in data

Description

Raw

Initial start of the pipeline, containing the sourced data model(s) that should never be changed, it forms your single source of truth to work from. These data models are typically un-typed in most cases e.g. csv, but this will vary from case to case

Intermediate

Optional data model(s), which are introduced to type your raw data model(s), e.g. converting string based values into their current typed representation

Primary

Domain specific data model(s) containing cleansed, transformed and wrangled data from either raw or intermediate, which forms your layer that you input into your feature engineering

Feature

Analytics specific data model(s) containing a set of features defined against the primary data, which are grouped by feature area of analysis and stored against a common dimension

Model input

Analytics specific data model(s) containing all feature data against a common dimension and in the case of live projects against an analytics run date to ensure that you track the historical changes of the features over time

Models

Stored, serialised pre-trained machine learning models

Model output

Analytics specific data model(s) containing the results generated by the model based on the model input data

Reporting

Reporting data model(s) that are used to combine a set of primary, feature, model input and model output data used to drive the dashboard and the views constructed. It encapsulates and removes the need to define any blending or joining of data, improve performance and replacement of presentation layer without having to redefine the data models