kedro_datasets.dask.ParquetDataset¶

class kedro_datasets.dask.ParquetDataset(filepath, load_args=None, save_args=None, credentials=None, fs_args=None, metadata=None)[source]¶

ParquetDataset loads and saves data to parquet file(s). It uses Dask remote data services to handle the corresponding load and save operations: https://docs.dask.org/en/latest/how-to/connect-to-remote-data.html

Example usage for the YAML API:

cars:
  type: dask.ParquetDataset
  filepath: s3://bucket_name/path/to/folder
  save_args:
    compression: GZIP
  credentials:
    client_kwargs:
      aws_access_key_id: YOUR_KEY
      aws_secret_access_key: YOUR_SECRET

Example usage for the Python API:

 from kedro.extras.datasets.dask import ParquetDataset
 import pandas as pd
 import dask.dataframe as dd

 data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
...                      'col3': [[5, 6], [7, 8]]})
 ddf = dd.from_pandas(data, npartitions=2)

 dataset = ParquetDataset(
...     filepath="s3://bucket_name/path/to/folder",
...     credentials={
...         'client_kwargs':{
...             'aws_access_key_id': 'YOUR_KEY',
...             'aws_secret_access_key': 'YOUR SECRET',
...         }
...     },
...     save_args={"compression": "GZIP"}
... )
 dataset.save(ddf)
 reloaded = dataset.load()

 assert ddf.compute().equals(reloaded.compute())

The output schema can also be explicitly specified using Triad. This is processed to map specific columns to PyArrow field types or schema. For instance:

parquet_dataset:
  type: dask.ParquetDataset
  filepath: "s3://bucket_name/path/to/folder"
  credentials:
    client_kwargs:
      aws_access_key_id: YOUR_KEY
      aws_secret_access_key: "YOUR SECRET"
  save_args:
    compression: GZIP
    schema:
      col1: [int32]
      col2: [int32]
      col3: [[int32]]

Attributes

`DEFAULT_LOAD_ARGS`
`DEFAULT_SAVE_ARGS`
`fs_args`	Property of optional file system parameters.

Methods

`exists`()	Checks whether a data set's output already exists by calling the provided _exists() method.
`from_config`(name, config[, load_version, ...])	Create a data set instance using the configuration provided.
`load`()	Loads data by delegation to the provided load method.
`release`()	Release any cached data.
`save`(data)	Saves data by delegation to the provided save method.

DEFAULT_LOAD_ARGS: Dict[str, Any] = {}¶

DEFAULT_SAVE_ARGS: Dict[str, Any] = {'write_index': False}¶

__init__(filepath, load_args=None, save_args=None, credentials=None, fs_args=None, metadata=None)[source]¶

Creates a new instance of ParquetDataset pointing to concrete parquet files.

Parameters:

filepath (str) – Filepath in POSIX format to a parquet file parquet collection or the directory of a multipart parquet.
load_args (Optional[Dict[str, Any]]) – Additional loading options dask.dataframe.read_parquet: https://docs.dask.org/en/latest/generated/dask.dataframe.read_parquet.html
save_args (Optional[Dict[str, Any]]) – Additional saving options for dask.dataframe.to_parquet: https://docs.dask.org/en/latest/generated/dask.dataframe.to_parquet.html
credentials (Optional[Dict[str, Any]]) – Credentials required to get access to the underlying filesystem. E.g. for GCSFileSystem it should look like {“token”: None}.
fs_args (Optional[Dict[str, Any]]) – Optional parameters to the backend file system driver: https://docs.dask.org/en/latest/how-to/connect-to-remote-data.html#optional-parameters
metadata (Optional[Dict[str, Any]]) – Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

exists()¶

Checks whether a data set’s output already exists by calling the provided _exists() method.

Return type:: bool
Returns:: Flag indicating whether the output already exists.
Raises:: DatasetError – when underlying exists method raises error.

classmethod from_config(name, config, load_version=None, save_version=None)¶

Create a data set instance using the configuration provided.

Parameters:

name – Data set name.
config – Data set config dictionary.
load_version – Version string to be used for load operation if the data set is versioned. Has no effect on the data set if versioning was not enabled.
save_version – Version string to be used for save operation if the data set is versioned. Has no effect on the data set if versioning was not enabled.

Returns:

An instance of an AbstractDataset subclass.

Raises:

DatasetError – When the function fails to create the data set from its config.

property fs_args: Dict[str, Any]¶

Property of optional file system parameters.

Return type:: Dict[str, Any]
Returns:: A dictionary of backend file system parameters, including credentials.

load()¶

Loads data by delegation to the provided load method.

Return type:: TypeVar(_DO)
Returns:: Data returned by the provided load method.
Raises:: DatasetError – When underlying load method raises error.

release()¶

Release any cached data.

Raises:: DatasetError – when underlying release method raises error.
Return type:: None

save(data)¶

Saves data by delegation to the provided save method.

Parameters:

data (TypeVar(_DI)) – the value to be saved by provided save method.

Raises:

DatasetError – when underlying save method raises error.
FileNotFoundError – when save method got file instead of dir, on Windows.
NotADirectoryError – when save method got file instead of dir, on Unix.

Return type:

None