kedro_datasets.dask.ParquetDataset¶

class kedro_datasets.dask.ParquetDataset(*, filepath, load_args=None, save_args=None, credentials=None, fs_args=None, metadata=None)[source]¶

ParquetDataset loads and saves data to parquet file(s). It uses Dask remote data services to handle the corresponding load and save operations: https://docs.dask.org/en/latest/how-to/connect-to-remote-data.html

Example usage for the YAML API:

cars:
  type: dask.ParquetDataset
  filepath: s3://bucket_name/path/to/folder
  save_args:
    compression: GZIP
  credentials:
    client_kwargs:
      aws_access_key_id: YOUR_KEY
      aws_secret_access_key: YOUR_SECRET

Example usage for the Python API:

 import dask.dataframe as dd
 import pandas as pd
 from kedro_datasets.dask import ParquetDataset
 import numpy as np

 data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [6, 7]})
 ddf = dd.from_pandas(data, npartitions=2)

 dataset = ParquetDataset(
...     filepath=tmp_path / "path/to/folder", save_args={"compression": "GZIP"}
... )
 dataset.save(ddf)
 reloaded = dataset.load()

 assert np.array_equal(ddf.compute(), reloaded.compute())

The output schema can also be explicitly specified using Triad. This is processed to map specific columns to PyArrow field types or schema. For instance:

parquet_dataset:
  type: dask.ParquetDataset
  filepath: "s3://bucket_name/path/to/folder"
  credentials:
    client_kwargs:
      aws_access_key_id: YOUR_KEY
      aws_secret_access_key: "YOUR SECRET"
  save_args:
    compression: GZIP
    schema:
      col1: [int32]
      col2: [int32]
      col3: [[int32]]

Attributes

`DEFAULT_LOAD_ARGS`
`DEFAULT_SAVE_ARGS`
`fs_args`	Property of optional file system parameters.

Methods

`exists`()	Checks whether a dataset's output already exists by calling the provided _exists() method.
`from_config`(name, config[, load_version, ...])	Create a dataset instance using the configuration provided.
`load`()	Loads data by delegation to the provided load method.
`release`()	Release any cached data.
`save`(data)	Saves data by delegation to the provided save method.

DEFAULT_LOAD_ARGS: dict[str, Any] = {}¶

DEFAULT_SAVE_ARGS: dict[str, Any] = {'write_index': False}¶

__init__(*, filepath, load_args=None, save_args=None, credentials=None, fs_args=None, metadata=None)[source]¶

Creates a new instance of ParquetDataset pointing to concrete parquet files.

Parameters:

filepath (str) – Filepath in POSIX format to a parquet file parquet collection or the directory of a multipart parquet.
load_args (Optional[dict[str, Any]]) – Additional loading options dask.dataframe.read_parquet: https://docs.dask.org/en/latest/generated/dask.dataframe.read_parquet.html
save_args (Optional[dict[str, Any]]) – Additional saving options for dask.dataframe.to_parquet: https://docs.dask.org/en/latest/generated/dask.dataframe.to_parquet.html
credentials (Optional[dict[str, Any]]) – Credentials required to get access to the underlying filesystem. E.g. for GCSFileSystem it should look like {“token”: None}.
fs_args (Optional[dict[str, Any]]) – Optional parameters to the backend file system driver: https://docs.dask.org/en/latest/how-to/connect-to-remote-data.html#optional-parameters
metadata (Optional[dict[str, Any]]) – Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

exists()¶

Checks whether a dataset’s output already exists by calling the provided _exists() method.

Return type:: bool
Returns:: Flag indicating whether the output already exists.
Raises:: DatasetError – when underlying exists method raises error.

classmethod from_config(name, config, load_version=None, save_version=None)¶

Create a dataset instance using the configuration provided.

Parameters:

name (str) – Data set name.
config (dict[str, Any]) – Data set config dictionary.
load_version (Optional[str]) – Version string to be used for load operation if the dataset is versioned. Has no effect on the dataset if versioning was not enabled.
save_version (Optional[str]) – Version string to be used for save operation if the dataset is versioned. Has no effect on the dataset if versioning was not enabled.

Return type:

AbstractDataset

Returns:

An instance of an AbstractDataset subclass.

Raises:

DatasetError – When the function fails to create the dataset from its config.

property fs_args: dict[str, Any]¶

Property of optional file system parameters.

Return type:: dict[str, Any]
Returns:: A dictionary of backend file system parameters, including credentials.

load()[source]¶

Loads data by delegation to the provided load method.

Return type:: DataFrame
Returns:: Data returned by the provided load method.
Raises:: DatasetError – When underlying load method raises error.

release()¶

Release any cached data.

Raises:: DatasetError – when underlying release method raises error.
Return type:: None

save(data)[source]¶

Saves data by delegation to the provided save method.

Parameters:

data (DataFrame) – the value to be saved by provided save method.

Raises:

DatasetError – when underlying save method raises error.
FileNotFoundError – when save method got file instead of dir, on Windows.
NotADirectoryError – when save method got file instead of dir, on Unix.

Return type:

None