ParquetDataset¶
ParquetDataset loads and saves data to parquet file(s). It uses Dask remote data services to handle the corresponding load and save operations.
kedro_datasets.dask.ParquetDataset ¶
ParquetDataset(
*,
filepath,
load_args=None,
save_args=None,
credentials=None,
fs_args=None,
metadata=None
)
Bases: AbstractDataset[DataFrame, DataFrame]
ParquetDataset loads and saves data to parquet file(s). It uses Dask
remote data services to handle the corresponding load and save operations:
https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html
Examples:
Using the YAML API:
cars:
type: dask.ParquetDataset
filepath: s3://bucket_name/path/to/folder
save_args:
compression: GZIP
credentials:
client_kwargs:
aws_access_key_id: YOUR_KEY
aws_secret_access_key: YOUR_SECRET
Using the Python API:
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> import numpy as np
>>> from kedro_datasets.dask import ParquetDataset
>>>
>>> data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [6, 7]})
>>> ddf = dd.from_pandas(data, npartitions=2)
>>>
>>> dataset = ParquetDataset(
... filepath=tmp_path / "path/to/folder", save_args={"compression": "GZIP"}
... )
>>> dataset.save(ddf)
>>> reloaded = dataset.load()
>>> assert np.array_equal(ddf.compute(), reloaded.compute())
The output schema can also be explicitly specified using Triad. This is processed to map specific columns to PyArrow field types or schema. For instance:
parquet_dataset:
type: dask.ParquetDataset
filepath: "s3://bucket_name/path/to/folder"
credentials:
client_kwargs:
aws_access_key_id: YOUR_KEY
aws_secret_access_key: "YOUR SECRET"
save_args:
compression: GZIP
schema:
col1: [int32]
col2: [int32]
col3: [[int32]]
Parameters:
-
filepath(str) –Filepath in POSIX format to a parquet file parquet collection or the directory of a multipart parquet.
-
load_args(dict[str, Any] | None, default:None) –Additional loading options
dask.dataframe.read_parquet: https://docs.dask.org/en/stable/generated/dask.dataframe.read_parquet.html -
save_args(dict[str, Any] | None, default:None) –Additional saving options for
dask.dataframe.to_parquet: https://docs.dask.org/en/stable/generated/dask.dataframe.to_parquet.html -
credentials(dict[str, Any] | None, default:None) –Credentials required to get access to the underlying filesystem. E.g. for
GCSFileSystemit should look like{"token": None}. -
fs_args(dict[str, Any] | None, default:None) –Optional parameters to the backend file system driver: https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html#optional-parameters
-
metadata(dict[str, Any] | None, default:None) –Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.
Source code in kedro_datasets/dask/parquet_dataset.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
fs_args
property
¶
fs_args
Property of optional file system parameters.
Returns:
-
dict[str, Any]–A dictionary of backend file system parameters, including credentials.
_describe ¶
_describe()
Source code in kedro_datasets/dask/parquet_dataset.py
126 127 128 129 130 131 | |
_exists ¶
_exists()
Source code in kedro_datasets/dask/parquet_dataset.py
194 195 196 197 | |
_process_schema ¶
_process_schema()
This method processes the schema in the catalog.yml or the API, if provided. This assumes that the schema is specified using Triad's grammar for schema definition.
When the value of the schema variable is a string, it is assumed that
it corresponds to the full schema specification for the data.
Alternatively, if the schema is specified as a dictionary, then only the
columns that are specified will be strictly mapped to a field type. The other
unspecified columns, if present, will be inferred from the data.
This method converts the Triad-parsed schema into a pyarrow schema. The output directly supports Dask's specifications for providing a schema when saving to a parquet file.
Note that if a pa.Schema object is passed directly in the schema argument, no
processing will be done. Additionally, the behavior when passing a pa.Schema
object is assumed to be consistent with how Dask sees it. That is, it should fully
define the schema for all fields.
Source code in kedro_datasets/dask/parquet_dataset.py
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | |
load ¶
load()
Source code in kedro_datasets/dask/parquet_dataset.py
133 134 135 136 | |
save ¶
save(data)
Source code in kedro_datasets/dask/parquet_dataset.py
138 139 140 141 142 | |