CSVDataset¶
CSVDataset loads and saves Hugging Face datasets in CSV format using the datasets library.
kedro_datasets.huggingface.CSVDataset ¶
CSVDataset(
*,
path,
version=None,
data_files=None,
load_args=None,
save_args=None,
credentials=None,
fs_args=None,
metadata=None
)
Bases: FilesystemDataset
CSVDataset loads/saves Hugging Face Dataset and
DatasetDict objects to/from CSV files.
Saving IterableDataset or IterableDatasetDict objects is not
supported and will raise a DatasetError. Materialize the iterable
dataset into a Dataset or DatasetDict before saving.
Note that datasets loads a single file as a datasets.DatasetDict
with a single key called "train". You can get around this by specifying
split in the load_args. See examples for more info.
Examples:
Using the
YAML API
to load a single file. Will be loaded as a datasets.DatasetDict with a single key
"train":
reviews:
type: huggingface.CSVDataset
path: data/01_raw/reviews.csv
Using the
Python API
to load a datasets.DatasetDict from a single file:
>>> from datasets import Dataset
>>> from kedro_datasets.huggingface.csv_dataset import (
... CSVDataset,
... )
>>>
>>> data = Dataset.from_dict({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
>>> dataset = CSVDataset(path=tmp_path / "data.csv")
>>> dataset.save(data)
>>> loaded = dataset.load()
>>> assert "train" in loaded
Using the
YAML API
to load a datasets.Dataset from a single file:
reviews:
type: huggingface.CSVDataset
path: data/01_raw/reviews.csv
load_args:
split: train
Using the
Python API
to load a datasets.Dataset from a single file:
>>> from datasets import Dataset
>>> from kedro_datasets.huggingface.csv_dataset import (
... CSVDataset,
... )
>>>
>>> data = Dataset.from_dict({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
>>> dataset = CSVDataset(
... path=tmp_path / "data.csv",
... load_args={"split": "train"},
... )
>>> dataset.save(data)
>>> loaded = dataset.load()
>>> assert type(loaded.shape) is tuple # No "train" key.
Using the
YAML API
to load a datasets.DatasetDict from a directory of files:
reviews:
type: huggingface.CSVDataset
path: data/01_raw/reviews
data_files:
labels: labels.csv
data: data.csv
Using the
Python API
to load a datasets.DatasetDict from a directory of files:
>>> from datasets import Dataset, DatasetDict
>>> from kedro_datasets.huggingface.csv_dataset import (
... CSVDataset,
... )
>>>
>>> dataset_dict = DatasetDict({
... "labels": Dataset.from_dict({"col1": [1, 2], "col2": ["a", "b"]}),
... "data": Dataset.from_dict({"col1": [3, 4], "col2": ["c", "d"]}),
... })
>>> dataset = CSVDataset(
... path=tmp_path,
... data_files={
... "labels": "labels.csv",
... "data": "data.csv",
... },
... )
>>> dataset.save(dataset_dict)
>>> loaded = dataset.load()
Source code in kedro_datasets/huggingface/_base.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |