JSONDataset¶
JSONDataset loads and saves Hugging Face datasets in JSON format using the datasets library.
kedro_datasets.huggingface.JSONDataset ¶
JSONDataset(
*,
path,
version=None,
data_files=None,
load_args=None,
save_args=None,
credentials=None,
fs_args=None,
metadata=None
)
Bases: FilesystemDataset
JSONDataset loads/saves Hugging Face Dataset and
DatasetDict objects to/from JSON files.
Saving IterableDataset or IterableDatasetDict objects is not
supported and will raise a DatasetError. Materialize the iterable
dataset into a Dataset or DatasetDict before saving.
Note that datasets loads a single file as a datasets.DatasetDict
with a single key called "train". You can get around this by specifying
split in the load_args. See examples for more info.
Examples:
Using the
YAML API
to load a single file. Will be loaded as a datasets.DatasetDict with a single key
"train":
reviews:
type: huggingface.JSONDataset
path: data/01_raw/reviews.json
Using the
Python API
to load a datasets.DatasetDict from a single file:
>>> from datasets import Dataset
>>> from kedro_datasets.huggingface.json_dataset import (
... JSONDataset,
... )
>>>
>>> data = Dataset.from_dict({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
>>> dataset = JSONDataset(path=tmp_path / "data.json")
>>> dataset.save(data)
>>> loaded = dataset.load()
>>> assert "train" in loaded
Using the
YAML API
to load a datasets.Dataset from a single file:
reviews:
type: huggingface.JSONDataset
path: data/01_raw/reviews.json
load_args:
split: train
Using the
Python API
to load a datasets.Dataset from a single file:
>>> from datasets import Dataset
>>> from kedro_datasets.huggingface.json_dataset import (
... JSONDataset,
... )
>>>
>>> data = Dataset.from_dict({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
>>> dataset = JSONDataset(
... path=tmp_path / "data.json",
... load_args={"split": "train"},
... )
>>> dataset.save(data)
>>> loaded = dataset.load()
>>> assert type(loaded.shape) is tuple # No "train" key.
Using the
YAML API
to load a datasets.DatasetDict from a directory of files:
reviews:
type: huggingface.JSONDataset
path: data/01_raw/reviews
data_files:
labels: labels.json
data: data.json
Using the
Python API
to load a datasets.DatasetDict from a directory of files:
>>> from datasets import Dataset, DatasetDict
>>> from kedro_datasets.huggingface.json_dataset import (
... JSONDataset,
... )
>>>
>>> dataset_dict = DatasetDict({
... "labels": Dataset.from_dict({"col1": [1, 2], "col2": ["a", "b"]}),
... "data": Dataset.from_dict({"col1": [3, 4], "col2": ["c", "d"]}),
... })
>>> dataset = JSONDataset(
... path=tmp_path,
... data_files={
... "labels": "labels.json",
... "data": "data.json",
... },
... )
>>> dataset.save(dataset_dict)
>>> loaded = dataset.load()
Source code in kedro_datasets/huggingface/_base.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 | |