Skip to content

CSVDataset

CSVDataset loads and saves Hugging Face datasets in CSV format using the datasets library.

kedro_datasets.huggingface.CSVDataset

CSVDataset(
    *,
    path,
    version=None,
    data_files=None,
    load_args=None,
    save_args=None,
    credentials=None,
    fs_args=None,
    metadata=None
)

Bases: FilesystemDataset

CSVDataset loads/saves Hugging Face Dataset and DatasetDict objects to/from CSV files.

Saving IterableDataset or IterableDatasetDict objects is not supported and will raise a DatasetError. Materialize the iterable dataset into a Dataset or DatasetDict before saving.

Note that datasets loads a single file as a datasets.DatasetDict with a single key called "train". You can get around this by specifying split in the load_args. See examples for more info.

Examples:

Using the YAML API to load a single file. Will be loaded as a datasets.DatasetDict with a single key "train":

reviews:
  type: huggingface.CSVDataset
  path: data/01_raw/reviews.csv

Using the Python API to load a datasets.DatasetDict from a single file:

>>> from datasets import Dataset
>>> from kedro_datasets.huggingface.csv_dataset import (
...     CSVDataset,
... )
>>>
>>> data = Dataset.from_dict({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
>>> dataset = CSVDataset(path=tmp_path / "data.csv")
>>> dataset.save(data)
>>> loaded = dataset.load()
>>> assert "train" in loaded

Using the YAML API to load a datasets.Dataset from a single file:

reviews:
  type: huggingface.CSVDataset
  path: data/01_raw/reviews.csv
  load_args:
    split: train

Using the Python API to load a datasets.Dataset from a single file:

>>> from datasets import Dataset
>>> from kedro_datasets.huggingface.csv_dataset import (
...     CSVDataset,
... )
>>>
>>> data = Dataset.from_dict({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
>>> dataset = CSVDataset(
...     path=tmp_path / "data.csv",
...     load_args={"split": "train"},
... )
>>> dataset.save(data)
>>> loaded = dataset.load()
>>> assert type(loaded.shape) is tuple  # No "train" key.

Using the YAML API to load a datasets.DatasetDict from a directory of files:

reviews:
  type: huggingface.CSVDataset
  path: data/01_raw/reviews
  data_files:
    labels: labels.csv
    data: data.csv

Using the Python API to load a datasets.DatasetDict from a directory of files:

>>> from datasets import Dataset, DatasetDict
>>> from kedro_datasets.huggingface.csv_dataset import (
...     CSVDataset,
... )
>>>
>>> dataset_dict = DatasetDict({
...     "labels": Dataset.from_dict({"col1": [1, 2], "col2": ["a", "b"]}),
...     "data": Dataset.from_dict({"col1": [3, 4], "col2": ["c", "d"]}),
... })
>>> dataset = CSVDataset(
...     path=tmp_path,
...     data_files={
...         "labels": "labels.csv",
...         "data": "data.csv",
...     },
... )
>>> dataset.save(dataset_dict)
>>> loaded = dataset.load()
Source code in kedro_datasets/huggingface/_base.py
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def __init__(  # noqa: PLR0913
    self,
    *,
    path: str | os.PathLike,
    version: Version | None = None,
    data_files: dict[str, str] | None = None,
    load_args: dict[str, Any] | None = None,
    save_args: dict[str, Any] | None = None,
    credentials: dict[str, Any] | None = None,
    fs_args: dict[str, Any] | None = None,
    metadata: dict[str, Any] | None = None,
) -> None:
    """Creates a new instance of ``FilesystemDataset``.

    Args:
        path: Path to a file or directory for persisting Hugging Face
            datasets. Supports local paths, ``os.PathLike`` objects,
            and remote URIs (e.g. ``s3://bucket/data``).
        version: Optional versioning configuration
            (see :class:`~kedro.io.core.Version`).
        data_files: Mapping of split name to filename for loading and
            saving a ``DatasetDict`` from a directory
            (e.g. ``{"train": "train.csv"}``). The keys must match
            the split names of the ``DatasetDict`` being saved, and
            the filenames must use the correct extension for the
            format (e.g. ``.csv`` for ``CSVDataset``).
        load_args: Additional keyword arguments passed to the
            underlying load function. This cannot include ``data_files``;
            use the top-level ``data_files`` argument instead.
        save_args: Additional keyword arguments passed to the
            underlying save function. This cannot include ``data_files``;
            use the top-level ``data_files`` argument instead.
        credentials: Credentials for the underlying filesystem
            (e.g. ``key``/``secret`` for S3). Passed to the
            ``storage_options`` parameter in the underlying
            ``datasets`` implementation.
        fs_args: Extra arguments passed to the ``fsspec`` filesystem
            initialiser. Passed to the ``storage_options`` parameter
            in the underlying ``datasets`` implementation.
        metadata: Any arbitrary metadata. This is ignored by Kedro
            but may be consumed by users or external plugins.
    """
    _fs_args = deepcopy(fs_args) or {}
    _credentials = deepcopy(credentials) or {}

    protocol, resolved_path = get_protocol_and_path(path, version)
    self._protocol = protocol

    if protocol == "file":
        _fs_args.setdefault("auto_mkdir", True)

    self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)

    self._load_args = deepcopy(load_args or {})
    self._save_args = deepcopy(save_args or {})

    if "data_files" in self._load_args or "data_files" in self._save_args:
        msg = (
            f"{type(self).__name__} got ``data_files`` in ``load_args`` "
            "or ``save_args``. Pass it as a top-level argument instead."
        )
        raise DatasetError(msg)

    self._data_files = deepcopy(data_files)
    self.metadata = metadata

    self._storage_options = {**_credentials, **_fs_args} or None

    super().__init__(
        filepath=PurePosixPath(resolved_path),
        version=version,
        exists_function=self._fs.exists,
        glob_function=self._fs.glob,
    )

BUILDER class-attribute

BUILDER = 'csv'

EXTENSION class-attribute

EXTENSION = '.csv'