Skip to content

CSVDataset

CSVDataset loads and saves data to/from CSV files using Polars.

kedro_datasets.polars.CSVDataset

CSVDataset(
    *,
    filepath,
    load_args=None,
    save_args=None,
    version=None,
    credentials=None,
    fs_args=None,
    metadata=None
)

Bases: AbstractVersionedDataset[DataFrame, DataFrame]

CSVDataset loads/saves data from/to a CSV file using an underlying filesystem (e.g.: local, S3, GCS). It uses polars to handle the CSV file.

Examples:

Using the YAML API:

cars:
  type: polars.CSVDataset
  filepath: data/01_raw/company/cars.csv
  load_args:
    sep: ","
    parse_dates: False
  save_args:
    has_header: False
    null_value: "somenullstring"

motorbikes:
  type: polars.CSVDataset
  filepath: s3://your_bucket/data/02_intermediate/company/motorbikes.csv
  credentials: dev_s3

Using the Python API:

>>> import sys
>>>
>>> import polars as pl
>>> import pytest
>>> from kedro_datasets.polars import CSVDataset
>>>
>>> if sys.platform.startswith("win"):
...     pytest.skip("this doctest fails on Windows CI runner")
...
>>> data = pl.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
>>>
>>> dataset = CSVDataset(filepath=tmp_path / "test.csv")
>>> dataset.save(data)
>>> reloaded = dataset.load()
>>> assert data.equals(reloaded)

Parameters:

  • filepath (str | PathLike) –

    Filepath as a string or path-like object in POSIX format to a CSV file prefixed with a protocol s3://. If prefix is not provided, file protocol (local filesystem) will be used. The prefix should be any protocol supported by fsspec. Note: http(s) doesn't support versioning.

  • load_args (dict[str, Any] | None, default: None ) –

    Polars options for loading CSV files. Here you can find all available arguments: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html#polars.read_csv All defaults are preserved, but we explicitly use rechunk=True for seaborn compatibility.

  • save_args (dict[str, Any] | None, default: None ) –

    Polars options for saving CSV files. Here you can find all available arguments: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_csv.html All defaults are preserved.

  • version (Version | None, default: None ) –

    If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save attribute is None, save version will be autogenerated.

  • credentials (dict[str, Any] | None, default: None ) –

    Credentials required to get access to the underlying filesystem. E.g. for GCSFileSystem it should look like {"token": None}.

  • fs_args (dict[str, Any] | None, default: None ) –

    Extra arguments to pass into underlying filesystem class constructor (e.g. {"project": "my-project"} for GCSFileSystem). Defaults are preserved, apart from the open_args_save mode which is set to w.

  • metadata (dict[str, Any] | None, default: None ) –

    Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

Source code in kedro_datasets/polars/csv_dataset.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
def __init__(  # noqa: PLR0913
    self,
    *,
    filepath: str | os.PathLike,
    load_args: dict[str, Any] | None = None,
    save_args: dict[str, Any] | None = None,
    version: Version | None = None,
    credentials: dict[str, Any] | None = None,
    fs_args: dict[str, Any] | None = None,
    metadata: dict[str, Any] | None = None,
) -> None:
    """Creates a new instance of ``CSVDataset`` pointing to a concrete CSV file
    on a specific filesystem.

    Args:
        filepath: Filepath as a string or path-like object in POSIX format to a CSV file prefixed with a protocol
            `s3://`.
            If prefix is not provided, `file` protocol (local filesystem)
            will be used.
            The prefix should be any protocol supported by ``fsspec``.
            Note: `http(s)` doesn't support versioning.
        load_args: Polars options for loading CSV files.
            Here you can find all available arguments:
            https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_csv.html#polars.read_csv
            All defaults are preserved, but we explicitly use `rechunk=True` for `seaborn`
            compatibility.
        save_args: Polars options for saving CSV files.
            Here you can find all available arguments:
            https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_csv.html
            All defaults are preserved.
        version: If specified, should be an instance of
            ``kedro.io.core.Version``. If its ``load`` attribute is
            None, the latest version will be loaded. If its ``save``
            attribute is None, save version will be autogenerated.
        credentials: Credentials required to get access to the underlying filesystem.
            E.g. for ``GCSFileSystem`` it should look like `{"token": None}`.
        fs_args: Extra arguments to pass into underlying filesystem class constructor
            (e.g. `{"project": "my-project"}` for ``GCSFileSystem``).
            Defaults are preserved, apart from the `open_args_save` `mode` which is set to `w`.

        metadata: Any arbitrary metadata.
            This is ignored by Kedro, but may be consumed by users or external plugins.
    """
    _fs_args = deepcopy(fs_args) or {}
    _fs_open_args_load = _fs_args.pop("open_args_load", {})
    _fs_open_args_save = _fs_args.pop("open_args_save", {})
    _credentials = deepcopy(credentials) or {}

    protocol, path = get_protocol_and_path(filepath, version)
    if protocol == "file":
        _fs_args.setdefault("auto_mkdir", True)

    self._protocol = protocol
    self._storage_options = {**_credentials, **_fs_args}
    self._fs = fsspec.filesystem(self._protocol, **self._storage_options)

    self.metadata = metadata

    super().__init__(
        filepath=PurePosixPath(path),
        version=version,
        exists_function=self._fs.exists,
        glob_function=self._fs.glob,
    )

    # Handle default load and save and fs arguments
    self._load_args = {**self.DEFAULT_LOAD_ARGS, **(load_args or {})}
    self._save_args = {**self.DEFAULT_SAVE_ARGS, **(save_args or {})}
    self._fs_open_args_load = {
        **self.DEFAULT_FS_ARGS.get("open_args_load", {}),
        **(_fs_open_args_load or {}),
    }
    self._fs_open_args_save = {
        **self.DEFAULT_FS_ARGS.get("open_args_save", {}),
        **(_fs_open_args_save or {}),
    }

    if "storage_options" in self._save_args or "storage_options" in self._load_args:
        logger.warning(
            "Dropping 'storage_options' for %s, "
            "please specify them under 'fs_args' or 'credentials'.",
            self._filepath,
        )
        self._save_args.pop("storage_options", None)
        self._load_args.pop("storage_options", None)

DEFAULT_FS_ARGS class-attribute instance-attribute

DEFAULT_FS_ARGS = {
    "open_args_save": {"mode": "w", "encoding": "utf-8"}
}

DEFAULT_LOAD_ARGS class-attribute instance-attribute

DEFAULT_LOAD_ARGS = {'rechunk': True}

DEFAULT_SAVE_ARGS class-attribute instance-attribute

DEFAULT_SAVE_ARGS = {}

_fs instance-attribute

_fs = filesystem(_protocol, **(_storage_options))

_fs_open_args_load instance-attribute

_fs_open_args_load = {
    None: get("open_args_load", {}),
    None: _fs_open_args_load or {},
}

_fs_open_args_save instance-attribute

_fs_open_args_save = {
    None: get("open_args_save", {}),
    None: _fs_open_args_save or {},
}

_load_args instance-attribute

_load_args = {
    None: DEFAULT_LOAD_ARGS,
    None: load_args or {},
}

_protocol instance-attribute

_protocol = protocol

_save_args instance-attribute

_save_args = {
    None: DEFAULT_SAVE_ARGS,
    None: save_args or {},
}

_storage_options instance-attribute

_storage_options = {None: _credentials, None: _fs_args}

metadata instance-attribute

metadata = metadata

_describe

_describe()
Source code in kedro_datasets/polars/csv_dataset.py
164
165
166
167
168
169
170
171
def _describe(self) -> dict[str, Any]:
    return {
        "filepath": self._filepath,
        "protocol": self._protocol,
        "load_args": self._load_args,
        "save_args": self._save_args,
        "version": self._version,
    }

_exists

_exists()
Source code in kedro_datasets/polars/csv_dataset.py
195
196
197
198
199
200
201
def _exists(self) -> bool:
    try:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)
    except DatasetError:
        return False

    return self._fs.exists(load_path)

_invalidate_cache

_invalidate_cache()

Invalidate underlying filesystem caches.

Source code in kedro_datasets/polars/csv_dataset.py
207
208
209
210
def _invalidate_cache(self) -> None:
    """Invalidate underlying filesystem caches."""
    filepath = get_filepath_str(self._filepath, self._protocol)
    self._fs.invalidate_cache(filepath)

_release

_release()
Source code in kedro_datasets/polars/csv_dataset.py
203
204
205
def _release(self) -> None:
    super()._release()
    self._invalidate_cache()

load

load()
Source code in kedro_datasets/polars/csv_dataset.py
173
174
175
176
177
178
179
180
181
182
183
184
185
def load(self) -> pl.DataFrame:
    load_path = str(self._get_load_path())
    if self._protocol == "file":
        # file:// protocol seems to misbehave on Windows
        # (<urlopen error file not on local host>),
        # so we don't join that back to the filepath;
        # storage_options also don't work with local paths
        return pl.read_csv(load_path, **self._load_args)

    load_path = f"{self._protocol}{PROTOCOL_DELIMITER}{load_path}"
    return pl.read_csv(
        load_path, storage_options=self._storage_options, **self._load_args
    )

preview

preview(nrows=5)

Generate a preview of the dataset with a specified number of rows.

Parameters:

  • nrows (int, default: 5 ) –

    The number of rows to include in the preview. Defaults to 5.

Returns:

  • dict

    A dictionary containing the data in a split format.

Source code in kedro_datasets/polars/csv_dataset.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
def preview(self, nrows: int = 5) -> TablePreview:
    """
    Generate a preview of the dataset with a specified number of rows.

    Args:
        nrows: The number of rows to include in the preview. Defaults to 5.

    Returns:
        dict: A dictionary containing the data in a split format.
    """
    # Create a copy so it doesn't contaminate the original dataset
    dataset_copy = self._copy()
    data = dataset_copy.load().limit(nrows if type(nrows) is int else 5)
    data_dict = data.to_pandas().to_dict(orient="split")
    return TablePreview(data_dict)

save

save(data)
Source code in kedro_datasets/polars/csv_dataset.py
187
188
189
190
191
192
193
def save(self, data: pl.DataFrame) -> None:
    save_path = get_filepath_str(self._get_save_path(), self._protocol)

    with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
        data.write_csv(file=fs_file, **self._save_args)

    self._invalidate_cache()