Skip to content

GenericDataset

GenericDataset loads/saves data to a file using an underlying filesystem (eg: local, S3, GCS). The underlying functionality is supported by geopandas, so it supports all allowed geopandas (pandas) options for loading and saving files.

kedro_datasets.geopandas.GenericDataset

GenericDataset(
    *,
    filepath,
    file_format="file",
    load_args=None,
    save_args=None,
    version=None,
    credentials=None,
    fs_args=None,
    metadata=None
)

Bases: AbstractVersionedDataset[GeoDataFrame, GeoDataFrame | dict[str, GeoDataFrame]]

GenericDataset loads/saves data to a file using an underlying filesystem (eg: local, S3, GCS). The underlying functionality is supported by geopandas, so it supports all allowed geopandas (pandas) options for loading and saving files.

Examples:

Using the Python API:

>>> import geopandas as gpd
>>> from kedro_datasets.geopandas import GenericDataset
>>> from shapely.geometry import Point
>>>
>>> data = gpd.GeoDataFrame(
...     {"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]},
...     geometry=[Point(1, 1), Point(2, 4)],
... )
>>>
>>> dataset = GenericDataset(filepath=tmp_path / "test.geojson")
>>> dataset.save(data)
>>> reloaded = dataset.load()
>>> assert data.equals(reloaded)

Parameters:

  • filepath (str) –

    Filepath in POSIX format to a file prefixed with a protocol like s3://. If prefix is not provided file protocol (local filesystem) will be used. The prefix should be any protocol supported by fsspec. Note: http(s) doesn't support versioning.

  • file_format (str, default: 'file' ) –

    String which is used to match the appropriate load/save method on a best effort basis. For example if 'parquet' is passed in the geopandas.read_parquet and geopandas.DataFrame.to_parquet will be identified. An error will be raised unless at least one matching read_{file_format} or to_{file_format} method is identified. Defaults to 'file'.

  • load_args (dict[str, Any] | None, default: None ) –

    GeoPandas options for loading files. Here you can find all available arguments: https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html

  • save_args (dict[str, Any] | None, default: None ) –

    GeoPandas options for saving files. Here you can find all available arguments: https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html

  • version (Version | None, default: None ) –

    If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save

  • credentials (dict[str, Any] | None, default: None ) –

    credentials required to access the underlying filesystem. Eg. for GCFileSystem it would look like {'token': None}.

  • fs_args (dict[str, Any] | None, default: None ) –

    Extra arguments to pass into underlying filesystem class constructor (e.g. {"project": "my-project"} for GCSFileSystem), as well as to pass to the filesystem's open method through nested keys open_args_load and open_args_save. Here you can find all available arguments for open: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open All defaults are preserved, except mode, which is set to wb when saving.

  • metadata (dict[str, Any] | None, default: None ) –

    Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

Source code in kedro_datasets/geopandas/generic_dataset.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def __init__(  # noqa: PLR0913
    self,
    *,
    filepath: str,
    file_format: str = "file",
    load_args: dict[str, Any] | None = None,
    save_args: dict[str, Any] | None = None,
    version: Version | None = None,
    credentials: dict[str, Any] | None = None,
    fs_args: dict[str, Any] | None = None,
    metadata: dict[str, Any] | None = None,
) -> None:
    """Creates a new instance of ``GenericDataset`` pointing to a concrete file
    on a specific filesystem fsspec.

    Args:
        filepath: Filepath in POSIX format to a file prefixed with a protocol like
            `s3://`. If prefix is not provided `file` protocol (local filesystem) will be used.
            The prefix should be any protocol supported by ``fsspec``.
            Note: `http(s)` doesn't support versioning.
        file_format: String which is used to match the appropriate load/save method on a best
            effort basis. For example if 'parquet' is passed in the `geopandas.read_parquet` and
            `geopandas.DataFrame.to_parquet` will be identified. An error will be raised unless
            at least one matching `read_{file_format}` or `to_{file_format}` method is
            identified. Defaults to 'file'.
        load_args: GeoPandas options for loading files.
            Here you can find all available arguments:
            https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html
        save_args: GeoPandas options for saving files.
            Here you can find all available arguments:
            https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.to_file.html
        version: If specified, should be an instance of
            ``kedro.io.core.Version``. If its ``load`` attribute is
            None, the latest version will be loaded. If its ``save``
        credentials: credentials required to access the underlying filesystem.
            Eg. for ``GCFileSystem`` it would look like `{'token': None}`.
        fs_args: Extra arguments to pass into underlying filesystem class constructor
            (e.g. `{"project": "my-project"}` for ``GCSFileSystem``), as well as
            to pass to the filesystem's `open` method through nested keys
            `open_args_load` and `open_args_save`.
            Here you can find all available arguments for `open`:
            https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open
            All defaults are preserved, except `mode`, which is set to `wb` when saving.
        metadata: Any arbitrary metadata.
            This is ignored by Kedro, but may be consumed by users or external plugins.
    """

    self._file_format = file_format.lower()

    _fs_args = copy.deepcopy(fs_args) or {}
    _fs_open_args_load = _fs_args.pop("open_args_load", {})
    _fs_open_args_save = _fs_args.pop("open_args_save", {})
    _credentials = copy.deepcopy(credentials) or {}
    protocol, path = get_protocol_and_path(filepath, version)
    self._protocol = protocol
    if protocol == "file":
        _fs_args.setdefault("auto_mkdir", True)

    self._fs = fsspec.filesystem(self._protocol, **_credentials, **_fs_args)

    self.metadata = metadata

    super().__init__(
        filepath=PurePosixPath(path),
        version=version,
        exists_function=self._fs.exists,
        glob_function=self._fs.glob,
    )

    # Handle default load and save and fs arguments
    self._load_args = {**self.DEFAULT_LOAD_ARGS, **(load_args or {})}
    self._save_args = {**self.DEFAULT_SAVE_ARGS, **(save_args or {})}
    self._fs_open_args_load = {
        **self.DEFAULT_FS_ARGS.get("open_args_load", {}),
        **(_fs_open_args_load or {}),
    }
    self._fs_open_args_save = {
        **self.DEFAULT_FS_ARGS.get("open_args_save", {}),
        **(_fs_open_args_save or {}),
    }

DEFAULT_FS_ARGS class-attribute instance-attribute

DEFAULT_FS_ARGS = {'open_args_save': {'mode': 'wb'}}

DEFAULT_LOAD_ARGS class-attribute instance-attribute

DEFAULT_LOAD_ARGS = {}

DEFAULT_SAVE_ARGS class-attribute instance-attribute

DEFAULT_SAVE_ARGS = {}

_file_format instance-attribute

_file_format = lower()

_fs instance-attribute

_fs = filesystem(_protocol, **_credentials, **_fs_args)

_fs_open_args_load instance-attribute

_fs_open_args_load = {
    None: get("open_args_load", {}),
    None: _fs_open_args_load or {},
}

_fs_open_args_save instance-attribute

_fs_open_args_save = {
    None: get("open_args_save", {}),
    None: _fs_open_args_save or {},
}

_load_args instance-attribute

_load_args = {
    None: DEFAULT_LOAD_ARGS,
    None: load_args or {},
}

_protocol instance-attribute

_protocol = protocol

_save_args instance-attribute

_save_args = {
    None: DEFAULT_SAVE_ARGS,
    None: save_args or {},
}

metadata instance-attribute

metadata = metadata

_describe

_describe()
Source code in kedro_datasets/geopandas/generic_dataset.py
189
190
191
192
193
194
195
196
197
def _describe(self) -> dict[str, Any]:
    return {
        "filepath": self._filepath,
        "file_format": self._file_format,
        "protocol": self._protocol,
        "load_args": self._load_args,
        "save_args": self._save_args,
        "version": self._version,
    }

_ensure_file_system_target

_ensure_file_system_target()
Source code in kedro_datasets/geopandas/generic_dataset.py
142
143
144
145
146
147
148
def _ensure_file_system_target(self) -> None:
    # Fail fast if provided a known non-filesystem target
    if self._file_format in NON_FILE_SYSTEM_TARGETS:
        raise DatasetError(
            f"Cannot load or save a dataset of file_format '{self._file_format}' as it "
            f"does not support a filepath target/source."
        )

_exists

_exists()
Source code in kedro_datasets/geopandas/generic_dataset.py
182
183
184
185
186
187
def _exists(self) -> bool:
    try:
        load_path = get_filepath_str(self._get_load_path(), self._protocol)
    except DatasetError:
        return False
    return self._fs.exists(load_path)

_release

_release()
Source code in kedro_datasets/geopandas/generic_dataset.py
199
200
def _release(self) -> None:
    self.invalidate_cache()

invalidate_cache

invalidate_cache()

Invalidate underlying filesystem cache.

Source code in kedro_datasets/geopandas/generic_dataset.py
202
203
204
205
def invalidate_cache(self) -> None:
    """Invalidate underlying filesystem cache."""
    filepath = get_filepath_str(self._filepath, self._protocol)
    self._fs.invalidate_cache(filepath)

load

load()
Source code in kedro_datasets/geopandas/generic_dataset.py
150
151
152
153
154
155
156
157
158
159
160
161
162
def load(self) -> gpd.GeoDataFrame | dict[str, gpd.GeoDataFrame]:
    self._ensure_file_system_target()

    load_path = get_filepath_str(self._get_load_path(), self._protocol)
    load_method = getattr(gpd, f"read_{self._file_format}", None)
    if load_method:
        with self._fs.open(load_path, **self._fs_open_args_load) as fs_file:
            return load_method(fs_file, **self._load_args)
    raise DatasetError(
        f"Unable to retrieve 'geopandas.read_{self._file_format}' method, please ensure that your "
        "'file_format' parameter has been defined correctly as per the GeoPandas API "
        "https://geopandas.org/en/stable/docs/reference/io.html"
    )

save

save(data)
Source code in kedro_datasets/geopandas/generic_dataset.py
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
def save(self, data: gpd.GeoDataFrame) -> None:
    self._ensure_file_system_target()

    save_path = get_filepath_str(self._get_save_path(), self._protocol)
    save_method = getattr(data, f"to_{self._file_format}", None)
    if save_method:
        with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
            # KEY ASSUMPTION - first argument is path/buffer/io
            save_method(fs_file, **self._save_args)
            self.invalidate_cache()
    else:
        raise DatasetError(
            f"Unable to retrieve 'geopandas.DataFrame.to_{self._file_format}' method, please "
            "ensure that your 'file_format' parameter has been defined correctly as "
            "per the GeoPandas API "
            "https://geopandas.org/en/stable/docs/reference/io.html"
        )