Skip to content

FileDataset

FileDataset is used to load and save data to files using the Ibis framework.

kedro_datasets.ibis.FileDataset

FileDataset(
    filepath,
    file_format="parquet",
    *,
    table_name=None,
    connection=None,
    credentials=None,
    load_args=None,
    save_args=None,
    fs_args=None,
    version=None,
    metadata=None
)

Bases: ConnectionMixin, AbstractVersionedDataset[Table, Table]

FileDataset loads/saves data from/to a specified file format.

Examples:

Using the YAML API:

cars:
  type: ibis.FileDataset
  filepath: data/01_raw/company/cars.csv
  file_format: csv
  table_name: cars
  connection:
    backend: duckdb
    database: company.db
  load_args:
    sep: ","
    nullstr: "#NA"
  save_args:
    sep: ","
    nullstr: "#NA"

motorbikes:
  type: ibis.FileDataset
  filepath: s3://your_bucket/data/02_intermediate/company/motorbikes/
  file_format: delta
  table_name: motorbikes
  connection:
    backend: polars

Using the Python API:

>>> import ibis
>>> from kedro_datasets.ibis import FileDataset
>>>
>>> data = ibis.memtable({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
>>>
>>> dataset = FileDataset(
...     filepath=tmp_path / "test.csv",
...     file_format="csv",
...     table_name="test",
...     connection={"backend": "duckdb", "database": tmp_path / "file.db"},
... )
>>> dataset.save(data)
>>> reloaded = dataset.load()
>>> assert data.execute().equals(reloaded.execute())

FileDataset connects to the Ibis backend object constructed from the connection configuration. The backend key provided in the config can be any of the supported backends. The remaining dictionary entries will be passed as arguments to the underlying connect() method (e.g. ibis.duckdb.connect()).

The read method corresponding to the given file_format (e.g. read_csv()) is used to load the file with the backend. Note that only the data is loaded; no link to the underlying file exists past FileDataset.load().

Parameters:

  • filepath (str) –

    Path to a file to register as a table. Most useful for loading data into your data warehouse (for testing). On save, the backend exports data to the specified path.

  • file_format (str, default: 'parquet' ) –

    String specifying the file format for the file. Defaults to writing execution results to a Parquet file.

  • table_name (str | None, default: None ) –

    The name to use for the created table (on load).

  • connection (dict[str, Any] | None, default: None ) –

    Configuration for connecting to an Ibis backend. If not provided, connect to DuckDB in in-memory mode.

  • credentials (dict[str, Any] | None, default: None ) –

    Credentials or additional configuration used to connect (e.g. user, password, token, account). If given, these values override the base connection configuration.

  • load_args (dict[str, Any] | None, default: None ) –

    Additional arguments passed to the Ibis backend's read_{file_format} method.

  • save_args (dict[str, Any] | None, default: None ) –

    Additional arguments passed to the Ibis backend's to_{file_format} method.

  • fs_args (dict[str, Any] | None, default: None ) –

    Extra arguments to pass into underlying filesystem class constructor (e.g. {"project": "my-project"} for GCSFileSystem). Used only to discover versions and check existence of a remote filepath; reading and writing is left to the Ibis backend. Note that fsspec credentials go here, not in credentials (which configures the Ibis connection).

  • version (Version | None, default: None ) –

    If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save attribute is None, save version will be autogenerated.

  • metadata (dict[str, Any] | None, default: None ) –

    Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

Source code in kedro_datasets/ibis/file_dataset.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
def __init__(  # noqa: PLR0913
    self,
    filepath: str,
    file_format: str = "parquet",
    *,
    table_name: str | None = None,
    connection: dict[str, Any] | None = None,
    credentials: dict[str, Any] | None = None,
    load_args: dict[str, Any] | None = None,
    save_args: dict[str, Any] | None = None,
    fs_args: dict[str, Any] | None = None,
    version: Version | None = None,
    metadata: dict[str, Any] | None = None,
) -> None:
    """Creates a new ``FileDataset`` pointing to the given filepath.

    ``FileDataset`` connects to the Ibis backend object constructed
    from the connection configuration. The `backend` key provided in
    the config can be any of the
    [supported backends](https://ibis-project.org/install). The
    remaining dictionary entries will be passed as arguments to the
    underlying ``connect()`` method (e.g.
    [ibis.duckdb.connect()](https://ibis-project.org/backends/duckdb#ibis.duckdb.connect)).

    The read method corresponding to the given ``file_format`` (e.g.
    [read_csv()](https://ibis-project.org/backends/duckdb#ibis.backends.duckdb.Backend.read_csv))
    is used to load
    the file with the backend. Note that only the data is loaded; no
    link to the underlying file exists past ``FileDataset.load()``.

    Args:
        filepath: Path to a file to register as a table. Most useful
            for loading data into your data warehouse (for testing).
            On save, the backend exports data to the specified path.
        file_format: String specifying the file format for the file.
            Defaults to writing execution results to a Parquet file.
        table_name: The name to use for the created table (on load).
        connection: Configuration for connecting to an Ibis backend.
            If not provided, connect to DuckDB in in-memory mode.
        credentials: Credentials or additional configuration used to
            connect (e.g. user, password, token, account). If given,
            these values override the base connection configuration.
        load_args: Additional arguments passed to the Ibis backend's
            `read_{file_format}` method.
        save_args: Additional arguments passed to the Ibis backend's
            `to_{file_format}` method.
        fs_args: Extra arguments to pass into underlying filesystem class
            constructor (e.g. ``{"project": "my-project"}`` for
            ``GCSFileSystem``). Used only to discover versions and check
            existence of a remote ``filepath``; reading and writing is left
            to the Ibis backend. Note that fsspec credentials go here, not
            in ``credentials`` (which configures the Ibis connection).
        version: If specified, should be an instance of
            ``kedro.io.core.Version``. If its ``load`` attribute is
            None, the latest version will be loaded. If its ``save``
            attribute is None, save version will be autogenerated.
        metadata: Any arbitrary metadata. This is ignored by Kedro,
            but may be consumed by users or external plugins.
    """
    self._file_format = file_format
    self._table_name = table_name
    _connection_config = connection or self.DEFAULT_CONNECTION_CONFIG
    _credentials = deepcopy(credentials) or {}
    self._connection_config = {**_connection_config, **_credentials}
    self._fs_args = deepcopy(fs_args or {})
    self.metadata = metadata

    self._fs_prefix, path = split_filepath(filepath)
    self._fs: fsspec.AbstractFileSystem | None = None

    super().__init__(
        filepath=PurePosixPath(path),
        version=version,
        exists_function=self._fs_exists if self._fs_prefix else None,
        glob_function=self._fs_glob if self._fs_prefix else None,
    )

    # Set load and save arguments, overwriting defaults if provided.
    self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
    if load_args is not None:
        self._load_args.update(load_args)

    self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
    if save_args is not None:
        self._save_args.update(save_args)

DEFAULT_CONNECTION_CONFIG class-attribute

DEFAULT_CONNECTION_CONFIG = {
    "backend": "duckdb",
    "database": ":memory:",
}

DEFAULT_LOAD_ARGS class-attribute

DEFAULT_LOAD_ARGS = {}

DEFAULT_SAVE_ARGS class-attribute

DEFAULT_SAVE_ARGS = {}

_CONNECTION_GROUP class-attribute

_CONNECTION_GROUP = 'ibis'

_connection_config instance-attribute

_connection_config = {
    None: _connection_config,
    None: _credentials,
}

_file_format instance-attribute

_file_format = file_format

_fs instance-attribute

_fs = None

_fs_args instance-attribute

_fs_args = deepcopy(fs_args or {})

_load_args instance-attribute

_load_args = deepcopy(self.DEFAULT_LOAD_ARGS)

_save_args instance-attribute

_save_args = deepcopy(self.DEFAULT_SAVE_ARGS)

_table_name instance-attribute

_table_name = table_name

connection property

connection

The Backend instance for the connection configuration.

metadata instance-attribute

metadata = metadata

_connect

_connect()
Source code in kedro_datasets/ibis/file_dataset.py
164
165
166
167
168
169
def _connect(self) -> BaseBackend:
    import ibis  # noqa: PLC0415

    config = deepcopy(self._connection_config)
    backend = getattr(ibis, config.pop("backend"))
    return backend.connect(**config)

_describe

_describe()
Source code in kedro_datasets/ibis/file_dataset.py
201
202
203
204
205
206
207
208
209
210
def _describe(self) -> dict[str, Any]:
    return {
        "filepath": self._fs_prefix + str(self._filepath),
        "file_format": self._file_format,
        "table_name": self._table_name,
        "backend": self._connection_config["backend"],
        "load_args": self._load_args,
        "save_args": self._save_args,
        "version": self._version,
    }

_exists

_exists()
Source code in kedro_datasets/ibis/file_dataset.py
212
213
214
215
216
217
218
def _exists(self) -> bool:
    try:
        load_path = self._get_load_path()
    except DatasetError:
        return False

    return self._exists_function(str(load_path))

_filesystem

_filesystem()
Source code in kedro_datasets/ibis/file_dataset.py
176
177
178
179
180
181
def _filesystem(self) -> fsspec.AbstractFileSystem:
    # Build lazily so backend-only users don't need adlfs, s3fs, etc.
    if self._fs is None:
        protocol = self._fs_prefix.removesuffix("://")
        self._fs = fsspec.filesystem(protocol, **self._fs_args)
    return self._fs

_fs_exists

_fs_exists(path)
Source code in kedro_datasets/ibis/file_dataset.py
183
184
def _fs_exists(self, path: str) -> bool:
    return self._filesystem().exists(path)

_fs_glob

_fs_glob(pattern)
Source code in kedro_datasets/ibis/file_dataset.py
186
187
def _fs_glob(self, pattern: str) -> list[str]:
    return self._filesystem().glob(pattern)

load

load()
Source code in kedro_datasets/ibis/file_dataset.py
189
190
191
192
def load(self) -> ir.Table:
    load_path = self._fs_prefix + str(self._get_load_path())
    reader = getattr(self.connection, f"read_{self._file_format}")
    return reader(load_path, table_name=self._table_name, **self._load_args)

save

save(data)
Source code in kedro_datasets/ibis/file_dataset.py
194
195
196
197
198
199
def save(self, data: ir.Table) -> None:
    save_path = self._fs_prefix + str(self._get_save_path())
    if not self._fs_prefix:  # only local paths need their parent created
        Path(save_path).parent.mkdir(parents=True, exist_ok=True)
    writer = getattr(self.connection, f"to_{self._file_format}")
    writer(data, save_path, **self._save_args)