Skip to content

FileDataset

FileDataset is used to load and save data to files using the Ibis framework.

kedro_datasets.ibis.FileDataset

FileDataset(
    filepath,
    file_format="parquet",
    *,
    table_name=None,
    connection=None,
    credentials=None,
    load_args=None,
    save_args=None,
    version=None,
    metadata=None
)

Bases: ConnectionMixin, AbstractVersionedDataset[Table, Table]

FileDataset loads/saves data from/to a specified file format.

Examples:

Using the YAML API:

cars:
  type: ibis.FileDataset
  filepath: data/01_raw/company/cars.csv
  file_format: csv
  table_name: cars
  connection:
    backend: duckdb
    database: company.db
  load_args:
    sep: ","
    nullstr: "#NA"
  save_args:
    sep: ","
    nullstr: "#NA"

motorbikes:
  type: ibis.FileDataset
  filepath: s3://your_bucket/data/02_intermediate/company/motorbikes/
  file_format: delta
  table_name: motorbikes
  connection:
    backend: polars

Using the Python API:

>>> import ibis
>>> from kedro_datasets.ibis import FileDataset
>>>
>>> data = ibis.memtable({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
>>>
>>> dataset = FileDataset(
...     filepath=tmp_path / "test.csv",
...     file_format="csv",
...     table_name="test",
...     connection={"backend": "duckdb", "database": tmp_path / "file.db"},
... )
>>> dataset.save(data)
>>> reloaded = dataset.load()
>>> assert data.execute().equals(reloaded.execute())

FileDataset connects to the Ibis backend object constructed from the connection configuration. The backend key provided in the config can be any of the supported backends. The remaining dictionary entries will be passed as arguments to the underlying connect() method (e.g. ibis.duckdb.connect()).

The read method corresponding to the given file_format (e.g. read_csv()) is used to load the file with the backend. Note that only the data is loaded; no link to the underlying file exists past FileDataset.load().

Parameters:

  • filepath (str) –

    Path to a file to register as a table. Most useful for loading data into your data warehouse (for testing). On save, the backend exports data to the specified path.

  • file_format (str, default: 'parquet' ) –

    String specifying the file format for the file. Defaults to writing execution results to a Parquet file.

  • table_name (str | None, default: None ) –

    The name to use for the created table (on load).

  • connection (dict[str, Any] | None, default: None ) –

    Configuration for connecting to an Ibis backend. If not provided, connect to DuckDB in in-memory mode.

  • credentials (dict[str, Any] | None, default: None ) –

    Credentials or additional configuration used to connect (e.g. user, password, token, account). If given, these values override the base connection configuration.

  • load_args (dict[str, Any] | None, default: None ) –

    Additional arguments passed to the Ibis backend's read_{file_format} method.

  • save_args (dict[str, Any] | None, default: None ) –

    Additional arguments passed to the Ibis backend's to_{file_format} method.

  • version (Version | None, default: None ) –

    If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save attribute is None, save version will be autogenerated.

  • metadata (dict[str, Any] | None, default: None ) –

    Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

Source code in kedro_datasets/ibis/file_dataset.py
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
def __init__(  # noqa: PLR0913
    self,
    filepath: str,
    file_format: str = "parquet",
    *,
    table_name: str | None = None,
    connection: dict[str, Any] | None = None,
    credentials: dict[str, Any] | None = None,
    load_args: dict[str, Any] | None = None,
    save_args: dict[str, Any] | None = None,
    version: Version | None = None,
    metadata: dict[str, Any] | None = None,
) -> None:
    """Creates a new ``FileDataset`` pointing to the given filepath.

    ``FileDataset`` connects to the Ibis backend object constructed
    from the connection configuration. The `backend` key provided in
    the config can be any of the
    [supported backends](https://ibis-project.org/install). The
    remaining dictionary entries will be passed as arguments to the
    underlying ``connect()`` method (e.g.
    [ibis.duckdb.connect()](https://ibis-project.org/backends/duckdb#ibis.duckdb.connect)).

    The read method corresponding to the given ``file_format`` (e.g.
    [read_csv()](https://ibis-project.org/backends/duckdb#ibis.backends.duckdb.Backend.read_csv))
    is used to load
    the file with the backend. Note that only the data is loaded; no
    link to the underlying file exists past ``FileDataset.load()``.

    Args:
        filepath: Path to a file to register as a table. Most useful
            for loading data into your data warehouse (for testing).
            On save, the backend exports data to the specified path.
        file_format: String specifying the file format for the file.
            Defaults to writing execution results to a Parquet file.
        table_name: The name to use for the created table (on load).
        connection: Configuration for connecting to an Ibis backend.
            If not provided, connect to DuckDB in in-memory mode.
        credentials: Credentials or additional configuration used to
            connect (e.g. user, password, token, account). If given,
            these values override the base connection configuration.
        load_args: Additional arguments passed to the Ibis backend's
            `read_{file_format}` method.
        save_args: Additional arguments passed to the Ibis backend's
            `to_{file_format}` method.
        version: If specified, should be an instance of
            ``kedro.io.core.Version``. If its ``load`` attribute is
            None, the latest version will be loaded. If its ``save``
            attribute is None, save version will be autogenerated.
        metadata: Any arbitrary metadata. This is ignored by Kedro,
            but may be consumed by users or external plugins.
    """
    self._file_format = file_format
    self._table_name = table_name
    _connection_config = connection or self.DEFAULT_CONNECTION_CONFIG
    _credentials = deepcopy(credentials) or {}
    self._connection_config = {**_connection_config, **_credentials}
    self.metadata = metadata

    super().__init__(
        filepath=PurePosixPath(filepath),
        version=version,
        exists_function=lambda filepath: Path(filepath).exists(),
    )

    # Set load and save arguments, overwriting defaults if provided.
    self._load_args = deepcopy(self.DEFAULT_LOAD_ARGS)
    if load_args is not None:
        self._load_args.update(load_args)

    self._save_args = deepcopy(self.DEFAULT_SAVE_ARGS)
    if save_args is not None:
        self._save_args.update(save_args)

DEFAULT_CONNECTION_CONFIG class-attribute

DEFAULT_CONNECTION_CONFIG = {
    "backend": "duckdb",
    "database": ":memory:",
}

DEFAULT_LOAD_ARGS class-attribute

DEFAULT_LOAD_ARGS = {}

DEFAULT_SAVE_ARGS class-attribute

DEFAULT_SAVE_ARGS = {}

_CONNECTION_GROUP class-attribute

_CONNECTION_GROUP = 'ibis'

_connection_config instance-attribute

_connection_config = {
    None: _connection_config,
    None: _credentials,
}

_file_format instance-attribute

_file_format = file_format

_load_args instance-attribute

_load_args = deepcopy(DEFAULT_LOAD_ARGS)

_save_args instance-attribute

_save_args = deepcopy(DEFAULT_SAVE_ARGS)

_table_name instance-attribute

_table_name = table_name

connection property

connection

The Backend instance for the connection configuration.

metadata instance-attribute

metadata = metadata

_connect

_connect()
Source code in kedro_datasets/ibis/file_dataset.py
150
151
152
153
154
155
def _connect(self) -> BaseBackend:
    import ibis  # noqa: PLC0415

    config = deepcopy(self._connection_config)
    backend = getattr(ibis, config.pop("backend"))
    return backend.connect(**config)

_describe

_describe()
Source code in kedro_datasets/ibis/file_dataset.py
173
174
175
176
177
178
179
180
181
182
def _describe(self) -> dict[str, Any]:
    return {
        "filepath": self._filepath,
        "file_format": self._file_format,
        "table_name": self._table_name,
        "backend": self._connection_config["backend"],
        "load_args": self._load_args,
        "save_args": self._save_args,
        "version": self._version,
    }

_exists

_exists()
Source code in kedro_datasets/ibis/file_dataset.py
184
185
186
187
188
189
190
def _exists(self) -> bool:
    try:
        load_path = self._get_load_path()
    except DatasetError:
        return False

    return Path(load_path).exists()

load

load()
Source code in kedro_datasets/ibis/file_dataset.py
162
163
164
165
def load(self) -> ir.Table:
    load_path = self._get_load_path()
    reader = getattr(self.connection, f"read_{self._file_format}")
    return reader(load_path, table_name=self._table_name, **self._load_args)

save

save(data)
Source code in kedro_datasets/ibis/file_dataset.py
167
168
169
170
171
def save(self, data: ir.Table) -> None:
    save_path = self._get_save_path()
    Path(save_path).parent.mkdir(parents=True, exist_ok=True)
    writer = getattr(self.connection, f"to_{self._file_format}")
    writer(data, save_path, **self._save_args)