kedro_datasets.spark.DeltaTableDataset¶

class kedro_datasets.spark.DeltaTableDataset(*, filepath, metadata=None)[source]¶

DeltaTableDataset loads data into DeltaTable objects.

Example usage for the YAML API:

weather@spark:
  type: spark.SparkDataset
  filepath: data/02_intermediate/data.parquet
  file_format: "delta"

weather@delta:
  type: spark.DeltaTableDataset
  filepath: data/02_intermediate/data.parquet

Example usage for the Python API:

 from kedro_datasets.spark import DeltaTableDataset, SparkDataset
 from pyspark.sql import SparkSession
 from pyspark.sql.types import StructField, StringType, IntegerType, StructType

 schema = StructType(
...     [StructField("name", StringType(), True), StructField("age", IntegerType(), True)]
... )

 data = [("Alex", 31), ("Bob", 12), ("Clarke", 65), ("Dave", 29)]

 spark_df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)

 dataset = SparkDataset(filepath=tmp_path / "test_data", file_format="delta")
 dataset.save(spark_df)
 deltatable_dataset = DeltaTableDataset(filepath=tmp_path / "test_data")
 delta_table = deltatable_dataset.load()

 delta_table.update()

Methods

`exists`()	Checks whether a data set's output already exists by calling the provided _exists() method.
`from_config`(name, config[, load_version, ...])	Create a data set instance using the configuration provided.
`load`()	Loads data by delegation to the provided load method.
`release`()	Release any cached data.
`save`(data)	Saves data by delegation to the provided save method.

__init__(*, filepath, metadata=None)[source]¶

Creates a new instance of DeltaTableDataset.

Parameters:

filepath (str) – Filepath in POSIX format to a Spark dataframe. When using Databricks and working with data written to mount path points, specify filepath``s for (versioned) ``SparkDataset``s starting with ``/dbfs/mnt.
metadata (Optional[dict[str, Any]]) – Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

exists()¶

Checks whether a data set’s output already exists by calling the provided _exists() method.

Return type:: bool
Returns:: Flag indicating whether the output already exists.
Raises:: DatasetError – when underlying exists method raises error.

classmethod from_config(name, config, load_version=None, save_version=None)¶

Create a data set instance using the configuration provided.

Parameters:

name (str) – Data set name.
config (dict[str, Any]) – Data set config dictionary.
load_version (str | None) – Version string to be used for load operation if the data set is versioned. Has no effect on the data set if versioning was not enabled.
save_version (str | None) – Version string to be used for save operation if the data set is versioned. Has no effect on the data set if versioning was not enabled.

Return type:

AbstractDataset

Returns:

An instance of an AbstractDataset subclass.

Raises:

DatasetError – When the function fails to create the data set from its config.

load()¶

Loads data by delegation to the provided load method.

Return type:: TypeVar(_DO)
Returns:: Data returned by the provided load method.
Raises:: DatasetError – When underlying load method raises error.

release()¶

Release any cached data.

Raises:: DatasetError – when underlying release method raises error.
Return type:: None

save(data)¶

Saves data by delegation to the provided save method.

Parameters:

data (TypeVar(_DI)) – the value to be saved by provided save method.

Raises:

DatasetError – when underlying save method raises error.
FileNotFoundError – when save method got file instead of dir, on Windows.
NotADirectoryError – when save method got file instead of dir, on Unix.

Return type:

None