DeltaTableDataset¶
DeltaTableDataset loads and saves data to Delta tables using Apache Spark.
kedro_datasets.spark.DeltaTableDataset ¶
DeltaTableDataset(*, filepath, metadata=None)
Bases: AbstractDataset[None, DeltaTable]
DeltaTableDataset loads data into DeltaTable objects.
Examples:
Using the YAML API:
weather@spark:
type: spark.SparkDataset
filepath: data/02_intermediate/data.parquet
file_format: "delta"
weather@delta:
type: spark.DeltaTableDataset
filepath: data/02_intermediate/data.parquet
Using the Python API:
>>> from delta import DeltaTable
>>> from kedro_datasets.spark import DeltaTableDataset, SparkDataset
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import StructField, StringType, IntegerType, StructType
>>>
>>> schema = StructType(
... [StructField("name", StringType(), True), StructField("age", IntegerType(), True)]
... )
>>> data = [("Alex", 31), ("Bob", 12), ("Clarke", 65), ("Dave", 29)]
>>> spark_df = SparkSession.builder.getOrCreate().createDataFrame(data, schema)
>>>
>>> filepath = (tmp_path / "test_data").as_posix()
>>> dataset = SparkDataset(filepath=filepath, file_format="delta")
>>> dataset.save(spark_df)
>>> deltatable_dataset = DeltaTableDataset(filepath=filepath)
>>> delta_table = deltatable_dataset.load()
>>> assert isinstance(delta_table, DeltaTable)
Parameters:
-
filepath(str) –Filepath in POSIX format to a Spark dataframe. When using Databricks and working with data written to mount path points, specify
filepaths for (versioned)SparkDatasets starting with/dbfs/mnt. -
metadata(dict[str, Any] | None, default:None) –Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.
Source code in kedro_datasets/spark/deltatable_dataset.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | |
_describe ¶
_describe()
Source code in kedro_datasets/spark/deltatable_dataset.py
102 103 | |
_exists ¶
_exists()
Source code in kedro_datasets/spark/deltatable_dataset.py
88 89 90 91 92 93 94 95 96 97 98 99 100 | |
load ¶
load()
Source code in kedro_datasets/spark/deltatable_dataset.py
81 82 83 | |
save ¶
save(data)
Source code in kedro_datasets/spark/deltatable_dataset.py
85 86 | |