kedro_datasets.pandas.SQLTableDataset¶
- class kedro_datasets.pandas.SQLTableDataset(*, table_name, credentials, load_args=None, save_args=None, metadata=None)[source]¶
SQLTableDataset
loads data from a SQL table and saves a pandas dataframe to a table. It usespandas.DataFrame
internally, so it supports all allowed pandas options onread_sql_table
andto_sql
methods. Since Pandas uses SQLAlchemy behind the scenes, when instantiatingSQLTableDataset
one needs to pass a compatible connection string either incredentials
(see the example code snippet below) or inload_args
andsave_args
. Connection string formats supported by SQLAlchemy can be found here: https://docs.sqlalchemy.org/core/engines.html#database-urlsSQLTableDataset
modifies the save parameters and stores the data with no index. This is designed to make load and save methods symmetric.Example usage for the YAML API:
shuttles_table_dataset: type: pandas.SQLTableDataset credentials: db_credentials table_name: shuttles load_args: schema: dwschema save_args: schema: dwschema if_exists: replace
Sample database credentials entry in
credentials.yml
:db_credentials: con: postgresql://scott:tiger@localhost/test pool_size: 10 # additional parameters
Example usage for the Python API:
from kedro_datasets.pandas import SQLTableDataset import pandas as pd data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}) table_name = "table_a" credentials = {"con": f"sqlite:///{tmp_path / 'test.db'}"} dataset = SQLTableDataset(table_name=table_name, credentials=credentials) dataset.save(data) reloaded = dataset.load() assert data.equals(reloaded)
Attributes
The
Engine
object for the dataset's connection string.Methods
create_connection
(connection_str[, ...])Given a connection string, create singleton connection to be used across all instances of
SQLTableDataset
that need to connect to the same source.exists
()Checks whether a dataset's output already exists by calling the provided _exists() method.
from_config
(name, config[, load_version, ...])Create a dataset instance using the configuration provided.
load
()Loads data by delegation to the provided load method.
preview
([nrows])Generate a preview of the dataset with a specified number of rows.
release
()Release any cached data.
save
(data)Saves data by delegation to the provided save method.
Converts the dataset instance into a dictionary-based configuration for serialization.
- __init__(*, table_name, credentials, load_args=None, save_args=None, metadata=None)[source]¶
Creates a new
SQLTableDataset
.- Parameters:
table_name (
str
) – The table name to load or save data to. It overwrites name insave_args
andtable_name
parameters inload_args
.credentials (
dict
[str
,Any
]) – A dictionary with aSQLAlchemy
connection string. Users are supposed to provide the connection string ‘con’ through credentials. To find all supported connection string formats, see here: https://docs.sqlalchemy.org/core/engines.html#database-urls Additional parameters for the sqlalchemy engine can be provided alongside the ‘con’ parameter.load_args (
Optional
[dict
[str
,Any
]]) – Provided to underlying pandasread_sql_table
function along with the connection string. To find all supported arguments, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html To find all supported connection string formats, see here: https://docs.sqlalchemy.org/core/engines.html#database-urlssave_args (
Optional
[dict
[str
,Any
]]) – Provided to underlying pandasto_sql
function along with the connection string. To find all supported arguments, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html To find all supported connection string formats, see here: https://docs.sqlalchemy.org/core/engines.html#database-urls It hasindex=False
in the default parameters.metadata (
Optional
[dict
[str
,Any
]]) – Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.
- Raises:
DatasetError – When either
table_name
orcon
is empty.
- classmethod create_connection(connection_str, connection_args=None)[source]¶
Given a connection string, create singleton connection to be used across all instances of
SQLTableDataset
that need to connect to the same source.- Return type:
- property engine¶
The
Engine
object for the dataset’s connection string.
- exists()[source]¶
Checks whether a dataset’s output already exists by calling the provided _exists() method.
- Return type:
- Returns:
Flag indicating whether the output already exists.
- Raises:
DatasetError – when underlying exists method raises error.
- classmethod from_config(name, config, load_version=None, save_version=None)[source]¶
Create a dataset instance using the configuration provided.
- Parameters:
name (
str
) – Data set name.load_version (
Optional
[str
]) – Version string to be used forload
operation if the dataset is versioned. Has no effect on the dataset if versioning was not enabled.save_version (
Optional
[str
]) – Version string to be used forsave
operation if the dataset is versioned. Has no effect on the dataset if versioning was not enabled.
- Return type:
- Returns:
An instance of an
AbstractDataset
subclass.- Raises:
DatasetError – When the function fails to create the dataset from its config.
- load()[source]¶
Loads data by delegation to the provided load method.
- Return type:
DataFrame
- Returns:
Data returned by the provided load method.
- Raises:
DatasetError – When underlying load method raises error.
- release()[source]¶
Release any cached data.
- Raises:
DatasetError – when underlying release method raises error.
- Return type:
- save(data)[source]¶
Saves data by delegation to the provided save method.
- Parameters:
data (
DataFrame
) – the value to be saved by provided save method.- Raises:
DatasetError – when underlying save method raises error.
FileNotFoundError – when save method got file instead of dir, on Windows.
NotADirectoryError – when save method got file instead of dir, on Unix.
- Return type:
- to_config()[source]¶
Converts the dataset instance into a dictionary-based configuration for serialization. Ensures that any subclass-specific details are handled, with additional logic for versioning and caching implemented for CachedDataset.
Adds a key for the dataset’s type using its module and class name and includes the initialization arguments.
For CachedDataset it extracts the underlying dataset’s configuration, handles the versioned flag and removes unnecessary metadata. It also ensures the embedded dataset’s configuration is appropriately flattened or transformed.
If the dataset has a version key, it sets the versioned flag in the configuration.
Removes the metadata key from the configuration if present.