kedro_datasets.pandas.SQLTableDataset¶

class kedro_datasets.pandas.SQLTableDataset(*, table_name, credentials, load_args=None, save_args=None, metadata=None)[source]¶

SQLTableDataset loads data from a SQL table and saves a pandas dataframe to a table. It uses pandas.DataFrame internally, so it supports all allowed pandas options on read_sql_table and to_sql methods. Since Pandas uses SQLAlchemy behind the scenes, when instantiating SQLTableDataset one needs to pass a compatible connection string either in credentials (see the example code snippet below) or in load_args and save_args. Connection string formats supported by SQLAlchemy can be found here: https://docs.sqlalchemy.org/core/engines.html#database-urls

SQLTableDataset modifies the save parameters and stores the data with no index. This is designed to make load and save methods symmetric.

Example usage for the YAML API:

shuttles_table_dataset:
  type: pandas.SQLTableDataset
  credentials: db_credentials
  table_name: shuttles
  load_args:
    schema: dwschema
  save_args:
    schema: dwschema
    if_exists: replace

Sample database credentials entry in credentials.yml:

db_credentials:
  con: postgresql://scott:tiger@localhost/test
  pool_size: 10 # additional parameters

Example usage for the Python API:

from kedro_datasets.pandas import SQLTableDataset
import pandas as pd

data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
table_name = "table_a"
credentials = {"con": f"sqlite:///{tmp_path / 'test.db'}"}
dataset = SQLTableDataset(table_name=table_name, credentials=credentials)

dataset.save(data)
reloaded = dataset.load()

assert data.equals(reloaded)

Attributes

`DEFAULT_LOAD_ARGS`
`DEFAULT_SAVE_ARGS`
`engine`	The `Engine` object for the dataset's connection string.
`engines`

Methods

`create_connection`(connection_str[, ...])	Given a connection string, create singleton connection to be used across all instances of `SQLTableDataset` that need to connect to the same source.
`exists`()	Checks whether a dataset's output already exists by calling the provided _exists() method.
`from_config`(name, config[, load_version, ...])	Create a dataset instance using the configuration provided.
`load`()	Loads data by delegation to the provided load method.
`preview`([nrows])	Generate a preview of the dataset with a specified number of rows.
`release`()	Release any cached data.
`save`(data)	Saves data by delegation to the provided save method.

DEFAULT_LOAD_ARGS: dict[str, Any] = {}¶

DEFAULT_SAVE_ARGS: dict[str, Any] = {'index': False}¶

__init__(*, table_name, credentials, load_args=None, save_args=None, metadata=None)[source]¶

Creates a new SQLTableDataset.

Parameters:

table_name (str) – The table name to load or save data to. It overwrites name in save_args and table_name parameters in load_args.
credentials (dict[str, Any]) – A dictionary with a SQLAlchemy connection string. Users are supposed to provide the connection string ‘con’ through credentials. To find all supported connection string formats, see here: https://docs.sqlalchemy.org/core/engines.html#database-urls Additional parameters for the sqlalchemy engine can be provided alongside the ‘con’ parameter.
load_args (Optional[dict[str, Any]]) – Provided to underlying pandas read_sql_table function along with the connection string. To find all supported arguments, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html To find all supported connection string formats, see here: https://docs.sqlalchemy.org/core/engines.html#database-urls
save_args (Optional[dict[str, Any]]) – Provided to underlying pandas to_sql function along with the connection string. To find all supported arguments, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html To find all supported connection string formats, see here: https://docs.sqlalchemy.org/core/engines.html#database-urls It has index=False in the default parameters.
metadata (Optional[dict[str, Any]]) – Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

Raises:

DatasetError – When either table_name or con is empty.

classmethod create_connection(connection_str, connection_args=None)[source]¶

Given a connection string, create singleton connection to be used across all instances of SQLTableDataset that need to connect to the same source.

Return type:: None

property engine¶: The Engine object for the dataset’s connection string.

engines: dict[str, Any] = {}¶

exists()¶

Checks whether a dataset’s output already exists by calling the provided _exists() method.

Return type:: bool
Returns:: Flag indicating whether the output already exists.
Raises:: DatasetError – when underlying exists method raises error.

classmethod from_config(name, config, load_version=None, save_version=None)¶

Create a dataset instance using the configuration provided.

Parameters:

name (str) – Data set name.
config (dict[str, Any]) – Data set config dictionary.
load_version (Optional[str]) – Version string to be used for load operation if the dataset is versioned. Has no effect on the dataset if versioning was not enabled.
save_version (Optional[str]) – Version string to be used for save operation if the dataset is versioned. Has no effect on the dataset if versioning was not enabled.

Return type:

AbstractDataset

Returns:

An instance of an AbstractDataset subclass.

Raises:

DatasetError – When the function fails to create the dataset from its config.

load()[source]¶

Loads data by delegation to the provided load method.

Return type:: DataFrame
Returns:: Data returned by the provided load method.
Raises:: DatasetError – When underlying load method raises error.

preview(nrows=5)[source]¶

Generate a preview of the dataset with a specified number of rows.

Parameters:: nrows (int) – The number of rows to include in the preview. Defaults to 5.
Returns:: A dictionary containing the data in a split format.
Return type:: dict

release()¶

Release any cached data.

Raises:: DatasetError – when underlying release method raises error.
Return type:: None

save(data)[source]¶

Saves data by delegation to the provided save method.

Parameters:

data (DataFrame) – the value to be saved by provided save method.

Raises:

DatasetError – when underlying save method raises error.
FileNotFoundError – when save method got file instead of dir, on Windows.
NotADirectoryError – when save method got file instead of dir, on Unix.

Return type:

None