kedro.extras.datasets.pandas.SQLQueryDataSet¶
- class kedro.extras.datasets.pandas.SQLQueryDataSet(sql=None, credentials=None, load_args=None, fs_args=None, filepath=None, execution_options=None)[source]¶
SQLQueryDataSet
loads data from a provided SQL query. It usespandas.DataFrame
internally, so it supports all allowed pandas options onread_sql_query
. Since Pandas uses SQLAlchemy behind the scenes, when instantiatingSQLQueryDataSet
one needs to pass a compatible connection string either incredentials
(see the example code snippet below) or inload_args
. Connection string formats supported by SQLAlchemy can be found here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urlsIt does not support save method so it is a read only data set. To save data to a SQL server use
SQLTableDataSet
.Example usage for the YAML API:
shuttle_id_dataset: type: pandas.SQLQueryDataSet sql: "select shuttle, shuttle_id from spaceflights.shuttles;" credentials: db_credentials
Advanced example using the
stream_results
andchunksize
options to reduce memory usage:shuttle_id_dataset: type: pandas.SQLQueryDataSet sql: "select shuttle, shuttle_id from spaceflights.shuttles;" credentials: db_credentials execution_options: stream_results: true load_args: chunksize: 1000
Sample database credentials entry in
credentials.yml
:db_credentials: con: postgresql://scott:tiger@localhost/test
Example usage for the Python API:
from kedro.extras.datasets.pandas import SQLQueryDataSet import pandas as pd data = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]}) sql = "SELECT * FROM table_a" credentials = { "con": "postgresql://scott:tiger@localhost/test" } data_set = SQLQueryDataSet(sql=sql, credentials=credentials) sql_data = data_set.load()
Attributes
Methods
create_connection
(connection_str)Given a connection string, create singleton connection to be used across all instances of SQLQueryDataSet that need to connect to the same source.
exists
()Checks whether a data set's output already exists by calling the provided _exists() method.
from_config
(name, config[, load_version, ...])Create a data set instance using the configuration provided.
load
()Loads data by delegation to the provided load method.
release
()Release any cached data.
save
(data)Saves data by delegation to the provided save method.
- __init__(sql=None, credentials=None, load_args=None, fs_args=None, filepath=None, execution_options=None)[source]¶
Creates a new
SQLQueryDataSet
.- Parameters
sql (
Optional
[str
]) – The sql query statement.credentials (
Optional
[Dict
[str
,Any
]]) – A dictionary with aSQLAlchemy
connection string. Users are supposed to provide the connection string ‘con’ through credentials. It overwrites con parameter inload_args
andsave_args
in case it is provided. To find all supported connection string formats, see here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urlsload_args (
Optional
[Dict
[str
,Any
]]) – Provided to underlying pandasread_sql_query
function along with the connection string. To find all supported arguments, see here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html To find all supported connection string formats, see here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urlsfs_args (
Optional
[Dict
[str
,Any
]]) – Extra arguments to pass into underlying filesystem class constructor (e.g. {“project”: “my-project”} forGCSFileSystem
), as well as to pass to the filesystem’s open method through nested keys open_args_load and open_args_save. Here you can find all available arguments for open: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open All defaults are preserved, except mode, which is set to r when loading.filepath (
Optional
[str
]) – A path to a file with a sql query statement.execution_options (
Optional
[Dict
[str
,Any
]]) – A dictionary with non-SQL advanced options for the connection to be applied to the underlying engine. To find all supported execution options, see here: https://docs.sqlalchemy.org/en/12/core/connections.html#sqlalchemy.engine.Connection.execution_options Note that this is not a standard argument supported by pandas API, but could be useful for handling large datasets.
- Raises
DataSetError – When either
sql
orcon
parameters is empty.
- classmethod create_connection(connection_str)[source]¶
Given a connection string, create singleton connection to be used across all instances of SQLQueryDataSet that need to connect to the same source.
- Return type
None
- engines: Dict[str, Any] = {}¶
- exists()¶
Checks whether a data set’s output already exists by calling the provided _exists() method.
- Return type
bool
- Returns
Flag indicating whether the output already exists.
- Raises
DataSetError – when underlying exists method raises error.
- classmethod from_config(name, config, load_version=None, save_version=None)¶
Create a data set instance using the configuration provided.
- Parameters
name (
str
) – Data set name.config (
Dict
[str
,Any
]) – Data set config dictionary.load_version (
Optional
[str
]) – Version string to be used forload
operation if the data set is versioned. Has no effect on the data set if versioning was not enabled.save_version (
Optional
[str
]) – Version string to be used forsave
operation if the data set is versioned. Has no effect on the data set if versioning was not enabled.
- Return type
- Returns
An instance of an
AbstractDataSet
subclass.- Raises
DataSetError – When the function fails to create the data set from its config.
- load()¶
Loads data by delegation to the provided load method.
- Return type
TypeVar
(_DO
)- Returns
Data returned by the provided load method.
- Raises
DataSetError – When underlying load method raises error.
- release()¶
Release any cached data.
- Raises
DataSetError – when underlying release method raises error.
- Return type
None
- save(data)¶
Saves data by delegation to the provided save method.
- Parameters
data (
TypeVar
(_DI
)) – the value to be saved by provided save method.- Raises
DataSetError – when underlying save method raises error.
FileNotFoundError – when save method got file instead of dir, on Windows.
NotADirectoryError – when save method got file instead of dir, on Unix.
- Return type
None