GBQQueryDataset¶

GBQQueryDataset loads data from a provided SQL query in Google BigQuery using pandas-gbq. It is read-only.

kedro_datasets.pandas.GBQQueryDataset ¶

GBQQueryDataset(
    sql=None,
    project=None,
    credentials=None,
    load_args=None,
    fs_args=None,
    filepath=None,
    metadata=None,
)

Bases: AbstractDataset[None, DataFrame]

GBQQueryDataset loads data from a provided SQL query from Google BigQuery. It uses pandas_gbq.read_gbq which itself uses pandas-gbq internally to read from BigQuery table. Therefore it supports all allowed pandas options on read_gbq.

Example usage for the YAML API:¶

vehicles:
    type: pandas.GBQQueryDataset
    sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
    project: my-project
    credentials: gbq-creds
    load_args:
    reauth: True

Example usage for the Python API:¶

from kedro_datasets.pandas import GBQQueryDataset

sql = "SELECT * FROM dataset_1.table_a"

dataset = GBQQueryDataset(sql, project="my-project")

sql_data = dataset.load()

Parameters:

sql (str | None, default: None ) –

The sql query statement.
project (str | None, default: None ) –

Google BigQuery Account project ID. Optional when available from the environment. https://cloud.google.com/resource-manager/docs/creating-managing-projects
credentials (dict[str, Any] | Credentials | None, default: None ) –

Credentials for accessing Google APIs. Either a credential that bases on google.auth.credentials.Credentials OR a service account json as a dictionary OR a path to a service account key json file. https://googleapis.dev/python/google-auth/latest/
load_args (dict[str, Any] | None, default: None ) –

Pandas options for loading BigQuery table into DataFrame. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html All defaults are preserved.
fs_args (dict[str, Any] | None, default: None ) –

Extra arguments to pass into underlying filesystem class constructor (e.g. {"project": "my-project"} for GCSFileSystem) used for reading the SQL query from filepath.
filepath (str | None, default: None ) –

A path to a file with a sql query statement.
metadata (dict[str, Any] | None, default: None ) –

Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

Raises:

DatasetError –

When sql and filepath parameters are either both empty or both provided, as well as when the save() method is invoked.

Source code in kedro_datasets/pandas/gbq_dataset.py

def __init__(  # noqa: PLR0913
    self,
    sql: str | None = None,
    project: str | None = None,
    credentials: dict[str, Any] | Credentials | None = None,
    load_args: dict[str, Any] | None = None,
    fs_args: dict[str, Any] | None = None,
    filepath: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> None:
    """Creates a new instance of ``GBQQueryDataset``.

    Args:
        sql: The sql query statement.
        project: Google BigQuery Account project ID.
            Optional when available from the environment.
            https://cloud.google.com/resource-manager/docs/creating-managing-projects
        credentials: Credentials for accessing Google APIs.
            Either a credential that bases on ``google.auth.credentials.Credentials`` OR
            a service account json as a dictionary OR
            a path to a service account key json file.
            https://googleapis.dev/python/google-auth/latest/
        load_args: Pandas options for loading BigQuery table into DataFrame.
            Here you can find all available arguments:
            https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html
            All defaults are preserved.
        fs_args: Extra arguments to pass into underlying filesystem class constructor
            (e.g. `{"project": "my-project"}` for ``GCSFileSystem``) used for reading the
            SQL query from filepath.
        filepath: A path to a file with a sql query statement.
        metadata: Any arbitrary metadata.
            This is ignored by Kedro, but may be consumed by users or external plugins.

    Raises:
        DatasetError: When ``sql`` and ``filepath`` parameters are either both empty
            or both provided, as well as when the `save()` method is invoked.
    """
    if sql and filepath:
        raise DatasetError(
            "'sql' and 'filepath' arguments cannot both be provided."
            "Please only provide one."
        )

    if not (sql or filepath):
        raise DatasetError(
            "'sql' and 'filepath' arguments cannot both be empty."
            "Please provide a sql query or path to a sql query file."
        )

    # Handle default load arguments
    self._load_args = {**self.DEFAULT_LOAD_ARGS, **(load_args or {})}

    self._project_id = project

    if (not isinstance(credentials, Credentials)) and (credentials is not None):
        self._credentials = _get_credentials(credentials)
    else:
        self._credentials = credentials

    # load sql query from arg or from file
    if sql:
        self._load_args["query_or_table"] = sql
        self._filepath = None
    else:
        # filesystem for loading sql file
        _fs_args = copy.deepcopy(fs_args) or {}
        _fs_credentials = _fs_args.pop("credentials", {})
        protocol, path = get_protocol_and_path(str(filepath))

        self._protocol = protocol
        self._fs = fsspec.filesystem(self._protocol, **_fs_credentials, **_fs_args)
        self._filepath = path

    self.metadata = metadata

DEFAULT_LOAD_ARGS `class-attribute` `instance-attribute` ¶

DEFAULT_LOAD_ARGS = {}

_credentials `instance-attribute` ¶

_credentials = _get_credentials(credentials)

_filepath `instance-attribute` ¶

_filepath = None

_fs `instance-attribute` ¶

_fs = filesystem(_protocol, **_fs_credentials, **_fs_args)

_load_args `instance-attribute` ¶

_load_args = {
    None: DEFAULT_LOAD_ARGS,
    None: load_args or {},
}

_project_id `instance-attribute` ¶

_project_id = project

_protocol `instance-attribute` ¶

_protocol = protocol

metadata `instance-attribute` ¶

metadata = metadata

_describe ¶

_describe()

Source code in kedro_datasets/pandas/gbq_dataset.py

def _describe(self) -> dict[str, Any]:
    load_args = copy.deepcopy(self._load_args)
    desc = {}
    desc["sql"] = str(load_args.pop("query_or_table", None))
    desc["filepath"] = str(self._filepath)
    desc["load_args"] = str(load_args)

    return desc

load ¶

load()

Source code in kedro_datasets/pandas/gbq_dataset.py

def load(self) -> pd.DataFrame:
    load_args = copy.deepcopy(self._load_args)

    if self._filepath:
        load_path = get_filepath_str(PurePosixPath(self._filepath), self._protocol)
        with self._fs.open(load_path, mode="r") as fs_file:
            load_args["query_or_table"] = fs_file.read()

    return pd_gbq.read_gbq(
        project_id=self._project_id,
        credentials=self._credentials,
        **load_args,
    )

save ¶

save(data)

Source code in kedro_datasets/pandas/gbq_dataset.py

def save(self, data: None) -> NoReturn:
    raise DatasetError("'save' is not supported on GBQQueryDataset")

GBQQueryDataset¶

kedro_datasets.pandas.GBQQueryDataset ¶

Example usage for the YAML API:¶

Example usage for the Python API:¶

DEFAULT_LOAD_ARGS class-attribute instance-attribute ¶

_credentials instance-attribute ¶

_filepath instance-attribute ¶

_fs instance-attribute ¶

_load_args instance-attribute ¶

_project_id instance-attribute ¶

_protocol instance-attribute ¶

metadata instance-attribute ¶

_describe ¶

load ¶

save ¶

DEFAULT_LOAD_ARGS `class-attribute` `instance-attribute` ¶

_credentials `instance-attribute` ¶

_filepath `instance-attribute` ¶

_fs `instance-attribute` ¶

_load_args `instance-attribute` ¶

_project_id `instance-attribute` ¶

_protocol `instance-attribute` ¶

metadata `instance-attribute` ¶