Skip to content

GBQQueryDataset

GBQQueryDataset loads data from a provided SQL query in Google BigQuery using pandas-gbq. It is read-only.

kedro_datasets.pandas.GBQQueryDataset

GBQQueryDataset(
    sql=None,
    project=None,
    credentials=None,
    load_args=None,
    fs_args=None,
    filepath=None,
    metadata=None,
)

Bases: AbstractDataset[None, DataFrame]

GBQQueryDataset loads data from a provided SQL query from Google BigQuery. It uses pandas_gbq.read_gbq which itself uses pandas-gbq internally to read from BigQuery table. Therefore it supports all allowed pandas options on read_gbq.

Example usage for the YAML API:
vehicles:
    type: pandas.GBQQueryDataset
    sql: "select shuttle, shuttle_id from spaceflights.shuttles;"
    project: my-project
    credentials: gbq-creds
    load_args:
    reauth: True
Example usage for the Python API:
from kedro_datasets.pandas import GBQQueryDataset

sql = "SELECT * FROM dataset_1.table_a"

dataset = GBQQueryDataset(sql, project="my-project")

sql_data = dataset.load()

Parameters:

  • sql (str | None, default: None ) –

    The sql query statement.

  • project (str | None, default: None ) –

    Google BigQuery Account project ID. Optional when available from the environment. https://cloud.google.com/resource-manager/docs/creating-managing-projects

  • credentials (dict[str, Any] | Credentials | None, default: None ) –

    Credentials for accessing Google APIs. Either a credential that bases on google.auth.credentials.Credentials OR a service account json as a dictionary OR a path to a service account key json file. https://googleapis.dev/python/google-auth/latest/

  • load_args (dict[str, Any] | None, default: None ) –

    Pandas options for loading BigQuery table into DataFrame. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html All defaults are preserved.

  • fs_args (dict[str, Any] | None, default: None ) –

    Extra arguments to pass into underlying filesystem class constructor (e.g. {"project": "my-project"} for GCSFileSystem) used for reading the SQL query from filepath.

  • filepath (str | None, default: None ) –

    A path to a file with a sql query statement.

  • metadata (dict[str, Any] | None, default: None ) –

    Any arbitrary metadata. This is ignored by Kedro, but may be consumed by users or external plugins.

Raises:

  • DatasetError

    When sql and filepath parameters are either both empty or both provided, as well as when the save() method is invoked.

Source code in kedro_datasets/pandas/gbq_dataset.py
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
def __init__(  # noqa: PLR0913
    self,
    sql: str | None = None,
    project: str | None = None,
    credentials: dict[str, Any] | Credentials | None = None,
    load_args: dict[str, Any] | None = None,
    fs_args: dict[str, Any] | None = None,
    filepath: str | None = None,
    metadata: dict[str, Any] | None = None,
) -> None:
    """Creates a new instance of ``GBQQueryDataset``.

    Args:
        sql: The sql query statement.
        project: Google BigQuery Account project ID.
            Optional when available from the environment.
            https://cloud.google.com/resource-manager/docs/creating-managing-projects
        credentials: Credentials for accessing Google APIs.
            Either a credential that bases on ``google.auth.credentials.Credentials`` OR
            a service account json as a dictionary OR
            a path to a service account key json file.
            https://googleapis.dev/python/google-auth/latest/
        load_args: Pandas options for loading BigQuery table into DataFrame.
            Here you can find all available arguments:
            https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html
            All defaults are preserved.
        fs_args: Extra arguments to pass into underlying filesystem class constructor
            (e.g. `{"project": "my-project"}` for ``GCSFileSystem``) used for reading the
            SQL query from filepath.
        filepath: A path to a file with a sql query statement.
        metadata: Any arbitrary metadata.
            This is ignored by Kedro, but may be consumed by users or external plugins.

    Raises:
        DatasetError: When ``sql`` and ``filepath`` parameters are either both empty
            or both provided, as well as when the `save()` method is invoked.
    """
    if sql and filepath:
        raise DatasetError(
            "'sql' and 'filepath' arguments cannot both be provided."
            "Please only provide one."
        )

    if not (sql or filepath):
        raise DatasetError(
            "'sql' and 'filepath' arguments cannot both be empty."
            "Please provide a sql query or path to a sql query file."
        )

    # Handle default load arguments
    self._load_args = {**self.DEFAULT_LOAD_ARGS, **(load_args or {})}

    self._project_id = project

    if (not isinstance(credentials, Credentials)) and (credentials is not None):
        self._credentials = _get_credentials(credentials)
    else:
        self._credentials = credentials

    # load sql query from arg or from file
    if sql:
        self._load_args["query_or_table"] = sql
        self._filepath = None
    else:
        # filesystem for loading sql file
        _fs_args = copy.deepcopy(fs_args) or {}
        _fs_credentials = _fs_args.pop("credentials", {})
        protocol, path = get_protocol_and_path(str(filepath))

        self._protocol = protocol
        self._fs = fsspec.filesystem(self._protocol, **_fs_credentials, **_fs_args)
        self._filepath = path

    self.metadata = metadata

DEFAULT_LOAD_ARGS class-attribute instance-attribute

DEFAULT_LOAD_ARGS = {}

_credentials instance-attribute

_credentials = _get_credentials(credentials)

_filepath instance-attribute

_filepath = None

_fs instance-attribute

_fs = filesystem(_protocol, **_fs_credentials, **_fs_args)

_load_args instance-attribute

_load_args = {
    None: DEFAULT_LOAD_ARGS,
    None: load_args or {},
}

_project_id instance-attribute

_project_id = project

_protocol instance-attribute

_protocol = protocol

metadata instance-attribute

metadata = metadata

_describe

_describe()
Source code in kedro_datasets/pandas/gbq_dataset.py
303
304
305
306
307
308
309
310
def _describe(self) -> dict[str, Any]:
    load_args = copy.deepcopy(self._load_args)
    desc = {}
    desc["sql"] = str(load_args.pop("query_or_table", None))
    desc["filepath"] = str(self._filepath)
    desc["load_args"] = str(load_args)

    return desc

load

load()
Source code in kedro_datasets/pandas/gbq_dataset.py
312
313
314
315
316
317
318
319
320
321
322
323
324
def load(self) -> pd.DataFrame:
    load_args = copy.deepcopy(self._load_args)

    if self._filepath:
        load_path = get_filepath_str(PurePosixPath(self._filepath), self._protocol)
        with self._fs.open(load_path, mode="r") as fs_file:
            load_args["query_or_table"] = fs_file.read()

    return pd_gbq.read_gbq(
        project_id=self._project_id,
        credentials=self._credentials,
        **load_args,
    )

save

save(data)
Source code in kedro_datasets/pandas/gbq_dataset.py
326
327
def save(self, data: None) -> NoReturn:
    raise DatasetError("'save' is not supported on GBQQueryDataset")