BigQuery¶
Overview¶
Google BigQuery is a cloud data warehouse solution. Data is stored in tables, and users can query using SQL. BigQuery uses datasets as top level containers for tables, and datasets are themselves contained within Google Cloud projects.
Quickstart¶
To instantiate the GoogleBigQuery class,
you can pass the constructor a string containing either the name of the
Google service account credentials file or a JSON string encoding those credentials.
Alternatively, you can set the environment variable GOOGLE_APPLICATION_CREDENTIALS
to be either of those strings and call the constructor without that argument.
from parsons import GoogleBigQuery
# May either be the file name or a JSON encoding of the credentials.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'google_credentials_file.json'
bigquery = GoogleBigQuery()
# Project in which we're working
project = 'parsons-test'
bigquery = GoogleBigQuery(
app_creds='google_credentials_file.json',
project=project
)
dataset = 'parsons_dataset'
table = 'parsons_table'
# Table name should be project.dataset.table, or dataset.table, if
# working with the default project
table_name = f"`{project}.{dataset}.{table}`"
# Must be pre-existing bucket. Create via GoogleCloudStorage() or
# at https://console.cloud.google.com/storage/create-bucket. May be
# omitted if the name of the bucket is specified in environment
# variable GCS_TEMP_BUCKET.
tmp_gcs_bucket = 'parsons_bucket'
# Create dataset if it doesn't already exist
bigquery.client.create_dataset(dataset=dataset, exists_ok=True)
parsons_table = Table([{'name':'Bob', 'party':'D'},
{'name':'Jane', 'party':'D'},
{'name':'Sue', 'party':'R'},
{'name':'Bill', 'party':'I'}])
# Copy table in to create new BigQuery table
bigquery.copy(
table_obj=parsons_table,
table_name=table_name,
tmp_gcs_bucket=tmp_gcs_bucket
)
# Select from project.dataset.table
bigquery.query(f'select name from {table_name} where party = "D"')
# Query with parameters
bigquery.query(
f"select name from {table_name} where party = %s",
parameters=["D"]
)
# Delete the table when we're done
bigquery.client.delete_table(table=table_name)
API¶
- class parsons.google.google_bigquery.GoogleBigQuery(app_creds: str | dict | Credentials | None = None, project: str | None = None, location: str | None = None, client_options: dict | None = None, tmp_gcs_bucket: str | None = None)[source]¶
Class for querying BigQuery table and returning the data as Parsons tables.
This class requires application credentials in the form of a json. It can be passed in the following ways:
Set an environmental variable named
GOOGLE_APPLICATION_CREDENTIALSwith the local path to the credentials json.Example:
GOOGLE_APPLICATION_CREDENTALS='path/to/creds.json'Pass in the path to the credentials using the
app_credsargument.Pass in a json string using the
app_credsargument.
- Parameters:
app_creds (str | dict | Credentials | None) – str, optional A credentials json string or a path to a json file. Not required if
GOOGLE_APPLICATION_CREDENTIALSenv variable set.project (str | None) – str, optional The project which the client is acting on behalf of. If not passed then will use the default inferred environment.
location (str | None) – str, optional Default geographic location for tables
client_options (dict | None) – dict, optional A dictionary containing any requested client options. Defaults to the required scopes for making API calls against External tables stored in Google Drive. Can be set to None if these permissions are not desired
tmp_gcs_bucket (str | None) – str, optional Name of the GCS bucket that will be used for storing data during bulk transfers. Required if you intend to perform bulk data transfers (eg. the copy_from_gcs method), and env variable
GCS_TEMP_BUCKETis not populated.
- property client¶
Get the Google BigQuery client to use for making queries.
- Returns:
google.cloud.bigquery.client.Client
- connection()[source]¶
Generate a BigQuery connection. The connection is set up as a python “context manager”, so it will be closed automatically when the connection goes out of scope. Note that the BigQuery API uses jobs to run database operations and, as such, simply has a no-op for a “commit” function.
If you would like to manage transactions, please use multi-statement queries as [outlined here](https://cloud.google.com/bigquery/docs/transactions) or utilize the query_with_transaction method on this class.
When using the connection, make sure to put it in a
withblock (necessary for any context manager):with bq.connection() as conn:- Yields:
Google BigQuery
connectionobject
- query(sql: str, parameters: list | dict | None = None, return_values: bool = True, job_config: QueryJobConfig | None = None) Table | None[source]¶
Run a BigQuery query and return the results as a Parsons table.
To include python variables in your query, it is recommended to pass them as parameters, following the BigQuery style where parameters are prefixed with @`s. Using the ``parameters` argument ensures that values are escaped properly, and avoids SQL injection attacks.
Parameter Examples
name = "Beatrice O'Brady" sql = 'SELECT * FROM my_table WHERE name = %s' rs.query(sql, parameters=[name])
name = "Beatrice O'Brady" sql = "SELECT * FROM my_table WHERE name = %(name)s" rs.query(sql, parameters={'name': name})
- Parameters:
sql (str) – str A valid BigTable statement
parameters (list | dict | None) – dict A dictionary of query parameters for BigQuery.
job_config (QueryJobConfig | None) – QueryJobConfig or None An optional QueryJobConfig object for custom behavior. See https://cloud.google.com/python/docs/reference/bigquery/latest#google.cloud.bigquery.job.QueryJobConfig
return_values (bool)
- Returns:
- Table
See Table for output options.
- Return type:
Table | None
- query_with_connection(sql, connection, parameters=None, commit=True, return_values: bool = True, job_config: QueryJobConfig | None = None)[source]¶
Execute a query against the BigQuery database, with an existing connection. Useful for batching queries together. Will return
Noneif the query returns zero rows.- Parameters:
sql – str A valid SQL statement
connection – obj A connection object obtained from
redshift.connection()parameters – list A list of python variables to be converted into SQL values in your query
commit – boolean Must be true. BigQuery
job_config (QueryJobConfig | None) – QueryJobConfig or None An optional QueryJobConfig object for custom behavior. See https://cloud.google.com/python/docs/reference/bigquery/latest#google.cloud.bigquery.job.QueryJobConfig
return_values (bool)
- Returns:
- Table
See Table for output options.
- get_job(job_id: str, **job_kwargs) LoadJob | CopyJob | ExtractJob | QueryJob | UnknownJob[source]¶
Fetch a job
- Parameters:
job_id (str) – str ID of job to fetch
**job_kwargs – kwargs Other arguments to pass to the underlying get_job call on the BigQuery client.
- Return type:
LoadJob | CopyJob | ExtractJob | QueryJob | UnknownJob
- copy_from_gcs(gcs_blob_uri: str, table_name: str, if_exists: Literal['append', 'drop', 'truncate', 'fail'] = 'fail', max_errors: int = 0, data_type: Literal['csv', 'json'] = 'csv', csv_delimiter: str = ',', ignoreheader: int = 1, nullas: str | None = None, allow_quoted_newlines: bool = True, allow_jagged_rows: bool = True, quote: str | None = None, schema: list[dict] | None = None, job_config: LoadJobConfig | None = None, force_unzip_blobs: bool = False, compression_type: Literal['zip', 'gzip'] = 'gzip', new_file_extension: str = 'csv', template_table: str | None = None, max_timeout: int = 21600, source_column_match: str | None = None, **load_kwargs)[source]¶
Copy a csv saved in Google Cloud Storage into Google BigQuery.
- Parameters:
gcs_blob_uri (str) – str The GoogleCloudStorage URI referencing the file to be copied.
table_name (str) – str The table name to load the data into. Will be used to generate load schema if no custom schema or template table are supplied and the if_exists is set to “truncate” or “append”.
if_exists (Literal['append', 'drop', 'truncate', 'fail']) – str If the table already exists, either
fail,append,droportruncatethe table. This maps to write_disposition in the LoadJobConfig class.max_errors (int) – int The maximum number of rows that can error and be skipped before the job fails. This maps to max_bad_records in the LoadJobConfig class.
data_type (Literal['csv', 'json']) – str Denotes whether target file is a JSON or CSV
csv_delimiter (str) – str Character used to separate values in the target file
ignoreheader (int) – int Treats the specified number_rows as a file header and doesn’t load them
nullas (str | None) – str Loads fields that match null_string as NULL, where null_string can be any string
allow_quoted_newlines (bool) – bool If True, detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary
allow_jagged_rows (bool) – bool Allow missing trailing optional columns (CSV only).
quote (str | None) – str The value that is used to quote data sections in a CSV file. BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state.
list BigQuery expects a list of dictionaries in the following format
schema = [ {"name": "column_name", "type": STRING}, {"name": "another_column_name", "type": INT} ]
job_config (LoadJobConfig | None) – object A LoadJobConfig object to provide to the underlying call to load_table_from_uri on the BigQuery client. The function will create its own if not provided. Note if there are any conflicts between the job_config and other parameters, the job_config values are preferred.
force_unzip_blobs (bool) – bool If True, target blobs will be unzipped before being loaded to BigQuery.
compression_type (Literal['zip', 'gzip']) – str Accepts zip or gzip values to differentially unzip a compressed blob in cloud storage.
new_file_extension (str) – str Provides a file extension if a blob is decompressed and rewritten to cloud storage.
template_table (str | None) – str Table name to be used as the load schema. Load operation wil use the same columns and data types as the template table.
max_timeout (int) – int The maximum number of seconds to wait for a request before the job fails.
**load_kwargs – kwargs Other arguments to pass to the underlying load_table_from_uri call on the BigQuery client.
source_column_match (str | None)
- copy_large_compressed_file_from_gcs(gcs_blob_uri: str, table_name: str, if_exists: Literal['append', 'drop', 'truncate', 'fail'] = 'fail', max_errors: int = 0, data_type: Literal['csv', 'json'] = 'csv', csv_delimiter: str = ',', ignoreheader: int = 1, nullas: str | None = None, allow_quoted_newlines: bool = True, allow_jagged_rows: bool = True, quote: str | None = None, schema: list[dict] | None = None, job_config: LoadJobConfig | None = None, compression_type: Literal['zip', 'gzip'] = 'gzip', new_file_extension: str = 'csv', template_table: str | None = None, max_timeout: int = 21600, **load_kwargs)[source]¶
Copy a compressed CSV file that exceeds the maximum size in Google Cloud Storage into Google BigQuery.
- Parameters:
gcs_blob_uri (str) – str The GoogleCloudStorage URI referencing the file to be copied.
table_name (str) – str The table name to load the data into. Will be used to generate load schema if no custom schema or template table are supplied and the if_exists is set to “truncate” or “append”.
if_exists (Literal['append', 'drop', 'truncate', 'fail']) – str If the table already exists, either
fail,append,droportruncatethe table. This maps to write_disposition in the LoadJobConfig class.max_errors (int) – int The maximum number of rows that can error and be skipped before the job fails. This maps to max_bad_records in the LoadJobConfig class.
data_type (Literal['csv', 'json']) – str Denotes whether target file is a JSON or CSV
csv_delimiter (str) – str Character used to separate values in the target file
ignoreheader (int) – int Treats the specified number_rows as a file header and doesn’t load them
nullas (str | None) – str Loads fields that match null_string as NULL, where null_string can be any string
allow_quoted_newlines (bool) – bool If True, detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary
allow_jagged_rows (bool) – bool Allow missing trailing optional columns (CSV only).
quote (str | None) – str The value that is used to quote data sections in a CSV file. BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state.
list BigQuery expects a list of dictionaries in the following format
schema = [ {"name": "column_name", "type": STRING}, {"name": "another_column_name", "type": INT} ]
job_config (LoadJobConfig | None) – object A LoadJobConfig object to provide to the underlying call to load_table_from_uri on the BigQuery client. The function will create its own if not provided. Note if there are any conflicts between the job_config and other parameters, the job_config values are preferred.
compression_type (Literal['zip', 'gzip']) – str Accepts zip or gzip values to differentially unzip a compressed blob in cloud storage.
new_file_extension (str) – str Provides a file extension if a blob is decompressed and rewritten to cloud storage.
template_table (str | None) – str Table name to be used as the load schema. Load operation wil use the same columns and data types as the template table.
max_timeout (int) – int The maximum number of seconds to wait for a request before the job fails.
**load_kwargs – kwargs Other arguments to pass to the underlying load_table_from_uri call on the BigQuery client.
- copy_s3(table_name, bucket, key, if_exists: Literal['append', 'drop', 'truncate', 'fail'] = 'fail', max_errors: int = 0, data_type: Literal['csv', 'json'] = 'csv', csv_delimiter: str = ',', ignoreheader: int = 1, nullas: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, gcs_client: GoogleCloudStorage | None = None, tmp_gcs_bucket: str | None = None, template_table: str | None = None, job_config: LoadJobConfig | None = None, max_timeout: int = 21600, **load_kwargs)[source]¶
Copy a file from s3 to BigQuery.
- Parameters:
table_name – str The table name and schema (
tmc.cool_table) to point the file.bucket – str The s3 bucket where the file or manifest is located.
key – str The key of the file or manifest in the s3 bucket.
if_exists (Literal['append', 'drop', 'truncate', 'fail']) – str If the table already exists, either
fail,append,droportruncatethe table.max_errors (int) – int The maximum number of rows that can error and be skipped before the job fails.
data_type (Literal['csv', 'json']) – str The data type of the file. Only
csvsupported currently.csv_delimiter (str) – str The delimiter of the
csv. Only relevant if data_type iscsv.ignoreheader (int) – int The number of header rows to skip. Ignored if data_type is
json.nullas (str | None) – str Loads fields that match string as NULL
aws_access_key_id (str | None) – An AWS access key granted to the bucket where the file is located. Not required if keys are stored as environmental variables.
aws_secret_access_key (str | None) – An AWS secret access key granted to the bucket where the file is located. Not required if keys are stored as environmental variables.
gcs_client (GoogleCloudStorage | None) – object The GoogleCloudStorage Connector to use for loading data into Google Cloud Storage.
tmp_gcs_bucket (str | None) – str The name of the Google Cloud Storage bucket to use to stage the data to load into BigQuery. Required if GCS_TEMP_BUCKET is not specified or set on the class instance.
template_table (str | None) – str Table name to be used as the load schema. Load operation wil use the same columns and data types as the template table.
job_config (LoadJobConfig | None) – object A LoadJobConfig object to provide to the underlying call to load_table_from_uri on the BigQuery client. The function will create its own if not provided. Note if there are any conflicts between the job_config and other parameters, the job_config values are preferred.
max_timeout (int) – int The maximum number of seconds to wait for a request before the job fails.
- Returns:
- Table or
None See Table for output options.
- Table or
- copy_direct(tbl: Table, table_name: str, if_exists: Literal['append', 'drop', 'truncate', 'fail'] = 'fail', max_errors: int = 0, job_config: LoadJobConfig | None = None, template_table: str | None = None, ignoreheader: int = 1, nullas: str | None = None, allow_quoted_newlines: bool = True, allow_jagged_rows: bool = True, quote: str | None = None, schema: list[dict] | None = None, max_timeout: int = 21600, convert_dict_list_columns_to_json: bool = True, **load_kwargs)[source]¶
Copy a Table into Google BigQuery directly. This will work well for smaller data. For larger data, use the
copy()method which stages the upload through CloudStorage.- Parameters:
tbl (Table) – obj The Parsons Table to copy into BigQuery.
table_name (str) – str The table name to load the data into. Will be used to generate load schema if no custom schema or template table are supplied and if_exists is set to “truncate” or “append”.
if_exists (Literal['append', 'drop', 'truncate', 'fail']) – str If the table already exists, either
fail,append,droportruncatethe table.max_errors (int) – int The maximum number of rows that can error and be skipped before the job fails.
job_config (LoadJobConfig | None) – object A LoadJobConfig object to provide to the underlying call to load_table_from_uri on the BigQuery client. The function will create its own if not provided.
template_table (str | None) – str Table name to be used as the load schema. Load operation wil use the same columns and data types as the template table.
max_timeout (int) – int The maximum number of seconds to wait for a request before the job fails.
convert_dict_list_columns_to_json (bool) – bool If set to True, will convert any dict or list columns (which cannot by default be successfully loaded to BigQuery to JSON strings)
**load_kwargs – kwargs Arguments to pass to the underlying load_table_from_uri call on the BigQuery client.
ignoreheader (int)
nullas (str | None)
allow_quoted_newlines (bool)
allow_jagged_rows (bool)
quote (str | None)
- copy(tbl: Table, table_name: str, if_exists: Literal['append', 'drop', 'truncate', 'fail'] = 'fail', max_errors: int = 0, tmp_gcs_bucket: str | None = None, temp_blob_name: str | None = None, gcs_client: GoogleCloudStorage | None = None, job_config: LoadJobConfig | None = None, template_table: str | None = None, ignoreheader: int = 1, nullas: str | None = None, allow_quoted_newlines: bool = True, allow_jagged_rows: bool = True, quote: str | None = None, schema: list[dict] | None = None, max_timeout: int = 21600, convert_dict_list_columns_to_json: bool = True, keep_gcs_file: bool = False, **load_kwargs)[source]¶
Copy a Table into Google BigQuery via Google Cloud Storage.
- Parameters:
tbl (Table) – obj The Parsons Table to copy into BigQuery.
table_name (str) – str The table name to load the data into. Will be used to generate load schema if no custom schema or template table are supplied and if_exists is set to “truncate” or “append”.
if_exists (Literal['append', 'drop', 'truncate', 'fail']) – str If the table already exists, either
fail,append,droportruncatethe table.max_errors (int) – int The maximum number of rows that can error and be skipped before the job fails.
tmp_gcs_bucket (str | None) – str The name of the Google Cloud Storage bucket to use to stage the data to load into BigQuery. Required if GCS_TEMP_BUCKET is not specified or set on the class instance.
gcs_client (GoogleCloudStorage | None) – object The GoogleCloudStorage Connector to use for loading data into Google Cloud Storage.
job_config (LoadJobConfig | None) – object A LoadJobConfig object to provide to the underlying call to load_table_from_uri on the BigQuery client. The function will create its own if not provided.
template_table (str | None) – str Table name to be used as the load schema. Load operation wil use the same columns and data types as the template table.
max_timeout (int) – int The maximum number of seconds to wait for a request before the job fails.
convert_dict_list_columns_to_json (bool) – bool If set to True, will convert any dict or list columns (which cannot by default be successfully loaded to BigQuery to JSON strings)
**load_kwargs – kwargs Arguments to pass to the underlying load_table_from_uri call on the BigQuery client.
temp_blob_name (str | None)
ignoreheader (int)
nullas (str | None)
allow_quoted_newlines (bool)
allow_jagged_rows (bool)
quote (str | None)
keep_gcs_file (bool)
- duplicate_table(source_table, destination_table, if_exists='fail', drop_source_table=False)[source]¶
Create a copy of an existing table (or subset of rows) in a new table.
- Parameters:
source_table – str Name of existing schema and table (e.g.
myschema.oldtable)destination_table – str Name of destination schema and table (e.g.
myschema.newtable)if_exists – str If the table already exists, either
fail,replace, orignorethe operation.drop_source_table – boolean Drop the source table
- upsert(table_obj, target_table, primary_key, distinct_check=True, cleanup_temp_table=True, from_s3=False, **copy_args)[source]¶
Preform an upsert on an existing table. An upsert is a function in which rows in a table are updated and inserted at the same time.
- Parameters:
table_obj – obj A Parsons table object
target_table – str The schema and table name to upsert
primary_key – str or list The primary key column(s) of the target table
distinct_check – boolean Check if the primary key column is distinct. Raise error if not.
cleanup_temp_table – boolean A temp table is dropped by default on cleanup. You can set to False for debugging.
from_s3 – boolean Instead of specifying a table_obj (set the first argument to None), set this to True and include
copy_s3()arguments to upsert a pre-existing s3 file into the target_table**copy_args – kwargs See
copy()for options.
- delete_table(table_name)[source]¶
Delete a BigQuery table.
- Parameters:
table_name – str The name of the table to delete.
- table_exists(table_name: str) bool[source]¶
Check whether or not the Google BigQuery table exists in the specified dataset.
- get_tables(schema, table_name: str | None = None)[source]¶
List the tables in a schema including metadata.
- get_columns(schema: str, table_name: str)[source]¶
Gets the column names (and other column metadata) for a table. If you need just the column names run
get_columns_list(), as it is faster.
- get_row_count(schema: str, table_name: str) int[source]¶
Gets the row count for a BigQuery materialization.
Caution: This method uses SELECT COUNT(*) which can be expensive for large tables, especially those with many columns. This is because BigQuery scans all table data to perform the count, even though only the row count is returned.
- extract(dataset: str, table_name: str, gcs_bucket: str, gcs_blob_name: str, project: str | None = None, gzip: bool = False, location: str = 'US', destination_file_format: str = 'CSV', field_delimiter: str = ',', compression: str | None = None, job_config: ExtractJobConfig = None, wait_for_job_to_complete: bool = True, **export_kwargs) ExtractJob[source]¶
Extracts a BigQuery table to a Google Cloud Storage bucket.
- Parameters:
dataset (str) – str The BigQuery dataset containing the table.
table_name (str) – str The name of the table to extract.
gcs_bucket (str) – str The GCS bucket where the table will be exported.
gcs_blob_name (str) – str The name of the blob in the GCS bucket.
project (str | None) – Optional[str] The Google Cloud project ID. If not provided, the default project of the client is used.
gzip (bool) – bool Not implemented
location (str)
destination_file_format (str)
field_delimiter (str)
compression (str | None)
job_config (ExtractJobConfig)
wait_for_job_to_complete (bool)
- Return type:
- copy_between_projects(source_project, source_dataset, source_table, destination_project, destination_dataset, destination_table, if_dataset_not_exists='fail', if_table_exists='fail')[source]¶
- Copy a table from one project to another. Fails if the source or target project
does not exist.
- If the target dataset does not exist, fhe flag if_dataset_not_exists controls behavior.
It defaults to ‘fail’; set it to ‘create’ if it’s ok to create it.
- If the target table exists, the flag if_table_exists controls behavior.
It defaults to ‘fail’; set it to ‘overwrite’ if it’s ok to overwrite an existing table.
- Parameters:
source_project – str Name of source project
source_dataset – str Name of source dataset
source_table – str Name of source table
destination_project – str Name of destination project
destination_dataset – str Name of destination dataset
destination_table – str Name of destination table
if_dataset_not_exists – str Action if dataset doesn’t exist {‘fail’,’create’}
if_table_exists – str Action if table exists {‘fail’, ‘overwrite’}