Cloud Storage

Overview

Google Cloud Storage is a cloud file storage system. It uses buckets in which to store arbitrary files referred to as blobs. You may use this connector to upload Parsons tables as blobs, download them to files, and list available blobs.

To use the GoogleCloudStorage class, you will need Google service account credentials. If you are the administrator of your Google Cloud account, you can generate them at Service accounts - IAM & Admin Once signed in, select your project, then your project’s email, then Keys, then Add key, and finally Create new key.

Quickstart

To instantiate the GoogleBigQuery class, you can pass the constructor a string containing either the name of your Google service account credentials file or a JSON string encoding those credentials. Alternatively, you can set the environment variable GOOGLE_APPLICATION_CREDENTIALS to be either of those strings and call the constructor without that argument.

Set the credentials as an environment variable
from parsons import GoogleCloudStorage

# May be the file name or a JSON encoding of the credentials.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'google_credentials_file.json'

gcs = GoogleCloudStorage()
Pass the credentials in as an argument
credentials_filename = 'google_credentials_file.json'
project = 'parsons-test'    # Project in which we're working
gcs = GoogleCloudStorage(app_creds=credentials_filename, project=project)
create buckets, upload blobs to them, and list/retrieve the available blobs
gcs.create_bucket('parsons_bucket')
gcs.list_buckets()

gcs.upload_table(bucket='parsons_bucket', table=parsons_table, blob_name='parsons_blob')
gcs.get_blob(bucket_name='parsons_bucket', blob_name='parsons_blob')

API

class parsons.google.google_cloud_storage.GoogleCloudStorage(app_creds: str | dict | Credentials | None = None, project=None)[source]

Google Cloud Storage connector utility

This class requires application credentials in the form of a json or google oauth2 Credentials object. It can be passed in the following ways:

  • Set an environmental variable named GOOGLE_APPLICATION_CREDENTIALS with the local path to the credentials json.

    Example: GOOGLE_APPLICATION_CREDENTALS='path/to/creds.json'

  • Pass in the path to the credentials using the app_creds argument.

  • Pass in a json string using the app_creds argument.

  • Generate the google credentials object directly, pass in using the app_creds argument.

For example, to pass in credentials from a parent shell that is authenticated with gcloud auth:

from google.auth import default

app_creds, _ = default()
gcs = GoogleCloudStorage(app_creds=app_creds)
Parameters:
  • app_creds (str | dict | Credentials | None) – str, dict, or google.oauth2.credentials.Credentials object A credentials json string or a path to a json file. Not required if GOOGLE_APPLICATION_CREDENTIALS env variable set. Can also pass a google oauth2 Credentials object directly.

  • project – str The project which the client is acting on behalf of. If not passed then will use the default inferred environment.

Returns:

GoogleCloudStorage Class

client

Access all methods of google.cloud package

list_buckets()[source]

Returns a list of buckets

Returns:

List of buckets

bucket_exists(bucket_name)[source]

Verify that a bucket exists

Parameters:

bucket_name – str The name of the bucket

Returns:

boolean

get_bucket(bucket_name)[source]

Returns a bucket object

Parameters:

bucket_name – str The name of bucket

Returns:

GoogleCloud Storage bucket

create_bucket(bucket_name)[source]

Create a bucket.

Parameters:

bucket_name – str A globally unique name for the bucket.

delete_bucket(bucket_name, delete_blobs=False)[source]

Delete a bucket. Will fail if not empty unless delete_blobs argument is set to True.

Parameters:
  • bucket_name – str The name of the bucket

  • delete_blobs – boolean Delete blobs in the bucket, if it is not empty

list_blobs(bucket_name, max_results=None, prefix=None, match_glob=None, include_file_details=False)[source]

List all of the blobs in a bucket

Parameters:
  • bucket_name – str The name of the bucket

  • max_results – int Maximum number of blobs to return

  • prefix – str A prefix to filter files

  • match_glob – str Filters files based on glob string. NOTE that the match_glob parameter runs on the full blob URI, include a preceding wildcard value to account for nested files (/ for one level, */ for n levels)

  • include_file_details – bool If True, returns a list of Blob objects with accessible metadata. For documentation of attributes associated with Blob objects see https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob

Returns:

A list of blob names (or Blob objects if include_file_details is invoked)

blob_exists(bucket_name, blob_name)[source]

Verify that a blob exists in the specified bucket

Parameters:
  • bucket_name – str The bucket name

  • blob_name – str The name of the blob

Returns:

boolean

get_blob(bucket_name, blob_name)[source]

Get a blob object

Parameters:
  • bucket_name – str A bucket name

  • blob_name – str A blob name

Returns:

A Google Storage blob object

put_blob(bucket_name, blob_name, local_path, **kwargs)[source]

Puts a blob (aka file) in a bucket

Parameters:
  • bucket_name – The name of the bucket to store the blob

  • blob_name – The name of blob to be stored in the bucket

  • local_path – str The local path of the file to upload

download_blob(bucket_name, blob_name, local_path=None)[source]

Gets a blob from a bucket

Parameters:
  • bucket_name – str The name of the bucket

  • blob_name – str The name of the blob

  • local_path – str The local path where the file will be downloaded. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

Returns:

str

The path of the downloaded file

delete_blob(bucket_name, blob_name)[source]

Delete a blob

Parameters:
  • bucket_name – str The bucket name

  • blob_name – str The blob name

upload_table(table, bucket_name, blob_name, data_type: Literal['csv', 'json'] = 'csv', default_acl=None, timeout: int = 60)[source]

Load the data from a Parsons table into a blob.

Parameters:
  • table – obj A Table

  • bucket_name – str The name of the bucket to upload the data into.

  • blob_name – str The name of the blob to upload the data into.

  • data_type (Literal['csv', 'json']) – str The file format to use when writing the data. One of: csv or json

  • default_acl – ACL desired for newly uploaded table

  • timeout (int)

Returns:

String representation of file URI in GCS

get_url(bucket_name, blob_name, expires_in=60)[source]

Generates a presigned url for a blob.

Parameters:
  • bucket_name – str The name of the bucket

  • blob_name – str The name of the blob

  • expires_in – int Minutes until the url expires

Returns:

str

A link to download the object

copy_bucket_to_gcs(gcs_sink_bucket: str, source: str, source_bucket: str, destination_path: str = '', source_path: str = '', aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, delete_objects_unique_in_sink: bool = False)[source]

Creates a one-time transfer job from Amazon S3 to Google Cloud Storage. Copies all blobs within the bucket unless a key or prefix is passed.

Parameters:
  • gcs_sink_bucket (str) – Destination for the data transfer (located in GCS)

  • source (str) – File storge vendor [gcs or s3]

  • source_bucket (str) – Source bucket name

  • source_path (str) – Path in the source system pointing to the relevant keys / files to sync. Must end in a ‘/’

  • aws_access_key_id (str) – Access key to authenticate storage transfer

  • aws_secret_access_key (str) – Secret key to authenticate storage transfer

  • delete_objects_unique_in_sink (bool) – Whether objects should be deleted from the source after they are transferred to the sink. Default is false.

  • destination_path (str)

format_uri(bucket: str, name: str)[source]

Represent a GCS URI as a string

Parameters:
  • bucket (str) – str GCS bucket name

  • name (str) – str Filename in bucket

Returns:

String represetnation of URI

split_uri(gcs_uri: str)[source]

Split a GCS URI into a bucket and blob name

Parameters:

gcs_uri (str) – str GCS URI

Returns:

Tuple of strings with bucket_name and blob_name

unzip_blob(bucket_name: str, blob_name: str, compression_type: Literal['zip', 'gzip'] = 'gzip', new_filename: str | None = None, new_file_extension: str | None = None) str[source]

Downloads and decompresses a blob. The decompressed blob is re-uploaded with the same filename if no new_filename parameter is provided.

Parameters:
  • bucket_name (str) – str GCS bucket name

  • blob_name (str) – str Blob name in GCS bucket

  • compression_type (Literal['zip', 'gzip']) – str Either zip or gzip

  • new_filename (str | None) – str If provided, replaces the existing blob name when the decompressed file is uploaded

  • new_file_extension (str | None) – str If provided, replaces the file extension when the decompressed file is uploaded

Returns:

String representation of decompressed GCS URI

Return type:

str