Redshift¶
Overview¶
The Redshift
class allows you to interact with an
Amazon Redshift relational database.
The connector utilizes the psycopg2
Python package under the hood. The core methods focus on input,
output and querying of the database.
In addition to the core API integration provided by the
Redshift class,
Parsons also includes utility functions for managing schemas and tables.
See Table and View API and
Schema API for more information.
S3 Credentials
Redshift only allows data to be copied to the database via S3.
As such, the the copy()
and copy_s3()
methods require S3 credentials and write access on an S3 Bucket,
which will be used for storing data en route to Redshift.
For more information about AWS Redshift authorization, see the
API documentation.
Whitelisting
Remember to ensure that the IP address from which you are connecting has been whitelisted.
Quickstart¶
Redshift API credentials can either be passed as environmental variables
(REDSHIFT_USERNAME, REDSHIFT_PASSWORD, REDSHIFT_HOST, REDSHIFT_DB, and REDSHIFT_PORT)
or as keyword arguments. Methods that use COPY require an
access key ID and a secret access key,
which can also be passed as environmental variables (aws_access_key_id and aws_secret_access_key)
or keyword arguments.
from parsons import Redshift
rs = Redshift()
from parsons import Redshift
rs = Redshift(
username='my_username',
password='my_password',
host='my_host',
db='my_db',
port='5439',
)
table = rs.query('select * from tmc_scratch.test_data')
table = rs.copy(tbl, 'tmc_scratch.test_table', if_exists='drop')
All of the standard COPY options can be passed as kwargs.
See the copy()
method for all options.
Core API¶
- class parsons.databases.redshift.redshift.Redshift(username=None, password=None, host=None, db=None, port=None, timeout=10, s3_temp_bucket=None, aws_access_key_id=None, aws_secret_access_key=None, iam_role=None, use_env_token=True)[source]¶
A Redshift class to connect to database.
- Parameters:
username – str Required if env variable
REDSHIFT_USERNAMEnot populatedpassword – str Required if env variable
REDSHIFT_PASSWORDnot populatedhost – str Required if env variable
REDSHIFT_HOSTnot populateddb – str Required if env variable
REDSHIFT_DBnot populatedport – int Required if env variable
REDSHIFT_PORTnot populated. Port 5439 is typical.timeout – int Seconds to timeout if connection not established
s3_temp_bucket – str Name of the S3 bucket that will be used for storing data during bulk transfers. Required if you intend to perform bulk data transfers (eg. the copy_s3 method), and env variable
S3_TEMP_BUCKETis not populated.aws_access_key_id – str The default AWS access key id for copying data from S3 into Redshift when running copy/upsert/etc methods. This will default to environment variable AWS_ACCESS_KEY_ID.
aws_secret_access_key – str The default AWS secret access key for copying data from S3 into Redshift when running copy/upsert/etc methods. This will default to environment variable AWS_SECRET_ACCESS_KEY.
iam_role – str AWS IAM Role ARN string – an optional, different way for credentials to be provided in the Redshift copy command that does not require an access key.
use_env_token – bool Controls use of the
AWS_SESSION_TOKENenvironment variable for S3. Defaults toTrue. Set toFalsein order to ignore theAWS_SESSION_TOKENenvironment variable even if theaws_session_tokenargument was not passed in.
- connection()[source]¶
Generate a Redshift connection. The connection is set up as a python “context manager”, so it will be closed automatically (and all queries committed) when the connection goes out of scope.
When using the connection, make sure to put it in a
withblock (necessary for any context manager):with rs.connection() as conn:- Yields:
Psycopg2
connectionobject
- query(sql: str, parameters: list | None = None) Table | None[source]¶
Execute a query against the Redshift database. Will return
Noneif the query returns zero rows.To include python variables in your query, it is recommended to pass them as parameters, following the psycopg style. Using the
parametersargument ensures that values are escaped properly, and avoids SQL injection attacks.Parameter Examples
# Note that the name contains a quote, which could break your query if not escaped # properly. name = "Beatrice O'Brady" sql = "SELECT * FROM my_table WHERE name = %s" rs.query(sql, parameters=[name])
names = ["Allen Smith", "Beatrice O'Brady", "Cathy Thompson"] placeholders = ', '.join('%s' for item in names) sql = f"SELECT * FROM my_table WHERE name IN ({placeholders})" rs.query(sql, parameters=names)
- query_with_connection(sql, connection, parameters=None, commit=True)[source]¶
Execute a query against the Redshift database, with an existing connection. Useful for batching queries together. Will return
Noneif the query returns zero rows.- Parameters:
sql – str A valid SQL statement
connection – obj A connection object obtained from
redshift.connection()parameters – list A list of python variables to be converted into SQL values in your query
commit – boolean Whether to commit the transaction immediately. If
Falsethe transaction will be committed when the connection goes out of scope and is closed (or you can commit manually withconnection.commit()).
- Returns:
- Table
See Table for output options.
- copy_s3(table_name, bucket, key, manifest=False, data_type='csv', csv_delimiter=',', compression=None, if_exists: Literal['fail', 'append', 'drop', 'truncate'] = 'fail', max_errors=0, distkey=None, sortkey=None, padding=None, varchar_max=None, statupdate=True, compupdate=True, ignoreheader=1, acceptanydate=True, dateformat='auto', timeformat='auto', emptyasnull=True, blanksasnull=True, nullas=None, acceptinvchars=True, truncatecolumns=False, columntypes=None, specifycols=None, aws_access_key_id=None, aws_secret_access_key=None, bucket_region=None, strict_length=True, template_table=None, encoding='utf-8', line_delimited=False)[source]¶
Copy a file from s3 to Redshift.
- Parameters:
table_name – str The table name and schema (
tmc.cool_table) to point the file.bucket – str The s3 bucket where the file or manifest is located.
key – str The key of the file or manifest in the s3 bucket.
manifest – str If using a manifest
data_type – str The data type of the file. Only
csvsupported currently.csv_delimiter – str The delimiter of the
csv. Only relevant if data_type iscsv.compression – str If specified (
gzip), will attempt to decompress the file.if_exists (Literal['fail', 'append', 'drop', 'truncate']) – str If the table already exists, either
fail,append,droportruncatethe table.max_errors – int The maximum number of rows that can error and be skipped before the job fails.
distkey – str The column name of the distkey
sortkey – str The column name of the sortkey
padding – float A percentage padding to add to varchar columns if creating a new table. This is helpful to add a buffer for future copies in which the data might be wider.
varchar_max – list A list of columns in which to set the width of the varchar column to 65,535 characters.
statupate – boolean Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command.
compupdate – boolean Controls whether compression encodings are automatically applied during a COPY.
ignore_header – int The number of header rows to skip. Ignored if data_type is
json.acceptanydate – boolean Allows any date format, including invalid formats such as 00/00/00 00:00:00, to be loaded without generating an error.
emptyasnull – boolean Indicates that Amazon Redshift should load empty char and varchar fields as
NULL.blanksasnull – boolean Loads blank varchar fields, which consist of only white space characters, as
NULL.nullas – str Loads fields that match string as NULL
acceptinvchars – boolean Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters.
dateformat – str Set the date format. Defaults to
auto.timeformat – str Set the time format. Defaults to
auto.truncatecolumns – boolean If the table already exists, truncates data in columns to the appropriate number of characters so that it fits the column specification. Applies only to columns with a VARCHAR or CHAR data type, and rows 4 MB or less in size.
columntypes – dict Optional map of column name to redshift column type, overriding the usual type inference. You only specify the columns you want to override, eg.
columntypes={'phone': 'varchar(12)', 'age': 'int'}).specifycols –
boolean Adds a column list to the Redshift COPY command, allowing for the source table in an append to have the columnns out of order, and to have fewer columns with any leftover target table columns filled in with the DEFAULT value.
This will fail if all of the source table’s columns do not match a column in the target table. This will also fail if the target table has an IDENTITY column and that column name is among the source table’s columns.
aws_access_key_id – An AWS access key granted to the bucket where the file is located. Not required if keys are stored as environmental variables.
aws_secret_access_key – An AWS secret access key granted to the bucket where the file is located. Not required if keys are stored as environmental variables.
bucket_region – str The AWS region that the bucket is located in. This should be provided if the Redshift cluster is located in a different region from the temp bucket.
strict_length – bool If the database table needs to be created, strict_length determines whether the created table’s column sizes will be sized to exactly fit the current data, or if their size will be rounded up to account for future values being larger then the current dataset. defaults to
True; this argument is ignored ifpaddingis specifiedtemplate_table – str Instead of specifying columns, columntypes, and/or inference, if there is a pre-existing table that has the same columns/types, then use the template_table table name as the schema for the new table.
- Returns:
- Table or
None See Table for output options.
- Table or
- copy(tbl: Table, table_name: str, if_exists: Literal['fail', 'append', 'drop', 'truncate'] = 'fail', max_errors: int = 0, distkey: str | None = None, sortkey: str | None = None, padding: float | None = None, statupdate: bool | None = None, compupdate: bool | None = None, acceptanydate: bool = True, emptyasnull: bool = True, blanksasnull: bool = True, nullas: str | None = None, acceptinvchars: bool = True, dateformat: str = 'auto', timeformat: str = 'auto', varchar_max: list[str] | None = None, truncatecolumns: bool = False, columntypes: dict | None = None, specifycols: bool | None = None, alter_table: bool = False, alter_table_cascade: bool = False, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, iam_role: str | None = None, cleanup_s3_file: bool = True, template_table: str | None = None, temp_bucket_region: str | None = None, strict_length: bool = True, csv_encoding: str = 'utf-8')[source]¶
Copy a Table to Redshift.
- Parameters:
tbl (Table) – obj A Parsons Table.
table_name (str) – str The destination table name (ex.
my_schema.my_table).if_exists (Literal['fail', 'append', 'drop', 'truncate']) – str If the table already exists, either
fail,append,droportruncatethe table.max_errors (int) – int The maximum number of rows that can error and be skipped before the job fails.
distkey (str | None) – str The column name of the distkey
sortkey (str | None) – str The column name of the sortkey
padding (float | None) – float A percentage padding to add to varchar columns if creating a new table. This is helpful to add a buffer for future copies in which the data might be wider.
statupate –
boolean Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command. If
Trueexplicitly setsstatupateto on, ifFalseexplicitly setsstatupateto off. IfNonestats update only if the table is initially empty. Defaults toNone. See Redshift docs for more details.Note
If STATUPDATE is used, the current user must be either the table owner or a superuser.
compupdate (bool | None) – boolean Controls whether compression encodings are automatically applied during a COPY. If
Trueexplicitly setscompupdateto on, ifFalseexplicitly setscompupdateto off. IfNonethe COPY command only chooses compression if the table is initially empty. Defaults toNone. See Redshift docs for more details.acceptanydate (bool) – boolean Allows any date format, including invalid formats such as 00/00/00 00:00:00, to be loaded without generating an error.
emptyasnull (bool) – boolean Indicates that Amazon Redshift should load empty char and varchar fields as
NULL.blanksasnull (bool) – boolean Loads blank varchar fields, which consist of only white space characters, as
NULL.nullas (str | None) – str Loads fields that match string as NULL
acceptinvchars (bool) – boolean Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters.
dateformat (str) – str Set the date format. Defaults to
auto.timeformat (str) – str Set the time format. Defaults to
auto.varchar_max (list[str] | None) – list A list of columns in which to set the width of the varchar column to 65,535 characters.
truncatecolumns (bool) – boolean If the table already exists, truncates data in columns to the appropriate number of characters so that it fits the column specification. Applies only to columns with a VARCHAR or CHAR data type, and rows 4 MB or less in size.
columntypes (dict | None) – dict Optional map of column name to redshift column type, overriding the usual type inference. You only specify the columns you want to override, eg.
columntypes={'phone': 'varchar(12)', 'age': 'int'}).specifycols (bool | None) –
boolean Adds a column list to the Redshift COPY command, allowing for the source table in an append to have the columnns out of order, and to have fewer columns with any leftover target table columns filled in with the DEFAULT value.
This will fail if all of the source table’s columns do not match a column in the target table. This will also fail if the target table has an IDENTITY column and that column name is among the source table’s columns.
alter_table (bool) – boolean Will check if the target table varchar widths are wide enough to copy in the table data. If not, will attempt to alter the table to make it wide enough. This will not work with tables that have dependent views. To drop them, set
alter_table_cascadeto True.alter_table_cascade (bool) – boolean Will drop dependent objects when attempting to alter the table. If
alter_tableisFalse, this will be ignored.aws_access_key_id (str | None) – An AWS access key granted to the bucket where the file is located. Not required if keys are stored as environmental variables.
aws_secret_access_key (str | None) – An AWS secret access key granted to the bucket where the file is located. Not required if keys are stored as environmental variables.
iam_role (str | None) – str An AWS IAM Role ARN string; an alternative credential for the COPY command from Redshift to S3. The IAM role must have been assigned to the Redshift instance and have access to the S3 bucket.
cleanup_s3_file (bool) – boolean The s3 upload is removed by default on cleanup. You can set to False for debugging.
template_table (str | None) – str Instead of specifying columns, columntypes, and/or inference, if there is a pre-existing table that has the same columns/types, then use the template_table table name as the schema for the new table. Unless you set specifycols=False explicitly, a template_table will set it to True
temp_bucket_region (str | None) – str The AWS region that the temp bucket (specified by the TEMP_S3_BUCKET environment variable) is located in. This should be provided if the Redshift cluster is located in a different region from the temp bucket.
strict_length (bool) – bool Whether or not to tightly fit the length of the table columns to the length of the data in
tbl; ifpaddingis specified, this argument is ignored.csv_ecoding – str String encoding to use when writing the temporary CSV file that is uploaded to S3. Defaults to ‘utf-8’.
statupdate (bool | None)
csv_encoding (str)
- Returns:
- Table or
None See Table for output options.
- Table or
- unload(sql, bucket, key_prefix, manifest=True, header=True, delimiter='|', compression: Literal['gzip', 'bzip2', 'None'] = 'gzip', add_quotes=True, null_as=None, escape=True, allow_overwrite=True, parallel=True, max_file_size='6.2 GB', extension=None, aws_region=None, format=None, aws_access_key_id=None, aws_secret_access_key=None)[source]¶
Unload Redshift data to S3 Bucket. This is a more efficient method than running a query to export data as it can export in parallel and directly into an S3 bucket. Consider using this for exports of 10MM or more rows.
- sql: str
The SQL string to execute to generate the data to unload.
- bucket: str
The destination S3 bucket
- key_prefix: str
The prefix of the key names that will be written
- manifest: boolean
Creates a manifest file that explicitly lists details for the data files that are created by the UNLOAD process.
- header: boolean
Adds a header line containing column names at the top of each output file.
- delimiter: str
Specificies the character used to separate fields. Defaults to ‘|’.
- compression: str
One of
gzip,bzip2orNone. Unloads data to one or more compressed files per slice. Each resulting file is appended with a.gzor.bz2extension.- add_quotes: boolean
Places quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself.
- null_as: str
Specifies a string that represents a null value in unload files. If this option is not specified, null values are unloaded as zero-length strings for delimited output.
- escape: boolean
For CHAR and VARCHAR columns in delimited unload files, an escape character () is placed before every linefeed, carriage return, escape characters and delimiters.
- allow_overwrite: boolean
If
True, will overwrite existing files, including the manifest file. IfFalsewill fail.- parallel: boolean
By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used.
- max_file_size: str
The maximum size of files UNLOAD creates in Amazon S3. Specify a decimal value between 5 MB and 6.2 GB.
- extension: str
This extension will be added to the end of file names loaded to S3
- region: str
The AWS Region where the target Amazon S3 bucket is located. REGION is required for UNLOAD to an Amazon S3 bucket that is not in the same AWS Region as the Amazon Redshift cluster.
- format: str
The format of the unload file (CSV, PARQUET, JSON) - Optional.
- aws_access_key_id:
An AWS access key granted to the bucket where the file is located. Not required if keys are stored as environmental variables.
- aws_secret_access_key:
An AWS secret access key granted to the bucket where the file is located. Not required if keys are stored as environmental variables.
- Parameters:
compression (Literal['gzip', 'bzip2', 'None'])
- drop_and_unload(rs_table, bucket, key, cascade=True, manifest=True, header=True, delimiter='|', compression: Literal['gzip', 'bzip2', 'None'] = 'gzip', add_quotes=True, escape=True, allow_overwrite=True, parallel=True, max_file_size='6.2 GB', aws_region=None)[source]¶
Unload data to s3, and then drop Redshift table
- Parameters:
rs_table – str Redshift table.
bucket – str S3 bucket
key – str S3 key prefix ahead of table name
cascade – bool whether to drop cascade
manifest – bool
header – bool
delimiter – str
compression (Literal['gzip', 'bzip2', 'None']) – str
add_quotes – bool
escape – bool
allow_overwrite – bool
parallel – bool
max_file_size – str
aws_region – str
- generate_manifest(buckets, aws_access_key_id=None, aws_secret_access_key=None, mandatory=True, prefix=None, manifest_bucket=None, manifest_key=None, path=None)[source]¶
Given a list of S3 buckets, generate a manifest file (JSON format). A manifest file allows you to copy multiple files into a single table at once. Once the manifest is generated, you can pass it with the
copy_s3()method.AWS keys are not required if
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYenvironmental variables set.- Parameters:
buckets – list or str A list of buckets or single bucket from which to generate manifest
aws_access_key_id – str AWS access key id to access S3 bucket
aws_secret_access_key – str AWS secret access key to access S3 bucket
mandatory – bool The mandatory flag indicates whether the Redshift COPY should terminate if the file does not exist.
prefix – str Optional filter for key prefixes
manifest_bucket – str Optional bucket to write manifest file.
manifest_key – str Optional key name for S3 bucket to write file
- Returns:
dictof manifest
- upsert(table_obj, target_table, primary_key, vacuum=True, distinct_check=True, cleanup_temp_table=True, alter_table=True, alter_table_cascade=False, from_s3=False, distkey=None, sortkey=None, **copy_args)[source]¶
Preform an upsert on an existing table.
An upsert is a function in which rows in a table are updated and inserted at the same time.
- Parameters:
table_obj – obj A Parsons table object
target_table – str The schema and table name to upsert
primary_key – str or list The primary key column(s) of the target table
vacuum – bool Re-sorts rows and reclaims space in the specified table. You must be a table owner or super user to effectively vacuum a table, however the method will not fail if you lack these priviledges.
distinct_check – bool Check if the primary key column is distinct. Raise error if not.
cleanup_temp_table – bool A temp table is dropped by default on cleanup. You can set to False for debugging.
alter_table – bool Set to False to avoid automatic varchar column resizing to accomodate new data
alter_table_cascade – bool Will drop dependent objects when attempting to alter the table. If
alter_tableisFalse, this will be ignored.from_s3 – bool Instead of specifying a table_obj (set the first argument to None), set this to True and include
copy_s3()arguments to upsert a pre-existing s3 file into the target_tabledistkey – str The column name of the distkey. If not provided, will default to
primary_key.sortkey – str or list The column name(s) of the sortkey. If not provided, will default to
primary_key.**copy_args – kwargs See
copy()for options.
- alter_varchar_column_widths(tbl, table_name, drop_dependencies=False)[source]¶
Alter the width of a varchar columns in a Redshift table to match the widths of a Parsons table. The columns are matched by column name and not their index.
- Parameters:
tbl – obj A Parsons table
table_name – The target table name (e.g.
my_schema.my_table)
- alter_table_column_type(table_name, column_name, data_type, varchar_width=None)[source]¶
Alter a column type of an existing table.
- table_name: str
The table name (ex.
my_schema.my_table).- column_name: str
The target column name
- data_type: str
A valid Redshift data type to alter the table to.
- varchar_width:
The new width of the column if of type varchar.
- static combine_schema_and_table_name(schema, table)¶
Creates a full table name by combining a schema and table.
- Parameters:
schema – str The schema name
table – str The table name
- Returns:
- str
The combined full table name
- create_schema_with_permissions(schema, group=None)¶
Creates a Redshift schema (if it doesn’t already exist), and grants usage permissions to a Redshift group (if specified).
- Parameters:
schema – str The schema name
group – str The Redshift group name
type – str The type of permissions to grant. Supports select, all, etc. (For full list, see the Redshift GRANT docs)
- create_table(table_object, table_name)¶
Create a table based on table object data.
- detect_data_type(value, cmp_type=None)¶
Detect the higher of value’s type cmp_type.
check if it’s a string
check if it’s a number
check if it’s a float
check if it’s an int
- Parameters:
value – str The value to inspect.
cmp_type – str The string representation of a type to compare with the type of
value.
- Returns:
- str
String representation of the higher of the two types.
- duplicate_table(source_table, destination_table, where_clause='', if_exists: Literal['fail', 'append', 'drop', 'truncate'] = 'fail', drop_source_table=False)¶
Create a copy of an existing table (or subset of rows) in a new table. It will inherit encoding, sortkey and distkey.
- Parameters:
source_table – str Name of existing schema and table (e.g.
myschema.oldtable)destination_table – str Name of destination schema and table (e.g.
myschema.newtable)where_clause – str An optional where clause (e.g.
where org = 1).if_exists (Literal['fail', 'append', 'drop', 'truncate']) – str If the table already exists, either
fail,append,drop, ortruncatethe table.drop_source_table – boolean Drop the source table
- format_column(col: str, index: int | str = '', replace_chars: dict | None = None, col_prefix: str = '_') str¶
Format the column to meet database contraints.
Formats the columns as follows: 1. Coverts to lowercase (if case insensitive) 2. Strips leading and trailing whitespace 3. Replaces invalid characters 4. Renames if in reserved words
- Parameters:
col (str) – The column to format.
index (int | str) – The index of the column. Used if the column is empty.
replace_chars (dict | None) – A dictionary of invalid characters and their replacements. If
Noneuses {” “: “_”}col_prefix (str) – The prefix to use when the column is empty or starts with an invalid character.
- Returns:
The formatted column.
- Return type:
- format_columns(cols, **kwargs)¶
Format the columns to meet database contraints.
This method relies on
format_columnto handle most changes. It only handles duplicated columns. Options toformat_columncan be passed through kwargs.- Parameters:
cols – list The columns to format.
**kwargs – dicts Keyword arguments to pass to
format_column.
- Returns:
- list
The formatted columns.
- generate_alchemy_url()¶
Generate a SQL Alchemy engine https://docs.sqlalchemy.org/en/14/core/engines.html#
- generate_engine()¶
Generate a SQL Alchemy engine.
- get_bigger_int(int1, int2)¶
Return the bigger of the two ints.
- Parameters:
int1 – str The string representation if an int type.
int2 – str The string representation if an int type.
- Returns:
- str
A string representation of the higher of the two int types.
- get_columns(schema, table_name)¶
Gets the column names (and some other column info) for a table.
If you just need the column names, run
get_columns_list()as it is faster.for col in rs.get_columns('some_schema', 'some_table'): print(col)
- Parameters:
schema – str The schema name
table_name – str The table name
- Returns:
A dict mapping column name to a dict with extra info. The keys of the dict are ordered just like the columns in the table. The extra info is a dict with format
{ 'data_type': str, 'max_length': int or None, 'max_precision': int or None, 'max_scale': int or None, 'is_nullable': bool }
- get_columns_list(schema, table_name)¶
Gets the just the column names for a table.
- Parameters:
schema – str The schema name
table_name – str The table name
- Returns:
A list of column names.
- get_max_value(table_name, value_column)¶
Return the max value from a table.
- Parameters:
table_name – str Schema and table name
value_column – str The column containing the values
- get_object_type(object_name)¶
Get object type.
One of view, table, index, sequence, or TOAST table.
- Parameters:
object_name – str The schema.obj for which to get the object type.
- Returns:
str of the object type.
- get_queries()¶
Return the Current queries running and queueing, along with resource consumption.
Warning
Must be a Redshift superuser to run this method.
- Returns:
- Table
See Table for output options.
- get_row_count(table_name)¶
Return the row count of a table.
SQL Code
SELECT COUNT(*) FROM myschema.mytable
- Parameters:
table_name – str The schema and name (e.g.
myschema.mytable) of the table.- Returns:
int
- get_table_definition(table)¶
Get the table definition (i.e. the create statement).
- Parameters:
table – str The schema.table for which to get the table definition.
- Returns:
str
- get_table_definitions(schema=None, table=None)¶
Get the table definition (i.e. the create statement) for multiple tables.
This works similar to get_table_def except it runs a single query to get the ddl for multiple tables. It supports SQL wildcards for schema and table. Only returns the ddl for _tables_ that match schema and table if they exist.
- Parameters:
schema – str The schema to filter by.
table – str The table to filter by.
- Returns:
list of dicts with matching tables.
- get_table_object(table_name)¶
Get a SQL Alchemy table object.
- get_table_stats(schema=None, table_name=None)¶
List the tables statistics includes row count and size.
Warning
This method is only accessible by Redshift superusers.
- Parameters:
schema – str Filter by a schema
table_name – str Filter by a table name
- Returns:
- Table
See Table for output options.
- get_tables(schema=None, table_name=None)¶
List the tables in a schema including metadata.
- Parameters:
schema – str Filter by a schema
table_name – str Filter by a table name
- Returns:
- Table
See Table for output options.
- get_view_definition(view)¶
Get the view definition (i.e. the create statement).
- Parameters:
view – str The schema.view for which to get the view definition.
- Returns:
str
- get_view_definitions(schema=None, view=None)¶
Get the view definition (i.e. the create statement) for multiple views.
This works similar to get_view_def except it runs a single query to get the ddl for multiple views. It supports SQL wildcards for schema and view. Only returns the ddl for _views_ that match schema and view if they exist.
- Parameters:
schema – str The schema to filter by.
view – str The view to filter by.
- Returns:
list of dicts with matching views.
- get_views(schema=None, view=None)¶
List views.
- Parameters:
schema – str Filter by a schema
view – str Filter by a table name
- Returns:
- Table
See Table for output options.
- grant_schema_permissions(schema, group, permissions_type='select')¶
Grants a Redshift group permissions to all tables within an existing schema.
- Parameters:
schema – str The schema name
group – str The Redshift group name
type –
str The type of permissions to grant. Supports select, all, etc. (For full list, see the Redshift GRANT docs)
- is_table(object_name)¶
Return true if the object is a table.
- Parameters:
object_name – str The schema.obj to test if it’s a table.
- Returns:
bool
- is_valid_sql_num(val)¶
Check whether val is a valid sql number.
- Parameters:
val – any The values to check.
- Returns:
- bool
Whether or not the value is a valid sql number.
- is_view(object_name)¶
Return true if the object is a view.
- Parameters:
object_name – str The schema.obj to test if it’s a view.
- Returns:
bool
- move_table(source_table, new_table, drop_source_table=False)¶
Move an existing table in the database.It will inherit encoding, sortkey and distkey. Once run, the source table rows will be empty. This is more efficiant than running
"create newtable as select * from oldtable".For more information see: ALTER TABLE APPEND
- Parameters:
source_table – str Name of existing schema and table (e.g.
my_schema.old_table)new_table – str New name of schema and table (e.g.
my_schema.newtable)drop_source_table – boolean Drop the source table.
- populate_table_from_query(query, destination_table, if_exists: Literal['fail', 'append', 'drop', 'truncate'] = 'fail', distkey=None, sortkey=None)¶
Populate a Redshift table with the results of a SQL query, creating the table if it doesn’t yet exist.
- Parameters:
query – str The SQL query
destination_table – str Name of destination schema and table (e.g.
mys_chema.new_table)if_exists (Literal['fail', 'append', 'drop', 'truncate']) – str If the table already exists, either
fail,append,drop, ortruncatethe table.distkey – str The column to use as the distkey for the table.
sortkey – str The column to use as the sortkey for the table.
- rename_table(table_name, new_table_name)¶
Rename an existing table.
Note
You cannot move schemas when renaming a table. Instead, utilize the
duplicate_table()method.- Parameters:
table_name – str Name of existing schema and table (e.g.
myschema.oldtable)new_table_name – str New name for table with the schema omitted (e.g.
newtable).
- static split_full_table_name(full_table_name)¶
Split a full table name into its schema and table. If a schema isn’t present, return public for the schema. Similarly, Redshift defaults to the public schema, when one isn’t provided.
Eg:
(schema, table) = Redshift.split_full_table_name("some_schema.some_table")- Parameters:
full_table_name – str The table name, as “schema.table”
- Returns:
- tuple
A tuple containing (schema, table)
- static split_table_name(full_table_name: str) tuple[str, str] | None¶
Parse the schema and table name.
- table_exists(table_name: str, view: bool = True) bool¶
Check if a table or view exists in the database.
- union_tables(new_table_name, tables, union_all=True, view=False)¶
Union a series of table into a new table.
- Parameters:
new_table_name – str The new table and schema (e.g.
myschema.newtable)tables – list A list of tables to union
union_all – boolean If
Falsewill deduplicate rows. IfTruewill include duplicate rows.view – boolean Create a view rather than a static table
Table and View API¶
Table and view utilities are a series of helper methods, all built off of commonly used SQL queries run against the Redshift database.
- class parsons.databases.redshift.rs_table_utilities.RedshiftTableUtilities[source]¶
- table_exists(table_name: str, view: bool = True) bool[source]¶
Check if a table or view exists in the database.
- get_row_count(table_name)[source]¶
Return the row count of a table.
SQL Code
SELECT COUNT(*) FROM myschema.mytable
- Parameters:
table_name – str The schema and name (e.g.
myschema.mytable) of the table.- Returns:
int
- rename_table(table_name, new_table_name)[source]¶
Rename an existing table.
Note
You cannot move schemas when renaming a table. Instead, utilize the
duplicate_table()method.- Parameters:
table_name – str Name of existing schema and table (e.g.
myschema.oldtable)new_table_name – str New name for table with the schema omitted (e.g.
newtable).
- move_table(source_table, new_table, drop_source_table=False)[source]¶
Move an existing table in the database.It will inherit encoding, sortkey and distkey. Once run, the source table rows will be empty. This is more efficiant than running
"create newtable as select * from oldtable".For more information see: ALTER TABLE APPEND
- Parameters:
source_table – str Name of existing schema and table (e.g.
my_schema.old_table)new_table – str New name of schema and table (e.g.
my_schema.newtable)drop_source_table – boolean Drop the source table.
- populate_table_from_query(query, destination_table, if_exists: Literal['fail', 'append', 'drop', 'truncate'] = 'fail', distkey=None, sortkey=None)[source]¶
Populate a Redshift table with the results of a SQL query, creating the table if it doesn’t yet exist.
- Parameters:
query – str The SQL query
destination_table – str Name of destination schema and table (e.g.
mys_chema.new_table)if_exists (Literal['fail', 'append', 'drop', 'truncate']) – str If the table already exists, either
fail,append,drop, ortruncatethe table.distkey – str The column to use as the distkey for the table.
sortkey – str The column to use as the sortkey for the table.
- duplicate_table(source_table, destination_table, where_clause='', if_exists: Literal['fail', 'append', 'drop', 'truncate'] = 'fail', drop_source_table=False)[source]¶
Create a copy of an existing table (or subset of rows) in a new table. It will inherit encoding, sortkey and distkey.
- Parameters:
source_table – str Name of existing schema and table (e.g.
myschema.oldtable)destination_table – str Name of destination schema and table (e.g.
myschema.newtable)where_clause – str An optional where clause (e.g.
where org = 1).if_exists (Literal['fail', 'append', 'drop', 'truncate']) – str If the table already exists, either
fail,append,drop, ortruncatethe table.drop_source_table – boolean Drop the source table
- union_tables(new_table_name, tables, union_all=True, view=False)[source]¶
Union a series of table into a new table.
- Parameters:
new_table_name – str The new table and schema (e.g.
myschema.newtable)tables – list A list of tables to union
union_all – boolean If
Falsewill deduplicate rows. IfTruewill include duplicate rows.view – boolean Create a view rather than a static table
- get_tables(schema=None, table_name=None)[source]¶
List the tables in a schema including metadata.
- Parameters:
schema – str Filter by a schema
table_name – str Filter by a table name
- Returns:
- Table
See Table for output options.
- get_table_stats(schema=None, table_name=None)[source]¶
List the tables statistics includes row count and size.
Warning
This method is only accessible by Redshift superusers.
- Parameters:
schema – str Filter by a schema
table_name – str Filter by a table name
- Returns:
- Table
See Table for output options.
- get_columns(schema, table_name)[source]¶
Gets the column names (and some other column info) for a table.
If you just need the column names, run
get_columns_list()as it is faster.for col in rs.get_columns('some_schema', 'some_table'): print(col)
- Parameters:
schema – str The schema name
table_name – str The table name
- Returns:
A dict mapping column name to a dict with extra info. The keys of the dict are ordered just like the columns in the table. The extra info is a dict with format
{ 'data_type': str, 'max_length': int or None, 'max_precision': int or None, 'max_scale': int or None, 'is_nullable': bool }
- get_columns_list(schema, table_name)[source]¶
Gets the just the column names for a table.
- Parameters:
schema – str The schema name
table_name – str The table name
- Returns:
A list of column names.
- get_views(schema=None, view=None)[source]¶
List views.
- Parameters:
schema – str Filter by a schema
view – str Filter by a table name
- Returns:
- Table
See Table for output options.
- get_queries()[source]¶
Return the Current queries running and queueing, along with resource consumption.
Warning
Must be a Redshift superuser to run this method.
- Returns:
- Table
See Table for output options.
- get_max_value(table_name, value_column)[source]¶
Return the max value from a table.
- Parameters:
table_name – str Schema and table name
value_column – str The column containing the values
- get_object_type(object_name)[source]¶
Get object type.
One of view, table, index, sequence, or TOAST table.
- Parameters:
object_name – str The schema.obj for which to get the object type.
- Returns:
str of the object type.
- is_view(object_name)[source]¶
Return true if the object is a view.
- Parameters:
object_name – str The schema.obj to test if it’s a view.
- Returns:
bool
- is_table(object_name)[source]¶
Return true if the object is a table.
- Parameters:
object_name – str The schema.obj to test if it’s a table.
- Returns:
bool
- get_table_definition(table)[source]¶
Get the table definition (i.e. the create statement).
- Parameters:
table – str The schema.table for which to get the table definition.
- Returns:
str
- get_table_definitions(schema=None, table=None)[source]¶
Get the table definition (i.e. the create statement) for multiple tables.
This works similar to get_table_def except it runs a single query to get the ddl for multiple tables. It supports SQL wildcards for schema and table. Only returns the ddl for _tables_ that match schema and table if they exist.
- Parameters:
schema – str The schema to filter by.
table – str The table to filter by.
- Returns:
list of dicts with matching tables.
- get_view_definition(view)[source]¶
Get the view definition (i.e. the create statement).
- Parameters:
view – str The schema.view for which to get the view definition.
- Returns:
str
- get_view_definitions(schema=None, view=None)[source]¶
Get the view definition (i.e. the create statement) for multiple views.
This works similar to get_view_def except it runs a single query to get the ddl for multiple views. It supports SQL wildcards for schema and view. Only returns the ddl for _views_ that match schema and view if they exist.
- Parameters:
schema – str The schema to filter by.
view – str The view to filter by.
- Returns:
list of dicts with matching views.
- static split_full_table_name(full_table_name)[source]¶
Split a full table name into its schema and table. If a schema isn’t present, return public for the schema. Similarly, Redshift defaults to the public schema, when one isn’t provided.
Eg:
(schema, table) = Redshift.split_full_table_name("some_schema.some_table")- Parameters:
full_table_name – str The table name, as “schema.table”
- Returns:
- tuple
A tuple containing (schema, table)
Schema API¶
Schema utilities are a series of helper methods, all built off of commonly used SQL queries run against the Redshift database.
- class parsons.databases.redshift.rs_schema.RedshiftSchema[source]¶
- create_schema_with_permissions(schema, group=None)[source]¶
Creates a Redshift schema (if it doesn’t already exist), and grants usage permissions to a Redshift group (if specified).
- Parameters:
schema – str The schema name
group – str The Redshift group name
type –
str The type of permissions to grant. Supports select, all, etc. (For full list, see the Redshift GRANT docs)
- grant_schema_permissions(schema, group, permissions_type='select')[source]¶
Grants a Redshift group permissions to all tables within an existing schema.
- Parameters:
schema – str The schema name
group – str The Redshift group name
type –
str The type of permissions to grant. Supports select, all, etc. (For full list, see the Redshift GRANT docs)