Parsons Table

Overview

Most methods and functions in Parsons return a Table, which is a 2D list-like object similar to a Pandas Dataframe. You can call the following methods on the Table object to output it into a variety of formats or storage types. A full list of Table methods can be found in the API section.

From Parsons Table

Method

Destination Type

Description

to_csv()

CSV File

Write a table to a local csv file

to_avro()

Avro File

Write a table to a local avro file

to_s3_csv()

AWS s3 Bucket

Write a table to a csv stored in S3

to_gcs_csv()

Google Cloud Storage Bucket

Write a table to a csv stored in Google Cloud Storage

to_sftp_csv()

SFTP Server

Write a table to a csv stored on an SFTP server

to_redshift()

A Redshift Database

Write a table to a Redshift database

to_postgres()

A Postgres Database

Write a table to a Postgres database

to_civis()

Civis Redshift Database

Write a table to Civis platform database

to_petl()

Petl table object

Convert a table a Petl table object

to_json()

JSON file

Write a table to a local JSON file

to_html()

HTML formatted table

Write a table to a local html file

to_dataframe()

Pandas Dataframe [1]

Return a Pandas dataframe

append_csv()

CSV file

Appends table to an existing CSV

append_avro()

Avro file

Appends table to an existing Avro file

to_zip_csv()

ZIP file

Writes a table to a CSV in a zip archive

to_dicts()

Dicts

Write a table as a list of dicts

To Parsons Table

Create Parsons Table object using the following methods.

Method

Source Type

Description

from_csv()

File like object, local path, url, ftp.

Loads a csv object into a Table

from_avro()

Avro File

Load a table from a local avro file

from_json()

File like object, local path, url, ftp.

Loads a json object into a Table

from_columns()

List object

Loads lists organized as columns in Table

from_redshift()

Redshift table

Loads a Redshift query into a Table

from_postgres()

Postgres table

Loads a Postgres query into a Table

from_dataframe()

Pandas Dataframe [2]

Load a Parsons table from a Pandas Dataframe

from_s3_csv()

S3 CSV

Load a Parsons table from a csv file on S3

from_csv_string()

File like object, local path, url, ftp.

Load a CSV string into a Table

You can also use the Table constructor to create a Table from a python list or petl table:

# From a list of dicts
tbl = Table([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])

# From a list of lists, the first list holding the field names
tbl = Table([['a', 'b'], [1, 2], [3, 4]])

# From a petl table
tbl = Table(petl_tbl)

Parsons Table Attributes

Tables have a number of convenience attributes.

Attribute

Description

.num_rows

The number of rows in the table

.columns

A list of column names in the table

.data

The actual data (rows) of the table, as a list of tuples (without field names)

.first

The first value in the table. Use for database queries where a single value is returned.

Parsons Table Transformations

Parsons tables have many methods that allow you to easily transform tables. Below is a selection of commonly used methods. The full list can be found in the API section.

Column Transformations

Method

Description

head()

Get the first n rows of a table

tail()

Get the last n rows of a table

add_column()

Add a column

remove_column()

Remove a column

rename_column()

Rename a column

rename_columns()

Rename multiple columns

move_column()

Move a column within a table

cut()

Return a table with a subset of columns

fill_column()

Provide a fixed value to fill a column

fillna_column()

Provide a fixed value to fill all null values in a column

get_column_types()

Get the python type of values for a given column

convert_column()

Transform the values of a column via arbitrary functions

coalesce_columns()

Coalesce values from one or more source columns

map_columns()

Standardizes column names based on multiple possible values

Row Transformations

Method

Description

select_rows()

Return a table of a subset of rows based on filters

stack()

Stack a number of tables on top of one another

chunk()

Divide tables into smaller tables based on row count

remove_null_rows()

Removes rows with null values in specified columns

deduplicate()

Removes duplicate rows based on optional key(s), and optionally sorts

Extraction and Reshaping

Method

Description

unpack_dict()

Unpack dictionary values from one column to top level columns

unpack_list()

Unpack list values from one column and add to top level columns

long_table()

Take a column with nested data and create a new long table

unpack_nested_columns_as_rows()

Unpack list or dict values from one column into separate rows

Parsons Table Indexing

To access rows and columns of data within a Parsons table, you can index on them. To access a column pass in the column name as a string (e.g. tbl['a']) and to access a row, pass in the row index as an integer (e.g. tbl[1]).

tbl = Table([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])

# Return a column as a list
tbl['a']
>> [1, 3]

# Return a row as a dict
tbl[1]
>> {'a': 3, 'b': 4}

A note on indexing and iterating over a table’s data: If you need to iterate over the data, make sure to use the python iterator syntax, so any data transformations can be applied efficiently. An example:

# Some data transformations
table.add_column('newcol', 'some value')

# Efficient way to grab all the data (applying the data transformations only once)
rows_list = [row for row in table]

Warning

If you must index directly into a table’s data, you can do so, but note that data transformations will be applied each time you do so. So this code will be very inefficient on a large table…

# Inefficient way to grab all the data
rows_list = []
for i in range(0, table.num_rows):
  # Data transformations will be applied each time through this loop!
  rows_list.append(table[i])

PETL

The Parsons Table relies heavily on the petl Python package. You can always access the underlying petl table, parsons.Table.table, which will allow you to perform any petl-supported ETL operations. Additionally, you can use the helper method, use_petl(), to conveniently perform the same operations on a parsons Table. For example:

import petl
...

tbl = Table()
tbl.table = petl.skipcomments(tbl.table, '#')

or

tbl = Table()
tbl.use_petl('skipcomments', '#', update_table=True)

Lazy Loading

The Parsons Table makes use of “lazy” loading and “lazy” transformations. What this means is that it tries not to load and process your data until absolutely necessary.

An example:

# Specify where to load the data
tbl = Table.from_csv('name_data.csv')

# Specify data transformations
tbl.add_column('full_name', lambda row: row['first_name'] + ' ' + row['last_name'])
tbl.remove_column(['first_name', 'last_name'])

# Save the table elsewhere
# IMPORTANT - The CSV won't actually be loaded and transformed until this step,
# since this is the first time it's actually needed.
tbl.to_redshift('main.name_table')

This “lazy” loading can be very convenient and performant. However, it can make issues hard to debug. Eg. if your data transformations are time-consuming, you won’t actually notice that performance hit until you try to use the data, potentially much later in your code. There may also be cases where it’s possible to get faster execution by caching a table, especially in situations where a single table will be used as the base for several subsequent calculations.

For these cases Parsons provides two utility functions to materialize a Table and all of its transformations.

Method

Description

materialize()

Load all data from the Table into memory and apply any transformations

materialize_to_file()

Load all data from the Table and apply any transformations, then save to a local temp file.

Examples

Basic Pipelines

# S3 to Civis
s3 = S3()
csv = s3.get_file('tmc-bucket', 'my_ids.csv')
Table.from_csv(csv).to_civis('TMC','ids.my_ids')

#VAN Activist Codes to a Dataframe
van = VAN(db='MyVoters')
van.activist_codes().to_dataframe()

#VAN Events to an s3 bucket
van = VAN(db='MyVoters')
van.events().to_s3_csv('my-van-bucket','myevents.csv')

To & From API

class parsons.etl.tofrom.ToFrom[source]
to_dataframe(index=None, exclude=None, columns=None, coerce_float=False)[source]

Outputs table as a Pandas Dataframe

Parameters:
  • index – str, list Field of array to use as the index, alternately a specific set of input labels to use

  • exclude – list Columns or fields to exclude

  • columns – list Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns)

Returns:

dataframe

Pandas DataFrame object

to_html(local_path=None, encoding=None, errors='strict', index_header=False, caption=None, tr_style=None, td_styles=None, truncate=None)[source]

Outputs table to html.

Warning

If a file already exists at the given location, it will be overwritten.

Parameters:
  • local_path – str The path to write the html locally. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

  • encoding – str The encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • index_header – boolean Prepend index to column names; Defaults to False.

  • caption – str A caption to include with the html table.

  • tr_style – str or callable Style to be applied to the table row.

  • td_styles – str, dict or callable Styles to be applied to the table cells.

  • truncate – int Length of cell data.

Returns:

str

The path of the new file

to_avro(target, schema=None, sample=9, codec: Literal['null', 'deflate', 'bzip2', 'snappy', 'zstandard', 'lz4', 'xz'] = 'deflate', compression_level=None, **avro_args)[source]

Outputs table to an Avro file.

In order to use this method, you must have the fastavro library installed. If using limited dependencies, you can install it with pip install parsons[avro].

Write the table into a new avro file according to schema passed.

This method assume that each column has values with the same type for all rows of the source table.

Avro is a data serialization framework that is generally is faster and safer than text formats like Json, XML or CSV.

Parameters:
  • target – str the file path for creating the avro file. Note that if a file already exists at the given location, it will be overwritten.

  • schema – dict defines the rows field structure of the file. Check fastavro [documentation](https://fastavro.readthedocs.io/en/latest/) and Avro schema [reference](https://avro.apache.org/docs/1.8.2/spec.html#schemas) for details.

  • sample – int, optional defines how many rows are inspected for discovering the field types and building a schema for the avro file when the schema argument is not passed. Default is 9.

  • codec – str, optional The codec argument (string, optional) sets the compression codec used to shrink data in the file. It can be ‘null’, ‘deflate’ (default), ‘bzip2’ or ‘snappy’, ‘zstandard’, ‘lz4’, ‘xz’ (if installed)

  • compression_level – int, optional sets the level of compression to use with the specified codec (if the codec supports it)

  • **avro_args – kwargs Additionally there are support for passing extra options in the argument **avro_args that are fowarded directly to fastavro. [Check the fastavro documentation](https://fastavro.readthedocs.io/en/latest/) for reference.

Example usage for writing files:

>>> # set up a Avro file to demonstrate with
>>> table2 = [['name', 'friends', 'age'],
...           ['Bob', 42, 33],
...           ['Jim', 13, 69],
...           ['Joe', 86, 17],
...           ['Ted', 23, 51]]
...
>>> schema2 = {
...     'doc': 'Some people records.',
...     'name': 'People',
...     'namespace': 'test',
...     'type': 'record',
...     'fields': [
...         {'name': 'name', 'type': 'string'},
...         {'name': 'friends', 'type': 'int'},
...         {'name': 'age', 'type': 'int'},
...     ]
... }
...
>>> # now demonstrate writing with toavro()
>>> from parsons import Table

>>> Table.toavro(table2, 'example.file2.avro', schema=schema2)
...
>>> # this was what was saved above
>>> tbl2 = Table.fromavro('example.file2.avro')
>>> tbl2
+-------+---------+-----+
| name  | friends | age |
+=======+=========+=====+
| 'Bob' |      42 |  33 |
+-------+---------+-----+
| 'Jim' |      13 |  69 |
+-------+---------+-----+
| 'Joe' |      86 |  17 |
+-------+---------+-----+
| 'Ted' |      23 |  51 |
+-------+---------+-----+
append_avro(target, schema=None, sample=9, **avro_args)[source]

Append table to an existing Avro file.

Write the table into an existing avro file according to schema passed.

This method assume that each column has values with the same type for all rows of the source table.

Parameters:
  • target – str the file path for creating the avro file.

  • schema – dict defines the rows field structure of the file. Check fastavro [documentation](https://fastavro.readthedocs.io/en/latest/) and Avro schema [reference](https://avro.apache.org/docs/1.8.2/spec.html#schemas) for details.

  • sample – int, optional defines how many rows are inspected for discovering the field types and building a schema for the avro file when the schema argument is not passed. Default is 9.

  • **avro_args – kwargs Additionally there are support for passing extra options in the argument **avro_args that are fowarded directly to fastavro. Check the fastavro [documentation](https://fastavro.readthedocs.io/en/latest/) for reference.

to_csv(local_path=None, temp_file_compression=None, encoding=None, errors='strict', write_header=True, csv_name=None, **csvargs)[source]

Outputs table to a CSV. Additional key word arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Warning

If a file already exists at the given location, it will be overwritten.

Parameters:
  • local_path – str The path to write the csv locally. If it ends in “.gz” or “.zip”, the file will be compressed. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

  • temp_file_compression – str If a temp file is requested (ie. no local_path is specified), the compression type for that file. Currently “None”, “gzip” or “zip” are supported. If a local_path is specified, this argument is ignored.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • csv_name – str If zip compression (either specified or inferred), the name of csv file within the archive.

  • **csvargs – kwargs csv_writer optional arguments

Returns:

str

The path of the new file

append_csv(local_path, encoding=None, errors='strict', **csvargs)[source]

Appends table to an existing CSV.

Additional additional key word arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Parameters:
  • local_path – str The local path of an existing CSV file. If it ends in “.gz”, the file will be compressed.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • **csvargs – kwargs csv_writer optional arguments

Returns:

str

The path of the file

to_zip_csv(archive_path=None, csv_name=None, encoding=None, errors='strict', write_header=True, if_exists: Literal['replace', 'append'] = 'replace', **csvargs)[source]

Outputs table to a CSV in a zip archive. Additional key word arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument. Use thismethod if you would like to write multiple csv files to the same archive.

Warning

If a file already exists in the archive, it will be overwritten.

Parameters:
  • archive_path – str The path to zip achive. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

  • csv_name – str The name of the csv file to be stored in the archive. If None will use the archive name.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • if_exists – str If archive already exists, one of ‘replace’ or ‘append’

  • **csvargs – kwargs csv_writer optional arguments

Returns:

str

The path of the archive

to_json(local_path=None, temp_file_compression=None, line_delimited=False)[source]

Outputs table to a JSON file

Warning

If a file already exists at the given location, it will be overwritten.

Parameters:
  • local_path – str The path to write the JSON locally. If it ends in “.gz”, it will be compressed first. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

  • temp_file_compression – str If a temp file is requested (ie. no local_path is specified), the compression type for that file. Currently “None” and “gzip” are supported. If a local_path is specified, this argument is ignored.

  • line_delimited – bool Whether the file will be line-delimited JSON (with a row on each line), or a proper JSON file.

Returns:

str

The path of the new file

to_dicts()[source]

Output table as a list of dicts.

Returns:

list

to_sftp_csv(remote_path, host, username, password, port=22, encoding=None, compression=None, errors='strict', write_header=True, rsa_private_key_file=None, **csvargs)[source]

Writes the table to a CSV file on a remote SFTP server

Parameters:
  • remote_path – str The remote path of the file. If it ends in ‘.gz’, the file will be compressed.

  • host – str The remote host

  • username – str The username to access the SFTP server

  • password – str The password to access the SFTP server

  • port – int The port number of the SFTP server

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • str (rsa_private_key_file) – Absolute path to a private RSA key used to authenticate stfp connection

  • **csvargs – kwargs csv_writer optional arguments

to_s3_csv(bucket, key, aws_access_key_id=None, aws_secret_access_key=None, compression=None, encoding=None, errors='strict', write_header=True, acl='bucket-owner-full-control', public_url=False, public_url_expires=3600, use_env_token=True, **csvargs)[source]

Writes the table to an s3 object as a CSV

Parameters:
  • bucket – str The s3 bucket to upload to

  • key – str The s3 key to name the file. If it ends in ‘.gz’ or ‘.zip’, the file will be compressed.

  • aws_access_key_id – str Required if not included as environmental variable

  • aws_secret_access_key – str Required if not included as environmental variable

  • compression – str The compression type for the s3 object. Currently “None”, “zip” and “gzip” are supported. If specified, will override the key suffix.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • public_url – boolean Create a public link to the file

  • public_url_expire – 3600 The time, in seconds, until the url expires if public_url set to True.

  • acl – str The S3 permissions on the file

  • use_env_token – boolean Controls use of the AWS_SESSION_TOKEN environment variable for S3. Defaults to True. Set to False in order to ignore the AWS_SESSION_TOKEN env variable even if the aws_session_token argument was not passed in.

  • **csvargs – kwargs csv_writer optional arguments

Returns:

Public url if specified. If not None.

to_gcs_csv(bucket_name, blob_name, gcs_client=None, app_creds=None, project=None, compression=None, encoding=None, errors='strict', write_header=True, public_url=False, public_url_expires=60, **csvargs)[source]

Writes the table to a Google Cloud Storage blob as a CSV.

Parameters:
  • bucket_name – str The bucket to upload to

  • blob_name – str The blob to name the file. If it ends in ‘.gz’ or ‘.zip’, the file will be compressed.

  • app_creds – str A credentials json string or a path to a json file. Not required if GOOGLE_APPLICATION_CREDENTIALS env variable set.

  • project – str The project which the client is acting on behalf of. If not passed then will use the default inferred environment.

  • compression – str The compression type for the csv. Currently “None”, “zip” and “gzip” are supported. If specified, will override the key suffix.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • public_url – boolean Create a public link to the file

  • public_url_expire – 60 The time, in minutes, until the url expires if public_url set to True.

  • **csvargs – kwargs csv_writer optional arguments

Returns:

Public url if specified. If not None.

to_redshift(table_name, username=None, password=None, host=None, db=None, port=None, **copy_args)[source]

Write a table to a Redshift database. Note, this requires you to pass AWS S3 credentials or store them as environmental variables.

Parameters:
  • table_name – str The table name and schema (my_schema.my_table) to point the file.

  • username – str Required if env variable REDSHIFT_USERNAME not populated

  • password – str Required if env variable REDSHIFT_PASSWORD not populated

  • host – str Required if env variable REDSHIFT_HOST not populated

  • db – str Required if env variable REDSHIFT_DB not populated

  • port – int Required if env variable REDSHIFT_PORT not populated. Port 5439 is typical.

  • **copy_args – kwargs See copy`() for options.

to_postgres(table_name, username=None, password=None, host=None, db=None, port=None, **copy_args)[source]

Write a table to a Postgres database.

Parameters:
  • table_name – str The table name and schema (my_schema.my_table) to point the file.

  • username – str Required if env variable PGUSER not populated

  • password – str Required if env variable PGPASSWORD not populated

  • host – str Required if env variable PGHOST not populated

  • db – str Required if env variable PGDATABASE not populated

  • port – int Required if env variable PGPORT not populated.

  • **copy_args – kwargs See copy`() for options.

to_bigquery(table_name: str, app_creds: str | None = None, project: str | None = None, **kwargs)[source]

Write a table to BigQuery

Parameters:
  • table_name – str Table name to write to in BigQuery; this should be in schema.table format

  • app_creds – str A credentials json string or a path to a json file. Not required if GOOGLE_APPLICATION_CREDENTIALS env variable set.

  • project – str The project which the client is acting on behalf of. If not passed then will use the default inferred environment.

  • **kwargs – kwargs Additional keyword arguments passed into the .copy() function (if_exists, max_errors, etc.)

to_civis(table, api_key=None, db=None, max_errors=None, existing_table_rows: Literal['fail', 'truncate', 'append', 'drop'] = 'fail', diststyle: Literal['even', 'all', 'key'] | None = None, distkey=None, sortkey1=None, sortkey2=None, wait=True, **civisargs)[source]

Write the table to a Civis Redshift cluster. Additional key word arguments can passed to civis.io.dataframe_to_civis()

Args
table: str

The schema and table you want to upload to. E.g., ‘scratch.table’. Schemas or tablenames with periods must be double quoted, e.g. ‘scratch.”my.table”’.

api_key: str

Your Civis API key. If not given, the CIVIS_API_KEY environment variable will be used.

db: str or int

The Civis Database. Can be database name or ID

max_errors: int

The maximum number of rows with errors to remove from the import before failing.

diststyle: str

The distribution style for the table. One of ‘even’, ‘all’ or ‘key’.

existing_table_rows: str

The behaviour if a table with the requested name already exists. One of ‘fail’, ‘truncate’, ‘append’ or ‘drop’. Defaults to ‘fail’.

distkey: str

The column to use as the distkey for the table.

sortkey1: str

The column to use as the sortkey for the table.

sortkey2: str

The second column in a compound sortkey for the table.

wait: boolean

Wait for write job to complete before exiting method.

classmethod from_avro(local_path, limit=None, skips=0, **avro_args)[source]

Create a parsons table from an Avro file.

Parameters:
  • local_path – str The path to the Avro file.

  • limit – int, optional The maximum number of rows to extract. Default is None (all rows).

  • skips – int, optional The number of rows to skip from the start. Default is 0.

  • **avro_args – kwargs Additional arguments passed to fastavro.reader.

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_csv(local_path, **csvargs)[source]

Create a parsons table object from a CSV file

Parameters:
  • local_path – obj A csv formatted local path, url or ftp. If this is a file path that ends in “.gz”, the file will be decompressed first.

  • **csvargs – kwargs csv_reader optional arguments

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_csv_string(str, **csvargs)[source]

Create a parsons table object from a string representing a CSV.

Parameters:
  • str – str The string object to convert to a table

  • **csvargs – kwargs csv_reader optional arguments

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_columns(cols, header=None)[source]

Create a parsons table from a list of lists organized as columns

Parameters:
  • cols – list A list of lists organized as columns

  • header – list List of column names. If not specified, will use dummy column names

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_json(local_path, header=None, line_delimited=False)[source]

Create a parsons table from a json file

Parameters:
  • local_path – list A JSON formatted local path, url or ftp. If this is a file path that ends in “.gz”, the file will be decompressed first.

  • header – list List of columns to use for the destination table. If omitted, columns will be inferred from the initial data in the file.

  • line_delimited – bool Whether the file is line-delimited JSON (with a row on each line), or a proper JSON file.

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_redshift(sql, username=None, password=None, host=None, db=None, port=None)[source]

Create a parsons table from a Redshift query.

To pull an entire Redshift table, use a query like SELECT * FROM tablename.

Parameters:
  • sql – str A valid SQL statement

  • username – str Required if env variable REDSHIFT_USERNAME not populated

  • password – str Required if env variable REDSHIFT_PASSWORD not populated

  • host – str Required if env variable REDSHIFT_HOST not populated

  • db – str Required if env variable REDSHIFT_DB not populated

  • port – int Required if env variable REDSHIFT_PORT not populated. Port 5439 is typical.

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_postgres(sql, username=None, password=None, host=None, db=None, port=None)[source]
Parameters:
  • sql – str A valid SQL statement

  • username – str Required if env variable PGUSER not populated

  • password – str Required if env variable PGPASSWORD not populated

  • host – str Required if env variable PGHOST not populated

  • db – str Required if env variable PGDATABASE not populated

  • port – int Required if env variable PGPORT not populated.

classmethod from_s3_csv(bucket, key, from_manifest=False, aws_access_key_id=None, aws_secret_access_key=None, **csvargs)[source]

Create a parsons table from a key in an S3 bucket.

Parameters:
  • bucket – str The S3 bucket.

  • key – str The S3 key

  • from_manifest – bool If True, treats key as a manifest file and loads all urls into a parsons.Table. Defaults to False.

  • aws_access_key_id – str Required if not included as environmental variable.

  • aws_secret_access_key – str Required if not included as environmental variable.

  • **csvargs – kwargs csv_reader optional arguments

Returns:

parsons.Table object

classmethod from_bigquery(sql: str, app_creds: str | None = None, project: str | None = None)[source]

Create a parsons table from a BigQuery statement.

To pull an entire BigQuery table, use a query like SELECT * FROM {{ table }}.

Parameters:
  • sql – str A valid SQL statement

  • app_creds – str A credentials json string or a path to a json file. Not required if GOOGLE_APPLICATION_CREDENTIALS env variable set.

  • project – str The project which the client is acting on behalf of. If not passed then will use the default inferred environment.

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_dataframe(dataframe, include_index=False)[source]

Create a parsons table from a Pandas dataframe.

Parameters:
  • dataframe – dataframe A valid Pandas dataframe objectt

  • include_index – boolean Include index column

Transformation API

The following methods allow you to manipulate the Parsons table data.

class parsons.etl.etl.ETL[source]
head(n: int = 5)[source]

Return the first n rows of the table

Parameters:

n – int The number of rows to return. Defaults to 5.

Returns:

Parsons Table

tail(n: int = 5)[source]

Return the last n rows of the table

Parameters:

n – int, optional The number of rows to return. Defaults to 5.

Returns:

parsons.Table

add_column(column, value=None, index=None, if_exists: str = 'fail')[source]

Add a column to your table

Parameters:
  • column – str Name of column to add

  • value – optional A fixed or calculated value

  • index – int, optional The position of the new column in the table

  • if_exists – str (options: ‘fail’, ‘replace’) If set replace, this function will call fill_column if the column already exists, rather than raising a ValueError Defaults to “fail”.

Returns:

parsons.Table

Also updates self

remove_column(*columns)[source]

Remove a column from your table

Parameters:

*columns – str Column names

Returns:

parsons.Table

Also updates self

rename_column(column_name, new_column_name)[source]

Rename a column

Parameters:
  • column_name – str The current column name

  • new_column_name – str The new column name

Returns:

parsons.Table

Also updates self

rename_columns(column_map)[source]

Rename multiple columns

Parameters:

column_map

dict A dictionary of columns and new names. The key is the old name and the value is the new name.

Example dictionary: {‘old_name’: ‘new_name’, ‘old_name2’: ‘new_name2’}

Returns:

parsons.Table

Also updates self

fill_column(column_name, fill_value)[source]

Fill a column in a table

Parameters:
  • column_name – str The column to fill

  • fill_value – A fixed or calculated value

Returns:

parsons.Table

Also updates self

fillna_column(column_name, fill_value)[source]

Fill None values in a column in a table

Parameters:
  • column_name – str The column to fill

  • fill_value – A fixed or calculated value

Returns:

parsons.Table

Also updates self

move_column(column, index)[source]

Move a column

Parameters:
  • column – str The column name to move

  • index – The new index for the column

Returns:

parsons.Table

Also updates self

convert_column(*column, **kwargs)[source]

Transform values under one or more fields via arbitrary functions, method invocations or dictionary translations. This leverages the petl convert() method. Example usage can be found here.

Parameters:
  • *column – str A single column or multiple columns passed as a list

  • **kwargs – str, method or variable The update function, method, or variable to process the update

Returns:

parsons.Table

Also updates self

get_column_max_width(column: str)[source]

Return the maximum width of the column.

Parameters:

column – str The column name.

Returns:

int

convert_columns_to_str()[source]

Convenience function to convert all non-string or mixed columns in a Parsons table to string (e.g. for comparison)

Returns:

parsons.Table

Also updates self

coalesce_columns(dest_column, source_columns, remove_source_columns=True)[source]

Coalesces values from one or more source columns into a destination column, by selecting the first non-empty value. If the destination column doesn’t exist, it will be added.

Parameters:
  • dest_column – str Name of destination column

  • source_columns – list List of source column names

  • remove_source_columns – bool Whether to remove the source columns after the coalesce. If the destination column is also one of the source columns, it will not be removed.

Returns:

parsons.Table

Also updates self

map_columns(column_map, exact_match=True)[source]

Standardizes column names based on multiple possible values. This method is helpful when your input table might have multiple and unknown column names.

tbl = [
    {'fn': 'Jane'},
    {'lastname': 'Doe'},
    {'dob': '1980-01-01'}
]

column_map = {
    'first_name': ['fn', 'first', 'firstname'],
    'last_name': ['ln', 'last', 'lastname'],
    'date_of_birth': ['dob', 'birthday']
}

tbl.map_columns(column_map)
print (tbl)
>> {{'first_name': 'Jane', 'last_name': 'Doe', 'date_of_birth': '1908-01-01'}}
Parameters:
  • column_map – dict A dictionary of columns and possible values that map to it

  • exact_match – boolean If True will only map if an exact match. If False will ignore case, spaces and underscores.

Returns:

parsons.Table

Also updates self

map_and_coalesce_columns(column_map)[source]

Coalesces columns based on multiple possible values. The columns in the map do not need to be in your table, so you can create a map with all possibilities. The coalesce will occur in the order that the columns are listed, unless the destination column name already exists in the table, in which case that value will be preferenced. This method is helpful when your input table might have multiple and unknown column names.

tbl = [
    {'first': None},
    {'fn': 'Jane'},
    {'lastname': 'Doe'},
    {'dob': '1980-01-01'}
]

column_map = {
    'first_name': ['fn', 'first', 'firstname'],
    'last_name': ['ln', 'last', 'lastname'],
    'date_of_birth': ['dob', 'birthday']
}

tbl.map_and_coalesce_columns(column_map)

print (tbl)
>> {{'first_name': 'Jane', 'last_name': 'Doe', 'date_of_birth': '1908-01-01'}}
Parameters:

column_map – dict A dictionary of columns and possible values that map to it

Returns:

parsons.Table

Also updates self

get_column_types(column)[source]

Return all of the Python types for values in a given column

Parameters:

column – str Name of the column to analyze

Returns:

list

A list of Python types

get_columns_type_stats()[source]

Return descriptive stats for all columns

Returns:

list

A list of dicts

Returns:

list[dict]

A list of dicts, each containing a column ‘name’ and a ‘type’ list

convert_table(*args)[source]

Transform all cells in a table via arbitrary functions, method invocations or dictionary translations. This method is useful for cleaning fields and data hygiene functions such as regex. This method leverages the petl convert() method. Example usage can be found here <https://petl.readthedocs.io/en/v0.24/transform.html#petl.convert>`_.

Parameters:

*args – str, method or variable The update function, method, or variable to process the update.

Returns:

parsons.Table

Also updates self

unpack_dict(column, keys=None, include_original=False, sample_size=5000, missing=None, prepend=True, prepend_value=None)[source]

Unpack dictionary values from one column into separate columns

Parameters:
  • column – str The column name to unpack

  • keys – list The dict keys in the column to unpack. If None will unpack all.

  • include_original – boolean Retain original column after unpacking

  • sample_size – int Number of rows to sample before determining columns

  • missing – str If a value is missing, the value to fill it with

  • prepend – Prepend the column name of the unpacked values. Useful for avoiding duplicate column names

  • prepend_value – Value to prepend new columns if prepend=True. If None, will set to column name.

unpack_list(column, include_original=False, missing=None, replace=False, max_columns=None)[source]

Unpack list values from one column into separate columns. Numbers the columns.

# Begin with a list in column
json = [
    {
        'id': '5421',
        'name': 'Jane Green',
        'phones': ['512-699-3334', '512-222-5478']
    }
]

tbl = Table(json)
print (tbl)
>>> {'id': '5421', 'name': 'Jane Green', 'phones': ['512-699-3334', '512-222-5478']}

tbl.unpack_list('phones', replace=True)
print (tbl)
>>> {'id': '5421', 'name': 'Jane Green', 'phones_0': '512-699-3334', 'phones_1': '512-222-5478'}
Parameters:
  • column – str The column name to unpack

  • include_original – boolean Retain original column after unpacking

  • sample_size – int Number of rows to sample before determining columns

  • missing – str If a value is missing, the value to fill it with

  • replace – boolean Return new table or replace existing

  • max_columns – int The maximum number of columns to unpack

unpack_nested_columns_as_rows(column, key='id', expand_original: bool | int = False)[source]

Unpack list or dict values from one column into separate rows. Not recommended for JSON columns (i.e. lists of dicts), but can handle columns with any mix of types. Makes use of PETL’s melt() method.

Parameters:
  • column – str The column name to unpack

  • key – str The column to use as a key when unpacking. Defaults to id

  • expand_original – boolean or int If True: Add resulting unpacked rows (with all other columns) to original If int: Add to original unless the max added per key is above the given number If False (default): Return unpacked rows (with key column only) as standalone Removes packed list and dict rows from original either way.

Returns:

parsons.Table

If expand_original, original table with packed rows replaced by unpacked rows. Otherwise, standalone table with key column and unpacked values only

long_table(key, column, key_rename=None, retain_original=False, prepend=True, prepend_value=None)[source]

Create a new long parsons table from a column, including the foreign key.

# Begin with nested dicts in a column
json = [
    {
        'id': '5421',
        'name': 'Jane Green',
        'emails': [
            {'home': 'jane@gmail.com'},
            {'work': 'jane@mywork.com'}
        ]
    }
]
tbl = Table(json)
print (tbl)
>>> {'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'}]}
>>> {'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'}]}

# Create skinny table of just the nested dicts
email_skinny = tbl.long_table(['id'], 'emails')

print (email_skinny)
>>> {'id': '5421', 'emails_home': 'jane@gmail.com', 'emails_work': None}
>>> {'id': '5421', 'emails_home': None, 'emails_work': 'jane@mywork.com'}
Parameters:
  • key – lst The columns to retain in the long table (e.g. foreign keys)

  • column – str The column name to make long

  • key_rename – dict The new name for the foreign key to better identify it. For example, you might want to rename id to person_id. Ex. {‘KEY_NAME’: ‘NEW_KEY_NAME’}

  • retain_original – boolean Retain the original column from the source table.

  • prepend – Prepend the column name of the unpacked values. Useful for avoiding duplicate column names

  • prepend_value – Value to prepend new columns if prepend=True. If None, will set to column name.

Returns:

parsons.Table

The new long table

cut(*columns)[source]

Return a table of selection of columns

Parameters:

*columns – str Columns in the parsons table

Returns:

parsons.Table

Selected columnns

select_rows(*filters)[source]

Select specific rows from a Parsons table based on the passed filters.

Example filters:

tbl = Table(
    [
        ['foo', 'bar', 'baz'],
        ['c', 4, 9.3],
        ['a', 2, 88.2],
        ['b', 1, 23.3]
    ]
)

# You can structure the filter in multiple wayss

# Lambda Function
tbl2 = tbl.select_rows(lambda row: row.foo == 'a' and row.baz > 88.1)
tbl2
>>> {'foo': 'a', 'bar': 2, 'baz': 88.1}

# Expression String
tbl3 = tbl.select_rows("{foo} == 'a' and {baz} > 88.1")
tbl3
>>> {'foo': 'a', 'bar': 2, 'baz': 88.1}
Parameters:

*filters – function or str

Returns:

A new parsons table containing the selected rows

remove_null_rows(columns, null_value=None)[source]

Remove rows if the values in a column are None. If multiple columns are passed as list, it will remove all rows with null values in any of the passed columns.

Parameters:
  • columns – str or list The column or columns to analyze

  • null_value – int or float or str The null value

stack(*tables, missing=None)[source]

Stack Parsons tables on top of one another.

Similar to table.concat(), except no attempt is made to align fields from different tables.

Parameters:
  • tables – parsons.Table or list A single table, or a list of tables

  • missing – bool The value to use when padding missing values

concat(*tables, missing=None)[source]

Concatenates one or more tables onto this one.

Note that the tables do not need to share exactly the same fields. Any missing fields will be padded with None, or whatever is provided via the missing keyword argument.

Parameters:
  • tables – parsons.Table or list A single table, or a list of tables

  • missing – bool The value to use when padding missing values

chunk(rows: int)[source]

Divides a Parsons table into smaller tables of a specified row count. If the table cannot be divided evenly, then the final table will only include the remainder.

Parameters:

rows – int The number of rows of each new Parsons table

Returns:

list[parsons.Table]

static get_normalized_column_name(column_name: str) str[source]

Returns a column name with whitespace removed, non-alphanumeric characters removed, and everything lowercased.

Parameters:

column_name – str

Returns:

str

Normalized column name

match_columns(desired_columns, fuzzy_match=True, if_extra_columns: Literal['remove', 'ignore', 'fail'] = 'remove', if_missing_columns: Literal['add', 'ignore', 'fail'] = 'add')[source]

Changes the column names and ordering in this Table to match a list of desired column names.

Parameters:
  • desired_columns – list Ordered list of desired column names

  • fuzzy_match – bool Whether to normalize column names when matching against the desired column names, removing whitespace and non-alphanumeric characters, and lowercasing everything. Eg. With this flag set, “FIRST NAME” would match “first_name”. If the Table has two columns that normalize to the same string (eg. “FIRST NAME” and “first_name”), the latter will be considered an extra column.

  • if_extra_columns – string If the Table has columns that don’t match any desired columns, either ‘remove’ them, ‘ignore’ them, or ‘fail’ (raising an error).

  • if_missing_columns – string If the Table is missing some of the desired columns, either ‘add’ them (with a value of None), ‘ignore’ them, or ‘fail’ (raising an error).

Returns:

parsons.Table

Also updates self

reduce_rows(columns, reduce_func, headers, presorted=False, **kwargs)[source]

Group rows by a column or columns, then reduce the groups to a single row.

For example, the output from the query to get a table’s definition is returned as one component per row. The reduce_rows method can be used to reduce all those to a single row containg the entire query.

Based on the rowreduce petl function.

ddl = rs.query(sql_to_get_table_ddl)
ddl.table

schemaname

tablename

ddl

‘db_scratch’

‘state_fips’

‘–DROP TABLE db_scratch.state_fips;’

‘db_scratch’

‘state_fips’

‘CREATE TABLE IF NOT EXISTS db_scratch.state_fips’

‘db_scratch’

‘state_fips’

‘(’

‘db_scratch’

‘state_fips’

‘tstate VARCHAR(1024) ENCODE RAW’

‘db_scratch’

‘state_fips’

‘t,stusab VARCHAR(1024) ENCODE RAW’

reducer_fn = lambda cols, rows: [
    f"{cols[0]}.{cols[1]}",
    r"\n".join([row[2] for row in rows])
]
ddl.reduce_rows(
    ['schemaname', 'tablename'],
    reducer_fn,
    ['tablename', 'ddl'],
    presorted=True
)
ddl.table

tablename

ddl

‘db_scratch.state_fips’

‘–DROP TABLE db_scratch.state_fips;nCREATE TABLE IF NOT EXISTS db_scratch.state_fipsn(ntstate VARCHAR(1024) ENCODE RAWnt ,db_scratch.state_fipsn(ntstate VARCHAR(1024) ENCODE RAW nt,stusab VARCHAR(1024) ENCODE RAWnt,state_name VARCHAR(1024) ENCODE RAWnt,statens VARCHAR(1024) ENCODE RAWn)nDISTSTYLE EVENn;’

Parameters:
  • columns (list) – The column(s) by which to group the rows.

  • reduce_func (function) – The function by which to reduce the rows. Should take the 2 arguments, the columns list and the rows list and return a list. reducer(columns: list, rows: list) -> list;

  • headers (list) – The list of headers for modified table. The length of headers should match the length of the list returned by the reduce function.

  • presorted (bool) – If false, the row will be sorted.

  • **kwargs – Extra options to pass to petl.rowreduce

Returns:

parsons.Table

Also updates self

sort(columns=None, reverse=False)[source]

Sort the rows a table.

Parameters:
  • sort_columns – list or str Sort by a single column or a list of column. If None then will sort columns from left to right.

  • reverse – boolean Sort rows in reverse order.

Returns:

Parsons Table and also updates self

set_header(new_header)[source]

Replace the header row of the table.

Parameters:

new_header – list List of new header column names

Returns:

parsons.Table

Also updates self

use_petl(petl_method, *args, **kwargs)[source]

Call a petl function on the current table.

This convenience method exposes the petl functions to the current Table. This is useful in cases where one might need a petl function that has not yet been implemented for parsons.Table.

# https://petl.readthedocs.io/en/v1.6.0/transform.html#petl.transform.basics.skipcomments
tbl = Table(
    [
        ['col1', 'col2'],
        ['# this is a comment row'],
        ['a', 1],
        ['#this is another comment', 'this is also ignored'],
        ['b', 2]
    ]
)
tbl.use_petl('skipcomments', '#', update_table=True)

>>> {'col1': 'a', 'col2': 1}
>>> {'col1': 'b', 'col2': 2}

+------+------+
| col1 | col2 |
+======+======+
| 'a'  |    1 |
+------+------+
| 'b'  |    2 |
+------+------+
Parameters:
  • petl_method – str The petl function to call

  • update_table – bool If True, updates the parsons.Table. Defaults to False.

  • to_petl – bool If True, returns a petl table, otherwise a parsons.Table. Defaults to False.

  • *args – Any The arguements to pass to the petl function.

  • **kwargs – Any The keyword arguements to pass to the petl function.

Returns:

parsons.Table or petl table

deduplicate(keys=None, presorted=False)[source]

Deduplicates table based on an optional keys argument, which can contain any number of keys or None.

Method considers all keys specified in the keys argument when deduplicating, not each key individually. For example, if keys=['a', 'b'], the method will not remove a record unless it’s identical to another record in both columns a and b.

tbl = Table([['a', 'b'], [1, 3], [1, 2], [1, 2], [2, 3]])

+---+---+
| a | b |
+===+===+
| 1 | 3 |
+---+---+
| 1 | 2 |
+---+---+
| 1 | 2 |
+---+---+
| 2 | 3 |
+---+---+

tbl.deduplicate('a')
# removes all subsequent rows with {'a': 1}

+---+---+
| a | b |
+===+===+
| 1 | 3 |
+---+---+
| 2 | 3 |
+---+---+

tbl = Table([['a', 'b'], [1, 3], [1, 2], [1, 2], [2, 3]]) # reset
tbl.deduplicate(['a', 'b'])
# sorted on both ('a', 'b') so (1, 2) was placed before (1, 3)
# did not remove second instance of {'a': 1} or {'b': 3}

+---+---+
| a | b |
+===+===+
| 1 | 2 |
+---+---+
| 1 | 3 |
+---+---+
| 2 | 3 |
+---+---+

 tbl = Table([['a', 'b'], [1, 3], [1, 2], [1, 2], [2, 3]]) # reset
 tbl.deduplicate('a').deduplicate('b')
 # can chain method to sort/dedupe on 'a', then sort/dedupe on 'b'

+---+---+
| a | b |
+===+===+
| 1 | 3 |
+---+---+

tbl = Table([['a', 'b'], [1, 3], [1, 2], [1, 2], [2, 3]]) # reset
tbl.deduplicate('b').deduplicate('a')
# Order DOES matter when deduping on one column at a time

+---+---+
| a | b |
+===+===+
| 1 | 2 |
+---+---+
Parameters:
  • keys – str or list[str] or None keys to deduplicate (and optionally sort) on.

  • presorted – bool If false, the row will be sorted.

Returns:

parsons.Table

Also updates self

Materialize API

class parsons.etl.table.Table(lst: list | tuple | Table | _EmptyDefault = _EmptyDefault.token)[source]

Create a Parsons Table. Accepts one of the following: - A list of lists, with list[0] holding field names, and the other lists holding data - A list of dicts - A petl table

Parameters:
  • lst – list See above for accepted list formats

  • source – str The original data source from which the data was pulled (optional)

  • name – str The name of the table (optional)

property num_rows

Returns: int

Number of rows in the table

property data

Returns an iterable object for iterating over the raw data rows as tuples (without field names)

property columns

Returns: list

List of the table’s column names

property first

Returns the first value in the table. Useful for database queries that only return a single value.

row_data(row_index)[source]

Returns a row in table

Parameters:

row_index – int

Returns:

dict

A dictionary of the row with the column as the key and the cell as the value.

add_column(column, value=None, index=None, if_exists: str = 'fail')

Add a column to your table

Parameters:
  • column – str Name of column to add

  • value – optional A fixed or calculated value

  • index – int, optional The position of the new column in the table

  • if_exists – str (options: ‘fail’, ‘replace’) If set replace, this function will call fill_column if the column already exists, rather than raising a ValueError Defaults to “fail”.

Returns:

parsons.Table

Also updates self

append_avro(target, schema=None, sample=9, **avro_args)

Append table to an existing Avro file.

Write the table into an existing avro file according to schema passed.

This method assume that each column has values with the same type for all rows of the source table.

Parameters:
  • target – str the file path for creating the avro file.

  • schema – dict defines the rows field structure of the file. Check fastavro [documentation](https://fastavro.readthedocs.io/en/latest/) and Avro schema [reference](https://avro.apache.org/docs/1.8.2/spec.html#schemas) for details.

  • sample – int, optional defines how many rows are inspected for discovering the field types and building a schema for the avro file when the schema argument is not passed. Default is 9.

  • **avro_args – kwargs Additionally there are support for passing extra options in the argument **avro_args that are fowarded directly to fastavro. Check the fastavro [documentation](https://fastavro.readthedocs.io/en/latest/) for reference.

append_csv(local_path, encoding=None, errors='strict', **csvargs)

Appends table to an existing CSV.

Additional additional key word arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Parameters:
  • local_path – str The local path of an existing CSV file. If it ends in “.gz”, the file will be compressed.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • **csvargs – kwargs csv_writer optional arguments

Returns:

str

The path of the file

chunk(rows: int)

Divides a Parsons table into smaller tables of a specified row count. If the table cannot be divided evenly, then the final table will only include the remainder.

Parameters:

rows – int The number of rows of each new Parsons table

Returns:

list[parsons.Table]

coalesce_columns(dest_column, source_columns, remove_source_columns=True)

Coalesces values from one or more source columns into a destination column, by selecting the first non-empty value. If the destination column doesn’t exist, it will be added.

Parameters:
  • dest_column – str Name of destination column

  • source_columns – list List of source column names

  • remove_source_columns – bool Whether to remove the source columns after the coalesce. If the destination column is also one of the source columns, it will not be removed.

Returns:

parsons.Table

Also updates self

column_data(column_name)[source]

Returns the data in the column as a list.

Parameters:

column_name – str The name of the column

Returns:

list

A list of data in the column.

concat(*tables, missing=None)

Concatenates one or more tables onto this one.

Note that the tables do not need to share exactly the same fields. Any missing fields will be padded with None, or whatever is provided via the missing keyword argument.

Parameters:
  • tables – parsons.Table or list A single table, or a list of tables

  • missing – bool The value to use when padding missing values

convert_column(*column, **kwargs)

Transform values under one or more fields via arbitrary functions, method invocations or dictionary translations. This leverages the petl convert() method. Example usage can be found here.

Parameters:
  • *column – str A single column or multiple columns passed as a list

  • **kwargs – str, method or variable The update function, method, or variable to process the update

Returns:

parsons.Table

Also updates self

convert_columns_to_str()

Convenience function to convert all non-string or mixed columns in a Parsons table to string (e.g. for comparison)

Returns:

parsons.Table

Also updates self

convert_table(*args)

Transform all cells in a table via arbitrary functions, method invocations or dictionary translations. This method is useful for cleaning fields and data hygiene functions such as regex. This method leverages the petl convert() method. Example usage can be found here <https://petl.readthedocs.io/en/v0.24/transform.html#petl.convert>`_.

Parameters:

*args – str, method or variable The update function, method, or variable to process the update.

Returns:

parsons.Table

Also updates self

cut(*columns)

Return a table of selection of columns

Parameters:

*columns – str Columns in the parsons table

Returns:

parsons.Table

Selected columnns

deduplicate(keys=None, presorted=False)

Deduplicates table based on an optional keys argument, which can contain any number of keys or None.

Method considers all keys specified in the keys argument when deduplicating, not each key individually. For example, if keys=['a', 'b'], the method will not remove a record unless it’s identical to another record in both columns a and b.

tbl = Table([['a', 'b'], [1, 3], [1, 2], [1, 2], [2, 3]])

+---+---+
| a | b |
+===+===+
| 1 | 3 |
+---+---+
| 1 | 2 |
+---+---+
| 1 | 2 |
+---+---+
| 2 | 3 |
+---+---+

tbl.deduplicate('a')
# removes all subsequent rows with {'a': 1}

+---+---+
| a | b |
+===+===+
| 1 | 3 |
+---+---+
| 2 | 3 |
+---+---+

tbl = Table([['a', 'b'], [1, 3], [1, 2], [1, 2], [2, 3]]) # reset
tbl.deduplicate(['a', 'b'])
# sorted on both ('a', 'b') so (1, 2) was placed before (1, 3)
# did not remove second instance of {'a': 1} or {'b': 3}

+---+---+
| a | b |
+===+===+
| 1 | 2 |
+---+---+
| 1 | 3 |
+---+---+
| 2 | 3 |
+---+---+

 tbl = Table([['a', 'b'], [1, 3], [1, 2], [1, 2], [2, 3]]) # reset
 tbl.deduplicate('a').deduplicate('b')
 # can chain method to sort/dedupe on 'a', then sort/dedupe on 'b'

+---+---+
| a | b |
+===+===+
| 1 | 3 |
+---+---+

tbl = Table([['a', 'b'], [1, 3], [1, 2], [1, 2], [2, 3]]) # reset
tbl.deduplicate('b').deduplicate('a')
# Order DOES matter when deduping on one column at a time

+---+---+
| a | b |
+===+===+
| 1 | 2 |
+---+---+
Parameters:
  • keys – str or list[str] or None keys to deduplicate (and optionally sort) on.

  • presorted – bool If false, the row will be sorted.

Returns:

parsons.Table

Also updates self

fill_column(column_name, fill_value)

Fill a column in a table

Parameters:
  • column_name – str The column to fill

  • fill_value – A fixed or calculated value

Returns:

parsons.Table

Also updates self

fillna_column(column_name, fill_value)

Fill None values in a column in a table

Parameters:
  • column_name – str The column to fill

  • fill_value – A fixed or calculated value

Returns:

parsons.Table

Also updates self

classmethod from_avro(local_path, limit=None, skips=0, **avro_args)

Create a parsons table from an Avro file.

Parameters:
  • local_path – str The path to the Avro file.

  • limit – int, optional The maximum number of rows to extract. Default is None (all rows).

  • skips – int, optional The number of rows to skip from the start. Default is 0.

  • **avro_args – kwargs Additional arguments passed to fastavro.reader.

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_bigquery(sql: str, app_creds: str | None = None, project: str | None = None)

Create a parsons table from a BigQuery statement.

To pull an entire BigQuery table, use a query like SELECT * FROM {{ table }}.

Parameters:
  • sql – str A valid SQL statement

  • app_creds – str A credentials json string or a path to a json file. Not required if GOOGLE_APPLICATION_CREDENTIALS env variable set.

  • project – str The project which the client is acting on behalf of. If not passed then will use the default inferred environment.

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_columns(cols, header=None)

Create a parsons table from a list of lists organized as columns

Parameters:
  • cols – list A list of lists organized as columns

  • header – list List of column names. If not specified, will use dummy column names

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_csv(local_path, **csvargs)

Create a parsons table object from a CSV file

Parameters:
  • local_path – obj A csv formatted local path, url or ftp. If this is a file path that ends in “.gz”, the file will be decompressed first.

  • **csvargs – kwargs csv_reader optional arguments

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_csv_string(str, **csvargs)

Create a parsons table object from a string representing a CSV.

Parameters:
  • str – str The string object to convert to a table

  • **csvargs – kwargs csv_reader optional arguments

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_dataframe(dataframe, include_index=False)

Create a parsons table from a Pandas dataframe.

Parameters:
  • dataframe – dataframe A valid Pandas dataframe objectt

  • include_index – boolean Include index column

classmethod from_json(local_path, header=None, line_delimited=False)

Create a parsons table from a json file

Parameters:
  • local_path – list A JSON formatted local path, url or ftp. If this is a file path that ends in “.gz”, the file will be decompressed first.

  • header – list List of columns to use for the destination table. If omitted, columns will be inferred from the initial data in the file.

  • line_delimited – bool Whether the file is line-delimited JSON (with a row on each line), or a proper JSON file.

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_postgres(sql, username=None, password=None, host=None, db=None, port=None)
Parameters:
  • sql – str A valid SQL statement

  • username – str Required if env variable PGUSER not populated

  • password – str Required if env variable PGPASSWORD not populated

  • host – str Required if env variable PGHOST not populated

  • db – str Required if env variable PGDATABASE not populated

  • port – int Required if env variable PGPORT not populated.

classmethod from_redshift(sql, username=None, password=None, host=None, db=None, port=None)

Create a parsons table from a Redshift query.

To pull an entire Redshift table, use a query like SELECT * FROM tablename.

Parameters:
  • sql – str A valid SQL statement

  • username – str Required if env variable REDSHIFT_USERNAME not populated

  • password – str Required if env variable REDSHIFT_PASSWORD not populated

  • host – str Required if env variable REDSHIFT_HOST not populated

  • db – str Required if env variable REDSHIFT_DB not populated

  • port – int Required if env variable REDSHIFT_PORT not populated. Port 5439 is typical.

Returns:

Parsons Table

See Parsons Table for output options.

classmethod from_s3_csv(bucket, key, from_manifest=False, aws_access_key_id=None, aws_secret_access_key=None, **csvargs)

Create a parsons table from a key in an S3 bucket.

Parameters:
  • bucket – str The S3 bucket.

  • key – str The S3 key

  • from_manifest – bool If True, treats key as a manifest file and loads all urls into a parsons.Table. Defaults to False.

  • aws_access_key_id – str Required if not included as environmental variable.

  • aws_secret_access_key – str Required if not included as environmental variable.

  • **csvargs – kwargs csv_reader optional arguments

Returns:

parsons.Table object

get_column_max_width(column: str)

Return the maximum width of the column.

Parameters:

column – str The column name.

Returns:

int

get_column_types(column)

Return all of the Python types for values in a given column

Parameters:

column – str Name of the column to analyze

Returns:

list

A list of Python types

get_columns_type_stats()

Return descriptive stats for all columns

Returns:

list

A list of dicts

Returns:

list[dict]

A list of dicts, each containing a column ‘name’ and a ‘type’ list

static get_normalized_column_name(column_name: str) str

Returns a column name with whitespace removed, non-alphanumeric characters removed, and everything lowercased.

Parameters:

column_name – str

Returns:

str

Normalized column name

head(n: int = 5)

Return the first n rows of the table

Parameters:

n – int The number of rows to return. Defaults to 5.

Returns:

Parsons Table

long_table(key, column, key_rename=None, retain_original=False, prepend=True, prepend_value=None)

Create a new long parsons table from a column, including the foreign key.

# Begin with nested dicts in a column
json = [
    {
        'id': '5421',
        'name': 'Jane Green',
        'emails': [
            {'home': 'jane@gmail.com'},
            {'work': 'jane@mywork.com'}
        ]
    }
]
tbl = Table(json)
print (tbl)
>>> {'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'}]}
>>> {'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'}]}

# Create skinny table of just the nested dicts
email_skinny = tbl.long_table(['id'], 'emails')

print (email_skinny)
>>> {'id': '5421', 'emails_home': 'jane@gmail.com', 'emails_work': None}
>>> {'id': '5421', 'emails_home': None, 'emails_work': 'jane@mywork.com'}
Parameters:
  • key – lst The columns to retain in the long table (e.g. foreign keys)

  • column – str The column name to make long

  • key_rename – dict The new name for the foreign key to better identify it. For example, you might want to rename id to person_id. Ex. {‘KEY_NAME’: ‘NEW_KEY_NAME’}

  • retain_original – boolean Retain the original column from the source table.

  • prepend – Prepend the column name of the unpacked values. Useful for avoiding duplicate column names

  • prepend_value – Value to prepend new columns if prepend=True. If None, will set to column name.

Returns:

parsons.Table

The new long table

map_and_coalesce_columns(column_map)

Coalesces columns based on multiple possible values. The columns in the map do not need to be in your table, so you can create a map with all possibilities. The coalesce will occur in the order that the columns are listed, unless the destination column name already exists in the table, in which case that value will be preferenced. This method is helpful when your input table might have multiple and unknown column names.

tbl = [
    {'first': None},
    {'fn': 'Jane'},
    {'lastname': 'Doe'},
    {'dob': '1980-01-01'}
]

column_map = {
    'first_name': ['fn', 'first', 'firstname'],
    'last_name': ['ln', 'last', 'lastname'],
    'date_of_birth': ['dob', 'birthday']
}

tbl.map_and_coalesce_columns(column_map)

print (tbl)
>> {{'first_name': 'Jane', 'last_name': 'Doe', 'date_of_birth': '1908-01-01'}}
Parameters:

column_map – dict A dictionary of columns and possible values that map to it

Returns:

parsons.Table

Also updates self

map_columns(column_map, exact_match=True)

Standardizes column names based on multiple possible values. This method is helpful when your input table might have multiple and unknown column names.

tbl = [
    {'fn': 'Jane'},
    {'lastname': 'Doe'},
    {'dob': '1980-01-01'}
]

column_map = {
    'first_name': ['fn', 'first', 'firstname'],
    'last_name': ['ln', 'last', 'lastname'],
    'date_of_birth': ['dob', 'birthday']
}

tbl.map_columns(column_map)
print (tbl)
>> {{'first_name': 'Jane', 'last_name': 'Doe', 'date_of_birth': '1908-01-01'}}
Parameters:
  • column_map – dict A dictionary of columns and possible values that map to it

  • exact_match – boolean If True will only map if an exact match. If False will ignore case, spaces and underscores.

Returns:

parsons.Table

Also updates self

match_columns(desired_columns, fuzzy_match=True, if_extra_columns: Literal['remove', 'ignore', 'fail'] = 'remove', if_missing_columns: Literal['add', 'ignore', 'fail'] = 'add')

Changes the column names and ordering in this Table to match a list of desired column names.

Parameters:
  • desired_columns – list Ordered list of desired column names

  • fuzzy_match – bool Whether to normalize column names when matching against the desired column names, removing whitespace and non-alphanumeric characters, and lowercasing everything. Eg. With this flag set, “FIRST NAME” would match “first_name”. If the Table has two columns that normalize to the same string (eg. “FIRST NAME” and “first_name”), the latter will be considered an extra column.

  • if_extra_columns – string If the Table has columns that don’t match any desired columns, either ‘remove’ them, ‘ignore’ them, or ‘fail’ (raising an error).

  • if_missing_columns – string If the Table is missing some of the desired columns, either ‘add’ them (with a value of None), ‘ignore’ them, or ‘fail’ (raising an error).

Returns:

parsons.Table

Also updates self

move_column(column, index)

Move a column

Parameters:
  • column – str The column name to move

  • index – The new index for the column

Returns:

parsons.Table

Also updates self

reduce_rows(columns, reduce_func, headers, presorted=False, **kwargs)

Group rows by a column or columns, then reduce the groups to a single row.

For example, the output from the query to get a table’s definition is returned as one component per row. The reduce_rows method can be used to reduce all those to a single row containg the entire query.

Based on the rowreduce petl function.

ddl = rs.query(sql_to_get_table_ddl)
ddl.table

schemaname

tablename

ddl

‘db_scratch’

‘state_fips’

‘–DROP TABLE db_scratch.state_fips;’

‘db_scratch’

‘state_fips’

‘CREATE TABLE IF NOT EXISTS db_scratch.state_fips’

‘db_scratch’

‘state_fips’

‘(’

‘db_scratch’

‘state_fips’

‘tstate VARCHAR(1024) ENCODE RAW’

‘db_scratch’

‘state_fips’

‘t,stusab VARCHAR(1024) ENCODE RAW’

reducer_fn = lambda cols, rows: [
    f"{cols[0]}.{cols[1]}",
    r"\n".join([row[2] for row in rows])
]
ddl.reduce_rows(
    ['schemaname', 'tablename'],
    reducer_fn,
    ['tablename', 'ddl'],
    presorted=True
)
ddl.table

tablename

ddl

‘db_scratch.state_fips’

‘–DROP TABLE db_scratch.state_fips;nCREATE TABLE IF NOT EXISTS db_scratch.state_fipsn(ntstate VARCHAR(1024) ENCODE RAWnt ,db_scratch.state_fipsn(ntstate VARCHAR(1024) ENCODE RAW nt,stusab VARCHAR(1024) ENCODE RAWnt,state_name VARCHAR(1024) ENCODE RAWnt,statens VARCHAR(1024) ENCODE RAWn)nDISTSTYLE EVENn;’

Parameters:
  • columns (list) – The column(s) by which to group the rows.

  • reduce_func (function) – The function by which to reduce the rows. Should take the 2 arguments, the columns list and the rows list and return a list. reducer(columns: list, rows: list) -> list;

  • headers (list) – The list of headers for modified table. The length of headers should match the length of the list returned by the reduce function.

  • presorted (bool) – If false, the row will be sorted.

  • **kwargs – Extra options to pass to petl.rowreduce

Returns:

parsons.Table

Also updates self

remove_column(*columns)

Remove a column from your table

Parameters:

*columns – str Column names

Returns:

parsons.Table

Also updates self

remove_null_rows(columns, null_value=None)

Remove rows if the values in a column are None. If multiple columns are passed as list, it will remove all rows with null values in any of the passed columns.

Parameters:
  • columns – str or list The column or columns to analyze

  • null_value – int or float or str The null value

rename_column(column_name, new_column_name)

Rename a column

Parameters:
  • column_name – str The current column name

  • new_column_name – str The new column name

Returns:

parsons.Table

Also updates self

rename_columns(column_map)

Rename multiple columns

Parameters:

column_map

dict A dictionary of columns and new names. The key is the old name and the value is the new name.

Example dictionary: {‘old_name’: ‘new_name’, ‘old_name2’: ‘new_name2’}

Returns:

parsons.Table

Also updates self

select_rows(*filters)

Select specific rows from a Parsons table based on the passed filters.

Example filters:

tbl = Table(
    [
        ['foo', 'bar', 'baz'],
        ['c', 4, 9.3],
        ['a', 2, 88.2],
        ['b', 1, 23.3]
    ]
)

# You can structure the filter in multiple wayss

# Lambda Function
tbl2 = tbl.select_rows(lambda row: row.foo == 'a' and row.baz > 88.1)
tbl2
>>> {'foo': 'a', 'bar': 2, 'baz': 88.1}

# Expression String
tbl3 = tbl.select_rows("{foo} == 'a' and {baz} > 88.1")
tbl3
>>> {'foo': 'a', 'bar': 2, 'baz': 88.1}
Parameters:

*filters – function or str

Returns:

A new parsons table containing the selected rows

set_header(new_header)

Replace the header row of the table.

Parameters:

new_header – list List of new header column names

Returns:

parsons.Table

Also updates self

sort(columns=None, reverse=False)

Sort the rows a table.

Parameters:
  • sort_columns – list or str Sort by a single column or a list of column. If None then will sort columns from left to right.

  • reverse – boolean Sort rows in reverse order.

Returns:

Parsons Table and also updates self

stack(*tables, missing=None)

Stack Parsons tables on top of one another.

Similar to table.concat(), except no attempt is made to align fields from different tables.

Parameters:
  • tables – parsons.Table or list A single table, or a list of tables

  • missing – bool The value to use when padding missing values

tail(n: int = 5)

Return the last n rows of the table

Parameters:

n – int, optional The number of rows to return. Defaults to 5.

Returns:

parsons.Table

to_avro(target, schema=None, sample=9, codec: Literal['null', 'deflate', 'bzip2', 'snappy', 'zstandard', 'lz4', 'xz'] = 'deflate', compression_level=None, **avro_args)

Outputs table to an Avro file.

In order to use this method, you must have the fastavro library installed. If using limited dependencies, you can install it with pip install parsons[avro].

Write the table into a new avro file according to schema passed.

This method assume that each column has values with the same type for all rows of the source table.

Avro is a data serialization framework that is generally is faster and safer than text formats like Json, XML or CSV.

Parameters:
  • target – str the file path for creating the avro file. Note that if a file already exists at the given location, it will be overwritten.

  • schema – dict defines the rows field structure of the file. Check fastavro [documentation](https://fastavro.readthedocs.io/en/latest/) and Avro schema [reference](https://avro.apache.org/docs/1.8.2/spec.html#schemas) for details.

  • sample – int, optional defines how many rows are inspected for discovering the field types and building a schema for the avro file when the schema argument is not passed. Default is 9.

  • codec – str, optional The codec argument (string, optional) sets the compression codec used to shrink data in the file. It can be ‘null’, ‘deflate’ (default), ‘bzip2’ or ‘snappy’, ‘zstandard’, ‘lz4’, ‘xz’ (if installed)

  • compression_level – int, optional sets the level of compression to use with the specified codec (if the codec supports it)

  • **avro_args – kwargs Additionally there are support for passing extra options in the argument **avro_args that are fowarded directly to fastavro. [Check the fastavro documentation](https://fastavro.readthedocs.io/en/latest/) for reference.

Example usage for writing files:

>>> # set up a Avro file to demonstrate with
>>> table2 = [['name', 'friends', 'age'],
...           ['Bob', 42, 33],
...           ['Jim', 13, 69],
...           ['Joe', 86, 17],
...           ['Ted', 23, 51]]
...
>>> schema2 = {
...     'doc': 'Some people records.',
...     'name': 'People',
...     'namespace': 'test',
...     'type': 'record',
...     'fields': [
...         {'name': 'name', 'type': 'string'},
...         {'name': 'friends', 'type': 'int'},
...         {'name': 'age', 'type': 'int'},
...     ]
... }
...
>>> # now demonstrate writing with toavro()
>>> from parsons import Table

>>> Table.toavro(table2, 'example.file2.avro', schema=schema2)
...
>>> # this was what was saved above
>>> tbl2 = Table.fromavro('example.file2.avro')
>>> tbl2
+-------+---------+-----+
| name  | friends | age |
+=======+=========+=====+
| 'Bob' |      42 |  33 |
+-------+---------+-----+
| 'Jim' |      13 |  69 |
+-------+---------+-----+
| 'Joe' |      86 |  17 |
+-------+---------+-----+
| 'Ted' |      23 |  51 |
+-------+---------+-----+
to_bigquery(table_name: str, app_creds: str | None = None, project: str | None = None, **kwargs)

Write a table to BigQuery

Parameters:
  • table_name – str Table name to write to in BigQuery; this should be in schema.table format

  • app_creds – str A credentials json string or a path to a json file. Not required if GOOGLE_APPLICATION_CREDENTIALS env variable set.

  • project – str The project which the client is acting on behalf of. If not passed then will use the default inferred environment.

  • **kwargs – kwargs Additional keyword arguments passed into the .copy() function (if_exists, max_errors, etc.)

to_civis(table, api_key=None, db=None, max_errors=None, existing_table_rows: Literal['fail', 'truncate', 'append', 'drop'] = 'fail', diststyle: Literal['even', 'all', 'key'] | None = None, distkey=None, sortkey1=None, sortkey2=None, wait=True, **civisargs)

Write the table to a Civis Redshift cluster. Additional key word arguments can passed to civis.io.dataframe_to_civis()

Args
table: str

The schema and table you want to upload to. E.g., ‘scratch.table’. Schemas or tablenames with periods must be double quoted, e.g. ‘scratch.”my.table”’.

api_key: str

Your Civis API key. If not given, the CIVIS_API_KEY environment variable will be used.

db: str or int

The Civis Database. Can be database name or ID

max_errors: int

The maximum number of rows with errors to remove from the import before failing.

diststyle: str

The distribution style for the table. One of ‘even’, ‘all’ or ‘key’.

existing_table_rows: str

The behaviour if a table with the requested name already exists. One of ‘fail’, ‘truncate’, ‘append’ or ‘drop’. Defaults to ‘fail’.

distkey: str

The column to use as the distkey for the table.

sortkey1: str

The column to use as the sortkey for the table.

sortkey2: str

The second column in a compound sortkey for the table.

wait: boolean

Wait for write job to complete before exiting method.

to_csv(local_path=None, temp_file_compression=None, encoding=None, errors='strict', write_header=True, csv_name=None, **csvargs)

Outputs table to a CSV. Additional key word arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.

Warning

If a file already exists at the given location, it will be overwritten.

Parameters:
  • local_path – str The path to write the csv locally. If it ends in “.gz” or “.zip”, the file will be compressed. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

  • temp_file_compression – str If a temp file is requested (ie. no local_path is specified), the compression type for that file. Currently “None”, “gzip” or “zip” are supported. If a local_path is specified, this argument is ignored.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • csv_name – str If zip compression (either specified or inferred), the name of csv file within the archive.

  • **csvargs – kwargs csv_writer optional arguments

Returns:

str

The path of the new file

to_dataframe(index=None, exclude=None, columns=None, coerce_float=False)

Outputs table as a Pandas Dataframe

Parameters:
  • index – str, list Field of array to use as the index, alternately a specific set of input labels to use

  • exclude – list Columns or fields to exclude

  • columns – list Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns)

Returns:

dataframe

Pandas DataFrame object

to_dicts()

Output table as a list of dicts.

Returns:

list

to_gcs_csv(bucket_name, blob_name, gcs_client=None, app_creds=None, project=None, compression=None, encoding=None, errors='strict', write_header=True, public_url=False, public_url_expires=60, **csvargs)

Writes the table to a Google Cloud Storage blob as a CSV.

Parameters:
  • bucket_name – str The bucket to upload to

  • blob_name – str The blob to name the file. If it ends in ‘.gz’ or ‘.zip’, the file will be compressed.

  • app_creds – str A credentials json string or a path to a json file. Not required if GOOGLE_APPLICATION_CREDENTIALS env variable set.

  • project – str The project which the client is acting on behalf of. If not passed then will use the default inferred environment.

  • compression – str The compression type for the csv. Currently “None”, “zip” and “gzip” are supported. If specified, will override the key suffix.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • public_url – boolean Create a public link to the file

  • public_url_expire – 60 The time, in minutes, until the url expires if public_url set to True.

  • **csvargs – kwargs csv_writer optional arguments

Returns:

Public url if specified. If not None.

to_html(local_path=None, encoding=None, errors='strict', index_header=False, caption=None, tr_style=None, td_styles=None, truncate=None)

Outputs table to html.

Warning

If a file already exists at the given location, it will be overwritten.

Parameters:
  • local_path – str The path to write the html locally. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

  • encoding

    str The encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • index_header – boolean Prepend index to column names; Defaults to False.

  • caption – str A caption to include with the html table.

  • tr_style – str or callable Style to be applied to the table row.

  • td_styles – str, dict or callable Styles to be applied to the table cells.

  • truncate – int Length of cell data.

Returns:

str

The path of the new file

to_json(local_path=None, temp_file_compression=None, line_delimited=False)

Outputs table to a JSON file

Warning

If a file already exists at the given location, it will be overwritten.

Parameters:
  • local_path – str The path to write the JSON locally. If it ends in “.gz”, it will be compressed first. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

  • temp_file_compression – str If a temp file is requested (ie. no local_path is specified), the compression type for that file. Currently “None” and “gzip” are supported. If a local_path is specified, this argument is ignored.

  • line_delimited – bool Whether the file will be line-delimited JSON (with a row on each line), or a proper JSON file.

Returns:

str

The path of the new file

to_postgres(table_name, username=None, password=None, host=None, db=None, port=None, **copy_args)

Write a table to a Postgres database.

Parameters:
  • table_name – str The table name and schema (my_schema.my_table) to point the file.

  • username – str Required if env variable PGUSER not populated

  • password – str Required if env variable PGPASSWORD not populated

  • host – str Required if env variable PGHOST not populated

  • db – str Required if env variable PGDATABASE not populated

  • port – int Required if env variable PGPORT not populated.

  • **copy_args – kwargs See copy`() for options.

to_redshift(table_name, username=None, password=None, host=None, db=None, port=None, **copy_args)

Write a table to a Redshift database. Note, this requires you to pass AWS S3 credentials or store them as environmental variables.

Parameters:
  • table_name – str The table name and schema (my_schema.my_table) to point the file.

  • username – str Required if env variable REDSHIFT_USERNAME not populated

  • password – str Required if env variable REDSHIFT_PASSWORD not populated

  • host – str Required if env variable REDSHIFT_HOST not populated

  • db – str Required if env variable REDSHIFT_DB not populated

  • port – int Required if env variable REDSHIFT_PORT not populated. Port 5439 is typical.

  • **copy_args – kwargs See copy`() for options.

to_s3_csv(bucket, key, aws_access_key_id=None, aws_secret_access_key=None, compression=None, encoding=None, errors='strict', write_header=True, acl='bucket-owner-full-control', public_url=False, public_url_expires=3600, use_env_token=True, **csvargs)

Writes the table to an s3 object as a CSV

Parameters:
  • bucket – str The s3 bucket to upload to

  • key – str The s3 key to name the file. If it ends in ‘.gz’ or ‘.zip’, the file will be compressed.

  • aws_access_key_id – str Required if not included as environmental variable

  • aws_secret_access_key – str Required if not included as environmental variable

  • compression – str The compression type for the s3 object. Currently “None”, “zip” and “gzip” are supported. If specified, will override the key suffix.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • public_url – boolean Create a public link to the file

  • public_url_expire – 3600 The time, in seconds, until the url expires if public_url set to True.

  • acl – str The S3 permissions on the file

  • use_env_token – boolean Controls use of the AWS_SESSION_TOKEN environment variable for S3. Defaults to True. Set to False in order to ignore the AWS_SESSION_TOKEN env variable even if the aws_session_token argument was not passed in.

  • **csvargs – kwargs csv_writer optional arguments

Returns:

Public url if specified. If not None.

to_sftp_csv(remote_path, host, username, password, port=22, encoding=None, compression=None, errors='strict', write_header=True, rsa_private_key_file=None, **csvargs)

Writes the table to a CSV file on a remote SFTP server

Parameters:
  • remote_path – str The remote path of the file. If it ends in ‘.gz’, the file will be compressed.

  • host – str The remote host

  • username – str The username to access the SFTP server

  • password – str The password to access the SFTP server

  • port – int The port number of the SFTP server

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • str (rsa_private_key_file) – Absolute path to a private RSA key used to authenticate stfp connection

  • **csvargs – kwargs csv_writer optional arguments

to_zip_csv(archive_path=None, csv_name=None, encoding=None, errors='strict', write_header=True, if_exists: Literal['replace', 'append'] = 'replace', **csvargs)

Outputs table to a CSV in a zip archive. Additional key word arguments are passed to csv.writer(). So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument. Use thismethod if you would like to write multiple csv files to the same archive.

Warning

If a file already exists in the archive, it will be overwritten.

Parameters:
  • archive_path – str The path to zip achive. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.

  • csv_name – str The name of the csv file to be stored in the archive. If None will use the archive name.

  • encoding

    str The CSV encoding type for csv.writer()

  • errors – str Raise an Error if encountered

  • write_header – boolean Include header in output

  • if_exists – str If archive already exists, one of ‘replace’ or ‘append’

  • **csvargs – kwargs csv_writer optional arguments

Returns:

str

The path of the archive

unpack_dict(column, keys=None, include_original=False, sample_size=5000, missing=None, prepend=True, prepend_value=None)

Unpack dictionary values from one column into separate columns

Parameters:
  • column – str The column name to unpack

  • keys – list The dict keys in the column to unpack. If None will unpack all.

  • include_original – boolean Retain original column after unpacking

  • sample_size – int Number of rows to sample before determining columns

  • missing – str If a value is missing, the value to fill it with

  • prepend – Prepend the column name of the unpacked values. Useful for avoiding duplicate column names

  • prepend_value – Value to prepend new columns if prepend=True. If None, will set to column name.

unpack_list(column, include_original=False, missing=None, replace=False, max_columns=None)

Unpack list values from one column into separate columns. Numbers the columns.

# Begin with a list in column
json = [
    {
        'id': '5421',
        'name': 'Jane Green',
        'phones': ['512-699-3334', '512-222-5478']
    }
]

tbl = Table(json)
print (tbl)
>>> {'id': '5421', 'name': 'Jane Green', 'phones': ['512-699-3334', '512-222-5478']}

tbl.unpack_list('phones', replace=True)
print (tbl)
>>> {'id': '5421', 'name': 'Jane Green', 'phones_0': '512-699-3334', 'phones_1': '512-222-5478'}
Parameters:
  • column – str The column name to unpack

  • include_original – boolean Retain original column after unpacking

  • sample_size – int Number of rows to sample before determining columns

  • missing – str If a value is missing, the value to fill it with

  • replace – boolean Return new table or replace existing

  • max_columns – int The maximum number of columns to unpack

unpack_nested_columns_as_rows(column, key='id', expand_original: bool | int = False)

Unpack list or dict values from one column into separate rows. Not recommended for JSON columns (i.e. lists of dicts), but can handle columns with any mix of types. Makes use of PETL’s melt() method.

Parameters:
  • column – str The column name to unpack

  • key – str The column to use as a key when unpacking. Defaults to id

  • expand_original – boolean or int If True: Add resulting unpacked rows (with all other columns) to original If int: Add to original unless the max added per key is above the given number If False (default): Return unpacked rows (with key column only) as standalone Removes packed list and dict rows from original either way.

Returns:

parsons.Table

If expand_original, original table with packed rows replaced by unpacked rows. Otherwise, standalone table with key column and unpacked values only

use_petl(petl_method, *args, **kwargs)

Call a petl function on the current table.

This convenience method exposes the petl functions to the current Table. This is useful in cases where one might need a petl function that has not yet been implemented for parsons.Table.

# https://petl.readthedocs.io/en/v1.6.0/transform.html#petl.transform.basics.skipcomments
tbl = Table(
    [
        ['col1', 'col2'],
        ['# this is a comment row'],
        ['a', 1],
        ['#this is another comment', 'this is also ignored'],
        ['b', 2]
    ]
)
tbl.use_petl('skipcomments', '#', update_table=True)

>>> {'col1': 'a', 'col2': 1}
>>> {'col1': 'b', 'col2': 2}

+------+------+
| col1 | col2 |
+======+======+
| 'a'  |    1 |
+------+------+
| 'b'  |    2 |
+------+------+
Parameters:
  • petl_method – str The petl function to call

  • update_table – bool If True, updates the parsons.Table. Defaults to False.

  • to_petl – bool If True, returns a petl table, otherwise a parsons.Table. Defaults to False.

  • *args – Any The arguements to pass to the petl function.

  • **kwargs – Any The keyword arguements to pass to the petl function.

Returns:

parsons.Table or petl table

materialize()[source]

“Materializes” a Table, meaning all data is loaded into memory and all pending transformations are applied.

Use this if petl’s lazy-loading behavior is causing you problems, eg. if you want to read data from a file immediately.

materialize_to_file(file_path=None)[source]

“Materializes” a Table, meaning all pending transformations are applied.

Unlike the original materialize function, this method does not bring the data into memory, but instead loads the data into a local temp file.

This method updates the current table in place.

Parameters:

file_path – str The path to the file to materialize the table to; if not specified, a temp file will be created.

Returns:

str

Path to the temp file that now contains the table

is_valid_table()[source]

Performs some simple checks on a Table. Specifically, verifies that we have a valid petl table within the Parsons Table.

Returns:

bool

empty_column(column)[source]

Checks if a given column is empty. Returns True if empty and False if not empty.

Parameters:

column – str The column name

Returns:

bool