Parsons Table
Overview
Most methods and functions in Parsons return a Table
, which is a 2D list-like object. (It’s similar to a pandas DataFrame, if you are familiar with that). You can call the following methods on the returned object to output it into a variety of formats or storage types. (For a full list of Table
methods, scroll down to the class documentation.)
From Parsons Table
Method |
Destination Type |
Description |
---|---|---|
|
CSV File |
Write a table to a local csv file |
|
AWS s3 Bucket |
Write a table to a csv stored in S3 |
|
SFTP Server |
Write a table to a csv stored on an SFTP server |
|
A Redshift Database |
Write a table to a Redshift database |
|
A Postgres Database |
Write a table to a Postgres database |
|
Civis Redshift Database |
Write a table to Civis platform database |
|
Petl table object |
Convert a table a Petl table object |
|
JSON file |
Write a table to a local JSON file |
|
HTML formatted table |
Write a table to a local html file |
|
Pandas Dataframe 1 |
Return a Pandas dataframe |
- 1
Requires optional installation of Pandas package by running
pip install pandas
.
To Parsons Table
Create Parsons Table object using the following methods.
Method |
Source Type |
Description |
---|---|---|
|
File like object, local path, url, ftp. |
Loads a csv object into a Table |
|
File like object, local path, url, ftp. |
Loads a json object into a Table |
|
List object |
Loads lists organized as columns in Table |
|
Redshift table |
Loads a Redshift query into a Table |
|
Postgres table |
Loads a Postgres query into a Table |
|
Pandas Dataframe 2 |
Load a Parsons table from a Pandas Dataframe |
|
S3 CSV |
Load a Parsons table from a file on S3 |
- 2
Requires optional installation of Pandas package by running
pip install pandas
.
You can also use the Table constructor to create a Table from a python list or petl table:
# From a list of dicts
tbl = Table([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
# From a list of lists, the first list holding the field names
tbl = Table([['a', 'b'], [1, 2], [3, 4]])
# From a petl table
tbl = Table(petl_tbl)
Parsons Table Attributes
Attribute |
Description |
---|---|
|
The number of rows in the table |
|
A list of column names in the table |
|
The actual data (rows) of the table, as a list of tuples (without field names) |
|
The first value in the table. Use for database queries where a single value is returned. |
Parsons Table Indexing
To access rows and columns of data within a Parsons table, you can index on them. To access a column
pass in the column name as a string (e.g. tbl['a']
) and to access a row, pass in the row index as
an integer (e.g. tbl[1]
).
tbl = Table([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}])
# Return a column as a list
tbl['a']
>> [1, 3]
# Return a row as a dict
tbl[1]
>> {'a': 3, 'b': 4}
A note on indexing and iterating over a table’s data: If you need to iterate over the data, make sure to use the python iterator syntax, so any data transformations can be applied efficiently. An example:
# Some data transformations
table.add_column('newcol', 'some value')
# Efficient way to grab all the data (applying the data transformations only once)
rows_list = [row for row in table]
Warning
If you must index directly into a table’s data, you can do so, but note that data transformations will be applied each time you do so. So this code will be very inefficient on a large table…
# Inefficient way to grab all the data
rows_list = []
for i in range(0, table.num_rows):
# Data transformations will be applied each time through this loop!
rows_list.append(table[i])
PETL
The Parsons Table
relies heavily on the petl Python package. You can always access the underlying petl table with my_parsons_table.table
, which will allow you to perform any petl-supported ETL operations.
Lazy Loading
The Parsons Table
makes use of “lazy” loading and “lazy” transformations. What this means is that it tries not to load and process your data until absolutely necessary.
An example:
# Specify where to load the data
tbl = Table.from_csv('name_data.csv')
# Specify data transformations
tbl.add_column('full_name', lambda row: row['first_name'] + ' ' + row['last_name'])
tbl.remove_column(['first_name', 'last_name'])
# Save the table elsewhere
# IMPORTANT - The CSV won't actually be loaded and transformed until this step,
# since this is the first time it's actually needed.
tbl.to_redshift('main.name_table')
This “lazy” loading can be very convenient and performant. However, it can make issues hard to debug. Eg. if your data transformations are time-consuming, you won’t actually notice that performance hit until you try to use the data, potentially much later in your code.
So just be aware of this behavior.
Examples
Basic Pipelines
# S3 to Civis
s3 = S3()
csv = s3.get_file('tmc-bucket', 'my_ids.csv')
Table.from_csv(csv).to_civis('TMC','ids.my_ids')
#VAN Activist Codes to a Dataframe
van = VAN(db='MyVoters')
van.activist_codes().to_dataframe()
#VAN Events to an s3 bucket
van = VAN(db='MyVoters')
van.events().to_s3_csv('my-van-bucket','myevents.csv')
To & From API
- class parsons.etl.tofrom.ToFrom[source]
- to_dataframe(index=None, exclude=None, columns=None, coerce_float=False)[source]
Outputs table as a Pandas Dataframe
- Args:
- index: str, list
Field of array to use as the index, alternately a specific set of input labels to use
- exclude: list
Columns or fields to exclude
- columns: list
Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns)
- Returns:
- dataframe
Pandas DataFrame object
- to_html(local_path=None, encoding=None, errors='strict', index_header=False, caption=None, tr_style=None, td_styles=None, truncate=None)[source]
Outputs table to html.
Warning
If a file already exists at the given location, it will be overwritten.
- Args:
- local_path: str
The path to write the html locally. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.
- encoding: str
The encoding type for csv.writer()
- errors: str
Raise an Error if encountered
- index_header: boolean
Prepend index to column names; Defaults to False.
- caption: str
A caption to include with the html table.
- tr_style: str or callable
Style to be applied to the table row.
- td_styles: str, dict or callable
Styles to be applied to the table cells.
- truncate: int
Length of cell data.
- Returns:
- str
The path of the new file
- to_csv(local_path=None, temp_file_compression=None, encoding=None, errors='strict', write_header=True, csv_name=None, **csvargs)[source]
Outputs table to a CSV. Additional key word arguments are passed to
csv.writer()
. So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.Warning
If a file already exists at the given location, it will be overwritten.
- Args:
- local_path: str
The path to write the csv locally. If it ends in “.gz” or “.zip”, the file will be compressed. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.
- temp_file_compression: str
If a temp file is requested (ie. no
local_path
is specified), the compression type for that file. Currently “None”, “gzip” or “zip” are supported. If alocal_path
is specified, this argument is ignored.- encoding: str
The CSV encoding type for csv.writer()
- errors: str
Raise an Error if encountered
- write_header: boolean
Include header in output
- csv_name: str
If
zip
compression (either specified or inferred), the name of csv file within the archive.- **csvargs: kwargs
csv_writer
optional arguments
- Returns:
- str
The path of the new file
- append_csv(local_path, encoding=None, errors='strict', **csvargs)[source]
Appends table to an existing CSV.
Additional additional key word arguments are passed to
csv.writer()
. So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument.- Args:
- local_path: str
The local path of an existing CSV file. If it ends in “.gz”, the file will be compressed.
- encoding: str
The CSV encoding type for csv.writer()
- errors: str
Raise an Error if encountered
- **csvargs: kwargs
csv_writer
optional arguments
- Returns:
- str
The path of the file
- to_zip_csv(archive_path=None, csv_name=None, encoding=None, errors='strict', write_header=True, if_exists='replace', **csvargs)[source]
Outputs table to a CSV in a zip archive. Additional key word arguments are passed to
csv.writer()
. So, e.g., to override the delimiter from the default CSV dialect, provide the delimiter keyword argument. Use thismethod if you would like to write multiple csv files to the same archive.Warning
If a file already exists in the archive, it will be overwritten.
- Args:
- archive_path: str
The path to zip achive. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.
- csv_name: str
The name of the csv file to be stored in the archive. If
None
will use the archive name.- encoding: str
The CSV encoding type for csv.writer()
- errors: str
Raise an Error if encountered
- write_header: boolean
Include header in output
- if_exists: str
If archive already exists, one of ‘replace’ or ‘append’
- **csvargs: kwargs
csv_writer
optional arguments
- Returns:
- str
The path of the archive
- to_json(local_path=None, temp_file_compression=None, line_delimited=False)[source]
Outputs table to a JSON file
Warning
If a file already exists at the given location, it will be overwritten.
- Args:
- local_path: str
The path to write the JSON locally. If it ends in “.gz”, it will be compressed first. If not specified, a temporary file will be created and returned, and that file will be removed automatically when the script is done running.
- temp_file_compression: str
If a temp file is requested (ie. no
local_path
is specified), the compression type for that file. Currently “None” and “gzip” are supported. If alocal_path
is specified, this argument is ignored.- line_delimited: bool
Whether the file will be line-delimited JSON (with a row on each line), or a proper JSON file.
- Returns:
- str
The path of the new file
- to_sftp_csv(remote_path, host, username, password, port=22, encoding=None, compression=None, errors='strict', write_header=True, rsa_private_key_file=None, **csvargs)[source]
Writes the table to a CSV file on a remote SFTP server
- Args:
- remote_path: str
The remote path of the file. If it ends in ‘.gz’, the file will be compressed.
- host: str
The remote host
- username: str
The username to access the SFTP server
- password: str
The password to access the SFTP server
- port: int
The port number of the SFTP server
- encoding: str
The CSV encoding type for csv.writer()
- errors: str
Raise an Error if encountered
- write_header: boolean
Include header in output
- rsa_private_key_file str
Absolute path to a private RSA key used to authenticate stfp connection
- **csvargs: kwargs
csv_writer
optional arguments
- to_s3_csv(bucket, key, aws_access_key_id=None, aws_secret_access_key=None, compression=None, encoding=None, errors='strict', write_header=True, acl='bucket-owner-full-control', public_url=False, public_url_expires=3600, **csvargs)[source]
Writes the table to an s3 object as a CSV
- Args:
- bucket: str
The s3 bucket to upload to
- key: str
The s3 key to name the file. If it ends in ‘.gz’ or ‘.zip’, the file will be compressed.
- aws_access_key_id: str
Required if not included as environmental variable
- aws_secret_access_key: str
Required if not included as environmental variable
- compression: str
The compression type for the s3 object. Currently “None”, “zip” and “gzip” are supported. If specified, will override the key suffix.
- encoding: str
The CSV encoding type for csv.writer()
- errors: str
Raise an Error if encountered
- write_header: boolean
Include header in output
- public_url: boolean
Create a public link to the file
- public_url_expire: 3600
The time, in seconds, until the url expires if
public_url
set toTrue
.- acl: str
The S3 permissions on the file
- **csvargs: kwargs
csv_writer
optional arguments
- Returns:
Public url if specified. If not
None
.
- to_redshift(table_name, username=None, password=None, host=None, db=None, port=None, **copy_args)[source]
Write a table to a Redshift database. Note, this requires you to pass AWS S3 credentials or store them as environmental variables.
- Args:
- table_name: str
The table name and schema (
my_schema.my_table
) to point the file.- username: str
Required if env variable
REDSHIFT_USERNAME
not populated- password: str
Required if env variable
REDSHIFT_PASSWORD
not populated- host: str
Required if env variable
REDSHIFT_HOST
not populated- db: str
Required if env variable
REDSHIFT_DB
not populated- port: int
Required if env variable
REDSHIFT_PORT
not populated. Port 5439 is typical.- **copy_args: kwargs
See
copy`()
for options.
- Returns:
None
- to_postgres(table_name, username=None, password=None, host=None, db=None, port=None, **copy_args)[source]
Write a table to a Postgres database.
- Args:
- table_name: str
The table name and schema (
my_schema.my_table
) to point the file.- username: str
Required if env variable
PGUSER
not populated- password: str
Required if env variable
PGPASSWORD
not populated- host: str
Required if env variable
PGHOST
not populated- db: str
Required if env variable
PGDATABASE
not populated- port: int
Required if env variable
PGPORT
not populated.- **copy_args: kwargs
See
copy`()
for options.
- Returns:
None
- to_civis(table, api_key=None, db=None, max_errors=None, existing_table_rows='fail', diststyle=None, distkey=None, sortkey1=None, sortkey2=None, wait=True, **civisargs)[source]
Write the table to a Civis Redshift cluster. Additional key word arguments can passed to civis.io.dataframe_to_civis() # noqa: E501
- Args
- table: str
The schema and table you want to upload to. E.g., ‘scratch.table’. Schemas or tablenames with periods must be double quoted, e.g. ‘scratch.”my.table”’.
- api_key: str
Your Civis API key. If not given, the CIVIS_API_KEY environment variable will be used.
- db: str or int
The Civis Database. Can be database name or ID
- max_errors: int
The maximum number of rows with errors to remove from the import before failing.
- diststyle: str
The distribution style for the table. One of ‘even’, ‘all’ or ‘key’.
- existing_table_rows: str
The behaviour if a table with the requested name already exists. One of ‘fail’, ‘truncate’, ‘append’ or ‘drop’. Defaults to ‘fail’.
- distkey: str
The column to use as the distkey for the table.
- sortkey1: str
The column to use as the sortkey for the table.
- sortkey2: str
The second column in a compound sortkey for the table.
- wait: boolean
Wait for write job to complete before exiting method.
- classmethod from_csv(local_path, **csvargs)[source]
Create a
parsons table
object from a CSV file- Args:
- local_path: obj
A csv formatted local path, url or ftp. If this is a file path that ends in “.gz”, the file will be decompressed first.
- **csvargs: kwargs
csv_reader
optional arguments
- Returns:
- Parsons Table
See Parsons Table for output options.
- classmethod from_csv_string(str, **csvargs)[source]
Create a
parsons table
object from a string representing a CSV.- Args:
- str: str
The string object to convert to a table
- **csvargs: kwargs
csv_reader
optional arguments
- Returns:
- Parsons Table
See Parsons Table for output options.
- classmethod from_columns(cols, header=None)[source]
Create a
parsons table
from a list of lists organized as columns- Args:
- cols: list
A list of lists organized as columns
- header: list
List of column names. If not specified, will use dummy column names
- Returns:
- Parsons Table
See Parsons Table for output options.
- classmethod from_json(local_path, header=None, line_delimited=False)[source]
Create a
parsons table
from a json file- Args:
- local_path: list
A JSON formatted local path, url or ftp. If this is a file path that ends in “.gz”, the file will be decompressed first.
- header: list
List of columns to use for the destination table. If omitted, columns will be inferred from the initial data in the file.
- line_delimited: bool
Whether the file is line-delimited JSON (with a row on each line), or a proper JSON file.
- Returns:
- Parsons Table
See Parsons Table for output options.
- classmethod from_redshift(sql, username=None, password=None, host=None, db=None, port=None)[source]
Create a
parsons table
from a Redshift query.To pull an entire Redshift table, use a query like
SELECT * FROM tablename
.- Args:
- sql: str
A valid SQL statement
- username: str
Required if env variable
REDSHIFT_USERNAME
not populated- password: str
Required if env variable
REDSHIFT_PASSWORD
not populated- host: str
Required if env variable
REDSHIFT_HOST
not populated- db: str
Required if env variable
REDSHIFT_DB
not populated- port: int
Required if env variable
REDSHIFT_PORT
not populated. Port 5439 is typical.
- Returns:
- Parsons Table
See Parsons Table for output options.
- classmethod from_postgres(sql, username=None, password=None, host=None, db=None, port=None)[source]
- Args:
- sql: str
A valid SQL statement
- username: str
Required if env variable
PGUSER
not populated- password: str
Required if env variable
PGPASSWORD
not populated- host: str
Required if env variable
PGHOST
not populated- db: str
Required if env variable
PGDATABASE
not populated- port: int
Required if env variable
PGPORT
not populated.
- classmethod from_s3_csv(bucket, key, aws_access_key_id=None, aws_secret_access_key=None, **csvargs)[source]
Create a
parsons table
from a key in an S3 bucket.- Args:
- bucket: str
The S3 bucket.
- key: str
The S3 key
- aws_access_key_id: str
Required if not included as environmental variable.
- aws_secret_access_key: str
Required if not included as environmental variable.
- **csvargs: kwargs
csv_reader
optional arguments
- Returns:
parsons.Table object
ETL API
The following methods allow you to manipulate the Parsons table data.
- class parsons.etl.etl.ETL[source]
- add_column(column, value=None, index=None)[source]
Add a column to your table
- Args:
- column: str
Name of column to add
- value:
A fixed or calculated value
- index: int
The position of the new column in the table
- Returns:
Parsons Table and also updates self
- remove_column(*columns)[source]
Remove a column from your table
- Args:
- *columns: str
Column names
- Returns:
Parsons Table and also updates self
- rename_column(column_name, new_column_name)[source]
Rename a column
- Args:
- column_name: str
The current column name
- new_column_name: str
The new column name
- Returns:
Parsons Table and also updates self
- fill_column(column_name, fill_value)[source]
Fill a column in a table
- Args:
- column_name: str
The column to fill
- fill_value:
A fixed or calculated value
- Returns:
Parsons Table and also updates self
- fillna_column(column_name, fill_value)[source]
Fill None values in a column in a table
- Args:
- column_name: str
The column to fill
- fill_value:
Fixed value only
- Returns:
Parsons Table and also updates self
- move_column(column, index)[source]
Move a column
- Args:
- column: str
The column name to move
- index:
The new index for the column
- Returns:
Parsons Table and also updates existing object.
- convert_column(*column, **kwargs)[source]
Transform values under one or more fields via arbitrary functions, method invocations or dictionary translations. This leverages the petl
convert()
method. Example usage can be found here <https://petl.readthedocs.io/en/v0.24/transform.html#petl.convert>`_- Args:
- *column: str
A single column or multiple columns passed as a list
- **kwargs: str, method or variable
The update function, method, or variable to process the update
- Returns:
Parsons Table and also updates self
- get_column_max_width(column)[source]
Return the maximum width of the column.
- Args:
- column: str
The column name.
- Returns:
int
- convert_columns_to_str()[source]
Convenience function to convert all non-string or mixed columns in a Parsons table to string (e.g. for comparison)
- Returns:
Parsons Table and also updates self
- coalesce_columns(dest_column, source_columns, remove_source_columns=True)[source]
Coalesces values from one or more source columns into a destination column, by selecting the first non-empty value. If the destination column doesn’t exist, it will be added.
- Args:
- dest_column: str
Name of destination column
- source_columns: list
List of source column names
- remove_source_columns: bool
Whether to remove the source columns after the coalesce. If the destination column is also one of the source columns, it will not be removed.
- Returns:
Parsons Table and also updates self
- map_columns(column_map)[source]
Standardizes column names based on multiple possible values. This method is helpful when your input table might have multiple and unknown column names. Args:
- column_map: dict
A dictionary of columns and possible values that map to it
- Returns:
Parsons Table and also updates self
tbl = [{fn: 'Jane'}, {lastname: 'Doe'}, {dob: '1980-01-01'}] column_map = {first_name: ['fn', 'first', 'firstname'], last_name: ['ln', 'last', 'lastname'], date_of_birth: ['dob', 'birthday']} tbl.map_columns(column_map) print (tbl) >> {{first_name: 'Jane', last_name: 'Doe', 'date_of_birth': '1908-01-01'}}
- get_column_types(column)[source]
Return all of the Python types for values in a given column
- Args:
- column: str
Name of the column to analyze
- Returns:
- list
A list of Python types
- get_columns_type_stats()[source]
Return descriptive stats for all columns
- Returns:
- list
A list of dicts
- Returns:
- list
A list of dicts, each containing a column ‘name’ and a ‘type’ list
- convert_table(*args)[source]
Transform all cells in a table via arbitrary functions, method invocations or dictionary translations. This method is useful for cleaning fields and data hygiene functions such as regex. This method leverages the petl
convert()
method. Example usage can be found here <https://petl.readthedocs.io/en/v0.24/transform.html#petl.convert>`_.- Args:
- *args: str, method or variable
The update function, method, or variable to process the update. Can also
- Returns:
Parsons Table and also updates self
- unpack_dict(column, keys=None, include_original=False, sample_size=1000, missing=None, prepend=True, prepend_value=None)[source]
Unpack dictionary values from one column into separate columns
- Args:
- column: str
The column name to unpack
- keys: list
The dict keys in the column to unpack. If
None
will unpack all.- include_original: boolean
Retain original column after unpacking
- sample_size: int
Number of rows to sample before determining columns
- missing: str
If a value is missing, the value to fill it with
- prepend:
Prepend the column name of the unpacked values. Useful for avoiding duplicate column names
- prepend_value:
Value to prepend new columns if
prepend=True
. If None, will set to column name.
- unpack_list(column, include_original=False, missing=None, replace=False, max_columns=None)[source]
Unpack list values from one column into separate columns. Numbers the columns.
# Begin with a list in column json = [{'id': '5421', 'name': 'Jane Green', 'phones': ['512-699-3334', '512-222-5478'] } ] tbl = Table(json) print (tbl) >>> {'id': '5421', 'name': 'Jane Green', 'phones': ['512-699-3334', '512-222-5478']} tbl.unpack_list('phones', replace=True) print (tbl) >>> {'id': '5421', 'name': 'Jane Green', 'phones_0': '512-699-3334', 'phones_1': '512-222-5478'} # noqa: E501
- Args:
- column: str
The column name to unpack
- include_original: boolean
Retain original column after unpacking
- sample_size: int
Number of rows to sample before determining columns
- missing: str
If a value is missing, the value to fill it with
- replace: boolean
Return new table or replace existing
- max_columns: int
The maximum number of columns to unpack
- Returns:
None
- unpack_nested_columns_as_rows(column, key='id', expand_original=False)[source]
Unpack list or dict values from one column into separate rows. Not recommended for JSON columns (i.e. lists of dicts), but can handle columns with any mix of types. Makes use of PETL’s melt() method.
- Args:
- column: str
The column name to unpack
- key: str
The column to use as a key when unpacking. Defaults to id
- expand_original: boolean or int
If True: Add resulting unpacked rows (with all other columns) to original If int: Add to original unless the max added per key is above the given number If False (default): Return unpacked rows (with key column only) as standalone Removes packed list and dict rows from original either way.
- Returns:
If expand_original, original table with packed rows replaced by unpacked rows Otherwise, standalone table with key column and unpacked values only
- long_table(key, column, key_rename=None, retain_original=False, prepend=True, prepend_value=None)[source]
Create a new long parsons table from a column, including the foreign key.
# Begin with nested dicts in a column json = [{'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'} ] } ] tbl = Table(json) print (tbl) >>> {'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'}]} # noqa: E501 >>> {'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'}]} # noqa: E501 # Create skinny table of just the nested dicts email_skinny = tbl.long_table(['id'], 'emails') print (email_skinny) >>> {'id': '5421', 'emails_home': 'jane@gmail.com', 'emails_work': None} >>> {'id': '5421', 'emails_home': None, 'emails_work': 'jane@mywork.com'}
- Args:
- key: lst
The columns to retain in the long table (e.g. foreign keys)
- column: str
The column name to make long
- key_rename: dict
The new name for the foreign key to better identify it. For example, you might want to rename
id
toperson_id
. Ex. {‘KEY_NAME’: ‘NEW_KEY_NAME’}- retain_original: boolean
Retain the original column from the source table.
- prepend:
Prepend the column name of the unpacked values. Useful for avoiding duplicate column names
- prepend_value:
Value to prepend new columns if
prepend=True
. If None, will set to column name.
- Returns:
- Parsons Table
The new long table
- cut(*columns)[source]
Return a table of selection of columns
- Args:
- *columns: str
Columns in the parsons table
- Returns:
A new parsons table containing the selected columnns
- select_rows(*filters)[source]
Select specific rows from a Parsons table based on the passed filters.
Example filters:
tbl = Table([['foo', 'bar', 'baz'], ['c', 4, 9.3], ['a', 2, 88.2], ['b', 1, 23.3],]) # You can structure the filter in multiple wayss # Lambda Function tbl2 = tbl.select_rows(lambda row: row.foo == 'a' and row.baz > 88.1) tbl2 >>> {foo: 'a', 'bar': 2, 'baz': 88.1} # Expression String tbl3 = tbl.select_rows("{foo} == 'a' and {baz} > 88.1") tbl3 >>> {foo: 'a', 'bar': 2, 'baz': 88.1}
- Args:
*filters: function or str
- Returns:
A new parsons table containing the selected rows
- remove_null_rows(columns, null_value=None)[source]
Remove rows if the values in a column are
None
. If multiple columns are passed as list, it will remove all rows with null values in any of the passed columns.- Args:
- column: str or list
The column or columns to analyze
- null_value: int or float or str
The null value
- Returns:
None
- stack(*tables, missing=None)[source]
Stack Parsons tables on top of one another.
Similar to
table.concat()
, except no attempt is made to align fields from different tables.- Args:
- tables: Parsons Table or list
A single table, or a list of tables
- missing: bool
The value to use when padding missing values
- Returns:
None
- concat(*tables, missing=None)[source]
Concatenates one or more tables onto this one.
Note that the tables do not need to share exactly the same fields. Any missing fields will be padded with None, or whatever is provided via the
missing
keyword argument.- Args:
- tables: Parsons Table or list
A single table, or a list of tables
- missing: bool
The value to use when padding missing values
- Returns:
None
- chunk(rows)[source]
Divides a Parsons table into smaller tables of a specified row count. If the table cannot be divided evenly, then the final table will only include the remainder.
- Args:
- rows: int
The number of rows of each new Parsons table
- Returns:
List of Parsons tables
- static get_normalized_column_name(column_name)[source]
Returns a column name with whitespace removed, non-alphanumeric characters removed, and everything lowercased.
- Returns:
- str
Normalized column name
- match_columns(desired_columns, fuzzy_match=True, if_extra_columns='remove', if_missing_columns='add')[source]
Changes the column names and ordering in this Table to match a list of desired column names.
- Args:
- desired_columns: list
Ordered list of desired column names
- fuzzy_match: bool
Whether to normalize column names when matching against the desired column names, removing whitespace and non-alphanumeric characters, and lowercasing everything. Eg. With this flag set, “FIRST NAME” would match “first_name”. If the Table has two columns that normalize to the same string (eg. “FIRST NAME” and “first_name”), the latter will be considered an extra column.
- if_extra_columns: string
If the Table has columns that don’t match any desired columns, either ‘remove’ them, ‘ignore’ them, or ‘fail’ (raising an error).
- if_missing_columns: string
If the Table is missing some of the desired columns, either ‘add’ them (with a value of None), ‘ignore’ them, or ‘fail’ (raising an error).
- Returns:
Parsons Table and also updates self
- reduce_rows(columns, reduce_func, headers, presorted=False, **kwargs)[source]
Group rows by a column or columns, then reduce the groups to a single row.
Based on the rowreduce petl function.
For example, the output from the query to get a table’s definition is returned as one component per row. The reduce_rows method can be used to reduce all those to a single row containg the entire query.
>>> ddl = rs.query(sql_to_get_table_ddl) >>> ddl.table +--------------+--------------+----------------------------------------------------+ | schemaname | tablename | ddl | +==============+==============+====================================================+ | 'db_scratch' | 'state_fips' | '--DROP TABLE db_scratch.state_fips;' | +--------------+--------------+----------------------------------------------------+ | 'db_scratch' | 'state_fips' | 'CREATE TABLE IF NOT EXISTS db_scratch.state_fips' | +--------------+--------------+----------------------------------------------------+ | 'db_scratch' | 'state_fips' | '(' | +--------------+--------------+----------------------------------------------------+ | 'db_scratch' | 'state_fips' | '\tstate VARCHAR(1024) ENCODE RAW' | +--------------+--------------+----------------------------------------------------+ | 'db_scratch' | 'state_fips' | '\t,stusab VARCHAR(1024) ENCODE RAW' | +--------------+--------------+----------------------------------------------------+ >>> reducer_fn = lambda columns, rows: [ ... f"{columns[0]}.{columns[1]}", ... '\n'.join([row[2] for row in rows])] >>> ddl.reduce_rows( ... ['schemaname', 'tablename'], ... reducer_fn, ... ['tablename', 'ddl'], ... presorted=True) >>> ddl.table +-------------------------+-----------------------------------------------------------------------+ | tablename | ddl | +=========================+=======================================================================+ | 'db_scratch.state_fips' | '--DROP TABLE db_scratch.state_fips;\nCREATE TABLE IF NOT EXISTS | | | db_scratch.state_fips\n(\n\tstate VARCHAR(1024) ENCODE RAW\n\t | | | ,db_scratch.state_fips\n(\n\tstate VARCHAR(1024) ENCODE RAW | | | \n\t,stusab VARCHAR(1024) ENCODE RAW\n\t,state_name | | | VARCHAR(1024) ENCODE RAW\n\t,statens VARCHAR(1024) ENCODE | | | RAW\n)\nDISTSTYLE EVEN\n;' | +-------------------------+-----------------------------------------------------------------------+
- Args:
- columns: list
The column(s) by which to group the rows.
- reduce_func: fun
The function by which to reduce the rows. Should take the 2 arguments, the columns list and the rows list and return a list. reducer(columns: list, rows: list) -> list;
- headers: list
The list of headers for modified table. The length of headers should match the length of the list returned by the reduce function.
- presorted: bool
If false, the row will be sorted.
- Returns:
Parsons Table and also updates self
- class parsons.etl.etl.ETL[source]
- add_column(column, value=None, index=None)[source]
Add a column to your table
- Args:
- column: str
Name of column to add
- value:
A fixed or calculated value
- index: int
The position of the new column in the table
- Returns:
Parsons Table and also updates self
- remove_column(*columns)[source]
Remove a column from your table
- Args:
- *columns: str
Column names
- Returns:
Parsons Table and also updates self
- rename_column(column_name, new_column_name)[source]
Rename a column
- Args:
- column_name: str
The current column name
- new_column_name: str
The new column name
- Returns:
Parsons Table and also updates self
- fill_column(column_name, fill_value)[source]
Fill a column in a table
- Args:
- column_name: str
The column to fill
- fill_value:
A fixed or calculated value
- Returns:
Parsons Table and also updates self
- fillna_column(column_name, fill_value)[source]
Fill None values in a column in a table
- Args:
- column_name: str
The column to fill
- fill_value:
Fixed value only
- Returns:
Parsons Table and also updates self
- move_column(column, index)[source]
Move a column
- Args:
- column: str
The column name to move
- index:
The new index for the column
- Returns:
Parsons Table and also updates existing object.
- convert_column(*column, **kwargs)[source]
Transform values under one or more fields via arbitrary functions, method invocations or dictionary translations. This leverages the petl
convert()
method. Example usage can be found here <https://petl.readthedocs.io/en/v0.24/transform.html#petl.convert>`_- Args:
- *column: str
A single column or multiple columns passed as a list
- **kwargs: str, method or variable
The update function, method, or variable to process the update
- Returns:
Parsons Table and also updates self
- get_column_max_width(column)[source]
Return the maximum width of the column.
- Args:
- column: str
The column name.
- Returns:
int
- convert_columns_to_str()[source]
Convenience function to convert all non-string or mixed columns in a Parsons table to string (e.g. for comparison)
- Returns:
Parsons Table and also updates self
- coalesce_columns(dest_column, source_columns, remove_source_columns=True)[source]
Coalesces values from one or more source columns into a destination column, by selecting the first non-empty value. If the destination column doesn’t exist, it will be added.
- Args:
- dest_column: str
Name of destination column
- source_columns: list
List of source column names
- remove_source_columns: bool
Whether to remove the source columns after the coalesce. If the destination column is also one of the source columns, it will not be removed.
- Returns:
Parsons Table and also updates self
- map_columns(column_map)[source]
Standardizes column names based on multiple possible values. This method is helpful when your input table might have multiple and unknown column names. Args:
- column_map: dict
A dictionary of columns and possible values that map to it
- Returns:
Parsons Table and also updates self
tbl = [{fn: 'Jane'}, {lastname: 'Doe'}, {dob: '1980-01-01'}] column_map = {first_name: ['fn', 'first', 'firstname'], last_name: ['ln', 'last', 'lastname'], date_of_birth: ['dob', 'birthday']} tbl.map_columns(column_map) print (tbl) >> {{first_name: 'Jane', last_name: 'Doe', 'date_of_birth': '1908-01-01'}}
- get_column_types(column)[source]
Return all of the Python types for values in a given column
- Args:
- column: str
Name of the column to analyze
- Returns:
- list
A list of Python types
- get_columns_type_stats()[source]
Return descriptive stats for all columns
- Returns:
- list
A list of dicts
- Returns:
- list
A list of dicts, each containing a column ‘name’ and a ‘type’ list
- convert_table(*args)[source]
Transform all cells in a table via arbitrary functions, method invocations or dictionary translations. This method is useful for cleaning fields and data hygiene functions such as regex. This method leverages the petl
convert()
method. Example usage can be found here <https://petl.readthedocs.io/en/v0.24/transform.html#petl.convert>`_.- Args:
- *args: str, method or variable
The update function, method, or variable to process the update. Can also
- Returns:
Parsons Table and also updates self
- unpack_dict(column, keys=None, include_original=False, sample_size=1000, missing=None, prepend=True, prepend_value=None)[source]
Unpack dictionary values from one column into separate columns
- Args:
- column: str
The column name to unpack
- keys: list
The dict keys in the column to unpack. If
None
will unpack all.- include_original: boolean
Retain original column after unpacking
- sample_size: int
Number of rows to sample before determining columns
- missing: str
If a value is missing, the value to fill it with
- prepend:
Prepend the column name of the unpacked values. Useful for avoiding duplicate column names
- prepend_value:
Value to prepend new columns if
prepend=True
. If None, will set to column name.
- unpack_list(column, include_original=False, missing=None, replace=False, max_columns=None)[source]
Unpack list values from one column into separate columns. Numbers the columns.
# Begin with a list in column json = [{'id': '5421', 'name': 'Jane Green', 'phones': ['512-699-3334', '512-222-5478'] } ] tbl = Table(json) print (tbl) >>> {'id': '5421', 'name': 'Jane Green', 'phones': ['512-699-3334', '512-222-5478']} tbl.unpack_list('phones', replace=True) print (tbl) >>> {'id': '5421', 'name': 'Jane Green', 'phones_0': '512-699-3334', 'phones_1': '512-222-5478'} # noqa: E501
- Args:
- column: str
The column name to unpack
- include_original: boolean
Retain original column after unpacking
- sample_size: int
Number of rows to sample before determining columns
- missing: str
If a value is missing, the value to fill it with
- replace: boolean
Return new table or replace existing
- max_columns: int
The maximum number of columns to unpack
- Returns:
None
- unpack_nested_columns_as_rows(column, key='id', expand_original=False)[source]
Unpack list or dict values from one column into separate rows. Not recommended for JSON columns (i.e. lists of dicts), but can handle columns with any mix of types. Makes use of PETL’s melt() method.
- Args:
- column: str
The column name to unpack
- key: str
The column to use as a key when unpacking. Defaults to id
- expand_original: boolean or int
If True: Add resulting unpacked rows (with all other columns) to original If int: Add to original unless the max added per key is above the given number If False (default): Return unpacked rows (with key column only) as standalone Removes packed list and dict rows from original either way.
- Returns:
If expand_original, original table with packed rows replaced by unpacked rows Otherwise, standalone table with key column and unpacked values only
- long_table(key, column, key_rename=None, retain_original=False, prepend=True, prepend_value=None)[source]
Create a new long parsons table from a column, including the foreign key.
# Begin with nested dicts in a column json = [{'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'} ] } ] tbl = Table(json) print (tbl) >>> {'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'}]} # noqa: E501 >>> {'id': '5421', 'name': 'Jane Green', 'emails': [{'home': 'jane@gmail.com'}, {'work': 'jane@mywork.com'}]} # noqa: E501 # Create skinny table of just the nested dicts email_skinny = tbl.long_table(['id'], 'emails') print (email_skinny) >>> {'id': '5421', 'emails_home': 'jane@gmail.com', 'emails_work': None} >>> {'id': '5421', 'emails_home': None, 'emails_work': 'jane@mywork.com'}
- Args:
- key: lst
The columns to retain in the long table (e.g. foreign keys)
- column: str
The column name to make long
- key_rename: dict
The new name for the foreign key to better identify it. For example, you might want to rename
id
toperson_id
. Ex. {‘KEY_NAME’: ‘NEW_KEY_NAME’}- retain_original: boolean
Retain the original column from the source table.
- prepend:
Prepend the column name of the unpacked values. Useful for avoiding duplicate column names
- prepend_value:
Value to prepend new columns if
prepend=True
. If None, will set to column name.
- Returns:
- Parsons Table
The new long table
- cut(*columns)[source]
Return a table of selection of columns
- Args:
- *columns: str
Columns in the parsons table
- Returns:
A new parsons table containing the selected columnns
- select_rows(*filters)[source]
Select specific rows from a Parsons table based on the passed filters.
Example filters:
tbl = Table([['foo', 'bar', 'baz'], ['c', 4, 9.3], ['a', 2, 88.2], ['b', 1, 23.3],]) # You can structure the filter in multiple wayss # Lambda Function tbl2 = tbl.select_rows(lambda row: row.foo == 'a' and row.baz > 88.1) tbl2 >>> {foo: 'a', 'bar': 2, 'baz': 88.1} # Expression String tbl3 = tbl.select_rows("{foo} == 'a' and {baz} > 88.1") tbl3 >>> {foo: 'a', 'bar': 2, 'baz': 88.1}
- Args:
*filters: function or str
- Returns:
A new parsons table containing the selected rows
- remove_null_rows(columns, null_value=None)[source]
Remove rows if the values in a column are
None
. If multiple columns are passed as list, it will remove all rows with null values in any of the passed columns.- Args:
- column: str or list
The column or columns to analyze
- null_value: int or float or str
The null value
- Returns:
None
- stack(*tables, missing=None)[source]
Stack Parsons tables on top of one another.
Similar to
table.concat()
, except no attempt is made to align fields from different tables.- Args:
- tables: Parsons Table or list
A single table, or a list of tables
- missing: bool
The value to use when padding missing values
- Returns:
None
- concat(*tables, missing=None)[source]
Concatenates one or more tables onto this one.
Note that the tables do not need to share exactly the same fields. Any missing fields will be padded with None, or whatever is provided via the
missing
keyword argument.- Args:
- tables: Parsons Table or list
A single table, or a list of tables
- missing: bool
The value to use when padding missing values
- Returns:
None
- chunk(rows)[source]
Divides a Parsons table into smaller tables of a specified row count. If the table cannot be divided evenly, then the final table will only include the remainder.
- Args:
- rows: int
The number of rows of each new Parsons table
- Returns:
List of Parsons tables
- static get_normalized_column_name(column_name)[source]
Returns a column name with whitespace removed, non-alphanumeric characters removed, and everything lowercased.
- Returns:
- str
Normalized column name
- match_columns(desired_columns, fuzzy_match=True, if_extra_columns='remove', if_missing_columns='add')[source]
Changes the column names and ordering in this Table to match a list of desired column names.
- Args:
- desired_columns: list
Ordered list of desired column names
- fuzzy_match: bool
Whether to normalize column names when matching against the desired column names, removing whitespace and non-alphanumeric characters, and lowercasing everything. Eg. With this flag set, “FIRST NAME” would match “first_name”. If the Table has two columns that normalize to the same string (eg. “FIRST NAME” and “first_name”), the latter will be considered an extra column.
- if_extra_columns: string
If the Table has columns that don’t match any desired columns, either ‘remove’ them, ‘ignore’ them, or ‘fail’ (raising an error).
- if_missing_columns: string
If the Table is missing some of the desired columns, either ‘add’ them (with a value of None), ‘ignore’ them, or ‘fail’ (raising an error).
- Returns:
Parsons Table and also updates self
- reduce_rows(columns, reduce_func, headers, presorted=False, **kwargs)[source]
Group rows by a column or columns, then reduce the groups to a single row.
Based on the rowreduce petl function.
For example, the output from the query to get a table’s definition is returned as one component per row. The reduce_rows method can be used to reduce all those to a single row containg the entire query.
>>> ddl = rs.query(sql_to_get_table_ddl) >>> ddl.table +--------------+--------------+----------------------------------------------------+ | schemaname | tablename | ddl | +==============+==============+====================================================+ | 'db_scratch' | 'state_fips' | '--DROP TABLE db_scratch.state_fips;' | +--------------+--------------+----------------------------------------------------+ | 'db_scratch' | 'state_fips' | 'CREATE TABLE IF NOT EXISTS db_scratch.state_fips' | +--------------+--------------+----------------------------------------------------+ | 'db_scratch' | 'state_fips' | '(' | +--------------+--------------+----------------------------------------------------+ | 'db_scratch' | 'state_fips' | '\tstate VARCHAR(1024) ENCODE RAW' | +--------------+--------------+----------------------------------------------------+ | 'db_scratch' | 'state_fips' | '\t,stusab VARCHAR(1024) ENCODE RAW' | +--------------+--------------+----------------------------------------------------+ >>> reducer_fn = lambda columns, rows: [ ... f"{columns[0]}.{columns[1]}", ... '\n'.join([row[2] for row in rows])] >>> ddl.reduce_rows( ... ['schemaname', 'tablename'], ... reducer_fn, ... ['tablename', 'ddl'], ... presorted=True) >>> ddl.table +-------------------------+-----------------------------------------------------------------------+ | tablename | ddl | +=========================+=======================================================================+ | 'db_scratch.state_fips' | '--DROP TABLE db_scratch.state_fips;\nCREATE TABLE IF NOT EXISTS | | | db_scratch.state_fips\n(\n\tstate VARCHAR(1024) ENCODE RAW\n\t | | | ,db_scratch.state_fips\n(\n\tstate VARCHAR(1024) ENCODE RAW | | | \n\t,stusab VARCHAR(1024) ENCODE RAW\n\t,state_name | | | VARCHAR(1024) ENCODE RAW\n\t,statens VARCHAR(1024) ENCODE | | | RAW\n)\nDISTSTYLE EVEN\n;' | +-------------------------+-----------------------------------------------------------------------+
- Args:
- columns: list
The column(s) by which to group the rows.
- reduce_func: fun
The function by which to reduce the rows. Should take the 2 arguments, the columns list and the rows list and return a list. reducer(columns: list, rows: list) -> list;
- headers: list
The list of headers for modified table. The length of headers should match the length of the list returned by the reduce function.
- presorted: bool
If false, the row will be sorted.
- Returns:
Parsons Table and also updates self