src.preprocess package

Submodules

src.preprocess.utils module

src.preprocess.utils.df_to_records(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c80485ba8>, dataset: str, drop_columns=[])[source]

Convert dataframe to a list of record oriented dicts.

Parameters:
  • df (pd.DataFrame) – Input dataset.
  • dataset (str) – Name of provider dataset.
  • drop_columns (type) – Which columns (if any) to drop.
Returns:

List of row-wise dicts.

Return type:

list

src.preprocess.utils.filter_new_hashes(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038fcc0>, ingested_path: str, date_now: str = '2021_03_09', save_ingestion_hashes: bool = False) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038fd30>[source]

Filter records by the row-wise hashes of their content.

Reduces the number of records that need to be processed from each dataset.

Will not filter hashes that were ingested on the same day as the function is called.

Parameters:
  • data (pd.DataFrame) – Input data.
  • ingested_path (str) – Path to ingested hash reference.
  • date_now (str) – String of current date.
  • save_ingestion_hashes (bool) – Should ingestion hashes be saved?
Returns:

Filtered data.

Return type:

pd.DataFrame

src.preprocess.utils.get_measure_records(combined_record, stub_names, id_columns, full_value_names)[source]

Function to break rows into individual records by stub group.

i.e. subset a row for only C4 records and other information, repeat for all possible measures.

Also drops records where notes column is blank i.e. sum(notes columns) == 0.

Parameters:
  • combined_record (type) – Dict of a single OXCGRT row.
  • stub_names (type) – List of names of each stub group.
  • id_columns (type) – List of columns to be retained as IDs.
  • full_value_names (type) – List of full names of value columns.
Returns:

List of dicts containing all records extracted from a given row.

Return type:

list

src.preprocess.utils.get_names(ox: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038f9b0>)[source]

Get the names of columns holding measure information.

These columns begin with the prefix “A1_” etc.

Returns:the names of all columns with measure information value_names: the names of measure columns stub_names: the measure column prefixes (i.e. “A1”)
Return type:full_value_names
Parameters:ox (pd.DataFrame) – Input OXCGRT dataset.
Returns:
  • full_value_names (list) – The names of all columns with measure information.
  • value_names (list) – The names of measure columns.
  • stub_names (list) – The measure column prefixes (i.e. “A1”).
src.preprocess.utils.get_row_hashes(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038f438>) → list[source]

Get row-wise base64 encoded hashes for a dataframe.

Parameters:data (pd.DataFrame) – Input data.
Returns:list of hashes.
Return type:list
src.preprocess.utils.oxcgrt_records(ox: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c80485780>, dataset: str, drop_columns: list = [])[source]

Function to convert OXCGRT data to list of record dicts.

This presents an additional challenge because of the wide format of the OXCGRT data.

Parameters:
  • ox (pd.DataFrame) – Input OXCGRT data.
  • dataset (str) – Name of provider dataset.
  • drop_columns (list) – Which columns (if any) to drop.
Returns:

List of record dicts.

Return type:

list

src.preprocess.utils.split_df_by_group(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038f390>, group: str)[source]

Split a dataframe by group and return a named dictionary.

Parameters:
  • data (pd.DataFrame) – Input dataset.
  • group (str) – Name of column to be used as group.
Returns:

Dict of dataset slices named by group.

Return type:

dict

src.preprocess.utils.write_records(records: list, dir: str, fn: str)[source]

Write records to a pickle file.

Parameters:
  • records (list) – List of preprocessed records.
  • dir (str) – Output directory.
  • fn (str) – Output file name.
Returns:

Return type:

None

src.preprocess.check module

src.preprocess.check.check_column_names(records: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9048>, config: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c94a8>, log: bool = True)[source]

Function to check that column names agree with config or raise exception.

Parameters:
  • records (pd.DataFrame) – Dataframe of provider data.
  • config (pd.DataFrame) – Reference for accepted column names.
  • log (bool) – Whether or not to log results of checks.
Returns:

Return type:

None

src.preprocess.check.check_date_format(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9588>, config: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9ef0>, dataset: str, log: bool = True)[source]

Check that an input date is in the expected format.

Parameters:
  • data (pd.DataFrame) – Dataframe of provider data..
  • config (pd.DataFrame) – Reference for accepted date formats.
  • dataset (str) – Name of provider dataset.
  • log (bool) – Whether or not to log results of checks.
Returns:

Return type:

None

src.preprocess.check.check_input(records: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9c88>, column_config: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9b00>, date_config: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9f28>, dataset: str)[source]

Function to unify all input checks.

Parameters:
  • records (pd.DataFrame) – Dataframe of provider data.
  • column_config (pd.DataFrame) – Reference for accepted column names.
  • date_config (pd.DataFrame) – Reference for accepted date formats.
  • dataset (str) – Name of provider dataset.
Returns:

Description of returned object.

Return type:

type

src.preprocess.check.validate_date_format(date, format)[source]

Return None if a date format does not parse.

Parameters:
  • date (type) – Input date string.
  • format (type) – Input accpeted format to try.
Returns:

Returns date on successful parse or None on parsing failure.

Return type:

type

Module contents