src.preprocess package¶
Submodules¶
src.preprocess.utils module¶
-
src.preprocess.utils.
df_to_records
(df: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c80485ba8>, dataset: str, drop_columns=[])[source]¶ Convert dataframe to a list of record oriented dicts.
Parameters: - df (pd.DataFrame) – Input dataset.
- dataset (str) – Name of provider dataset.
- drop_columns (type) – Which columns (if any) to drop.
Returns: List of row-wise dicts.
Return type: list
-
src.preprocess.utils.
filter_new_hashes
(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038fcc0>, ingested_path: str, date_now: str = '2021_03_09', save_ingestion_hashes: bool = False) → <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038fd30>[source]¶ Filter records by the row-wise hashes of their content.
Reduces the number of records that need to be processed from each dataset.
Will not filter hashes that were ingested on the same day as the function is called.
Parameters: - data (pd.DataFrame) – Input data.
- ingested_path (str) – Path to ingested hash reference.
- date_now (str) – String of current date.
- save_ingestion_hashes (bool) – Should ingestion hashes be saved?
Returns: Filtered data.
Return type: pd.DataFrame
-
src.preprocess.utils.
get_measure_records
(combined_record, stub_names, id_columns, full_value_names)[source]¶ Function to break rows into individual records by stub group.
i.e. subset a row for only C4 records and other information, repeat for all possible measures.
Also drops records where notes column is blank i.e. sum(notes columns) == 0.
Parameters: - combined_record (type) – Dict of a single OXCGRT row.
- stub_names (type) – List of names of each stub group.
- id_columns (type) – List of columns to be retained as IDs.
- full_value_names (type) – List of full names of value columns.
Returns: List of dicts containing all records extracted from a given row.
Return type: list
-
src.preprocess.utils.
get_names
(ox: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038f9b0>)[source]¶ Get the names of columns holding measure information.
These columns begin with the prefix “A1_” etc.
Returns: the names of all columns with measure information value_names: the names of measure columns stub_names: the measure column prefixes (i.e. “A1”) Return type: full_value_names Parameters: ox (pd.DataFrame) – Input OXCGRT dataset. Returns: - full_value_names (list) – The names of all columns with measure information.
- value_names (list) – The names of measure columns.
- stub_names (list) – The measure column prefixes (i.e. “A1”).
-
src.preprocess.utils.
get_row_hashes
(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038f438>) → list[source]¶ Get row-wise base64 encoded hashes for a dataframe.
Parameters: data (pd.DataFrame) – Input data. Returns: list of hashes. Return type: list
-
src.preprocess.utils.
oxcgrt_records
(ox: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c80485780>, dataset: str, drop_columns: list = [])[source]¶ Function to convert OXCGRT data to list of record dicts.
This presents an additional challenge because of the wide format of the OXCGRT data.
Parameters: - ox (pd.DataFrame) – Input OXCGRT data.
- dataset (str) – Name of provider dataset.
- drop_columns (list) – Which columns (if any) to drop.
Returns: List of record dicts.
Return type: list
-
src.preprocess.utils.
split_df_by_group
(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c8038f390>, group: str)[source]¶ Split a dataframe by group and return a named dictionary.
Parameters: - data (pd.DataFrame) – Input dataset.
- group (str) – Name of column to be used as group.
Returns: Dict of dataset slices named by group.
Return type: dict
src.preprocess.check module¶
-
src.preprocess.check.
check_column_names
(records: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9048>, config: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c94a8>, log: bool = True)[source]¶ Function to check that column names agree with config or raise exception.
Parameters: - records (pd.DataFrame) – Dataframe of provider data.
- config (pd.DataFrame) – Reference for accepted column names.
- log (bool) – Whether or not to log results of checks.
Returns: Return type: None
-
src.preprocess.check.
check_date_format
(data: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9588>, config: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9ef0>, dataset: str, log: bool = True)[source]¶ Check that an input date is in the expected format.
Parameters: - data (pd.DataFrame) – Dataframe of provider data..
- config (pd.DataFrame) – Reference for accepted date formats.
- dataset (str) – Name of provider dataset.
- log (bool) – Whether or not to log results of checks.
Returns: Return type: None
-
src.preprocess.check.
check_input
(records: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9c88>, column_config: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9b00>, date_config: <sphinx.ext.autodoc.importer._MockObject object at 0x7f5c803c9f28>, dataset: str)[source]¶ Function to unify all input checks.
Parameters: - records (pd.DataFrame) – Dataframe of provider data.
- column_config (pd.DataFrame) – Reference for accepted column names.
- date_config (pd.DataFrame) – Reference for accepted date formats.
- dataset (str) – Name of provider dataset.
Returns: Description of returned object.
Return type: type