| Class | |
A container managing multiple pandas.DataFrame. |
| Function | add |
Add category label for missing values to a categorical data series. |
| Function | batch |
Read and validate multiple files at once. |
| Function | create |
Create an ICD Catalog based on data from BfArM or WHO. |
| Function | cut |
Cutting values into bins similar to pandas.cut(). |
| Function | cut |
Cutting values into specific number of bins of the same range. |
| Function | cut |
Cut a dataframe into pieces row by row but keep groups together. |
| Function | cut |
Cut a dataframe horizontally into pieces of (nearly) the same size. |
| Function | generate |
Create labels for the associated intervals. |
| Function | parallelize |
Use multiprocessing to working on dataframe row by row. |
| Function | read |
Read and validate a CSV file. |
| Function | read |
Read and validate an Excel file. |
| Function | reorder |
Rearrange the columns of a DataFrame. |
| Function | validate |
Validate the structure of a CSV file. |
| Function | validate |
Validate a dataframes content and format. |
| Function | validate |
Validate if the DataFrame fit to length rules. |
| Function | validate |
Validate if columns contain unique values. |
| Function | validate |
Validate if the DataFrame fit to value rules. |
| Constant | DECREASE |
Number of workers using parallelize_job_on_dataframe() is decreased by this number. |
| Constant | FILENAME |
Prefix used for the WRONG file. See validate_csv() for details. |
| Constant | SERVER |
Machines with this number of logical cores are treated as servers or crunching machines used by multiple users. See SERVER_USE_CORE_FX for details. |
| Constant | SERVER |
When parallelizing tasks using parallelize_job_on_dataframe() on a mutli-user server use only this fraction of available cores. |
| Function | _announce |
Undocumented |
| Function | _construct |
Construct a file name suitable for documenting wrong lines. |
| Function | _csv |
See validate_csv() for details. |
| Function | _explode |
Convert lists with values and ranges to ranges only. |
| Function | _file |
Use a path or a in-memory file (buffer) with a with statement. |
| Function | _name |
Replace integers less then 0 with strings of format Unnamed: {idx}. |
| Function | _parse |
Parse, validate and separate the specs_and_rules argument. |
| Function | _replace |
Replace lines and substrings in the content of a file or buffer. |
| Function | _store |
Store na values as list even if there is one value only. |
| Function | _zip |
Workaround to make pandas handle zipfile.Path. |
| Variable | _log |
Undocumented |
pandas.Series, missing_label: str, insert_at_end: bool = True) -> pandas.Series:
(source)
¶
Add category label for missing values to a categorical data series.
| Parameters | |
data:pandas.Series | The data series (e.g. a column). |
missingstr | The new label. |
insertbool | Append label to the end (default) or in front (False). |
| Returns | |
pandas.Series | The data series with new categorical dtype attached. |
Read and validate multiple files at once.
The argument read_this is a complex dictionary containing all relevant
information about how to read and validate the files. The following example
illustrate how to use it. The element specs_and_rules is
described elsewhere in bandas.read_and_validate_csv().
read_this = {
'Foo': # name of resulting dataframe
{
'file': Path('foo.csv'),
'specs_and_rules': {...},
'encoding': 'utf-8', # optional
'no_header': True, # optional
'skiprows': 1, # optional
'skipfooter': 4, # optional
},
'Bar':
{
'file': Path('bar.xlsx'),
'sheet_name': 'Money', # optional
# ...
},
'FromArchive':
{
'file': zipfile.Path('data.zip', 'inzip_file.xlsx'),
# ...
},
# ...
}| Parameters | |
readdict | A complexe dictionary with dataframe names as keys. |
| Returns | |
DataContainer | A complete data container. |
| Raises | |
BaseException | See read_and_validate_csv() and
read_and_validate_excel() about details. |
Create an ICD Catalog based on data from BfArM or WHO.
The data source can be a zip file (download from BfArM or WHO) or a download URL for that zip file (e.g. icd10gm2024syst-meta.zip).
The result looks like this:
CODE TYP CHAPTER BLOCK TEXT CODE_TEXT 0 A00.- CODE 01 A00-A09 Cholera.. A00.- Cholera.. 1 A00.0 CODE 01 A00-A09 Cholera.. A00.0 Cholera.. 2 A00-A09 BLOCK 01 A00-A09 Infektiö.. A00-A09 Infekt.. 3 A15-A19 BLOCK 01 A15-A19 Tuberkul.. A15-A19 Tuberk.. 4 01 CHAPTER 01 Bestimmt.. 01 Bestimmte i.. 5 02 CHAPTER 02 Neubildu.. 02 Neubildunge..
Attention
Most of the downloads offering zip archives. But the internal structure of them differs. This can cause problems with that function here because it doesn't know all structures yet. Please open an Issue report
| Parameters | |
filepathlib.Path | str | File path or URL. |
| Returns | |
pandas.DataFrame | Data frame with columns named ??? |
list | pandas.Series, *, infinity_is_less_than: int = 0, interval_range: int = None, infinity_is_equal_or_more_than: int = None, interval_list: Iterable[ tuple] = None, labels: Iterable[ str] = None, format_string: str = '{} to {}', begin_format_string: str = 'less than {}', end_format_string: str = '{} and more', remove_unused_labels: bool = False) -> list | pandas.Series:
(source)
¶
Cutting values into bins similar to pandas.cut().
See also cut_bins_via_interval_count() for a simplified wrapper.
There are two major methods to specify the bins or intervals.
# Specify steps using "interval_range" to get intervals of # same length: cut_bins(the_values, interval_range=10) # Or using "interval_list" to specify explicit each interval # with pairs of start and begin value which are both included # in that interval: bins = [(0, 4), (5, 9), (10, 50)] cut_bins(the_values, interval_list=bins)
The interval range arguments are ignored when interval_list is given.
Infinity starts and ends are created automatically when interval_range is used but not with interval_list. For the latter you have to specify them yourself like so:
bins = [(-math.inf, -1), (0, 9), (10, math.inf)] cut_bins(the_values, interval_list=bins)
The wording of the labels can be modified via format_string and when using infinity ends also via begin_format_string and end_format_string. Or you can set explicit labels via labels.
| Parameters | |
values:list | pandas.Series | List of values to cut. |
infinityint | The right edge (or end) of the first infinity interval. For example if 0 the first two intervals are [(math.inf, -1), (0, ...)]. |
intervalint | Size for each interval. |
infinityint | Value covered by the last infinity interval. For example if 100 the last two intervals are [(..., 99), (100, math.inf)]. |
intervalIterable[ | List of pairs of start and end points of
intervals. Both points are included in that interval. Overwrites
interval_range and related arguments. If one of values is
not covered in that list a ValueError is raised. |
labels:Iterable[ | Explicit list of labels to use. |
formatstr | Used for in between intervals. |
beginstr | Used for the first interval including infinity start. |
endstr | Used for the last interval including infinity end. |
removebool | Labels removed from the categories if they don't exist in the values. |
| Returns | |
list | pandas.Series | List of labels corresponding to values. |
| Raises | |
ValueError | If interval_list is used and its length isn't equal to the number of provided labels. |
ValueError | If not all values covered by the bins/intervals (except NA values). |
list | pandas.Series, interval_range: int, interval_count: int, first_interval_begin: int = 0, **kwargs) -> list | pandas.Series:
(source)
¶
Cutting values into specific number of bins of the same range.
If one of the values is not covered by the resulting intervals (bins)
a ValueError is raised. NA values are an exception of that rule.
This function is a wrapper around cut_bins().
| Parameters | |
values:list | pandas.Series | List of values to cut. |
intervalint | Size for each interval. |
intervalint | Number of intervals (or bins). |
firstint | First value in the first interval. |
| **kwargs | See cut_bins() for additional arguments. |
| Returns | |
list | pandas.Series | List of labels corresponding to values. |
| Raises | |
ValueError | If not all values covered by the bins/intervals (except NA values). |
pandas.DataFrame, group_column: str | int, n_pieces: int, sort_kind: str = 'quicksort') -> list[ pandas.DataFrame]:
(source)
¶
Cut a dataframe into pieces row by row but keep groups together.
Groups are specified by the column name in group_column. The data will by sorted by group_column using the sort algorithm specified by sort_kind which is used by pandas/numpy. The default quicksort is the fastest. For unittesting stable should be used. The number of resulting parts is not guaranteed to be n_pieces.
| Parameters | |
data:pandas.DataFrame | The data frame that should be cut. |
groupstr | int | Name or index of the group column. |
nint | Number of resulting pieces. |
sortstr | Used for unittesting. |
| Returns | |
list[ | A list of the data frame parts. |
Cut a dataframe horizontally into pieces of (nearly) the same size.
This pieces can be used for parallelization. To keep groups of rows
together there is the alternative cut_by_row_keep_group(). The number
of pieces is garantueed.
| Parameters | |
data:pandas.DataFrame | The data frame that should be cut. |
nint | Number of resulting pieces. |
| Returns | |
list[ | A list of the data frame parts. |
Iterable, format_string: str = '{} to {}', begin_format_string: str = 'less than {}', end_format_string: str = '{} and more') -> list[ str]:
(source)
¶
Create labels for the associated intervals.
The function is used by cut_bins(). Two ways exists to give the intervals
in the bin argument.
Example
# Pairs of intervals (both ends included) bins = [ (0, 4), (5, 9), (10, 20) ] result = ['0 to 4', '5 to 9', '10 to 20'] # Or with infinity ends bins = [ (-math.inf, 4), (5, 9), (10, math.inf) ] result = ['less than 5', '5 to 9', '10 and more'] # As list of edges (only right end included) bins = [0, 5, 10] result = ['1 to 5', '6 to 10'] # Or with infinity ends bins = [-math.inf, 0, 5, 10, math.inf] result = ['less than 1', '1 to 5', '6 to 10', '11 and more']
| Parameters | |
bins:Iterable | The intervals as a list of bin edges or a list of start-end pairs. |
formatstr | Used for labels like "x to y". |
beginstr | Used for the first label with infinity begin. |
endstr | Used for the last label with infinity end. |
| Returns | |
list[ | A list of labels. |
pandas.DataFrame, worker_func: Callable, group_column: str = None, worker_args: tuple = tuple(), n_pieces: int = None, decrease_workers_by: int = None) -> pandas.DataFrame:
(source)
¶
Use multiprocessing to working on dataframe row by row.
A dataframe is cut into multiple dataframes. Each of them is
transferred to another process (not thread). This is fast because of
using multiple CPU cores but costs a lot of RAM and some time for
transffering the dataframe pieces (via pickle) to a process and back.
There are two options to cut a dataframe. By default or when using
n_pieces it is cut into pieces with nearly the same number of rows.
The function bandas.cut_into_pieces() is used in that case.
When using group_column the function bandas.cut_by_row_keep_group()
will be used.
Just return the result of the worker:
def the_worker(sub_dataframe): sub_dataframe.foo = 7 return sub_dataframe
To add additional arguments to the worker use worker_args argument and the values in a tuple.
def the_worker(columns, sub_dataframe): sub_dataframe['Extra'] = sub_dataframe.loc[:, cols].apply( lambda row: row * 7, axis=1) return sub_dataframe if __name__ == '__main__': result = bandas.parallelize_job_on_dataframe( data=df, worker_func=the_worker, group_column='group', worker_args=(['colA', 'colD', 'colT'], ) )
| Parameters | |
data:pandas.DataFrame | The dataframe. |
workerCallable | A function used in each process. |
groupstr | That group is not cut into while cutting the dataframe. |
workertuple | Tuple of arguments used in the worker function. |
nint | Number of pieces the dataframe should be cut into. |
decreaseint | Reduce the number of cores to use by value (default: 0). |
| Returns | |
pandas.DataFrame | The resulting dataframe. |
pathlib.Path | zipfile.Path, specs_and_rules: dict, *, no_header_line: bool = False, encoding: str = 'utf-8', delimiter: str = ';', replace_lines: dict[ str, str] = None, replace_substrings: dict[ str, str] = None, on_bad_lines: str | Callable = 'error', **kwargs) -> pandas.DataFrame:
(source)
¶
Read and validate a CSV file.
Read a CSV file with respect to specifications about format and
rules about valid values and return a pandas.DataFrame.
You have to give specifications for all existing columns in the correct
order. The following aspects can be specified:
- Columns to read and columns to ignore.
- The data type of a column. Types from
pandas,numpyor Pythonsbuiltinsare valid. - Missing values. They are converted to pandas.NA in the resulting data frame.
- Length of a data field in the raw CSV file.
- Valid values in a column (checked in the resulting data frame).
- Ignoring a column is also a specification.
Example
Here you see a complex example with all possible options.
- ColumnA is of type
str. - ColumnB is of type
strand the value no answer is treated as missing (pandas.NA). - ColumnC exist in the *.csv file but will be ignored while reading and won't be a part of the resulting data frame.
- ColumnD is of type
pandas.Int16Dtype. The value -9 is a missing. The field length can be 1, 2 or 4 to 8. Possible or valid values are 0, 1, 3 to 9 and the missing -9. The column need to have unique values. - The 5th column (named -1) has no name in the header line. The name Unnamed: 4 will be used for it.
specs_and_rules = {
'ColumnA': 'str',
'ColumnB': ('str', 'no answer'),
'ColumnC': None,
'ColumnD': (
'Int16',
-9,
{
'len': [1, 2, (4-8)],
'val': [0, 1, (3-9)],
'unique': True,
}
},
-1: int,
}Example
Here we expect a CSV file with three columns but only one ColumnB is in the resulting data frame and the others are ignored while reading. Despite the third column ColumnC is not contained in the result its content will be validated with a val-rule.
specs_and_rules = {
'ColumnA': None,
'ColumnB': 'int',
'ColumnC': (None, None, {'val': [1]}),
}Important
Do not use objects of type type when specifying the column type.
For example when the column is a string use "str" instead
of str.
Hint
To passthrough arguments to pandas.read_csv() the **kwargs can
be used. For example skiprows or skipfooter.
Hint
The file_path can also be of type zipfile.Path to specify an
entry in a ZIP file.
| Parameters | |
filepathlib.Path | zipfile.Path | Path to the CSV file to read from. |
specsdict | A column named indexed dictionary. |
nobool | Indicates if the first line contains column names. |
encoding:str | Optional encoding type used for reading the CSV file. |
delimiter:str | Delimiter to separate the fields. |
replacedict[ | Replace dictionary to replace complete lines. |
replacedict[ | Replace dictionary to replace sub strings. |
onstr | Callable | See pandas.read_csv() for details. |
| **kwargs | Used to handover arguments to pandas.read_csv(). |
| Returns | |
pandas.DataFrame | The resulting data frame. |
pathlib.Path | zipfile.Path, specs_and_rules: dict, *, no_header_line: bool = False, **kwargs) -> pandas.DataFrame:
(source)
¶
Read and validate an Excel file.
Read an Excel file with respect to specifications about format and
rules about valid values and return a pandas.DataFrame.
You have to give specifications for all existing columns in the correct
order. But ignoring a column is also a specification. See
read_and_validate_csv() for details and examples about usage of
specs_and_rules.
Warning
If missing values specified via specs_and_rules their length will be ignored according to len rules. Missing values (.isna()) are not considered in length rules when reading from Excel files.
Tip
To passthrough arguments to pandas.read_excel() the **kwargs can
be used. For example sheet_name, skiprows or skipfooter.
pandas.DataFrame, this_columns: list[ str] | str, behind_this_column: str = None) -> pandas.DataFrame:
(source)
¶
Rearrange the columns of a DataFrame.
The columns named in this_columns are moved behind the column named via behind_this_column.
| Parameters | |
dataframe:pandas.DataFrame | The complete data frame. |
thislist[ | List of names or one name of column(s) to move. |
behindstr | Name of column before the insertion position. |
| Returns | |
pandas.DataFrame | The new ordered data frame. |
| Raises | |
AttributeError | Column names not unique. |
KeyError | Column to move not exist. |
ValueError | Behind column not exist. |
pathlib.Path | zipfile.Path, *, columns: list[ str] | int, delimiter: str = ';', quoting: int = csv.QUOTE_MINIMAL, encoding: str = 'utf-8', field_length_rules: dict = None, stop_after_n_errors: int = 50, skiprows: int = 0, skipfooter: int = 0, nrows: int = None, replace_lines: dict[ str, str] = None, replace_substrings: dict[ str, str] = None) -> bool:
(source)
¶
Validate the structure of a CSV file.
See read_and_validate_csv() for more details. Malformed or invalid rows
are logged to a file (*.wrong.csv). The need for that function arises
from the fact that the checks of pandas.read_csv() are quite sluggish.
Note
Development notes: Maybe move that function to buhtzology.misc because there are no pandas dependencies in it.
| Parameters | |
filepathlib.Path | zipfile.Path | Path and name of the csv file to check. |
columns:list[ | Names or count of expected columns. |
delimiter:str | Field delimiter. |
quoting:int | Quoting dialect. |
encoding:str | Used while reading the file. |
fielddict | Column index indexed dict with list of valid field length. |
stopint | Stop checking for further errors or rule violations when this number is reached. |
skiprows:int | Skipping n rows from the beginning. |
skipfooter:int | Skipping n rows from the end. |
nrows:int | Read nrows from beginning (including header) after skipping. |
replacedict[ | Replace dictionary to replace complete lines. |
replacedict[ | Replace dictionary to replace sub strings. |
| Returns | |
bool | True if everything is fine. Otherwise exception is raised. |
| Raises | |
ValueError | If the header or one or more lines do not fit the rules. |
pandas.DataFrame, expected_columns: Iterable, ignored_columns: Iterable, value_rules: dict[ Hashable, Iterable], unique_columns: Iterable[ str]):
(source)
¶
Validate a dataframes content and format.
If the dataframe is valid nothing happens; no return value. If something is invalid a TypeError is raised.
The expected_columns is allowed to have negative integers. They are
replaced using _name_unnamed_columns().
| Parameters | |
dfpandas.DataFrame | The dataframe to perform the validation on. |
expectedIterable | List of columns names that should exist. |
ignoredIterable | List of columns excluded from the validation. |
valuedict[ | See read_and_validate_csv() for details. |
uniqueIterable[ | Names of columns that should have unique values. |
| Raises | |
TypeError | When the dataframe do not fit the rules. |
pandas.DataFrame, len_rules: dict[ int, list[ int]], ignore_na: bool = False):
(source)
¶
Validate if the DataFrame fit to length rules.
Validate if the DataFrame (df_to_check) fit to len_rules.
Otherwise an ValueError is raised.
| Parameters | |
dfpandas.DataFrame | Dataframe to validate. |
lendict[ | Dictionary with column index (not name) ask keys and as value a list valid lengths. |
ignorebool | Missing values not considered. |
| Raises | |
ValueError | If value rules not fit. |
pandas.DataFrame, unique_columns: Iterable[ str | int]):
(source)
¶
Validate if columns contain unique values.
| Parameters | |
dfpandas.DataFrame | Dataframe to validate. |
uniqueIterable[ | List of column names. |
| Raises | |
ValueError | If value rules not fit. |
Validate if the DataFrame fit to value rules.
Validate if the DataFrame (df_to_check) fit to value_rules.
Otherwise an ValueError is raised.
| Parameters | |
dfpandas.DataFrame | Dataframe to validate. |
valuedict | Column name indexed dict with valid values. |
| Raises | |
ValueError | If value rules not fit. |
TypeError | If value rules are missing. |
Number of workers using parallelize_job_on_dataframe() is decreased by
this number.
| Value |
|
Machines with this number of logical cores are treated as servers or
crunching machines used by multiple users. See SERVER_USE_CORE_FX for
details.
| Value |
|
When parallelizing tasks using parallelize_job_on_dataframe() on a
mutli-user server use only this fraction of available cores.
This is used to preventing to take all cors and block other users. See
SERVER_N_CORES for details.
| Value |
|
pathlib.Path, violations: dict, delimiter: str, quoting: int, encoding: str, real_header: list[ str], count_err_msg: str):
(source)
¶
Undocumented
pathlib.Path | zipfile.Path | io.IOBase) -> pathlib.Path:
(source)
¶
Construct a file name suitable for documenting wrong lines.
- Example 1:
- Based on /home/user/data.csv the path /home/user/_WRONG_data.csv will be returned.
- Example 2:
- Based on /foo.zip/foo/bar/data.csv the path /_WRONG_foo.zip.foo_bar_data.csv will be returned.
| Parameters | |
filepathlib.Path | zipfile.Path | io.IOBase | Path object which used as a base for construction. |
| Returns | |
pathlib.Path | The new file path. |
list[ str], rules: dict[ int, list[ int]], columns: list[ str]) -> bool:
(source)
¶
See validate_csv() for details.
Convert lists with values and ranges to ranges only.
Used by _parse_specs_and_rules(). Ranges are specified as tuples with
two elements. Ranges are only allowed with type int.
| Parameters | |
rules:list | A list mixed with single values and value ranges as tuples. E.g. [1, 2, (4, 8), 11] |
| Returns | |
list | A list with single values only. E.g. [1, 2, 4, 5, 6, 7, 8, 11] |
| Raises | |
ValueError | If range is invalid. e.g. from 5 to 2 |
def _file_path_as_buffer(path_or_buffer:
pathlib.Path | io.IOBase, mode: str = 'r', encoding: str = 'utf-8', replace_lines: dict[ str, str] = None, replace_substrings: dict[ str, str] = None) -> io.IOBase:
(source)
¶
Use a path or a in-memory file (buffer) with a with statement.
Iterable, rules: dict, unique: list) -> tuple[ list, dict, list]:
(source)
¶
Replace integers less then 0 with strings of format Unnamed: {idx}.
Check for integers less then 0 and replace them with strings of format
Unnamed: {idx}. It is used in validate_dataframe().
| Parameters | |
columns:Iterable | A list of column names including unnamed columns marked with negative integers. |
rules:dict | Considered while renaming and returned as a result. |
unique:list | Considered while renaming and returned as a result. |
| Returns | |
tuple[ |
|
dict, ignore_na_lengths: bool = False) -> tuple[ list, dict, dict, dict, dict]:
(source)
¶
Parse, validate and separate the specs_and_rules argument.
Columns without names are marked with negative integers. They are renamed
based on pandas behavior into Unnamed: 1 where the number is the
index of the column.
It is used in read_and_validate_csv().
| Parameters | |
specsdict | The dictionary to process. |
ignorebool | Don't add length of na values to len rules. |
| Returns | |
| A dictionary with six elements | columns, ignored columns, dtypes, missing values, length rules and value rules. |
io.IOBase, replace_lines: dict[ str, str], replace_substrings: dict[ str, str]) -> io.IOBase:
(source)
¶
Replace lines and substrings in the content of a file or buffer.
Lines are replaced complete only and not treated as like substrings. Line endings (e.g. \\n) need to be part of the line to replace. Substrings are searched via is in operator. Line replacement has priority.
| Parameters | |
buffer:io.IOBase | The buffer (file-like object) to read content from. |
replacedict[ | Indexed by complete lines to replace. |
replacedict[ | Indexed by substrings replaced in lines. |
| Returns | |
io.IOBase | A buffer (file-like object) sougth back to 0. |
Store na values as list even if there is one value only.
Called by _parse_specs_and_rules().
Missing values are specified in specs and stored in NA_VALUES
indexed by `column_name.
Workaround to make pandas handle zipfile.Path.
Pandas can not handle zipfile.Path instances. See: https://github.com/pandas-dev/pandas/issues/49906 Here we open it as a byte stream.