buhtzology.analy

module documentation

(source)

Generate output for analysis in form of tables and figures.

For tables pandas.DataFrame is used. For figures seaborn or other matplotlib based packages are used. The results are intended to be used by report.

Function	`aggregate`	Summarize data using aggregating functions.
Function	`frequencies`	Frequencies for one variable.
Function	`frequencies_and_percentages`	Frequencies and percentages for one variable only.
Function	`percentages`	Percentages for one variable.
Function	`summarize`	Create summary table with frequency and percentage for multiple variables using there name and values.
Constant	`DEFAULT_AGGFUNC_LABELS`	Labels used by default in `aggregate()`.
Constant	`INDENTION_DEFAULT`	Spaces to indent, e.g. when indenting row labels (indexes).
Constant	`LABEL_TOTAL`	Default label used for total columns and rows.
Function	`_generic_frequency_or_fraction`	Frequencies or percentages for one variable.
Function	`_join_columns`	Join two columns into one.
Function	`_total_label`	Helper function to determine the correct value for the total column depending on the givin argument value.
Variable	`_log`	Handle to the logger.

def aggregate(data: pandas.DataFrame, values_in: Union[str, list, Dict[str, str]], aggfunc: Union[Callable, Iterable[Callable], Dict[Callable, str]], group_by: str = None, total: Union[bool, str] = True, round_digits: int = None) -> pandas.DataFrame: (source) ¶

Summarize data using aggregating functions.

Output example from titanic dataset as markdown showing the sum and mean value of age :

|     |     Sum |    Mean |
|:----|--------:|--------:|
| age | 21205.2 | 29.6991 |

Adding fare costs:

|      |     Sum |    Mean |
|:-----|--------:|--------:|
| age  | 21205.2 | 29.6991 |
| fare | 28693.9 | 32.2042 |

Mean (with customized label) for age and fare and grouped by gender with values rouned:

|      | ('female', 'MD') | ('male', 'MD') | ('Total', 'MD') |
|:-----|-----------------:|---------------:|----------------:|
| age  |               28 |             31 |              30 |
| fare |               44 |             26 |              32 |

About values_in: This argument can be a string, a list of strings or a dictionary. Use a string to specify one column in the data or use a list of strings to specify multiple columns. To customize labels in the resulting data frame use a dictionary indexed by the column names with the customized labels as values.

About aggfunc: Similar to values_in this argument can be a one function, a list of functions or a dictionary. The term function refers to a Callable. Labels in resulting table used for aggregating functions are read from DEFAULT_AGGFUNC_LABELS by default. To add more or customized labels use a dict indexed by functions with customized label as values.

Parameters
data:`pandas.DataFrame`	Data frame with raw data.
values_in:`Union[str, list, Dict[str, str]]`	Specify columns in `data` to aggregate. See details below.
aggfunc:`Union[Callable, Iterable[Callable], Dict[Callable, str]]`	Aggregate function(s) to use. See details below.
group_by:`str`	Optional column in `data` to values.
total:`Union[bool, str]`	Add total column (default: `True`) with optionally customized label when using a string instead of `True`.
round_digits:`int`	Number of digits to round values in all cells.
Returns
`pandas.DataFrame`	Resulting table as `pandas.DataFrame`.

def frequencies(data: pandas.Series, dropna: bool = False, result_label: str = 'n', index: Union[bool, str] = False, data_label: str = None) -> pandas.DataFrame: (source) ¶

Frequencies for one variable.

Example output (markdown style):

| Gender   |   n |
|:---------|----:|
| Female   | 101 |
| Male     |  99 |

See frequencies_and_percentages() for more examples and details.

Parameters
data:`pandas.Series`	A series of values or a data frame's column.
dropna:`bool`	Ignore missing values or not.
result_label:`str`	The label used for the result column.
index:`Union[bool, str]`	Modify row labels. See `frequencies_and_percentages()` for details.
data_label:`str`	Overwrite explanatory data label. See `frequencies_and_percentages()` for details.
Returns
`pandas.DataFrame`	The frequencies table as a data frame.

def frequencies_and_percentages(data: pandas.Series, dropna: bool = False, result_labels: Tuple[str] = ('n', '%'), combine: str = '{} ({})', reverse_columns: bool = False, index: Union[bool, str] = False, data_label: str = None, sort_index: bool = None, round_percentage_digits: int = 2) -> pandas.DataFrame: (source) ¶

Frequencies and percentages for one variable only.

It is a wrapper around frequencies() and percentages().

The index (labels of the row) by default contain the unique values specified by data (e.g. female and male). Additionally an explanatory name (e.g. Gender) can be added in to flavors to the index. If index is True an indented index is created with row labels indented by four blank spaces (see INDENTION_DEFAULT). The same happens when index == 'indented'. If index == 'multi' a pandas.MultiIndex is created with the explanatory name in the first and the value labels in the second level.

Output example from titanic dataset as markdown:

|       |   n |       % |
|:------|----:|--------:|
| man   | 869 | 66.0334 |
| women | 447 | 33.9666 |

Example with index='multi':

|                  |   n |       % |
|:-----------------|----:|--------:|
| ('sex', 'man')   | 869 | 66.0334 |
| ('sex', 'women') | 447 | 33.9666 |

Example with customized variable label (data_label='Gender'):

|                     |   n |       % |
|:--------------------|----:|--------:|
| ('Gender', 'man')   | 869 | 66.0334 |
| ('Gender', 'women') | 447 | 33.9666 |

Example with customized variable label (index=True, data_label='Gender'):

|            | n (%)         |
|:-----------|:--------------|
| Gender     |               |
|     male   | 577.0 (64.76) |
|     female | 314.0 (35.24) |

Parameters
data:`pandas.Series`	A series of values or dataframe column.
dropna:`bool`	Ignore missing values or not.
result_labels:`Tuple[str]`	The labels used for the result columns.
combine:`str`	Combine the two columns into one (e.g. `"n (%)"`).
reverse_columns:`bool`	Order of columns in resulting table.
index:`Union[bool, str]`	Add `data`'s name to the index and/or specify the index kind as `MultiIndex` (value: `multi`) or indented (value: `indented`). See details below.
data_label:`str`	Replace `data`'s name by this label.
sort_index:`bool`	Sorting the row index (e.g. by ordered categories).
round_percentage_digits:`int`	Round percentage values to n digits.
Returns
`pandas.DataFrame`	A table as a data frame.

def percentages(data: pandas.Series, dropna: bool = False, result_label: str = '%', index: Union[bool, str] = False, data_label: str = None) -> pandas.DataFrame: (source) ¶

Percentages for one variable.

Example output:

|       |       % |
|:------|--------:|
| man   | 66.0334 |
| women | 33.9666 |

See frequencies_and_percentages() for more examples and details.

Parameters
data:`pandas.Series`	A series of values or data frame's column.
dropna:`bool`	Ignore missing values or not.
result_label:`str`	The label used for the result column.
index:`Union[bool, str]`	Modify row labels. See `frequencies_and_percentages()` for details.
data_label:`str`	Overwrite explanatory data label. See `frequencies_and_percentages()` for details.
Returns
`pandas.DataFrame`	The percentages table as a data frame.

def summarize(data: pandas.DataFrame, values_in: Union[str, Iterable[str], Dict[str, str]], frequency: Union[bool, Tuple[bool, bool]] = (True, False), combine: bool = True, group_by: str = None, total: Union[bool, str] = True) -> pandas.DataFrame: (source) ¶

Create summary table with frequency and percentage for multiple variables using there name and values.

Simple output example using titanic dataset:

>>> analy.summarize(df, 'sex')
                  n (%)
Total       891 (100.0)
sex
    male    577 (64.76)
    female  314 (35.24)

Example with more variables:

>>> analy.summarize(df, ['sex', 'class', 'alive'])
                  n (%)
Total       891 (100.0)
sex
    male    577 (64.76)
    female  314 (35.24)
class
    Third   491 (55.11)
    First   216 (24.24)
    Second  184 (20.65)
alive
    no      549 (61.62)
    yes     342 (38.38)

Using argument group_by with multiple variables:

                  Betazed        Bajor        Trill        Total
                    n (%)        n (%)        n (%)        n (%)
Total         117 (100.0)  127 (100.0)  105 (100.0)  349 (100.0)
Person
    Sarek      29 (24.79)   14 (11.02)    19 (18.1)   62 (17.77)
    Diana      25 (21.37)   33 (25.98)   20 (19.05)   78 (22.35)
    Picard      22 (18.8)   26 (20.47)   22 (20.95)   70 (20.06)
    Worf        22 (18.8)   29 (22.83)   26 (24.76)   77 (22.06)
    Quark      19 (16.24)   25 (19.69)   18 (17.14)   62 (17.77)
Gender
    diverse    29 (24.79)    32 (25.2)   24 (22.86)   85 (24.36)
    female     25 (21.37)   20 (15.75)   25 (23.81)   70 (20.06)
    androgyn   24 (20.51)   23 (18.11)   17 (16.19)   64 (18.34)
    unknown    20 (17.09)   25 (19.69)   17 (16.19)   62 (17.77)
    male       19 (16.24)   27 (21.26)   22 (20.95)   68 (19.48)

Use argument frequency to control if frequencies and/or percentages used and in which order.

# Default
>>> analy.summarize(df, 'class')
                n (%)
Total       891 (100.0)
class
    Third   491 (55.11)
    First   216 (24.24)
    Second  184 (20.65)

# Percentage only
>>> analy.summarize(df, 'class', frequency=False)
                    %
Total           100.0
class
    Third   55.106622
    First   24.242424
    Second  20.650954

# Frequency only
>>> analy.summarize(df, 'class', frequency=True)
            n
Total       891
class
    Third   491
    First   216
    Second  184

# Reverse order of
>>> analy.summarize(df, 'class', frequency=(False, True))
                % (n)
Total       100.0 (891)
class
    Third   55.11 (491)
    First   24.24 (216)
    Second  20.65 (184)

Parameters
data:`pandas.DataFrame`	The dataframe.
values_in:`Union[str, Iterable[str], Dict[str, str]]`	List of column names in the data to summarize.
frequency:`Union[bool, Tuple[bool, bool]]`	Control if frequencies and/or percentages used.
combine:`bool`	Combine frequencies and percentages into one column.
group_by:`str`	A column in `data` which values based on new columns are added to the resulting table.
total:`Union[bool, str]`	Add total column (default: `True`) with optionally customized label when using a string instead of `True`.
Returns
`pandas.DataFrame`	Resulting table as a data frame.

DEFAULT_AGGFUNC_LABELS = (source) ¶

Labels used by default in aggregate().

Value

{statistics.mean: 'Mean',
 pandas.Series.mean: 'Mean',
 statistics.median: 'Median',
 pandas.Series.median: 'Median',
 statistics.stdev: 'SD',
 pandas.Series.std: 'SD',
 sum: 'Sum',
...

INDENTION_DEFAULT: int = (source) ¶

Spaces to indent, e.g. when indenting row labels (indexes).

Value

LABEL_TOTAL: str = (source) ¶

Default label used for total columns and rows.

Value

'Total'

def _generic_frequency_or_fraction(data: pandas.Series, dropna: bool, normalize: bool, column_label: str, index: Union[bool, str] = False, data_label: str = None) -> pandas.DataFrame: (source) ¶

Frequencies or percentages for one variable.

It is the core function used by frequencies(), percentages() and frequencies_and_percentages().

Example: .. python

>>> tab = _generic_frequency_or_fraction(
...     df.sex, False, False, 'Count', False)
>>> print(tab.to_markdown())
|       |   Count |
|:------|--------:|
| man   |     869 |
| women |     447 |

>>> tab = _generic_frequency_or_fraction(
...     df.sex, False, True, 'perc', True)
>>> print(tab.to_markdown())
|                  |     perc |
|:-----------------|---------:|
| ('sex', 'man')   | 0.660334 |
| ('sex', 'women') | 0.339666 |

See frequencies_and_percentages() for more examples and details.

Parameters
data:`pandas.Series`	A series of values or dataframe column.
dropna:`bool`	Ignore missing values or not.
normalize:`bool`	Do frequencies (`False`) or percentages (`True`).
column_label:`str`	Label used for the column.
index:`Union[bool, str]`	Modify row labels. See `frequencies_and_percentages()` for details.
data_label:`str`	Overwrite explanatory data label. See `frequencies_and_percentages()` for details.
Returns
`pandas.DataFrame`	A table as a data frame.

def _join_columns(data: pandas.DataFrame, columns: Tuple[str] = None, joined_column_name: str = None, join_fstring: str = '{} ({})', round_digits: Union[int, Tuple[int, int]] = None) -> pandas.DataFrame: (source) ¶

Join two columns into one.

It is a helper function to morph existing tables. One use case would be to join frequencies and percentages into one column (e.g. "3 (4.5)").

Input example:

|                |   n |         % |
|:---------------|----:|----------:|
| Ja             | 403 | 62.0955   |
| Nein           | 238 | 36.6718   |
| Weiß ich nicht |   6 |  0.924499 |
| (Fehlend)      |   2 |  0.308166 |

Output example:

|                | n (%)      |
|:---------------|:-----------|
| Ja             | 403 (62,1) |
| Nein           | 238 (36,7) |
| Weiß ich nicht | 6 (0,9)    |
| (Fehlend)      | 2 (0,3)    |

Parameters
data:`pandas.DataFrame`	The data frame to modify.
columns:`Tuple[str]`	Name of two columns to join. Data frame column names are used as default if not present.
joined_column_name:`str`	Name of the new column. If not present it is created based on `join_fstring` argument.
join_fstring:`str`	Format string used to join values of the two columns.
round_digits:`Union[int, Tuple[int, int]]`	Number of digits to round the two values. Use a tuple of two integers to have different number of digits.
Returns
`pandas.DataFrame`	Resulting table as data frame.

def _total_label(total: Union[bool, str] = True) -> str: (source) ¶

Helper function to determine the correct value for the total column depending on the givin argument value.

Used by multiple functions in this module.

Parameters
total:`Union[bool, str]`	If `True` the value of `LABEL_TOTAL` is used and if it is of type `str` its own value is used. In all other cases `None` is returned.
Returns
`str`	The total label as a string or `None`.

_log = (source) ¶

Handle to the logger.