module documentation

Generate output for analysis in form of tables and figures.

For tables pandas.DataFrame is used. For figures seaborn or other matplotlib based packages are used. The results are intended to be used by report.

Function aggregate Summarize data using aggregating functions.
Function frequencies Frequencies for one variable.
Function frequencies_and_percentages Frequencies and percentages for one variable only.
Function percentages Percentages for one variable.
Function summarize Create summary table with frequency and percentage for multiple variables using there name and values.
Constant DEFAULT_AGGFUNC_LABELS Labels used by default in aggregate().
Constant INDENTION_DEFAULT Spaces to indent, e.g. when indenting row labels (indexes).
Constant LABEL_TOTAL Default label used for total columns and rows.
Function _generic_frequency_or_fraction Frequencies or percentages for one variable.
Function _join_columns Join two columns into one.
Function _total_label Helper function to determine the correct value for the total column depending on the givin argument value.
Variable _log Handle to the logger.
def aggregate(data: pandas.DataFrame, values_in: Union[str, list, Dict[str, str]], aggfunc: Union[Callable, Iterable[Callable], Dict[Callable, str]], group_by: str = None, total: Union[bool, str] = True, round_digits: int = None) -> pandas.DataFrame: (source)

Summarize data using aggregating functions.

Output example from titanic dataset as markdown showing the sum and mean value of age :

|     |     Sum |    Mean |
|:----|--------:|--------:|
| age | 21205.2 | 29.6991 |

Adding fare costs:

|      |     Sum |    Mean |
|:-----|--------:|--------:|
| age  | 21205.2 | 29.6991 |
| fare | 28693.9 | 32.2042 |

Mean (with customized label) for age and fare and grouped by gender with values rouned:

|      | ('female', 'MD') | ('male', 'MD') | ('Total', 'MD') |
|:-----|-----------------:|---------------:|----------------:|
| age  |               28 |             31 |              30 |
| fare |               44 |             26 |              32 |

About values_in: This argument can be a string, a list of strings or a dictionary. Use a string to specify one column in the data or use a list of strings to specify multiple columns. To customize labels in the resulting data frame use a dictionary indexed by the column names with the customized labels as values.

About aggfunc: Similar to values_in this argument can be a one function, a list of functions or a dictionary. The term function refers to a Callable. Labels in resulting table used for aggregating functions are read from DEFAULT_AGGFUNC_LABELS by default. To add more or customized labels use a dict indexed by functions with customized label as values.

Parameters
data:pandas.DataFrameData frame with raw data.
values_in:Union[str, list, Dict[str, str]]Specify columns in data to aggregate. See details below.
aggfunc:Union[Callable, Iterable[Callable], Dict[Callable, str]]Aggregate function(s) to use. See details below.
group_by:strOptional column in data to values.
total:Union[bool, str]Add total column (default: True) with optionally customized label when using a string instead of True.
round_digits:intNumber of digits to round values in all cells.
Returns
pandas.DataFrameResulting table as pandas.DataFrame.
def frequencies(data: pandas.Series, dropna: bool = False, result_label: str = 'n', index: Union[bool, str] = False, data_label: str = None) -> pandas.DataFrame: (source)

Frequencies for one variable.

Example output (markdown style):

| Gender   |   n |
|:---------|----:|
| Female   | 101 |
| Male     |  99 |

See frequencies_and_percentages() for more examples and details.

Parameters
data:pandas.SeriesA series of values or a data frame's column.
dropna:boolIgnore missing values or not.
result_label:strThe label used for the result column.
index:Union[bool, str]Modify row labels. See frequencies_and_percentages() for details.
data_label:strOverwrite explanatory data label. See frequencies_and_percentages() for details.
Returns
pandas.DataFrameThe frequencies table as a data frame.
def frequencies_and_percentages(data: pandas.Series, dropna: bool = False, result_labels: Tuple[str] = ('n', '%'), combine: str = '{} ({})', reverse_columns: bool = False, index: Union[bool, str] = False, data_label: str = None, sort_index: bool = None, round_percentage_digits: int = 2) -> pandas.DataFrame: (source)

Frequencies and percentages for one variable only.

It is a wrapper around frequencies() and percentages().

The index (labels of the row) by default contain the unique values specified by data (e.g. female and male). Additionally an explanatory name (e.g. Gender) can be added in to flavors to the index. If index is True an indented index is created with row labels indented by four blank spaces (see INDENTION_DEFAULT). The same happens when index == 'indented'. If index == 'multi' a pandas.MultiIndex is created with the explanatory name in the first and the value labels in the second level.

Output example from titanic dataset as markdown:

|       |   n |       % |
|:------|----:|--------:|
| man   | 869 | 66.0334 |
| women | 447 | 33.9666 |

Example with index='multi':

|                  |   n |       % |
|:-----------------|----:|--------:|
| ('sex', 'man')   | 869 | 66.0334 |
| ('sex', 'women') | 447 | 33.9666 |

Example with customized variable label (data_label='Gender'):

|                     |   n |       % |
|:--------------------|----:|--------:|
| ('Gender', 'man')   | 869 | 66.0334 |
| ('Gender', 'women') | 447 | 33.9666 |

Example with customized variable label (index=True, data_label='Gender'):

|            | n (%)         |
|:-----------|:--------------|
| Gender     |               |
|     male   | 577.0 (64.76) |
|     female | 314.0 (35.24) |
Parameters
data:pandas.SeriesA series of values or dataframe column.
dropna:boolIgnore missing values or not.
result_labels:Tuple[str]The labels used for the result columns.
combine:strCombine the two columns into one (e.g. "n (%)").
reverse_columns:boolOrder of columns in resulting table.
index:Union[bool, str]Add data's name to the index and/or specify the index kind as MultiIndex (value: multi) or indented (value: indented). See details below.
data_label:strReplace data's name by this label.
sort_index:boolSorting the row index (e.g. by ordered categories).
round_percentage_digits:intRound percentage values to n digits.
Returns
pandas.DataFrameA table as a data frame.
def percentages(data: pandas.Series, dropna: bool = False, result_label: str = '%', index: Union[bool, str] = False, data_label: str = None) -> pandas.DataFrame: (source)

Percentages for one variable.

Example output:

|       |       % |
|:------|--------:|
| man   | 66.0334 |
| women | 33.9666 |

See frequencies_and_percentages() for more examples and details.

Parameters
data:pandas.SeriesA series of values or data frame's column.
dropna:boolIgnore missing values or not.
result_label:strThe label used for the result column.
index:Union[bool, str]Modify row labels. See frequencies_and_percentages() for details.
data_label:strOverwrite explanatory data label. See frequencies_and_percentages() for details.
Returns
pandas.DataFrameThe percentages table as a data frame.
def summarize(data: pandas.DataFrame, values_in: Union[str, Iterable[str], Dict[str, str]], frequency: Union[bool, Tuple[bool, bool]] = (True, False), combine: bool = True, group_by: str = None, total: Union[bool, str] = True) -> pandas.DataFrame: (source)

Create summary table with frequency and percentage for multiple variables using there name and values.

Simple output example using titanic dataset:

>>> analy.summarize(df, 'sex')
                  n (%)
Total       891 (100.0)
sex
    male    577 (64.76)
    female  314 (35.24)

Example with more variables:

>>> analy.summarize(df, ['sex', 'class', 'alive'])
                  n (%)
Total       891 (100.0)
sex
    male    577 (64.76)
    female  314 (35.24)
class
    Third   491 (55.11)
    First   216 (24.24)
    Second  184 (20.65)
alive
    no      549 (61.62)
    yes     342 (38.38)

Using argument group_by with multiple variables:

                  Betazed        Bajor        Trill        Total
                    n (%)        n (%)        n (%)        n (%)
Total         117 (100.0)  127 (100.0)  105 (100.0)  349 (100.0)
Person
    Sarek      29 (24.79)   14 (11.02)    19 (18.1)   62 (17.77)
    Diana      25 (21.37)   33 (25.98)   20 (19.05)   78 (22.35)
    Picard      22 (18.8)   26 (20.47)   22 (20.95)   70 (20.06)
    Worf        22 (18.8)   29 (22.83)   26 (24.76)   77 (22.06)
    Quark      19 (16.24)   25 (19.69)   18 (17.14)   62 (17.77)
Gender
    diverse    29 (24.79)    32 (25.2)   24 (22.86)   85 (24.36)
    female     25 (21.37)   20 (15.75)   25 (23.81)   70 (20.06)
    androgyn   24 (20.51)   23 (18.11)   17 (16.19)   64 (18.34)
    unknown    20 (17.09)   25 (19.69)   17 (16.19)   62 (17.77)
    male       19 (16.24)   27 (21.26)   22 (20.95)   68 (19.48)

Use argument frequency to control if frequencies and/or percentages used and in which order.

# Default
>>> analy.summarize(df, 'class')
                n (%)
Total       891 (100.0)
class
    Third   491 (55.11)
    First   216 (24.24)
    Second  184 (20.65)

# Percentage only
>>> analy.summarize(df, 'class', frequency=False)
                    %
Total           100.0
class
    Third   55.106622
    First   24.242424
    Second  20.650954

# Frequency only
>>> analy.summarize(df, 'class', frequency=True)
            n
Total       891
class
    Third   491
    First   216
    Second  184

# Reverse order of
>>> analy.summarize(df, 'class', frequency=(False, True))
                % (n)
Total       100.0 (891)
class
    Third   55.11 (491)
    First   24.24 (216)
    Second  20.65 (184)
Parameters
data:pandas.DataFrameThe dataframe.
values_in:Union[str, Iterable[str], Dict[str, str]]List of column names in the data to summarize.
frequency:Union[bool, Tuple[bool, bool]]Control if frequencies and/or percentages used.
combine:boolCombine frequencies and percentages into one column.
group_by:strA column in data which values based on new columns are added to the resulting table.
total:Union[bool, str]Add total column (default: True) with optionally customized label when using a string instead of True.
Returns
pandas.DataFrameResulting table as a data frame.
DEFAULT_AGGFUNC_LABELS = (source)

Labels used by default in aggregate().

Value
{statistics.mean: 'Mean',
 pandas.Series.mean: 'Mean',
 statistics.median: 'Median',
 pandas.Series.median: 'Median',
 statistics.stdev: 'SD',
 pandas.Series.std: 'SD',
 sum: 'Sum',
...
INDENTION_DEFAULT: int = (source)

Spaces to indent, e.g. when indenting row labels (indexes).

Value
4
LABEL_TOTAL: str = (source)

Default label used for total columns and rows.

Value
'Total'
def _generic_frequency_or_fraction(data: pandas.Series, dropna: bool, normalize: bool, column_label: str, index: Union[bool, str] = False, data_label: str = None) -> pandas.DataFrame: (source)

Frequencies or percentages for one variable.

It is the core function used by frequencies(), percentages() and frequencies_and_percentages().

Example: .. python

>>> tab = _generic_frequency_or_fraction(
...     df.sex, False, False, 'Count', False)
>>> print(tab.to_markdown())
|       |   Count |
|:------|--------:|
| man   |     869 |
| women |     447 |

>>> tab = _generic_frequency_or_fraction(
...     df.sex, False, True, 'perc', True)
>>> print(tab.to_markdown())
|                  |     perc |
|:-----------------|---------:|
| ('sex', 'man')   | 0.660334 |
| ('sex', 'women') | 0.339666 |

See frequencies_and_percentages() for more examples and details.

Parameters
data:pandas.SeriesA series of values or dataframe column.
dropna:boolIgnore missing values or not.
normalize:boolDo frequencies (False) or percentages (True).
column_label:strLabel used for the column.
index:Union[bool, str]Modify row labels. See frequencies_and_percentages() for details.
data_label:strOverwrite explanatory data label. See frequencies_and_percentages() for details.
Returns
pandas.DataFrameA table as a data frame.
def _join_columns(data: pandas.DataFrame, columns: Tuple[str] = None, joined_column_name: str = None, join_fstring: str = '{} ({})', round_digits: Union[int, Tuple[int, int]] = None) -> pandas.DataFrame: (source)

Join two columns into one.

It is a helper function to morph existing tables. One use case would be to join frequencies and percentages into one column (e.g. "3 (4.5)").

Input example:

|                |   n |         % |
|:---------------|----:|----------:|
| Ja             | 403 | 62.0955   |
| Nein           | 238 | 36.6718   |
| Weiß ich nicht |   6 |  0.924499 |
| (Fehlend)      |   2 |  0.308166 |

Output example:

|                | n (%)      |
|:---------------|:-----------|
| Ja             | 403 (62,1) |
| Nein           | 238 (36,7) |
| Weiß ich nicht | 6 (0,9)    |
| (Fehlend)      | 2 (0,3)    |
Parameters
data:pandas.DataFrameThe data frame to modify.
columns:Tuple[str]Name of two columns to join. Data frame column names are used as default if not present.
joined_column_name:strName of the new column. If not present it is created based on join_fstring argument.
join_fstring:strFormat string used to join values of the two columns.
round_digits:Union[int, Tuple[int, int]]Number of digits to round the two values. Use a tuple of two integers to have different number of digits.
Returns
pandas.DataFrameResulting table as data frame.
def _total_label(total: Union[bool, str] = True) -> str: (source)

Helper function to determine the correct value for the total column depending on the givin argument value.

Used by multiple functions in this module.

Parameters
total:Union[bool, str]If True the value of LABEL_TOTAL is used and if it is of type str its own value is used. In all other cases None is returned.
Returns
strThe total label as a string or None.

Handle to the logger.