Generate output for analysis in form of tables and figures.
For tables pandas.DataFrame
is used. For figures seaborn or
other matplotlib
based packages are used. The results are intended to be
used by report
.
Function | aggregate |
Summarize data using aggregating functions. |
Function | frequencies |
Frequencies for one variable. |
Function | frequencies |
Frequencies and percentages for one variable only. |
Function | percentages |
Percentages for one variable. |
Function | summarize |
Create summary table with frequency and percentage for multiple variables using there name and values. |
Constant | DEFAULT |
Labels used by default in aggregate() . |
Constant | INDENTION |
Spaces to indent, e.g. when indenting row labels (indexes). |
Constant | LABEL |
Default label used for total columns and rows. |
Function | _generic |
Frequencies or percentages for one variable. |
Function | _join |
Join two columns into one. |
Function | _total |
Helper function to determine the correct value for the total column depending on the givin argument value. |
Variable | _log |
Handle to the logger. |
pandas.DataFrame
, values_in: Union[ str, list, Dict[ str, str]]
, aggfunc: Union[ Callable, Iterable[ Callable], Dict[ Callable, str]]
, group_by: str
= None, total: Union[ bool, str]
= True, round_digits: int
= None) -> pandas.DataFrame
:
(source)
¶
Summarize data using aggregating functions.
Output example from titanic dataset as markdown showing the sum and mean value of age :
| | Sum | Mean | |:----|--------:|--------:| | age | 21205.2 | 29.6991 |
Adding fare costs:
| | Sum | Mean | |:-----|--------:|--------:| | age | 21205.2 | 29.6991 | | fare | 28693.9 | 32.2042 |
Mean (with customized label) for age and fare and grouped by gender with values rouned:
| | ('female', 'MD') | ('male', 'MD') | ('Total', 'MD') | |:-----|-----------------:|---------------:|----------------:| | age | 28 | 31 | 30 | | fare | 44 | 26 | 32 |
About values_in: This argument can be a string, a list of strings or a dictionary. Use a string to specify one column in the data or use a list of strings to specify multiple columns. To customize labels in the resulting data frame use a dictionary indexed by the column names with the customized labels as values.
About aggfunc: Similar to values_in this argument can be a one
function, a list of functions or a dictionary. The term function refers
to a Callable
. Labels in resulting table used for aggregating functions
are read from DEFAULT_AGGFUNC_LABELS
by default. To add more
or customized labels use a dict
indexed by functions with customized
label as values.
Parameters | |
data:pandas.DataFrame | Data frame with raw data. |
valuesUnion[ | Specify columns in data to aggregate. See details below. |
aggfunc:Union[ | Aggregate function(s) to use. See details below. |
groupstr | Optional column in data to values. |
total:Union[ | Add total column (default: True) with optionally customized label when using a string instead of True. |
roundint | Number of digits to round values in all cells. |
Returns | |
pandas.DataFrame | Resulting table as pandas.DataFrame . |
pandas.Series
, dropna: bool
= False, result_label: str
= 'n', index: Union[ bool, str]
= False, data_label: str
= None) -> pandas.DataFrame
:
(source)
¶
Frequencies for one variable.
Example output (markdown style):
| Gender | n | |:---------|----:| | Female | 101 | | Male | 99 |
See frequencies_and_percentages()
for more examples and details.
Parameters | |
data:pandas.Series | A series of values or a data frame's column. |
dropna:bool | Ignore missing values or not. |
resultstr | The label used for the result column. |
index:Union[ | Modify row labels. See frequencies_and_percentages() for
details. |
datastr | Overwrite explanatory data label. See
frequencies_and_percentages() for details. |
Returns | |
pandas.DataFrame | The frequencies table as a data frame. |
pandas.Series
, dropna: bool
= False, result_labels: Tuple[ str]
= (str
= '{} ({})', reverse_columns: bool
= False, index: Union[ bool, str]
= False, data_label: str
= None, sort_index: bool
= None, round_percentage_digits: int
= 2) -> pandas.DataFrame
:
(source)
¶
Frequencies and percentages for one variable only.
It is a wrapper around frequencies()
and percentages()
.
The index (labels of the row) by default contain the unique values
specified by data (e.g. female and male). Additionally an
explanatory name (e.g. Gender) can be added in to flavors to the
index. If index is True an indented index is created with row
labels indented by four blank spaces (see INDENTION_DEFAULT
). The same
happens when index == 'indented'. If index == 'multi' a
pandas.MultiIndex is created with the explanatory name in the first
and the value labels in the second level.
Output example from titanic dataset as markdown:
| | n | % | |:------|----:|--------:| | man | 869 | 66.0334 | | women | 447 | 33.9666 |
Example with index='multi':
| | n | % | |:-----------------|----:|--------:| | ('sex', 'man') | 869 | 66.0334 | | ('sex', 'women') | 447 | 33.9666 |
Example with customized variable label (data_label='Gender'):
| | n | % | |:--------------------|----:|--------:| | ('Gender', 'man') | 869 | 66.0334 | | ('Gender', 'women') | 447 | 33.9666 |
Example with customized variable label (index=True, data_label='Gender'):
| | n (%) | |:-----------|:--------------| | Gender | | | male | 577.0 (64.76) | | female | 314.0 (35.24) |
Parameters | |
data:pandas.Series | A series of values or dataframe column. |
dropna:bool | Ignore missing values or not. |
resultTuple[ | The labels used for the result columns. |
combine:str | Combine the two columns into one (e.g. "n (%)"). |
reversebool | Order of columns in resulting table. |
index:Union[ | Add data's name to the index and/or specify the index kind as MultiIndex (value: multi) or indented (value: indented). See details below. |
datastr | Replace data's name by this label. |
sortbool | Sorting the row index (e.g. by ordered categories). |
roundint | Round percentage values to n digits. |
Returns | |
pandas.DataFrame | A table as a data frame. |
pandas.Series
, dropna: bool
= False, result_label: str
= '%', index: Union[ bool, str]
= False, data_label: str
= None) -> pandas.DataFrame
:
(source)
¶
Percentages for one variable.
Example output:
| | % | |:------|--------:| | man | 66.0334 | | women | 33.9666 |
See frequencies_and_percentages()
for more examples and details.
Parameters | |
data:pandas.Series | A series of values or data frame's column. |
dropna:bool | Ignore missing values or not. |
resultstr | The label used for the result column. |
index:Union[ | Modify row labels. See frequencies_and_percentages() for
details. |
datastr | Overwrite explanatory data label. See
frequencies_and_percentages() for details. |
Returns | |
pandas.DataFrame | The percentages table as a data frame. |
pandas.DataFrame
, values_in: Union[ str, Iterable[ str], Dict[ str, str]]
, frequency: Union[ bool, Tuple[ bool, bool]]
= (bool
= True, group_by: str
= None, total: Union[ bool, str]
= True) -> pandas.DataFrame
:
(source)
¶
Create summary table with frequency and percentage for multiple variables using there name and values.
Simple output example using titanic dataset:
>>> analy.summarize(df, 'sex') n (%) Total 891 (100.0) sex male 577 (64.76) female 314 (35.24)
Example with more variables:
>>> analy.summarize(df, ['sex', 'class', 'alive']) n (%) Total 891 (100.0) sex male 577 (64.76) female 314 (35.24) class Third 491 (55.11) First 216 (24.24) Second 184 (20.65) alive no 549 (61.62) yes 342 (38.38)
Using argument group_by with multiple variables:
Betazed Bajor Trill Total n (%) n (%) n (%) n (%) Total 117 (100.0) 127 (100.0) 105 (100.0) 349 (100.0) Person Sarek 29 (24.79) 14 (11.02) 19 (18.1) 62 (17.77) Diana 25 (21.37) 33 (25.98) 20 (19.05) 78 (22.35) Picard 22 (18.8) 26 (20.47) 22 (20.95) 70 (20.06) Worf 22 (18.8) 29 (22.83) 26 (24.76) 77 (22.06) Quark 19 (16.24) 25 (19.69) 18 (17.14) 62 (17.77) Gender diverse 29 (24.79) 32 (25.2) 24 (22.86) 85 (24.36) female 25 (21.37) 20 (15.75) 25 (23.81) 70 (20.06) androgyn 24 (20.51) 23 (18.11) 17 (16.19) 64 (18.34) unknown 20 (17.09) 25 (19.69) 17 (16.19) 62 (17.77) male 19 (16.24) 27 (21.26) 22 (20.95) 68 (19.48)
Use argument frequency to control if frequencies and/or percentages used and in which order.
# Default >>> analy.summarize(df, 'class') n (%) Total 891 (100.0) class Third 491 (55.11) First 216 (24.24) Second 184 (20.65) # Percentage only >>> analy.summarize(df, 'class', frequency=False) % Total 100.0 class Third 55.106622 First 24.242424 Second 20.650954 # Frequency only >>> analy.summarize(df, 'class', frequency=True) n Total 891 class Third 491 First 216 Second 184 # Reverse order of >>> analy.summarize(df, 'class', frequency=(False, True)) % (n) Total 100.0 (891) class Third 55.11 (491) First 24.24 (216) Second 20.65 (184)
Parameters | |
data:pandas.DataFrame | The dataframe. |
valuesUnion[ | List of column names in the data to summarize. |
frequency:Union[ | Control if frequencies and/or percentages used. |
combine:bool | Combine frequencies and percentages into one column. |
groupstr | A column in data which values based on new columns are added to the resulting table. |
total:Union[ | Add total column (default: True) with optionally customized label when using a string instead of True. |
Returns | |
pandas.DataFrame | Resulting table as a data frame. |
Labels used by default in aggregate()
.
Value |
|
pandas.Series
, dropna: bool
, normalize: bool
, column_label: str
, index: Union[ bool, str]
= False, data_label: str
= None) -> pandas.DataFrame
:
(source)
¶
Frequencies or percentages for one variable.
It is the core function used by frequencies()
, percentages()
and
frequencies_and_percentages()
.
Example: .. python
>>> tab = _generic_frequency_or_fraction( ... df.sex, False, False, 'Count', False) >>> print(tab.to_markdown()) | | Count | |:------|--------:| | man | 869 | | women | 447 | >>> tab = _generic_frequency_or_fraction( ... df.sex, False, True, 'perc', True) >>> print(tab.to_markdown()) | | perc | |:-----------------|---------:| | ('sex', 'man') | 0.660334 | | ('sex', 'women') | 0.339666 |
See frequencies_and_percentages()
for more examples and details.
Parameters | |
data:pandas.Series | A series of values or dataframe column. |
dropna:bool | Ignore missing values or not. |
normalize:bool | Do frequencies (False) or percentages (True). |
columnstr | Label used for the column. |
index:Union[ | Modify row labels. See frequencies_and_percentages() for
details. |
datastr | Overwrite explanatory data label. See
frequencies_and_percentages() for details. |
Returns | |
pandas.DataFrame | A table as a data frame. |
pandas.DataFrame
, columns: Tuple[ str]
= None, joined_column_name: str
= None, join_fstring: str
= '{} ({})', round_digits: Union[ int, Tuple[ int, int]]
= None) -> pandas.DataFrame
:
(source)
¶
Join two columns into one.
It is a helper function to morph existing tables. One use case would be to join frequencies and percentages into one column (e.g. "3 (4.5)").
Input example:
| | n | % | |:---------------|----:|----------:| | Ja | 403 | 62.0955 | | Nein | 238 | 36.6718 | | Weiß ich nicht | 6 | 0.924499 | | (Fehlend) | 2 | 0.308166 |
Output example:
| | n (%) | |:---------------|:-----------| | Ja | 403 (62,1) | | Nein | 238 (36,7) | | Weiß ich nicht | 6 (0,9) | | (Fehlend) | 2 (0,3) |
Parameters | |
data:pandas.DataFrame | The data frame to modify. |
columns:Tuple[ | Name of two columns to join. Data frame column names are used as default if not present. |
joinedstr | Name of the new column. If not present it is created based on join_fstring argument. |
joinstr | Format string used to join values of the two columns. |
roundUnion[ | Number of digits to round the two values. Use a tuple of two integers to have different number of digits. |
Returns | |
pandas.DataFrame | Resulting table as data frame. |
Helper function to determine the correct value for the total column depending on the givin argument value.
Used by multiple functions in this module.
Parameters | |
total:Union[ | If True the value of LABEL_TOTAL is used and if it is of
type str its own value is used. In all other cases None is
returned. |
Returns | |
str | The total label as a string or None. |