buhtzology.bandas.DataContainer

class documentation

class DataContainer: (source)

Constructors: DataContainer.load(file_path, meta_to_check), DataContainer(dataframes, permanent_names, meta), DataContainer._load_from_archive(archive_path), DataContainer._load_from_folder(folder_path)

View In Hierarchy

A container managing multiple pandas.DataFrame.

The container can operate in two modes. In folder mode the container is located in a folder on the file system and each data frame has its own file. I archive mode the container is one single compressed (ZIP) archive with each data frame as an entry in it.

Two types of data frames are distinguished. The on demand data frames loaded when they are accessed the first time. A permanent data frame is loaded right on time when the container is instantiated. Which mode is used depends on the path name used; *.zip results in archive mode and everything else in folder mode.

In folder mode a data frame can be stored as *.pickle or as a (ZIP) compressed *.pickle.zip file. By default the former is used for permanent and the latter for on demand data frames.

Warning

Despite a container store (multiple) data frames in the file system it is not intended to be long term archive format. It is a temporary storage and can be used between running two different scripts or steps of research.

Example

# Data for the container organized in a dict.
dataframes = {
    'Foo': pandas.DataFrame(),
    'Bar': pandas.DataFrame()
}

# Create container object
dc = DataContainer(dataframes=dataframes, permanent_names=['Bar'])

# Save containers as one ZIP file (archive mode)
fp = pathlib.Path('data.zip')
dc.save(fp)

# Load the container
dc = DataContainer(fp)

# 'Bar' is loaded immediately because it was marked as
# permanent. But 'Foo' is not loaded yet.
print(dc.names)  # ['Bar', 'Foo']
print(dc.Bar.head())
print(dc.Foo.head())  # <- Now "Foo" is going to be loaded

# Add a new data frame
dc.Elke = pandas.DataFrame()

# Save in ZIP
dc.store(fp)

# Save again in folder mode
fp = pathlib.Path('folder')
dc.store(fp)

Class Method	`load`	Load a data container from the path.
Method	`__delattr__`	Delete a data frame by its attribute name (e.g. `del dc.foo`).
Method	`__delitem__`	Delete a data frame by its key name (e.g. `del dc['foo']`).
Method	`__getattr__`	Get data frame as instance attribute.
Method	`__getitem__`	Dict like behavior of the container.
Method	`__init__`	Create a container in memory.
Method	`__len__`	Count of all data frames.
Method	`__setattr__`	Add new data frame via attribute name.
Method	`__setitem__`	Add new data frame via key.
Method	`__str__`	Literal description of the container containing number of data frames, names of permanent and on-demand data frames and meta information.
Method	`describe_with_markdown`	Describe all dataframe of the container as markdown tables.
Method	`keep_row_count`	Context manager monitoring the number of rows of a data frame.
Method	`load_all`	Load all data frames into memory.
Method	`save`	The container is saved in filesystem at `file_path`.
Method	`update_meta_info`	New keys are added and existing keys are overwritten with respect to nested dicts.
Property	`meta`	The meta data (e.g. version number) as `dict`.
Property	`names`	List all data frame names no matter if they are on-demand or permanent type.
Property	`names_permanent`	List names of permanent data frames.
Class Method	`_format_log_data`	A helper method for `_log_save()` and `_log_loaded()`.
Class Method	`_load_from_archive`	Load a container from an archive file.
Class Method	`_load_from_folder`	Load a container from a folder.
Class Method	`_write_folder`	Write data frames into a folder.
Class Method	`_write_zip_file`	Write data frames to a ZIP archive.
Method	`_assemble_log_message`	Undocumented
Method	`_check_meta_info`	Compare the meta data given by `meta_check` and in `meta`.
Method	`_compute_filepath_for_dataframe`	Try to find the path to a data frame's file.
Method	`_copy_dataframe_file`	Copy a data frame file to a new folder.
Method	`_generate_creation_info`	No summary
Method	`_hash_dataframe`	Create a hash value for a data frame and return it.
Method	`_is_compressed`	Determine if the data frame file is compressed.
Method	`_is_dataframe_modified`	Return the data frames modified state.
Method	`_load_dataframe`	Load one specific data frame from its file.
Method	`_load_dataframe_from_archive`	Load a data frame from an archive.
Method	`_load_dataframe_from_folder`	Load a data frame from a folder.
Method	`_log_loaded`	Create log message about a loaded container.
Method	`_log_save`	Create log message about a stored container.
Method	`_save_as_archive`	Save the container instance as ZIP compressed file.
Method	`_save_as_folder`	Store the container in folder mode.
Method	`_save_previos_archive_as_folder`	Undocumented
Method	`_unload_dataframe`	Unload a data frame from the container.
Constant	`_ARCHIVE_SUFFIX`	Filename suffix for ZIP compressed container in archive mode.
Constant	`_META_FILENAME`	Name of the meta data holding file.
Constant	`_PICKLE_PROTOCOL`	The Pickle protocol version used.
Constant	`_PICKLE_SUFFIX`	Filename suffix for pickle files.
Constant	`_PICKLE_ZIP_SUFFIX`	Filename suffix for zip-compressed pickle files.
Instance Variable	`_dataframe_hashes`	Hash values of loaded data frames (folder mode only).
Instance Variable	`_dataframes`	Dictionary of date frame's.
Instance Variable	`_meta`	The meta data (e.g. version number) as `dict`.
Instance Variable	`_names_on_demand`	List of data frame names which are of type on-demand.
Instance Variable	`_source_obj`	Path to the file or folder the container was loaded from.

@classmethod
def load(cls, file_path: Union[pathlib.Path], meta_to_check: Dict = None) -> DataContainer: (source) ¶

Load a data container from the path.

Data frame files beginning with a _ prefix are treated as permanent and loaded immediately. Other data frames are treated as on-demand and loaded from files only when they are accessed.

If the container contain meta information's (file __meta) and the argument meta_to_check is givin this two dictionaries are compared. An exception is raised when they are not equal. Could be used for a version number or a date. See _check_meta_info() for details.

Parameters
file_path:`Union[pathlib.Path]`	Path to file or folder of the container.
meta_to_check:`Dict`	Meta info's to validate.
Returns
`DataContainer`	The container object.

def __delattr__(self, name): (source) ¶

Delete a data frame by its attribute name (e.g. del dc.foo).

The underlying files are not deleted.

def __delitem__(self, key): (source) ¶

Delete a data frame by its key name (e.g. del dc['foo']).

The underlying files are not deleted.

def __getattr__(self, attr: str) -> pandas.DataFrame: (source) ¶

Get data frame as instance attribute.

Get a data frame by using its name as instance attribute name (dc.foobar). On-demand data frames are loaded automatically in background.

Example

dc = bandas.DataContainer.load('data.zip')

# Access dataframe as instance attribute.
# The dataframe's name here is "foobar".
dc.foobar.head()

Parameters
attr:`str`	Name of a data frame.
Returns
`pandas.DataFrame`	Undocumented

def __getitem__(self, key_or_idx: Union[int, str]) -> Union[str, pandas.DataFrame]: (source) ¶

Dict like behavior of the container.

Get a data frame by using its name as key (dc[name]) Using a numerical index (dc[0]) will return the name of a data frame as str. On-demand data frames are loaded automatically in background.

Example

dc = bandas.DataContainer.load('data.zip')

# Iterate over dataframe names (like a dict)
for name in dc:
    print(name)

    # Use the name like a key
    print(dc[name].head())

Parameters
key_or_idx:`Union[int, str]`	Name of a data frame or index of its name.
Returns
`Union[str, pandas.DataFrame]`	Undocumented

def __init__(self, dataframes: Dict[str, pandas.DataFrame] = None, permanent_names: List[str] = None, meta: Dict = None): (source) ¶

Create a container in memory.

Parameters
dataframes:`Dict[str, pandas.DataFrame]`	Dictionary of `pandas.DataFrame`'s.
permanent_names:`List[str]`	List of data frame names. See `load()` for details.
meta:`Dict`	Dictionary with additional data.

def __len__(self) -> int: (source) ¶

Count of all data frames.

The number is based on the names property.

def __setattr__(self, attr: str, value: pandas.DataFrame): (source) ¶

Add new data frame via attribute name.

Parameters
attr:`str`	Name of the data frame.
value:`pandas.DataFrame`	The data frame.

def __setitem__(self, key: str, dataframe: pandas.DataFrame): (source) ¶

Add new data frame via key.

The data frame is added to the container as on-demand. Change it to permanent when using save().

Parameters
key:`str`	Name of dataframe.
dataframe:`pandas.DataFrame`	Undocumented

def __str__(self) -> str: (source) ¶

Literal description of the container containing number of data frames, names of permanent and on-demand data frames and meta information.

def describe_with_markdown(self, head_n: int = 4, width: int = 80, **kwargs) -> str: (source) ¶

Describe all dataframe of the container as markdown tables.

The width is respected for markdown tables also.

Parameters
head_n:`int`	Number of data rows per frame.
width:`int`	With in characters of each table.
**kwargs	Used to handover arguments to `buhtzology.break_paragraph()`.
Returns
`str`	The result as a string.

@contextlib.contextmanager
def keep_row_count(self, name: str) -> DataContainer: (source) ¶

Context manager monitoring the number of rows of a data frame.

The context manager can be used to make sure that an operation on the data do not delete or add rows. This can accidentally happen when using pandas.merge().

dc = get_container()

with dc.keep_row_count('MyDataFrame'):
    dc.MyDataFrame = pandas.merge(
        left=dc.MyDataFrame,
        right=other_data_frame
    )

Parameters
name:`str`	Name of the data frame.
Returns
`DataContainer`	Itself.
Raises
`TypeError`	If the length of the data frame changed.

def load_all(self): (source) ¶

Load all data frames into memory.

def save(self, file_path: pathlib.Path, permanent_names: List[str] = None, zip_names: List[str] = None, meta: Dict = None): (source) ¶

The container is saved in filesystem at file_path.

If file_path is a a ZIP file then the container is stored as a compressed archive containing all data frame files as pickles. Otherwise the path is used as a folder and each dataframe is stored as a single file.

Hint

In folder mode by default the on-demand data frames are ZIP compressed. But if zip_names is given only data frames with that names are compressed no matter if they are on-demand or permanent. If all data frame files should be uncompressed set zip_names=[].

Previously existing meta data in the container is not lost but only updated using update_meta_info() with the meta argument.

Parameters
file_path:`pathlib.Path`	Path to a ZIP file or a folder.
permanent_names:`List[str]`	Names of dataframes marked as permanent.
zip_names:`List[str]`	Works only on modified data frames.
meta:`Dict`	To update containers meta data with.

def update_meta_info(self, value: dict): (source) ¶

New keys are added and existing keys are overwritten with respect to nested dicts.

@property
meta: dict = (source) ¶

The meta data (e.g. version number) as dict.

TODO explain '_created' key, etc pp

@property
names: list[str] = (source) ¶

List all data frame names no matter if they are on-demand or permanent type.

@property
names_permanent: list[str] = (source) ¶

List names of permanent data frames.

@classmethod
def _format_log_data(cls, entries: list[tuple], meta: dict, njust: int) -> str: (source) ¶

A helper method for _log_save() and _log_loaded().

@classmethod
def _load_from_archive(cls, archive_path: pathlib.Path) -> DataContainer: (source) ¶

Load a container from an archive file.

See load() for details.

@classmethod
def _load_from_folder(cls, folder_path: pathlib.Path) -> DataContainer: (source) ¶

Load a container from a folder.

See load() for details.

@classmethod
def _write_folder(cls, folder_path: pathlib.Path, dataframes: dict[str, pandas.DataFrame], permanent_names: Iterable[str], meta: Dict, zip_names: Iterable[str]) -> list[tuple[str]]: (source) ¶

Write data frames into a folder.

By default on-demand data frames are written as compressed and permanent data frames as uncompressed pickle files. That default behavior is used if zip_names is None. Otherwise if zip_names is empty ([]) none of the data frames are compressed.

Parameters
folder_path:`pathlib.Path`	Destination path.
dataframes:`dict[str, pandas.DataFrame]`	Dict of `pandas.DataFrame`'s indexed by their name.
permanent_names:`Iterable[str]`	List of data frame names need to be persistent.
meta:`Dict`	Dictioniary with additional (user defined) meta data.
zip_names:`Iterable[str]`	List of data frame names should be compressed no matter if they are on-demand or permanent.
Returns
`list[tuple[str]]`	A list of tuples indicating the status of each data frame.

@classmethod
def _write_zip_file(cls, file_like_obj: Union[pathlib.Path, zipfile.ZipFile], dataframes: Dict[str, pandas.DataFrame], permanent_names: List[str], meta: Dict) -> list[tuple[str]]: (source) ¶

Write data frames to a ZIP archive.

Parameters
file_like_obj:`Union[pathlib.Path, zipfile.ZipFile]`	Path to ZIP file to create or a `zipfile.ZipFile` object.
dataframes:`Dict[str, pandas.DataFrame]`	Dictionary with `pandas.DataFrame`'s.
permanent_names:`List[str]`	List of dataframe names to mark as 'permanent'.
meta:`Dict`	Meta data to add to the ZIP file.
Returns
`list[tuple[str]]`	Undocumented

def _assemble_log_message(self, path: pathlib.Path, mode: str, indicators: list[tuple[str, str]]) -> str: (source) ¶

Undocumented

def _check_meta_info(self, meta_check: Dict): (source) ¶

Compare the meta data given by meta_check and in meta.

The ordering of the keys is ignored.

Parameters
meta_check:`Dict`	A dictionary with expected meta data.
Raises
`ValueError if the dicts have different keys and/or values.`

def _compute_filepath_for_dataframe(self, name_prefix: str, df_name: str) -> pathlib.Path: (source) ¶

Try to find the path to a data frame's file.

Returns
`pathlib.Path`	The path to the file if it exists.
Raises
`FileNotFoundError`

def _copy_dataframe_file(self, name: str, folder_path: pathlib.Path, permanent: bool, zipped: bool): (source) ¶

Copy a data frame file to a new folder.

The data frame to copy is specified by name and must exist in the current data container. The status can be toggled between permanent and on-demand using permanent argument. Despite from that the argument zipped is used to validate if copy is possible. A ValueError is raised if the sources is compressed and the destination not or the other way around.

Parameters
name:`str`	Name of the data frame used in this data container.
folder_path:`pathlib.Path`	Destination folder to copy the data frame file into.
permanent:`bool`	Data frame should be permanent or not after copy it.
zipped:`bool`	Indicate if destination file is compressed or not. Will raise `ValueError` if different from source file.
Returns
A string used as status indicator in `_log_save()`, e.g. `P_`, `Oz`.
Raises
`ValueError`	If destination file should be compressed but source file is not or the other way around.

def _generate_creation_info(self) -> dict: (source) ¶

def _hash_dataframe(self, df_name: str) -> str: (source) ¶

Create a hash value for a data frame and return it.

Parameters
df_name:`str`	Name of the data frame.
Returns
`str`	The hash (sha1) as a string (hexdigts).
Raises
`KeyError`	If the data frame is not loaded yet.

def _is_compressed(self, name: str) -> bool: (source) ¶

Determine if the data frame file is compressed.

Parameters
name:`str`	Name of a data frame.
Returns
`bool`	Undocumented
Raises
`TypeError`	If the container was not stored yet or is in archive mode.

def _is_dataframe_modified(self, df_name: str) -> bool: (source) ¶

Return the data frames modified state.

Not loaded on-demand data frames assumed to be unmodified and False is returned. If a data frame was attached (from memory) to the container not stored yet it is assumed to be modified and True is returned.

Parameters
df_name:`str`	Name of the data frame.
Returns
`bool`	Boolean givin the modified state.

def _load_dataframe(self, df_name: str): (source) ¶

Load one specific data frame from its file.

Parameters
df_name:`str`	Name of the data frame.

def _load_dataframe_from_archive(self, name_prefix: str, df_name: str): (source) ¶

Load a data frame from an archive.

def _load_dataframe_from_folder(self, name_prefix: str, df_name: str): (source) ¶

Load a data frame from a folder.

def _log_loaded(self, path: pathlib.Path, mode: str, indicators: list[tuple[str, str]]): (source) ¶

Create log message about a loaded container.

def _log_save(self, path: pathlib.Path, mode: str, indicators: list[tuple[str, str]]): (source) ¶

Create log message about a stored container.

def _save_as_archive(self, file_path: pathlib.Path, permanent_names: List[str]): (source) ¶

Save the container instance as ZIP compressed file.

See save() for details.

def _save_as_folder(self, folder_path: pathlib.Path, permanent_names: List[str], zip_names: List[str]): (source) ¶

Store the container in folder mode.

Parameters
folder_path:`pathlib.Path`	Destination path.
permanent_names:`List[str]`	List of data frame names need to be persistent.
zip_names:`List[str]`	List of data frame names should be compressed no matter if they are on-demand or permanent.

def _save_previos_archive_as_folder(self, folder_path: pathlib.Path, permanent_names: List[str], zip_names: List[str]): (source) ¶

Undocumented

def _unload_dataframe(self, df_name: str): (source) ¶

Unload a data frame from the container.

Only data frames of type on demand can be unloaded. Permanent data frames will raise a ValueError. If the data frame is not loaded but do exist nothing is raised.

Parameters
df_name:`str`	Name of the data frame.
Raises
`KeyError`	If data frame do not exist.
`ValueError`	If the data frames is of permanent type.
`TypeError`	If data frame is modified.

_ARCHIVE_SUFFIX: str = (source) ¶

Filename suffix for ZIP compressed container in archive mode.

Value

'.zip'

_META_FILENAME: str = (source) ¶

Name of the meta data holding file.

Value

'__meta'

_PICKLE_PROTOCOL: int = (source) ¶

The Pickle protocol version used.

Value

_PICKLE_SUFFIX: str = (source) ¶

Filename suffix for pickle files.

Value

'.pickle'

_PICKLE_ZIP_SUFFIX: str = (source) ¶

Filename suffix for zip-compressed pickle files.

Value

'.pickle.zip'

_dataframe_hashes: dict = (source) ¶

Hash values of loaded data frames (folder mode only).

_dataframes: dict = (source) ¶

Dictionary of date frame's.

_meta = (source) ¶

The meta data (e.g. version number) as dict.

_names_on_demand = (source) ¶

List of data frame names which are of type on-demand.

_source_obj = (source) ¶

Path to the file or folder the container was loaded from.