class documentation

A container managing multiple pandas.DataFrame.

The container can operate in two modes. In folder mode the container is located in a folder on the file system and each data frame has its own file. I archive mode the container is one single compressed (ZIP) archive with each data frame as an entry in it.

Two types of data frames are distinguished. The on demand data frames loaded when they are accessed the first time. A permanent data frame is loaded right on time when the container is instantiated. Which mode is used depends on the path name used; *.zip results in archive mode and everything else in folder mode.

In folder mode a data frame can be stored as *.pickle or as a (ZIP) compressed *.pickle.zip file. By default the former is used for permanent and the latter for on demand data frames.

Warning

Despite a container store (multiple) data frames in the file system it is not intended to be long term archive format. It is a temporary storage and can be used between running two different scripts or steps of research.

Example

# Data for the container organized in a dict.
dataframes = {
    'Foo': pandas.DataFrame(),
    'Bar': pandas.DataFrame()
}

# Create container object
dc = DataContainer(dataframes=dataframes, permanent_names=['Bar'])

# Save containers as one ZIP file (archive mode)
fp = pathlib.Path('data.zip')
dc.save(fp)

# Load the container
dc = DataContainer(fp)

# 'Bar' is loaded immediately because it was marked as
# permanent. But 'Foo' is not loaded yet.
print(dc.names)  # ['Bar', 'Foo']
print(dc.Bar.head())
print(dc.Foo.head())  # <- Now "Foo" is going to be loaded

# Add a new data frame
dc.Elke = pandas.DataFrame()

# Save in ZIP
dc.store(fp)

# Save again in folder mode
fp = pathlib.Path('folder')
dc.store(fp)
Class Method load Load a data container from the path.
Method __delattr__ Delete a data frame by its attribute name (e.g. del dc.foo).
Method __delitem__ Delete a data frame by its key name (e.g. del dc['foo']).
Method __getattr__ Get data frame as instance attribute.
Method __getitem__ Dict like behavior of the container.
Method __init__ Create a container in memory.
Method __len__ Count of all data frames.
Method __setattr__ Add new data frame via attribute name.
Method __setitem__ Add new data frame via key.
Method __str__ Literal description of the container containing number of data frames, names of permanent and on-demand data frames and meta information.
Method describe_with_markdown Describe all dataframe of the container as markdown tables.
Method keep_row_count Context manager monitoring the number of rows of a data frame.
Method load_all Load all data frames into memory.
Method save The container is saved in filesystem at file_path.
Method update_meta_info New keys are added and existing keys are overwritten with respect to nested dicts.
Property meta The meta data (e.g. version number) as dict.
Property names List all data frame names no matter if they are on-demand or permanent type.
Property names_permanent List names of permanent data frames.
Class Method _format_log_data A helper method for _log_save() and _log_loaded().
Class Method _load_from_archive Load a container from an archive file.
Class Method _load_from_folder Load a container from a folder.
Class Method _write_folder Write data frames into a folder.
Class Method _write_zip_file Write data frames to a ZIP archive.
Method _assemble_log_message Undocumented
Method _check_meta_info Compare the meta data given by meta_check and in meta.
Method _compute_filepath_for_dataframe Try to find the path to a data frame's file.
Method _copy_dataframe_file Copy a data frame file to a new folder.
Method _generate_creation_info No summary
Method _hash_dataframe Create a hash value for a data frame and return it.
Method _is_compressed Determine if the data frame file is compressed.
Method _is_dataframe_modified Return the data frames modified state.
Method _load_dataframe Load one specific data frame from its file.
Method _load_dataframe_from_archive Load a data frame from an archive.
Method _load_dataframe_from_folder Load a data frame from a folder.
Method _log_loaded Create log message about a loaded container.
Method _log_save Create log message about a stored container.
Method _save_as_archive Save the container instance as ZIP compressed file.
Method _save_as_folder Store the container in folder mode.
Method _save_previos_archive_as_folder Undocumented
Method _unload_dataframe Unload a data frame from the container.
Constant _ARCHIVE_SUFFIX Filename suffix for ZIP compressed container in archive mode.
Constant _META_FILENAME Name of the meta data holding file.
Constant _PICKLE_PROTOCOL The Pickle protocol version used.
Constant _PICKLE_SUFFIX Filename suffix for pickle files.
Constant _PICKLE_ZIP_SUFFIX Filename suffix for zip-compressed pickle files.
Instance Variable _dataframe_hashes Hash values of loaded data frames (folder mode only).
Instance Variable _dataframes Dictionary of date frame's.
Instance Variable _meta The meta data (e.g. version number) as dict.
Instance Variable _names_on_demand List of data frame names which are of type on-demand.
Instance Variable _source_obj Path to the file or folder the container was loaded from.
@classmethod
def load(cls, file_path: Union[pathlib.Path], meta_to_check: Dict = None) -> DataContainer: (source)

Load a data container from the path.

Data frame files beginning with a _ prefix are treated as permanent and loaded immediately. Other data frames are treated as on-demand and loaded from files only when they are accessed.

If the container contain meta information's (file __meta) and the argument meta_to_check is givin this two dictionaries are compared. An exception is raised when they are not equal. Could be used for a version number or a date. See _check_meta_info() for details.

Parameters
file_path:Union[pathlib.Path]Path to file or folder of the container.
meta_to_check:DictMeta info's to validate.
Returns
DataContainerThe container object.
def __delattr__(self, name): (source)

Delete a data frame by its attribute name (e.g. del dc.foo).

The underlying files are not deleted.

def __delitem__(self, key): (source)

Delete a data frame by its key name (e.g. del dc['foo']).

The underlying files are not deleted.

def __getattr__(self, attr: str) -> pandas.DataFrame: (source)

Get data frame as instance attribute.

Get a data frame by using its name as instance attribute name (dc.foobar). On-demand data frames are loaded automatically in background.

Example

dc = bandas.DataContainer.load('data.zip')

# Access dataframe as instance attribute.
# The dataframe's name here is "foobar".
dc.foobar.head()
Parameters
attr:strName of a data frame.
Returns
pandas.DataFrameUndocumented
def __getitem__(self, key_or_idx: Union[int, str]) -> Union[str, pandas.DataFrame]: (source)

Dict like behavior of the container.

Get a data frame by using its name as key (dc[name]) Using a numerical index (dc[0]) will return the name of a data frame as str. On-demand data frames are loaded automatically in background.

Example

dc = bandas.DataContainer.load('data.zip')

# Iterate over dataframe names (like a dict)
for name in dc:
    print(name)

    # Use the name like a key
    print(dc[name].head())
Parameters
key_or_idx:Union[int, str]Name of a data frame or index of its name.
Returns
Union[str, pandas.DataFrame]Undocumented
def __init__(self, dataframes: Dict[str, pandas.DataFrame] = None, permanent_names: List[str] = None, meta: Dict = None): (source)

Create a container in memory.

Parameters
dataframes:Dict[str, pandas.DataFrame]Dictionary of pandas.DataFrame's.
permanent_names:List[str]List of data frame names. See load() for details.
meta:DictDictionary with additional data.
def __len__(self) -> int: (source)

Count of all data frames.

The number is based on the names property.

def __setattr__(self, attr: str, value: pandas.DataFrame): (source)

Add new data frame via attribute name.

Parameters
attr:strName of the data frame.
value:pandas.DataFrameThe data frame.
def __setitem__(self, key: str, dataframe: pandas.DataFrame): (source)

Add new data frame via key.

The data frame is added to the container as on-demand. Change it to permanent when using save().

Parameters
key:strName of dataframe.
dataframe:pandas.DataFrameUndocumented
def __str__(self) -> str: (source)

Literal description of the container containing number of data frames, names of permanent and on-demand data frames and meta information.

def describe_with_markdown(self, head_n: int = 4, width: int = 80, **kwargs) -> str: (source)

Describe all dataframe of the container as markdown tables.

The width is respected for markdown tables also.

Parameters
head_n:intNumber of data rows per frame.
width:intWith in characters of each table.
**kwargsUsed to handover arguments to buhtzology.break_paragraph().
Returns
strThe result as a string.
@contextlib.contextmanager
def keep_row_count(self, name: str) -> DataContainer: (source)

Context manager monitoring the number of rows of a data frame.

The context manager can be used to make sure that an operation on the data do not delete or add rows. This can accidentally happen when using pandas.merge().

dc = get_container()

with dc.keep_row_count('MyDataFrame'):
    dc.MyDataFrame = pandas.merge(
        left=dc.MyDataFrame,
        right=other_data_frame
    )
Parameters
name:strName of the data frame.
Returns
DataContainerItself.
Raises
TypeErrorIf the length of the data frame changed.
def load_all(self): (source)

Load all data frames into memory.

def save(self, file_path: pathlib.Path, permanent_names: List[str] = None, zip_names: List[str] = None, meta: Dict = None): (source)

The container is saved in filesystem at file_path.

If file_path is a a ZIP file then the container is stored as a compressed archive containing all data frame files as pickles. Otherwise the path is used as a folder and each dataframe is stored as a single file.

Hint

In folder mode by default the on-demand data frames are ZIP compressed. But if zip_names is given only data frames with that names are compressed no matter if they are on-demand or permanent. If all data frame files should be uncompressed set zip_names=[].

Previously existing meta data in the container is not lost but only updated using update_meta_info() with the meta argument.

Parameters
file_path:pathlib.PathPath to a ZIP file or a folder.
permanent_names:List[str]Names of dataframes marked as permanent.
zip_names:List[str]Works only on modified data frames.
meta:DictTo update containers meta data with.
def update_meta_info(self, value: dict): (source)

New keys are added and existing keys are overwritten with respect to nested dicts.

The meta data (e.g. version number) as dict.

TODO explain '_created' key, etc pp

List all data frame names no matter if they are on-demand or permanent type.

@property
names_permanent: list[str] = (source)

List names of permanent data frames.

@classmethod
def _format_log_data(cls, entries: list[tuple], meta: dict, njust: int) -> str: (source)

A helper method for _log_save() and _log_loaded().

@classmethod
def _load_from_archive(cls, archive_path: pathlib.Path) -> DataContainer: (source)

Load a container from an archive file.

See load() for details.

@classmethod
def _load_from_folder(cls, folder_path: pathlib.Path) -> DataContainer: (source)

Load a container from a folder.

See load() for details.

@classmethod
def _write_folder(cls, folder_path: pathlib.Path, dataframes: dict[str, pandas.DataFrame], permanent_names: Iterable[str], meta: Dict, zip_names: Iterable[str]) -> list[tuple[str]]: (source)

Write data frames into a folder.

By default on-demand data frames are written as compressed and permanent data frames as uncompressed pickle files. That default behavior is used if zip_names is None. Otherwise if zip_names is empty ([]) none of the data frames are compressed.

Parameters
folder_path:pathlib.PathDestination path.
dataframes:dict[str, pandas.DataFrame]Dict of pandas.DataFrame's indexed by their name.
permanent_names:Iterable[str]List of data frame names need to be persistent.
meta:DictDictioniary with additional (user defined) meta data.
zip_names:Iterable[str]List of data frame names should be compressed no matter if they are on-demand or permanent.
Returns
list[tuple[str]]A list of tuples indicating the status of each data frame.
@classmethod
def _write_zip_file(cls, file_like_obj: Union[pathlib.Path, zipfile.ZipFile], dataframes: Dict[str, pandas.DataFrame], permanent_names: List[str], meta: Dict) -> list[tuple[str]]: (source)

Write data frames to a ZIP archive.

Parameters
file_like_obj:Union[pathlib.Path, zipfile.ZipFile]Path to ZIP file to create or a zipfile.ZipFile object.
dataframes:Dict[str, pandas.DataFrame]Dictionary with pandas.DataFrame's.
permanent_names:List[str]List of dataframe names to mark as 'permanent'.
meta:DictMeta data to add to the ZIP file.
Returns
list[tuple[str]]Undocumented
def _assemble_log_message(self, path: pathlib.Path, mode: str, indicators: list[tuple[str, str]]) -> str: (source)

Undocumented

def _check_meta_info(self, meta_check: Dict): (source)

Compare the meta data given by meta_check and in meta.

The ordering of the keys is ignored.

Parameters
meta_check:DictA dictionary with expected meta data.
Raises
ValueError if the dicts have different keys and/or values.
def _compute_filepath_for_dataframe(self, name_prefix: str, df_name: str) -> pathlib.Path: (source)

Try to find the path to a data frame's file.

Returns
pathlib.PathThe path to the file if it exists.
Raises
FileNotFoundError
def _copy_dataframe_file(self, name: str, folder_path: pathlib.Path, permanent: bool, zipped: bool): (source)

Copy a data frame file to a new folder.

The data frame to copy is specified by name and must exist in the current data container. The status can be toggled between permanent and on-demand using permanent argument. Despite from that the argument zipped is used to validate if copy is possible. A ValueError is raised if the sources is compressed and the destination not or the other way around.

Parameters
name:strName of the data frame used in this data container.
folder_path:pathlib.PathDestination folder to copy the data frame file into.
permanent:boolData frame should be permanent or not after copy it.
zipped:boolIndicate if destination file is compressed or not. Will raise ValueError if different from source file.
Returns
A string used as status indicator in _log_save(), e.g. P_, Oz.
Raises
ValueErrorIf destination file should be compressed but source file is not or the other way around.
def _generate_creation_info(self) -> dict: (source)
def _hash_dataframe(self, df_name: str) -> str: (source)

Create a hash value for a data frame and return it.

Parameters
df_name:strName of the data frame.
Returns
strThe hash (sha1) as a string (hexdigts).
Raises
KeyErrorIf the data frame is not loaded yet.
def _is_compressed(self, name: str) -> bool: (source)

Determine if the data frame file is compressed.

Parameters
name:strName of a data frame.
Returns
boolUndocumented
Raises
TypeErrorIf the container was not stored yet or is in archive mode.
def _is_dataframe_modified(self, df_name: str) -> bool: (source)

Return the data frames modified state.

Not loaded on-demand data frames assumed to be unmodified and False is returned. If a data frame was attached (from memory) to the container not stored yet it is assumed to be modified and True is returned.

Parameters
df_name:strName of the data frame.
Returns
boolBoolean givin the modified state.
def _load_dataframe(self, df_name: str): (source)

Load one specific data frame from its file.

Parameters
df_name:strName of the data frame.
def _load_dataframe_from_archive(self, name_prefix: str, df_name: str): (source)

Load a data frame from an archive.

def _load_dataframe_from_folder(self, name_prefix: str, df_name: str): (source)

Load a data frame from a folder.

def _log_loaded(self, path: pathlib.Path, mode: str, indicators: list[tuple[str, str]]): (source)

Create log message about a loaded container.

def _log_save(self, path: pathlib.Path, mode: str, indicators: list[tuple[str, str]]): (source)

Create log message about a stored container.

def _save_as_archive(self, file_path: pathlib.Path, permanent_names: List[str]): (source)

Save the container instance as ZIP compressed file.

See save() for details.

def _save_as_folder(self, folder_path: pathlib.Path, permanent_names: List[str], zip_names: List[str]): (source)

Store the container in folder mode.

Parameters
folder_path:pathlib.PathDestination path.
permanent_names:List[str]List of data frame names need to be persistent.
zip_names:List[str]List of data frame names should be compressed no matter if they are on-demand or permanent.
def _save_previos_archive_as_folder(self, folder_path: pathlib.Path, permanent_names: List[str], zip_names: List[str]): (source)

Undocumented

def _unload_dataframe(self, df_name: str): (source)

Unload a data frame from the container.

Only data frames of type on demand can be unloaded. Permanent data frames will raise a ValueError. If the data frame is not loaded but do exist nothing is raised.

Parameters
df_name:strName of the data frame.
Raises
KeyErrorIf data frame do not exist.
ValueErrorIf the data frames is of permanent type.
TypeErrorIf data frame is modified.
_ARCHIVE_SUFFIX: str = (source)

Filename suffix for ZIP compressed container in archive mode.

Value
'.zip'
_META_FILENAME: str = (source)

Name of the meta data holding file.

Value
'__meta'
_PICKLE_PROTOCOL: int = (source)

The Pickle protocol version used.

Value
4
_PICKLE_SUFFIX: str = (source)

Filename suffix for pickle files.

Value
'.pickle'
_PICKLE_ZIP_SUFFIX: str = (source)

Filename suffix for zip-compressed pickle files.

Value
'.pickle.zip'
_dataframe_hashes: dict = (source)

Hash values of loaded data frames (folder mode only).

_dataframes: dict = (source)

Dictionary of date frame's.

The meta data (e.g. version number) as dict.

_names_on_demand = (source)

List of data frame names which are of type on-demand.

_source_obj = (source)

Path to the file or folder the container was loaded from.