class DataContainer: (source)
Constructors: DataContainer.load(file_path, meta_to_check)
, DataContainer(dataframes, permanent_names, meta)
, DataContainer._load_from_archive(archive_path)
, DataContainer._load_from_folder(folder_path)
A container managing multiple pandas.DataFrame
.
The container can operate in two modes. In folder mode the container is located in a folder on the file system and each data frame has its own file. I archive mode the container is one single compressed (ZIP) archive with each data frame as an entry in it.
Two types of data frames are distinguished. The on demand data frames loaded when they are accessed the first time. A permanent data frame is loaded right on time when the container is instantiated. Which mode is used depends on the path name used; *.zip results in archive mode and everything else in folder mode.
In folder mode a data frame can be stored as *.pickle or as a (ZIP) compressed *.pickle.zip file. By default the former is used for permanent and the latter for on demand data frames.
Warning
Despite a container store (multiple) data frames in the file system it is not intended to be long term archive format. It is a temporary storage and can be used between running two different scripts or steps of research.
Example
# Data for the container organized in a dict. dataframes = { 'Foo': pandas.DataFrame(), 'Bar': pandas.DataFrame() } # Create container object dc = DataContainer(dataframes=dataframes, permanent_names=['Bar']) # Save containers as one ZIP file (archive mode) fp = pathlib.Path('data.zip') dc.save(fp) # Load the container dc = DataContainer(fp) # 'Bar' is loaded immediately because it was marked as # permanent. But 'Foo' is not loaded yet. print(dc.names) # ['Bar', 'Foo'] print(dc.Bar.head()) print(dc.Foo.head()) # <- Now "Foo" is going to be loaded # Add a new data frame dc.Elke = pandas.DataFrame() # Save in ZIP dc.store(fp) # Save again in folder mode fp = pathlib.Path('folder') dc.store(fp)
Class Method | load |
Load a data container from the path. |
Method | __delattr__ |
Delete a data frame by its attribute name (e.g. del dc.foo). |
Method | __delitem__ |
Delete a data frame by its key name (e.g. del dc['foo']). |
Method | __getattr__ |
Get data frame as instance attribute. |
Method | __getitem__ |
Dict like behavior of the container. |
Method | __init__ |
Create a container in memory. |
Method | __len__ |
Count of all data frames. |
Method | __setattr__ |
Add new data frame via attribute name. |
Method | __setitem__ |
Add new data frame via key. |
Method | __str__ |
Literal description of the container containing number of data frames, names of permanent and on-demand data frames and meta information. |
Method | describe |
Describe all dataframe of the container as markdown tables. |
Method | keep |
Context manager monitoring the number of rows of a data frame. |
Method | load |
Load all data frames into memory. |
Method | save |
The container is saved in filesystem at file_path. |
Method | update |
New keys are added and existing keys are overwritten with respect to nested dicts. |
Property | meta |
The meta data (e.g. version number) as dict . |
Property | names |
List all data frame names no matter if they are on-demand or permanent type. |
Property | names |
List names of permanent data frames. |
Class Method | _format |
A helper method for _log_save() and _log_loaded() . |
Class Method | _load |
Load a container from an archive file. |
Class Method | _load |
Load a container from a folder. |
Class Method | _write |
Write data frames into a folder. |
Class Method | _write |
Write data frames to a ZIP archive. |
Method | _assemble |
Undocumented |
Method | _check |
Compare the meta data given by meta_check and in meta . |
Method | _compute |
Try to find the path to a data frame's file. |
Method | _copy |
Copy a data frame file to a new folder. |
Method | _generate |
No summary |
Method | _hash |
Create a hash value for a data frame and return it. |
Method | _is |
Determine if the data frame file is compressed. |
Method | _is |
Return the data frames modified state. |
Method | _load |
Load one specific data frame from its file. |
Method | _load |
Load a data frame from an archive. |
Method | _load |
Load a data frame from a folder. |
Method | _log |
Create log message about a loaded container. |
Method | _log |
Create log message about a stored container. |
Method | _save |
Save the container instance as ZIP compressed file. |
Method | _save |
Store the container in folder mode. |
Method | _save |
Undocumented |
Method | _unload |
Unload a data frame from the container. |
Constant | _ARCHIVE |
Filename suffix for ZIP compressed container in archive mode. |
Constant | _META |
Name of the meta data holding file. |
Constant | _PICKLE |
The Pickle protocol version used. |
Constant | _PICKLE |
Filename suffix for pickle files. |
Constant | _PICKLE |
Filename suffix for zip-compressed pickle files. |
Instance Variable | _dataframe |
Hash values of loaded data frames (folder mode only). |
Instance Variable | _dataframes |
Dictionary of date frame's. |
Instance Variable | _meta |
The meta data (e.g. version number) as dict . |
Instance Variable | _names |
List of data frame names which are of type on-demand. |
Instance Variable | _source |
Path to the file or folder the container was loaded from. |
def load(cls, file_path:
Union[ pathlib.Path]
, meta_to_check: Dict
= None) -> DataContainer
:
(source)
¶
Load a data container from the path.
Data frame files beginning with a _ prefix are treated as permanent and loaded immediately. Other data frames are treated as on-demand and loaded from files only when they are accessed.
If the container contain meta information's (file __meta) and
the argument meta_to_check is givin this two dictionaries are
compared. An exception is raised when they are not equal.
Could be used for a version number or a date. See _check_meta_info()
for details.
Parameters | |
fileUnion[ | Path to file or folder of the container. |
metaDict | Meta info's to validate. |
Returns | |
DataContainer | The container object. |
Get data frame as instance attribute.
Get a data frame by using its name as instance attribute name (dc.foobar). On-demand data frames are loaded automatically in background.
Example
dc = bandas.DataContainer.load('data.zip') # Access dataframe as instance attribute. # The dataframe's name here is "foobar". dc.foobar.head()
Parameters | |
attr:str | Name of a data frame. |
Returns | |
pandas.DataFrame | Undocumented |
Dict like behavior of the container.
Get a data frame by using its name as key (dc[name])
Using a numerical index (dc[0])
will return the name of a data frame as str
.
On-demand data frames are loaded automatically in background.
Example
dc = bandas.DataContainer.load('data.zip') # Iterate over dataframe names (like a dict) for name in dc: print(name) # Use the name like a key print(dc[name].head())
Parameters | |
keyUnion[ | Name of a data frame or index of its name. |
Returns | |
Union[ | Undocumented |
Dict[ str, pandas.DataFrame]
= None, permanent_names: List[ str]
= None, meta: Dict
= None):
(source)
¶
Create a container in memory.
Parameters | |
dataframes:Dict[ | Dictionary of pandas.DataFrame 's. |
permanentList[ | List of data frame names. See load() for
details. |
meta:Dict | Dictionary with additional data. |
Add new data frame via attribute name.
Parameters | |
attr:str | Name of the data frame. |
value:pandas.DataFrame | The data frame. |
Add new data frame via key.
The data frame is added to the container as on-demand. Change it to
permanent when using save()
.
Parameters | |
key:str | Name of dataframe. |
dataframe:pandas.DataFrame | Undocumented |
Literal description of the container containing number of data frames, names of permanent and on-demand data frames and meta information.
Describe all dataframe of the container as markdown tables.
The width is respected for markdown tables also.
Parameters | |
headint | Number of data rows per frame. |
width:int | With in characters of each table. |
**kwargs | Used to handover arguments to
buhtzology.break_paragraph() . |
Returns | |
str | The result as a string. |
Context manager monitoring the number of rows of a data frame.
The context manager can be used to make sure that an operation on
the data do not delete or add rows. This can accidentally happen
when using pandas.merge()
.
dc = get_container() with dc.keep_row_count('MyDataFrame'): dc.MyDataFrame = pandas.merge( left=dc.MyDataFrame, right=other_data_frame )
Parameters | |
name:str | Name of the data frame. |
Returns | |
DataContainer | Itself. |
Raises | |
TypeError | If the length of the data frame changed. |
pathlib.Path
, permanent_names: List[ str]
= None, zip_names: List[ str]
= None, meta: Dict
= None):
(source)
¶
The container is saved in filesystem at file_path.
If file_path is a a ZIP file then the container is stored as a compressed archive containing all data frame files as pickles. Otherwise the path is used as a folder and each dataframe is stored as a single file.
Hint
In folder mode by default the on-demand data frames are ZIP compressed. But if zip_names is given only data frames with that names are compressed no matter if they are on-demand or permanent. If all data frame files should be uncompressed set zip_names=[].
Previously existing meta data in the container is not lost but only
updated using update_meta_info()
with the meta
argument.
Parameters | |
filepathlib.Path | Path to a ZIP file or a folder. |
permanentList[ | Names of dataframes marked as permanent. |
zipList[ | Works only on modified data frames. |
meta:Dict | To update containers meta data with. |
def _format_log_data(cls, entries:
list[ tuple]
, meta: dict
, njust: int
) -> str
:
(source)
¶
A helper method for _log_save()
and _log_loaded()
.
def _write_folder(cls, folder_path:
pathlib.Path
, dataframes: dict[ str, pandas.DataFrame]
, permanent_names: Iterable[ str]
, meta: Dict
, zip_names: Iterable[ str]
) -> list[ tuple[ str]]
:
(source)
¶
Write data frames into a folder.
By default on-demand data frames are written as compressed and permanent data frames as uncompressed pickle files. That default behavior is used if zip_names is None. Otherwise if zip_names is empty ([]) none of the data frames are compressed.
Parameters | |
folderpathlib.Path | Destination path. |
dataframes:dict[ | Dict of pandas.DataFrame 's indexed by their name. |
permanentIterable[ | List of data frame names need to be persistent. |
meta:Dict | Dictioniary with additional (user defined) meta data. |
zipIterable[ | List of data frame names should be compressed no matter if they are on-demand or permanent. |
Returns | |
list[ | A list of tuples indicating the status of each data frame. |
def _write_zip_file(cls, file_like_obj:
Union[ pathlib.Path, zipfile.ZipFile]
, dataframes: Dict[ str, pandas.DataFrame]
, permanent_names: List[ str]
, meta: Dict
) -> list[ tuple[ str]]
:
(source)
¶
Write data frames to a ZIP archive.
Parameters | |
fileUnion[ | Path to ZIP file to create or a zipfile.ZipFile
object. |
dataframes:Dict[ | Dictionary with pandas.DataFrame 's. |
permanentList[ | List of dataframe names to mark as 'permanent'. |
meta:Dict | Meta data to add to the ZIP file. |
Returns | |
list[ | Undocumented |
pathlib.Path
, mode: str
, indicators: list[ tuple[ str, str]]
) -> str
:
(source)
¶
Undocumented
str
, df_name: str
) -> pathlib.Path
:
(source)
¶
Try to find the path to a data frame's file.
Returns | |
pathlib.Path | The path to the file if it exists. |
Raises | |
FileNotFoundError |
str
, folder_path: pathlib.Path
, permanent: bool
, zipped: bool
):
(source)
¶
Copy a data frame file to a new folder.
The data frame to copy is specified by name and must exist in the
current data container. The status can be toggled between permanent and
on-demand using permanent argument. Despite from that the argument
zipped is used to validate if copy is possible. A ValueError
is
raised if the sources is compressed and the destination not or the
other way around.
Parameters | |
name:str | Name of the data frame used in this data container. |
folderpathlib.Path | Destination folder to copy the data frame file into. |
permanent:bool | Data frame should be permanent or not after copy it. |
zipped:bool | Indicate if destination file is compressed or not. Will
raise ValueError if different from source file. |
Returns | |
A string used as status indicator in _log_save() , e.g. P_,
Oz. | |
Raises | |
ValueError | If destination file should be compressed but source file is not or the other way around. |
Return the data frames modified state.
Not loaded on-demand data frames assumed to be unmodified and
False
is returned. If a data frame was attached (from memory) to the
container not stored yet it is assumed to be modified and True
is
returned.
Parameters | |
dfstr | Name of the data frame. |
Returns | |
bool | Boolean givin the modified state. |
pathlib.Path
, permanent_names: List[ str]
, zip_names: List[ str]
):
(source)
¶
Store the container in folder mode.
Parameters | |
folderpathlib.Path | Destination path. |
permanentList[ | List of data frame names need to be persistent. |
zipList[ | List of data frame names should be compressed no matter if they are on-demand or permanent. |
pathlib.Path
, permanent_names: List[ str]
, zip_names: List[ str]
):
(source)
¶
Undocumented
Unload a data frame from the container.
Only data frames of type on demand can be unloaded. Permanent data
frames will raise a ValueError
. If the data frame is not loaded
but do exist nothing is raised.
Parameters | |
dfstr | Name of the data frame. |
Raises | |
KeyError | If data frame do not exist. |
ValueError | If the data frames is of permanent type. |
TypeError | If data frame is modified. |