malet.plot_utils.data_processor

malet.plot_utils.data_processor#

Attributes#

ValueLike

Functions#

`select_df`(→ pandas.DataFrame)	Select df rows with matching values from `filt_dict` except `exclude_fields`.
`homogenize_df`(→ pandas.DataFrame)	Homogenize index values of `df` with reference to `select_df(ref_df, filt_dict)`.
`avgbest_df`(, best_over, best_of, Any] = dict, ...)	Average over `avg_over` and get best result over `best_over`.

Module Contents#

malet.plot_utils.data_processor.ValueLike#

malet.plot_utils.data_processor.select_df(df: pandas.DataFrame, filt_dict: Dict[str, ValueLike], *exclude_fields: str, equal: bool = True, drop: bool = False, validate: bool = True) → pandas.DataFrame[source]#

Select df rows with matching values from filt_dict except exclude_fields.

This is a vectorized, single-pass version of the original implementation.

Original behavior preserved: - Asserts that df is non-empty. - Asserts that filt_dict keys exist in df.index.names. - Validates that requested values exist in each index level. - Raises early if intermediate filtering yields an empty dataframe. - Supports equal (keep matches) and drop (drop filtered levels).

Performance notes: - Builds ONE boolean mask and slices once, instead of repeated df.loc calls. - Avoids repeated DataFrame materialization inside Python loops.

Parameters:

df (pandas.DataFrame) – DataFrame with MultiIndex.
filt_dict (Dict[str, Any]) – Mapping from index level to allowed values.
exclude_fields (str) – Index levels to exclude from filtering.
equal (bool) – If True, keep matching rows; otherwise exclude them.
drop (bool) – If True, drop filtered index levels.
validate (bool) – If True, run key/value existence checks.

Returns:

Filtered DataFrame.

Return type:

pandas.DataFrame

malet.plot_utils.data_processor.homogenize_df(df: pandas.DataFrame, ref_df: pandas.DataFrame, filt_dict: Dict[str, ValueLike], *exclude_fields: str, validate: bool = True) → pandas.DataFrame[source]#

Homogenize index values of df with reference to select_df(ref_df, filt_dict).

Original intent (unchanged): - Align df so that its remaining index grid matches the grid induced by

select_df(ref_df, filt_dict, drop=True).

Original caveats (preserved verbatim): - grid should be complete, else some fields in filt_dict will be missing. - also, when metric in filt_dict, step and total_steps can be metric-dependent

and could return empty df.

Performance improvement: - Replaces per-row select_df + concat with a single vectorized

MultiIndex membership test using isin.

Parameters:

df (pandas.DataFrame) – DataFrame to homogenize.
ref_df (pandas.DataFrame) – Reference DataFrame.
filt_dict (Dict[str, Any]) – Filter used to define the reference grid.
exclude_fields (str) – Index levels excluded from filtering.
validate (bool) – Run validation checks.

Returns:

Homogenized DataFrame.

Return type:

pandas.DataFrame

malet.plot_utils.data_processor.avgbest_df(df: pandas.DataFrame, metric_field: str, avg_over: Set[str] = set(), best_over: Set[str] = set(), best_of: Dict[str, Any] = dict(), best_at_max: bool = True, validate: bool = True) → pandas.DataFrame[source]#

Average over avg_over and get best result over best_over.

Original semantics preserved: - avg_over: aggregate (mean + SEM) over these index levels. - best_over: choose hyperparameter values yielding best metric_field. - best_of: restrict best search to a fixed subset of index values,

then apply the chosen hyperparameter globally.

best_at_max controls argmax vs argmin selection.

Original internal logic (preserved): ‘’’ - aggregate index : avg_over, best_over - key index : best_of, others ‘’’

Performance improvements: - Vectorized filtering and grouping. - No repeated slicing inside loops. - homogenize_df uses index membership instead of concat.

Parameters:

df (pandas.DataFrame) – Base dataframe to operate over.
metric_field (str) – Metric used to select best hyperparameter.
avg_over (Set[str]) – MultiIndex levels to average over.
best_over (Set[str]) – MultiIndex levels to select best over.
best_of (Dict[str, Any]) – Fixed index values for best selection.
best_at_max (bool) – True if larger metric is better.
validate (bool) – Enable validation checks.

Returns:

Processed DataFrame.

Return type:

pandas.DataFrame