malet.plot_utils.data_processor#

Attributes#

Functions#

select_df(→ pandas.DataFrame)

Select df rows with matching values from filt_dict except exclude_fields.

homogenize_df(→ pandas.DataFrame)

Homogenize index values of df with reference to select_df(ref_df, filt_dict).

avgbest_df(, best_over, best_of, Any] = dict, ...)

Average over avg_over and get best result over best_over.

Module Contents#

malet.plot_utils.data_processor.ValueLike#
malet.plot_utils.data_processor.select_df(df: pandas.DataFrame, filt_dict: Dict[str, ValueLike], *exclude_fields: str, equal: bool = True, drop: bool = False, validate: bool = True) pandas.DataFrame[source]#

Select df rows with matching values from filt_dict except exclude_fields.

This is a vectorized, single-pass version of the original implementation.

Original behavior preserved: - Asserts that df is non-empty. - Asserts that filt_dict keys exist in df.index.names. - Validates that requested values exist in each index level. - Raises early if intermediate filtering yields an empty dataframe. - Supports equal (keep matches) and drop (drop filtered levels).

Performance notes: - Builds ONE boolean mask and slices once, instead of repeated df.loc calls. - Avoids repeated DataFrame materialization inside Python loops.

Parameters:
  • df (pandas.DataFrame) – DataFrame with MultiIndex.

  • filt_dict (Dict[str, Any]) – Mapping from index level to allowed values.

  • exclude_fields (str) – Index levels to exclude from filtering.

  • equal (bool) – If True, keep matching rows; otherwise exclude them.

  • drop (bool) – If True, drop filtered index levels.

  • validate (bool) – If True, run key/value existence checks.

Returns:

Filtered DataFrame.

Return type:

pandas.DataFrame

malet.plot_utils.data_processor.homogenize_df(df: pandas.DataFrame, ref_df: pandas.DataFrame, filt_dict: Dict[str, ValueLike], *exclude_fields: str, validate: bool = True) pandas.DataFrame[source]#

Homogenize index values of df with reference to select_df(ref_df, filt_dict).

Original intent (unchanged): - Align df so that its remaining index grid matches the grid induced by

select_df(ref_df, filt_dict, drop=True).

Original caveats (preserved verbatim): - grid should be complete, else some fields in filt_dict will be missing. - also, when metric in filt_dict, step and total_steps can be metric-dependent

and could return empty df.

Performance improvement: - Replaces per-row select_df + concat with a single vectorized

MultiIndex membership test using isin.

Parameters:
  • df (pandas.DataFrame) – DataFrame to homogenize.

  • ref_df (pandas.DataFrame) – Reference DataFrame.

  • filt_dict (Dict[str, Any]) – Filter used to define the reference grid.

  • exclude_fields (str) – Index levels excluded from filtering.

  • validate (bool) – Run validation checks.

Returns:

Homogenized DataFrame.

Return type:

pandas.DataFrame

malet.plot_utils.data_processor.avgbest_df(df: pandas.DataFrame, metric_field: str, avg_over: Set[str] = set(), best_over: Set[str] = set(), best_of: Dict[str, Any] = dict(), best_at_max: bool = True, validate: bool = True) pandas.DataFrame[source]#

Average over avg_over and get best result over best_over.

Original semantics preserved: - avg_over: aggregate (mean + SEM) over these index levels. - best_over: choose hyperparameter values yielding best metric_field. - best_of: restrict best search to a fixed subset of index values,

then apply the chosen hyperparameter globally.

  • best_at_max controls argmax vs argmin selection.

Original internal logic (preserved): ‘’’ - aggregate index : avg_over, best_over - key index : best_of, others ‘’’

Performance improvements: - Vectorized filtering and grouping. - No repeated slicing inside loops. - homogenize_df uses index membership instead of concat.

Parameters:
  • df (pandas.DataFrame) – Base dataframe to operate over.

  • metric_field (str) – Metric used to select best hyperparameter.

  • avg_over (Set[str]) – MultiIndex levels to average over.

  • best_over (Set[str]) – MultiIndex levels to select best over.

  • best_of (Dict[str, Any]) – Fixed index values for best selection.

  • best_at_max (bool) – True if larger metric is better.

  • validate (bool) – Enable validation checks.

Returns:

Processed DataFrame.

Return type:

pandas.DataFrame