malet.experiment

Contents

malet.experiment#

This module provides classes and utilities for managing and executing experiments with structured configurations, logging, and checkpointing.

Attributes#

Classes#

ConfigIter

Iterator over experiment configurations defined in a structured YAML file.

ExperimentLog

Logging class for experiment results.

RunInfo

Use for tracking and managing information about a specific run or

Experiment

Execute experiments based on provided configurations. It supports

Module Contents#

malet.experiment.ExpFunc#
class malet.experiment.ConfigIter(exp_config_path: str)[source]#

Iterator over experiment configurations defined in a structured YAML file.

This class reads a YAML file containing both static parameters and parameter grids, then generates all combinations of configurations by expanding the specified grid fields. Each configuration is returned as a ConfigDict, suitable for iterating in experiment loops.

static_configs#

Configuration values that remain constant across all runs.

Type:

dict

grid_fields#

Ordered list of grid field names, specifying expansion order.

Type:

list of str

grid#

Raw grid specifications parsed from the YAML file.

Type:

list of dict

grid_iter#

List of fully expanded configuration dictionaries.

Type:

list of ConfigDict

Example

`python >>> for config in ConfigIter('exp_file.yaml'): ...     train_func(config) `

## YAML Schema Example:

The YAML file should define top-level static fields and a grid field. Each entry under grid defines a parameter sweep. Nested group fields are expanded via Cartesian product before merging with other fields.

```yaml model: ResNet32 dataset: cifar10 …

grid:
  • optimizer: [sgd] group:

    pai_type: [[random], [snip]] pai_scope: [[local], [global]]

    rho: [0.05] seed: [1, 2, 3]

  • optimizer: [sam] pai_type: [random, snip, lth] pai_sparsity: [0, 0.9, 0.95, 0.98, 0.99] rho: [0.05, 0.1, 0.2, 0.3] seed: [1, 2, 3]

```

## Notes:
  • ConfigDict is assumed to be a mutable configuration container (e.g.,

from ml_collections). - The expansion order of configurations follows grid_fields if provided; otherwise, it is inferred. - Nested group fields are flattened into individual configurations using Cartesian product logic.

Raises:
  • FileNotFoundError – If the YAML file path is invalid.

  • ValueError – If the YAML structure does not conform to the expected schema.

static_configs#
grid_fields = []#
grid#
grid_iter#
filter_iter(filt_fn: Callable[[int, dict], bool])[source]#

Filters ConfigIter with filt_fn which has (idx, dict) as arguments.

Parameters:
  • filt_fn (Callable[[int, dict], bool]) – Filter function to filter

  • ConfigIter.

property grid_dict: dict#

Dictionary of all grid values

field_type(field: str)[source]#

Returns type of a field in grid.

Parameters:

field (str) – Name of the field.

Returns:

Type of the field.

Return type:

Any

class malet.experiment.ExperimentLog[source]#

Logging class for experiment results.

Logs all configs for reproduction, and resulting pre-defined metrics from experiment run as DataFrame. Changing configs are stored as multiindex and metrics are stored as columns. Other static configs are passed in and stored as dictionary.

These can be written to tsv file with yaml header, and loaded back from it. Filelocks can be used to prevent race conditions when multiple processes are reading or writing to the same log file.

df#

DataFrame of experiment results.

Type:

pd.DataFrame

static_configs#

Dictionary of static configs of the experiment.

Type:

dict

logs_file#

File path to tsv file.

Type:

str

use_filelock#

Whether to use file lock for reading/writing log file. Defaults to False.

Type:

bool, optional

df: pandas.DataFrame#
static_configs: dict#
logs_file: str#
use_filelock: bool = False#
property grid_fields#
property metric_fields#
property all_fields#

Get all static, grid, and metric fields in the log.

grid_dict() Dict[str, Any][source]#

Get all values for each index field in the log. This is useful for getting all possible values for each field in the log. For example, if the log.df has 3 fields: ‘a’, ‘b’, ‘c’, and the values are:

` optimizer  lr    weight_decay sgd        0.1   0.01 sgd        0.1   0.001 adam       0.01  0.01 adam       0.01  0.001 `

Then the output will be: `python {"optimizer": ["sgd", "adam"], "lr": [0.1, 0.01], "weight_decay": [0.01, 0.001]} `

Returns:

Dictionary of all values for each index field in the log.

Return type:

dict

classmethod from_fields(grid_fields: list, metric_fields: list, static_configs: dict, logs_file: str, use_filelock: bool = False) ExperimentLog[source]#

Create ExperimentLog from grid and metric fields.

Parameters:
  • grid_fields (list) – Field names of configs to be grid-searched.

  • metric_fields (list) – Field names of metrics to be logged from experiment results.

  • static_configs (dict) – Other static configs of the experiment.

  • logs_file (str) – File path to tsv file.

  • use_filelock (bool, optional) – Whether to use file lock for reading/writing log file. Defaults to False.

Returns:

New experiment log object.

Return type:

ExperimentLog

classmethod from_config_iter(config_iter: ConfigIter, metric_fields: list, logs_file: str, use_filelock: bool = False) ExperimentLog[source]#

Create ExperimentLog from ConfigIter object.

Parameters:
  • config_iter (ConfigIter) – ConfigIter object to reference static_configs and grid_fields.

  • metric_fields (list) – list of metric fields.

  • logs_file (str) – File path to tsv file.

  • use_filelock (bool, optional) – Whether to use file lock for reading/writing log file. Defaults to False.

Returns:

New experiment log object.

Return type:

ExperimentLog

classmethod from_tsv(logs_file: str, use_filelock: bool = False, parse_str=True) ExperimentLog[source]#

Create ExperimentLog from tsv file with yaml header.

Parameters:
  • logs_file (str) – File path to tsv file.

  • use_filelock (bool, optional) – Whether to use file lock for reading/writing log file. Defaults to False.

  • parse_str (bool, optional) – Whether to parse and cast string into speculated type. Defaults to True.

Returns:

New experiment log object.

Return type:

ExperimentLog

classmethod from_wandb_sweep(entity: str, project: str, sweep_ids: List[str], logs_file: str, get_all_steps: bool = False, filter_dict: dict | None = None, get_metrics: List[str] | None = None) ExperimentLog[source]#

Create ExperimentLog from wandb sweep.

Parameters:
  • sweep_ids (List[str]) – List of wandb sweep ids to load.

  • entity (str) – wandb entity name.

  • project (str) – wandb project name.

  • logs_file (str) – File path to tsv file.

  • get_all_steps (bool, optional) – Whether to get all steps of the metrics. Defaults to False.

  • filter_dict (dict | None, optional) – Filter for runs. Defaults to None.

  • get_metrics (List[str] | None, optional) – List of metrics to get. Defaults to None.

Returns:

New experiment log object.

Return type:

ExperimentLog

classmethod parse_tsv(log_file: str, parse_str=True) dict[source]#

Parse tsv file into usable datas.

Parse tsv file generated by ExperimentLog.to_tsv method. Has static_config as yaml header, and DataFrame as tsv body where multiindices is set as different line with column names.

Parameters:
  • log_file (str) – File path to tsv file.

  • parse_str (bool, optional) – Whether to parse and cast string into speculated type. Defaults to True.

Raises:

Exception – Error while reading log file.

Returns:

Dictionary of pandas.DataFrame, grid_fields, metric_fields, and

static_configs.

Return type:

dict

lock_file()[source]#

Decorator that wraps a method with filelock acquire/release.

If self.use_filelock is True, acquires the filelock before calling func and releases it afterward. Otherwise calls func directly.

Parameters:

func (Callable) – The method to wrap.

Returns:

Wrapped method with conditional file locking.

Return type:

Callable

load_tsv(logs_file: str | None = None, parse_str: bool = True)[source]#

Load tsv with yaml header into ExperimentLog object.

Parameters:
  • logs_file (Optional[str], optional) – Specify other file path to tsv file. Defaults to None.

  • parse_str (bool, optional) – Whether to parse and cast string into speculated type. Defaults to True.

to_tsv(logs_file: str | None = None)[source]#

Write ExperimentLog object to tsv file with yaml header.

Parameters:

logs_file (Optional[str], optional) – Specify other file path to tsv file. Defaults to None.

add_result(configs: Mapping[str, Any], **metrics)[source]#

Add experiment run result to dataframe.

Parameters:
  • configs (Mapping[str, Any]) – Dictionary or Mapping of configurations of the result of the experiment instance to add.

  • **metrics (Any) – Metrics of the result of the experiment instance to add.

derive_field(new_field_name: str, fn: Callable, *fn_arg_fields: str, is_index: bool = False)[source]#

Add new field computed from existing fields in self.df.

Parameters:
  • new_field_name (str) – Name of the new field.

  • fn (Callable) – Function to compute new field.

  • *fn_arg_fields (str) – Field names to be used as arguments for the function.

  • is_index (bool, optional) – Whether to add field as index. Defaults to False.

drop_fields(field_names: List[str])[source]#

Drop fields from the log.

Parameters:

field_names (List[str]) – list of field names to drop.

rename_fields(name_map: Dict[str, str])[source]#

Rename fields in the log.

Parameters:

name_map (Dict[str, str]) – Mapping of old field names to new field names.

resolve_merge_conflicts(other: ExperimentLog) Tuple[ExperimentLog, ExperimentLog][source]#

CLI to summarize merge conflicts and accept user input for resolution.

Parameters:

other (ExperimentLog) – Target log to merge with self.

Returns:

Resolved logs (self, other).

Return type:

Tuple[ExperimentLog, ExperimentLog]

merge(*others: ExperimentLog, same: bool = True)[source]#

Merge multiple logs into self.

Parameters:
  • *others (ExperimentLog) – Logs to merge with self.

  • same (bool, optional) – Whether to raise error when logs are not of matching experiments. Defaults to True.

static merge_tsv(*log_files: str, save_path: str | None = None, same: bool = True) ExperimentLog[source]#

Merge multiple logs into one from tsv file paths.

Parameters:
  • *logs_path (str) – Path to logs.

  • save_path (Optional[str]) – Path to save merged log.

  • same (bool, optional) – Whether to raise error when logs are not of matching experiments. Defaults to True.

static merge_folder(logs_path: str, save_path: str | None = None, same: bool = True) ExperimentLog[source]#

Merge multiple logs into one from tsv files in folder.

Parameters:
  • logs_path (str) – Folder path to logs.

  • save_path (Optional[str], optional) – Path to save merged log. Defaults to None.

  • same (bool, optional) – Whether to raise error when logs are not of matching experiments. Defaults to True.

isin(config: Mapping[str, Any]) bool[source]#

Check if specific experiment config was already executed in log.

Parameters:
  • config (Mapping[str, Any]) – Configuration instance to check if it is in

  • log. (the)

Returns:

Whether the config is in the log.

Return type:

bool

get_metric(config: Mapping[str, Any]) dict[source]#

Search matching log with given config dict and return metric_dict, info_dict.

Parameters:

config (Mapping[str, Any]) – Configuration instance to search in the log.

Returns:

Found metric dictionary of the given config.

Return type:

dict

is_same_exp(other: ExperimentLog) bool[source]#

Check if both logs have same config fields.

Parameters:

other (ExperimentLog) – Log to compare with.

Returns:

Whether both logs have same config fields.

Return type:

bool

drop_duplicates()[source]#

Drop duplicate entries and provides CLI to resolve conflicting duplicates.

Resolves duplicates with same grid fields but different metric fields. If the duplicates also has same metric fields, they will be automatically removed but one.

melt_and_explode_metric(df: pandas.DataFrame | None = None, step: int | None = None, dropna: bool = True) pandas.DataFrame[source]#

Melt and explode metric values in DataFrame.

Melt column (metric) names into ‘metric’ field (multi-index) and their values into ‘metric_value’ columns. Explode metric with list of values into multiple rows with new ‘step’ and ‘total_steps’ field. If step is specified, only that step is selected, otherwise all steps are exploded.

Parameters:
  • df (Optional[pd.DataFrame], optional) – Base DataFrame to operate over. Defaults to None.

  • step (Optional[int], optional) – Specific step to select. Defaults to None.

  • dropna (bool, optional) – Whether to drop rows with NaN metric values. Defaults to True.

Returns:

Melted and exploded DataFrame.

Return type:

pd.DataFrame

class malet.experiment.RunInfo(prev_duration: datetime.timedelta = timedelta(0))[source]#

Use for tracking and managing information about a specific run or execution, including the start time, duration, and the current Git commit hash.

infos#

A class-level attribute that lists the keys of the run information (‘datetime’, ‘duration’, ‘commit_hash’).

Type:

ClassVar[list]

infos: ClassVar[list] = ['datetime', 'duration', 'commit_hash']#
get()[source]#

Returns the current run info as a dict.

Returns:

Dictionary with keys datetime, duration, and

commit_hash.

Return type:

dict

update_and_get()[source]#

Updates the accumulated duration to now and returns run info.

Adds the elapsed time since the last update (or init) to the stored duration, then returns the updated run info dict.

Returns:

Dictionary with keys datetime, duration, and

commit_hash.

Return type:

dict

class malet.experiment.Experiment(exp_folder_path: str, exp_function: ExpFunc, exp_metrics: list | None = None, total_splits: int | str = 1, curr_split: int | str = 0, configs_save: bool = False, checkpoint: bool = False, filelock: bool = False, timeout: float | None = None)[source]#

Execute experiments based on provided configurations. It supports parallel-friendly experiment scheduling and provides mechanisms for logging, resuming, and managing experiment runs.

## Features:
  • Supports two approaches for parallelizing experiments:

    1. Splitting: Splits hyperparameter configs evenly across multiple parallel processes. While this requires predetermining the number of Experiment processes to use, this is more reliabe when using very many parallel runs (many gpus). 2. Queueing: Each parallel processes check for unexecuted hyperparamter configs (or queue), similairly to WandB Agents. This allows to dynamically allocate new resources as they become available, but can be less reliable when using many parallel runs. It utilizes the log file to share currently running configs, by logging empty metrics of running configurations. Here, file locking should be used to avoid race conditions in read/writing to log file.

  • Automatically resumes from or skip previously run config from hyperparam grid using saved logs.

  • Allows checkpointing metrics during training steps, allowing to stop and resume mid-training much like model checkpointing. Here, file locking should be used to avoid race conditions from frequent log file access.

name#

Name of the experiment.

Type:

str

exp_func#

Function to execute the experiment. Is should recieve the config (or self when checkpointing is enabled) and return the resulting metrics to log.

Type:

ExpFunc

configs#

Configuration iterator for the experiment.

Type:

ConfigIter

log#

Log object for tracking experiment results.

Type:

ExperimentLog

infos#

List of information fields for logging, including status.

Type:

list

configs_save#

Whether to save only the config (with empty metrics) of currently running config to logs file. This is used for queueing mode.

Type:

bool

checkpoint#

Whether mid-train checkpointing is executed in exp_func. When set to True, Experiment.run will pass it self in addition to the current config. It is expected that the user will use this to update the log every time a model checkpoint is saved.

Type:

bool

filelock#

Whether to use file locking for the log file. This will create a lock file where the log file is located. The status on which Experiment run has is accessing or has requested the access to the log file is logged in the lock file, letting other runs to wait until the lock file is released.

Type:

bool

timeout#

Expected timeout for used resource system. Experiment run will use this to execute necessary logging before the system timeout terminates the run. In configs_save mode, some configs might terminate while being status set as running in the log file, which won’t be executed in the next continuing Expreiment run.

Type:

Optional[float]

infos: ClassVar[list] = ['datetime', 'duration', 'commit_hash', 'status']#
name#
exp_func#
configs_save = False#
checkpoint = False#
filelock = False#
timeout = None#
configs#
log#
static get_paths(exp_folder, split=None)[source]#

Constructs and returns file paths for configuration, log, and figure paths based on the given experiment folder and optional split identifier.

Parameters:
  • exp_folder (str) – The path to the experiment folder.

  • split (int, optional) – The split identifier for log files. If None, the default log file path is used. Defaults to None.

Returns:

A tuple containing:
  • cfg_file (str): Path to the experiment configuration file

(‘exp_config.yaml’). - tsv_file (str): Path to the log file. If split is provided, the path corresponds to the split-specific log file (‘log_splits/split_{split}.tsv’), otherwise it defaults to ‘log.tsv’. - fig_dir (str): Path to the figure directory (‘figure’).

Return type:

tuple

get_metric_info(config)[source]#

Retrieves metric and info dictionaries for a given configuration.

Parameters:

config (str) – The configuration key to look up in the log.

Returns:

A tuple containing two dictionaries:
  • metric_dict (dict): A dictionary of metrics associated with the given

configuration, excluding NaN scalar values. - info_dict (dict): A dictionary of additional information extracted from the log, containing keys present in self.infos and non-NaN values.

Return type:

tuple

update_log(config, status=None, **metric_dict)[source]#

Updates the log with the given configuration, status, and metrics. This method loads the current log, updates it with the provided configuration, metrics, and run information, and then saves the updated log.

Parameters:
  • config (Mapping) – The current config to log.

  • status (str|None, optional) – The status of the current run. Defaults to None, which sets the status to a predefined running state.

  • **metric_dict – Metrics to log.

run()[source]#

Executes a series of experiments based on the provided configurations according given execution setup.

static resplit_logs(exp_folder_path: str, target_split: int = 1, save_backup: bool = True)[source]#

Resplit splitted logs into target_split number of splits.

Parameters:
  • exp_folder_path (str) – The path to the experiment folder containing the logs.

  • target_split (int, optional) – The number of splits to divide the logs into. Must be greater than 0. Defaults to 1.

  • save_backup (bool, optional) – Whether to save a backup of the merged logs before resplitting. Defaults to True.

classmethod set_log_status_as_failed(exp_folder_path: str)[source]#

Updates the status of logs in the specified experiment folder to ‘FAILED’ if their current status is ‘RUNNING’