mridc.core.optim package

Submodules

mridc.core.optim.adafactor module

class mridc.core.optim.adafactor.Adafactor(params, lr=None, eps=(1e-30, 0.001), clip_threshold=1.0, decay_rate=-0.8, beta1=None, weight_decay=0.0, scale_parameter=True, relative_step=True, warmup_init=False, min_step=0.01)[source]

Bases: Optimizer

Implements Adafactor algorithm.

This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost (see https://arxiv.org/abs/1804.04235) Note that this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and warmup_init options. To use a manual (external) learning rate schedule you should set scale_parameter=False and relative_step=False.

Parameters
  • params (Iterable of parameters to optimize or dicts defining parameter groups.) – iterable

  • lr (External learning rate.) – float (optional), (default: None)

  • eps (Regularization constants for square gradient and parameter scale respectively.) – tuple (float, float), (default: (1e-30, 1e-3))

  • clip_threshold (Threshold of root-mean-square of final gradient update.) – float, (default: 1.0)

  • decay_rate (Coefficient used to compute running averages of square gradient.) – float, (default: -0.8)

  • beta1 (Coefficient used for computing running averages of gradient) – float, (default: None)

  • weight_decay (Weight decay (L2 penalty).) – float (optional), (default: 0)

  • scale_parameter (If True, learning rate is scaled by root-mean-square of parameter.) – bool (default: True)

  • relative_step (If True, time-dependent learning rate is computed instead of external learning rate.) – bool (default: True)

  • warmup_init (Time-dependent learning rate computation depends on whether warm-up initialization is being used.) – bool (default: False)

Return type

Adafactor Optimizer

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (A closure that reevaluates the model and returns the loss.) – callable (optional)

property supports_flat_params

Whether the optimizer supports flat parameters.

property supports_memory_efficient_fp16

Whether optimizer supports memory efficient fp16

mridc.core.optim.lr_scheduler module

class mridc.core.optim.lr_scheduler.CosineAnnealing(optimizer, *, max_steps, min_lr=0, last_epoch=-1, **kwargs)[source]

Bases: WarmupAnnealHoldPolicy

Anneal learning rate by cosine.

class mridc.core.optim.lr_scheduler.InverseSquareRootAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]

Bases: WarmupPolicy

Inverse square root learning rate annealing.

class mridc.core.optim.lr_scheduler.NoamAnnealing(optimizer, *, d_model, warmup_steps=None, warmup_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: _LRScheduler

Noam learning rate annealing.

get_lr()[source]

Get learning rate at current step.

class mridc.core.optim.lr_scheduler.NoamHoldAnnealing(optimizer, *, max_steps, decay_rate=0.5, min_lr=0.0, last_epoch=-1, **kwargs)[source]

Bases: WarmupHoldPolicy

class mridc.core.optim.lr_scheduler.PolynomialDecayAnnealing(optimizer, *, max_steps, min_lr=0.0, power=1.0, cycle=False, last_epoch=-1, **kwargs)[source]

Bases: WarmupPolicy

Polynomial decay learning rate annealing.

class mridc.core.optim.lr_scheduler.PolynomialHoldDecayAnnealing(optimizer, *, max_steps, min_lr=0.0, power=1.0, cycle=False, last_epoch=-1, **kwargs)[source]

Bases: WarmupHoldPolicy

Polynomial decay learning rate annealing.

class mridc.core.optim.lr_scheduler.SquareAnnealing(optimizer, *, max_steps, min_lr=1e-05, last_epoch=-1, **kwargs)[source]

Bases: WarmupPolicy

Anneal learning rate by square.

class mridc.core.optim.lr_scheduler.SquareRootAnnealing(optimizer, *, max_steps, min_lr=0, last_epoch=-1, **kwargs)[source]

Bases: WarmupPolicy

Anneal learning rate by square root.

class mridc.core.optim.lr_scheduler.SquareRootConstantPolicy(optimizer, *, constant_steps=None, constant_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: _LRScheduler

Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.

Parameters
  • warmup_steps (Number of training steps in warmup stage) –

  • warmup_ratio (Ratio of warmup steps to total steps) –

  • max_steps (Total number of steps while training or None for infinite training) –

get_lr()[source]

Get learning rate at current step.

class mridc.core.optim.lr_scheduler.T5InverseSquareRootAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]

Bases: SquareRootConstantPolicy

Inverse square root learning rate annealing.

class mridc.core.optim.lr_scheduler.WarmupAnnealHoldPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, constant_steps=None, constant_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: _LRScheduler

Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.

Parameters
  • warmup_steps (Number of training steps in warmup stage) –

  • warmup_ratio (Ratio of warmup steps to total steps) –

  • max_steps (Total number of steps while training or None for infinite training) –

  • min_lr (Minimum lr to hold the learning rate after decay at.) –

  • constant_steps (Number of steps to keep lr constant at.) –

  • constant_ratio (Ratio of steps to keep lr constant.) –

get_lr()[source]

Get learning rate at current step.

class mridc.core.optim.lr_scheduler.WarmupAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]

Bases: WarmupPolicy

Warmup learning rate annealing.

class mridc.core.optim.lr_scheduler.WarmupHoldPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, hold_steps=None, hold_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: WarmupPolicy

Variant of WarmupPolicy which maintains high learning rate for a defined number of steps. All arguments should be passed as kwargs for clarity,

Parameters
  • warmup_steps (Number of training steps in warmup stage) –

  • warmup_ratio (Ratio of warmup steps to total steps) –

  • hold_steps (Number of training steps to hold the learning rate after warm up) –

  • hold_ratio (Ratio of hold steps to total steps) –

  • max_steps (Total number of steps while training or None for infinite training) –

  • Results

  • -------

  • steps (Learning rate is linearly increased from 0 to 1 over warmup) –

  • hold (then linearly decreased from 1 to 0 over) –

  • steps.

get_lr()[source]

Get learning rate at current step.

class mridc.core.optim.lr_scheduler.WarmupPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: _LRScheduler

Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.

Parameters
  • warmup_steps (Number of training steps in warmup stage.) –

  • warmup_ratio (Ratio of warmup steps to total steps.) –

  • max_steps (Total number of steps while training or None for infinite training.) –

Returns

lr

Return type

Learning rate for current step.

get_lr()[source]

Get learning rate at current step.

mridc.core.optim.lr_scheduler.compute_max_steps(max_epochs, accumulate_grad_batches, limit_train_batches, num_workers, num_samples, batch_size, drop_last)[source]

Compute effective max_steps from the provided parameters.

mridc.core.optim.lr_scheduler.get_scheduler(name: str, **kwargs: Optional[Dict[str, Any]]) _LRScheduler[source]

Convenience method to obtain an _LRScheduler class and partially instantiate it with optimizer kwargs.

Parameters
  • name (Name of the scheduler in the registry.) –

  • kwargs (Optional kwargs of the scheduler used during instantiation.) –

Return type

A partially instantiated _LRScheduler

mridc.core.optim.lr_scheduler.prepare_lr_scheduler(optimizer: Optimizer, scheduler_config: Optional[Union[Dict[str, Any], DictConfig]], train_dataloader: Optional[DataLoader] = None) Optional[Dict[str, Any]][source]

Constructs an LR Scheduler (optionally) for a given optimizer, based on a config with the following schema.

Parameters
  • optimizer (The optimizer to use for the scheduler.) –

    name: <name of optimizer>

    lr: <maximal learning rate>

    # <additional optimizer arguments>

    args:

    name: auto # special keyword, resolves to correct optimizer config for given optimizer name

    # cls: mridc.core.config.optimizers.NovogradParams # explicit instantiation by class path

    params: # optional override parameters for the optimizer config

    betas: [0.8, 0.5]

    weight_decay: 0.001

  • scheduler_config (The scheduler config.) –

    name: <name of scheduler>

    iters_per_batch: null # computed at runtime; mandatory to have

    max_steps: null # computed at runtime or explicitly set here; mandatory to have

    # pytorch lightning args <mandatory>

    monitor: val_loss

    reduce_on_plateau: false

    # <scheduler config override>

    args:

    name: auto # special keyword, resolves to correct optimizer config for given optimizer name

    # cls: mridc.core.config.schedulers.CosineAnnealingParams # explicit instantiation by class path

    params: # optional override parameters for the optimizer config

    warmup_steps: null

    warmup_ratio: null

    min_lr: 0.0

    last_epoch: -1

  • train_dataloader (Optional requirement, must be passed if "iters_per_batch" is defined instead of "max_steps". Used to compute effective "max_steps".) –

Return type

A dictionary containing the LR Scheduler implementation if the config was successfully parsed along with other parameters required by Pytorch Lightning, otherwise None.

mridc.core.optim.lr_scheduler.register_scheduler(name: str, scheduler: _LRScheduler, scheduler_params: SchedulerParams)[source]

Checks if the scheduler name exists in the registry, and if it doesn’t, adds it. This allows custom schedulers to be added and called by name during instantiation.

Parameters
  • name (Name of the optimizer. Will be used as key to retrieve the optimizer.) –

  • scheduler (Scheduler class (inherits from _LRScheduler)) –

  • scheduler_params (The parameters as a dataclass of the scheduler) –

mridc.core.optim.novograd module

class mridc.core.optim.novograd.Novograd(params, lr=0.001, betas=(0.95, 0.98), eps=1e-08, weight_decay=0, grad_averaging=False, amsgrad=False, luc=False, luc_trust=0.001, luc_eps=1e-08)[source]

Bases: Optimizer

Implements Novograd algorithm. It has been proposed in “Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks” (https://arxiv.org/abs/1905.11286).

Parameters
  • params (Iterable of parameters to optimize or dicts defining parameter groups.) – iterable

  • lr (Learning rate.) – float, (default: 1e-3)

  • betas (Coefficients used for computing running averages of gradient and its square.) – (Tuple[float, float], optional) (default: (0.9, 0.999))

  • eps (Term added to the denominator to improve numerical stability.) – (float, optional), (default: 1e-8)

  • (float (weight_decay) –

  • optional) (weight decay (L2 penalty) (default: 0)) –

  • amsgrad (whether to use the AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and) –

  • Beyond". – (boolean, optional), (default: False)

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (A closure that reevaluates the model and returns the loss.) –

Returns

loss

Return type

Loss (if provided)

mridc.core.optim.optimizer_with_master_params module

class mridc.core.optim.optimizer_with_master_params.GradBucket(numel, chunk_size_mb)[source]

Bases: object

Persistent buffer for main gradients that remains allocated between training iterations.

allreduce_buffer()[source]

Synchronous buffer data allreduce

get(shape, start_index)[source]

Return a tensor with the input shape as a view into the 1-D data starting at start_index.

get_allreduce_tensor()[source]

Get a tensor that can be used for allreduce.

update_chunk_info(grad_chunk_info)[source]

Update the chunk info with the grad_chunk_info.

zero()[source]

Reset the buffer to zero.

class mridc.core.optim.optimizer_with_master_params.MainParamsOptimizerWrapper(optimizer, fp32_grad_accum=False, contiguous_grad_bucket=False, async_grad_allreduce=False, grad_div_ar_fusion=True, grad_allreduce_chunk_size_mb=0)[source]

Bases: Optimizer

Float16 optimizer wrapper for half precision (fp16 and bf16) data types. This optimizer wrapper holds main parameters and gradients in fp32 to support stable convergence.

Parameters
  • optimizer (base optimizer such as Adam or SGD.) –

  • fp32_grad_accum (to enable the use of fp32 in gradient accumulation and allreduce.) –

  • contiguous_grad_bucket (to enable allocating the master gradients in the contiguous memory space to reduce memory) –

  • fragmentation.

  • async_grad_allreduce (enable asynchronous gradient allreduce that is executed along with the training step back prop.) –

allreduce_main_grads()[source]

All reduce main grads.

property async_master_grads_allreduce

Return whether to use async allreduce for master grads.

copy_model_grads_to_main_grads()[source]

Copy model grads to main grads.

property defaults

Promote defaults, so it can be retrieved or set via ‘optimizer_instance.default’.

property fp32_grad_accumulation

Return whether to accumulate gradients in fp32.

get_parameters()[source]

Return the parameters of the optimizer.

load_state_dict(state_dict)[source]

Load the state of the optimizer.

no_sync()[source]

A context manager to disable gradient synchronizations across data-parallel ranks.

property param_groups

Promote param_groups, so it can be retrieved or set via “optimizer_instance.param_groups. (for example, to adjust the learning rate)

reload_model_params()[source]

Reload model params.

property state

Promote state, so it can be retrieved or set via “optimizer_instance.state.

state_dict()[source]

Return the state of the optimizer.

step(**kwargs)[source]

Step the optimizer.

zero_grad(set_to_none=True)[source]

We only need to zero the model related parameters, i.e., float16_groups & fp32_from_fp32_groups. We additionally zero fp32_from_float16_groups as a memory optimization to reduce fragmentation; in the case of set_to_none==True, the space used by this field can be safely deallocated at this point.

mridc.core.optim.optimizers module

mridc.core.optim.optimizers.get_optimizer(name: str, **kwargs: Optional[Dict[str, Any]]) partial[source]

Convenience method to obtain an Optimizer class and partially instantiate it with optimizer kwargs.

Parameters
  • name (Name of the Optimizer in the registry.) –

  • kwargs (Optional kwargs of the optimizer used during instantiation.) –

Return type

A partially instantiated Optimizer.

mridc.core.optim.optimizers.parse_optimizer_args(optimizer_name: str, optimizer_kwargs: Union[DictConfig, Dict[str, Any]]) Union[Dict[str, Any], DictConfig][source]

Parses a list of strings, of the format “key=value” or “key2=val1,val2,…” into a dictionary of type {key=value, key2=[val1, val2], …} This dictionary is then used to instantiate the chosen Optimizer.

Parameters
  • optimizer_name (string name of the optimizer, used for auto resolution of params.) –

  • optimizer_kwargs (Either a list of strings in a specified format, or a dictionary. If a dictionary is provided, it) –

  • value (is assumed the dictionary is the final parsed) –

  • provided (and simply returned. If a list of strings is) –

  • each

  • dictionary. (item in the list is parsed into a new) –

Return type

A dictionary of the parsed arguments.

mridc.core.optim.optimizers.register_optimizer(name: str, optimizer: Optimizer, optimizer_params: OptimizerParams)[source]

Checks if the optimizer name exists in the registry, and if it doesn’t, adds it. This allows custom optimizers to be added and called by name during instantiation.

Parameters
  • name (Name of the optimizer. Will be used as key to retrieve the optimizer.) –

  • optimizer (Optimizer class.) –

  • optimizer_params (The parameters as a dataclass of the optimizer.) –

Module contents