mridc.core.optim package

Submodules

mridc.core.optim.adafactor module

class mridc.core.optim.adafactor.Adafactor(params, lr=None, eps=(1e-30, 0.001), clip_threshold=1.0, decay_rate=-0.8, beta1=None, weight_decay=0.0, scale_parameter=True, relative_step=True, warmup_init=False, min_step=0.01)[source]

Bases: Optimizer

Implements Adafactor algorithm.

This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost (see https://arxiv.org/abs/1804.04235) Note that this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and warmup_init options. To use a manual (external) learning rate schedule you should set scale_parameter=False and relative_step=False.

Parameters

params (Iterable of parameters to optimize or dicts defining parameter groups.) – iterable
lr (External learning rate.) – float (optional), (default: None)
eps (Regularization constants for square gradient and parameter scale respectively.) – tuple (float, float), (default: (1e-30, 1e-3))
clip_threshold (Threshold of root-mean-square of final gradient update.) – float, (default: 1.0)
decay_rate (Coefficient used to compute running averages of square gradient.) – float, (default: -0.8)
beta1 (Coefficient used for computing running averages of gradient) – float, (default: None)
weight_decay (Weight decay (L2 penalty).) – float (optional), (default: 0)
scale_parameter (If True, learning rate is scaled by root-mean-square of parameter.) – bool (default: True)
relative_step (If True, time-dependent learning rate is computed instead of external learning rate.) – bool (default: True)
warmup_init (Time-dependent learning rate computation depends on whether warm-up initialization is being used.) – bool (default: False)

Return type

Adafactor Optimizer

step(closure=None)[source]

Performs a single optimization step.

Parameters: closure (A closure that reevaluates the model and returns the loss.) – callable (optional)

property supports_flat_params: Whether the optimizer supports flat parameters.

property supports_memory_efficient_fp16: Whether optimizer supports memory efficient fp16

mridc.core.optim.lr_scheduler module

class mridc.core.optim.lr_scheduler.CosineAnnealing(optimizer, *, max_steps, min_lr=0, last_epoch=-1, **kwargs)[source]

Bases: WarmupAnnealHoldPolicy

Anneal learning rate by cosine.

class mridc.core.optim.lr_scheduler.InverseSquareRootAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]

Bases: WarmupPolicy

Inverse square root learning rate annealing.

class mridc.core.optim.lr_scheduler.NoamAnnealing(optimizer, *, d_model, warmup_steps=None, warmup_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: _LRScheduler

Noam learning rate annealing.

get_lr()[source]: Get learning rate at current step.

class mridc.core.optim.lr_scheduler.NoamHoldAnnealing(optimizer, *, max_steps, decay_rate=0.5, min_lr=0.0, last_epoch=-1, **kwargs)[source]: Bases: WarmupHoldPolicy

class mridc.core.optim.lr_scheduler.PolynomialDecayAnnealing(optimizer, *, max_steps, min_lr=0.0, power=1.0, cycle=False, last_epoch=-1, **kwargs)[source]

Bases: WarmupPolicy

Polynomial decay learning rate annealing.

class mridc.core.optim.lr_scheduler.PolynomialHoldDecayAnnealing(optimizer, *, max_steps, min_lr=0.0, power=1.0, cycle=False, last_epoch=-1, **kwargs)[source]

Bases: WarmupHoldPolicy

Polynomial decay learning rate annealing.

class mridc.core.optim.lr_scheduler.SquareAnnealing(optimizer, *, max_steps, min_lr=1e-05, last_epoch=-1, **kwargs)[source]

Bases: WarmupPolicy

Anneal learning rate by square.

class mridc.core.optim.lr_scheduler.SquareRootAnnealing(optimizer, *, max_steps, min_lr=0, last_epoch=-1, **kwargs)[source]

Bases: WarmupPolicy

Anneal learning rate by square root.

class mridc.core.optim.lr_scheduler.SquareRootConstantPolicy(optimizer, *, constant_steps=None, constant_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: _LRScheduler

Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.

Parameters

warmup_steps (Number of training steps in warmup stage) –
warmup_ratio (Ratio of warmup steps to total steps) –
max_steps (Total number of steps while training or None for infinite training) –

get_lr()[source]: Get learning rate at current step.

class mridc.core.optim.lr_scheduler.T5InverseSquareRootAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]

Bases: SquareRootConstantPolicy

Inverse square root learning rate annealing.

class mridc.core.optim.lr_scheduler.WarmupAnnealHoldPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, constant_steps=None, constant_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: _LRScheduler

Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.

Parameters

warmup_steps (Number of training steps in warmup stage) –
warmup_ratio (Ratio of warmup steps to total steps) –
max_steps (Total number of steps while training or None for infinite training) –
min_lr (Minimum lr to hold the learning rate after decay at.) –
constant_steps (Number of steps to keep lr constant at.) –
constant_ratio (Ratio of steps to keep lr constant.) –

get_lr()[source]: Get learning rate at current step.

class mridc.core.optim.lr_scheduler.WarmupAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]

Bases: WarmupPolicy

Warmup learning rate annealing.

class mridc.core.optim.lr_scheduler.WarmupHoldPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, hold_steps=None, hold_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: WarmupPolicy

Variant of WarmupPolicy which maintains high learning rate for a defined number of steps. All arguments should be passed as kwargs for clarity,

Parameters

warmup_steps (Number of training steps in warmup stage) –
warmup_ratio (Ratio of warmup steps to total steps) –
hold_steps (Number of training steps to hold the learning rate after warm up) –
hold_ratio (Ratio of hold steps to total steps) –
max_steps (Total number of steps while training or None for infinite training) –
Results –
------- –
steps (Learning rate is linearly increased from 0 to 1 over warmup) –
hold (then linearly decreased from 1 to 0 over) –
steps. –

get_lr()[source]: Get learning rate at current step.

class mridc.core.optim.lr_scheduler.WarmupPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]

Bases: _LRScheduler

Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.

Parameters

warmup_steps (Number of training steps in warmup stage.) –
warmup_ratio (Ratio of warmup steps to total steps.) –
max_steps (Total number of steps while training or None for infinite training.) –

Returns

lr

Return type

Learning rate for current step.

get_lr()[source]: Get learning rate at current step.

mridc.core.optim.lr_scheduler.compute_max_steps(max_epochs, accumulate_grad_batches, limit_train_batches, num_workers, num_samples, batch_size, drop_last)[source]: Compute effective max_steps from the provided parameters.

mridc.core.optim.lr_scheduler.get_scheduler(name: str, **kwargs: Optional[Dict[str, Any]]) → _LRScheduler[source]

Convenience method to obtain an _LRScheduler class and partially instantiate it with optimizer kwargs.

Parameters

name (Name of the scheduler in the registry.) –
kwargs (Optional kwargs of the scheduler used during instantiation.) –

Return type

A partially instantiated _LRScheduler

mridc.core.optim.lr_scheduler.prepare_lr_scheduler(optimizer: Optimizer, scheduler_config: Optional[Union[Dict[str, Any], DictConfig]], train_dataloader: Optional[DataLoader] = None) → Optional[Dict[str, Any]][source]

Constructs an LR Scheduler (optionally) for a given optimizer, based on a config with the following schema.

Parameters

optimizer (The optimizer to use for the scheduler.) –
name: <name of optimizer>

lr: <maximal learning rate>

# <additional optimizer arguments>

args:

name: auto # special keyword, resolves to correct optimizer config for given optimizer name

# cls: mridc.core.config.optimizers.NovogradParams # explicit instantiation by class path

params: # optional override parameters for the optimizer config

betas: [0.8, 0.5]

weight_decay: 0.001
scheduler_config (The scheduler config.) –
name: <name of scheduler>

iters_per_batch: null # computed at runtime; mandatory to have

max_steps: null # computed at runtime or explicitly set here; mandatory to have

# pytorch lightning args <mandatory>

monitor: val_loss

reduce_on_plateau: false

# <scheduler config override>

args:

name: auto # special keyword, resolves to correct optimizer config for given optimizer name

# cls: mridc.core.config.schedulers.CosineAnnealingParams # explicit instantiation by class path

params: # optional override parameters for the optimizer config

warmup_steps: null

warmup_ratio: null

min_lr: 0.0

last_epoch: -1
train_dataloader (Optional requirement, must be passed if "iters_per_batch" is defined instead of "max_steps". Used to compute effective "max_steps".) –

Return type

A dictionary containing the LR Scheduler implementation if the config was successfully parsed along with other parameters required by Pytorch Lightning, otherwise None.

mridc.core.optim.lr_scheduler.register_scheduler(name: str, scheduler: _LRScheduler, scheduler_params: SchedulerParams)[source]

Checks if the scheduler name exists in the registry, and if it doesn’t, adds it. This allows custom schedulers to be added and called by name during instantiation.

Parameters

name (Name of the optimizer. Will be used as key to retrieve the optimizer.) –
scheduler (Scheduler class (inherits from _LRScheduler)) –
scheduler_params (The parameters as a dataclass of the scheduler) –

mridc.core.optim.novograd module

class mridc.core.optim.novograd.Novograd(params, lr=0.001, betas=(0.95, 0.98), eps=1e-08, weight_decay=0, grad_averaging=False, amsgrad=False, luc=False, luc_trust=0.001, luc_eps=1e-08)[source]

Bases: Optimizer

Implements Novograd algorithm. It has been proposed in “Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks” (https://arxiv.org/abs/1905.11286).

Parameters

params (Iterable of parameters to optimize or dicts defining parameter groups.) – iterable
lr (Learning rate.) – float, (default: 1e-3)
betas (Coefficients used for computing running averages of gradient and its square.) – (Tuple[float, float], optional) (default: (0.9, 0.999))
eps (Term added to the denominator to improve numerical stability.) – (float, optional), (default: 1e-8)
(float (weight_decay) –
optional) (weight decay (L2 penalty) (default: 0)) –
amsgrad (whether to use the AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and) –
Beyond". – (boolean, optional), (default: False)

step(closure=None)[source]

Performs a single optimization step.

Parameters: closure (A closure that reevaluates the model and returns the loss.) –
Returns: loss
Return type: Loss (if provided)

mridc.core.optim.optimizer_with_master_params module

class mridc.core.optim.optimizer_with_master_params.GradBucket(numel, chunk_size_mb)[source]

Bases: object

Persistent buffer for main gradients that remains allocated between training iterations.

allreduce_buffer()[source]: Synchronous buffer data allreduce

get(shape, start_index)[source]: Return a tensor with the input shape as a view into the 1-D data starting at start_index.

get_allreduce_tensor()[source]: Get a tensor that can be used for allreduce.

update_chunk_info(grad_chunk_info)[source]: Update the chunk info with the grad_chunk_info.

zero()[source]: Reset the buffer to zero.

class mridc.core.optim.optimizer_with_master_params.MainParamsOptimizerWrapper(optimizer, fp32_grad_accum=False, contiguous_grad_bucket=False, async_grad_allreduce=False, grad_div_ar_fusion=True, grad_allreduce_chunk_size_mb=0)[source]

Bases: Optimizer

Float16 optimizer wrapper for half precision (fp16 and bf16) data types. This optimizer wrapper holds main parameters and gradients in fp32 to support stable convergence.

Parameters

optimizer (base optimizer such as Adam or SGD.) –
fp32_grad_accum (to enable the use of fp32 in gradient accumulation and allreduce.) –
contiguous_grad_bucket (to enable allocating the master gradients in the contiguous memory space to reduce memory) –
fragmentation. –
async_grad_allreduce (enable asynchronous gradient allreduce that is executed along with the training step back prop.) –

allreduce_main_grads()[source]: All reduce main grads.

property async_master_grads_allreduce: Return whether to use async allreduce for master grads.

copy_model_grads_to_main_grads()[source]: Copy model grads to main grads.

property defaults: Promote defaults, so it can be retrieved or set via ‘optimizer_instance.default’.

property fp32_grad_accumulation: Return whether to accumulate gradients in fp32.

get_parameters()[source]: Return the parameters of the optimizer.

load_state_dict(state_dict)[source]: Load the state of the optimizer.

no_sync()[source]: A context manager to disable gradient synchronizations across data-parallel ranks.

property param_groups: Promote param_groups, so it can be retrieved or set via “optimizer_instance.param_groups. (for example, to adjust the learning rate)

reload_model_params()[source]: Reload model params.

property state: Promote state, so it can be retrieved or set via “optimizer_instance.state.

state_dict()[source]: Return the state of the optimizer.

step(**kwargs)[source]: Step the optimizer.

zero_grad(set_to_none=True)[source]: We only need to zero the model related parameters, i.e., float16_groups & fp32_from_fp32_groups. We additionally zero fp32_from_float16_groups as a memory optimization to reduce fragmentation; in the case of set_to_none==True, the space used by this field can be safely deallocated at this point.

mridc.core.optim.optimizers module

mridc.core.optim.optimizers.get_optimizer(name: str, **kwargs: Optional[Dict[str, Any]]) → partial[source]

Convenience method to obtain an Optimizer class and partially instantiate it with optimizer kwargs.

Parameters

name (Name of the Optimizer in the registry.) –
kwargs (Optional kwargs of the optimizer used during instantiation.) –

Return type

A partially instantiated Optimizer.

mridc.core.optim.optimizers.parse_optimizer_args(optimizer_name: str, optimizer_kwargs: Union[DictConfig, Dict[str, Any]]) → Union[Dict[str, Any], DictConfig][source]

Parses a list of strings, of the format “key=value” or “key2=val1,val2,…” into a dictionary of type {key=value, key2=[val1, val2], …} This dictionary is then used to instantiate the chosen Optimizer.

Parameters

optimizer_name (string name of the optimizer, used for auto resolution of params.) –
optimizer_kwargs (Either a list of strings in a specified format, or a dictionary. If a dictionary is provided, it) –
value (is assumed the dictionary is the final parsed) –
provided (and simply returned. If a list of strings is) –
each –
dictionary. (item in the list is parsed into a new) –

Return type

A dictionary of the parsed arguments.

mridc.core.optim.optimizers.register_optimizer(name: str, optimizer: Optimizer, optimizer_params: OptimizerParams)[source]

Checks if the optimizer name exists in the registry, and if it doesn’t, adds it. This allows custom optimizers to be added and called by name during instantiation.

Parameters

name (Name of the optimizer. Will be used as key to retrieve the optimizer.) –
optimizer (Optimizer class.) –
optimizer_params (The parameters as a dataclass of the optimizer.) –

mridc.core.optim package

Submodules

mridc.core.optim.adafactor module

mridc.core.optim.lr_scheduler module

mridc.core.optim.novograd module

mridc.core.optim.optimizer_with_master_params module

mridc.core.optim.optimizers module

Module contents