mridc.core.optim package
Submodules
mridc.core.optim.adafactor module
- class mridc.core.optim.adafactor.Adafactor(params, lr=None, eps=(1e-30, 0.001), clip_threshold=1.0, decay_rate=-0.8, beta1=None, weight_decay=0.0, scale_parameter=True, relative_step=True, warmup_init=False, min_step=0.01)[source]
Bases:
Optimizer
Implements Adafactor algorithm.
This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost (see https://arxiv.org/abs/1804.04235) Note that this optimizer internally adjusts the learning rate depending on the scale_parameter, relative_step and warmup_init options. To use a manual (external) learning rate schedule you should set scale_parameter=False and relative_step=False.
- Parameters
params (Iterable of parameters to optimize or dicts defining parameter groups.) – iterable
lr (External learning rate.) – float (optional), (default: None)
eps (Regularization constants for square gradient and parameter scale respectively.) – tuple (float, float), (default: (1e-30, 1e-3))
clip_threshold (Threshold of root-mean-square of final gradient update.) – float, (default: 1.0)
decay_rate (Coefficient used to compute running averages of square gradient.) – float, (default: -0.8)
beta1 (Coefficient used for computing running averages of gradient) – float, (default: None)
weight_decay (Weight decay (L2 penalty).) – float (optional), (default: 0)
scale_parameter (If True, learning rate is scaled by root-mean-square of parameter.) – bool (default: True)
relative_step (If True, time-dependent learning rate is computed instead of external learning rate.) – bool (default: True)
warmup_init (Time-dependent learning rate computation depends on whether warm-up initialization is being used.) – bool (default: False)
- Return type
Adafactor Optimizer
- step(closure=None)[source]
Performs a single optimization step.
- Parameters
closure (A closure that reevaluates the model and returns the loss.) – callable (optional)
- property supports_flat_params
Whether the optimizer supports flat parameters.
- property supports_memory_efficient_fp16
Whether optimizer supports memory efficient fp16
mridc.core.optim.lr_scheduler module
- class mridc.core.optim.lr_scheduler.CosineAnnealing(optimizer, *, max_steps, min_lr=0, last_epoch=-1, **kwargs)[source]
Bases:
WarmupAnnealHoldPolicy
Anneal learning rate by cosine.
- class mridc.core.optim.lr_scheduler.InverseSquareRootAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]
Bases:
WarmupPolicy
Inverse square root learning rate annealing.
- class mridc.core.optim.lr_scheduler.NoamAnnealing(optimizer, *, d_model, warmup_steps=None, warmup_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]
Bases:
_LRScheduler
Noam learning rate annealing.
- class mridc.core.optim.lr_scheduler.NoamHoldAnnealing(optimizer, *, max_steps, decay_rate=0.5, min_lr=0.0, last_epoch=-1, **kwargs)[source]
Bases:
WarmupHoldPolicy
- class mridc.core.optim.lr_scheduler.PolynomialDecayAnnealing(optimizer, *, max_steps, min_lr=0.0, power=1.0, cycle=False, last_epoch=-1, **kwargs)[source]
Bases:
WarmupPolicy
Polynomial decay learning rate annealing.
- class mridc.core.optim.lr_scheduler.PolynomialHoldDecayAnnealing(optimizer, *, max_steps, min_lr=0.0, power=1.0, cycle=False, last_epoch=-1, **kwargs)[source]
Bases:
WarmupHoldPolicy
Polynomial decay learning rate annealing.
- class mridc.core.optim.lr_scheduler.SquareAnnealing(optimizer, *, max_steps, min_lr=1e-05, last_epoch=-1, **kwargs)[source]
Bases:
WarmupPolicy
Anneal learning rate by square.
- class mridc.core.optim.lr_scheduler.SquareRootAnnealing(optimizer, *, max_steps, min_lr=0, last_epoch=-1, **kwargs)[source]
Bases:
WarmupPolicy
Anneal learning rate by square root.
- class mridc.core.optim.lr_scheduler.SquareRootConstantPolicy(optimizer, *, constant_steps=None, constant_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]
Bases:
_LRScheduler
Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.
- Parameters
warmup_steps (Number of training steps in warmup stage) –
warmup_ratio (Ratio of warmup steps to total steps) –
max_steps (Total number of steps while training or None for infinite training) –
- class mridc.core.optim.lr_scheduler.T5InverseSquareRootAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]
Bases:
SquareRootConstantPolicy
Inverse square root learning rate annealing.
- class mridc.core.optim.lr_scheduler.WarmupAnnealHoldPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, constant_steps=None, constant_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]
Bases:
_LRScheduler
Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.
- Parameters
warmup_steps (Number of training steps in warmup stage) –
warmup_ratio (Ratio of warmup steps to total steps) –
max_steps (Total number of steps while training or None for infinite training) –
min_lr (Minimum lr to hold the learning rate after decay at.) –
constant_steps (Number of steps to keep lr constant at.) –
constant_ratio (Ratio of steps to keep lr constant.) –
- class mridc.core.optim.lr_scheduler.WarmupAnnealing(optimizer, *, max_steps, last_epoch=-1, min_lr=0.0, **kwargs)[source]
Bases:
WarmupPolicy
Warmup learning rate annealing.
- class mridc.core.optim.lr_scheduler.WarmupHoldPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, hold_steps=None, hold_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]
Bases:
WarmupPolicy
Variant of WarmupPolicy which maintains high learning rate for a defined number of steps. All arguments should be passed as kwargs for clarity,
- Parameters
warmup_steps (Number of training steps in warmup stage) –
warmup_ratio (Ratio of warmup steps to total steps) –
hold_steps (Number of training steps to hold the learning rate after warm up) –
hold_ratio (Ratio of hold steps to total steps) –
max_steps (Total number of steps while training or None for infinite training) –
Results –
------- –
steps (Learning rate is linearly increased from 0 to 1 over warmup) –
hold (then linearly decreased from 1 to 0 over) –
steps. –
- class mridc.core.optim.lr_scheduler.WarmupPolicy(optimizer, *, warmup_steps=None, warmup_ratio=None, max_steps=None, min_lr=0.0, last_epoch=-1)[source]
Bases:
_LRScheduler
Adds warmup kwargs and warmup logic to lr policy. All arguments should be passed as kwargs for clarity.
- Parameters
warmup_steps (Number of training steps in warmup stage.) –
warmup_ratio (Ratio of warmup steps to total steps.) –
max_steps (Total number of steps while training or None for infinite training.) –
- Returns
lr
- Return type
Learning rate for current step.
- mridc.core.optim.lr_scheduler.compute_max_steps(max_epochs, accumulate_grad_batches, limit_train_batches, num_workers, num_samples, batch_size, drop_last)[source]
Compute effective max_steps from the provided parameters.
- mridc.core.optim.lr_scheduler.get_scheduler(name: str, **kwargs: Optional[Dict[str, Any]]) _LRScheduler [source]
Convenience method to obtain an _LRScheduler class and partially instantiate it with optimizer kwargs.
- Parameters
name (Name of the scheduler in the registry.) –
kwargs (Optional kwargs of the scheduler used during instantiation.) –
- Return type
A partially instantiated _LRScheduler
- mridc.core.optim.lr_scheduler.prepare_lr_scheduler(optimizer: Optimizer, scheduler_config: Optional[Union[Dict[str, Any], DictConfig]], train_dataloader: Optional[DataLoader] = None) Optional[Dict[str, Any]] [source]
Constructs an LR Scheduler (optionally) for a given optimizer, based on a config with the following schema.
- Parameters
optimizer (The optimizer to use for the scheduler.) –
name: <name of optimizer>
lr: <maximal learning rate>
# <additional optimizer arguments>
args:
name: auto # special keyword, resolves to correct optimizer config for given optimizer name
# cls: mridc.core.config.optimizers.NovogradParams # explicit instantiation by class path
params: # optional override parameters for the optimizer config
betas: [0.8, 0.5]
weight_decay: 0.001
scheduler_config (The scheduler config.) –
name: <name of scheduler>
iters_per_batch: null # computed at runtime; mandatory to have
max_steps: null # computed at runtime or explicitly set here; mandatory to have
# pytorch lightning args <mandatory>
monitor: val_loss
reduce_on_plateau: false
# <scheduler config override>
args:
name: auto # special keyword, resolves to correct optimizer config for given optimizer name
# cls: mridc.core.config.schedulers.CosineAnnealingParams # explicit instantiation by class path
params: # optional override parameters for the optimizer config
warmup_steps: null
warmup_ratio: null
min_lr: 0.0
last_epoch: -1
train_dataloader (Optional requirement, must be passed if "iters_per_batch" is defined instead of "max_steps". Used to compute effective "max_steps".) –
- Return type
A dictionary containing the LR Scheduler implementation if the config was successfully parsed along with other parameters required by Pytorch Lightning, otherwise None.
- mridc.core.optim.lr_scheduler.register_scheduler(name: str, scheduler: _LRScheduler, scheduler_params: SchedulerParams)[source]
Checks if the scheduler name exists in the registry, and if it doesn’t, adds it. This allows custom schedulers to be added and called by name during instantiation.
- Parameters
name (Name of the optimizer. Will be used as key to retrieve the optimizer.) –
scheduler (Scheduler class (inherits from _LRScheduler)) –
scheduler_params (The parameters as a dataclass of the scheduler) –
mridc.core.optim.novograd module
- class mridc.core.optim.novograd.Novograd(params, lr=0.001, betas=(0.95, 0.98), eps=1e-08, weight_decay=0, grad_averaging=False, amsgrad=False, luc=False, luc_trust=0.001, luc_eps=1e-08)[source]
Bases:
Optimizer
Implements Novograd algorithm. It has been proposed in “Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks” (https://arxiv.org/abs/1905.11286).
- Parameters
params (Iterable of parameters to optimize or dicts defining parameter groups.) – iterable
lr (Learning rate.) – float, (default: 1e-3)
betas (Coefficients used for computing running averages of gradient and its square.) – (Tuple[float, float], optional) (default: (0.9, 0.999))
eps (Term added to the denominator to improve numerical stability.) – (float, optional), (default: 1e-8)
(float (weight_decay) –
optional) (weight decay (L2 penalty) (default: 0)) –
amsgrad (whether to use the AMSGrad variant of this algorithm from the paper "On the Convergence of Adam and) –
Beyond". – (boolean, optional), (default: False)
mridc.core.optim.optimizer_with_master_params module
- class mridc.core.optim.optimizer_with_master_params.GradBucket(numel, chunk_size_mb)[source]
Bases:
object
Persistent buffer for main gradients that remains allocated between training iterations.
- class mridc.core.optim.optimizer_with_master_params.MainParamsOptimizerWrapper(optimizer, fp32_grad_accum=False, contiguous_grad_bucket=False, async_grad_allreduce=False, grad_div_ar_fusion=True, grad_allreduce_chunk_size_mb=0)[source]
Bases:
Optimizer
Float16 optimizer wrapper for half precision (fp16 and bf16) data types. This optimizer wrapper holds main parameters and gradients in fp32 to support stable convergence.
- Parameters
optimizer (base optimizer such as Adam or SGD.) –
fp32_grad_accum (to enable the use of fp32 in gradient accumulation and allreduce.) –
contiguous_grad_bucket (to enable allocating the master gradients in the contiguous memory space to reduce memory) –
fragmentation. –
async_grad_allreduce (enable asynchronous gradient allreduce that is executed along with the training step back prop.) –
- property async_master_grads_allreduce
Return whether to use async allreduce for master grads.
- property defaults
Promote defaults, so it can be retrieved or set via ‘optimizer_instance.default’.
- property fp32_grad_accumulation
Return whether to accumulate gradients in fp32.
- no_sync()[source]
A context manager to disable gradient synchronizations across data-parallel ranks.
- property param_groups
Promote param_groups, so it can be retrieved or set via “optimizer_instance.param_groups. (for example, to adjust the learning rate)
- property state
Promote state, so it can be retrieved or set via “optimizer_instance.state.
- zero_grad(set_to_none=True)[source]
We only need to zero the model related parameters, i.e., float16_groups & fp32_from_fp32_groups. We additionally zero fp32_from_float16_groups as a memory optimization to reduce fragmentation; in the case of set_to_none==True, the space used by this field can be safely deallocated at this point.
mridc.core.optim.optimizers module
- mridc.core.optim.optimizers.get_optimizer(name: str, **kwargs: Optional[Dict[str, Any]]) partial [source]
Convenience method to obtain an Optimizer class and partially instantiate it with optimizer kwargs.
- Parameters
name (Name of the Optimizer in the registry.) –
kwargs (Optional kwargs of the optimizer used during instantiation.) –
- Return type
A partially instantiated Optimizer.
- mridc.core.optim.optimizers.parse_optimizer_args(optimizer_name: str, optimizer_kwargs: Union[DictConfig, Dict[str, Any]]) Union[Dict[str, Any], DictConfig] [source]
Parses a list of strings, of the format “key=value” or “key2=val1,val2,…” into a dictionary of type {key=value, key2=[val1, val2], …} This dictionary is then used to instantiate the chosen Optimizer.
- Parameters
optimizer_name (string name of the optimizer, used for auto resolution of params.) –
optimizer_kwargs (Either a list of strings in a specified format, or a dictionary. If a dictionary is provided, it) –
value (is assumed the dictionary is the final parsed) –
provided (and simply returned. If a list of strings is) –
each –
dictionary. (item in the list is parsed into a new) –
- Return type
A dictionary of the parsed arguments.
- mridc.core.optim.optimizers.register_optimizer(name: str, optimizer: Optimizer, optimizer_params: OptimizerParams)[source]
Checks if the optimizer name exists in the registry, and if it doesn’t, adds it. This allows custom optimizers to be added and called by name during instantiation.
- Parameters
name (Name of the optimizer. Will be used as key to retrieve the optimizer.) –
optimizer (Optimizer class.) –
optimizer_params (The parameters as a dataclass of the optimizer.) –