This repository presents an official implementation of our research paper, which introduces a generalized framework for SGD-variant methods and establishes their convergence properties. Our findings guarantee the convergence of SGD, heavy-ball SGD, Lion, and signSGD when training nonsmooth neural networks, such as those utilizing ReLU activation functions as their activation functions.
The optimizers provided in this repository are designed to be user-friendly and can be easily invoked using the same approach as the SGD optimizers in PyTorch.
class GSGD(params, lr=<required parameter>, momentum=0.9, scaling=None, Dphi_map=lambda tensor: tensor, weight_decay=0, nesterov=0, *, maximize=False)
Parameters:
params
(iterable): An iterable of parameters to optimize or dictionaries defining parameter groups.lr
(float): The learning rate.momentum
(float, optional): The lower-bound for the momentum factor (default:0.9
).scaling
(float, optional): The ratio between the stepsizes for the momentum terms and parameters. If set toNone
, the scaling is automatically chosen as1.0/lr
(default:None
).Dphi_map
(callable, optional): A mapping function that determines the regularization techniques in SGD-variant methods. InGSGD
, the updating direction for each tensor inparams
is given byDphi_map(tensor)
. By choosing differentDphi_map
functions, users can apply different regularization techniques to the SGD-variant methods. Detailed requirements forDphi_map
can be found in Section 4 of our research paper. (default:lambda tensor: tensor
)- When
Dphi_map = lambda tensor: tensor
, theGSGD
optimizer becomes a variant of the heavy-ball SGD method. - When
Dphi_map = lambda tensor: torch.sign(tensor)
, theGSGD
optimizer becomes the signSGD method.
- When
weight_decay
(float, optional): Weight decay (L2 penalty) (default:0
).nesterov
(float, optional): Enables Nesterov momentum (default:0
).maximize
(bool, optional): Determines whether to maximize the parameters based on the objective, instead of minimizing them (default:False
).
For the Lion method, we provide its implementation based on the codes from this repository. Detailed instructions on parameter tuning for the Lion optimizer can be found in lucidrains' repository.
class Lion(params, lr=1e-4, betas=(0.9, 0.99), scaling=None, weight_decay=0.0, nesterov_momentum=0)
Parameters:
params
(iterable): An iterable of parameters to optimize or dictionaries defining parameter groups.lr
(float, optional): The learning rate (default:1e-4
).betas
(Tuple of float, optional): Coefficients used for computing running averages of the gradient (default:(0.9, 0.99)
).scaling
(float, optional): The ratio between the stepsizes for the momentum terms and parameters. If set toNone
, the scaling is automatically chosen as1.0/lr
(default:None
).weight_decay
(float, optional): Weight decay (L2 penalty) (default:0
).nesterov_momentum
(float, optional): Enables Nesterov momentum (default:0
).