First-order stochastic optimization algorithms have been widely applied to large-scale machine learning tasks. We have proposed a new stochastic optimization algorithm, which is inspired by an accelerated stochastic variance reduced method originally developed for convex optimization. The proposed algorithm addresses nonconvex and nonsmooth problems via the proximal operator and mitigates overfitting through a mini-batch strategy that exploits richer gradient information. We established sublinear convergence under standard assumptions. Extensive experiments on synthetic classification tasks, real-world datasets, nonconvex matrix factorization problems, and time series prediction tasks showed that the proposed algorithm consistently achieves faster convergence and competitive test performance compared to state-of-the-art stochastic optimization algorithms, underscoring its potential in practical applications for large-scale, nonsmooth learning problems.
Citation: Weihao Cui, Chongyang He, Mingyuan Cao, Yueting Yang. A new mini-batch negative momentum proximal stochastic variance reduction method for nonconvex optimization[J]. AIMS Mathematics, 2026, 11(2): 4759-4786. doi: 10.3934/math.2026194
First-order stochastic optimization algorithms have been widely applied to large-scale machine learning tasks. We have proposed a new stochastic optimization algorithm, which is inspired by an accelerated stochastic variance reduced method originally developed for convex optimization. The proposed algorithm addresses nonconvex and nonsmooth problems via the proximal operator and mitigates overfitting through a mini-batch strategy that exploits richer gradient information. We established sublinear convergence under standard assumptions. Extensive experiments on synthetic classification tasks, real-world datasets, nonconvex matrix factorization problems, and time series prediction tasks showed that the proposed algorithm consistently achieves faster convergence and competitive test performance compared to state-of-the-art stochastic optimization algorithms, underscoring its potential in practical applications for large-scale, nonsmooth learning problems.
| [1] |
R. Collobert, J. Weston, A unified architecture for natural language processing: deep neural networks with multitask learning, Proceedings of the 25th International Conference on Machine Learning, 2008,160–167. https://doi.org/10.1145/1390156.1390177 doi: 10.1145/1390156.1390177
|
| [2] |
R. F. Conchas, A. G. Loukianov, E. N. Sanchez, A. Y. Alanis, Finite time convergent recurrent neural network for variational inequality problems subject to equality constraints, J. Franklin I., 361 (2024), 583–597. https://doi.org/10.1016/j.jfranklin.2023.11.041 doi: 10.1016/j.jfranklin.2023.11.041
|
| [3] |
P. Guo, Z. Ye, K. Xiao, W. Zhu, Weighted aggregating stochastic gradient descent for parallel deep learning, IEEE Trans. Knowl. Data Eng., 34 (2022), 5037–5050. https://doi.org/10.1109/tkde.2020.3047894 doi: 10.1109/tkde.2020.3047894
|
| [4] |
H. Kasai, Stochastic variance reduced multiplicative update for nonnegative matrix factorization, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, 6338–6342. https://doi.org/10.1109/ICASSP.2018.8461325 doi: 10.1109/ICASSP.2018.8461325
|
| [5] |
A. E. Hoerl, R. W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics, 12 (1970), 55–67. https://doi.org/10.1080/00401706.1970.10488634 doi: 10.1080/00401706.1970.10488634
|
| [6] |
R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. B, 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x
|
| [7] |
J. Yin, C. Tang, J. Jian, Q. Huang, A partial Bregman ADMM with a general relaxation factor for structured nonconvex and nonsmooth optimization, J. Glob. Optim., 89 (2024), 899–926. https://doi.org/10.1007/s10898-024-01384-2 doi: 10.1007/s10898-024-01384-2
|
| [8] |
Y. Yang, H. Lan, L. Jiang, Novel inertial stochastic Bregman inexact ADMMs for solving large-scale nonconvex and nonsmooth optimization without relying on the Kurdyka-Łojasiewicz property, AIMS Mathematics, 10 (2025), 24804–24835. https://doi.org/10.3934/math.20251099 doi: 10.3934/math.20251099
|
| [9] |
H. Mine, M. Fukushima, A minimization method for the sum of a convex function and a continuously differentiable function, J. Optim. Theory Appl., 33 (1981), 9–23. https://doi.org/10.1007/bf00935173 doi: 10.1007/bf00935173
|
| [10] |
S. Ghadimi, G. Lan, H. Zhang, Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program., 155 (2016), 267–305. https://doi.org/10.1007/s10107-014-0846-1 doi: 10.1007/s10107-014-0846-1
|
| [11] |
H. Robbins, S. Monro, A stochastic approximation method, Ann. Math. Stat., 22 (1951), 400–407. https://doi.org/10.1214/aoms/1177729586 doi: 10.1214/aoms/1177729586
|
| [12] | R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Proceedings of the 27th International Conference on Neural Information Processing Systems, 2013,315–323. |
| [13] | A. Defazio, F. Bach, S. Lacoste-Julien, SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives, Proceedings of the 28th International Conference on Neural Information Processing Systems, 2014, 1646–1654. |
| [14] | L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, SARAH: a novel method for machine learning problems using stochastic recursive gradient, Proceedings of the 34th International Conference on Machine Learning, 2017, 2613–2621. |
| [15] |
L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optimiz., 24 (2014), 2057–2075. https://doi.org/10.1137/140961791 doi: 10.1137/140961791
|
| [16] | F. Pedregosa, R. Leblond, S. Lacoste-Julien, Breaking the nonsmooth barrier: a scalable parallel method for composite optimization, Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, 55–64. |
| [17] | N. Pham, L. Nguyen, D. Phan, Q. Tran-Dinh, ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization, J. Mach. Learn. Res., 21 (2020), 1–48. |
| [18] | Y. Nesterov, A method for solving the convex programming problem with convergence rate O (1/k2), Proceedings of the USSR Academy of Sciences, 269 (1983), 543–547. |
| [19] |
A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), 183–202. https://doi.org/10.1137/080716542 doi: 10.1137/080716542
|
| [20] |
B. Wen, X. Chen, T. K. Pong, Linear convergence of proximal gradient algorithm with extrapolation for a class of nonconvex nonsmooth minimization problems, SIAM J. Optimiz., 27 (2017), 124–145. https://doi.org/10.1137/16m1055323 doi: 10.1137/16m1055323
|
| [21] |
Z. Wu, M. Li, General inertial proximal gradient method for a class of nonconvex nonsmooth optimization problems, Comput. Optim. Appl., 73 (2019), 129–158. https://doi.org/10.1007/s10589-019-00073-1 doi: 10.1007/s10589-019-00073-1
|
| [22] |
L. He, J. Ye, J. E, Nonconvex optimization with inertial proximal stochastic variance reduction gradient, Inform. Sciences, 648 (2023), 119546. https://doi.org/10.1016/j.ins.2023.119546 doi: 10.1016/j.ins.2023.119546
|
| [23] |
Z. Allen-Zhu, L. Orecchia, Linear coupling: an ultimate unification of gradient and mirror descent, Proceedings of 8th Innovations in Theoretical Computer Science Conference, 2017, 3:1–3:22. https://doi.org/10.4230/LIPIcs.ITCS.2017.3 doi: 10.4230/LIPIcs.ITCS.2017.3
|
| [24] |
S. Bubeck, Convex optimization: algorithms and complexity, Found. Trends Mach. Le., 8 (2015), 231–357. https://doi.org/10.1561/2200000050 doi: 10.1561/2200000050
|
| [25] | W. Su, S. Boyd, E. J. Candes, A differential equation for modeling Nesterov's accelerated gradient method: theory and insights, J. Mach. Learn. Res., 17 (2016), 5312–5354. |
| [26] | A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, Proceedings of the 28th International Conference on Neural Information Processing Systems, 2014, 1574–1582. |
| [27] |
Z. Yang, SARAH-M: a fast stochastic recursive gradient descent algorithm via momentum, Expert Syst. Appl., 238 (2024), 122295. https://doi.org/10.1016/j.eswa.2024.122295 doi: 10.1016/j.eswa.2024.122295
|
| [28] | Z. Allen-Zhu, Katyusha: the first direct acceleration of stochastic gradient methods, J. Mach. Learn. Res., 18 (2018), 1–51. |
| [29] |
L. He, J. Ye, J. E, Accelerated stochastic variance reduction for a class of convex optimization problems, J. Optim. Theory Appl., 196 (2023), 810–828. https://doi.org/10.1007/s10957-022-02157-1 doi: 10.1007/s10957-022-02157-1
|
| [30] | J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2011), 2121–2159. |
| [31] | T. Tieleman, Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 4 (2012), 26–31. |
| [32] | D. P. Kingma, J. Ba, Adam: a method for stochastic optimization, arXiv: 1412.6980. https://doi.org/10.48550/arXiv.1412.6980 |
| [33] | A. Beck, Introduction to nonlinear optimization: theory, algorithms, and applications with MATLAB, Philadelphia: Society for Industrial and Applied Mathematics, 2014. https://doi.org/10.1137/1.9781611973655 |
| [34] | S. J. Reddi, S. Sra, B. Poczos, A. J. Smola, Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization, Proceedings of the 30th Conference on Neural Information Processing Systems (NeurIPS), 2016, 1145–1153. |
| [35] |
Z. Wang, B. Wen, Proximal stochastic recursive momentum algorithm for nonsmooth nonconvex optimization problems, Optimization, 73 (2024), 481–495. https://doi.org/10.1080/02331934.2022.2112191 doi: 10.1080/02331934.2022.2112191
|
| [36] |
B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys., 4 (1964), 1–17. https://doi.org/10.1016/0041-5553(64)90137-5 doi: 10.1016/0041-5553(64)90137-5
|