A biased stochastic three-term conjugate gradient algorithm with bandwidth-based step size for machine learning

Shuang Wang; Gonglin Yuan; Junyu Lu; Yiqian Wei; Shuang Wang; Gonglin Yuan; Junyu Lu; Yiqian Wei

doi:10.3934/era.2026169

Electronic Research Archive

2026, Volume 34, Issue 6: 3736-3767. doi: 10.3934/era.2026169

Previous Article Next Article

Research article

A biased stochastic three-term conjugate gradient algorithm with bandwidth-based step size for machine learning

1.
School of Mathematics & Center for Applied Mathematics of Guangxi & Post-doctoral Research Station of the First-level Discipline in Mathematics, Guangxi University, Nanning 530004, China
2.
Center for Faculty and Teaching Development, Guangxi Vocational University of Agriculture, Nanning 530007, China

Received: 28 December 2025 Revised: 27 March 2026 Accepted: 08 April 2026 Published: 06 May 2026

In the field of machine learning, the solution of large-scale data optimization problems faces numerous challenges. Traditional conjugate gradient (CG) algorithms, though possessing excellent convergence properties, incur high computational costs when dealing with large-scale problems, thereby restricting their scope of application. On the other hand, stochastic gradient descent (SGD) algorithm, while being computationally inexpensive, has its convergence rates limited by the variance of gradient estimates, making it difficult to achieve satisfactory optimization results. To address these limitations, this paper proposes a biased stochastic three-term conjugate gradient algorithm. The proposed algorithm integrates the stochastic recursive gradient algorithm (SARAH) and a bandwidth-based step size strategy. Without incurring additional computational costs, it automatically incorporates upper and lower bounds on the step size, effectively balancing the flexibility and stability of the step size. Through the theoretical analysis presented in this paper, we demonstrate that the algorithm converges to a global optimum and analyze the linear convergence rate of the non-convex ($ \lambda $-gradient dominated) objective functions. The numerical results of two machine learning models demonstrate the strong competitiveness from the biased stochastic three-term conjugate gradient algorithm with bandwidth-based step size (SCGBW) algorithm.
- machine learning,
- stochastic conjugate gradient,
- SARAH algorithm,
- bandwidth-based step size,
- three-term conjugate gradient method,
- non-convex problems
Citation: Shuang Wang, Gonglin Yuan, Junyu Lu, Yiqian Wei. A biased stochastic three-term conjugate gradient algorithm with bandwidth-based step size for machine learning[J]. Electronic Research Archive, 2026, 34(6): 3736-3767. doi: 10.3934/era.2026169

Related Papers:

Abstract

In the field of machine learning, the solution of large-scale data optimization problems faces numerous challenges. Traditional conjugate gradient (CG) algorithms, though possessing excellent convergence properties, incur high computational costs when dealing with large-scale problems, thereby restricting their scope of application. On the other hand, stochastic gradient descent (SGD) algorithm, while being computationally inexpensive, has its convergence rates limited by the variance of gradient estimates, making it difficult to achieve satisfactory optimization results. To address these limitations, this paper proposes a biased stochastic three-term conjugate gradient algorithm. The proposed algorithm integrates the stochastic recursive gradient algorithm (SARAH) and a bandwidth-based step size strategy. Without incurring additional computational costs, it automatically incorporates upper and lower bounds on the step size, effectively balancing the flexibility and stability of the step size. Through the theoretical analysis presented in this paper, we demonstrate that the algorithm converges to a global optimum and analyze the linear convergence rate of the non-convex ($ \lambda $-gradient dominated) objective functions. The numerical results of two machine learning models demonstrate the strong competitiveness from the biased stochastic three-term conjugate gradient algorithm with bandwidth-based step size (SCGBW) algorithm.

References

[1]	M. A. Cauchy, Méthode générale pour la résolution des systemes déquations simultanées, C. R. Hebd. Seances Acad. Sci., 25 (1874), 536–538.
[2]	L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT's 2010, (2010), 177–186. https://doi.org/10.1007/978-3-7908-2604-3_16
[3]	J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2011), 2121–2159.
[4]	T. Tieleman, G. Hinton, Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning, in Technical Report, University of Toronto, 2017.
[5]	D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, preprint, arXiv: 1412.6980. https://doi.org/10.48550/arXiv.1412.6980
[6]	T. Dozat, Incorporating Nesterov momentum into Adam, in 4th International Conference on Learning Representations, ICLR Workshop Track, 2016.
[7]	N. Roux, M. Schmidt, F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, Adv. Neural Inf. Process. Syst., 25 (2012).
[8]	A. Defazio, F. Bach, S. Lacoste-Julien, Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, Adv. Neural Inf. Process. Syst., 27 (2014).
[9]	R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Adv. Neural Inf. Process. Syst., 26 (2013).
[10]	L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, SARAH: A novel method for machine learning problems using stochastic recursive gradient, in International Conference on Machine Learning, PMLR, (2017), 2613–2621.
[11]	M. R. Hestenes, E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Res. Natl. Bur. Stand., 49 (1952), 409–436. https://doi.org/10.6028/jres.049.044 doi: 10.6028/jres.049.044
[12]	R. Fletcher, C. M. Reeves, Function minimization by conjugate gradients, Comput. J., 7 (1964), 149–154. https://doi.org/10.1093/comjnl/7.2.149 doi: 10.1093/comjnl/7.2.149
[13]	Y. H. Dai, Y. Yuan, A nonlinear conjugate gradient method with a strong global convergence property, SIAM J. Optim., 10 (1999), 177–182. https://doi.org/10.1137/S1052623497318992 doi: 10.1137/S1052623497318992
[14]	B. T. Polyak, The conjugate gradient method in extremal problems, USSR Comput. Math. Math. Phys., 9 (1969), 94–112. https://doi.org/10.1016/0041-5553(69)90035-4 doi: 10.1016/0041-5553(69)90035-4
[15]	E. Polak, G. Ribiere, Note sur la convergence de méthodes de directions conjuguées, Rev. Fr. d'informatique Rech. Oper. Ser. Rouge, 3 (1969), 35–43. https://doi.org/10.1051/m2an/196903R100351 doi: 10.1051/m2an/196903R100351
[16]	N. Andrei, Another hybrid conjugate gradient algorithm for unconstrained optimization, Numerical Algorithms, 47 (2008), 143–156. https://doi.org/10.1007/s11075-007-9152-9 doi: 10.1007/s11075-007-9152-9
[17]	M. Raydan, The barzilai and borwein gradient method for the large scale unconstrained minimization problem, SIAM J. Optim., 7 (1997), 26–33. https://doi.org/10.1137/S1052623494266365 doi: 10.1137/S1052623494266365
[18]	G. Yu, L. Guan, W. Chen, Spectral conjugate gradient methods with sufficient descent property for large-scale unconstrained optimization, Optim. Methods Software, 23 (2008), 275–293. https://doi.org/10.1080/10556780701661344 doi: 10.1080/10556780701661344
[19]	E. G. Birgin, J. M. Martínez, A spectral conjugate gradient method for unconstrained optimization, Appl. Math. Optim., 43 (2001), 117–128. https://doi.org/10.1007/s00245-001-0003-0 doi: 10.1007/s00245-001-0003-0
[20]	L. Zhang, W. Zhou, D. Li, Some descent three-term conjugate gradient methods and their global convergence, Optim. Methods Software, 22 (2007), 697–711. https://doi.org/10.1080/10556780701223293 doi: 10.1080/10556780701223293
[21]	N. N. Schraudolph, T. Graepel, Combining conjugate direction methods with stochastic approximation of gradients, in Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, PMLR, (2003), 248–253.
[22]	H. Jiang, P. Wilford, A stochastic conjugate gradient method for the approximation of functions, J. Comput. Appl. Math., 236 (2012), 2529–2544. https://doi.org/10.1016/j.cam.2011.12.012 doi: 10.1016/j.cam.2011.12.012
[23]	R. H. Byrd, G. M. Chin, J. Nocedal, Y. Wu, Sample size selection in optimization methods for machine learning, Math. Program., 134 (2012), 127–155. https://doi.org/10.1007/s10107-012-0572-5 doi: 10.1007/s10107-012-0572-5
[24]	X. B. Jin, X. Y. Zhang, K. Huang, G. G. Geng, Stochastic conjugate gradient algorithm with variance reduction, IEEE Trans. Neural Networks Learn. Syst., 30 (2018), 1360–1369. https://doi.org/10.1109/TNNLS.2018.2885887 doi: 10.1109/TNNLS.2018.2885887
[25]	K. A. Alnowibet, S. Mahdi, A. M. Alshamrani, K. M. Sallam, A. W. Mohamed, A family of hybrid stochastic conjugate gradient algorithms for local and global minimization problems, Mathematics, 10 (2022), 3575. https://doi.org/10.3390/math10193595 doi: 10.3390/math10193595
[26]	G. Yuan, J. Lu, Z. Jin, J. Yu, A novel stochastic conjugate gradient algorithm based on a stochastic differential equation perspective, Eur. J. Oper. Res., 332 (2026), 505–521. https://doi.org/10.1016/j.ejor.2026.02.034 doi: 10.1016/j.ejor.2026.02.034
[27]	C. Ouyang, C. Lu, X. Zhao, R. Huang, G. Yuan, Y. Jiang, Stochastic three-term conjugate gradient method with variance technique for non-convex learning, Stat. Comput., 34 (2024), 107. https://doi.org/10.1007/s11222-024-10409-5 doi: 10.1007/s11222-024-10409-5
[28]	J. Barzilai, J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), 141–148. https://doi.org/10.1093/imanum/8.1.141 doi: 10.1093/imanum/8.1.141
[29]	X. Wang, Y. X. Yuan, On the convergence of stochastic gradient descent with bandwidth-based step size, J. Mach. Learn. Res., 24 (2023), 1–49.
[30]	Z. Wei, G. Yu, G. Yuan, Z. Lian, The superlinear convergence of a modified bfgs-type method for unconstrained optimization, Comput. Optim. Appl., 29 (2004), 315–332. https://doi.org/10.1023/B:COAP.0000044184.25410.39 doi: 10.1023/B:COAP.0000044184.25410.39
[31]	G. Yuan, Z. Wei, Convergence analysis of a modified bfgs method on convex minimizations, Comput. Optim. Appl., 47 (2010), 237–255. https://doi.org/10.1007/s10589-008-9219-0 doi: 10.1007/s10589-008-9219-0
[32]	Z. Mo, C. Ouyang, H. Pham, G. Yuan, A stochastic recursive gradient algorithm with inertial extrapolation for non-convex problems and machine learning, Int. J. Mach. Learn. Cybern., 16 (2025), 4545–4559. https://doi.org/10.1007/s13042-024-02524-6 doi: 10.1007/s13042-024-02524-6
[33]	L. M. Nguyen, K. Scheinberg, M. Takáč, Inexact sarah algorithm for stochastic optimization, Optim. Methods Software, 36 (2021), 237–258. https://doi.org/10.1080/10556788.2020.1818081 doi: 10.1080/10556788.2020.1818081
[34]	Z. Yang, On the step size selection in variance-reduced algorithm for nonconvex optimization, Expert Syst. Appl., 169 (2021), 114336. https://doi.org/10.1016/j.eswa.2020.114336 doi: 10.1016/j.eswa.2020.114336
[35]	Z. Yang, Z. Chen, C. Wang, Accelerating mini-batch SARAH by step size rules, Inf. Sci., 558 (2021), 157–173. https://doi.org/10.1016/j.ins.2020.12.075 doi: 10.1016/j.ins.2020.12.075
[36]	H. Kim, C. Wang, H. Byun, W. Hu, S. Kim, Q. Jiao, et al., Variable three-term conjugate gradient method for training artificial neural networks, Neural Networks, 159 (2023), 125–136. https://doi.org/10.1016/j.neunet.2022.12.001 doi: 10.1016/j.neunet.2022.12.001
[37]	S. J. Reddi, S. Kale, S. Kumar, On the convergence of adam and beyond, preprint, arXiv: 1904.09237. https://doi.org/10.48550/arXiv.1904.09237
[38]	J. Zhuang, T. Tang, Y. Ding, S. C. Tatikonda, N. Dvornek, X. Papademetris, et al., Adabelief optimizer: Adapting stepsizes by the belief in observed gradients, Adv. Neural Inf. Process. Syst., 33, 18795–18806.
[39]	C. Kou, H. Yang, A mini-batch stochastic conjugate gradient algorithm with variance reduction, J. Global Optim., 87 (2023), 1009–1025. https://doi.org/10.1007/s10898-022-01205-4 doi: 10.1007/s10898-022-01205-4
[40]	Z. Yang, Adaptive stochastic conjugate gradient for machine learning, Expert Syst. Appl., 206 (2022), 117719. https://doi.org/10.1016/j.eswa.2022.117719 doi: 10.1016/j.eswa.2022.117719
[41]	R. Huang, Y. Qin, K. Liu, G. Yuan, Biased stochastic conjugate gradient algorithm with adaptive step size for nonconvex problems, Expert Syst. Appl., 238 (2024), 121556. https://doi.org/10.1016/j.eswa.2023.121556 doi: 10.1016/j.eswa.2023.121556
[42]	S. Ghadimi, G. Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), 2341–2368. https://doi.org/10.1137/120880811 doi: 10.1137/120880811
[43]	R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, P. Richtárik, Sgd: General analysis and improved rates, in International Conference on Machine Learning, (2019), 5200–5209.
[44]	A. Rakhlin, O. Shamir, K. Sridharan, Making gradient descent optimal for strongly convex stochastic optimization, preprint, arXiv: 1109.5647. https://doi.org/10.48550/arXiv.1109.5647
[45]	X. Wang, S. Magnússon, M. Johansson, On the convergence of step decay step-size for stochastic optimization, Adv. Neural Inf. Process. Syst., 34 (2021), 14226–14238.
[46]	S. J. Reddi, A. Hefny, S. Sra, B. Poczos, A. Smola, Stochastic variance reduction for nonconvex optimization, in International Conference on Machine Learning, PMLR, (2016), 314–323.
[47]	L. M. Nguyen, P. H. Nguyen, P. Richtárik, K. Scheinberg, M. Takáč, M. van Dijk, New convergence aspects of stochastic gradient algorithms, J. Mach. Learn. Res., 20 (2019), 1–49.
[48]	B. T. Polyak, Gradient methods for the minimisation of functionals, USSR Comput. Math. Math. Phys., 3 (1963), 864–878. https://doi.org/10.1016/0041-5553(63)90382-3 doi: 10.1016/0041-5553(63)90382-3
[49]	Y. Nesterov, B. T. Polyak, Cubic regularization of newton method and its global performance, Math. Program., 108 (2006), 177–205. https://doi.org/10.1007/s10107-006-0706-8 doi: 10.1007/s10107-006-0706-8
[50]	D. Prokhorov, IJCNN 2001 Neural Network Competition, in Slide Presentation in IJCNN, 1 (2001), 38.
[51]	J. A. Blackard, D. J. Dean, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Comput. Electron. Agric., 24 (1999), 131–151. https://doi.org/10.1016/S0168-1699(99)00046-0 doi: 10.1016/S0168-1699(99)00046-0
[52]	Y. Wang, C. Ouyang, B. Hu, G. Yuan, Y. Wang, Stochastic conjugate gradient algorithm with an inertial extrapolation step for nonconvex optimization in machine learning, J. Appl. Math. Comput., 72 (2026), 1–15. https://doi.org/10.5771/2193-0147-2026-1-72 doi: 10.5771/2193-0147-2026-1-72

Reader Comments

Your name:*

Email:*
© 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)