
Citation: Wei Xue, Pengcheng Wan, Qiao Li, Ping Zhong, Gaohang Yu, Tao Tao. An online conjugate gradient algorithm for large-scale data analysis in machine learning[J]. AIMS Mathematics, 2021, 6(2): 1515-1537. doi: 10.3934/math.2021092
[1] | Xiyuan Zhang, Yueting Yang . A new hybrid conjugate gradient method close to the memoryless BFGS quasi-Newton method and its application in image restoration and machine learning. AIMS Mathematics, 2024, 9(10): 27535-27556. doi: 10.3934/math.20241337 |
[2] | Frank Rogers . Fuzzy gradient descent for the linear fuzzy real number system. AIMS Mathematics, 2019, 4(4): 1078-1086. doi: 10.3934/math.2019.4.1078 |
[3] | Shuhua Wang . Convergence of online learning algorithm with a parameterized loss. AIMS Mathematics, 2022, 7(11): 20066-20084. doi: 10.3934/math.20221098 |
[4] | Yixin Li, Chunguang Li, Wei Yang, Wensheng Zhang . A new conjugate gradient method with a restart direction and its application in image restoration. AIMS Mathematics, 2023, 8(12): 28791-28807. doi: 10.3934/math.20231475 |
[5] | Sani Aji, Poom Kumam, Aliyu Muhammed Awwal, Mahmoud Muhammad Yahaya, Kanokwan Sitthithakerngkiet . An efficient DY-type spectral conjugate gradient method for system of nonlinear monotone equations with application in signal recovery. AIMS Mathematics, 2021, 6(8): 8078-8106. doi: 10.3934/math.2021469 |
[6] | Xuejie Ma, Songhua Wang . A hybrid approach to conjugate gradient algorithms for nonlinear systems of equations with applications in signal restoration. AIMS Mathematics, 2024, 9(12): 36167-36190. doi: 10.3934/math.20241717 |
[7] | Jie Guo, Zhong Wan . A new three-term conjugate gradient algorithm with modified gradient-differences for solving unconstrained optimization problems. AIMS Mathematics, 2023, 8(2): 2473-2488. doi: 10.3934/math.2023128 |
[8] | Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang . On stochastic accelerated gradient with non-strongly convexity. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085 |
[9] | Wenya Shi, Zhixiang Chen . A breakdown-free block conjugate gradient method for large-scale discriminant analysis. AIMS Mathematics, 2024, 9(7): 18777-18795. doi: 10.3934/math.2024914 |
[10] | Mingyuan Cao, Yueting Yang, Chaoqian Li, Xiaowei Jiang . An accelerated conjugate gradient method for the Z-eigenvalues of symmetric tensors. AIMS Mathematics, 2023, 8(7): 15008-15023. doi: 10.3934/math.2023766 |
We first establish the key notations to make them consistent in this paper. Vectors are denoted by boldface letters, e.g., w, and ∇f(w) denotes the gradient of function f at point w. Superscript "T" is the transpose operation of a vector. We use ‖w‖ as a shorthand for the l2-norm of w. Given a dataset D={(zi,yi)mi=1}, where zi∈Rn and yi∈R, the focus of this paper is structural risk minimization [28], a fundamental subject in machine learning, which is a combination of two terms
minw∈Rn f(w)≜1mm∑i=1l(w;zi,yi)+r(w)⏟fi(w). | (1.1) |
Here, w is the optimization variable (usually called weight vector in the research area of machine learning), l:Rn→R is the loss function associated with sample pair (zi,yi), and the regularization function r:Rn→R is a penalty on the complexity of w.
Optimization learning algorithms for solving (1.1) mainly fall into two broad categories, namely, online learning and batch learning. The two learning mechanisms are illustrated in Figure 1. In online learning, data streams (either individually or in mini-batches) into the learning algorithm and updates the prediction model. We just need to use the new (real-time) data to build a model. However, in batch learning, the model is trained using the entire dataset. This process is also called offline learning. In batch learning, if there is need to update the learning model based on new data, the model should be trained from scratch all over again using both the previous data and the new data [2].
The prototypical online learning algorithm is stochastic gradient descent (SGD) method [23], which is defined as
wk+1=wk−ηk∇fik(wk), | (1.2) |
where ηk is the learning rate (also known as stepsize in numerical optimization), and the index ik, corresponding to the sample pair (zik,yik), is randomly chosen from {1,2,…,m}. We describe the standard SGD method in Algorithm 1. It is observed that each iteration of SGD only involves the computation of ∇fik(wk) corresponding to one sample, resulting in a very cheap iteration cost. The gradient information is obtained from only one sample at each iteration. Further, one can use a mini-batch method, in which a small subset of all samples is randomly selected per iteration and employ the iteration.
wk+1=wk−ηk|Sk|∑i∈Sk⊆D∇fi(wk). | (1.3) |
![]() |
where |Sk| denotes the number of elements in the set Sk.
In recent years, SGD has been widely studied in machine learning and optimization community; see [5,16,19,22,27,29,32] for example. The standard SGD can quickly reach a proximate optimum solution during the learning process, but its convergence rate may slow down when more accurate solutions are required. Due to the variance of random sampling, with a suitably chosen ηk=O(1/k), SGD achieves a sub-linear convergence rate of O(1/k).
In order to improve the convergence of SGD, Johnson and Zhang [15] proposed a stochastic variance reduced gradient (SVRG) method to reduce the variance of random sampling. The description of SVRG is in Algorithm 2. As argued by Bottou et al. [3], for the first epochs, SGD is more efficient, but once the iterations approach the solution the benefits of the fast convergence rate of SVRG can be observed. In [21], Nguyen et al. introduced a stochastic recursive gradient algorithm (SARAH). Different from SVRG, SARAH employs a simple recursive approach to update stochastic gradient and uses more stable gradient estimates than SVRG. But the choice of the learning rate η depends on a Lipschitz constant of the objective function. Without explicit knowledge of this constant, η is typically chosen by experimentation in both SVRG and SARAH. By incorporating the idea of variance reduction, Moritz et al. [20] proposed a stochastic L-BFGS method and proved a linear rate of convergence. In [26], Tan et al. used the well-known Barzilai-Borwein (BB) method [1] to automatically compute the learning rate for SVRG, and established a linear convergence rate. Recently, based on the BB method, an improved SVRG, named stochastic two-point stepsize gradient method, has been proposed by Shao et al. [25], which also achieves a linear convergence rate for smooth and strongly convex functions. Liu et al. [17] used the BB approach to adaptively compute the learning rate for SARAH. In [14], Jin et al. proposed a stochastic CG method with variance reduction, which converges more quickly than SVRG. Actually, the idea of stochastic CG has been studied before. Schraudolph and Graepel [24] combined the idea of CG with stochastic setting to optimize unconstrained quadratic problems, and the resulting algorithms converge orders of magnitude faster than ordinary SGD. In [30], Xu and Dai employed CG to accelerate the convergence of stochastic approximation algorithm, Robbins-Monro method. Jiang and Wilford [12] proposed a stochastic CG method for the approximation of functions, which performs the CG steps by using an inner product that is based on stochastic sampling.
![]() |
Motivated by this, we propose in this paper a new online variance reduced CG method and prove that the proposed method converges linearly for strongly convex and smooth objective functions. The remainder of this paper is organized as follows. In Section 2, we give a brief introduction of the IFR CG method used in batch optimization. And then, we present the proposed method along with the theoretical analysis in Section 3. Numerical experiments are reported in Section 4. Finally, we conclude this paper in Section 5.
Due to the low memory requirement and strong convergence property, CG method is a very efficient batch algorithm, and it is still one of the most active research fields of optimization [6,7,11,13,18,31]. Take (1.1) as an example. Suppose that f is differentiable, then given an initial point w0, a CG method generates a sequence {wk} by using the recurrence
wk+1=wk+αkdk, | (2.1) |
where αk is the stepsize usually obtained by a line search, and the search direction dk is computed by
dk=−∇f(wk)+βkdk−1,d0=−∇f(w0). | (2.2) |
βk is the CG update parameter. Different choices of βk yield different CG methods. Some well-known formulas for βk are βHSk, βFRk, βPRk, βCDk, βLSk, and βDYk, see the comprehensive review [10] for details. Here, we focus on βFRk proposed by Fletcher and Reeves [8], which takes the form of
βFRk=‖∇f(wk)‖2‖∇f(wk−1)‖2. | (2.3) |
Another important issue related to the performance of CG is the line search, which requires sufficient accuracy to ensure dk satisfies the descent condition ∇f(wk)Tdk<0, ∀k. A common criteria for line search is the Wolfe line search
{f(wk+αkdk)−f(wk)≤c1αk∇f(wk)Tdk,∇f(wk+αkdk)Tdk≥c2∇f(wk)Tdk, | (2.4) |
where 0<c1<c2<1. In the strong Wolfe line search, the second condition in (2.4) is replaced by
|∇f(wk+αkdk)Tdk|≤−c2∇f(wk)Tdk. | (2.5) |
It has been shown that the strong Wolfe line search for βFRk may not yield a descent direction unless c2≤1/2 [4]. In practice, it is often most efficient to choose c2 close to 1. Hence, the constraint c2≤1/2, needed to ensure descent, is a significant restriction in the choice of parameter c2 in the Wolfe line search. Note that Gilbert and Nocedal [9] gave the convergent scope τk∈[−1,1] for τkβFRk. Based on this, Jiang and Jian [13] considered τk=−|∇f(wk)Tdk−1|∇f(wk−1)Tdk−1 and introduced an improved FR (IFR) formula
βIFRk=−|∇f(wk)Tdk−1|∇f(wk−1)Tdk−1×βFRk. | (2.6) |
The following lemma shows that the search directions yielded by IFR CG are all sufficient descent, and the IFR CG method is globally convergent under the strong Wolfe line search.
Lemma 1 (Theorem 3 [13]). Suppose that f({\boldsymbol{w}}) is differentiable, \nabla f({\boldsymbol{w}}) is Lipschitz continuous, and 0 < c_2 < \sqrt{2}/2 , then it holds that
\begin{equation} \frac{-1}{1-c^2_2} \leq \frac{\nabla f({\boldsymbol{w}}_k)^T{\boldsymbol{d}}_k}{\|\nabla f({\boldsymbol{w}}_k)\|^2} \leq \frac{2c^2_2-1}{1-c^2_2}. \end{equation} | (2.7) |
Further, IFR CG is globally convergent by the way of \liminf_{k\rightarrow \infty}\|\nabla f({\boldsymbol{w}}_k)\| = 0 .
Based on the theories mentioned above, we now describe the core of the proposed online CG algorithm, called stochastic improved Fletcher-Reeves conjugate gradient (SIFR CG for short) method, for solving (1.1) in detail and provide the pseudo-code.
SIFR CG operates in cycles. At the beginning of each cycle, an iteration point {\bf{w}}_k is available at which the method computes a batch gradient {\bf{u}}_k \leftarrow \nabla f({\bf{w}}_k) . Then, after initializing {\bf{x}}_0 \leftarrow {\bf{w}}_k , {\bf{g}}_0 \leftarrow {\bf{h}}_k , and {\bf{d}}_0 \leftarrow -{\bf{g}}_0 , a set of T inner iterations indexed by t with an update {\bf{x}}_{t+1} \leftarrow {\bf{x}}_t + \alpha_t {\bf{d}}_t are performed, where \alpha_t is chosen according to the strong Wolfe line search. We define the subsampled function f_{\mathcal{S}}({\bf{w}}) as:
\begin{equation} f_{\mathcal{S}}({\bf{w}}) = \frac{1}{|\mathcal{S}|}\sum\limits_{i\in \mathcal{S}_k }f_i({\bf{w}}). \end{equation} | (3.1) |
where |\mathcal{S}_k| denotes the number of elements in the set S_k .The search direction {\bf{d}}_t is computed by -{\bf{g}}_t + \beta_t{\bf{d}}_{t-1} , where {\bf{g}}_{t+1} \leftarrow \nabla f_{\mathcal{S}}({\bf{x}}_{t+1}) - \nabla f_{\mathcal{S}}({\bf{w}}_k) + {\bf{u}}_k , and \mathcal{S} is chosen randomly. The pseudo-code of SIFR CG is given in Algorithm 3.
![]() |
There are a couple of things to note about this algorithm.
Remark 1. To mitigate the effects of some outliers in the training data, we restrict \alpha_t and \beta_t by adding some additional constraints.
Remark 2. For convenience, we use the condition |{\boldsymbol{g}}_{t+1}^T{\boldsymbol{d}}_t| \leq -c_2 {\boldsymbol{g}}_t^T{\boldsymbol{d}}_t in the theoretical analysis.
Remark 3. The following experiments are carried out using {\boldsymbol{w}}_{k+1} \leftarrow {\boldsymbol{x}}_T.
In this section we analyze the convergence property of the proposed method SIFR CG. Our analysis makes use of the following assumptions.
Assumption 1. \forall i , f_i: \mathbb{R}^n \rightarrow \mathbb{R} is twice continuously differentiable.
Assumption 2. \forall {\boldsymbol{w}} \in \mathbb{R}^n and \mathcal{S} \subseteq \mathcal{D} , there exist two positive constants \phi and \varphi such that \phi \boldsymbol{I} \leq \nabla^2 f_{\mathcal{S}}({\boldsymbol{x}}) \leq \varphi \boldsymbol{I} , where \boldsymbol{I} denotes the identity matrix, and \nabla^2 f_{\mathcal{S}}({\boldsymbol{x}}) is the Hessian matrix of f at point {\boldsymbol{x}} .
For Algorithm 3, we immediately have the following result.
Theorem 1. Suppose that Assumptions 1 and 2 hold, and let \xi > 0 be an upper bound on - {\boldsymbol{g}}_t^T{\boldsymbol{d}}_t / |{\boldsymbol{g}}_{t+1}^T{\boldsymbol{d}}_t| , then, \forall t , we have
\begin{equation} \|{\boldsymbol{d}}_t\|^2 \leq \rho(t)\|{\boldsymbol{g}}_{0}\|^2, \end{equation} | (3.3) |
where \rho(t) = (\xi + \frac{4}{3}) \frac{(\epsilon\xi)^t-(\epsilon^2)^t}{\xi-\epsilon} + (\epsilon^2)^t .
Proof. It follows from the lines 14–16 in Algorithm 3 that \beta_{t+1}\leq\epsilon , therefore,
\begin{equation} \|{\bf{g}}_t\|^2/\|{\bf{g}}_{t-1}\|^2 \leq - \epsilon{\bf{g}}_t^T{\bf{d}}_t/|{\bf{g}}_{t+1}^T{\bf{d}}_t| \leq \epsilon \xi. \end{equation} | (3.4) |
From the definition of {\bf{d}}_t , we have
\begin{equation} \begin{split} \|{\bf{d}}_t\|^2 & = \|-{\bf{g}}_t + \beta_t{\bf{d}}_{t-1}\|^2 = \|{\bf{g}}_t\|^2 -2\beta_t {\bf{g}}_t^T{\bf{d}}_{t-1} + \beta_t^2\|{\bf{d}}_{t-1}\|^2 \\ & \leq \epsilon(\xi + \frac{2c_2}{1-c_2^2})\|{\bf{g}}_{t-1}\|^2 + \epsilon^2\|{\bf{d}}_{t-1}\|^2. \end{split} \end{equation} | (3.5) |
The last inequality follows from Lemma 1 and Remark 2. Using 0 < c_2 < 1/2 , it follows that
\begin{equation} \begin{split} \|{\bf{d}}_t\|^2 & \leq \epsilon(\xi + \frac{4}{3})\|{\bf{g}}_{t-1}\|^2 + \epsilon^2\|{\bf{d}}_{t-1}\|^2 \\ & \leq \epsilon(\xi + \frac{4}{3})[\|{\bf{g}}_{t-1}\|^2 + \epsilon^2\|{\bf{g}}_{t-2}\|^2 + \ldots + (\epsilon^2)^{t-1}\|{\bf{g}}_0\|^2] + (\epsilon^2)^t\|{\bf{g}}_{0}\|^2. \end{split} \end{equation} | (3.6) |
The second inequality follows from {\bf{d}}_0 = {\bf{g}}_0 . Observing (3.4), we immediately get \|{\bf{g}}_t\|^2 \leq (\epsilon\xi)^t \|{\bf{g}}_0\|^2 , and (3.6) can be rewritten as
\begin{equation} \begin{split} \|{\bf{d}}_t\|^2 & \leq \epsilon(\xi + \frac{4}{3})[(\epsilon\xi)^{t-1} \|{\bf{g}}_0\|^2 + \epsilon^2(\epsilon\xi)^{t-2} \|{\bf{g}}_0\|^2 + \ldots +(\epsilon^2)^{t-1}\|{\bf{g}}_0\|^2] + (\epsilon^2)^t\|{\bf{g}}_{0}\|^2 \\ & = \Big[(\xi + \frac{4}{3}) \frac{(\epsilon\xi)^t-(\epsilon^2)^t}{\xi-\epsilon} + (\epsilon^2)^t \Big] \|{\bf{g}}_{0}\|^2, \end{split} \end{equation} | (3.7) |
which completes the proof.
The following two lemmas show a lower bound on \|\nabla f({\bf{x}})\|^2 and an upper bound on the variance reduced gradient estimates {\bf{g}}_t , respectively.
Lemma 2 (Lemma 5 [20], Lemma 1 [14]). Let {\boldsymbol{w}}_* be the unique minimizer of f , and suppose that f is strongly convex with parameter \sigma , then, \forall {\boldsymbol{x}} , we have \|\nabla f({\boldsymbol{x}})\|^2 \geq 2\sigma[f({\boldsymbol{x}}) - f({\boldsymbol{w}}_*)] .
Lemma 3 (Lemma 6 [20], Lemma 2 [14]). Let {\boldsymbol{w}}_* be the unique minimizer of f , {\boldsymbol{u}}_k = \nabla f({\boldsymbol{w}}_k) , and {\boldsymbol{g}}_t = \nabla f_{\mathcal{S}}({\boldsymbol{x}}_t - \nabla f_{\mathcal{S}}({\boldsymbol{w}}_k) + {\boldsymbol{u}}_k , then by taking an expectation w.r.t. \mathcal{S} , we have \mathbb{E}[\|{\boldsymbol{g}}_t\|^2] \leq 4 \varphi [f({\boldsymbol{x}}_t) - f({\boldsymbol{w}}_*) + f({\boldsymbol{w}}_k) - f({\boldsymbol{w}}_*)] .
Based on the above theoretical basis, we now state our main result in the following theorem.
Theorem 2. Suppose that Assumptions 1 and 2 hold, and let {\boldsymbol{x}}_* be the unique minimizer of f , then, \forall k , we have
\begin{equation} \mathbb{E}[f({\boldsymbol{w}}_k) - f({\boldsymbol{w}}_*)] \leq \theta^k \mathbb{E}[f({\boldsymbol{w}}_0) - f({\boldsymbol{w}}_*)]. \end{equation} | (3.8) |
Here, the convergence rate \theta is given by
\begin{equation} \theta = \frac{1+\frac{8\varphi\epsilon\alpha_{max}T}{3} + 4\varphi^2 \alpha_{max}^2 \Lambda} {2\sigma\alpha_{min} T - \frac{8\varphi\epsilon\alpha_{max}T}{3}} \lt 1 \end{equation} | (3.9) |
with \Lambda = \frac{\xi + 4/3}{\xi - \epsilon} \times \frac{1 - (\xi\epsilon)^T}{1 - \xi\epsilon} - \frac{\epsilon + 4/3}{\xi-\epsilon} \times \frac{1-(\epsilon^2)^T}{1-\epsilon^2} , assuming that we choose 8\varphi\epsilon\alpha_{max} < 3\sigma\alpha_{min} and that we choose T large enough to satisfy
\begin{equation} T \gt \frac{3+12\varphi^2\alpha_{max}^2\Lambda}{6\sigma\alpha_{min} - 16\varphi\epsilon\alpha_{max}}. \end{equation} | (3.10) |
Proof. Using the Lipschitz continuity of \nabla f , which follows from Assumption 2, we have
\begin{equation} \begin{split} f({\bf{x}}_{t+1}) & \leq f({\bf{x}}_t) + \nabla f({\bf{x}}_t)^T({\bf{x}}_{t+1}-{\bf{x}}_t) +\frac{\varphi}{2}\|{\bf{x}}_{t+1}-{\bf{x}}_t\|^2 \\ & = f({\bf{x}}_t) + \alpha_t\nabla f({\bf{x}}_t)^T{\bf{d}}_t + \frac{\varphi \alpha_t^2}{2}\|{\bf{d}}_t\|^2. \end{split} \end{equation} | (3.11) |
Taking expectations of (3.11) and using Lemma 2 gives
\begin{equation} \begin{split} \mathbb{E}[f({\bf{x}}_{t+1})] & \leq \mathbb{E}[f({\bf{x}}_t)] + \alpha_t \mathbb{E}[\nabla f({\bf{x}}_t)^T{\bf{d}}_t] +\frac{\varphi \alpha_t^2}{2} \mathbb{E}[\|{\bf{d}}_t\|^2] \\ & \leq \mathbb{E}[f({\bf{x}}_t)] + \alpha_t \Big(-2\sigma\mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] + \beta_t \mathbb{E}[{\bf{g}}_t^T{\bf{d}}_{t-1}]\Big) + \frac{\varphi \alpha_t^2}{2} \mathbb{E}[\|{\bf{d}}_t\|^2] \\ & \leq \mathbb{E}[f({\bf{x}}_t)] + \alpha_t \Big(-2\sigma\mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] + \beta_t\frac{c_2}{1-c_2^2}\mathbb{E}[\|{\bf{g}}_{t-1}\|^2]\Big) + \frac{\varphi \alpha_t^2}{2} \mathbb{E}[\|{\bf{d}}_t\|^2] \\ & \leq \mathbb{E}[f({\bf{x}}_t)] + \alpha_t \Big(-2\sigma\mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] + \frac{2\beta_t}{3}\mathbb{E}[\|{\bf{g}}_{t-1}\|^2]\Big) + \frac{\varphi \alpha_t^2}{2} \mathbb{E}[\|{\bf{d}}_t\|^2]. \end{split} \end{equation} | (3.12) |
Using Theorem 1 and Lemma 3, (3.12) becomes
\begin{equation} \begin{split} & \mathbb{E}[f({\bf{x}}_{t+1})] \\ & \leq \mathbb{E}[f({\bf{x}}_t)] - 2\sigma\alpha_t \mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] + \frac{8\varphi\alpha_t\beta_t}{3} \Big(\mathbb{E}[f({\bf{x}}_{t-1}) - f({\bf{w}}_*)] + \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]\Big)\\ & \ \ \ \ + \frac{\varphi \alpha_t^2}{2} \times 4\varphi\rho(t) \Big( \mathbb{E}[f({\bf{x}}_0 - f({\bf{w}}_*)] + \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)] \Big) \\ & \leq \mathbb{E}[f({\bf{x}}_t)] - 2\sigma\alpha_{min} \mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] \\ & \ \ \ \ +\frac{8\varphi\epsilon\alpha_{max}}{3}\Big(\mathbb{E}[f({\bf{x}}_{t-1}) - f({\bf{w}}_*)] +\mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]\Big) + 4\varphi^2 \alpha_{max}^2\rho(t)\mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]. \end{split} \end{equation} | (3.13) |
By summing (3.13) over t = 0, 1, \ldots, T-1 , we obtain
\begin{equation} \begin{split} \mathbb{E}[f({\bf{x}}_T)] & \leq \mathbb{E}[f({\bf{x}}_0)] - 2\sigma\alpha_{min} \sum\limits_{t = 0}^{T-1} \mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] \\ & \ \ \ \ + \frac{8\varphi\epsilon\alpha_{max}}{3} \sum\limits_{t = 0}^{T-2} \Big(\mathbb{E}[f({\bf{x}}_t)-f({\bf{w}}_*)] + \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]\Big) \\ & \ \ \ \ + 4\varphi^2 \alpha_{max}^2 \sum\limits_{t = 0}^{T-1} \rho(t) \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]. \end{split} \end{equation} | (3.14) |
Rearranging (3.14) gives
\begin{equation} \begin{split} & 0 \leq \mathbb{E}[f({\bf{w}}_k)] -\mathbb{E}[f({\bf{x}}_T)] - 2\sigma\alpha_{min} T \mathbb{E}[f({\bf{w}}_{k+1}) - f({\bf{w}}_*)] \\ & \ \ \ \ + \frac{8\varphi\epsilon\alpha_{max}T}{3} \mathbb{E}[f({\bf{w}}_{k+1}) -f({\bf{w}}_*)] + \frac{8\varphi\epsilon\alpha_{max}T}{3} \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)] \\ & \ \ \ \ + 4\varphi^2 \alpha_{max}^2 \sum\limits_{t = 0}^{T-1} \rho(t) \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]. \end{split} \end{equation} | (3.15) |
Further, we have
\begin{equation} \begin{split} 0 & \leq \mathbb{E}[f({\bf{w}}_k)] -\mathbb{E}[f({\bf{w}}_*)] + \Big(\frac{8\varphi\epsilon\alpha_{max}T}{3} + 4\varphi^2 \alpha_{max}^2 \sum\limits_{t = 0}^{T-1} \rho(t) \Big) \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]\\ & \ \ \ \ + \Big(\frac{8\varphi\epsilon\alpha_{max}T}{3} - 2\sigma\alpha_{min} T \Big) \mathbb{E}[f({\bf{w}}_{k+1}) - f({\bf{w}}_*)] \\ & = \Big(1+\frac{8\varphi\epsilon\alpha_{max}T}{3} + 4\varphi^2 \alpha_{max}^2 \sum\limits_{t = 0}^{T-1} \rho(t) \Big) \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)] \\ & \ \ \ \ + \Big(\frac{8\varphi\epsilon\alpha_{max}T}{3} - 2\sigma\alpha_{min} T \Big) \mathbb{E}[f({\bf{w}}_{k+1}) - f({\bf{w}}_*)]. \end{split} \end{equation} | (3.16) |
The second inequality follows from f({\bf{w}}_*)\leq f({\bf{x}}_T) . Note that \sum_{t = 0}^{T-1} \rho(t) = \frac{\xi + 4/3}{\xi - \epsilon} \times \frac{1 - (\xi\epsilon)^T}{1 - \xi\epsilon} - \frac{\epsilon + 4/3}{\xi-\epsilon} \times \frac{1-(\epsilon^2)^T}{1-\epsilon^2} and denote this result by \Lambda , then we have
\begin{equation} \mathbb{E}[f({\bf{w}}_{k+1}) - f({\bf{w}}_*)] \leq \theta \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)], \end{equation} | (3.17) |
on condition that \alpha_{max}/\alpha_{min} < 0.75\sigma/(\varphi\epsilon) , where \theta = \frac{1+\frac{8\varphi\epsilon\alpha_{max}T}{3} + 4\varphi^2 \alpha_{max}^2 \Lambda} {2\sigma\alpha_{min} T - \frac{8\varphi\epsilon\alpha_{max}T}{3}} . Since we chose T to satisfy (3.10), it follows that the rate \theta is less than one. Thus, we get the result as desired.
To show the efficiency and effectiveness of the proposed method, we compare it with SVRG [15] (a stochastic learning algorithm) and IFR CG [13] (a batch learning algorithm) on five classification datasets, as shown in Table 1. These datasets are obtained from the LIBSVM database*. All experiments are implemented using the Armadillo C++ library on a PC (Core i9-9900X CPU$ @.50GHz, 64 GB of memory) with Ubuntu 18.04.1 operating system.
datasets | m | n |
a9a | 32561 | 123 |
w8a | 49749 | 300 |
ijcnn1 | 49990 | 22 |
susy | 5000000 | 18 |
higgs | 11000000 | 28 |
*Note: m and n represent the number of samples and features of the data set respectively. |
*http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets
We evaluate these methods on three popular machine learning problems, including
\bullet l_2 -logistic regression
\begin{equation} \min\limits_{{\bf{w}}\in \mathbb{R}^n}\frac{1}{m}\sum\limits_{i = 1}^m \log(1+e^{-y_i{\bf{w}}^T{\bf{z}}_i}) + \lambda\|{\bf{w}}\|^2, \end{equation} | (4.1) |
\bullet l_1 -SVM
\begin{equation} \min\limits_{{\bf{w}}\in \mathbb{R}^n}\frac{1}{m}\sum\limits_{i = 1}^m\max(0, 1-y_i{\bf{w}}^T{\bf{z}}_i) + \lambda\|{\bf{w}}\|^2, \end{equation} | (4.2) |
\bullet l_2 -SVM
\begin{equation} \min\limits_{{\bf{w}}\in \mathbb{R}^n}\frac{1}{m}\sum\limits_{i = 1}^m\max(0, 1-y_i{\bf{w}}^T{\bf{z}}_i)^2 + \lambda\|{\bf{w}}\|^2, \end{equation} | (4.3) |
where \lambda > 0 is the regularization parameter.
We set \lambda = \{0.01, 0.001, 0.0001\} and set |\mathcal{S}| = \sqrt{m} during the whole experiments. We set the number of iterations of the outer loop to 25, and set the inner loop index T to 30 and 50, respectively. For SVRG, we set \eta = \{0.1, 0.05, 0.001, 0.005, 0.01, 0.05\} . For IFR CG and SIFR CG, we set c_1 = 10^{-4} and c_2 = 0.1 . Additionally, in SIFR CG, we set \alpha_{max} = 1/\alpha_{min} = 10^5 and \epsilon = 10 . We randomly divide each dataset into three parts, 1/3 for test, 1/5 for validation, and the rest for training. After finding the optimal parameters that maximize the AUC values on validation set, we employ the corresponding model to the test set.
We first adopt the the cross validation technique to obtain the optimal parameters for each learning model, and the cross validation results are summarize in Tables 2 and 3. We can see that for different learning models and different datasets, the optimal parameters are often data-dependent and should be tuned correspondingly.
l_2 -logistic regression | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
w8a | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.008 |
ijcnn1 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
susy | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
l_1 -SVM | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.008 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.008 |
w8a | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.1 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.1 |
ijcnn1 | \lambda=.1 , \eta=.001 | \lambda=.05 | \lambda=.008 | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.05 |
susy | \lambda=.008 , \eta=.05 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
higgs | \lambda=.008 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.005 |
l_2 -SVM | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.01 , \eta=.01 | \lambda=.1 | \lambda=.05 | \lambda=.005 , \eta=.01 | \lambda=.1 | \lambda=.05 |
w8a | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.05 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.05 |
susy | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
l_2 -logistic regression | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
w8a | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 |
ijcnn1 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
susy | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.01 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.008 , \eta=.1 | \lambda=.05 | \lambda=.005 |
l_1 -SVM | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.1 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.01 |
w8a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.01 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.005 | \lambda=.1 , \eta=.001 | \lambda=.05 | \lambda=.008 |
susy | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
higgs | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.005 | \lambda=.008 , \eta=.05 | \lambda=.1 | \lambda=.005 |
l_2 -SVM | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.005 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.005 | \lambda=.1 | \lambda=.05 |
w8a | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.01 | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.1 |
susy | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
After getting the optimal learning rates for SVRG and the optimal regularization parameters for all the methods, we report the changing curves of AUC values on each test datasets obtained by different method with increasing computational cost in Figures 2-5. In order to fairly compare the performance of different methods, all the x -axes represent the computational cost that is measured by the number of gradient computations divided by m , and all the y -axes represent the AUC value. In general, the closer the AUC is to 1, the higher the accuracy of the method is. Figures 2-5 show the changing results when T = 30 , T = 40 , T = 50 , and T = 60 , respectively. Firstly, we can see from the four figures that the proposed method SIFR CG significantly outperforms SVRG and IFR CG on the whole test datssets. That is, the classification accuracy of our method is the highest. Secondly, SVRG and IFR CG have their own advantages and disadvantages on different datasets, therefore, they are hard to differentiate. Thirdly, unlike SVRG and IFR CG, we find that the role of increasing the number of loops to improve the performance of SIFR CG is less obvious. In other words, the SIFR CG method takes fewer iterations to get the optimal solution.
Additionally, Figures 6-17 show the convergence behaviors and the corresponding average computing time (the numbers of iterations of the outer loop is set to 25) of the three methods on each training dataset. In the convergence behavior figures (see Figures 6, 7, 9, 10, 12, 13, 15 and 16), all the x -axes also represent the computational cost that is measured by the number of gradient computations divided by m , and all the y -axes represent the logarithm of objective function value (base 10). We see that the convergence rate of the SVRG is not as fast as the other two methods because SVRG is sensitive to the learning rate. However, for all test environments, the proposed method SIFR CG has the fastest convergence speed and reaches the smallest objective function value, which is consistent with the AUC results shown in Figures 2-5. In the average computing time figures (see Figures 8, 11, 14, and 17), all the x -axes are in logarithmic time (base 10). We can see that SVRG has the least computing time on these datasets. On the contrary, IFR CG employs the entire dataset to conduct the line search to select the learning rate at each iteration, resulting in a heavy computing burden. The computing time of SIFR CG is between SVRG and IFR CG, and it can be observed that the computing time of SIFR CG is close to IFR CG when the dataset is small and is close to SVRG when the data is large.
Overall, preliminary experiments show that the proposed method SIFR CG not only has better learning performance, but converges speedily. And The advantage of SIFR CG is more obvious on large-scale datasets.
In this paper, we proposed a stochastic Fletcher-Reeves conjugate gradient algorithm and proved a linear rate of convergence in the strongly convex case. Comparative experiments on several benchmark datasets demonstrate that the proposed algorithm performs well on large-scale smooth and nonsmooth machine learning problems.
There are several interesting issues to study in future work. For example, we can replace the Wolfe line search by a nonmonotone line search to find the optimal solution faster or propose more effective conjugate gradient update parameters.
The authors thank the editor and the reviewers for their constructive comments and suggestions that greatly improved the quality and presentation of this paper. This work was partly supported by the National Natural Science Foundation of China (Grant Nos. 11661007, 61671456 and 61806004), the China Postdoctoral Science Foundation (Grant No. 2020T130767), the Natural Science Foundation of the Anhui Higher Education Institutions of China (Grant No. KJ2019A0082), and the Natural Science Foundation of Zhejiang Province, China (Grant No. LD19A010002).
The authors declare that there is no conflict of interests regarding the publication of this article.
[1] |
J. Barzilai, J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), 141-148. doi: 10.1093/imanum/8.1.141
![]() |
[2] | E. Bisong, Batch vs. online larning, Building Machine Learning and Deep Learning Models on Google Cloud Platform, 2019. |
[3] |
L. Bottou, F. E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311. doi: 10.1137/16M1080173
![]() |
[4] | Y. H. Dai, Y. Yuan, Nonlinear conjugate gradient methods, Shanghai: Shanghai Scientific Technical Publishers, 2000. |
[5] |
D. Davis, B. Grimmer, Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems, SIAM J. Optim., 29 (2019), 1908-1930. doi: 10.1137/17M1151031
![]() |
[6] |
R. Dehghani, N. Bidabadi, H. Fahs, M. M. Hosseini, A conjugate gradient method based on a modified secant relation for unconstrained optimization, Numer. Funct. Anal. Optim., 41 (2020), 621-634. doi: 10.1080/01630563.2019.1669641
![]() |
[7] |
P. Faramarzi, K. Amini, A modified spectral conjugate gradient method with global convergence, J. Optim. Theory Appl., 182 (2019), 667-690. doi: 10.1007/s10957-019-01527-6
![]() |
[8] |
R. Fletcher, C. M. Reeves, Function minimization by conjugate gradients, Comput. J., 7 (1964), 149-154. doi: 10.1093/comjnl/7.2.149
![]() |
[9] |
J. C. Gilbert, J. Nocedal, Global convergence properties of conjugate gradient methods for optimization, SIAM J. Optim., 2 (1992), 21-42. doi: 10.1137/0802003
![]() |
[10] |
W. W. Hager, H. Zhang, Algorithm 851: CG DESCENT, a conjugate gradient method with guaranteed descent, ACM Trans. Math. Software, 32 (2006), 113-137. doi: 10.1145/1132973.1132979
![]() |
[11] |
A. S. Halilu, M. Y. Waziri, Y. B. Musa, Inexact double step length method for solving systems of nonlinear equations, Stat. Optim. Inf. Comput., 8 (2020), 165-174. doi: 10.19139/soic-2310-5070-532
![]() |
[12] |
H. Jiang, P. Wilford, A stochastic conjugate gradient method for the approximation of functions, J. Comput. Appl. Math., 236 (2012), 2529-2544. doi: 10.1016/j.cam.2011.12.012
![]() |
[13] |
X. Jiang, J. Jian, Improved Fletcher-Reeves and Dai-Yuan conjugate gradient methods with the strong Wolfe line search, J. Comput. Appl. Math., 348 (2019), 525-534. doi: 10.1016/j.cam.2018.09.012
![]() |
[14] | X. B. Jin, X. Y. Zhang, K. Huang, G. G. Geng, Stochastic conjugate gradient algorithm with variance reduction, IEEE Trans. Neural Networks Learn. Syst., 30 (2018), 1360-1369. |
[15] | R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems, 2013. |
[16] |
X. L. Li, Preconditioned stochastic gradient descent, IEEE Trans. Neural Networks Learn. Syst., 29 (2018), 1454-1466. doi: 10.1109/TNNLS.2017.2672978
![]() |
[17] |
Y. Liu, X. Wang, T. Guo, A linearly convergent stochastic recursive gradient method for convex optimization, Optim. Lett., 2020. Doi: 10.1007/s11590-020-01550-x. doi: 10.1007/s11590-020-01550-x
![]() |
[18] |
M. Lotfi, S. M. Hosseini, An efficient Dai-Liao type conjugate gradient method by reformulating the CG parameter in the search direction equation, J. Comput. Appl. Math., 371 (2020), 112708. doi: 10.1016/j.cam.2019.112708
![]() |
[19] | S. Mandt, M. D. Hoffman, D. M. Blei, Stochastic gradient descent as approximate Bayesian inference, J. Mach. Learn. Res., 18 (2017), 4873-4907. |
[20] | P. Moritz, R. Nishihara, M. I. Jordan, A linearly-convergent stochastic L-BFGS algorithm, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016. |
[21] | L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, SARAH: A novel method for machine learning problems using stochastic recursive gradient, Proceedings of the 34th International Conference on Machine Learning, 2017. |
[22] | A. Nitanda, Accelerated stochastic gradient descent for minimizing finite sums, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016. |
[23] |
H. Robbins, S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), 400-407. doi: 10.1214/aoms/1177729586
![]() |
[24] | N. N. Schraudolph, T. Graepel, Combining conjugate direction methods with stochastic approximation of gradients, Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, 2003. |
[25] | G. Shao, W. Xue, G. Yu, X. Zheng, Improved SVRG for finite sum structure optimization with application to binary classification, J. Ind. Manage. Optim., 16 (2020), 2253-2266. |
[26] | C. Tan, S. Ma, Y. H. Dai, Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, Advances in Neural Information Processing Systems, 2016. |
[27] | P. Toulis, E. Airoldi, J. Rennie, Statistical analysis of stochastic gradient methods for generalized linear models, Proceedings of the 31th International Conference on Machine Learning, 2014. |
[28] | V. Vapnik, The nature of statistical learning theory, New York: Springer, 1995. |
[29] |
L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), 2057-2075. doi: 10.1137/140961791
![]() |
[30] | Z. Xu, Y. H. Dai, A stochastic approximation frame algorithm with adaptive directions, Numer. Math. Theory Methods Appl., 1 (2008), 460-474. |
[31] | W. Xue, J. Ren, X. Zheng, Z. Liu, Y. Ling, A new DY conjugate gradient method and applications to image denoising, IEICE Trans. Inf. Syst., 101 (2018), 2984-2990. |
[32] |
Q. Zheng, X. Tian, N. Jiang, M. Yang, Layer-wise learning based stochastic gradient descent method for the optimization of deep convolutional neural network, J. Intell. Fuzzy Syst., 37 (2019), 5641-5654. doi: 10.3233/JIFS-190861
![]() |
1. | Zhuang Yang, Large-scale machine learning with fast and stable stochastic conjugate gradient, 2022, 173, 03608352, 108656, 10.1016/j.cie.2022.108656 | |
2. | Maryam Khoshsimaye-Bargard, Ali Ashrafi, A descent family of the spectral Hestenes–Stiefel method by considering the quasi-Newton method, 2023, 1055-6788, 1, 10.1080/10556788.2022.2142585 | |
3. | Maryam Khoshsimaye-Bargard, Ali Ashrafi, A family of the modified three-term Hestenes–Stiefel conjugate gradient method with sufficient descent and conjugacy conditions, 2023, 1598-5865, 10.1007/s12190-023-01839-x | |
4. | Sanaullah Mastoi, Abdul Hamid Ganie, Abdulkafi Mohammed Saeed, Umair Ali, Umair Ahmed Rajput, Wan Ainun Mior Othman, Numerical solution for two-dimensional partial differential equations using SM’s method, 2022, 20, 2391-5471, 142, 10.1515/phys-2022-0015 | |
5. | Eltiyeb Ali, Salem Mahdi, A Family of Developed Hybrid Four-Term Conjugate Gradient Algorithms for Unconstrained Optimization with Applications in Image Restoration, 2023, 15, 2073-8994, 1203, 10.3390/sym15061203 | |
6. | Elena Tovbis, Vladimir Krutikov, Predrag Stanimirović, Vladimir Meshechkin, Aleksey Popov, Lev Kazakovtsev, A Family of Multi-Step Subgradient Minimization Methods, 2023, 11, 2227-7390, 2264, 10.3390/math11102264 | |
7. | Mariya Toofan, Saman Babaie-Kafaki, Hybrid Conjugate Gradient Methods Based on an Extended Least-Squares Model, 2024, 2305-221X, 10.1007/s10013-024-00726-7 | |
8. | Maryam Khoshsimaye-Bargard, Ali Ashrafi, A projected hybridization of the Hestenes–Stiefel and Dai–Yuan conjugate gradient methods with application to nonnegative matrix factorization, 2024, 1598-5865, 10.1007/s12190-024-02245-7 | |
9. | Xuejie Ma, Songhua Wang, A hybrid approach to conjugate gradient algorithms for nonlinear systems of equations with applications in signal restoration, 2024, 9, 2473-6988, 36167, 10.3934/math.20241717 | |
10. | Maryam Khoshsimaye–Bargard, Ali Ashrafi, A family of descent spectral three-term conjugate gradient methods based on the quasi–Newton aspects with applications to nonnegative matrix factorization and image restoration, 2025, 1017-1398, 10.1007/s11075-025-02091-z | |
11. | Dandan Li, Songhua Wang, Yan Xia, Xuejie Ma, Convergence analysis of a three-term extended RMIL CGP-based algorithm for constrained nonlinear equations and image denoising applications, 2025, 33, 2688-1594, 3584, 10.3934/era.2025160 |
datasets | m | n |
a9a | 32561 | 123 |
w8a | 49749 | 300 |
ijcnn1 | 49990 | 22 |
susy | 5000000 | 18 |
higgs | 11000000 | 28 |
*Note: m and n represent the number of samples and features of the data set respectively. |
l_2 -logistic regression | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
w8a | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.008 |
ijcnn1 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
susy | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
l_1 -SVM | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.008 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.008 |
w8a | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.1 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.1 |
ijcnn1 | \lambda=.1 , \eta=.001 | \lambda=.05 | \lambda=.008 | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.05 |
susy | \lambda=.008 , \eta=.05 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
higgs | \lambda=.008 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.005 |
l_2 -SVM | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.01 , \eta=.01 | \lambda=.1 | \lambda=.05 | \lambda=.005 , \eta=.01 | \lambda=.1 | \lambda=.05 |
w8a | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.05 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.05 |
susy | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
l_2 -logistic regression | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
w8a | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 |
ijcnn1 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
susy | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.01 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.008 , \eta=.1 | \lambda=.05 | \lambda=.005 |
l_1 -SVM | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.1 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.01 |
w8a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.01 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.005 | \lambda=.1 , \eta=.001 | \lambda=.05 | \lambda=.008 |
susy | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
higgs | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.005 | \lambda=.008 , \eta=.05 | \lambda=.1 | \lambda=.005 |
l_2 -SVM | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.005 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.005 | \lambda=.1 | \lambda=.05 |
w8a | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.01 | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.1 |
susy | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
datasets | m | n |
a9a | 32561 | 123 |
w8a | 49749 | 300 |
ijcnn1 | 49990 | 22 |
susy | 5000000 | 18 |
higgs | 11000000 | 28 |
*Note: m and n represent the number of samples and features of the data set respectively. |
l_2 -logistic regression | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
w8a | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.008 |
ijcnn1 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
susy | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
l_1 -SVM | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.008 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.008 |
w8a | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.1 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.1 |
ijcnn1 | \lambda=.1 , \eta=.001 | \lambda=.05 | \lambda=.008 | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.05 |
susy | \lambda=.008 , \eta=.05 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
higgs | \lambda=.008 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.005 |
l_2 -SVM | T=30 | T=40 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.01 , \eta=.01 | \lambda=.1 | \lambda=.05 | \lambda=.005 , \eta=.01 | \lambda=.1 | \lambda=.05 |
w8a | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.05 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.05 |
susy | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
l_2 -logistic regression | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
w8a | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.1 | \lambda=.1 | \lambda=.005 |
ijcnn1 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.05 , \eta=.1 | \lambda=.05 | \lambda=.005 |
susy | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.1 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.01 , \eta=.1 | \lambda=.05 | \lambda=.005 | \lambda=.008 , \eta=.1 | \lambda=.05 | \lambda=.005 |
l_1 -SVM | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.1 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.01 |
w8a | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.01 | \lambda=.1 , \eta=.01 | \lambda=.05 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.0001 | \lambda=.05 | \lambda=.005 | \lambda=.1 , \eta=.001 | \lambda=.05 | \lambda=.008 |
susy | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |
higgs | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.005 | \lambda=.008 , \eta=.05 | \lambda=.1 | \lambda=.005 |
l_2 -SVM | T=50 | T=60 | ||||
SVRG | IFR CG | SIFR CG | SVRG | IFR CG | SIFR CG | |
a9a | \lambda=.1 , \eta=.005 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.005 | \lambda=.1 | \lambda=.05 |
w8a | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.1 | \lambda=.1 , \eta=.001 | \lambda=.1 | \lambda=.05 |
ijcnn1 | \lambda=.1 , \eta=.05 | \lambda=.1 | \lambda=.01 | \lambda=.05 , \eta=.05 | \lambda=.1 | \lambda=.1 |
susy | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 | \lambda=.005 , \eta=.05 | \lambda=.005 | \lambda=.005 |
higgs | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 | \lambda=.1 , \eta=.01 | \lambda=.1 | \lambda=.005 |