An online conjugate gradient algorithm for large-scale data analysis in machine learning

Wei Xue; Pengcheng Wan; Qiao Li; Ping Zhong; Gaohang Yu; Tao Tao; Wei Xue; Pengcheng Wan; Qiao Li; Ping Zhong; Gaohang Yu; Tao Tao

doi:10.3934/math.2021092

AIMS Mathematics

2021, Volume 6, Issue 2: 1515-1537. doi: 10.3934/math.2021092

Previous Article Next Article

Research article

An online conjugate gradient algorithm for large-scale data analysis in machine learning

1.
School of Computer Science and Technology, Anhui University of Technology, Maanshan 243032, China
2.
National Key Laboratory of Science and Technology on Automatic Target Recognition, National University of Defense Technology, Changsha 410073, China
3.
Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, Nanjing 210096, China
4.
School of Sciences, Hangzhou Dianzi University, Hangzhou 310018, China

Received: 16 August 2020 Accepted: 02 November 2020 Published: 23 November 2020
MSC : 65K05, 68W27

In recent years, the amount of available data is growing exponentially, and large-scale data is becoming ubiquitous. Machine learning is a key to deriving insight from this deluge of data. In this paper, we focus on the large-scale data analysis, especially classification data, and propose an online conjugate gradient (CG) descent algorithm. Our algorithm draws from a recent improved Fletcher-Reeves (IFR) CG method proposed in Jiang and Jian[13] as well as a recent approach to reduce variance for stochastic gradient descent from Johnson and Zhang [15]. In theory, we prove that the proposed online algorithm achieves a linear convergence rate under strong Wolfe line search when the objective function is smooth and strongly convex. Comparison results on several benchmark classification datasets demonstrate that our approach is promising in solving large-scale machine learning problems, viewed from the points of area under curve (AUC) value and convergence behavior.

Keywords:

Citation: Wei Xue, Pengcheng Wan, Qiao Li, Ping Zhong, Gaohang Yu, Tao Tao. An online conjugate gradient algorithm for large-scale data analysis in machine learning[J]. AIMS Mathematics, 2021, 6(2): 1515-1537. doi: 10.3934/math.2021092

Related Papers:

[1]	Xiyuan Zhang, Yueting Yang . A new hybrid conjugate gradient method close to the memoryless BFGS quasi-Newton method and its application in image restoration and machine learning. AIMS Mathematics, 2024, 9(10): 27535-27556. doi: 10.3934/math.20241337
[2]	Frank Rogers . Fuzzy gradient descent for the linear fuzzy real number system. AIMS Mathematics, 2019, 4(4): 1078-1086. doi: 10.3934/math.2019.4.1078
[3]	Shuhua Wang . Convergence of online learning algorithm with a parameterized loss. AIMS Mathematics, 2022, 7(11): 20066-20084. doi: 10.3934/math.20221098
[4]	Yixin Li, Chunguang Li, Wei Yang, Wensheng Zhang . A new conjugate gradient method with a restart direction and its application in image restoration. AIMS Mathematics, 2023, 8(12): 28791-28807. doi: 10.3934/math.20231475
[5]	Sani Aji, Poom Kumam, Aliyu Muhammed Awwal, Mahmoud Muhammad Yahaya, Kanokwan Sitthithakerngkiet . An efficient DY-type spectral conjugate gradient method for system of nonlinear monotone equations with application in signal recovery. AIMS Mathematics, 2021, 6(8): 8078-8106. doi: 10.3934/math.2021469
[6]	Xuejie Ma, Songhua Wang . A hybrid approach to conjugate gradient algorithms for nonlinear systems of equations with applications in signal restoration. AIMS Mathematics, 2024, 9(12): 36167-36190. doi: 10.3934/math.20241717
[7]	Jie Guo, Zhong Wan . A new three-term conjugate gradient algorithm with modified gradient-differences for solving unconstrained optimization problems. AIMS Mathematics, 2023, 8(2): 2473-2488. doi: 10.3934/math.2023128
[8]	Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang . On stochastic accelerated gradient with non-strongly convexity. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085
[9]	Wenya Shi, Zhixiang Chen . A breakdown-free block conjugate gradient method for large-scale discriminant analysis. AIMS Mathematics, 2024, 9(7): 18777-18795. doi: 10.3934/math.2024914
[10]	Mingyuan Cao, Yueting Yang, Chaoqian Li, Xiaowei Jiang . An accelerated conjugate gradient method for the Z-eigenvalues of symmetric tensors. AIMS Mathematics, 2023, 8(7): 15008-15023. doi: 10.3934/math.2023766

Abstract

1. Introduction

We first establish the key notations to make them consistent in this paper. Vectors are denoted by boldface letters, e.g., ${\bf{w}}$ , and $\nabla f({\bf{w}})$ denotes the gradient of function $f$ at point ${\bf{w}}$ . Superscript $"T"$ is the transpose operation of a vector. We use $\|{\bf{w}}\|$ as a shorthand for the $l_2$ -norm of ${\bf{w}}$ . Given a dataset $\mathcal{D} = \{({\bf{z}}_i, y_i)_{i = 1}^m\}$ , where ${\bf{z}}_i\in \mathbb{R}^n$ and $y_i \in \mathbb{R}$ , the focus of this paper is structural risk minimization ^[28], a fundamental subject in machine learning, which is a combination of two terms

$\begin{equation} \min\limits_{{\bf{w}} \in \mathbb{R}^n} \ f({\bf{w}}) \triangleq \frac{1}{m} \sum\limits_{i = 1}^m \underbrace{l({\bf{w}}; {\bf{z}}_i, y_i) + r({\bf{w}})}_{f_i({\bf{w}})}. \end{equation}$

(1.1)

Here, ${\bf{w}}$ is the optimization variable (usually called weight vector in the research area of machine learning), $l: \mathbb{R}^n\rightarrow \mathbb{R}$ is the loss function associated with sample pair $({\bf{z}}_i, y_i)$ , and the regularization function $r: \mathbb{R}^n\rightarrow \mathbb{R}$ is a penalty on the complexity of ${\bf{w}}$ .

Optimization learning algorithms for solving (1.1) mainly fall into two broad categories, namely, online learning and batch learning. The two learning mechanisms are illustrated in Figure 1. In online learning, data streams (either individually or in mini-batches) into the learning algorithm and updates the prediction model. We just need to use the new (real-time) data to build a model. However, in batch learning, the model is trained using the entire dataset. This process is also called offline learning. In batch learning, if there is need to update the learning model based on new data, the model should be trained from scratch all over again using both the previous data and the new data ^[2].

Figure 1. Online learning vs. batch learning.

DownLoad: Full-Size Img PowerPoint

The prototypical online learning algorithm is stochastic gradient descent (SGD) method ^[23], which is defined as

$\begin{equation} {\bf{w}}_{k+1} = {\bf{w}}_k - \eta_k \nabla f_{i_k}({\bf{w}}_k), \end{equation}$

(1.2)

where $\eta_k$ is the learning rate (also known as stepsize in numerical optimization), and the index $i_k$ , corresponding to the sample pair $({\bf{z}}_{i_k}, y_{i_k})$ , is randomly chosen from $\{1, 2, \ldots, m\}$ . We describe the standard SGD method in Algorithm 1. It is observed that each iteration of SGD only involves the computation of $\nabla f_{i_k}({\bf{w}}_k)$ corresponding to one sample, resulting in a very cheap iteration cost. The gradient information is obtained from only one sample at each iteration. Further, one can use a mini-batch method, in which a small subset of all samples is randomly selected per iteration and employ the iteration.

$\begin{equation} {\bf{w}}_{k+1} = {\bf{w}}_k - \frac{\eta_k}{|\mathcal{S}_k|}\sum\limits_{i\in \mathcal{S}_k \subseteq \mathcal{D}}\nabla f_i({\bf{w}}_k). \end{equation}$

(1.3)

where $|\mathcal{S}_k|$ denotes the number of elements in the set $\mathcal{S}_k$ .

In recent years, SGD has been widely studied in machine learning and optimization community; see ^{[5,16,19,22,27,29,32]} for example. The standard SGD can quickly reach a proximate optimum solution during the learning process, but its convergence rate may slow down when more accurate solutions are required. Due to the variance of random sampling, with a suitably chosen $\eta_k = \mathcal{O}(1/k)$ , SGD achieves a sub-linear convergence rate of $\mathcal{O}(1/k)$ .

In order to improve the convergence of SGD, Johnson and Zhang ^[15] proposed a stochastic variance reduced gradient (SVRG) method to reduce the variance of random sampling. The description of SVRG is in Algorithm 2. As argued by Bottou et al. ^[3], for the first epochs, SGD is more efficient, but once the iterations approach the solution the benefits of the fast convergence rate of SVRG can be observed. In ^[21], Nguyen et al. introduced a stochastic recursive gradient algorithm (SARAH). Different from SVRG, SARAH employs a simple recursive approach to update stochastic gradient and uses more stable gradient estimates than SVRG. But the choice of the learning rate $\eta$ depends on a Lipschitz constant of the objective function. Without explicit knowledge of this constant, $\eta$ is typically chosen by experimentation in both SVRG and SARAH. By incorporating the idea of variance reduction, Moritz et al. ^[20] proposed a stochastic L-BFGS method and proved a linear rate of convergence. In ^[26], Tan et al. used the well-known Barzilai-Borwein (BB) method ^[1] to automatically compute the learning rate for SVRG, and established a linear convergence rate. Recently, based on the BB method, an improved SVRG, named stochastic two-point stepsize gradient method, has been proposed by Shao et al. ^[25], which also achieves a linear convergence rate for smooth and strongly convex functions. Liu et al. ^[17] used the BB approach to adaptively compute the learning rate for SARAH. In ^[14], Jin et al. proposed a stochastic CG method with variance reduction, which converges more quickly than SVRG. Actually, the idea of stochastic CG has been studied before. Schraudolph and Graepel ^[24] combined the idea of CG with stochastic setting to optimize unconstrained quadratic problems, and the resulting algorithms converge orders of magnitude faster than ordinary SGD. In ^[30], Xu and Dai employed CG to accelerate the convergence of stochastic approximation algorithm, Robbins-Monro method. Jiang and Wilford ^[12] proposed a stochastic CG method for the approximation of functions, which performs the CG steps by using an inner product that is based on stochastic sampling.

Motivated by this, we propose in this paper a new online variance reduced CG method and prove that the proposed method converges linearly for strongly convex and smooth objective functions. The remainder of this paper is organized as follows. In Section 2, we give a brief introduction of the IFR CG method used in batch optimization. And then, we present the proposed method along with the theoretical analysis in Section 3. Numerical experiments are reported in Section 4. Finally, we conclude this paper in Section 5.

2. Brief introduction of IFR CG

Due to the low memory requirement and strong convergence property, CG method is a very efficient batch algorithm, and it is still one of the most active research fields of optimization ^{[6,7,11,13,18,31]}. Take (1.1) as an example. Suppose that $f$ is differentiable, then given an initial point ${\bf{w}}_0$ , a CG method generates a sequence $\{{\bf{w}}_k\}$ by using the recurrence

$\begin{equation} {\bf{w}}_{k+1} = {\bf{w}}_k + \alpha_k{\bf{d}}_k, \end{equation}$

(2.1)

where $\alpha_k$ is the stepsize usually obtained by a line search, and the search direction ${\bf{d}}_k$ is computed by

$\begin{equation} {\bf{d}}_k = -\nabla f({\bf{w}}_k) + \beta_k{\bf{d}}_{k-1}, {\bf{d}}_0 = -\nabla f({\bf{w}}_0). \end{equation}$

(2.2)

$\beta_k$ is the CG update parameter. Different choices of $\beta_k$ yield different CG methods. Some well-known formulas for $\beta_k$ are $\beta_k^{HS}$ , $\beta_k^{FR}$ , $\beta_k^{PR}$ , $\beta_k^{CD}$ , $\beta_k^{LS}$ , and $\beta_k^{DY}$ , see the comprehensive review ^[10] for details. Here, we focus on $\beta_k^{FR}$ proposed by Fletcher and Reeves ^[8], which takes the form of

$\begin{equation} \beta_k^{FR} = \frac{\|\nabla f({\bf{w}}_k)\|^2}{\|\nabla f({\bf{w}}_{k-1})\|^2}. \end{equation}$

(2.3)

Another important issue related to the performance of CG is the line search, which requires sufficient accuracy to ensure ${\bf{d}}_k$ satisfies the descent condition $\nabla f({\bf{w}}_k)^T{\bf{d}}_k < 0$ , $\forall k$ . A common criteria for line search is the Wolfe line search

$\begin{equation} \left\{ \begin{array}{lr} f({\bf{w}}_k + \alpha_k {\bf{d}}_k) - f({\bf{w}}_k) \leq c_1\alpha_k \nabla f({\bf{w}}_k)^T{\bf{d}}_k, \\ \nabla f({\bf{w}}_k + \alpha_k {\bf{d}}_k)^T{\bf{d}}_k \geq c_2 \nabla f({\bf{w}}_k)^T{\bf{d}}_k, \\ \end{array} \right. \end{equation}$

(2.4)

where $0 < c_1 < c_2 < 1$ . In the strong Wolfe line search, the second condition in (2.4) is replaced by

$\begin{equation} |\nabla f({\bf{w}}_k + \alpha_k {\bf{d}}_k)^T{\bf{d}}_k| \leq -c_2 \nabla f({\bf{w}}_k)^T{\bf{d}}_k. \end{equation}$

(2.5)

It has been shown that the strong Wolfe line search for $\beta_k^{FR}$ may not yield a descent direction unless $c_2\leq 1/2$ ^[4]. In practice, it is often most efficient to choose $c_2$ close to $1$ . Hence, the constraint $c_2\leq 1/2$ , needed to ensure descent, is a significant restriction in the choice of parameter $c_2$ in the Wolfe line search. Note that Gilbert and Nocedal ^[9] gave the convergent scope $\tau_k\in[-1, 1]$ for $\tau_k\beta_k^{FR}$ . Based on this, Jiang and Jian ^[13] considered $\tau_k = -\frac{|\nabla f({\bf{w}}_k)^T{\bf{d}}_{k-1\;\;}|}{\nabla f({\bf{w}}_{k-1})^T{\bf{d}}_{k-1}}$ and introduced an improved FR (IFR) formula

$\begin{equation} \beta_k^{IFR} = - \frac{|\nabla f({\bf{w}}_k)^T{\bf{d}}_{k-1}|}{\nabla f({\bf{w}}_{k-1})^T{\bf{d}}_{k-1}} \times \beta_k^{FR}. \end{equation}$

(2.6)

The following lemma shows that the search directions yielded by IFR CG are all sufficient descent, and the IFR CG method is globally convergent under the strong Wolfe line search.

Lemma 1 (Theorem 3 ^[13]). Suppose that $f({\boldsymbol{w}})$ is differentiable, $\nabla f({\boldsymbol{w}})$ is Lipschitz continuous, and $0 < c_2 < \sqrt{2}/2$ , then it holds that

$\begin{equation} \frac{-1}{1-c^2_2} \leq \frac{\nabla f({\boldsymbol{w}}_k)^T{\boldsymbol{d}}_k}{\|\nabla f({\boldsymbol{w}}_k)\|^2} \leq \frac{2c^2_2-1}{1-c^2_2}. \end{equation}$

(2.7)

Further, IFR CG is globally convergent by the way of $\liminf_{k\rightarrow \infty}\|\nabla f({\boldsymbol{w}}_k)\| = 0$ .

3. Proposed method and theoretical analysis

3.1. SIFR CG method

Based on the theories mentioned above, we now describe the core of the proposed online CG algorithm, called stochastic improved Fletcher-Reeves conjugate gradient (SIFR CG for short) method, for solving (1.1) in detail and provide the pseudo-code.

SIFR CG operates in cycles. At the beginning of each cycle, an iteration point ${\bf{w}}_k$ is available at which the method computes a batch gradient ${\bf{u}}_k \leftarrow \nabla f({\bf{w}}_k)$ . Then, after initializing ${\bf{x}}_0 \leftarrow {\bf{w}}_k$ , ${\bf{g}}_0 \leftarrow {\bf{h}}_k$ , and ${\bf{d}}_0 \leftarrow -{\bf{g}}_0$ , a set of $T$ inner iterations indexed by $t$ with an update ${\bf{x}}_{t+1} \leftarrow {\bf{x}}_t + \alpha_t {\bf{d}}_t$ are performed, where $\alpha_t$ is chosen according to the strong Wolfe line search. We define the subsampled function $f_{\mathcal{S}}({\bf{w}})$ as:

$\begin{equation} f_{\mathcal{S}}({\bf{w}}) = \frac{1}{|\mathcal{S}|}\sum\limits_{i\in \mathcal{S}_k }f_i({\bf{w}}). \end{equation}$

(3.1)

where $|\mathcal{S}_k|$ denotes the number of elements in the set $S_k$ .The search direction ${\bf{d}}_t$ is computed by $-{\bf{g}}_t + \beta_t{\bf{d}}_{t-1}$ , where ${\bf{g}}_{t+1} \leftarrow \nabla f_{\mathcal{S}}({\bf{x}}_{t+1}) - \nabla f_{\mathcal{S}}({\bf{w}}_k) + {\bf{u}}_k$ , and $\mathcal{S}$ is chosen randomly. The pseudo-code of SIFR CG is given in Algorithm 3.

There are a couple of things to note about this algorithm.

Remark 1. To mitigate the effects of some outliers in the training data, we restrict $\alpha_t$ and $\beta_t$ by adding some additional constraints.

Remark 2. For convenience, we use the condition $|{\boldsymbol{g}}_{t+1}^T{\boldsymbol{d}}_t| \leq -c_2 {\boldsymbol{g}}_t^T{\boldsymbol{d}}_t$ in the theoretical analysis.

Remark 3. The following experiments are carried out using ${\boldsymbol{w}}_{k+1} \leftarrow {\boldsymbol{x}}_T$ .

3.2. Theoretical analysis

In this section we analyze the convergence property of the proposed method SIFR CG. Our analysis makes use of the following assumptions.

Assumption 1. $\forall i$ , $f_i: \mathbb{R}^n \rightarrow \mathbb{R}$ is twice continuously differentiable.

Assumption 2. $\forall {\boldsymbol{w}} \in \mathbb{R}^n$ and $\mathcal{S} \subseteq \mathcal{D}$ , there exist two positive constants $\phi$ and $\varphi$ such that $\phi \boldsymbol{I} \leq \nabla^2 f_{\mathcal{S}}({\boldsymbol{x}}) \leq \varphi \boldsymbol{I}$ , where $\boldsymbol{I}$ denotes the identity matrix, and $\nabla^2 f_{\mathcal{S}}({\boldsymbol{x}})$ is the Hessian matrix of $f$ at point ${\boldsymbol{x}}$ .

For Algorithm 3, we immediately have the following result.

Theorem 1. Suppose that Assumptions 1 and 2 hold, and let $\xi > 0$ be an upper bound on $- {\boldsymbol{g}}_t^T{\boldsymbol{d}}_t / |{\boldsymbol{g}}_{t+1}^T{\boldsymbol{d}}_t|$ , then, $\forall t$ , we have

$\begin{equation} \|{\boldsymbol{d}}_t\|^2 \leq \rho(t)\|{\boldsymbol{g}}_{0}\|^2, \end{equation}$

(3.3)

where $\rho(t) = (\xi + \frac{4}{3}) \frac{(\epsilon\xi)^t-(\epsilon^2)^t}{\xi-\epsilon} + (\epsilon^2)^t$ .

Proof. It follows from the lines 14–16 in Algorithm 3 that $\beta_{t+1}\leq\epsilon$ , therefore,

$\begin{equation} \|{\bf{g}}_t\|^2/\|{\bf{g}}_{t-1}\|^2 \leq - \epsilon{\bf{g}}_t^T{\bf{d}}_t/|{\bf{g}}_{t+1}^T{\bf{d}}_t| \leq \epsilon \xi. \end{equation}$

(3.4)

From the definition of ${\bf{d}}_t$ , we have

$\begin{equation} \begin{split} \|{\bf{d}}_t\|^2 & = \|-{\bf{g}}_t + \beta_t{\bf{d}}_{t-1}\|^2 = \|{\bf{g}}_t\|^2 -2\beta_t {\bf{g}}_t^T{\bf{d}}_{t-1} + \beta_t^2\|{\bf{d}}_{t-1}\|^2 \\ & \leq \epsilon(\xi + \frac{2c_2}{1-c_2^2})\|{\bf{g}}_{t-1}\|^2 + \epsilon^2\|{\bf{d}}_{t-1}\|^2. \end{split} \end{equation}$

(3.5)

The last inequality follows from Lemma 1 and Remark 2. Using $0 < c_2 < 1/2$ , it follows that

$\begin{equation} \begin{split} \|{\bf{d}}_t\|^2 & \leq \epsilon(\xi + \frac{4}{3})\|{\bf{g}}_{t-1}\|^2 + \epsilon^2\|{\bf{d}}_{t-1}\|^2 \\ & \leq \epsilon(\xi + \frac{4}{3})[\|{\bf{g}}_{t-1}\|^2 + \epsilon^2\|{\bf{g}}_{t-2}\|^2 + \ldots + (\epsilon^2)^{t-1}\|{\bf{g}}_0\|^2] + (\epsilon^2)^t\|{\bf{g}}_{0}\|^2. \end{split} \end{equation}$

(3.6)

The second inequality follows from ${\bf{d}}_0 = {\bf{g}}_0$ . Observing (3.4), we immediately get $\|{\bf{g}}_t\|^2 \leq (\epsilon\xi)^t \|{\bf{g}}_0\|^2$ , and (3.6) can be rewritten as

$\begin{equation} \begin{split} \|{\bf{d}}_t\|^2 & \leq \epsilon(\xi + \frac{4}{3})[(\epsilon\xi)^{t-1} \|{\bf{g}}_0\|^2 + \epsilon^2(\epsilon\xi)^{t-2} \|{\bf{g}}_0\|^2 + \ldots +(\epsilon^2)^{t-1}\|{\bf{g}}_0\|^2] + (\epsilon^2)^t\|{\bf{g}}_{0}\|^2 \\ & = \Big[(\xi + \frac{4}{3}) \frac{(\epsilon\xi)^t-(\epsilon^2)^t}{\xi-\epsilon} + (\epsilon^2)^t \Big] \|{\bf{g}}_{0}\|^2, \end{split} \end{equation}$

(3.7)

which completes the proof.

The following two lemmas show a lower bound on $\|\nabla f({\bf{x}})\|^2$ and an upper bound on the variance reduced gradient estimates ${\bf{g}}_t$ , respectively.

Lemma 2 (Lemma 5 ^[20], Lemma 1 ^[14]). Let ${\boldsymbol{w}}_*$ be the unique minimizer of $f$ , and suppose that $f$ is strongly convex with parameter $\sigma$ , then, $\forall {\boldsymbol{x}}$ , we have $\|\nabla f({\boldsymbol{x}})\|^2 \geq 2\sigma[f({\boldsymbol{x}}) - f({\boldsymbol{w}}_*)]$ .

Lemma 3 (Lemma 6 ^[20], Lemma 2 ^[14]). Let ${\boldsymbol{w}}_*$ be the unique minimizer of $f$ , ${\boldsymbol{u}}_k = \nabla f({\boldsymbol{w}}_k)$ , and ${\boldsymbol{g}}_t = \nabla f_{\mathcal{S}}({\boldsymbol{x}}_t - \nabla f_{\mathcal{S}}({\boldsymbol{w}}_k) + {\boldsymbol{u}}_k$ , then by taking an expectation w.r.t. $\mathcal{S}$ , we have $\mathbb{E}[\|{\boldsymbol{g}}_t\|^2] \leq 4 \varphi [f({\boldsymbol{x}}_t) - f({\boldsymbol{w}}_*) + f({\boldsymbol{w}}_k) - f({\boldsymbol{w}}_*)]$ .

Based on the above theoretical basis, we now state our main result in the following theorem.

Theorem 2. Suppose that Assumptions 1 and 2 hold, and let ${\boldsymbol{x}}_*$ be the unique minimizer of $f$ , then, $\forall k$ , we have

$\begin{equation} \mathbb{E}[f({\boldsymbol{w}}_k) - f({\boldsymbol{w}}_*)] \leq \theta^k \mathbb{E}[f({\boldsymbol{w}}_0) - f({\boldsymbol{w}}_*)]. \end{equation}$

(3.8)

Here, the convergence rate $\theta$ is given by

$\begin{equation} \theta = \frac{1+\frac{8\varphi\epsilon\alpha_{max}T}{3} + 4\varphi^2 \alpha_{max}^2 \Lambda} {2\sigma\alpha_{min} T - \frac{8\varphi\epsilon\alpha_{max}T}{3}} \lt 1 \end{equation}$

(3.9)

with $\Lambda = \frac{\xi + 4/3}{\xi - \epsilon} \times \frac{1 - (\xi\epsilon)^T}{1 - \xi\epsilon} - \frac{\epsilon + 4/3}{\xi-\epsilon} \times \frac{1-(\epsilon^2)^T}{1-\epsilon^2}$ , assuming that we choose $8\varphi\epsilon\alpha_{max} < 3\sigma\alpha_{min}$ and that we choose $T$ large enough to satisfy

$\begin{equation} T \gt \frac{3+12\varphi^2\alpha_{max}^2\Lambda}{6\sigma\alpha_{min} - 16\varphi\epsilon\alpha_{max}}. \end{equation}$

(3.10)

Proof. Using the Lipschitz continuity of $\nabla f$ , which follows from Assumption 2, we have

$\begin{equation} \begin{split} f({\bf{x}}_{t+1}) & \leq f({\bf{x}}_t) + \nabla f({\bf{x}}_t)^T({\bf{x}}_{t+1}-{\bf{x}}_t) +\frac{\varphi}{2}\|{\bf{x}}_{t+1}-{\bf{x}}_t\|^2 \\ & = f({\bf{x}}_t) + \alpha_t\nabla f({\bf{x}}_t)^T{\bf{d}}_t + \frac{\varphi \alpha_t^2}{2}\|{\bf{d}}_t\|^2. \end{split} \end{equation}$

(3.11)

Taking expectations of (3.11) and using Lemma 2 gives

$\begin{equation} \begin{split} \mathbb{E}[f({\bf{x}}_{t+1})] & \leq \mathbb{E}[f({\bf{x}}_t)] + \alpha_t \mathbb{E}[\nabla f({\bf{x}}_t)^T{\bf{d}}_t] +\frac{\varphi \alpha_t^2}{2} \mathbb{E}[\|{\bf{d}}_t\|^2] \\ & \leq \mathbb{E}[f({\bf{x}}_t)] + \alpha_t \Big(-2\sigma\mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] + \beta_t \mathbb{E}[{\bf{g}}_t^T{\bf{d}}_{t-1}]\Big) + \frac{\varphi \alpha_t^2}{2} \mathbb{E}[\|{\bf{d}}_t\|^2] \\ & \leq \mathbb{E}[f({\bf{x}}_t)] + \alpha_t \Big(-2\sigma\mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] + \beta_t\frac{c_2}{1-c_2^2}\mathbb{E}[\|{\bf{g}}_{t-1}\|^2]\Big) + \frac{\varphi \alpha_t^2}{2} \mathbb{E}[\|{\bf{d}}_t\|^2] \\ & \leq \mathbb{E}[f({\bf{x}}_t)] + \alpha_t \Big(-2\sigma\mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] + \frac{2\beta_t}{3}\mathbb{E}[\|{\bf{g}}_{t-1}\|^2]\Big) + \frac{\varphi \alpha_t^2}{2} \mathbb{E}[\|{\bf{d}}_t\|^2]. \end{split} \end{equation}$

(3.12)

Using Theorem 1 and Lemma 3, (3.12) becomes

$\begin{equation} \begin{split} & \mathbb{E}[f({\bf{x}}_{t+1})] \\ & \leq \mathbb{E}[f({\bf{x}}_t)] - 2\sigma\alpha_t \mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] + \frac{8\varphi\alpha_t\beta_t}{3} \Big(\mathbb{E}[f({\bf{x}}_{t-1}) - f({\bf{w}}_*)] + \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]\Big)\\ & \ \ \ \ + \frac{\varphi \alpha_t^2}{2} \times 4\varphi\rho(t) \Big( \mathbb{E}[f({\bf{x}}_0 - f({\bf{w}}_*)] + \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)] \Big) \\ & \leq \mathbb{E}[f({\bf{x}}_t)] - 2\sigma\alpha_{min} \mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] \\ & \ \ \ \ +\frac{8\varphi\epsilon\alpha_{max}}{3}\Big(\mathbb{E}[f({\bf{x}}_{t-1}) - f({\bf{w}}_*)] +\mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]\Big) + 4\varphi^2 \alpha_{max}^2\rho(t)\mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]. \end{split} \end{equation}$

(3.13)

By summing (3.13) over $t = 0, 1, \ldots, T-1$ , we obtain

$\begin{equation} \begin{split} \mathbb{E}[f({\bf{x}}_T)] & \leq \mathbb{E}[f({\bf{x}}_0)] - 2\sigma\alpha_{min} \sum\limits_{t = 0}^{T-1} \mathbb{E}[f({\bf{x}}_t) - f({\bf{w}}_*)] \\ & \ \ \ \ + \frac{8\varphi\epsilon\alpha_{max}}{3} \sum\limits_{t = 0}^{T-2} \Big(\mathbb{E}[f({\bf{x}}_t)-f({\bf{w}}_*)] + \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]\Big) \\ & \ \ \ \ + 4\varphi^2 \alpha_{max}^2 \sum\limits_{t = 0}^{T-1} \rho(t) \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]. \end{split} \end{equation}$

(3.14)

Rearranging (3.14) gives

$\begin{equation} \begin{split} & 0 \leq \mathbb{E}[f({\bf{w}}_k)] -\mathbb{E}[f({\bf{x}}_T)] - 2\sigma\alpha_{min} T \mathbb{E}[f({\bf{w}}_{k+1}) - f({\bf{w}}_*)] \\ & \ \ \ \ + \frac{8\varphi\epsilon\alpha_{max}T}{3} \mathbb{E}[f({\bf{w}}_{k+1}) -f({\bf{w}}_*)] + \frac{8\varphi\epsilon\alpha_{max}T}{3} \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)] \\ & \ \ \ \ + 4\varphi^2 \alpha_{max}^2 \sum\limits_{t = 0}^{T-1} \rho(t) \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]. \end{split} \end{equation}$

(3.15)

Further, we have

$\begin{equation} \begin{split} 0 & \leq \mathbb{E}[f({\bf{w}}_k)] -\mathbb{E}[f({\bf{w}}_*)] + \Big(\frac{8\varphi\epsilon\alpha_{max}T}{3} + 4\varphi^2 \alpha_{max}^2 \sum\limits_{t = 0}^{T-1} \rho(t) \Big) \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)]\\ & \ \ \ \ + \Big(\frac{8\varphi\epsilon\alpha_{max}T}{3} - 2\sigma\alpha_{min} T \Big) \mathbb{E}[f({\bf{w}}_{k+1}) - f({\bf{w}}_*)] \\ & = \Big(1+\frac{8\varphi\epsilon\alpha_{max}T}{3} + 4\varphi^2 \alpha_{max}^2 \sum\limits_{t = 0}^{T-1} \rho(t) \Big) \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)] \\ & \ \ \ \ + \Big(\frac{8\varphi\epsilon\alpha_{max}T}{3} - 2\sigma\alpha_{min} T \Big) \mathbb{E}[f({\bf{w}}_{k+1}) - f({\bf{w}}_*)]. \end{split} \end{equation}$

(3.16)

The second inequality follows from $f({\bf{w}}_*)\leq f({\bf{x}}_T)$ . Note that $\sum_{t = 0}^{T-1} \rho(t) = \frac{\xi + 4/3}{\xi - \epsilon} \times \frac{1 - (\xi\epsilon)^T}{1 - \xi\epsilon} - \frac{\epsilon + 4/3}{\xi-\epsilon} \times \frac{1-(\epsilon^2)^T}{1-\epsilon^2}$ and denote this result by $\Lambda$ , then we have

$\begin{equation} \mathbb{E}[f({\bf{w}}_{k+1}) - f({\bf{w}}_*)] \leq \theta \mathbb{E}[f({\bf{w}}_k) - f({\bf{w}}_*)], \end{equation}$

(3.17)

on condition that $\alpha_{max}/\alpha_{min} < 0.75\sigma/(\varphi\epsilon)$ , where $\theta = \frac{1+\frac{8\varphi\epsilon\alpha_{max}T}{3} + 4\varphi^2 \alpha_{max}^2 \Lambda} {2\sigma\alpha_{min} T - \frac{8\varphi\epsilon\alpha_{max}T}{3}}$ . Since we chose $T$ to satisfy (3.10), it follows that the rate $\theta$ is less than one. Thus, we get the result as desired.

4. Comparative experiments

To show the efficiency and effectiveness of the proposed method, we compare it with SVRG ^[15] (a stochastic learning algorithm) and IFR CG ^[13] (a batch learning algorithm) on five classification datasets, as shown in Table 1. These datasets are obtained from the LIBSVM database^*. All experiments are implemented using the Armadillo C++ library on a PC (Core i9-9900X CPU$ @.50GHz, 64 GB of memory) with Ubuntu 18.04.1 operating system.

Table 1. Summary of the benchmark datasets.

datasets	$m$	$n$
a9a	32561	123
w8a	49749	300
ijcnn1	49990	22
susy	5000000	18
higgs	11000000	28
*Note: $m$ and $n$ represent the number of samples and features of the data set respectively.

| Show Table

DownLoad: CSV

^*http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets

4.1. Problem setup

We evaluate these methods on three popular machine learning problems, including

$\bullet l_2$ -logistic regression

$\begin{equation} \min\limits_{{\bf{w}}\in \mathbb{R}^n}\frac{1}{m}\sum\limits_{i = 1}^m \log(1+e^{-y_i{\bf{w}}^T{\bf{z}}_i}) + \lambda\|{\bf{w}}\|^2, \end{equation}$

(4.1)

$\bullet l_1$ -SVM

$\begin{equation} \min\limits_{{\bf{w}}\in \mathbb{R}^n}\frac{1}{m}\sum\limits_{i = 1}^m\max(0, 1-y_i{\bf{w}}^T{\bf{z}}_i) + \lambda\|{\bf{w}}\|^2, \end{equation}$

(4.2)

$\bullet l_2$ -SVM

$\begin{equation} \min\limits_{{\bf{w}}\in \mathbb{R}^n}\frac{1}{m}\sum\limits_{i = 1}^m\max(0, 1-y_i{\bf{w}}^T{\bf{z}}_i)^2 + \lambda\|{\bf{w}}\|^2, \end{equation}$

(4.3)

where $\lambda > 0$ is the regularization parameter.

We set $\lambda = \{0.01, 0.001, 0.0001\}$ and set $|\mathcal{S}| = \sqrt{m}$ during the whole experiments. We set the number of iterations of the outer loop to 25, and set the inner loop index $T$ to 30 and 50, respectively. For SVRG, we set $\eta = \{0.1, 0.05, 0.001, 0.005, 0.01, 0.05\}$ . For IFR CG and SIFR CG, we set $c_1 = 10^{-4}$ and $c_2 = 0.1$ . Additionally, in SIFR CG, we set $\alpha_{max} = 1/\alpha_{min} = 10^5$ and $\epsilon = 10$ . We randomly divide each dataset into three parts, $1/3$ for test, $1/5$ for validation, and the rest for training. After finding the optimal parameters that maximize the AUC values on validation set, we employ the corresponding model to the test set.

4.2. Experimental results

We first adopt the the cross validation technique to obtain the optimal parameters for each learning model, and the cross validation results are summarize in Tables 2 and 3. We can see that for different learning models and different datasets, the optimal parameters are often data-dependent and should be tuned correspondingly.

Table 2. Summary of the optimal parameters for each method obtained by cross validation on different datasets and different learning problems.

$l_2$ -logistic regression	$T=30$			$T=40$
$l_2$ -logistic regression	SVRG	IFR CG	SIFR CG	SVRG	IFR CG	SIFR CG
a9a	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$
w8a	$\lambda=.1$ , $\eta=.1$	$\lambda=.1$	$\lambda=.005$	$\lambda=.1$ , $\eta=.1$	$\lambda=.1$	$\lambda=.008$
ijcnn1	$\lambda=.05$ , $\eta=.1$	$\lambda=.05$	$\lambda=.005$	$\lambda=.05$ , $\eta=.1$	$\lambda=.05$	$\lambda=.005$
susy	$\lambda=.005$ , $\eta=.1$	$\lambda=.005$	$\lambda=.005$	$\lambda=.005$ , $\eta=.1$	$\lambda=.005$	$\lambda=.005$
higgs	$\lambda=.05$ , $\eta=.1$	$\lambda=.05$	$\lambda=.005$	$\lambda=.05$ , $\eta=.1$	$\lambda=.05$	$\lambda=.005$
$l_1$ -SVM	$T=30$			$T=40$
$l_1$ -SVM	SVRG	IFR CG	SIFR CG	SVRG	IFR CG	SIFR CG
a9a	$\lambda=.1$ , $\eta=.01$	$\lambda=.05$	$\lambda=.008$	$\lambda=.1$ , $\eta=.01$	$\lambda=.05$	$\lambda=.008$
w8a	$\lambda=.1$ , $\eta=.0001$	$\lambda=.05$	$\lambda=.1$	$\lambda=.1$ , $\eta=.01$	$\lambda=.05$	$\lambda=.1$
ijcnn1	$\lambda=.1$ , $\eta=.001$	$\lambda=.05$	$\lambda=.008$	$\lambda=.1$ , $\eta=.0001$	$\lambda=.05$	$\lambda=.05$
susy	$\lambda=.008$ , $\eta=.05$	$\lambda=.1$	$\lambda=.005$	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$
higgs	$\lambda=.008$ , $\eta=.1$	$\lambda=.1$	$\lambda=.005$	$\lambda=.05$ , $\eta=.05$	$\lambda=.1$	$\lambda=.005$
$l_2$ -SVM	$T=30$			$T=40$
$l_2$ -SVM	SVRG	IFR CG	SIFR CG	SVRG	IFR CG	SIFR CG
a9a	$\lambda=.01$ , $\eta=.01$	$\lambda=.1$	$\lambda=.05$	$\lambda=.005$ , $\eta=.01$	$\lambda=.1$	$\lambda=.05$
w8a	$\lambda=.1$ , $\eta=.001$	$\lambda=.1$	$\lambda=.1$	$\lambda=.1$ , $\eta=.001$	$\lambda=.1$	$\lambda=.05$
ijcnn1	$\lambda=.1$ , $\eta=.05$	$\lambda=.1$	$\lambda=.05$	$\lambda=.1$ , $\eta=.05$	$\lambda=.1$	$\lambda=.05$
susy	$\lambda=.005$ , $\eta=.05$	$\lambda=.005$	$\lambda=.005$	$\lambda=.005$ , $\eta=.05$	$\lambda=.005$	$\lambda=.005$
higgs	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$

| Show Table

DownLoad: CSV

Table 3. Summary of the optimal parameters for each method obtained by cross validation on different datasets and different learning problems (cont.).

$l_2$ -logistic regression	$T=50$			$T=60$
$l_2$ -logistic regression	SVRG	IFR CG	SIFR CG	SVRG	IFR CG	SIFR CG
a9a	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$
w8a	$\lambda=.1$ , $\eta=.1$	$\lambda=.1$	$\lambda=.005$	$\lambda=.1$ , $\eta=.1$	$\lambda=.1$	$\lambda=.005$
ijcnn1	$\lambda=.05$ , $\eta=.1$	$\lambda=.05$	$\lambda=.005$	$\lambda=.05$ , $\eta=.1$	$\lambda=.05$	$\lambda=.005$
susy	$\lambda=.005$ , $\eta=.1$	$\lambda=.005$	$\lambda=.005$	$\lambda=.005$ , $\eta=.1$	$\lambda=.005$	$\lambda=.005$
higgs	$\lambda=.01$ , $\eta=.1$	$\lambda=.05$	$\lambda=.005$	$\lambda=.008$ , $\eta=.1$	$\lambda=.05$	$\lambda=.005$
$l_1$ -SVM	$T=50$			$T=60$
$l_1$ -SVM	SVRG	IFR CG	SIFR CG	SVRG	IFR CG	SIFR CG
a9a	$\lambda=.1$ , $\eta=.01$	$\lambda=.05$	$\lambda=.1$	$\lambda=.1$ , $\eta=.01$	$\lambda=.05$	$\lambda=.01$
w8a	$\lambda=.1$ , $\eta=.01$	$\lambda=.05$	$\lambda=.01$	$\lambda=.1$ , $\eta=.01$	$\lambda=.05$	$\lambda=.05$
ijcnn1	$\lambda=.1$ , $\eta=.0001$	$\lambda=.05$	$\lambda=.005$	$\lambda=.1$ , $\eta=.001$	$\lambda=.05$	$\lambda=.008$
susy	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$
higgs	$\lambda=.05$ , $\eta=.05$	$\lambda=.1$	$\lambda=.005$	$\lambda=.008$ , $\eta=.05$	$\lambda=.1$	$\lambda=.005$
$l_2$ -SVM	$T=50$			$T=60$
$l_2$ -SVM	SVRG	IFR CG	SIFR CG	SVRG	IFR CG	SIFR CG
a9a	$\lambda=.1$ , $\eta=.005$	$\lambda=.1$	$\lambda=.1$	$\lambda=.1$ , $\eta=.005$	$\lambda=.1$	$\lambda=.05$
w8a	$\lambda=.1$ , $\eta=.001$	$\lambda=.1$	$\lambda=.1$	$\lambda=.1$ , $\eta=.001$	$\lambda=.1$	$\lambda=.05$
ijcnn1	$\lambda=.1$ , $\eta=.05$	$\lambda=.1$	$\lambda=.01$	$\lambda=.05$ , $\eta=.05$	$\lambda=.1$	$\lambda=.1$
susy	$\lambda=.005$ , $\eta=.05$	$\lambda=.005$	$\lambda=.005$	$\lambda=.005$ , $\eta=.05$	$\lambda=.005$	$\lambda=.005$
higgs	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$	$\lambda=.1$ , $\eta=.01$	$\lambda=.1$	$\lambda=.005$

| Show Table

DownLoad: CSV

After getting the optimal learning rates for SVRG and the optimal regularization parameters for all the methods, we report the changing curves of AUC values on each test datasets obtained by different method with increasing computational cost in -. In order to fairly compare the performance of different methods, all the $x$ -axes represent the computational cost that is measured by the number of gradient computations divided by $m$ , and all the $y$ -axes represent the AUC value. In general, the closer the AUC is to 1, the higher the accuracy of the method is. - show the changing results when $T = 30$ , $T = 40$ , $T = 50$ , and $T = 60$ , respectively. Firstly, we can see from the four figures that the proposed method SIFR CG significantly outperforms SVRG and IFR CG on the whole test datssets. That is, the classification accuracy of our method is the highest. Secondly, SVRG and IFR CG have their own advantages and disadvantages on different datasets, therefore, they are hard to differentiate. Thirdly, unlike SVRG and IFR CG, we find that the role of increasing the number of loops to improve the performance of SIFR CG is less obvious. In other words, the SIFR CG method takes fewer iterations to get the optimal solution.

Figure 2. Changing curves of AUC values between different methods on the test data when

$T = 30$ .

DownLoad: Full-Size Img PowerPoint

Figure 3. Changing curves of AUC values between different methods on the test data when

$T = 40$ .

DownLoad: Full-Size Img PowerPoint

Figure 4. Changing curves of AUC values between different methods on the test data when

$T = 50$ .

DownLoad: Full-Size Img PowerPoint

Figure 5. Changing curves of AUC values between different methods on the test data when

$T = 60$ .

DownLoad: Full-Size Img PowerPoint

Additionally, Figures 6-17 show the convergence behaviors and the corresponding average computing time (the numbers of iterations of the outer loop is set to 25) of the three methods on each training dataset. In the convergence behavior figures (see Figures 6, 7, 9, 10, 12, , and ), all the $x$ -axes also represent the computational cost that is measured by the number of gradient computations divided by $m$ , and all the $y$ -axes represent the logarithm of objective function value (base 10). We see that the convergence rate of the SVRG is not as fast as the other two methods because SVRG is sensitive to the learning rate. However, for all test environments, the proposed method SIFR CG has the fastest convergence speed and reaches the smallest objective function value, which is consistent with the AUC results shown in Figures 2-5. In the average computing time figures (see Figures 8, , , and ), all the $x$ -axes are in logarithmic time (base 10). We can see that SVRG has the least computing time on these datasets. On the contrary, IFR CG employs the entire dataset to conduct the line search to select the learning rate at each iteration, resulting in a heavy computing burden. The computing time of SIFR CG is between SVRG and IFR CG, and it can be observed that the computing time of SIFR CG is close to IFR CG when the dataset is small and is close to SVRG when the data is large.

Figure 6. Convergence behaviors of different methods on the training data when

$\lambda = 10^{-2}$ and

$T = 30$ .

DownLoad: Full-Size Img PowerPoint

Figure 7. Convergence behaviors of different methods on the training data when

$\lambda = 10^{-4}$ and

$T = 30$ .

DownLoad: Full-Size Img PowerPoint

Figure 8. Average computing time of different methods on the test data when

$T = 30$ .

DownLoad: Full-Size Img PowerPoint

Figure 9. Convergence behaviors of different methods on the training data when

$\lambda = 10^{-2}$ and

$T = 40$ .

DownLoad: Full-Size Img PowerPoint

Figure 10. Convergence behaviors of different methods on the training data when

$\lambda = 10^{-4}$ and

$T = 40$ .

DownLoad: Full-Size Img PowerPoint

Figure 11. Average computing time of different methods on the test data when

$T = 40$ .

DownLoad: Full-Size Img PowerPoint

Figure 12. Convergence behaviors of different methods on the training data when

$\lambda = 10^{-2}$ and

$T = 50$ .

DownLoad: Full-Size Img PowerPoint

Figure 13. Convergence behaviors of different methods on the training data when

$\lambda = 10^{-4}$ and

$T = 50$ .

DownLoad: Full-Size Img PowerPoint

Figure 14. Average computing time of different methods on the test data when

$T = 50$ .

DownLoad: Full-Size Img PowerPoint

Figure 15. Convergence behaviors of different methods on the training data when

$\lambda = 10^{-2}$ and

$T = 60$ .

DownLoad: Full-Size Img PowerPoint

Figure 16. Convergence behaviors of different methods on the training data when

$\lambda = 10^{-4}$ and

$T = 60$ .

DownLoad: Full-Size Img PowerPoint

Figure 17. Average computing time of different methods on the test data when

$T = 60$ .

DownLoad: Full-Size Img PowerPoint

Overall, preliminary experiments show that the proposed method SIFR CG not only has better learning performance, but converges speedily. And The advantage of SIFR CG is more obvious on large-scale datasets.

5. Conclusions

In this paper, we proposed a stochastic Fletcher-Reeves conjugate gradient algorithm and proved a linear rate of convergence in the strongly convex case. Comparative experiments on several benchmark datasets demonstrate that the proposed algorithm performs well on large-scale smooth and nonsmooth machine learning problems.

There are several interesting issues to study in future work. For example, we can replace the Wolfe line search by a nonmonotone line search to find the optimal solution faster or propose more effective conjugate gradient update parameters.

Acknowledgments

The authors thank the editor and the reviewers for their constructive comments and suggestions that greatly improved the quality and presentation of this paper. This work was partly supported by the National Natural Science Foundation of China (Grant Nos. 11661007, 61671456 and 61806004), the China Postdoctoral Science Foundation (Grant No. 2020T130767), the Natural Science Foundation of the Anhui Higher Education Institutions of China (Grant No. KJ2019A0082), and the Natural Science Foundation of Zhejiang Province, China (Grant No. LD19A010002).

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this article.

References

[1]	J. Barzilai, J. M. Borwein, Two-point step size gradient methods, IMA J. Numer. Anal., 8 (1988), 141-148. doi: 10.1093/imanum/8.1.141
[2]	E. Bisong, Batch vs. online larning, Building Machine Learning and Deep Learning Models on Google Cloud Platform, 2019.
[3]	L. Bottou, F. E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223-311. doi: 10.1137/16M1080173
[4]	Y. H. Dai, Y. Yuan, Nonlinear conjugate gradient methods, Shanghai: Shanghai Scientific Technical Publishers, 2000.
[5]	D. Davis, B. Grimmer, Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems, SIAM J. Optim., 29 (2019), 1908-1930. doi: 10.1137/17M1151031
[6]	R. Dehghani, N. Bidabadi, H. Fahs, M. M. Hosseini, A conjugate gradient method based on a modified secant relation for unconstrained optimization, Numer. Funct. Anal. Optim., 41 (2020), 621-634. doi: 10.1080/01630563.2019.1669641
[7]	P. Faramarzi, K. Amini, A modified spectral conjugate gradient method with global convergence, J. Optim. Theory Appl., 182 (2019), 667-690. doi: 10.1007/s10957-019-01527-6
[8]	R. Fletcher, C. M. Reeves, Function minimization by conjugate gradients, Comput. J., 7 (1964), 149-154. doi: 10.1093/comjnl/7.2.149
[9]	J. C. Gilbert, J. Nocedal, Global convergence properties of conjugate gradient methods for optimization, SIAM J. Optim., 2 (1992), 21-42. doi: 10.1137/0802003
[10]	W. W. Hager, H. Zhang, Algorithm 851: CG DESCENT, a conjugate gradient method with guaranteed descent, ACM Trans. Math. Software, 32 (2006), 113-137. doi: 10.1145/1132973.1132979
[11]	A. S. Halilu, M. Y. Waziri, Y. B. Musa, Inexact double step length method for solving systems of nonlinear equations, Stat. Optim. Inf. Comput., 8 (2020), 165-174. doi: 10.19139/soic-2310-5070-532
[12]	H. Jiang, P. Wilford, A stochastic conjugate gradient method for the approximation of functions, J. Comput. Appl. Math., 236 (2012), 2529-2544. doi: 10.1016/j.cam.2011.12.012
[13]	X. Jiang, J. Jian, Improved Fletcher-Reeves and Dai-Yuan conjugate gradient methods with the strong Wolfe line search, J. Comput. Appl. Math., 348 (2019), 525-534. doi: 10.1016/j.cam.2018.09.012
[14]	X. B. Jin, X. Y. Zhang, K. Huang, G. G. Geng, Stochastic conjugate gradient algorithm with variance reduction, IEEE Trans. Neural Networks Learn. Syst., 30 (2018), 1360-1369.
[15]	R. Johnson, T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems, 2013.
[16]	X. L. Li, Preconditioned stochastic gradient descent, IEEE Trans. Neural Networks Learn. Syst., 29 (2018), 1454-1466. doi: 10.1109/TNNLS.2017.2672978
[17]	Y. Liu, X. Wang, T. Guo, A linearly convergent stochastic recursive gradient method for convex optimization, Optim. Lett., 2020. Doi: 10.1007/s11590-020-01550-x. doi: 10.1007/s11590-020-01550-x
[18]	M. Lotfi, S. M. Hosseini, An efficient Dai-Liao type conjugate gradient method by reformulating the CG parameter in the search direction equation, J. Comput. Appl. Math., 371 (2020), 112708. doi: 10.1016/j.cam.2019.112708
[19]	S. Mandt, M. D. Hoffman, D. M. Blei, Stochastic gradient descent as approximate Bayesian inference, J. Mach. Learn. Res., 18 (2017), 4873-4907.
[20]	P. Moritz, R. Nishihara, M. I. Jordan, A linearly-convergent stochastic L-BFGS algorithm, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016.
[21]	L. M. Nguyen, J. Liu, K. Scheinberg, M. Takáč, SARAH: A novel method for machine learning problems using stochastic recursive gradient, Proceedings of the 34th International Conference on Machine Learning, 2017.
[22]	A. Nitanda, Accelerated stochastic gradient descent for minimizing finite sums, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016.
[23]	H. Robbins, S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), 400-407. doi: 10.1214/aoms/1177729586
[24]	N. N. Schraudolph, T. Graepel, Combining conjugate direction methods with stochastic approximation of gradients, Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, 2003.
[25]	G. Shao, W. Xue, G. Yu, X. Zheng, Improved SVRG for finite sum structure optimization with application to binary classification, J. Ind. Manage. Optim., 16 (2020), 2253-2266.
[26]	C. Tan, S. Ma, Y. H. Dai, Y. Qian, Barzilai-Borwein step size for stochastic gradient descent, Advances in Neural Information Processing Systems, 2016.
[27]	P. Toulis, E. Airoldi, J. Rennie, Statistical analysis of stochastic gradient methods for generalized linear models, Proceedings of the 31th International Conference on Machine Learning, 2014.
[28]	V. Vapnik, The nature of statistical learning theory, New York: Springer, 1995.
[29]	L. Xiao, T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24 (2014), 2057-2075. doi: 10.1137/140961791
[30]	Z. Xu, Y. H. Dai, A stochastic approximation frame algorithm with adaptive directions, Numer. Math. Theory Methods Appl., 1 (2008), 460-474.
[31]	W. Xue, J. Ren, X. Zheng, Z. Liu, Y. Ling, A new DY conjugate gradient method and applications to image denoising, IEICE Trans. Inf. Syst., 101 (2018), 2984-2990.
[32]	Q. Zheng, X. Tian, N. Jiang, M. Yang, Layer-wise learning based stochastic gradient descent method for the optimization of deep convolutional neural network, J. Intell. Fuzzy Syst., 37 (2019), 5641-5654. doi: 10.3233/JIFS-190861

This article has been cited by:

1.	Zhuang Yang, Large-scale machine learning with fast and stable stochastic conjugate gradient, 2022, 173, 03608352, 108656, 10.1016/j.cie.2022.108656
2.	Maryam Khoshsimaye-Bargard, Ali Ashrafi, A descent family of the spectral Hestenes–Stiefel method by considering the quasi-Newton method, 2023, 1055-6788, 1, 10.1080/10556788.2022.2142585
3.	Maryam Khoshsimaye-Bargard, Ali Ashrafi, A family of the modified three-term Hestenes–Stiefel conjugate gradient method with sufficient descent and conjugacy conditions, 2023, 1598-5865, 10.1007/s12190-023-01839-x
4.	Sanaullah Mastoi, Abdul Hamid Ganie, Abdulkafi Mohammed Saeed, Umair Ali, Umair Ahmed Rajput, Wan Ainun Mior Othman, Numerical solution for two-dimensional partial differential equations using SM’s method, 2022, 20, 2391-5471, 142, 10.1515/phys-2022-0015
5.	Eltiyeb Ali, Salem Mahdi, A Family of Developed Hybrid Four-Term Conjugate Gradient Algorithms for Unconstrained Optimization with Applications in Image Restoration, 2023, 15, 2073-8994, 1203, 10.3390/sym15061203
6.	Elena Tovbis, Vladimir Krutikov, Predrag Stanimirović, Vladimir Meshechkin, Aleksey Popov, Lev Kazakovtsev, A Family of Multi-Step Subgradient Minimization Methods, 2023, 11, 2227-7390, 2264, 10.3390/math11102264
7.	Mariya Toofan, Saman Babaie-Kafaki, Hybrid Conjugate Gradient Methods Based on an Extended Least-Squares Model, 2024, 2305-221X, 10.1007/s10013-024-00726-7
8.	Maryam Khoshsimaye-Bargard, Ali Ashrafi, A projected hybridization of the Hestenes–Stiefel and Dai–Yuan conjugate gradient methods with application to nonnegative matrix factorization, 2024, 1598-5865, 10.1007/s12190-024-02245-7
9.	Xuejie Ma, Songhua Wang, A hybrid approach to conjugate gradient algorithms for nonlinear systems of equations with applications in signal restoration, 2024, 9, 2473-6988, 36167, 10.3934/math.20241717
10.	Maryam Khoshsimaye–Bargard, Ali Ashrafi, A family of descent spectral three-term conjugate gradient methods based on the quasi–Newton aspects with applications to nonnegative matrix factorization and image restoration, 2025, 1017-1398, 10.1007/s11075-025-02091-z
11.	Dandan Li, Songhua Wang, Yan Xia, Xuejie Ma, Convergence analysis of a three-term extended RMIL CGP-based algorithm for constrained nonlinear equations and image denoising applications, 2025, 33, 2688-1594, 3584, 10.3934/era.2025160

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Mathematics

1.8 3.1

Metrics

Article views(7096) PDF downloads(323) Cited by(11)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(17) / Tables(3)

AIMS Mathematics

An online conjugate gradient algorithm for large-scale data analysis in machine learning

Related Papers:

Abstract

1. Introduction

2. Brief introduction of IFR CG

3. Proposed method and theoretical analysis

3.1. SIFR CG method

3.2. Theoretical analysis

4. Comparative experiments

4.1. Problem setup

4.2. Experimental results

5. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Mathematics

An online conjugate gradient algorithm for large-scale data analysis in machine learning

Related Papers:

Abstract

1. Introduction

2. Brief introduction of IFR CG

3. Proposed method and theoretical analysis

3.1. SIFR CG method

3.2. Theoretical analysis

4. Comparative experiments

4.1. Problem setup

4.2. Experimental results

5. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog