Processing math: 78%
Research article Special Issues

Sequential stochastic blackbox optimization with zeroth-order gradient estimators

  • This work considers stochastic optimization problems in which the objective function values can only be computed by a blackbox corrupted by some random noise following an unknown distribution. The proposed method is based on sequential stochastic optimization (SSO), i.e., the original problem is decomposed into a sequence of subproblems. Each subproblem is solved by using a zeroth-order version of a sign stochastic gradient descent with momentum algorithm (i.e., ZO-signum) and with increasingly fine precision. This decomposition allows a good exploration of the space while maintaining the efficiency of the algorithm once it gets close to the solution. Under the Lipschitz continuity assumption on the blackbox, a convergence rate in mean is derived for the ZO-signum algorithm. Moreover, if the blackbox is smooth and convex or locally convex around its minima, the rate of convergence to an ϵ-optimal point of the problem may be obtained for the SSO algorithm. Numerical experiments are conducted to compare the SSO algorithm with other state-of-the-art algorithms and to demonstrate its competitiveness.

    Citation: Charles Audet, Jean Bigeon, Romain Couderc, Michael Kokkolaras. Sequential stochastic blackbox optimization with zeroth-order gradient estimators[J]. AIMS Mathematics, 2023, 8(11): 25922-25956. doi: 10.3934/math.20231321

    Related Papers:

    [1] Duangdaw Rakjarungkiat, Nimit Nimana . An extrapolated fixed-point optimization method for strongly convex smooth optimizations. AIMS Mathematics, 2024, 9(2): 4259-4280. doi: 10.3934/math.2024210
    [2] Lidiya Melnikova, Valeriy Rozenberg . One dynamical input reconstruction problem: tuning of solving algorithm via numerical experiments. AIMS Mathematics, 2019, 4(3): 699-713. doi: 10.3934/math.2019.3.699
    [3] Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang . On stochastic accelerated gradient with non-strongly convexity. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085
    [4] Nunthakarn Boonruangkan, Pattrawut Chansangiam . Convergence analysis of a gradient iterative algorithm with optimal convergence factor for a generalized Sylvester-transpose matrix equation. AIMS Mathematics, 2021, 6(8): 8477-8496. doi: 10.3934/math.2021492
    [5] Noura Alhouiti, Fatemah Mofarreh, Akram Ali, Fatemah Abdullah Alghamdi . On gradient normalized Ricci-harmonic solitons in sequential warped products. AIMS Mathematics, 2024, 9(9): 23221-23233. doi: 10.3934/math.20241129
    [6] Fang Cheng, Ye Hu, Mati ur Rahman . Analyzing the continuity of the mild solution in finite element analysis of semilinear stochastic subdiffusion problems. AIMS Mathematics, 2024, 9(4): 9364-9379. doi: 10.3934/math.2024456
    [7] Adisorn Kittisopaporn, Pattrawut Chansangiam . Approximate solutions of the 2D space-time fractional diffusion equation via a gradient-descent iterative algorithm with Grünwald-Letnikov approximation. AIMS Mathematics, 2022, 7(5): 8471-8490. doi: 10.3934/math.2022472
    [8] Shexiang Hai, Liang He . The steepest descent method for fuzzy optimization problems under granular differentiability. AIMS Mathematics, 2025, 10(4): 10163-10186. doi: 10.3934/math.2025463
    [9] Mingyuan Cao, Yueting Yang, Chaoqian Li, Xiaowei Jiang . An accelerated conjugate gradient method for the Z-eigenvalues of symmetric tensors. AIMS Mathematics, 2023, 8(7): 15008-15023. doi: 10.3934/math.2023766
    [10] Yue Li, Yunyan Wang . Strong consistency of the nonparametric kernel estimator of the transition density for the second-order diffusion process. AIMS Mathematics, 2024, 9(7): 19015-19030. doi: 10.3934/math.2024925
  • This work considers stochastic optimization problems in which the objective function values can only be computed by a blackbox corrupted by some random noise following an unknown distribution. The proposed method is based on sequential stochastic optimization (SSO), i.e., the original problem is decomposed into a sequence of subproblems. Each subproblem is solved by using a zeroth-order version of a sign stochastic gradient descent with momentum algorithm (i.e., ZO-signum) and with increasingly fine precision. This decomposition allows a good exploration of the space while maintaining the efficiency of the algorithm once it gets close to the solution. Under the Lipschitz continuity assumption on the blackbox, a convergence rate in mean is derived for the ZO-signum algorithm. Moreover, if the blackbox is smooth and convex or locally convex around its minima, the rate of convergence to an ϵ-optimal point of the problem may be obtained for the SSO algorithm. Numerical experiments are conducted to compare the SSO algorithm with other state-of-the-art algorithms and to demonstrate its competitiveness.



    The present work targets stochastic blackbox optimization problems of the form

    minxRnf(x) where f(x):=Eξ[F(x,ξ)], (1.1)

    and F:Rn×RmR is a blackbox [3] that takes two inputs: a vector of design variables xRn and a vector ξRm that represents uncertainties with unknown distributions. The function F is called a stochastic zeroth-order oracle [20]. The objective function f is obtained by taking the expectation of F over all possible values of the uncertainties ξ. This optimization problem can be found in two different fields. The first is in a machine learning framework wherein the loss function's gradient is unavailable or difficult to compute, such as in the optimization of neural network architecture [36], design of adversarial attacks [15] or game content generation [44]. The second field is when the function F is evaluated by means of a computational procedure [27]. In many cases, it depends on an uncertainty vector ξ due to environmental conditions, costs or effects of repair actions that are unknown [38]. Another source of uncertainty appears when the optimization is conducted at the early stages of the design process, where knowledge, information and data are very limited.

    Stochastic derivative-free optimization has been the subject of research for many years. Traditional derivative-free methods may be divided into two categories [16]: direct search and model-based methods. Algorithms corresponding to both methods have been adapted to a stochastic ZO oracle. Examples include the stochastic Nelder-Mead algorithm [13] and the stochastic versions of the mesh adaptive direct search algorithm [2,4] for the direct search methods. For model-based methods, most studies consider extensions of the trust region method [14,17,33]. A major shortcoming of these methods is their difficulty to scale to large problems.

    Recently, another class of methods, named ZO methods, has been attracting increasing amounts of attention. These methods use stochastic gradient estimators, which are based on the seminal work in [24,37], and they have been extended in [20,34,39,41]. These estimators have the appealing property of being able to estimate the gradient with only one or two function evaluations, regardless of the problem size. ZO methods take advantage of this property to extend first-order methods. For instance, the well known first-order methods conditional gradient, sign stochastic gradient descent (signSGD) [6] and adaptive momentum (ADAM) [26] have been extended to ZSCG [5], ZO-signSGD [30] and ZO-adaMM [15], respectively. More methods, not only based on first-order algorithms, have also emerged to solve regularized optimization problems [11], for very high dimensional blackbox optimization problems [9] and stochastic composition optimization problems [21]. Methods using second-order information based limited function queries have been developed [25]. Some methods handle situations in which the optimizer has only access to a comparison oracle that indicates which of two points has the highest value [10]. For an overview on ZO methods, readers may consult [31].

    Formally, stochastic gradient estimators involve a smooth approximation fβ (see Chapter 7.6 in [39]) which is a convolution product between f and a kernel hβ(u)

    fβ(x):=hβ(u)f(xu)du=hβ(xu)f(u)du. (1.2)

    The kernel must fulfill a set of conditions [39, pp. 263]:

    (1) hβ(u)=1βnh(uβ) is a piecewise differentiable function;

    (2) limβ0hβ(u)=δ(u), where δ(v) is Dirac's delta function;

    (3) limβ0fβ(x)=f(x) if x is a point of continuity of f;

    (4) The kernel hβ(u) is a probability density function, that is fβ(x)=EUhβ(u)[f(xU)]=EUh(u)[f(xβU)].

    Frequently used kernels include the Gaussian distribution and the uniform distribution on a unit ball. Three properties of the smooth approximation are worth noting. First, the smooth approximation may be interpreted as a local weighted average of the function values in the neighborhood of x. Condition 1.2 implies that it is possible to obtain a solution that is arbitrarily close to a local minimum f. Second, the smooth approximation is infinitely differentiable as a consequence of the convolution product, regardless of the degree of smoothness of f. Moreover, according to the chosen kernel, stochastic gradient estimators may be calculated. These estimators are unbiased estimators of fβ and may be constructed on the basis of observations of F(x,ξ) alone. Finally, the smooth approximation allows convexification of the original function f. Previous studies [39,42] show that greater values of β result in better convexification, as illustrated in Figure 1. Additionally, a larger β leads to greater exploration of the space during the calculation of the gradient estimator. It has also been demonstrated in [32] that if the smoothing parameter is too small, the difference in function values cannot be used to accurately represent the function differential, particularly when the noise level is significant.

    Figure 1.  Curves of fβ for uN(0,1) and different values of β.

    Although the two first properties of the smooth approximation are exploited by ZO methods, the last property has not been utilized since the work in [42]. This may be because the convexification phenomenon becomes insignificant when dealing with high-dimensional problems *. However, for problems of relatively small size (n10), this property can be useful. The authors of [42] use an iterative algorithm to minimize the sequence of subproblems

    minxRnfβi(x), (1.3)

    *Note that a blackbox optimization problem with dimensions ranging from 100 to 1000 may be considered large, while problems with n10000 may be considered very large.

    where βi belongs to a finite prescaled sequence of scalars. This approach is limited because the sequence βi does not necessarily converge to 0 and the number of iterations to go from subproblem i to i+1 is arbitrarily fixed a priori. Furthermore, neither a convergence proof nor a convergence rate are provided for the algorithm. Finally, although promising, numerical results are only presented for analytical test problems. These shortcomings motivate the research presented here.

    The main contributions of this paper can be summarized as follows:

    ● A sequential stochastic optimization (SSO) algorithm is developed to solve the sequence of subproblems in Eq (1.3). In the inner loop, a subproblem is solved according to the ZO version of the signum algorithm [6]. The stopping criterion is based on the norm of the momentum, which must be below a certain threshold. In the outer loop, the sequence of βi is proportional to the threshold needed to consider a subproblem solved, and it is driven to 0. Therefore, the smaller the value of βi (and thus better the approximation given by fβi), the larger the computational budget allotted for the resolution of the subproblem.

    ● A theoretical analysis of this algorithm is conducted. First, the expectation of the norm of the momentum is proved to converge to 0, with a convergence rate that depends on the step sizes. Then, the convergence rate in mean of the ZO-signum algorithm toward a stationary point of fβ is derived under Lipschitz continuity of the function F. Finally, if the function F is smooth and fβ is convex or becomes convex around its local minima, the rate of convergence to an ϵ-optimal point is derived for the SSO algorithm.

    ● Numerical experiments were conducted to evaluate the performance of the proposed algorithm for two applications. First, a comparison is made with traditional derivative-free algorithms in terms of the optimization of the storage cost of a solar thermal power plant model, which is a low-dimensional problem. Second, a comparison is made with other ZO algorithms in order to generate blackbox adversarial attacks, which are large-sized problems.

    The remainder of this paper is organized as follows. In Section 2, the main assumptions and the Gaussian gradient estimator are described. In Section 3, the sequential optimization algorithm is presented, and its convergence properties are studied in Section 4. Section 5 presents numerical results, and Section 6 draws conclusions and discusses future work.

    The assumptions concerning the stochastic blackbox function F are as follows:

    Assumption 1. Let (Ω,F,P) be a probability space.

    (a) The function satisfies F(,ξ)L1(Ω,F,P) and f(x):=Eξ[F(x,ξ)] for all xRn.

    (b) F(,ξ) is Lipschitz-continuous for any fixed value of ξ=(ξ1,ξ2), with the constant L0(F)>0, that is

    |F(x,ξ1)F(y,ξ2)|L0(F)||xy||.

    Assumption 1(a) implies that the expectation of F(x,ξ) with respect to ξ is well defined on Rn and that the estimator F(x,ξ) is unbiased. Assumption 1(b) is commonly used to ensure convergence and bound the variance of the stochastic ZO oracle. It is worth noticing that no assumption is made regarding the differentiability of the objective function f or of its estimate F with respect to x, contrary to most work on ZO methods.

    Under Assumption 1, a smooth approximation of the function f may be constructed via its convolution with a Gaussian random vector. Let u be an n-dimensional standard Gaussian random vector and β>0 be the smoothing parameter. Then, a smooth approximation of f is defined as

    fβ(x):=1(2π)n2f(x+βu)e||u||22du=Eu[f(x+βu)]. (2.1)

    This estimator has been studied in the literature (especially in [34]); it has the benefits of several appealing properties. The properties used in this work are summarized in the following lemma:

    Lemma 2.1. Under Assumption 1, the following statements hold for any integrable function f:RnR and its approximation fβ parameterized by β>0.

    (1) fβ is infinitely differentiable: fβC.

    (2) A one-sided unbiased estimator of fβ is

    ˜fβ(x):=u(f(x+βu)f(x))β. (2.2)

    (3) Let β2β10; then, xRn

    ||fβ1(x)fβ2(x)||L1(fβ1)(β2β1)(n+3)32.

    Moreover, for β>0, fβ is L1(fβ)-smooth, i.e., fβC1+ with L1(fβ)=2nβL0(F).

    (4) If f is convex, then fβ is also convex.

    Proof. (1) It is a consequence of the convolution product between an integrable function and an infinitely differentiable kernel.

    (2) See [34, Eq (22)].

    (3) If uN(0,I), define the following for all xRn

    g(x)=fβ1(x)=Eu[f(x+β1u)].

    Let μ=β2β10; it follows that for all xRn

    gμ(x)=Eu[g(x+μu)]=Eu[fβ1(x+μu)]=Eu[f(x+μu+β1u)]=Eu[f(x+β2u)]=fβ2(x).

    Then, since by [34, Lemma 2] under Assumption 1, fβ1 is Lipschitz continuously differentiable, [34, Lemma 3] may be applied to the function g and it follows that

    ||fβ1(x)fβ2(x)||=||g(x)gμ(x)||L1(fβ1)μ(n+3)32=L1(fβ1)(β2β1)(n+3)32.

    (4) See [34, pp. 5].

    The estimator obtained in Eq (2.2) may be adapted to the stochastic ZO oracle F. For instance, a one-sided (mini-batch) estimator of the noised function F is

    ˜fβ(x,ξ)=1qqj=1uj(F(x+βuj,ξj)F(x,ξ0))β, (2.3)

    where {uj}qj=1 and {ξj}qj=0 are q Gaussian random directional vectors and their associated q estimate values of the function F, respectively. This is still an unbiased estimator of fβ because

    Eu,ξ[˜fβ(x,ξ)]=Eu[Eξ[˜fβ(x,ξ)|u]]=fβ(x). (2.4)

    The result of Lemma 2.1(3) is essential to understand why solving a sequence of optimization problems defined by Eq (1.3) may be efficient, although it might seem counterproductive at first sight. Below are examples of the advantages of treating the problem with sequential smoothed function optimization.

    ● The subproblems are approximations of the original problem and it is not necessary to solve them exactly. Thus, an appropriate procedure for solving these problems with increasingly fine precision can be used. Moreover, as seen in Lemma 2.1(3), the norm of the gradient obtained in a subproblem is close to the one of the following subproblem. The computational effort to find a solution to the second subproblem from the solution of the first should therefore not be important.

    ● The information collected during the optimization process for a subproblem may be reused in the subsequent subproblems since they are similar.

    ● A specific interest in the case of smooth approximation is the ability of using a larger value of β to solve the first subproblems. It allows for a better exploration of the space and convexification phenomenon of the function (see Figure 1). Moreover, the new step size may be used for each subproblem; it allows for an increase in the step size momentarily, in the hope of having a greater chance of escaping a local minimum.

    Section 3.1 presents a ZO version of the signum algorithm [6] to solve Subproblem (1.3) for a given βi and Section 3.2 presents the complete algorithm used to solve the sequential optimization problem.

    A ZO version of the signum algorithm (Algorithm 2 of [6]) is used to solve the subproblems. The signum algorithm is a momentum version of the sign-SGD algorithm. In [30], the authors extended the original sign-SGD algorithm to a ZO version of this algorithm. However, a ZO version of signum is not studied in the work of [30]. As the signum algorithm has been shown to be competitive with the ADAM algorithm [6], a ZO version of this algorithm seems interesting to consider. For completeness, the versions of the sign-SGD and the signum algorithms as they originally appeared in [6] are given in Appendix 6. There is an important difference between the original signum algorithm and its ZO version presented in Algorithm 1. Indeed, while the step size of the momentum 1s2 is kept constant in the work of [6], it is driven to 0 in our work.

    Algorithm 1 ZO-signum algorithm to solve subproblem iN.
    1: Input: xi,0,mi,0,βi,si,01,si,02, L, q, M
    2: Set k=0
    3: Define step-size sequences si,k1=si,01(k+1)α1 and si,k2=si,02(k+1)α2
    4: while ||mi,k||>Lβi4β0 or kM do
    5:   Draw q samples uk from the Gaussian distribution N(0,I)
    6:   Calculate the average of the q Gaussian estimate ˜fβi(xi,k,ξi,k) from Eq (2.3)
    7:   Update:
           mi,k+1=si,k2˜fβi(xi,k,ξk)+(1si,k2)mi,k(3.1)
           xi,k+1j=xi,kjsk1sign(mi,k+1j)j[1,n](3.2)
    8:   kk+1
    9: end while
    10: Return mi,k and xi,k

     | Show Table
    DownLoad: CSV

    This leads to two consequences. First, the variance is reduced since the gradient is averaged on a longer time horizon, without using mini-batch sampling. Second, as it has been demonstrated in other stochastic approximation works ([7, Section 3.3] and [40]), with carefully chosen step sizes the norm of the momentum goes to 0 with probability one. In the ZO-signum algorithm, the norm of the momentum is thus used as a stopping criterion.

    The optimization of the subproblem sequence described in Eq (1.3) is driven by the SSO algorithm presented in Algorithm 2. The value of β plays a critical role, as it serves as both the smoothing parameter and the stopping criterion for Algorithms 1 and 2. Algorithm 2 is inspired by the MADS algorithm [1] as it is based on two steps: a search step and a local step. The search step is optional, may use any heuristics and is required only for problems with relatively small dimensions. In Algorithm 2, an example of a search is given; it consists of updating x after M iterations of the ZO-signum algorithm with the best known x found so far. The local step is then used: Algorithm 1 is launched for each subproblem i with specific values of βi and step-size sequences. Once Algorithm 1 meets the stopping criterion (which depends on the value of βi), the value of βi and the initial step-sizes si,01 and si,02 are reduced, and the algorithm proceeds to the next subproblem. The convergence is guaranteed by the local step, since the search step is run only a finite number of times.

    It is worth noting that the decreasing rate of βi is chosen so that the difference between subproblems i and i+1 is not significant. Therefore, the information collected in subproblem i, through the momentum vector m, can be used in subproblem i+1. Furthermore, the initial step sizes si,01 and si,02 decrease with each iteration, allowing us to focus our efforts quickly toward a local optimum when s0,01 and β0 are chosen to be relatively large.

    Algorithm 2 SSO algorithm.
    1: Initialization:
    2: Set x0,0Rn, β0>0 and N as the maximum number of function calls for the search step
    3: Set q as the number of gradient estimates at each iteration of ZO-signum (ZOS) algorithm
    4: Set M the minimum number of iterations made by the ZOS algorithm on a subproblem
    5: C is the cache containing all of the evaluated points
    6: Set m0,0=˜fβ0(x0,0,ξ0) and L=+
    7: Set s0,01>0 and s0,02>0
    8: Set i=0
    9: Search step (optional):
    10: while M(i+1)qN: do
    11:   Solve subproblem i with Algorithm 1:
           mi+1,0=ZOS(xi,0,mi,0,βi,si,01,si,02,L,q,M)xi+1,0argminxCF(x,ξ)
    12:   Update βi, si,01 and si,02 as in Step 18
    13: end while
    14: L=||m0,0||
    15: Local step:
    16: while βi>ϵ do
    17:   Solve subproblem i with Algorithm 1:
           mi+1,0,xi+1,0=ZOS(xi,0,mi,0,βi,si,01,si,02,L,q,M)
    18:   Update:
           βi=β0(i+1)2,si,01=s0,01(i+1)32,si,02=s0,02i+1ii+1
    19: end while
    20: Return xi

     | Show Table
    DownLoad: CSV

    The convergence analysis is conducted in two steps: first the convergence rate in mean is derived for Algorithm 1 and then the rate of convergence to an ϵ-optimal point is derived for Algorithm 2.

    The analysis of Algorithm 1 follows the general methodology given in [6, Appendix E]. In the following subsection, the main result in [6] is recalled for completeness. The next subsections are devoted to bound the variance and bias terms when limksi,k2=0. Finally, these results are used to obtain the convergence rate in mean of Algorithm 1 in the non-convex and convex case. The last subsection is devoted to a theoretical comparison with other ZO methods of the literature. The subproblem index i is kept constant throughout this section. In order to better convey the convergence analysis of the ZO-signum algorithm, a hierarchical workflow of the different theoretical results is presented in Table 1. The main results are presented in Theorem 4.1 and its corollary for the non-convex case, and Theorem 4.2 for the convex case.

    Table 1.  Workflow of lemmas/propositions/theorems for the ZO-signum convergence analysis.
    Assumptions on F Preliminary results Intermediate results Main results When fβ is convex
    Assumption 1 which implies that L1(fβi)=2nL0(F)βi Proposition 4.1
    Lemma 4.1 Proposition 4.2 Theorem 4.1
    Corollary 4.1
    Theorem 4.2
    Lemma 4.2
    Lemma 4.3
    Lemma 4.4 Proposition 4.3
    Lemma 4.5

     | Show Table
    DownLoad: CSV

    The following proposition uses the Lipschitz continuity of the function fβi (proved in Lemma 2.1) to bound the gradient at the kth iteration.

    Proposition 4.1. [6] For the subproblem iN, under Assumption 1 and in the setting of Algorithm 1, we have

    si,k1E[||fβi(xi,k)||1]E[fβi(xi,k)fβi(xi,k+1)]+nL1(fβi)2(si,k1)2+2si,k1E[||ˉmi,k+1fβi(xi,k)||1]bias+2si,k1nE[||mi,k+1ˉmi,k+1||22]variance, (4.1)

    where ˉmi,k+1j is defined recursively as ˉmi,k+1j=si,k2fβi(xi,k)+(1si,k2)ˉmi,kj.

    Proof. See Appendix B.

    Now, it remains to bound the three terms on the right side of Inequality (4.1).

    The three following lemmas are consecrated to bound the variance term. Unlike the work reported in [6], the variance reduction is conducted by driving the step size of the momentum to 0. It avoids the need to sample an increasing number of stochastic gradients at each iteration, which may be problematic, as noted in [30]. To achieve this, the variance term is first decomposed in terms of expectation of the squared norm of the stochastic gradient estimators ˜g.

    Lemma 4.1. For the subproblem iN, let kN and j[1,n]; we have

    E[||mi,k+1ˉmi,k+1||2](si,k2)2E[||˜gi,k||2]+k1r=0(si,r2)2k1t=r(1si,t+12)2E[||˜gi,r||2]+kt=0(1si,t2)2E[||˜gi,0||2],

    where ˜gi,rj=˜fβi(xi,r,ξr),r[0,k] is defined in Eq (2.3) and the norm is ||||2.

    Proof. Let kN; by definition of mi,k and ˉmi,k, it follows that

    ||mi,k+1ˉmi,k+1||2=(si,k2)2||˜gi,kfβi(xi,k)||2+(1si,k2)2||mi,kˉmi,k||2+2si,k2(1si,k2)(˜gi,kfβi(xi,k))T(mi,kˉmi,k).

    The expectation of this expression is

    E[||mi,k+1ˉmi,k+1||2]=(si,k2)2E[||˜gi,kfβi(xi,k)||2]+(1si,k2)2E[||mi,kˉmi,k||2]+2si,k2(1si,k2)E[(˜gi,kfβi(xi,k))T(mi,kˉmi,k)]. (4.2)

    Now, introducing the associated sigma field of the process Fi,k=σ(xj,t,mj,t,ˉmj,t;ji,tk) by the law of total expectation, it follows that

    E[(˜gi,kfβi(xi,k))T(mi,kˉmi,k)]=E[E[(˜gi,kfβi(xi,k))T(mi,kˉmi,k)|Fi,k]]=E[(E[˜gi,k|Fi,k]fβi(xi,k))T(mi,kˉmi,k)]=0,

    where the second equality holds because mi,k,ˉmi,k and fβi(xi,k) are fixed conditioned on Fi,k and because E[˜gi,k|xi,k]=fβi(xi,k) where ˜gi,k is an unbiased estimator of the gradient by Eq (2.4). By substituting this result in (4.2), it follows that

    E[||mi,k+1ˉmi,k+1||2]=(si,k2)2E[||˜gi,kfβi(xi,k)||2]+(1si,k2)2E[||mi,kˉmi,k||2].

    By repeating this process iteratively, we obtain

    E[||mi,k+1ˉmi,k+1||2]=(si,k2)2E[||˜gi,kfβi(xi,k)||2]+k1r=0(si,r2)2k1t=r(1si,t+12)2E[||˜gi,rfβi(xi,r)||2]+kt=0(1si,t2)2E[||˜gi,0fβi(xi,0)||2]. (4.3)

    Finally, by observing that r[0,k],E[˜gi,r|xi,r]=fβi(xi,r) and by the law of total expectation, we obtain

    E[||˜gi,rfβi(xi,r)||2]=E[||˜gi,rE[˜gi,r|xi,r]||2]=E[||˜gi,r||2]E[||fβi(xi,r)||2]E[||˜gi,r||2].

    Introducing this inequality to Eq (4.3) completes the proof.

    Second, the expectation of the squared norm of the stochastic gradient estimators are bounded by a constant depending quadratically on the dimension.

    Lemma 4.2. Let iN, r[0,k] and j[1,n]; then under Assumption 1, we have

    E[||˜gi,r||2]L0(F)2(n+4)2,

    where L0(F) is the Lipschitz constant of F.

    Proof. By Eq (2.3) with q=1, it follows that

    E[||˜gi,r||2]=E[||u||2(βi)2(F(xi,r+βiu,ξ1)F(xi,r,ξ0))2]L0(F)2E[||u||4]L0(F)2(n+4)2

    where the first inequality follows from Assumption 1(b) and the second by [34, Lemma 1].

    Finally, a technical lemma bounds the second term of the decomposition of Lemma 4.1 by a decreasing sequence. It achieves the same rate of convergence as in [6] without sampling any stochastic gradient.

    Lemma 4.3. For the subproblem iN, let si,k2 be defined such that si,k2=si,02(k+1)α2 with α2(0,1) and si,02(0,1); then, for k such that

    k(k+1)α2ln(si,02)+(1+α2)ln(k)si,02, (4.4)

    the following inequality holds

    k1r=0(si,r2)2k1t=r(1si,t+12)29si,02kα2. (4.5)

    Proof. Let kN; as in [6], the strategy consists of breaking up the sum in order to bound both terms separately.

    k1r=0(si,r2)2k1t=r(1si,t+12)2=k/21r=0(si,r2)2k1t=r(1si,t+12)2+k1r=k/2(si,r2)2k1t=r(1si,t+12)2(1si,k2)2k/2k/21r=0(si,r2)2+(si,k/212)2k1r=k/2(1si,k2)2(kr1)(si,02)2k/2(1si,k2)2k/2+8(si,02)2k2α2k/2r=0(1si,k2)2r(si,02)2k(1si,k2)2k/2+8(si,02)2k2α2(1(1si,k2)2)(si,02)2k(1si,k2)2k/2+8si,02kα2(2si,k2).

    Now, we are looking for k such that

    si,02k(1si,k2)2k/21kα2e2k/2ln(1si,k2)1(si,02)k1+α2.

    As, ln(1x)x, it is sufficient to find k such that

    esi,02k(k+1)α21(si,02)k1+α2k(k+1)α2ln(si,02)+(1+α2)ln(k)si,02.

    Taking such a k allows us to complete the proof.

    Combining the three previous lemmas allows the bounding of the variance term in Proposition 4.1.

    Proposition 4.2. In the setting of Lemmas 4.2 and 4.3 and under Assumption 1(b), the variance term of Proposition 4.1 is bounded by

    E[||mi,k+1ˉmi,k+1||22]9si,02L0(F)2(n+4)2kα2+o(1kα2).

    Proof. By Lemmas 4.1 and 4.2, it follows that

    E[||mi,k+1ˉmi,k+1||2](si,k2)2E[||˜gi,k||2]+k1r=0(si,r2)2k1t=r(1si,t+12)2E[||˜gi,r||2]+kt=0(1si,t2)2E[||˜gi,0||2]((si,k2)2+k1r=0(si,r2)2k1t=r(1si,t+12)2+kt=0(1si,t2)2)L0(F)2(n+4)2.

    Now as (si,k2)2=o(1kα2) and kt=0(1si,t2)2=o(1kα2), the result follows from Lemma 4.3.

    First, the bias term is bounded by a sum depending on sk1 and sk2.

    Lemma 4.4. For the subproblem iN and at iteration kN of the Algorithm 1, we have

    E[||ˉmi,k+1fβi(xi,k)||1]2nL1(fβi)(k1l=0si,l1k1t=l(1si,t+12)).

    Proof. Foremost, observe that the quantity

    Si,k:={1,if k=0,si,k2+k1r=0si,r2k1t=r(1si,t+12)+kt=0(1si,t2),otherwise, (4.6)

    may be written recursively as

    Si,k={1,if k=0,si,k2+(1si,k2)Si,k1,otherwise.

    Note that in its second expression Si,k=1 for all k. Therefore, by definition of ˉmi,kj and the previous result on Si,k, it follows that

    ˉmi,k=si,k2fβi(xi,k)+k1r=0si,r2k1t=r(1si,t+12)fβi(xi,r)+kt=0(1si,t2)fβi(xi,0),fβi(xi,k)=(si,k2+k1r=0si,r2k1t=r(1si,t+12)+kt=0(1si,t2))fβi(xi,k).

    Thus

    E[||ˉmi,k+1fβi(xi,k)||1]k1r=0si,r2k1t=r(1si,t+12)E[||fβi(xi,r)fβi(xi,k)||1]+kt=0(1si,t2)E[||fβi(xi,0)fβi(xi,k)||1]. (4.7)

    By the smoothness of the function fβi, Lemma F(3) of [6] ensures that r[0,k1]

    ||fβi(xi,r)fβi(xi,k)||1k1l=r||fβi(xi,l+1)fβi(xi,l)||1  2nL1(fβi)k1l=rsi,l1.

    Substituting this inequality into Eq (4.7) gives

    E[||ˉmi,k+1fβi(xi,k)||1]2nL1(fβi)Si,k1, (4.8)

    where

    Si,k1=k1r=0si,r2k1l=rsi,l1k1t=r(1si,t+12)+k1l=0si,l1kt=0(1si,t2).

    Reordering the terms in Sk1, we obtain

    Si,k1=k1l=0si,l1(lr=0si,r2k1t=r(1si,t+12)+kt=0(1si,t2))=k1l=0si,l1(si,l2k1t=l(1si,t+12)+l1r=0si,r2k1t=r(1si,t+12)+kt=0(1si,t2))=k1l=0si,l1k1t=l(1si,t+12)(si,l2+l1r=0si,r2l1t=r(1si,t+12)+lt=0(1si,t2))Si,l=1=k1l=0si,l1k1t=l(1si,t+12),

    which completes the proof.

    Second, the sum may be bounded by a term decreasing with k.

    Lemma 4.5. For the subproblem iN, let si,k2=si,02(k+1)α2 and si,k1=si,01(k+1)α1 with si,01(0,1),si,02(0,1) and 0<α2<α1<1; then, for k such that

    k(k+1)α22(ln(si,02)+(1+α1α2)ln(k))si,02, (4.9)

    the following inequality holds

    k1l=0si,l1k1t=l(1si,t+12)5si,01si,02kα1α2. (4.10)

    Proof. The proof follows the proof of Lemma 4.3. The sum is partitioned as follows:

    k1l=0si,l1k1t=l(1si,t+12)=k/21l=0si,l1k1t=l(1si,t+12)+k1l=k/21si,l1k1t=l(1si,t+12)(1si,k2)k/2k/21l=0si,l1+si,k/211k1l=k/21(1si,k2)kr1si,01k(1si,k2)k/2+4si,01kα1(1(1si,k2))=si,01si,02k(1si,k2)k/2si,02+4si,01si,02kα1α2.

    Now, as in Lemma 4.3 taking k such that

    k(k+1)α22(ln(si,02)+(1+α1α2)ln(k))si,02

    ensures that si,02k(1si,k2)k/21kα1α2, which completes the proof.

    Finally, using the two previous lemmas allows for bounding of the bias term.

    Proposition 4.3. In the setting of Lemma 4.5, the bias term of Proposition 4.1 is bounded by

    E[||ˉmi,k+1fβi(xi,k)||1]10nL1(fβi)si,01si,02kα1α2.

    Proof. The proof is a straightforward consequence of Lemmas 4.4 and 4.5.

    As the different terms in the inequality of Proposition 4.1 have been bounded, the main result of this section may be derived as in the following theorem.

    Theorem 4.1. For a subproblem iN and under Assumption 1, let α1(0,1), α2(0,α1), 0<si,01, si,02<1 and K>C where CN satisfies Eqs (4.4) and (4.9); we have

    E[||fβi(xi,R)||1]1K1α1CKα1(Difsi,01+nnL0(F)si,01βiKk=C1k2α1+6si,02L0(F)n(n+4)Kk=C1kα1+α22+40L0(F)si,01nnsi,02βiKk=C1k2α1α2), (4.11)

    where fβi(xi,C)minxfβi(x)Dif, L0(F) is the Lipschitz constant of F and R is randomly picked from a uniform distribution in [C,K].

    Proof. Let CN satisfy Eqs (4.4) and (4.9) and, summing over the inequality in Proposition 4.1, it follows that

    Kk=Csi,k1E[||fβi(xi,k)||1]E[fβi(xi,C)fβi(xi,K+1)]+nL1(fβi)2Kk=C(si,k1)2+2nKk=Csi,k1E[||mi,k+1ˉmi,k+1||22]+2Kk=Csi,k1E[||ˉmi,k+1fβi(xi,k)||1].

    By substituting the results of Propositions 4.2 and 4.3 in the previous inequality, we obtain

    Kk=Csi,k1E[||fβi(xi,k)||1]E[fβi(xi,C)fβi(xi,K+1)]+nL1(fβi)2Kk=C(si,k1)2+6si,02L0(F)(n+4)nKk=Csi,01kα1+α22+20L1(fβi)si,01nsi,02Kk=Csi,01k2α1α2.

    Dividing both sides by si,01Kα1(KC), picking R randomly uniformly in [C,K] and using the definition of Dif given that minxf(x)f(x) for all x, we get

    E[||fβi(xi,R)||1]=1KCKk=CE[||fβi(xi,k)||1]1KCKk=CKα1kα1E[||fβi(xi,k)||1]1K1α1CKα1(Difsi,01+nL1(fβi)si,012Kk=C1k2α1+6si,02L0(F)(n+4)nKk=C1kα1+α22+20L1(fβi)si,01nsi,02Kk=C1k2α1α2).

    Recalling that L1(fβi)=2nL0(F)βi (see [34, Lemma 2]) completes the proof.

    This theorem allows one to prove the convergence rate in mean of the norm of the gradient when α1 and α2 are chosen adequately. In particular, the following corollary provides the convergence rate when α1=34 and α2=12.

    Corollary 4.1. Under the same setting of Theorem 4.1 with βi1 α1=34, α2=12, si,01=1n34 and si,021, we have

    E[||fβi(xi,R)||2]=O(n32K1/4ln(K)). (4.12)

    Proof. The result is a direct consequence of Theorem 4.1 with the specified constant, and it can be obtained by noting that ||||2||||1 in Rn.

    In [15,20,30], the function F is assumed to be smooth with a Lipschitz continuous gradient. In the present work, F is only assumed to be Lipschitz continuous. This has two main consequences on the result of convergence: the dependence of the dimension on the convergence rate is larger. Furthermore, while β must be chosen relatively small in the smooth case, it is interesting to note that it does not have to be this way in the nonsmooth case.

    The convergence rate results for the ZO-signum algorithm has been derived in the non-convex case. In the next theorem, they are derived for the case when the function fβi is convex.

    Theorem 4.2. Under Assumption 1, suppose moreover that fβi is convex and there exists ρ such that ρ=maxkN||xi,kxi,||; then, by setting

    si,k1=2ρ(k+1),si,k2=1(k+1)23andΓk:=kl=2(12k+1)=2k(k+1)withΓ1=1, (4.13)

    it follows that

    E[fβi(xi,K)fβi(x)]4ρ2nnL0(F)βiK13 (4.14)

    and

    E[||fβi(xi,R)||]2L0(F)K2+4ρnnL0(F)βiK13, (4.15)

    where R is a random variable in [0,K1] whose the probability distribution is given by

    P(R=k)=si,k1/Γk+1K1k=0si,k1/Γk+1.

    Proof. Under the assumptions in the statement of Theorem 4.2, it follows by Proposition 4.1 that

    E[fβi(xi,k+1)fβi(xi,)]E[fβi(xi,k)fβi(xi,)]si,k1E[||fβi(xi,k)||]+nL1(fβi)2(si,k1)2+2si,k1E[||ˉmi,k+1fβi(xi,k)||1]+2si,k1nE[||mi,k+1ˉmi,k+1||2]E[fβi(xi,k)fβi(xi,)]sk1E[||fβi(xi,k)||]+4ρ2nnL0(F)βi(k+1)43, (4.16)

    where the last inequality follows thanks to Propositions 4.2 and 4.3 with L1(fβi)=2L0(F)nβi and the values of si,k1 and si,k2. Now, by convexity assumption of fβi and the bound ρ, the following holds

    fβi(xi,k)fβi(xi,)fβi(xi,k)T(xi,kxi,)||fβi(xi,k)||||xi,kxi,||ρ||fβi(xi,k)||.

    Thus, by substituting this result into Eq (4.16), it follows that

    E[fβi(xi,k+1)fβi(xi,)](12(k+1))E[fβi(xi,k)fβi(xi,)]+4ρ2nnL0(F)βi(k+1)43.

    Now by dividing both sides of the equation by Γk+1 and summing up the inequalities, it follows that

    E[fβi(xi,K)fβi(xi,)]ΓK4ρ2nnL0(F)βiK1k=01Γk+1(k+1)434ρ2nnL0(F)βiK1k=0(k+1)23.

    Thus

    E[fβi(xi,K)fβi(xi,)]4ρ2nnL0(F)βiΓKK1k=0(k+1)234ρ2nnL0(F)βiK13.

    Now, the second part of the proof may be demonstrated. By Eq (4.16), it also follows that

    si,k1E[||fβi(xi,k)||]E[fβi(xi,k)fβi(xi,)]E[fβi(xi,k+1)fβi(xi,)]+4ρ2nnL0(F)βi(k+1)43.

    As in the previous part, by dividing both sides by Γk+1, summing up the inequalities and noting that ¯fk=E[fβi(xi,k)fβi(xi,)], we obtain

    K1k=0si,k1Γk+1E[||fβi(xi,k)||]K1k=0¯fk¯fk+1Γk+1+4ρ2nnL0(F)βiK1k=01Γk+1(k+1)43.

    Then, again by dividing both sides by K1k=0si,k1Γk+1 it follows that

    E[||fβi(xi,R)||]=K1k=0si,k1Γk+1E[||fβi(xi,k)||]K1k=0si,k1Γk+11K1k=0si,k1Γk+1(K1k=0E[¯fk¯fk+1]Γk+1+4ρ2nnL0(F)βiK1k=01Γk+1(k+1)43),

    where R is a random variable whose distribution is given in the statement of the theorem. Now, as in Eq (2.21) of [5], the following inequalities hold

    K1k=0¯fk¯fk+1Γk+1¯f0+K1k=12Γk+1(k+1)¯fk and K1k=0si,k1Γk+1=ρΓK.

    Thus, by substituting these in the inequality involving the expectation, we obtain

    E[||fβi(xi,R)||]ΓKρ(E[¯f0]+K1k=12Γk+1(k+1)E[¯fk]+4ρ2nnL0(F)βiK1k=01Γk+1(k+1)43)ΓKρ(E[¯f0]+8ρnnL0(F)βiK1k=01Γk+1(k+1)43)2L0(F)K2+8ρnnL0(F)βiK13,

    where the second inequality follows from Eq (4.14).

    The result obtained in Eq (4.12) is consistent with the convergence results of other ZO methods. To gain a better understanding of its performance, this result is compared with those of four other algorithms from different perspectives: the assumptions, the measure used, the convergence rate and the function query complexity. All methods seek a solution to a stochastic optimization problem; the comparison is presented in Table 2. Since the convergence rates of the ZO-signum and ZO-signSGD algorithms are measured by using ||f(x)||, although ||f(x)||2 is used for ZO-adaMM and ZO-SGD, Jensen's inequality is used to rewrite the convergence rates in terms of the gradient norm.

    Table 2.  Summary of convergence rates and query complexity for various ZO algorithms given K iterations.
    Method Assumptions Measure Convergence rate Queries
    ZO-SGD [20] F(,ξ)C1+ E[||f(xR)||2] O(σn14K14+nK) O(K)
    E[||F(x,ξ)f(x)||2]σ2
    ZO-signSGD [30] F(,ξ)C0+ E[||f(xR)||2] O(nK+nb+nbq) O(bqK)
    F(,ξ)C1+
    ||F(x,ξ)||2η
    ZO-adaMM [15] F(,ξ)C0+ E[||f(xR)||2] O((nK14+nK)(ln(K)+ln(n))14) O(K)
    F(,ξ)C1+
    ||F(x,ξ)||η
    ZO-Signum F(,ξ)C0+ E[||fβ(xR)||2] O(nnK14ln(K)) O(K)
    ZO-Signum F(,ξ)C0+, f convex E[fβi(xi,K)fβi(xi,)] O(nnK13) O(K)
    ZO-SGD [34] F(,ξ)C0+, f convex E[f(xi,K)f(xi,)] O(nK) O(K)
    Modified ZSCG [5] F(,ξ)C1+, F convex E[f(xi,K)f(xi,)] O(σnK) O(K)
    E[||F(x,ξ)f(x)||2]σ2

     | Show Table
    DownLoad: CSV

    ● for ZO-SGD [20]

    E[||f(x)||]E[||f(x)||2]O(σnK+nK)O(σn14K14+nK);

    ● for ZO-adaMM [15]

    E[||f(x)||]E[||f(x)||2]O((nK+n2K)ln(K)+ln(n))O((nK14+nK)(ln(K)+ln(n))14),

    where the third inequalities are due to a2+b2a+b, for a,b0. For ZO-signSGD, unless the value of b depends on K, the algorithm's convergence is only guaranteed within some ball around the solution, making it difficult to compare with other methods. Thus, in the non-convex case, after this transformation, it becomes apparent that ZO-signum has a convergence rate that is O(n34σ) and O(n) worse than those of ZO-SGD and ZO-adaMM, respectively. This may be attributed to the milder assumption made on the function F in the present work, which also explains why the convergence is relative to fβ. In the convex case, ZO-signum has a convergence rate that is O(nK16σ) worse that of ZSCG and O(nK16) worse that of ZO-SGD. This may be explained by the sign() operator loosing the magnitude information of the gradient when it is applied. This problem may be fixed as in [23] but it outside the scope of this work. Finally, all methods except ZO-signSGD are momentum-based versions of the original ZO-SGD method. Although the momentum-based versions are mostly used in practice, it is interesting to notice that none of these methods possess a better convergence rate than the original ZO-SGD method. The next section provides some clues about the interests of the momentum-based method.

    The convergence analysis from the previous subsection is in mean, i.e., it establishes the expected convergence performance over many executions of the ZO-signum algorithm. As in [20], we now focus on the performance of a single run. A second hierarchical workflow of the different theoretical results is presented in Table 3.

    Table 3.  Workflow of Lemmas/Propositions/Theorems for the SSO convergence analysis.
    Assumptions on F Preliminary results Intermediate results Main result When fβ is convex
    Assumptions 1, 2 and 3 which imply L1(fβi)L1(f) Lemma 4.6 Theorem 4.3 (ⅰ) Theorem 4.3 (ⅱ)
    Proposition 4.3 Lemma 4.7 Lemma 4.8 Lemma 4.9
    Lemma 2.1(3)
    Theorem 4.1
    Proposition 4.2 Lemma 4.10
    Proposition 4.3

     | Show Table
    DownLoad: CSV

    Unlike [20], our analysis is based on a sequential optimization framework rather than a post-optimization process. Our SSO algorithm uses the norm of the momentum as an indicator of the quality of the current solution. In order to analyze the rate of convergence of this algorithm, the following additional assumptions are made regarding the function F. The first assumption concerns the smoothness of the function F. The assumption of smoothness is used only to guarantee that L1(fβi) is a constant with respect to βi, contrary to the non-smooth case (see [34, Eq (12)]).

    Assumption 2. The function F(,ξ) has a L1(F)-Lipschitz continuous gradient.

    The second assumption concerns the local convexity of the function fβ.

    Assumption 3. Let (xi,0) be a sequence of points produced by Algorithm 2 and xi, a sequence of local minima of fβi. We assume that there exists a threshold IN and a radius ρ>0 such that iI:

    (1) fβi is convex on the ball Bρ(xi,):={xRn:||xxi,||<ρ};

    (2) xi,0Bρ(xi,).

    Under these assumptions, we will prove that if the norm of the momentum vector m is below some threshold, then this threshold can be used to bound the norm of the gradient. Second, an estimate for the number of iterations required to reduce the norm of m below the threshold is provided. The next lemma is simply technical and demonstrates the link between ˉm and m.

    Lemma 4.6. For any subproblem iN and iteration k1, the following equality holds

    E[mi,k|xi,k1]=E[ˉmi,k|xi,k1],

    where ˉmi,k is defined recursively in Proposition 4.1.

    Proof. The proof is conducted by induction on k. For k=1, setting mi,0=˜fβi(xi,0,ξ0) implies that

    mi,1=si,02˜fβi(xi,0,ξ0)+(1si,02)mi,0=˜fβi(xi,0,ξ0).

    In the same way, ˉmi,1=fβi(xi,0). Therefore, we have

    E[mi,1|xi,0]=E[˜fβi(xi,0,ξ0)|xi,0]=fβi(xi,0)=E[fβi(xi,0)|xi,0]=E[ˉmi,1|xi,0].

    Now, suppose that the induction assumption is true for a given kN; then,

    E[mi,k+1|xi,k]=si,k2fβi(xi,k)+(1si,k2)E[mi,k|xi,k].

    Now, by the law of total expectation

    E[mi,k|xi,k]=E[E[mi,k|xi,k,xi,k1]|xi,k]=E[E[mi,k|xi,k1]|xi,k]=E[E[ˉmi,k|xi,k1]|xi,k](by the induction assumption)=E[ˉmi,k|xi,k].

    Thus as E[fβi(xi,k)|xi,k]=fβi(xi,k), it follows that

    E[mi,k+1|xi,k]=si,k2fβi(xi,k)+(1si,k2)E[mi,k|xi,k]=si,k2E[fβi(xi,k)|xi,k]+(1si,k2)E[ˉmi,k|xi,k]=E[ˉmi,k+1|xi,k],

    which completes the proof.

    The following lemma shows that if ||m|| is below a certain threshold, then this threshold can be used to bound the norm of the gradient.

    Lemma 4.7. For a subproblem iN, let KiN denote the first iteration in Algorithm 1 for which ||mi,Ki||Lβi4β0; then, under Assumption 3 the norm of the gradient of the function fβi at xi,K may be bounded as follows

    ||fβi(xi,Ki)||Lβi4β0+10nL1(F)si,01si,02Kα1α2i.

    Moreover, if the problem i+1 is considered, the gradient of the function fβi+1 may be bounded at the point xi,Ki=xi+1,0 as follows:

    ||fβi+1(xi+1,0)||||fβi(xi,Ki)||+L1(F)(n+3)32(βiβi+1).

    Proof. Let Ki be taken as in the statement of the lemma. The norm of the gradient may be bounded as follows:

    ||fβi(xi,Ki)||||E[mi,Ki|xi,Ki]||+||fβi(xi,Ki)E[mi,Ki|xi,Ki]||E[||mi,Ki|||xi,Ki]+||fβi(xi,Ki)E[ˉmi,Ki|xi,Ki]||,

    where the second inequality follows from Jensen's inequality and Lemma 4.6. Now, using ||mi,Ki|| Lβi4β0, E[fβi(xi,K)|xi,Ki]=fβi(xi,Ki), L1(fβi)L1(F) and the result of Proposition 4.3 completes the first part of the proof

    ||fβi(xi,Ki)||Lβi4β0+E[||fβi(xi,Ki)ˉmi,Ki|||xi,Ki]Lβi4β0+10nL1(F)si,01si,02Kα1α2i.

    The second part of the proof follows directly by applying the triangular inequality and the result in Lemma 2.1(3) because xi,Ki=xi+1,0.

    Under Assumption 2, the expected difference between the values of fβi at xi,0 and its optimal value is bounded in the next lemma.

    Lemma 4.8. Let I be the threshold from Assumption 2. If iI, then

    E[fβi+1(xi+1,0)fβi+1(xi+1,)]ρ(Lβi4β0+10nL1(F)si,01si,02Kα1α2i+L1(F)(n+3)32(βiβi+1)). (4.17)

    Proof. Convexity of the function fβi on the ball Bρ(xi,) implies that

    E[fβi+1(xi+1,0)fβi+1(xi+1,)]E[fβi+1(xi+1,0),xi+1,0xi+1,]E[||fβi+1(xi+1,0)||||xi+1,0xi+1,||].

    The result follows by using Lemma 4.7 and because xi+1,0 belongs to the ball Bϵ(xi,).

    Moreover, an estimate of the number of iterations required to reduce the norm of the gradient below some threshold may be given.

    Lemma 4.9. Under Assumptions 1–3, for a subproblem i>I and in the setting of Algorithm 2, let si,02R+ be such that k=1 in Eqs (4.4) and (4.9); assume that L=max(L0(F),L1(F)), α1=34 and α2=12. Then, for a uniformly randomly chosen R[0,Ki], it follows that

    P(||fβi(xi,R)||Lβi4β0)4β0βiK14i(Ai+Bi),

    where Ai and Bi are defined in Eq (4.18).

    Proof. Markov's inequality implies that

    P(||fβi(xi,R)||Lβi4β0)4β0E[||fβi(xi,R)||]Lβi.

    Now, given the result of Theorem 4.1 with the specified values of α1 and α2 and the fact that L1(fβi)L1(F) together with Lemma 4.8, it follows that

    4β0E[||fβi(xi,R)||]Lβi4β0βiK14i(Ai+Bi),

    where

    Ai=ρsi,01(βi14β0+10nsi1,01si1,02K14i1+(n+3)32(βiβi+1)),Bi=nsi,012H(32)k+ln(Ki)(6si,02(n+4)n+20nsi,01si,02), (4.18)

    Ki is the iteration number for subproblem i and H(32)k is the generalized harmonic number.

    The following lemma provides an estimate of the number of iterations required to bound the norm of the difference between m and the gradient below a certain threshold.

    Lemma 4.10. For a subproblem iN and in the setting of Algorithm 2, let si,02R+ be such that k=1 in Eqs (4.4) and (4.9); assume that L=max(L0(F),L1(F)), α1=34 and α2=12. Then, for a uniformly randomly chosen R[0,Ki], it follows that

    P(||mi,Rfβi(xi,R)||Lβi4β0)4β0βiK14i(3si,02(n+4)n+10nsi,01si,02).

    Proof. By Markov's inequality, it follows that

    P(||mi,Rfβi(xi,R)||Lβi4β0)4β0E[||mi,Rfβi(xi,R)||]Lβi=4β0LβiKiKik=0E[||mi,kfβi(xi,k)||]4β0βiK14i(3si,02(n+4)n+10nsi,01si,02),

    where the last inequality holds by Propositions 4.2 and 4.3 with α1=34 and α2=12.

    Finally, the main theorem of this section may be stated.

    Theorem 4.3. Let Assumptions 1–3 hold and let I be the threshold from Assumption 3.

    (i) For iN, set

    βi=1n(i+1)2,si,01=16n(i+1)3/2andsi,02=s2(i+1)

    with s2 so that Eqs (4.4) and (4.9) are satisfied for k=1. Moreover, let us denote Ki as the first iteration for which ||mi,Ki||Lβi4β0 and that without loss of generality L=max(L0(F),L1(F)). Let ϵ>0 be the desired accuracy and let iLϵI. If for any iI,Ki(i+1)6; then after at most

    O(n6L7/2ϵ7/2)

    function evaluations, the following inequality holds

    ||fβi(xi,0)||ϵ. (4.19)

    (ii) Furthermore, when for every iN,fβi is convex; then, under the same setting as Theorem 4.2 given in Eq (4.13), it follows that after at most

    O(n92L7/2ϵ7/2)

    function evaluations, the inequality given by Eq (4.19) holds.

    Proof. For a subproblem iN, a probabilistic upper bound on the iteration KiN such that ||mi,Ki||Lβi4β0 may be provided. We have

    ||mi,Ki||=mink[0,Ki]||mi,k||||mi,R||||mi,Rfβi(xi,R)||+||fβi(xi,R)||, (4.20)

    where RU[0,Ki]. Now, probabilistic upper bounds on the number Ki are required to obtain that both terms on the right-hand side of the previous inequality are below Lβi4β0. For the first term of the right-hand side in Eq (4.20), using the specified values of si,01, si,02 and βi, Lemma 4.10 ensures that

    P(||mi,Rfβi(xi,R)||Lβi4β0)4β0βiK14i(3si,02(n+4)n+10nsi,01si,02)O(nn(i+1)32K14i).

    The second term of the right-hand side in Eq (4.20) depends on the value of I. For subproblems iI, it follows by Markov's inequality and Theorem 4.1 that

    P(||fβi(xi,R)||Lβi4β0)4β0LβiE[||fβi(xi,R)||]4β0βi(Difsi,01+nsi,012H(32)k+ln(Ki)(6si,02(n+4)n+40si,01nsi,02))O(max(n(i+1)72L,nnln(Ki)(i+1)32)K14i).

    For subproblems i>I, Lemma 4.9 ensures that

    P(||fβi(xi,R)||Lβi4β0)4β0βiK14(Ai+Bi),

    where Ai and Bi are given by Eq (4.18). Now, given the condition on Ki, it follows that

    Ai=ρn(i+1)3/2(1i2+10s2i2+2(n+3)i2(i+1))

    and

    Bi=H(32)k2(i+1)3/2+ln(Ki)(6nn+3s2i+1+12s2i+1).

    Thus, we obtain

    P(||fβi(xi,R)||Lβi4β0)O(nn(i+1)32ln(Ki)K14i). (4.21)

    Therefore, to obtain that ||mi,Ki||Lβi4β0, it takes at most the following number of iterations:

    Ki={O(max(n4(i+1)14,n6(i+1)6)),if iI,O((n6(i+1)6)),otherwise.

    Thus, by taking iLϵ, it follows that the number of iterations needed to reach the subproblem i is

    ii=1Ki=Ii=1Ki+ii=I+1Ki=O(max(n4(I+1)15,n6(I+1)7))+O(n6(i)7)=O(n6L7/2ϵ7/2), (4.22)

    where I is a constant with respect to ϵ. Once this number of iterations is reached, it follows that ||mi,0||L(i+1)2ϵ and by Lemma 4.7

    ||fβi(xi,Ki)||L(i+1)2+Li+1(i)322ϵ.

    For the second part of the proof, the bounds on Eq (4.20) do not depend on the value of I since fβi is assumed convex for every iN. With the setting in Eq (4.13), it follows that

    P(||fβi(xi,R)||Lβi4β0)4β0βiE[||fβi(xi,R)||]16ρnn(i+1)2K13i

    and

    P(||mi,Rfβi(xi,R)||Lβi4β0)4β0LβinnLKi1k=02ρΓk+1(k+1)43Ki1k=02ρΓk+1(k+1)8nn(i+1)2K13i,

    where the first inequality follows by Theorem 4.2 and the second one by the definition of the probability density of R together with Propositions 4.2 and 4.3. Therefore, it takes at most Ki=O(n92(i+1)6) iterations to obtain ||mi,Ki||Lβi4β0. Thus, by taking iLϵ, it follows that the number of iterations needed to reach the subproblem i is

    ii=1Ki=O(n92(i)7)=O(n92L72ϵ72).

    It remains to apply Lemma 4.7 as previously done to complete the proof.

    We would like to make a few remarks about this theorem. First, one approach to satisfy the condition Ki(i+1)6 for any iN is to incorporate it into the stopping criterion of Algorithm 1. However, due to the limited number of iterations in practice, this condition is typically replaced by a weaker one, KiM, where M>0 is a constant. Second, the main result of Theorem 8 establishes the rate of convergence to an ϵ-optimal point for a single run of the SSO algorithm, which is the first of its kind to the best of our knowledge. This was made possible by decomposing the problem given in Eq (1.1) into a sequence of subproblems, each of which is solved by using carefully chosen stopping criteria and step sizes. It is worth noting that, in [20], the (ϵ,Λ)-solution of the norm of the gradient is obtained after at most O(nL2σ2ϵ4) function evaluations. Although this bound has a weaker dependence on n and L, it is worse in terms of ϵ. Third, the first term in Eq (4.22) may be significant even if it is fixed, particularly if the region where the function is convex is difficult to reach; indeed, this constant disappears when fβi is convex for every index i. Nevertheless, the bounds given represent the worst set and may be considerably smaller in practice. A way to decrease this term is to decrease the power on i in the denominator of βi, si,01 and si,02 but this would also decrease the asymptotic rate of convergence. Finally, the process used in the SSO algorithm may be extended to other momentum-based methods and give an appealing property for these methods compared to the classical SGD.

    The numerical experiments are conducted for two bounded constrained blackbox optimization problems. In order to handle the bound constraints x[,u]Rn, the update given by Eq (3.2) is simply projected such that xmax(,min(x,u)).

    The first stochastic test problem is SOLAR [19], which simulates a thermal solar power plant and contains several instances allowing for selection of the number of variables, the types of constraints and the objective function to optimize. All of the instances of SOLAR are stochastic and have non-convex constraints and integer variables. In this work, the algorithms developed do not deal with integer variables. Therefore, the problem is altered: all integer variables are fixed to their initial values and the problem is to obtain a feasible solution by optimizing the expectation of constraint violations over the remaining variables. Numerical experiments were conducted for the second instance of the SOLAR framework, which considers 12 variables (2 integers) and 12 constraints

    minx[0,1]12E[mj=1max(0,cj(x,ξ))2],

    where cj denotes the original stochastic constraints and the bound constraints have been normalized. The second instance of SOLAR is computationally expensive; a run may take between several seconds and several minutes. Therefore, the maximum number of function evaluations was set to 1000. Four algorithms were used

    ● SSO, whose hyperparameters values are given in Table 4. The search step given in Algorithm 2 was used for this experiment. A truncated version of the Gaussian gradient based estimate was used for this experiment.

    Table 4.  List of hyperparameters for the SSO algorithm.
    Problem βi si,k1 si,k2 M q
    Cifar10 0.005(i+1)2 0.005(i+1)32k+1 0.9(i+1)(k+1)14 60 10
    ImageNet 0.001(i+1)2 0.003(i+1)32k+1 0.7(i+1)(k+1)14 100 10
    Solar 0.3(i+1)2 0.1(i+1)32k+1 0.5(i+1)(k+1)14 5 10

     | Show Table
    DownLoad: CSV

    ● ZO-adaMM [15] which is a ZO version of the original Adam algorithm. This algorithm appears as one of the most effective according to [15,31] in terms of distortion value, number of function evaluations and success rate. The default parameters defined experimentally in [15] were used for this problem, except that β=0.05 and the learning rate was equal to 0.3. Moreover, the same gradient estimator as that for ZO-signum was used to eliminate its impact on the performance.

    ● CMA-ES [22] an algorithm based on biologically inspired operators. Its name comes from the adaptation of the covariance matrix of the multivariate normal distribution used during the mutation. The version of CMA-ES used was the one of the pymoo [8] library with the default setting.

    ● The NOMAD 3.9.1 software [29], based on the MADS [1] algorithm, a popular blackbox optimization solver.

    The results are presented in Figure 2, which plots the average best result obtained by each algorithm with five different seeds. In this experiment, SSO achieved similar performance to NOMAD and CMA-ES which are state-of-the-art algorithms for this type of problem. ZO-adaMM had difficulty converging even though it is a ZO algorithm.

    Figure 2.  Average of five different seed runs for the NOMAD, CMAES, SSO and ZO-adaMM algorithms.

    This section demonstrates the competitiveness of the SSO algorithm through experiments involving the generation of blackbox adversarial examples for deep neural networks (DNNs) [45]. Generating an adversarial example for a DNN involves adding a well-designed perturbation to the original legal input to cause the DNN to misclassify it. In this work, the attacker considers the DNN model to be unknown, hence the term blackbox. Adversarial attacks against DNNs are not just theoretical, they pose a real safety issue [35]. Having an algorithm that generates effective adversarial examples enables modification of DNN architecture to enhance its robustness against such attacks. An ideal adversarial example is one that can mislead a DNN to recognize it as any target image label, while appearing visually similar to the original input, making the perturbations indiscernible to human eyes. The similarity between the two inputs is typically measured by an p norm. Mathematically, a blackbox adversarial attack can be formalized as follows. Let (y,) denote a legitimate image y with the true label [1,M], where M is the total number of image classes. Let x denote the adversarial perturbation; the adversarial example is then given by y=y+x, and the goal is to solve the problem [15]

    minxλf(y+x)+||x||2, subject to (y+x)[0.5,0.5]n,

    where λ>0 is a regularization parameter and f is the blackbox attack loss function. In our experiments, λ=10 and the loss function is defined for an untargeted attack [12], i.e.,

    f(y)=max{Z(y)maxjZ(y)j,0},

    where Z(y)k denotes the prediction score for class k given the input y. Thus, the minimum value of 0 is reached as the perturbation succeeds to fool the neural network.

    The experiments of generating blackbox adversarial examples were first performed by using an adapted AlexNet [28] on the dataset Cifar10 and then by using InceptionV3 [43] on the dataset ImageNet [18]. Since the NOMAD algorithm is not recommended for large problems, three algorithms are compared: SSO (without search), ZO-adaMM and CMA-ES. In the experiments, the hyperparameters of the ZO-adaMM algorithm were taken as in [15], and those of SSO are given in Table 4; the uniform gradient based estimate is used for both algorithms. Moreover, for the Cifar10 dataset, different initial learning rates for ZO-adaMM were used to observe its influence on the success rate. Experiments were conducted for 100 randomly selected images with a starting point corresponding to a null distortion; the maximum number of function queries was set to 5000. Thus, as the iteration increases, the attack loss decreases until it converges to 0 (indicating a successful attack) while the norm of the distortion could increase.

    The best attack performance involves a trade-off between a fast convergence to a 0 attack loss in terms of function evaluations, a high rate of success, and a low distortion (evaluated by the 2-norm). The results for the Cifar10 dataset are given in Table 5.

    Table 5.  Results of blackbox adversarial attack for the Cifar10 dataset (n=3×32×32).
    Method Attack success rate ||2|| first success Average # of function evaluations
    ZO-adaMM lr=0.01 79% 0.14 582
    ZO-adaMM lr=0.03 96% 0.97 310
    ZO-adaMM lr=0.05 98% 2.10 215
    CMAES σ=0.005 99% 0.33 862
    SSO 100% 0.55 442

     | Show Table
    DownLoad: CSV

    Except for ZO-adaMM with an initial learning rate equal to 0.01, all algorithms achieved a success rate above 95. Among these algorithms, ZO-adaMM with a learning rate equal to 0.05, had the best convergence rate in terms of function evaluations but it had the worst value of distortion. On the contrary, CMA-ES obtained the best value of distortion but had the worst convergence rate. The SSO algorithm achieved balanced results, and it was the only one to reach full success rate.

    Table 6 displays the results for the ImageNet dataset. Only two algorithms are compared since dimensions were too large to invert the covariance matrix in CMA-ES. For this dataset, ZO-adaMM and SSO had the same convergence rate. However, SSO outperformed ZO-adaMM in terms of success rate while having a slightly higher level of distortion.

    Table 6.  Results of blackbox adversarial attack for the ImageNet dataset (n=3×299×299).
    Method Attack success rate ||2|| first success Average # of function evaluations
    ZO-adaMM lr=0.01 59% 19 1339
    SSO 73% 33 1335

     | Show Table
    DownLoad: CSV

    This paper presents a method for computationally expensive stochastic blackbox optimization. The approach uses ZO gradient estimates, which provides three advantages. First, they require few function evaluations to estimate the gradient, regardless of the problem's dimensions. Second, under mild conditions on the noised objective function, the problem is formulated as optimization of a smooth approximation. Third, the smooth approximation may appear to be locally convexified near a local minima.

    Based on these three features, the SSO algorithm was proposed. This algorithm is a sequential one and comprises two steps. The first is an optional search step that improves the exploration of the decision variable space and the algorithm's efficiency. The second is a local search, which ensures the convergence of the algorithm. In this step, the original problem is decomposed into subproblems solved by a ZO-version of a sign stochastic descent with momentum algorithm. More specifically, when the momentum's norm falls below a specified threshold that depends on the smoothing parameter, the subproblem is considered solved. The smoothing parameter's value is then decreased, and the SSO algorithm moves on to the next subproblem.

    A theoretical analysis of the algorithm has been conducted. Under Lipschitz continuity of the stochastic ZO oracle, a convergence rate in mean of the ZO-signum algorithm is derived. Under additional assumptions of smoothness and convexity or local convexity of the objective function near its minima, the rate of convergence of the SSO algorithm to an ϵ-optimal point of the problem has been derived, which is, to the best of our knowledge, the first of its kind.

    Finally, numerical experiments were conducted based on a solar power plant simulation and adversarial blackbox attacks. Both examples were computationally expensive, the former was a small-sized problem (n10) and the latter was a large-sized problem (up to n105). The results demonstrate the SSO algorithm's competitiveness in terms of both performance and convergence rate compared to state-of-the-art algorithms. Further work will extend this approach to constrained stochastic optimization.

    The authors declare that they have not used artificial intelligence tools in the creation of this article.

    The authors are grateful to the anonymous reviewers for their helpful comments and the English editor for improving the English quality of this text. This work was financed by the IVADO Fundamental Research Projects Grant PRF-2019-8079623546 and by the NSERC Alliance grant 544900-19 in collaboration with Huawei-Canada.

    All authors declare no conflicts of interest that may influence the publication of this paper.

    The following list describes symbols used within the body of the document. Throughout the paper, when a symbol is shown in bold then it is a vector; otherwise, it is a scalar.

    n The dimension of the space of the design variables
    Ω The sample space of ξ, i.e., the set of all possible outcomes of ξ
    ξ:ΩRm The vector of uncertainties
    Eξ[] The expectation with respect to the random vector ξ
    F:Rn×RmR The stochastic zeroth-order oracle that takes into account the uncertainty ξ
    f:RnR The expectation of F with respect to ξ
    βR+ A strictly positive scalar for use as a smoothing parameter
    uRn A Gaussian random vector
    fβ=E[f(x+βu)] A smooth approximation of a function f
    L0(f) The Lipschitz constant associated with a function f
    L1(f) The Lipschitz constant associated with the gradient of a function f
    f The gradient of a function f
    ˜f An estimator of the gradient of a function f
    ˜g An estimator of the gradient of a function f based on outputs of the stochastic zeroth-order oracle F(x,ξ)
    j[1,n] The counter associated with the dimension
    iN The outer iteration counter associated with a subproblem
    kN The inner iteration counter
    mRn The momentum vector
    si,k2(0,1) The step size associated with the momentum
    si,k1(0,1) The step size associated with x
    LR+ An approximation of the Lispchitz constant
    qN The size of the mini batch used to estimate ˜
    MN The minimum number of iterations used in the ZO-signum algorithm
    H(α)k The generalized harmonic number of order α
    C0+ Class of Lipschitz continuous functions
    C1+ Class of differentiable functions whose gradient is Lipschitz
    C Class of infinitely differentiable functions

     | Show Table
    DownLoad: CSV

    Proposition B.1. [6] For the subproblem iN, under Assumption 1 and in the setting of Algorithm 1, we have

    si,k1E[||fβi(xi,k)||1]E[fβi(xi,k)fβi(xi,k+1)]+nL1(fβi)2(si,k1)2+2si,k1E[||ˉmi,k+1fβi(xi,k)||1]bias+2si,k1nE[||mi,k+1ˉmi,k+1||22]variance, (B.1)

    where ˉmi,k+1j is defined recursively as ˉmi,k+1j=si,k2fβi(xi,k)+(1si,k2)ˉmi,kj.

    Proof. By L1(fβi)-Lipschitz smoothness of fβi (see Lemma 2.1(3)), it follows that

    fβi(xi,k+1)fβi(xi,k)+fβi(xi,k),xi,k+1xi,k+L1(fβi)2||xi,k+1xi,k||22=fβi(xi,k)si,k1fβi(xi,k),sign(mi,k+1)+L1(fβi)(si,k1)22||sign(mi,k+1)||22=fβi(xi,k)si,k1||fβi(xi,k)||1+nL1(fβi)2(si,k1)2+2si,k1nj=1|jfβi(xi,k)|1{sign(mi,k+1j)sign(jfβi(xi,k))},

    where 1{} is the indicator function. Now, as in [6,30], the expected improvement conditioned on xi,k is given by

    E[fβi(xi,k+1)fβi(xi,k)|xi,k]si,k1||fβi(xi,k)||1+nL1(fβi)2(si,k1)2+2si,k1nj=1|jfβi(xi,k)|E[1{sign(mi,k+1j)sign(jfβi(xi,k))}|xi,k]. (B.2)

    Again, as in [6,30], the expectation that the sign of mi,k+1j is different from the sign of jfβi(xi,k) is relaxed by considering that the set

    {mi,k+1j:sign(mi,k+1j)sign(jfβi(xi,k)}{mi,k+1j:|mi,k+1jjfβi(xi,k)||jfβi(xi,k)|}.

    Therefore, it follows that

    E[1{sign(mi,k+1j)sign(jfβi(xi,k))}|xi,k]E[1{|mi,k+1jjfβi(xi,k)||jfβi(xi,k)|}|xi,k]E[|mi,k+1jjfβi(xi,k)||xi,k]|jfβi(xi,k)|, (B.3)

    where the second inequality comes from the conditional Markov's inequality. Substituting Eq (B.3) into Eq (B.2) and taking the expectation over all of the randomness we obtain

    E[fβi(xi,k+1)fβi(xi,k)]si,k1E[||fβi(xi,k)||1]+nL2(si,k1)2+2si,k1nj=1E[|mi,k+1jjfβi(xi,k)|]. (B.4)

    Moreover, by adding and subtracting ˉmi,k+1 in terms of the sum of Eq (B.4), one gets

    nj=1E[|mi,k+1jjfβi(xi,k)|]=E[||mi,k+1ˉmi,k+1+ˉmi,k+1fβi(xi,k)||1]nE[||mi,k+1ˉmi,k+1||2]+E[||ˉmi,k+1fβi(xi,k)||1]nE[||mi,k+1ˉmi,k+1||22]+E[||ˉmi,k+1fβi(xi,k)||1],

    where the first inequality comes from ||||1n||||2 and the second one from Jensen's inequality. Finally, incorporating the last inequality in Eq (B.4) completes the proof.

    Below are the original versions of the signSGD and signum algorithms.

    Algorithm 3 signSGD algorithm.
    1: Input: x0, s1(0,1)
    2: for k=0,1, do
    3:   Calculate an estimate of the stochastic gradient ˜f(xk) and update:
             xk+1=xks1sign(˜f(xk))
    4: end for
    5: Return xk

     | Show Table
    DownLoad: CSV
    Algorithm 4 Signum algorithm.
    1: Input: x0,m0,s1(0,1),s2(0,1)
    2: for k=0,1, do
    3:   Calculate an estimate of the stochastic gradient ˜f(xk) and update:
             mk+1=s2mk+(1s2)˜f(xk)xk+1=xks1sign(mk+1)
    3: end for
    4: Return xk

     | Show Table
    DownLoad: CSV


    [1] C. Audet, J. Dennis, Mesh adaptive direct search algorithms for constrained optimization, SIAM J. Optimiz., 17 (2006), 188–217. http://dx.doi.org/10.1137/040603371 doi: 10.1137/040603371
    [2] C. Audet, K. Dzahini, M. Kokkolaras, S. Le Digabel, Stochastic mesh adaptive direct search for blackbox optimization using probabilistic estimates, Comput. Optim. Appl., 79 (2021), 1–34. http://dx.doi.org/10.1007/s10589-020-00249-0 doi: 10.1007/s10589-020-00249-0
    [3] C. Audet, W. Hare, Derivative-free and blackbox optimization, Cham: Springer, 2017. http://dx.doi.org/10.1007/978-3-319-68913-5
    [4] C. Audet, A. Ihaddadene, S. Le Digabel, C. Tribes, Robust optimization of noisy blackbox problems using the mesh adaptive direct search algorithm, Optim. Lett., 12 (2018), 675–689. http://dx.doi.org/10.1007/s11590-017-1226-6 doi: 10.1007/s11590-017-1226-6
    [5] K. Balasubramanian, S. Ghadimi, Zeroth-order nonconvex stochastic optimization: handling constraints, high dimensionality, and saddle points, Found. Computat. Math., 22 (2022), 35–76. http://dx.doi.org/10.1007/s10208-021-09499-8 doi: 10.1007/s10208-021-09499-8
    [6] J. Bernstein, Y. Wang, K. Azizzadenesheli, A. Anandkumar, SignSGD: compressed optimisation for non-convex problems, Proceedings of International Conference on Machine Learning, 2018,560–569.
    [7] S. Bhatnagar, H. Prasad, L. Prashanth, Stochastic recursive algorithms for optimization, London: Springer, 2013. http://dx.doi.org/10.1007/978-1-4471-4285-0
    [8] J. Blank, K. Deb, Pymoo: multi-objective optimization in Python, IEEE Access, 8 (2020), 89497–89509. http://dx.doi.org/10.1109/ACCESS.2020.2990567 doi: 10.1109/ACCESS.2020.2990567
    [9] H. Cai, Y. Lou, D. McKenzie, W. Yin, A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization, Proceedings of the 38th International Conference on Machine Learning, 2021, 1193–1203.
    [10] H. Cai, D. McKenzie, W. Yin, Z. Zhang, A one-bit, comparison-based gradient estimator, Appl. Comput. Harmon. Anal., 60 (2022), 242–266. http://dx.doi.org/10.1016/j.acha.2022.03.003 doi: 10.1016/j.acha.2022.03.003
    [11] H. Cai, D. Mckenzie, W. Yin, Z. Zhang, Zeroth-order regularized optimization (zoro): approximately sparse gradients and adaptive sampling, SIAM J. Optim., 32 (2022), 687–714. http://dx.doi.org/10.1137/21M1392966 doi: 10.1137/21M1392966
    [12] N. Carlini, D. Wagner, Towards evaluating the robustness of neural networks, Proceedings of 2017 IEEE Symposium on Security and Privacy, 2017, 39–57. http://dx.doi.org/10.1109/SP.2017.49 doi: 10.1109/SP.2017.49
    [13] K. Chang, Stochastic nelder-mead simplex method-a new globally convergent direct search method for simulation optimization, Eur. J. Oper. Res., 220 (2012), 684–694. http://dx.doi.org/10.1016/j.ejor.2012.02.028 doi: 10.1016/j.ejor.2012.02.028
    [14] R. Chen, M. Menickelly, K. Scheinberg, Stochastic optimization using a trust-region method and random models, Math. Program., 169 (2018), 447–487. http://dx.doi.org/10.1007/s10107-017-1141-8 doi: 10.1007/s10107-017-1141-8
    [15] X. Chen, S. Liu, K. Xu, X. Li, X. Lin, M. Hong, et al., Zo-adamm: zeroth-order adaptive momentum method for black-box optimization, Proceedings of 33rd Conference on Neural Information Processing Systems, 2019, 1–12.
    [16] A. Conn, K. Scheinberg, L. Vicente, Introduction to derivative-free optimization, Philadelphia: SIAM, 2009. http://dx.doi.org/10.1137/1.9780898718768
    [17] F. Curtis, K. Scheinberg, R. Shi, A stochastic trust region algorithm based on careful step normalization, Informs Journal on Optimization, 1 (2019), 200–220. http://dx.doi.org/10.1287/ijoo.2018.0010 doi: 10.1287/ijoo.2018.0010
    [18] J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Imagenet: a large-scale hierarchical image database, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009,248–255. http://dx.doi.org/10.1109/CVPR.2009.5206848 doi: 10.1109/CVPR.2009.5206848
    [19] M. Garneau, Modelling of a solar thermal power plant for benchmarking blackbox optimization solvers, Ph. D Thesis, École Polytechnique de Montréal, 2015.
    [20] S. Ghadimi, G. Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), 2341–2368. http://dx.doi.org/10.1137/120880811 doi: 10.1137/120880811
    [21] S. Ghadimi, A. Ruszczynski, M. Wang, A single timescale stochastic approximation method for nested stochastic optimization, SIAM J. Optim., 30 (2020), 960–979. http://dx.doi.org/10.1137/18M1230542 doi: 10.1137/18M1230542
    [22] N. Hansen, The CMA evolution strategy: a comparing review, In: Towards a new evolutionary computation, Berlin: Springer, 2006, 75–102. http://dx.doi.org/10.1007/3-540-32494-1_4
    [23] S. Karimireddy, Q. Rebjock, S. Stich, M. Jaggi, Error feedback fixes signsgd and other gradient compression schemes, Proceedings of the 36th International Conference on Machine Learning, 2019, 3252–3261.
    [24] J. Kiefer, J. Wolfowitz, Stochastic estimation of the maximum of a regression function, Ann. Math. Statist., 23 (1952), 462–466. http://dx.doi.org/10.1214/aoms/1177729392 doi: 10.1214/aoms/1177729392
    [25] B. Kim, H. Cai, D. McKenzie, W. Yin, Curvature-aware derivative-free optimization, arXiv:2109.13391.
    [26] D. Kingma, J. Ba, Adam: a method for stochastic optimization, arXiv:1412.6980.
    [27] M. Kokkolaras, Z. Mourelatos, P. Papalambros, Impact of uncertainty quantification on design: an engine optimisation case study, International Journal of Reliability and Safety, 1 (2006), 225–237. http://dx.doi.org/10.1504/IJRS.2006.010786 doi: 10.1504/IJRS.2006.010786
    [28] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, Commun. ACM, 60 (2017), 84–90. http://dx.doi.org/10.1145/3065386 doi: 10.1145/3065386
    [29] S. Le Digabel, Algorithm 909: NOMAD: nonlinear optimization with the MADS algorithm, ACM T. Math. Software, 37 (2011), 1–15. http://dx.doi.org/10.1145/1916461.1916468 doi: 10.1145/1916461.1916468
    [30] S. Liu, P. Chen, X. Chen, M. Hong, Sign-SGD via zeroth-order oracle, Proceedings of International Conference on Learning Representations, 2019, 1–24.
    [31] S. Liu, P. Chen, B. Kailkhura, G. Zhang, A. Hero, P. Varshney, A primer on zeroth-order optimization in signal processing and machine learning: principals, recent advances, and applications, IEEE Signal Proc. Mag., 37 (2020), 43–54. http://dx.doi.org/10.1109/MSP.2020.3003837 doi: 10.1109/MSP.2020.3003837
    [32] S. Liu, B. Kailkhura, P. Chen, P. Ting, S. Chang, L. Amini, Zeroth-order stochastic variance reduction for nonconvex optimization, Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, 3731–3741.
    [33] A. Maggiar, A. Wachter, I. Dolinskaya, J. Staum, A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling, SIAM J. Optim., 28 (2018), 1478–1507. http://dx.doi.org/10.1137/15M1031679 doi: 10.1137/15M1031679
    [34] Y. Nesterov, V. Spokoiny, Random gradient-free minimization of convex functions, Found. Comput. Math., 17 (2017), 527–566. http://dx.doi.org/10.1007/s10208-015-9296-2 doi: 10.1007/s10208-015-9296-2
    [35] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. Berkay Celik, A. Swami, Practical black-box attacks against machine learning, Proceedings of the 2017 ACM on Asia conference on computer and communications security, 2017,506–519. http://dx.doi.org/10.1145/3052973.3053009 doi: 10.1145/3052973.3053009
    [36] E. Real, S. Moore, A. Selle, S. Saxena, Y. Suematsu, J. Tan, et al., Large-scale evolution of image classifiers, Proceedings of the 34th International Conference on Machine Learning, 2017, 2902–2911.
    [37] H. Robbins, S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), 400–407. http://dx.doi.org/10.1214/aoms/1177729586 doi: 10.1214/aoms/1177729586
    [38] R. Rockafellar, J. Royset, Risk measures in engineering design under uncertainty, Proceedings of International Conference on Applications of Statistics and Probability, 2015, 1–8. http://dx.doi.org/10.14288/1.0076159 doi: 10.14288/1.0076159
    [39] R. Rubinstein, Simulation and the Monte Carlo method, Hoboken: John Wiley & Sons Inc., 1981. http://dx.doi.org/10.1002/9780470316511
    [40] A. Ruszczynski, W. Syski, Stochastic approximation method with gradient averaging for unconstrained problems, IEEE T. Automat. Contr., 28 (1983), 1097–1105. http://dx.doi.org/10.1109/TAC.1983.1103184 doi: 10.1109/TAC.1983.1103184
    [41] J. Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation, IEEE T. Automat. Contr., 37 (1992), 332–341. http://dx.doi.org/10.1109/9.119632 doi: 10.1109/9.119632
    [42] M. Styblinski, T. Tang, Experiments in nonconvex optimization: stochastic approximation with function smoothing and simulated annealing, Neural Networks, 3 (1990), 467–483.
    [43] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, 2818–2826. http://dx.doi.org/10.1109/CVPR.2016.308 doi: 10.1109/CVPR.2016.308
    [44] V. Volz, J. Schrum, J. Liu, S. Lucas, A. Smith, S. Risi, Evolving mario levels in the latent space of a deep convolutional generative adversarial network, Proceedings of the Genetic and Evolutionary Computation Conference, 2018,221–228. http://dx.doi.org/10.1145/3205455.3205517 doi: 10.1145/3205455.3205517
    [45] K. Xu, S. Liu, P. Zhao, P. Chen, H. Zhang, Q. Fan, et al., Structured adversarial attack: towards general implementation and better interpretability, Proceedings of International Conference on Learning Representations, 2019, 1–21.
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1859) PDF downloads(84) Cited by(0)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog