
This work considers stochastic optimization problems in which the objective function values can only be computed by a blackbox corrupted by some random noise following an unknown distribution. The proposed method is based on sequential stochastic optimization (SSO), i.e., the original problem is decomposed into a sequence of subproblems. Each subproblem is solved by using a zeroth-order version of a sign stochastic gradient descent with momentum algorithm (i.e., ZO-signum) and with increasingly fine precision. This decomposition allows a good exploration of the space while maintaining the efficiency of the algorithm once it gets close to the solution. Under the Lipschitz continuity assumption on the blackbox, a convergence rate in mean is derived for the ZO-signum algorithm. Moreover, if the blackbox is smooth and convex or locally convex around its minima, the rate of convergence to an ϵ-optimal point of the problem may be obtained for the SSO algorithm. Numerical experiments are conducted to compare the SSO algorithm with other state-of-the-art algorithms and to demonstrate its competitiveness.
Citation: Charles Audet, Jean Bigeon, Romain Couderc, Michael Kokkolaras. Sequential stochastic blackbox optimization with zeroth-order gradient estimators[J]. AIMS Mathematics, 2023, 8(11): 25922-25956. doi: 10.3934/math.20231321
[1] | Duangdaw Rakjarungkiat, Nimit Nimana . An extrapolated fixed-point optimization method for strongly convex smooth optimizations. AIMS Mathematics, 2024, 9(2): 4259-4280. doi: 10.3934/math.2024210 |
[2] | Lidiya Melnikova, Valeriy Rozenberg . One dynamical input reconstruction problem: tuning of solving algorithm via numerical experiments. AIMS Mathematics, 2019, 4(3): 699-713. doi: 10.3934/math.2019.3.699 |
[3] | Yiyuan Cheng, Yongquan Zhang, Xingxing Zha, Dongyin Wang . On stochastic accelerated gradient with non-strongly convexity. AIMS Mathematics, 2022, 7(1): 1445-1459. doi: 10.3934/math.2022085 |
[4] | Nunthakarn Boonruangkan, Pattrawut Chansangiam . Convergence analysis of a gradient iterative algorithm with optimal convergence factor for a generalized Sylvester-transpose matrix equation. AIMS Mathematics, 2021, 6(8): 8477-8496. doi: 10.3934/math.2021492 |
[5] | Noura Alhouiti, Fatemah Mofarreh, Akram Ali, Fatemah Abdullah Alghamdi . On gradient normalized Ricci-harmonic solitons in sequential warped products. AIMS Mathematics, 2024, 9(9): 23221-23233. doi: 10.3934/math.20241129 |
[6] | Fang Cheng, Ye Hu, Mati ur Rahman . Analyzing the continuity of the mild solution in finite element analysis of semilinear stochastic subdiffusion problems. AIMS Mathematics, 2024, 9(4): 9364-9379. doi: 10.3934/math.2024456 |
[7] | Adisorn Kittisopaporn, Pattrawut Chansangiam . Approximate solutions of the 2D space-time fractional diffusion equation via a gradient-descent iterative algorithm with Grünwald-Letnikov approximation. AIMS Mathematics, 2022, 7(5): 8471-8490. doi: 10.3934/math.2022472 |
[8] | Shexiang Hai, Liang He . The steepest descent method for fuzzy optimization problems under granular differentiability. AIMS Mathematics, 2025, 10(4): 10163-10186. doi: 10.3934/math.2025463 |
[9] | Mingyuan Cao, Yueting Yang, Chaoqian Li, Xiaowei Jiang . An accelerated conjugate gradient method for the Z-eigenvalues of symmetric tensors. AIMS Mathematics, 2023, 8(7): 15008-15023. doi: 10.3934/math.2023766 |
[10] | Yue Li, Yunyan Wang . Strong consistency of the nonparametric kernel estimator of the transition density for the second-order diffusion process. AIMS Mathematics, 2024, 9(7): 19015-19030. doi: 10.3934/math.2024925 |
This work considers stochastic optimization problems in which the objective function values can only be computed by a blackbox corrupted by some random noise following an unknown distribution. The proposed method is based on sequential stochastic optimization (SSO), i.e., the original problem is decomposed into a sequence of subproblems. Each subproblem is solved by using a zeroth-order version of a sign stochastic gradient descent with momentum algorithm (i.e., ZO-signum) and with increasingly fine precision. This decomposition allows a good exploration of the space while maintaining the efficiency of the algorithm once it gets close to the solution. Under the Lipschitz continuity assumption on the blackbox, a convergence rate in mean is derived for the ZO-signum algorithm. Moreover, if the blackbox is smooth and convex or locally convex around its minima, the rate of convergence to an ϵ-optimal point of the problem may be obtained for the SSO algorithm. Numerical experiments are conducted to compare the SSO algorithm with other state-of-the-art algorithms and to demonstrate its competitiveness.
The present work targets stochastic blackbox optimization problems of the form
minx∈Rnf(x) where f(x):=Eξ[F(x,ξ)], | (1.1) |
and F:Rn×Rm→R is a blackbox [3] that takes two inputs: a vector of design variables x∈Rn and a vector ξ∈Rm that represents uncertainties with unknown distributions. The function F is called a stochastic zeroth-order oracle [20]. The objective function f is obtained by taking the expectation of F over all possible values of the uncertainties ξ. This optimization problem can be found in two different fields. The first is in a machine learning framework wherein the loss function's gradient is unavailable or difficult to compute, such as in the optimization of neural network architecture [36], design of adversarial attacks [15] or game content generation [44]. The second field is when the function F is evaluated by means of a computational procedure [27]. In many cases, it depends on an uncertainty vector ξ due to environmental conditions, costs or effects of repair actions that are unknown [38]. Another source of uncertainty appears when the optimization is conducted at the early stages of the design process, where knowledge, information and data are very limited.
Stochastic derivative-free optimization has been the subject of research for many years. Traditional derivative-free methods may be divided into two categories [16]: direct search and model-based methods. Algorithms corresponding to both methods have been adapted to a stochastic ZO oracle. Examples include the stochastic Nelder-Mead algorithm [13] and the stochastic versions of the mesh adaptive direct search algorithm [2,4] for the direct search methods. For model-based methods, most studies consider extensions of the trust region method [14,17,33]. A major shortcoming of these methods is their difficulty to scale to large problems.
Recently, another class of methods, named ZO methods, has been attracting increasing amounts of attention. These methods use stochastic gradient estimators, which are based on the seminal work in [24,37], and they have been extended in [20,34,39,41]. These estimators have the appealing property of being able to estimate the gradient with only one or two function evaluations, regardless of the problem size. ZO methods take advantage of this property to extend first-order methods. For instance, the well known first-order methods conditional gradient, sign stochastic gradient descent (signSGD) [6] and adaptive momentum (ADAM) [26] have been extended to ZSCG [5], ZO-signSGD [30] and ZO-adaMM [15], respectively. More methods, not only based on first-order algorithms, have also emerged to solve regularized optimization problems [11], for very high dimensional blackbox optimization problems [9] and stochastic composition optimization problems [21]. Methods using second-order information based limited function queries have been developed [25]. Some methods handle situations in which the optimizer has only access to a comparison oracle that indicates which of two points has the highest value [10]. For an overview on ZO methods, readers may consult [31].
Formally, stochastic gradient estimators involve a smooth approximation fβ (see Chapter 7.6 in [39]) which is a convolution product between f and a kernel hβ(u)
fβ(x):=∫∞−∞hβ(u)f(x−u)du=∫∞−∞hβ(x−u)f(u)du. | (1.2) |
The kernel must fulfill a set of conditions [39, pp. 263]:
(1) hβ(u)=1βnh(uβ) is a piecewise differentiable function;
(2) limβ→0hβ(u)=δ(u), where δ(v) is Dirac's delta function;
(3) limβ→0fβ(x)=f(x) if x is a point of continuity of f;
(4) The kernel hβ(u) is a probability density function, that is fβ(x)=EU∼hβ(u)[f(x−U)]=EU∼h(u)[f(x−βU)].
Frequently used kernels include the Gaussian distribution and the uniform distribution on a unit ball. Three properties of the smooth approximation are worth noting. First, the smooth approximation may be interpreted as a local weighted average of the function values in the neighborhood of x. Condition 1.2 implies that it is possible to obtain a solution that is arbitrarily close to a local minimum f∗. Second, the smooth approximation is infinitely differentiable as a consequence of the convolution product, regardless of the degree of smoothness of f. Moreover, according to the chosen kernel, stochastic gradient estimators may be calculated. These estimators are unbiased estimators of ∇fβ and may be constructed on the basis of observations of F(x,ξ) alone. Finally, the smooth approximation allows convexification of the original function f. Previous studies [39,42] show that greater values of β result in better convexification, as illustrated in Figure 1. Additionally, a larger β leads to greater exploration of the space during the calculation of the gradient estimator. It has also been demonstrated in [32] that if the smoothing parameter is too small, the difference in function values cannot be used to accurately represent the function differential, particularly when the noise level is significant.
Although the two first properties of the smooth approximation are exploited by ZO methods, the last property has not been utilized since the work in [42]. This may be because the convexification phenomenon becomes insignificant when dealing with high-dimensional problems *. However, for problems of relatively small size (n≃10), this property can be useful. The authors of [42] use an iterative algorithm to minimize the sequence of subproblems
minx∈Rnfβi(x), | (1.3) |
*Note that a blackbox optimization problem with dimensions ranging from 100 to 1000 may be considered large, while problems with n≥10000 may be considered very large.
where βi belongs to a finite prescaled sequence of scalars. This approach is limited because the sequence βi does not necessarily converge to 0 and the number of iterations to go from subproblem i to i+1 is arbitrarily fixed a priori. Furthermore, neither a convergence proof nor a convergence rate are provided for the algorithm. Finally, although promising, numerical results are only presented for analytical test problems. These shortcomings motivate the research presented here.
The main contributions of this paper can be summarized as follows:
● A sequential stochastic optimization (SSO) algorithm is developed to solve the sequence of subproblems in Eq (1.3). In the inner loop, a subproblem is solved according to the ZO version of the signum algorithm [6]. The stopping criterion is based on the norm of the momentum, which must be below a certain threshold. In the outer loop, the sequence of βi is proportional to the threshold needed to consider a subproblem solved, and it is driven to 0. Therefore, the smaller the value of βi (and thus better the approximation given by fβi), the larger the computational budget allotted for the resolution of the subproblem.
● A theoretical analysis of this algorithm is conducted. First, the expectation of the norm of the momentum is proved to converge to 0, with a convergence rate that depends on the step sizes. Then, the convergence rate in mean of the ZO-signum algorithm toward a stationary point of fβ is derived under Lipschitz continuity of the function F. Finally, if the function F is smooth and fβ is convex or becomes convex around its local minima, the rate of convergence to an ϵ-optimal point is derived for the SSO algorithm.
● Numerical experiments were conducted to evaluate the performance of the proposed algorithm for two applications. First, a comparison is made with traditional derivative-free algorithms in terms of the optimization of the storage cost of a solar thermal power plant model, which is a low-dimensional problem. Second, a comparison is made with other ZO algorithms in order to generate blackbox adversarial attacks, which are large-sized problems.
The remainder of this paper is organized as follows. In Section 2, the main assumptions and the Gaussian gradient estimator are described. In Section 3, the sequential optimization algorithm is presented, and its convergence properties are studied in Section 4. Section 5 presents numerical results, and Section 6 draws conclusions and discusses future work.
The assumptions concerning the stochastic blackbox function F are as follows:
Assumption 1. Let (Ω,F,P) be a probability space.
(a) The function satisfies F(⋅,ξ)∈L1(Ω,F,P) and f(x):=Eξ[F(x,ξ)] for all x∈Rn.
(b) F(⋅,ξ) is Lipschitz-continuous for any fixed value of ξ=(ξ1,ξ2), with the constant L0(F)>0, that is
|F(x,ξ1)−F(y,ξ2)|≤L0(F)||x−y||. |
Assumption 1(a) implies that the expectation of F(x,ξ) with respect to ξ is well defined on Rn and that the estimator F(x,ξ) is unbiased. Assumption 1(b) is commonly used to ensure convergence and bound the variance of the stochastic ZO oracle. It is worth noticing that no assumption is made regarding the differentiability of the objective function f or of its estimate F with respect to x, contrary to most work on ZO methods.
Under Assumption 1, a smooth approximation of the function f may be constructed via its convolution with a Gaussian random vector. Let u be an n-dimensional standard Gaussian random vector and β>0 be the smoothing parameter. Then, a smooth approximation of f is defined as
fβ(x):=1(2π)n2∫f(x+βu)e−||u||22du=Eu[f(x+βu)]. | (2.1) |
This estimator has been studied in the literature (especially in [34]); it has the benefits of several appealing properties. The properties used in this work are summarized in the following lemma:
Lemma 2.1. Under Assumption 1, the following statements hold for any integrable function f:Rn→R and its approximation fβ parameterized by β>0.
(1) fβ is infinitely differentiable: fβ∈C∞.
(2) A one-sided unbiased estimator of ∇fβ is
˜∇fβ(x):=u(f(x+βu)−f(x))β. | (2.2) |
(3) Let β2≥β1≥0; then, ∀x∈Rn
||∇fβ1(x)−∇fβ2(x)||≤L1(fβ1)(β2−β1)(n+3)32. |
Moreover, for β>0, fβ is L1(fβ)-smooth, i.e., fβ∈C1+ with L1(fβ)=2√nβL0(F).
(4) If f is convex, then fβ is also convex.
Proof. (1) It is a consequence of the convolution product between an integrable function and an infinitely differentiable kernel.
(2) See [34, Eq (22)].
(3) If u∼N(0,I), define the following for all x∈Rn
g(x)=fβ1(x)=Eu[f(x+β1u)]. |
Let μ=β2−β1≥0; it follows that for all x∈Rn
gμ(x)=Eu[g(x+μu)]=Eu[fβ1(x+μu)]=Eu[f(x+μu+β1u)]=Eu[f(x+β2u)]=fβ2(x). |
Then, since by [34, Lemma 2] under Assumption 1, fβ1 is Lipschitz continuously differentiable, [34, Lemma 3] may be applied to the function g and it follows that
||∇fβ1(x)−∇fβ2(x)||=||∇g(x)−∇gμ(x)||≤L1(fβ1)μ(n+3)32=L1(fβ1)(β2−β1)(n+3)32. |
(4) See [34, pp. 5].
The estimator obtained in Eq (2.2) may be adapted to the stochastic ZO oracle F. For instance, a one-sided (mini-batch) estimator of the noised function F is
˜∇fβ(x,ξ)=1qq∑j=1uj(F(x+βuj,ξj)−F(x,ξ0))β, | (2.3) |
where {uj}qj=1 and {ξj}qj=0 are q Gaussian random directional vectors and their associated q estimate values of the function F, respectively. This is still an unbiased estimator of ∇fβ because
Eu,ξ[˜∇fβ(x,ξ)]=Eu[Eξ[˜∇fβ(x,ξ)|u]]=∇fβ(x). | (2.4) |
The result of Lemma 2.1(3) is essential to understand why solving a sequence of optimization problems defined by Eq (1.3) may be efficient, although it might seem counterproductive at first sight. Below are examples of the advantages of treating the problem with sequential smoothed function optimization.
● The subproblems are approximations of the original problem and it is not necessary to solve them exactly. Thus, an appropriate procedure for solving these problems with increasingly fine precision can be used. Moreover, as seen in Lemma 2.1(3), the norm of the gradient obtained in a subproblem is close to the one of the following subproblem. The computational effort to find a solution to the second subproblem from the solution of the first should therefore not be important.
● The information collected during the optimization process for a subproblem may be reused in the subsequent subproblems since they are similar.
● A specific interest in the case of smooth approximation is the ability of using a larger value of β to solve the first subproblems. It allows for a better exploration of the space and convexification phenomenon of the function (see Figure 1). Moreover, the new step size may be used for each subproblem; it allows for an increase in the step size momentarily, in the hope of having a greater chance of escaping a local minimum.
Section 3.1 presents a ZO version of the signum algorithm [6] to solve Subproblem (1.3) for a given βi and Section 3.2 presents the complete algorithm used to solve the sequential optimization problem.
A ZO version of the signum algorithm (Algorithm 2 of [6]) is used to solve the subproblems. The signum algorithm is a momentum version of the sign-SGD algorithm. In [30], the authors extended the original sign-SGD algorithm to a ZO version of this algorithm. However, a ZO version of signum is not studied in the work of [30]. As the signum algorithm has been shown to be competitive with the ADAM algorithm [6], a ZO version of this algorithm seems interesting to consider. For completeness, the versions of the sign-SGD and the signum algorithms as they originally appeared in [6] are given in Appendix 6. There is an important difference between the original signum algorithm and its ZO version presented in Algorithm 1. Indeed, while the step size of the momentum 1−s2 is kept constant in the work of [6], it is driven to 0 in our work.
Algorithm 1 ZO-signum algorithm to solve subproblem i∈N. |
1: Input: xi,0,mi,0,βi,si,01,si,02, L, q, M |
2: Set k=0 |
3: Define step-size sequences si,k1=si,01(k+1)α1 and si,k2=si,02(k+1)α2 |
4: while ||mi,k||>Lβi4β0 or k≤M do |
5: Draw q samples uk from the Gaussian distribution N(0,I) |
6: Calculate the average of the q Gaussian estimate ˜∇fβi(xi,k,ξi,k) from Eq (2.3) |
7: Update: |
mi,k+1=si,k2˜∇fβi(xi,k,ξk)+(1−si,k2)mi,k(3.1) |
xi,k+1j=xi,kj−sk1sign(mi,k+1j)∀j∈[1,n](3.2) |
8: k←k+1 |
9: end while |
10: Return mi,k and xi,k |
This leads to two consequences. First, the variance is reduced since the gradient is averaged on a longer time horizon, without using mini-batch sampling. Second, as it has been demonstrated in other stochastic approximation works ([7, Section 3.3] and [40]), with carefully chosen step sizes the norm of the momentum goes to 0 with probability one. In the ZO-signum algorithm, the norm of the momentum is thus used as a stopping criterion.
The optimization of the subproblem sequence described in Eq (1.3) is driven by the SSO algorithm presented in Algorithm 2. The value of β plays a critical role, as it serves as both the smoothing parameter and the stopping criterion for Algorithms 1 and 2. Algorithm 2 is inspired by the MADS algorithm [1] as it is based on two steps: a search step and a local step. The search step is optional, may use any heuristics and is required only for problems with relatively small dimensions. In Algorithm 2, an example of a search is given; it consists of updating x after M iterations of the ZO-signum algorithm with the best known x found so far. The local step is then used: Algorithm 1 is launched for each subproblem i with specific values of βi and step-size sequences. Once Algorithm 1 meets the stopping criterion (which depends on the value of βi), the value of βi and the initial step-sizes si,01 and si,02 are reduced, and the algorithm proceeds to the next subproblem. The convergence is guaranteed by the local step, since the search step is run only a finite number of times.
It is worth noting that the decreasing rate of βi is chosen so that the difference between subproblems i and i+1 is not significant. Therefore, the information collected in subproblem i, through the momentum vector m, can be used in subproblem i+1. Furthermore, the initial step sizes si,01 and si,02 decrease with each iteration, allowing us to focus our efforts quickly toward a local optimum when s0,01 and β0 are chosen to be relatively large.
Algorithm 2 SSO algorithm. |
1: Initialization: |
2: Set x0,0∈Rn, β0>0 and N as the maximum number of function calls for the search step |
3: Set q as the number of gradient estimates at each iteration of ZO-signum (ZOS) algorithm |
4: Set M the minimum number of iterations made by the ZOS algorithm on a subproblem |
5: C is the cache containing all of the evaluated points |
6: Set m0,0=˜∇fβ0(x0,0,ξ0) and L=+∞ |
7: Set s0,01>0 and s0,02>0 |
8: Set i=0 |
9: Search step (optional): |
10: while M(i+1)q≤N: do |
11: Solve subproblem i with Algorithm 1: |
mi+1,0=ZOS(xi,0,mi,0,βi,si,01,si,02,L,q,M)xi+1,0∈argminx∈CF(x,ξ) |
12: Update βi, si,01 and si,02 as in Step 18 |
13: end while |
14: L=||m0,0|| |
15: Local step: |
16: while βi>ϵ do |
17: Solve subproblem i with Algorithm 1: |
mi+1,0,xi+1,0=ZOS(xi,0,mi,0,βi,si,01,si,02,L,q,M) |
18: Update: |
βi=β0(i+1)2,si,01=s0,01(i+1)32,si,02=s0,02i+1i←i+1 |
19: end while |
20: Return xi |
The convergence analysis is conducted in two steps: first the convergence rate in mean is derived for Algorithm 1 and then the rate of convergence to an ϵ-optimal point is derived for Algorithm 2.
The analysis of Algorithm 1 follows the general methodology given in [6, Appendix E]. In the following subsection, the main result in [6] is recalled for completeness. The next subsections are devoted to bound the variance and bias terms when limk→∞si,k2=0. Finally, these results are used to obtain the convergence rate in mean of Algorithm 1 in the non-convex and convex case. The last subsection is devoted to a theoretical comparison with other ZO methods of the literature. The subproblem index i is kept constant throughout this section. In order to better convey the convergence analysis of the ZO-signum algorithm, a hierarchical workflow of the different theoretical results is presented in Table 1. The main results are presented in Theorem 4.1 and its corollary for the non-convex case, and Theorem 4.2 for the convex case.
Assumptions on F | Preliminary results | Intermediate results | Main results | When fβ is convex |
Assumption 1 which implies that L1(fβi)=2√nL0(F)βi | Proposition 4.1 | |||
Lemma 4.1 | Proposition 4.2 | Theorem 4.1 Corollary 4.1 |
Theorem 4.2 | |
Lemma 4.2 | ||||
Lemma 4.3 | ||||
Lemma 4.4 | Proposition 4.3 | |||
Lemma 4.5 |
The following proposition uses the Lipschitz continuity of the function fβi (proved in Lemma 2.1) to bound the gradient at the kth iteration.
Proposition 4.1. [6] For the subproblem i∈N, under Assumption 1 and in the setting of Algorithm 1, we have
si,k1E[||∇fβi(xi,k)||1]≤E[fβi(xi,k)−fβi(xi,k+1)]+nL1(fβi)2(si,k1)2+2si,k1E[||ˉmi,k+1−∇fβi(xi,k)||1]⏟bias+2si,k1√n√E[||mi,k+1−ˉmi,k+1||22]⏟variance, | (4.1) |
where ˉmi,k+1j is defined recursively as ˉmi,k+1j=si,k2∇fβi(xi,k)+(1−si,k2)ˉmi,kj.
Proof. See Appendix B.
Now, it remains to bound the three terms on the right side of Inequality (4.1).
The three following lemmas are consecrated to bound the variance term. Unlike the work reported in [6], the variance reduction is conducted by driving the step size of the momentum to 0. It avoids the need to sample an increasing number of stochastic gradients at each iteration, which may be problematic, as noted in [30]. To achieve this, the variance term is first decomposed in terms of expectation of the squared norm of the stochastic gradient estimators ˜g.
Lemma 4.1. For the subproblem i∈N, let k∈N and j∈[1,n]; we have
E[||mi,k+1−ˉmi,k+1||2]≤(si,k2)2E[||˜gi,k||2]+k−1∑r=0(si,r2)2k−1∏t=r(1−si,t+12)2E[||˜gi,r||2]+k∏t=0(1−si,t2)2E[||˜gi,0||2], |
where ˜gi,rj=˜∇fβi(xi,r,ξr),∀r∈[0,k] is defined in Eq (2.3) and the norm is ||⋅||2.
Proof. Let k∈N; by definition of mi,k and ˉmi,k, it follows that
||mi,k+1−ˉmi,k+1||2=(si,k2)2||˜gi,k−∇fβi(xi,k)||2+(1−si,k2)2||mi,k−ˉmi,k||2+2si,k2(1−si,k2)(˜gi,k−∇fβi(xi,k))T(mi,k−ˉmi,k). |
The expectation of this expression is
E[||mi,k+1−ˉmi,k+1||2]=(si,k2)2E[||˜gi,k−∇fβi(xi,k)||2]+(1−si,k2)2E[||mi,k−ˉmi,k||2]+2si,k2(1−si,k2)E[(˜gi,k−∇fβi(xi,k))T(mi,k−ˉmi,k)]. | (4.2) |
Now, introducing the associated sigma field of the process Fi,k=σ(xj,t,mj,t,ˉmj,t;j≤i,t≤k) by the law of total expectation, it follows that
E[(˜gi,k−∇fβi(xi,k))T(mi,k−ˉmi,k)]=E[E[(˜gi,k−∇fβi(xi,k))T(mi,k−ˉmi,k)|Fi,k]]=E[(E[˜gi,k|Fi,k]−∇fβi(xi,k))T(mi,k−ˉmi,k)]=0, |
where the second equality holds because mi,k,ˉmi,k and ∇fβi(xi,k) are fixed conditioned on Fi,k and because E[˜gi,k|xi,k]=∇fβi(xi,k) where ˜gi,k is an unbiased estimator of the gradient by Eq (2.4). By substituting this result in (4.2), it follows that
E[||mi,k+1−ˉmi,k+1||2]=(si,k2)2E[||˜gi,k−∇fβi(xi,k)||2]+(1−si,k2)2E[||mi,k−ˉmi,k||2]. |
By repeating this process iteratively, we obtain
E[||mi,k+1−ˉmi,k+1||2]=(si,k2)2E[||˜gi,k−∇fβi(xi,k)||2]+k−1∑r=0(si,r2)2k−1∏t=r(1−si,t+12)2E[||˜gi,r−∇fβi(xi,r)||2]+k∏t=0(1−si,t2)2E[||˜gi,0−∇fβi(xi,0)||2]. | (4.3) |
Finally, by observing that ∀r∈[0,k],E[˜gi,r|xi,r]=∇fβi(xi,r) and by the law of total expectation, we obtain
E[||˜gi,r−∇fβi(xi,r)||2]=E[||˜gi,r−E[˜gi,r|xi,r]||2]=E[||˜gi,r||2]−E[||∇fβi(xi,r)||2]≤E[||˜gi,r||2]. |
Introducing this inequality to Eq (4.3) completes the proof.
Second, the expectation of the squared norm of the stochastic gradient estimators are bounded by a constant depending quadratically on the dimension.
Lemma 4.2. Let i∈N, r∈[0,k] and j∈[1,n]; then under Assumption 1, we have
E[||˜gi,r||2]≤L0(F)2(n+4)2, |
where L0(F) is the Lipschitz constant of F.
Proof. By Eq (2.3) with q=1, it follows that
E[||˜gi,r||2]=E[||u||2(βi)2(F(xi,r+βiu,ξ1)−F(xi,r,ξ0))2]≤L0(F)2E[||u||4]≤L0(F)2(n+4)2 |
where the first inequality follows from Assumption 1(b) and the second by [34, Lemma 1].
Finally, a technical lemma bounds the second term of the decomposition of Lemma 4.1 by a decreasing sequence. It achieves the same rate of convergence as in [6] without sampling any stochastic gradient.
Lemma 4.3. For the subproblem i∈N, let si,k2 be defined such that si,k2=si,02(k+1)α2 with α2∈(0,1) and si,02∈(0,1); then, for k such that
k(k+1)α2≥ln(si,02)+(1+α2)ln(k)si,02, | (4.4) |
the following inequality holds
k−1∑r=0(si,r2)2k−1∏t=r(1−si,t+12)2≤9si,02kα2. | (4.5) |
Proof. Let k∈N; as in [6], the strategy consists of breaking up the sum in order to bound both terms separately.
k−1∑r=0(si,r2)2k−1∏t=r(1−si,t+12)2=⌊k/2⌋−1∑r=0(si,r2)2k−1∏t=r(1−si,t+12)2+k−1∑r=⌊k/2⌋(si,r2)2k−1∏t=r(1−si,t+12)2≤(1−si,k2)2⌊k/2⌋⌊k/2⌋−1∑r=0(si,r2)2+(si,⌊k/2⌋−12)2k−1∑r=⌊k/2⌋(1−si,k2)2(k−r−1)≤(si,02)2⌊k/2⌋(1−si,k2)2⌊k/2⌋+8(si,02)2k2α2⌊k/2⌋∑r=0(1−si,k2)2r≤(si,02)2k(1−si,k2)2⌊k/2⌋+8(si,02)2k2α2(1−(1−si,k2)2)≤(si,02)2k(1−si,k2)2⌊k/2⌋+8si,02kα2(2−si,k2). |
Now, we are looking for k such that
si,02k(1−si,k2)2⌊k/2⌋≤1kα2⇔e2⌊k/2⌋ln(1−si,k2)≤1(si,02)k1+α2. |
As, ln(1−x)≤−x, it is sufficient to find k such that
e−si,02k(k+1)α2≤1(si,02)k1+α2⇔k(k+1)α2≥ln(si,02)+(1+α2)ln(k)si,02. |
Taking such a k allows us to complete the proof.
Combining the three previous lemmas allows the bounding of the variance term in Proposition 4.1.
Proposition 4.2. In the setting of Lemmas 4.2 and 4.3 and under Assumption 1(b), the variance term of Proposition 4.1 is bounded by
E[||mi,k+1−ˉmi,k+1||22]≤9si,02L0(F)2(n+4)2kα2+o(1kα2). |
Proof. By Lemmas 4.1 and 4.2, it follows that
E[||mi,k+1−ˉmi,k+1||2]≤(si,k2)2E[||˜gi,k||2]+k−1∑r=0(si,r2)2k−1∏t=r(1−si,t+12)2E[||˜gi,r||2]+k∏t=0(1−si,t2)2E[||˜gi,0||2]≤((si,k2)2+k−1∑r=0(si,r2)2k−1∏t=r(1−si,t+12)2+k∏t=0(1−si,t2)2)L0(F)2(n+4)2. |
Now as (si,k2)2=o(1kα2) and ∏kt=0(1−si,t2)2=o(1kα2), the result follows from Lemma 4.3.
First, the bias term is bounded by a sum depending on sk1 and sk2.
Lemma 4.4. For the subproblem i∈N and at iteration k∈N of the Algorithm 1, we have
E[||ˉmi,k+1−∇fβi(xi,k)||1]≤2nL1(fβi)(k−1∑l=0si,l1k−1∏t=l(1−si,t+12)). |
Proof. Foremost, observe that the quantity
Si,k:={1,if k=0,si,k2+k−1∑r=0si,r2k−1∏t=r(1−si,t+12)+k∏t=0(1−si,t2),otherwise, | (4.6) |
may be written recursively as
Si,k={1,if k=0,si,k2+(1−si,k2)Si,k−1,otherwise. |
Note that in its second expression Si,k=1 for all k. Therefore, by definition of ˉmi,kj and the previous result on Si,k, it follows that
ˉmi,k=si,k2∇fβi(xi,k)+k−1∑r=0si,r2k−1∏t=r(1−si,t+12)∇fβi(xi,r)+k∏t=0(1−si,t2)∇fβi(xi,0),∇fβi(xi,k)=(si,k2+k−1∑r=0si,r2k−1∏t=r(1−si,t+12)+k∏t=0(1−si,t2))∇fβi(xi,k). |
Thus
E[||ˉmi,k+1−∇fβi(xi,k)||1]≤k−1∑r=0si,r2k−1∏t=r(1−si,t+12)E[||∇fβi(xi,r)−∇fβi(xi,k)||1]+k∏t=0(1−si,t2)E[||∇fβi(xi,0)−∇fβi(xi,k)||1]. | (4.7) |
By the smoothness of the function fβi, Lemma F(3) of [6] ensures that ∀r∈[0,k−1]
||∇fβi(xi,r)−∇fβi(xi,k)||1≤k−1∑l=r||∇fβi(xi,l+1)−∇fβi(xi,l)||1 ≤ 2nL1(fβi)k−1∑l=rsi,l1. |
Substituting this inequality into Eq (4.7) gives
E[||ˉmi,k+1−∇fβi(xi,k)||1]≤2nL1(fβi)Si,k1, | (4.8) |
where
Si,k1=k−1∑r=0si,r2k−1∑l=rsi,l1k−1∏t=r(1−si,t+12)+k−1∑l=0si,l1k∏t=0(1−si,t2). |
Reordering the terms in Sk1, we obtain
Si,k1=k−1∑l=0si,l1(l∑r=0si,r2k−1∏t=r(1−si,t+12)+k∏t=0(1−si,t2))=k−1∑l=0si,l1(si,l2k−1∏t=l(1−si,t+12)+l−1∑r=0si,r2k−1∏t=r(1−si,t+12)+k∏t=0(1−si,t2))=k−1∑l=0si,l1k−1∏t=l(1−si,t+12)(si,l2+l−1∑r=0si,r2l−1∏t=r(1−si,t+12)+l∏t=0(1−si,t2))⏟Si,l=1=k−1∑l=0si,l1k−1∏t=l(1−si,t+12), |
which completes the proof.
Second, the sum may be bounded by a term decreasing with k.
Lemma 4.5. For the subproblem i∈N, let si,k2=si,02(k+1)α2 and si,k1=si,01(k+1)α1 with si,01∈(0,1),si,02∈(0,1) and 0<α2<α1<1; then, for k such that
k(k+1)α2≥2(ln(si,02)+(1+α1−α2)ln(k))si,02, | (4.9) |
the following inequality holds
k−1∑l=0si,l1k−1∏t=l(1−si,t+12)≤5si,01si,02kα1−α2. | (4.10) |
Proof. The proof follows the proof of Lemma 4.3. The sum is partitioned as follows:
k−1∑l=0si,l1k−1∏t=l(1−si,t+12)=⌊k/2⌋−1∑l=0si,l1k−1∏t=l(1−si,t+12)+k−1∑l=⌊k/2⌋−1si,l1k−1∏t=l(1−si,t+12)≤(1−si,k2)⌊k/2⌋⌊k/2⌋−1∑l=0si,l1+si,⌊k/2⌋−11k−1∑l=⌊k/2⌋−1(1−si,k2)k−r−1≤si,01k(1−si,k2)⌊k/2⌋+4si,01kα1(1−(1−si,k2))=si,01si,02k(1−si,k2)⌊k/2⌋si,02+4si,01si,02kα1−α2. |
Now, as in Lemma 4.3 taking k such that
k(k+1)α2≥2(ln(si,02)+(1+α1−α2)ln(k))si,02 |
ensures that si,02k(1−si,k2)⌊k/2⌋≤1kα1−α2, which completes the proof.
Finally, using the two previous lemmas allows for bounding of the bias term.
Proposition 4.3. In the setting of Lemma 4.5, the bias term of Proposition 4.1 is bounded by
E[||ˉmi,k+1−∇fβi(xi,k)||1]≤10nL1(fβi)si,01si,02kα1−α2. |
Proof. The proof is a straightforward consequence of Lemmas 4.4 and 4.5.
As the different terms in the inequality of Proposition 4.1 have been bounded, the main result of this section may be derived as in the following theorem.
Theorem 4.1. For a subproblem i∈N and under Assumption 1, let α1∈(0,1), α2∈(0,α1), 0<si,01, si,02<1 and K>C where C∈N satisfies Eqs (4.4) and (4.9); we have
E[||∇fβi(xi,R)||1]≤1K1−α1−CKα1(Difsi,01+n√nL0(F)si,01βiK∑k=C1k2α1+6√si,02L0(F)√n(n+4)K∑k=C1kα1+α22+40L0(F)si,01n√nsi,02βiK∑k=C1k2α1−α2), | (4.11) |
where fβi(xi,C)−minxfβi(x)≤Dif, L0(F) is the Lipschitz constant of F and R is randomly picked from a uniform distribution in [C,K].
Proof. Let C∈N satisfy Eqs (4.4) and (4.9) and, summing over the inequality in Proposition 4.1, it follows that
K∑k=Csi,k1E[||∇fβi(xi,k)||1]≤E[fβi(xi,C)−fβi(xi,K+1)]+nL1(fβi)2K∑k=C(si,k1)2+2√nK∑k=Csi,k1√E[||mi,k+1−ˉmi,k+1||22]+2K∑k=Csi,k1E[||ˉmi,k+1−∇fβi(xi,k)||1]. |
By substituting the results of Propositions 4.2 and 4.3 in the previous inequality, we obtain
K∑k=Csi,k1E[||∇fβi(xi,k)||1]≤E[fβi(xi,C)−fβi(xi,K+1)]+nL1(fβi)2K∑k=C(si,k1)2+6√si,02L0(F)(n+4)√nK∑k=Csi,01kα1+α22+20L1(fβi)si,01nsi,02K∑k=Csi,01k2α1−α2. |
Dividing both sides by si,01K−α1(K−C), picking R randomly uniformly in [C,K] and using the definition of Dif given that minxf(x)≤f(x) for all x, we get
E[||∇fβi(xi,R)||1]=1K−CK∑k=CE[||∇fβi(xi,k)||1]≤1K−CK∑k=CKα1kα1E[||∇fβi(xi,k)||1]≤1K1−α1−CKα1(Difsi,01+nL1(fβi)si,012K∑k=C1k2α1+6√si,02L0(F)(n+4)√nK∑k=C1kα1+α22+20L1(fβi)si,01nsi,02K∑k=C1k2α1−α2). |
Recalling that L1(fβi)=2√nL0(F)βi (see [34, Lemma 2]) completes the proof.
This theorem allows one to prove the convergence rate in mean of the norm of the gradient when α1 and α2 are chosen adequately. In particular, the following corollary provides the convergence rate when α1=34 and α2=12.
Corollary 4.1. Under the same setting of Theorem 4.1 with βi≈1 α1=34, α2=12, si,01=1n34 and si,02≈1, we have
E[||∇fβi(xi,R)||2]=O(n32K1/4ln(K)). | (4.12) |
Proof. The result is a direct consequence of Theorem 4.1 with the specified constant, and it can be obtained by noting that ||⋅||2≤||⋅||1 in Rn.
In [15,20,30], the function F is assumed to be smooth with a Lipschitz continuous gradient. In the present work, F is only assumed to be Lipschitz continuous. This has two main consequences on the result of convergence: the dependence of the dimension on the convergence rate is larger. Furthermore, while β must be chosen relatively small in the smooth case, it is interesting to note that it does not have to be this way in the nonsmooth case.
The convergence rate results for the ZO-signum algorithm has been derived in the non-convex case. In the next theorem, they are derived for the case when the function fβi is convex.
Theorem 4.2. Under Assumption 1, suppose moreover that fβi is convex and there exists ρ such that ρ=maxk∈N||xi,k−xi,∗||; then, by setting
si,k1=2ρ(k+1),si,k2=1(k+1)23andΓk:=k∏l=2(1−2k+1)=2k(k+1)withΓ1=1, | (4.13) |
it follows that
E[fβi(xi,K)−fβi(x∗)]≤4ρ2n√nL0(F)βiK13 | (4.14) |
and
E[||∇fβi(xi,R)||]≤2L0(F)K2+4ρn√nL0(F)βiK13, | (4.15) |
where R is a random variable in [0,K−1] whose the probability distribution is given by
P(R=k)=si,k1/Γk+1∑K−1k=0si,k1/Γk+1. |
Proof. Under the assumptions in the statement of Theorem 4.2, it follows by Proposition 4.1 that
E[fβi(xi,k+1)−fβi(xi,∗)]≤E[fβi(xi,k)−fβi(xi,∗)]−si,k1E[||∇fβi(xi,k)||]+nL1(fβi)2(si,k1)2+2si,k1E[||ˉmi,k+1−∇fβi(xi,k)||1]+2si,k1√n√E[||mi,k+1−ˉmi,k+1||2]≤E[fβi(xi,k)−fβi(xi,∗)]−sk1E[||∇fβi(xi,k)||]+4ρ2n√nL0(F)βi(k+1)43, | (4.16) |
where the last inequality follows thanks to Propositions 4.2 and 4.3 with L1(fβi)=2L0(F)√nβi and the values of si,k1 and si,k2. Now, by convexity assumption of fβi and the bound ρ, the following holds
fβi(xi,k)−fβi(xi,∗)≤∇fβi(xi,k)T(xi,k−xi,∗)≤||∇fβi(xi,k)||||xi,k−xi,∗||≤ρ||∇fβi(xi,k)||. |
Thus, by substituting this result into Eq (4.16), it follows that
E[fβi(xi,k+1)−fβi(xi,∗)]≤(1−2(k+1))E[fβi(xi,k)−fβi(xi,∗)]+4ρ2n√nL0(F)βi(k+1)43. |
Now by dividing both sides of the equation by Γk+1 and summing up the inequalities, it follows that
E[fβi(xi,K)−fβi(xi,∗)]ΓK≤4ρ2n√nL0(F)βiK−1∑k=01Γk+1(k+1)43≤4ρ2n√nL0(F)βiK−1∑k=0(k+1)23. |
Thus
E[fβi(xi,K)−fβi(xi,∗)]≤4ρ2n√nL0(F)βiΓKK−1∑k=0(k+1)23≤4ρ2n√nL0(F)βiK13. |
Now, the second part of the proof may be demonstrated. By Eq (4.16), it also follows that
si,k1E[||∇fβi(xi,k)||]≤E[fβi(xi,k)−fβi(xi,∗)]−E[fβi(xi,k+1)−fβi(xi,∗)]+4ρ2n√nL0(F)βi(k+1)43. |
As in the previous part, by dividing both sides by Γk+1, summing up the inequalities and noting that ¯fk=E[fβi(xi,k)−fβi(xi,∗)], we obtain
K−1∑k=0si,k1Γk+1E[||∇fβi(xi,k)||]≤K−1∑k=0¯fk−¯fk+1Γk+1+4ρ2n√nL0(F)βiK−1∑k=01Γk+1(k+1)43. |
Then, again by dividing both sides by ∑K−1k=0si,k1Γk+1 it follows that
E[||∇fβi(xi,R)||]=∑K−1k=0si,k1Γk+1E[||∇fβi(xi,k)||]∑K−1k=0si,k1Γk+1≤1∑K−1k=0si,k1Γk+1(K−1∑k=0E[¯fk−¯fk+1]Γk+1+4ρ2n√nL0(F)βiK−1∑k=01Γk+1(k+1)43), |
where R is a random variable whose distribution is given in the statement of the theorem. Now, as in Eq (2.21) of [5], the following inequalities hold
K−1∑k=0¯fk−¯fk+1Γk+1≤¯f0+K−1∑k=12Γk+1(k+1)¯fk and K−1∑k=0si,k1Γk+1=ρΓK. |
Thus, by substituting these in the inequality involving the expectation, we obtain
E[||∇fβi(xi,R)||]≤ΓKρ(E[¯f0]+K−1∑k=12Γk+1(k+1)E[¯fk]+4ρ2n√nL0(F)βiK−1∑k=01Γk+1(k+1)43)≤ΓKρ(E[¯f0]+8ρn√nL0(F)βiK−1∑k=01Γk+1(k+1)43)≤2L0(F)K2+8ρn√nL0(F)βiK13, |
where the second inequality follows from Eq (4.14).
The result obtained in Eq (4.12) is consistent with the convergence results of other ZO methods. To gain a better understanding of its performance, this result is compared with those of four other algorithms from different perspectives: the assumptions, the measure used, the convergence rate and the function query complexity. All methods seek a solution to a stochastic optimization problem; the comparison is presented in Table 2. Since the convergence rates of the ZO-signum and ZO-signSGD algorithms are measured by using ||∇f(x)||, although ||∇f(x)||2 is used for ZO-adaMM and ZO-SGD, Jensen's inequality is used to rewrite the convergence rates in terms of the gradient norm.
Method | Assumptions | Measure | Convergence rate | Queries |
ZO-SGD [20] | F(⋅,ξ)∈C1+ | E[||∇f(xR)||2] | O(√σn14K14+√n√K) | O(K) |
E[||∇F(x,ξ)−∇f(x)||2]≤σ2 | ||||
ZO-signSGD [30] | F(⋅,ξ)∈C0+ | E[||∇f(xR)||2] | O(√n√K+√n√b+n√bq) | O(bqK) |
F(⋅,ξ)∈C1+ | ||||
||∇F(x,ξ)||2≤η | ||||
ZO-adaMM [15] | F(⋅,ξ)∈C0+ | E[||∇f(xR)||2] | O((√nK14+n√K)(ln(K)+ln(n))14) | O(K) |
F(⋅,ξ)∈C1+ | ||||
||∇F(x,ξ)||∞≤η | ||||
ZO-Signum | F(⋅,ξ)∈C0+ | E[||∇fβ(xR)||2] | O(n√nK14ln(K)) | O(K) |
ZO-Signum | F(⋅,ξ)∈C0+, f convex | E[fβi(xi,K)−fβi(xi,∗)] | O(n√nK13) | O(K) |
ZO-SGD [34] | F(⋅,ξ)∈C0+, f convex | E[f(xi,K)−f(xi,∗)] | O(n√K) | O(K) |
Modified ZSCG [5] | F(⋅,ξ)∈C1+, F convex | E[f(xi,K)−f(xi,∗)] | O(σ√n√K) | O(K) |
E[||∇F(x,ξ)−∇f(x)||2]≤σ2 |
● for ZO-SGD [20]
E[||∇f(x)||]≤√E[||∇f(x)||2]≤√O(σ√n√K+nK)≤O(√σn14K14+√n√K); |
● for ZO-adaMM [15]
E[||∇f(x)||]≤√E[||∇f(x)||2]≤√O((n√K+n2K)√ln(K)+ln(n))≤O((√nK14+n√K)(ln(K)+ln(n))14), |
where the third inequalities are due to √a2+b2≤a+b, for a,b≥0. For ZO-signSGD, unless the value of b depends on K, the algorithm's convergence is only guaranteed within some ball around the solution, making it difficult to compare with other methods. Thus, in the non-convex case, after this transformation, it becomes apparent that ZO-signum has a convergence rate that is O(n34√σ) and O(√n) worse than those of ZO-SGD and ZO-adaMM, respectively. This may be attributed to the milder assumption made on the function F in the present work, which also explains why the convergence is relative to fβ. In the convex case, ZO-signum has a convergence rate that is O(nK16σ) worse that of ZSCG and O(√nK16) worse that of ZO-SGD. This may be explained by the sign(⋅) operator loosing the magnitude information of the gradient when it is applied. This problem may be fixed as in [23] but it outside the scope of this work. Finally, all methods except ZO-signSGD are momentum-based versions of the original ZO-SGD method. Although the momentum-based versions are mostly used in practice, it is interesting to notice that none of these methods possess a better convergence rate than the original ZO-SGD method. The next section provides some clues about the interests of the momentum-based method.
The convergence analysis from the previous subsection is in mean, i.e., it establishes the expected convergence performance over many executions of the ZO-signum algorithm. As in [20], we now focus on the performance of a single run. A second hierarchical workflow of the different theoretical results is presented in Table 3.
Assumptions on F | Preliminary results | Intermediate results | Main result | When fβ is convex | ||
Assumptions 1, 2 and 3 which imply L1(fβi)≤L1(f) | Lemma 4.6 | Theorem 4.3 (ⅰ) | Theorem 4.3 (ⅱ) | |||
Proposition 4.3 | Lemma 4.7 | Lemma 4.8 | Lemma 4.9 | |||
Lemma 2.1(3) | ||||||
Theorem 4.1 | ||||||
Proposition 4.2 | Lemma 4.10 | |||||
Proposition 4.3 |
Unlike [20], our analysis is based on a sequential optimization framework rather than a post-optimization process. Our SSO algorithm uses the norm of the momentum as an indicator of the quality of the current solution. In order to analyze the rate of convergence of this algorithm, the following additional assumptions are made regarding the function F. The first assumption concerns the smoothness of the function F. The assumption of smoothness is used only to guarantee that L1(fβi) is a constant with respect to βi, contrary to the non-smooth case (see [34, Eq (12)]).
Assumption 2. The function F(⋅,ξ) has a L1(F)-Lipschitz continuous gradient.
The second assumption concerns the local convexity of the function fβ.
Assumption 3. Let (xi,0) be a sequence of points produced by Algorithm 2 and xi,∗ a sequence of local minima of fβi. We assume that there exists a threshold I∈N and a radius ρ>0 such that ∀i≥I:
(1) fβi is convex on the ball Bρ(xi,∗):={x∈Rn:||x−xi,∗||<ρ};
(2) xi,0∈Bρ(xi,∗).
Under these assumptions, we will prove that if the norm of the momentum vector m is below some threshold, then this threshold can be used to bound the norm of the gradient. Second, an estimate for the number of iterations required to reduce the norm of m below the threshold is provided. The next lemma is simply technical and demonstrates the link between ˉm and m.
Lemma 4.6. For any subproblem i∈N and iteration k≥1, the following equality holds
E[mi,k|xi,k−1]=E[ˉmi,k|xi,k−1], |
where ˉmi,k is defined recursively in Proposition 4.1.
Proof. The proof is conducted by induction on k. For k=1, setting mi,0=˜∇fβi(xi,0,ξ0) implies that
mi,1=si,02˜∇fβi(xi,0,ξ0)+(1−si,02)mi,0=˜∇fβi(xi,0,ξ0). |
In the same way, ˉmi,1=∇fβi(xi,0). Therefore, we have
E[mi,1|xi,0]=E[˜∇fβi(xi,0,ξ0)|xi,0]=∇fβi(xi,0)=E[∇fβi(xi,0)|xi,0]=E[ˉmi,1|xi,0]. |
Now, suppose that the induction assumption is true for a given k∈N; then,
E[mi,k+1|xi,k]=si,k2∇fβi(xi,k)+(1−si,k2)E[mi,k|xi,k]. |
Now, by the law of total expectation
E[mi,k|xi,k]=E[E[mi,k|xi,k,xi,k−1]|xi,k]=E[E[mi,k|xi,k−1]|xi,k]=E[E[ˉmi,k|xi,k−1]|xi,k](by the induction assumption)=E[ˉmi,k|xi,k]. |
Thus as E[∇fβi(xi,k)|xi,k]=∇fβi(xi,k), it follows that
E[mi,k+1|xi,k]=si,k2∇fβi(xi,k)+(1−si,k2)E[mi,k|xi,k]=si,k2E[∇fβi(xi,k)|xi,k]+(1−si,k2)E[ˉmi,k|xi,k]=E[ˉmi,k+1|xi,k], |
which completes the proof.
The following lemma shows that if ||m|| is below a certain threshold, then this threshold can be used to bound the norm of the gradient.
Lemma 4.7. For a subproblem i∈N, let Ki∈N denote the first iteration in Algorithm 1 for which ||mi,Ki||≤Lβi4β0; then, under Assumption 3 the norm of the gradient of the function fβi at xi,K may be bounded as follows
||∇fβi(xi,Ki)||≤Lβi4β0+10nL1(F)si,01si,02Kα1−α2i. |
Moreover, if the problem i+1 is considered, the gradient of the function fβi+1 may be bounded at the point xi,Ki=xi+1,0 as follows:
||∇fβi+1(xi+1,0)||≤||∇fβi(xi,Ki)||+L1(F)(n+3)32(βi−βi+1). |
Proof. Let Ki be taken as in the statement of the lemma. The norm of the gradient may be bounded as follows:
||∇fβi(xi,Ki)||≤||E[mi,Ki|xi,Ki]||+||∇fβi(xi,Ki)−E[mi,Ki|xi,Ki]||≤E[||mi,Ki|||xi,Ki]+||∇fβi(xi,Ki)−E[ˉmi,Ki|xi,Ki]||, |
where the second inequality follows from Jensen's inequality and Lemma 4.6. Now, using ||mi,Ki|| ≤Lβi4β0, E[∇fβi(xi,K)|xi,Ki]=∇fβi(xi,Ki), L1(fβi)≤L1(F) and the result of Proposition 4.3 completes the first part of the proof
||∇fβi(xi,Ki)||≤Lβi4β0+E[||∇fβi(xi,Ki)−ˉmi,Ki|||xi,Ki]≤Lβi4β0+10nL1(F)si,01si,02Kα1−α2i. |
The second part of the proof follows directly by applying the triangular inequality and the result in Lemma 2.1(3) because xi,Ki=xi+1,0.
Under Assumption 2, the expected difference between the values of fβi at xi,0 and its optimal value is bounded in the next lemma.
Lemma 4.8. Let I be the threshold from Assumption 2. If i≥I, then
E[fβi+1(xi+1,0)−fβi+1(xi+1,∗)]≤ρ(Lβi4β0+10nL1(F)si,01si,02Kα1−α2i+L1(F)(n+3)32(βi−βi+1)). | (4.17) |
Proof. Convexity of the function fβi on the ball Bρ(xi,∗) implies that
E[fβi+1(xi+1,0)−fβi+1(xi+1,∗)]≤E[⟨∇fβi+1(xi+1,0),xi+1,0−xi+1,∗⟩]≤E[||∇fβi+1(xi+1,0)||||xi+1,0−xi+1,∗||]. |
The result follows by using Lemma 4.7 and because xi+1,0 belongs to the ball Bϵ(xi,∗).
Moreover, an estimate of the number of iterations required to reduce the norm of the gradient below some threshold may be given.
Lemma 4.9. Under Assumptions 1–3, for a subproblem i>I and in the setting of Algorithm 2, let si,02∈R+ be such that k=1 in Eqs (4.4) and (4.9); assume that L=max(L0(F),L1(F)), α1=34 and α2=12. Then, for a uniformly randomly chosen R∈[0,Ki], it follows that
P(||∇fβi(xi,R)||≥Lβi4β0)≤4β0βiK14i(Ai+Bi), |
where Ai and Bi are defined in Eq (4.18).
Proof. Markov's inequality implies that
P(||∇fβi(xi,R)||≥Lβi4β0)≤4β0E[||∇fβi(xi,R)||]Lβi. |
Now, given the result of Theorem 4.1 with the specified values of α1 and α2 and the fact that L1(fβi)≤L1(F) together with Lemma 4.8, it follows that
4β0E[||∇fβi(xi,R)||]Lβi≤4β0βiK14i(Ai+Bi), |
where
Ai=ρsi,01(βi−14β0+10nsi−1,01si−1,02K14i−1+(n+3)32(βi−βi+1)),Bi=nsi,012H(−32)k+ln(Ki)(6√si,02(n+4)√n+20nsi,01si,02), | (4.18) |
Ki is the iteration number for subproblem i and H(−32)k is the generalized harmonic number.
The following lemma provides an estimate of the number of iterations required to bound the norm of the difference between m and the gradient below a certain threshold.
Lemma 4.10. For a subproblem i∈N and in the setting of Algorithm 2, let si,02∈R+ be such that k=1 in Eqs (4.4) and (4.9); assume that L=max(L0(F),L1(F)), α1=34 and α2=12. Then, for a uniformly randomly chosen R∈[0,Ki], it follows that
P(||mi,R−∇fβi(xi,R)||≥Lβi4β0)≤4β0βiK14i(3√si,02(n+4)√n+10nsi,01si,02). |
Proof. By Markov's inequality, it follows that
P(||mi,R−∇fβi(xi,R)||≥Lβi4β0)≤4β0E[||mi,R−∇fβi(xi,R)||]Lβi=4β0LβiKiKi∑k=0E[||mi,k−∇fβi(xi,k)||]≤4β0βiK14i(3√si,02(n+4)√n+10nsi,01si,02), |
where the last inequality holds by Propositions 4.2 and 4.3 with α1=34 and α2=12.
Finally, the main theorem of this section may be stated.
Theorem 4.3. Let Assumptions 1–3 hold and let I be the threshold from Assumption 3.
(i) For i∈N, set
βi=1√n(i+1)2,si,01=16n(i+1)3/2andsi,02=s2(i+1) |
with s2 so that Eqs (4.4) and (4.9) are satisfied for k=1. Moreover, let us denote Ki as the first iteration for which ||mi,Ki||≤Lβi4β0 and that without loss of generality L=max(L0(F),L1(F)). Let ϵ>0 be the desired accuracy and let i∗≥√Lϵ≥I. If for any i≥I,Ki≥(i+1)6; then after at most
O(n6L7/2ϵ7/2) |
function evaluations, the following inequality holds
||∇fβi∗(xi∗,0)||≤ϵ. | (4.19) |
(ii) Furthermore, when for every i∈N,fβi is convex; then, under the same setting as Theorem 4.2 given in Eq (4.13), it follows that after at most
O(n92L7/2ϵ7/2) |
function evaluations, the inequality given by Eq (4.19) holds.
Proof. For a subproblem i∈N, a probabilistic upper bound on the iteration Ki∈N such that ||mi,Ki||≤Lβi4β0 may be provided. We have
||mi,Ki||=mink∈[0,Ki]||mi,k||≤||mi,R||≤||mi,R−∇fβi(xi,R)||+||∇fβi(xi,R)||, | (4.20) |
where R∼U[0,Ki]. Now, probabilistic upper bounds on the number Ki are required to obtain that both terms on the right-hand side of the previous inequality are below Lβi4β0. For the first term of the right-hand side in Eq (4.20), using the specified values of si,01, si,02 and βi, Lemma 4.10 ensures that
P(||mi,R−∇fβi(xi,R)||≥Lβi4β0)≤4β0βiK14i(3√si,02(n+4)√n+10nsi,01si,02)≤O(n√n(i+1)32K14i). |
The second term of the right-hand side in Eq (4.20) depends on the value of I. For subproblems i≤I, it follows by Markov's inequality and Theorem 4.1 that
P(||∇fβi(xi,R)||≥Lβi4β0)≤4β0LβiE[||∇fβi(xi,R)||]≤4β0βi(Difsi,01+nsi,012H(−32)k+ln(Ki)(6√si,02(n+4)√n+40si,01nsi,02))≤O(max(n(i+1)72L,n√nln(Ki)(i+1)32)K14i). |
For subproblems i>I, Lemma 4.9 ensures that
P(||∇fβi(xi,R)||≥Lβi4β0)≤4β0βiK14(Ai+Bi), |
where Ai and Bi are given by Eq (4.18). Now, given the condition on Ki, it follows that
Ai=ρn(i+1)3/2(1i2+10s2i2+2(n+3)i2(i+1)) |
and
Bi=H(−32)k2(i+1)3/2+ln(Ki)(6n√n+3√s2√i+1+12s2√i+1). |
Thus, we obtain
P(||∇fβi(xi,R)||≥Lβi4β0)≤O(n√n(i+1)32ln(Ki)K14i). | (4.21) |
Therefore, to obtain that ||mi,Ki||≤Lβi4β0, it takes at most the following number of iterations:
Ki={O(max(n4(i+1)14,n6(i+1)6)),if i≤I,O((n6(i+1)6)),otherwise. |
Thus, by taking i∗≥√Lϵ, it follows that the number of iterations needed to reach the subproblem i∗ is
i∗∑i=1Ki=I∑i=1Ki+i∗∑i=I+1Ki=O(max(n4(I+1)15,n6(I+1)7))+O(n6(i∗)7)=O(n6L7/2ϵ7/2), | (4.22) |
where I is a constant with respect to ϵ. Once this number of iterations is reached, it follows that ||mi∗,0||≤L(i∗+1)2≤ϵ and by Lemma 4.7
||∇fβi∗(xi∗,Ki∗)||≤L(i∗+1)2+L√i∗+1(i∗)32≤2ϵ. |
For the second part of the proof, the bounds on Eq (4.20) do not depend on the value of I since fβi is assumed convex for every i∈N. With the setting in Eq (4.13), it follows that
P(||∇fβi(xi,R)||≥Lβi4β0)≤4β0βiE[||∇fβi(xi,R)||]≤16ρn√n(i+1)2K13i |
and
P(||mi,R−∇fβi(xi,R)||≥Lβi4β0)≤4β0Lβin√nL∑Ki−1k=02ρΓk+1(k+1)43∑Ki−1k=02ρΓk+1(k+1)≤8n√n(i+1)2K13i, |
where the first inequality follows by Theorem 4.2 and the second one by the definition of the probability density of R together with Propositions 4.2 and 4.3. Therefore, it takes at most Ki=O(n92(i+1)6) iterations to obtain ||mi,Ki||≤Lβi4β0. Thus, by taking i∗≥√Lϵ, it follows that the number of iterations needed to reach the subproblem i∗ is
i∗∑i=1Ki=O(n92(i∗)7)=O(n92L72ϵ72). |
It remains to apply Lemma 4.7 as previously done to complete the proof.
We would like to make a few remarks about this theorem. First, one approach to satisfy the condition Ki≥(i+1)6 for any i∈N is to incorporate it into the stopping criterion of Algorithm 1. However, due to the limited number of iterations in practice, this condition is typically replaced by a weaker one, Ki≥M, where M>0 is a constant. Second, the main result of Theorem 8 establishes the rate of convergence to an ϵ-optimal point for a single run of the SSO algorithm, which is the first of its kind to the best of our knowledge. This was made possible by decomposing the problem given in Eq (1.1) into a sequence of subproblems, each of which is solved by using carefully chosen stopping criteria and step sizes. It is worth noting that, in [20], the (ϵ,Λ)-solution of the norm of the gradient is obtained after at most O(nL2σ2ϵ4) function evaluations. Although this bound has a weaker dependence on n and L, it is worse in terms of ϵ. Third, the first term in Eq (4.22) may be significant even if it is fixed, particularly if the region where the function is convex is difficult to reach; indeed, this constant disappears when fβi is convex for every index i. Nevertheless, the bounds given represent the worst set and may be considerably smaller in practice. A way to decrease this term is to decrease the power on i in the denominator of βi, si,01 and si,02 but this would also decrease the asymptotic rate of convergence. Finally, the process used in the SSO algorithm may be extended to other momentum-based methods and give an appealing property for these methods compared to the classical SGD.
The numerical experiments are conducted for two bounded constrained blackbox optimization problems. In order to handle the bound constraints x∈[ℓ,u]⊂Rn, the update given by Eq (3.2) is simply projected such that x←max(ℓ,min(x,u)).
The first stochastic test problem is SOLAR [19], which simulates a thermal solar power plant and contains several instances allowing for selection of the number of variables, the types of constraints and the objective function to optimize. All of the instances of SOLAR are stochastic and have non-convex constraints and integer variables. In this work, the algorithms developed do not deal with integer variables. Therefore, the problem is altered: all integer variables are fixed to their initial values and the problem is to obtain a feasible solution by optimizing the expectation of constraint violations over the remaining variables. Numerical experiments were conducted for the second instance of the SOLAR framework, which considers 12 variables (2 integers) and 12 constraints
minx∈[0,1]12E[m∑j=1max(0,cj(x,ξ))2], |
where cj denotes the original stochastic constraints and the bound constraints have been normalized. The second instance of SOLAR is computationally expensive; a run may take between several seconds and several minutes. Therefore, the maximum number of function evaluations was set to 1000. Four algorithms were used
● SSO, whose hyperparameters values are given in Table 4. The search step given in Algorithm 2 was used for this experiment. A truncated version of the Gaussian gradient based estimate was used for this experiment.
Problem | βi | si,k1 | si,k2 | M | q |
Cifar10 | 0.005(i+1)2 | 0.005(i+1)32√k+1 | 0.9(i+1)(k+1)14 | 60 | 10 |
ImageNet | 0.001(i+1)2 | 0.003(i+1)32√k+1 | 0.7(i+1)(k+1)14 | 100 | 10 |
Solar | 0.3(i+1)2 | 0.1(i+1)32√k+1 | 0.5(i+1)(k+1)14 | 5 | 10 |
● ZO-adaMM [15] which is a ZO version of the original Adam algorithm. This algorithm appears as one of the most effective according to [15,31] in terms of distortion value, number of function evaluations and success rate. The default parameters defined experimentally in [15] were used for this problem, except that β=0.05 and the learning rate was equal to 0.3. Moreover, the same gradient estimator as that for ZO-signum was used to eliminate its impact on the performance.
● CMA-ES [22] an algorithm based on biologically inspired operators. Its name comes from the adaptation of the covariance matrix of the multivariate normal distribution used during the mutation. The version of CMA-ES used was the one of the pymoo [8] library with the default setting.
● The NOMAD 3.9.1 software [29], based on the MADS [1] algorithm, a popular blackbox optimization solver.
The results are presented in Figure 2, which plots the average best result obtained by each algorithm with five different seeds. In this experiment, SSO achieved similar performance to NOMAD and CMA-ES which are state-of-the-art algorithms for this type of problem. ZO-adaMM had difficulty converging even though it is a ZO algorithm.
This section demonstrates the competitiveness of the SSO algorithm through experiments involving the generation of blackbox adversarial examples for deep neural networks (DNNs) [45]. Generating an adversarial example for a DNN involves adding a well-designed perturbation to the original legal input to cause the DNN to misclassify it. In this work, the attacker considers the DNN model to be unknown, hence the term blackbox. Adversarial attacks against DNNs are not just theoretical, they pose a real safety issue [35]. Having an algorithm that generates effective adversarial examples enables modification of DNN architecture to enhance its robustness against such attacks. An ideal adversarial example is one that can mislead a DNN to recognize it as any target image label, while appearing visually similar to the original input, making the perturbations indiscernible to human eyes. The similarity between the two inputs is typically measured by an ℓp norm. Mathematically, a blackbox adversarial attack can be formalized as follows. Let (y,ℓ) denote a legitimate image y with the true label ℓ∈[1,M], where M is the total number of image classes. Let x denote the adversarial perturbation; the adversarial example is then given by y′=y+x, and the goal is to solve the problem [15]
minxλf(y+x)+||x||2, subject to (y+x)∈[−0.5,0.5]n, |
where λ>0 is a regularization parameter and f is the blackbox attack loss function. In our experiments, λ=10 and the loss function is defined for an untargeted attack [12], i.e.,
f(y′)=max{Z(y′)ℓ−maxj≠ℓZ(y)j,0}, |
where Z(y′)k denotes the prediction score for class k given the input y′. Thus, the minimum value of 0 is reached as the perturbation succeeds to fool the neural network.
The experiments of generating blackbox adversarial examples were first performed by using an adapted AlexNet [28] on the dataset Cifar10 and then by using InceptionV3 [43] on the dataset ImageNet [18]. Since the NOMAD algorithm is not recommended for large problems, three algorithms are compared: SSO (without search), ZO-adaMM and CMA-ES. In the experiments, the hyperparameters of the ZO-adaMM algorithm were taken as in [15], and those of SSO are given in Table 4; the uniform gradient based estimate is used for both algorithms. Moreover, for the Cifar10 dataset, different initial learning rates for ZO-adaMM were used to observe its influence on the success rate. Experiments were conducted for 100 randomly selected images with a starting point corresponding to a null distortion; the maximum number of function queries was set to 5000. Thus, as the iteration increases, the attack loss decreases until it converges to 0 (indicating a successful attack) while the norm of the distortion could increase.
The best attack performance involves a trade-off between a fast convergence to a 0 attack loss in terms of function evaluations, a high rate of success, and a low distortion (evaluated by the ℓ2-norm). The results for the Cifar10 dataset are given in Table 5.
Method | Attack success rate | Average # of function evaluations | |
ZO-adaMM |
79% | ||
ZO-adaMM |
96% | ||
ZO-adaMM |
98% | ||
CMAES |
99% | ||
SSO | 100% |
Except for ZO-adaMM with an initial learning rate equal to 0.01, all algorithms achieved a success rate above 95. Among these algorithms, ZO-adaMM with a learning rate equal to 0.05, had the best convergence rate in terms of function evaluations but it had the worst value of distortion. On the contrary, CMA-ES obtained the best value of distortion but had the worst convergence rate. The SSO algorithm achieved balanced results, and it was the only one to reach full success rate.
Table 6 displays the results for the ImageNet dataset. Only two algorithms are compared since dimensions were too large to invert the covariance matrix in CMA-ES. For this dataset, ZO-adaMM and SSO had the same convergence rate. However, SSO outperformed ZO-adaMM in terms of success rate while having a slightly higher level of distortion.
Method | Attack success rate | Average # of function evaluations | |
ZO-adaMM |
|||
SSO |
This paper presents a method for computationally expensive stochastic blackbox optimization. The approach uses ZO gradient estimates, which provides three advantages. First, they require few function evaluations to estimate the gradient, regardless of the problem's dimensions. Second, under mild conditions on the noised objective function, the problem is formulated as optimization of a smooth approximation. Third, the smooth approximation may appear to be locally convexified near a local minima.
Based on these three features, the SSO algorithm was proposed. This algorithm is a sequential one and comprises two steps. The first is an optional search step that improves the exploration of the decision variable space and the algorithm's efficiency. The second is a local search, which ensures the convergence of the algorithm. In this step, the original problem is decomposed into subproblems solved by a ZO-version of a sign stochastic descent with momentum algorithm. More specifically, when the momentum's norm falls below a specified threshold that depends on the smoothing parameter, the subproblem is considered solved. The smoothing parameter's value is then decreased, and the SSO algorithm moves on to the next subproblem.
A theoretical analysis of the algorithm has been conducted. Under Lipschitz continuity of the stochastic ZO oracle, a convergence rate in mean of the ZO-signum algorithm is derived. Under additional assumptions of smoothness and convexity or local convexity of the objective function near its minima, the rate of convergence of the SSO algorithm to an ϵ-optimal point of the problem has been derived, which is, to the best of our knowledge, the first of its kind.
Finally, numerical experiments were conducted based on a solar power plant simulation and adversarial blackbox attacks. Both examples were computationally expensive, the former was a small-sized problem (n≈10) and the latter was a large-sized problem (up to n≈105). The results demonstrate the SSO algorithm's competitiveness in terms of both performance and convergence rate compared to state-of-the-art algorithms. Further work will extend this approach to constrained stochastic optimization.
The authors declare that they have not used artificial intelligence tools in the creation of this article.
The authors are grateful to the anonymous reviewers for their helpful comments and the English editor for improving the English quality of this text. This work was financed by the IVADO Fundamental Research Projects Grant PRF-2019-8079623546 and by the NSERC Alliance grant 544900-19 in collaboration with Huawei-Canada.
All authors declare no conflicts of interest that may influence the publication of this paper.
The following list describes symbols used within the body of the document. Throughout the paper, when a symbol is shown in bold then it is a vector; otherwise, it is a scalar.
n | The dimension of the space of the design variables |
Ω | The sample space of ξ, i.e., the set of all possible outcomes of ξ |
ξ:Ω→Rm | The vector of uncertainties |
Eξ[⋅] | The expectation with respect to the random vector ξ |
F:Rn×Rm→R | The stochastic zeroth-order oracle that takes into account the uncertainty ξ |
f:Rn→R | The expectation of F with respect to ξ |
β∈R+∗ | A strictly positive scalar for use as a smoothing parameter |
u∈Rn | A Gaussian random vector |
fβ=E[f(x+βu)] | A smooth approximation of a function f |
L0(f) | The Lipschitz constant associated with a function f |
L1(f) | The Lipschitz constant associated with the gradient of a function f |
∇f | The gradient of a function f |
˜∇f | An estimator of the gradient of a function f |
˜g | An estimator of the gradient of a function f based on outputs of the stochastic zeroth-order oracle F(x,ξ) |
j∈[1,n] | The counter associated with the dimension |
i∈N | The outer iteration counter associated with a subproblem |
k∈N | The inner iteration counter |
m∈Rn | The momentum vector |
si,k2∈(0,1) | The step size associated with the momentum |
si,k1∈(0,1) | The step size associated with x |
L∈R+∗ | An approximation of the Lispchitz constant |
q∈N | The size of the mini batch used to estimate ˜∇ |
M∈N | The minimum number of iterations used in the ZO-signum algorithm |
H(α)k | The generalized harmonic number of order α |
C0+ | Class of Lipschitz continuous functions |
C1+ | Class of differentiable functions whose gradient is Lipschitz |
C∞ | Class of infinitely differentiable functions |
Proposition B.1. [6] For the subproblem i∈N, under Assumption 1 and in the setting of Algorithm 1, we have
si,k1E[||∇fβi(xi,k)||1]≤E[fβi(xi,k)−fβi(xi,k+1)]+nL1(fβi)2(si,k1)2+2si,k1E[||ˉmi,k+1−∇fβi(xi,k)||1]⏟bias+2si,k1√n√E[||mi,k+1−ˉmi,k+1||22]⏟variance, | (B.1) |
where ˉmi,k+1j is defined recursively as ˉmi,k+1j=si,k2∇fβi(xi,k)+(1−si,k2)ˉmi,kj.
Proof. By L1(fβi)-Lipschitz smoothness of fβi (see Lemma 2.1(3)), it follows that
fβi(xi,k+1)≤fβi(xi,k)+⟨∇fβi(xi,k),xi,k+1−xi,k⟩+L1(fβi)2||xi,k+1−xi,k||22=fβi(xi,k)−si,k1⟨∇fβi(xi,k),sign(mi,k+1)⟩+L1(fβi)(si,k1)22||sign(mi,k+1)||22=fβi(xi,k)−si,k1||∇fβi(xi,k)||1+nL1(fβi)2(si,k1)2+2si,k1n∑j=1|∇jfβi(xi,k)|1{sign(mi,k+1j)≠sign(∇jfβi(xi,k))}, |
where 1{⋅} is the indicator function. Now, as in [6,30], the expected improvement conditioned on xi,k is given by
E[fβi(xi,k+1)−fβi(xi,k)|xi,k]≤−si,k1||∇fβi(xi,k)||1+nL1(fβi)2(si,k1)2+2si,k1n∑j=1|∇jfβi(xi,k)|E[1{sign(mi,k+1j)≠sign(∇jfβi(xi,k))}|xi,k]. | (B.2) |
Again, as in [6,30], the expectation that the sign of mi,k+1j is different from the sign of ∇jfβi(xi,k) is relaxed by considering that the set
{mi,k+1j:sign(mi,k+1j)≠sign(∇jfβi(xi,k)}⊂{mi,k+1j:|mi,k+1j−∇jfβi(xi,k)|≥|∇jfβi(xi,k)|}. |
Therefore, it follows that
E[1{sign(mi,k+1j)≠sign(∇jfβi(xi,k))}|xi,k]≤E[1{|mi,k+1j−∇jfβi(xi,k)|≥|∇jfβi(xi,k)|}|xi,k]≤E[|mi,k+1j−∇jfβi(xi,k)||xi,k]|∇jfβi(xi,k)|, | (B.3) |
where the second inequality comes from the conditional Markov's inequality. Substituting Eq (B.3) into Eq (B.2) and taking the expectation over all of the randomness we obtain
E[fβi(xi,k+1)−fβi(xi,k)]≤−si,k1E[||∇fβi(xi,k)||1]+nL2(si,k1)2+2si,k1n∑j=1E[|mi,k+1j−∇jfβi(xi,k)|]. | (B.4) |
Moreover, by adding and subtracting ˉmi,k+1 in terms of the sum of Eq (B.4), one gets
n∑j=1E[|mi,k+1j−∇jfβi(xi,k)|]=E[||mi,k+1−ˉmi,k+1+ˉmi,k+1−∇fβi(xi,k)||1]≤√nE[||mi,k+1−ˉmi,k+1||2]+E[||ˉmi,k+1−∇fβi(xi,k)||1]≤√n√E[||mi,k+1−ˉmi,k+1||22]+E[||ˉmi,k+1−∇fβi(xi,k)||1], |
where the first inequality comes from ||⋅||1≤√n||⋅||2 and the second one from Jensen's inequality. Finally, incorporating the last inequality in Eq (B.4) completes the proof.
Below are the original versions of the signSGD and signum algorithms.
Algorithm 3 signSGD algorithm. |
1: Input: x0, s1∈(0,1) |
2: for k=0,1,… do |
3: Calculate an estimate of the stochastic gradient ˜∇f(xk) and update: |
xk+1=xk−s1sign(˜∇f(xk)) |
4: end for |
5: Return xk |
Algorithm 4 Signum algorithm. |
1: Input: x0,m0,s1∈(0,1),s2∈(0,1) |
2: for k=0,1,… do |
3: Calculate an estimate of the stochastic gradient ˜∇f(xk) and update: |
mk+1=s2mk+(1−s2)˜∇f(xk)xk+1=xk−s1sign(mk+1) |
3: end for |
4: Return xk |
[1] |
C. Audet, J. Dennis, Mesh adaptive direct search algorithms for constrained optimization, SIAM J. Optimiz., 17 (2006), 188–217. http://dx.doi.org/10.1137/040603371 doi: 10.1137/040603371
![]() |
[2] |
C. Audet, K. Dzahini, M. Kokkolaras, S. Le Digabel, Stochastic mesh adaptive direct search for blackbox optimization using probabilistic estimates, Comput. Optim. Appl., 79 (2021), 1–34. http://dx.doi.org/10.1007/s10589-020-00249-0 doi: 10.1007/s10589-020-00249-0
![]() |
[3] | C. Audet, W. Hare, Derivative-free and blackbox optimization, Cham: Springer, 2017. http://dx.doi.org/10.1007/978-3-319-68913-5 |
[4] |
C. Audet, A. Ihaddadene, S. Le Digabel, C. Tribes, Robust optimization of noisy blackbox problems using the mesh adaptive direct search algorithm, Optim. Lett., 12 (2018), 675–689. http://dx.doi.org/10.1007/s11590-017-1226-6 doi: 10.1007/s11590-017-1226-6
![]() |
[5] |
K. Balasubramanian, S. Ghadimi, Zeroth-order nonconvex stochastic optimization: handling constraints, high dimensionality, and saddle points, Found. Computat. Math., 22 (2022), 35–76. http://dx.doi.org/10.1007/s10208-021-09499-8 doi: 10.1007/s10208-021-09499-8
![]() |
[6] | J. Bernstein, Y. Wang, K. Azizzadenesheli, A. Anandkumar, SignSGD: compressed optimisation for non-convex problems, Proceedings of International Conference on Machine Learning, 2018,560–569. |
[7] | S. Bhatnagar, H. Prasad, L. Prashanth, Stochastic recursive algorithms for optimization, London: Springer, 2013. http://dx.doi.org/10.1007/978-1-4471-4285-0 |
[8] |
J. Blank, K. Deb, Pymoo: multi-objective optimization in Python, IEEE Access, 8 (2020), 89497–89509. http://dx.doi.org/10.1109/ACCESS.2020.2990567 doi: 10.1109/ACCESS.2020.2990567
![]() |
[9] | H. Cai, Y. Lou, D. McKenzie, W. Yin, A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization, Proceedings of the 38th International Conference on Machine Learning, 2021, 1193–1203. |
[10] |
H. Cai, D. McKenzie, W. Yin, Z. Zhang, A one-bit, comparison-based gradient estimator, Appl. Comput. Harmon. Anal., 60 (2022), 242–266. http://dx.doi.org/10.1016/j.acha.2022.03.003 doi: 10.1016/j.acha.2022.03.003
![]() |
[11] |
H. Cai, D. Mckenzie, W. Yin, Z. Zhang, Zeroth-order regularized optimization (zoro): approximately sparse gradients and adaptive sampling, SIAM J. Optim., 32 (2022), 687–714. http://dx.doi.org/10.1137/21M1392966 doi: 10.1137/21M1392966
![]() |
[12] |
N. Carlini, D. Wagner, Towards evaluating the robustness of neural networks, Proceedings of 2017 IEEE Symposium on Security and Privacy, 2017, 39–57. http://dx.doi.org/10.1109/SP.2017.49 doi: 10.1109/SP.2017.49
![]() |
[13] |
K. Chang, Stochastic nelder-mead simplex method-a new globally convergent direct search method for simulation optimization, Eur. J. Oper. Res., 220 (2012), 684–694. http://dx.doi.org/10.1016/j.ejor.2012.02.028 doi: 10.1016/j.ejor.2012.02.028
![]() |
[14] |
R. Chen, M. Menickelly, K. Scheinberg, Stochastic optimization using a trust-region method and random models, Math. Program., 169 (2018), 447–487. http://dx.doi.org/10.1007/s10107-017-1141-8 doi: 10.1007/s10107-017-1141-8
![]() |
[15] | X. Chen, S. Liu, K. Xu, X. Li, X. Lin, M. Hong, et al., Zo-adamm: zeroth-order adaptive momentum method for black-box optimization, Proceedings of 33rd Conference on Neural Information Processing Systems, 2019, 1–12. |
[16] | A. Conn, K. Scheinberg, L. Vicente, Introduction to derivative-free optimization, Philadelphia: SIAM, 2009. http://dx.doi.org/10.1137/1.9780898718768 |
[17] |
F. Curtis, K. Scheinberg, R. Shi, A stochastic trust region algorithm based on careful step normalization, Informs Journal on Optimization, 1 (2019), 200–220. http://dx.doi.org/10.1287/ijoo.2018.0010 doi: 10.1287/ijoo.2018.0010
![]() |
[18] |
J. Deng, W. Dong, R. Socher, L. Li, K. Li, F. Li, Imagenet: a large-scale hierarchical image database, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009,248–255. http://dx.doi.org/10.1109/CVPR.2009.5206848 doi: 10.1109/CVPR.2009.5206848
![]() |
[19] | M. Garneau, Modelling of a solar thermal power plant for benchmarking blackbox optimization solvers, Ph. D Thesis, École Polytechnique de Montréal, 2015. |
[20] |
S. Ghadimi, G. Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), 2341–2368. http://dx.doi.org/10.1137/120880811 doi: 10.1137/120880811
![]() |
[21] |
S. Ghadimi, A. Ruszczynski, M. Wang, A single timescale stochastic approximation method for nested stochastic optimization, SIAM J. Optim., 30 (2020), 960–979. http://dx.doi.org/10.1137/18M1230542 doi: 10.1137/18M1230542
![]() |
[22] | N. Hansen, The CMA evolution strategy: a comparing review, In: Towards a new evolutionary computation, Berlin: Springer, 2006, 75–102. http://dx.doi.org/10.1007/3-540-32494-1_4 |
[23] | S. Karimireddy, Q. Rebjock, S. Stich, M. Jaggi, Error feedback fixes signsgd and other gradient compression schemes, Proceedings of the 36th International Conference on Machine Learning, 2019, 3252–3261. |
[24] |
J. Kiefer, J. Wolfowitz, Stochastic estimation of the maximum of a regression function, Ann. Math. Statist., 23 (1952), 462–466. http://dx.doi.org/10.1214/aoms/1177729392 doi: 10.1214/aoms/1177729392
![]() |
[25] | B. Kim, H. Cai, D. McKenzie, W. Yin, Curvature-aware derivative-free optimization, arXiv:2109.13391. |
[26] | D. Kingma, J. Ba, Adam: a method for stochastic optimization, arXiv:1412.6980. |
[27] |
M. Kokkolaras, Z. Mourelatos, P. Papalambros, Impact of uncertainty quantification on design: an engine optimisation case study, International Journal of Reliability and Safety, 1 (2006), 225–237. http://dx.doi.org/10.1504/IJRS.2006.010786 doi: 10.1504/IJRS.2006.010786
![]() |
[28] |
A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, Commun. ACM, 60 (2017), 84–90. http://dx.doi.org/10.1145/3065386 doi: 10.1145/3065386
![]() |
[29] |
S. Le Digabel, Algorithm 909: NOMAD: nonlinear optimization with the MADS algorithm, ACM T. Math. Software, 37 (2011), 1–15. http://dx.doi.org/10.1145/1916461.1916468 doi: 10.1145/1916461.1916468
![]() |
[30] | S. Liu, P. Chen, X. Chen, M. Hong, Sign-SGD via zeroth-order oracle, Proceedings of International Conference on Learning Representations, 2019, 1–24. |
[31] |
S. Liu, P. Chen, B. Kailkhura, G. Zhang, A. Hero, P. Varshney, A primer on zeroth-order optimization in signal processing and machine learning: principals, recent advances, and applications, IEEE Signal Proc. Mag., 37 (2020), 43–54. http://dx.doi.org/10.1109/MSP.2020.3003837 doi: 10.1109/MSP.2020.3003837
![]() |
[32] | S. Liu, B. Kailkhura, P. Chen, P. Ting, S. Chang, L. Amini, Zeroth-order stochastic variance reduction for nonconvex optimization, Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, 3731–3741. |
[33] |
A. Maggiar, A. Wachter, I. Dolinskaya, J. Staum, A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling, SIAM J. Optim., 28 (2018), 1478–1507. http://dx.doi.org/10.1137/15M1031679 doi: 10.1137/15M1031679
![]() |
[34] |
Y. Nesterov, V. Spokoiny, Random gradient-free minimization of convex functions, Found. Comput. Math., 17 (2017), 527–566. http://dx.doi.org/10.1007/s10208-015-9296-2 doi: 10.1007/s10208-015-9296-2
![]() |
[35] |
N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. Berkay Celik, A. Swami, Practical black-box attacks against machine learning, Proceedings of the 2017 ACM on Asia conference on computer and communications security, 2017,506–519. http://dx.doi.org/10.1145/3052973.3053009 doi: 10.1145/3052973.3053009
![]() |
[36] | E. Real, S. Moore, A. Selle, S. Saxena, Y. Suematsu, J. Tan, et al., Large-scale evolution of image classifiers, Proceedings of the 34th International Conference on Machine Learning, 2017, 2902–2911. |
[37] |
H. Robbins, S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), 400–407. http://dx.doi.org/10.1214/aoms/1177729586 doi: 10.1214/aoms/1177729586
![]() |
[38] |
R. Rockafellar, J. Royset, Risk measures in engineering design under uncertainty, Proceedings of International Conference on Applications of Statistics and Probability, 2015, 1–8. http://dx.doi.org/10.14288/1.0076159 doi: 10.14288/1.0076159
![]() |
[39] | R. Rubinstein, Simulation and the Monte Carlo method, Hoboken: John Wiley & Sons Inc., 1981. http://dx.doi.org/10.1002/9780470316511 |
[40] |
A. Ruszczynski, W. Syski, Stochastic approximation method with gradient averaging for unconstrained problems, IEEE T. Automat. Contr., 28 (1983), 1097–1105. http://dx.doi.org/10.1109/TAC.1983.1103184 doi: 10.1109/TAC.1983.1103184
![]() |
[41] |
J. Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient approximation, IEEE T. Automat. Contr., 37 (1992), 332–341. http://dx.doi.org/10.1109/9.119632 doi: 10.1109/9.119632
![]() |
[42] | M. Styblinski, T. Tang, Experiments in nonconvex optimization: stochastic approximation with function smoothing and simulated annealing, Neural Networks, 3 (1990), 467–483. |
[43] |
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, 2818–2826. http://dx.doi.org/10.1109/CVPR.2016.308 doi: 10.1109/CVPR.2016.308
![]() |
[44] |
V. Volz, J. Schrum, J. Liu, S. Lucas, A. Smith, S. Risi, Evolving mario levels in the latent space of a deep convolutional generative adversarial network, Proceedings of the Genetic and Evolutionary Computation Conference, 2018,221–228. http://dx.doi.org/10.1145/3205455.3205517 doi: 10.1145/3205455.3205517
![]() |
[45] | K. Xu, S. Liu, P. Zhao, P. Chen, H. Zhang, Q. Fan, et al., Structured adversarial attack: towards general implementation and better interpretability, Proceedings of International Conference on Learning Representations, 2019, 1–21. |
Algorithm 1 ZO-signum algorithm to solve subproblem i∈N. |
1: Input: xi,0,mi,0,βi,si,01,si,02, L, q, M |
2: Set k=0 |
3: Define step-size sequences si,k1=si,01(k+1)α1 and si,k2=si,02(k+1)α2 |
4: while ||mi,k||>Lβi4β0 or k≤M do |
5: Draw q samples uk from the Gaussian distribution N(0,I) |
6: Calculate the average of the q Gaussian estimate ˜∇fβi(xi,k,ξi,k) from Eq (2.3) |
7: Update: |
mi,k+1=si,k2˜∇fβi(xi,k,ξk)+(1−si,k2)mi,k(3.1) |
xi,k+1j=xi,kj−sk1sign(mi,k+1j)∀j∈[1,n](3.2) |
8: k←k+1 |
9: end while |
10: Return mi,k and xi,k |
Algorithm 2 SSO algorithm. |
1: Initialization: |
2: Set x0,0∈Rn, β0>0 and N as the maximum number of function calls for the search step |
3: Set q as the number of gradient estimates at each iteration of ZO-signum (ZOS) algorithm |
4: Set M the minimum number of iterations made by the ZOS algorithm on a subproblem |
5: C is the cache containing all of the evaluated points |
6: Set m0,0=˜∇fβ0(x0,0,ξ0) and L=+∞ |
7: Set s0,01>0 and s0,02>0 |
8: Set i=0 |
9: Search step (optional): |
10: while M(i+1)q≤N: do |
11: Solve subproblem i with Algorithm 1: |
mi+1,0=ZOS(xi,0,mi,0,βi,si,01,si,02,L,q,M)xi+1,0∈argminx∈CF(x,ξ) |
12: Update βi, si,01 and si,02 as in Step 18 |
13: end while |
14: L=||m0,0|| |
15: Local step: |
16: while βi>ϵ do |
17: Solve subproblem i with Algorithm 1: |
mi+1,0,xi+1,0=ZOS(xi,0,mi,0,βi,si,01,si,02,L,q,M) |
18: Update: |
βi=β0(i+1)2,si,01=s0,01(i+1)32,si,02=s0,02i+1i←i+1 |
19: end while |
20: Return xi |
Assumptions on F | Preliminary results | Intermediate results | Main results | When fβ is convex |
Assumption 1 which implies that L1(fβi)=2√nL0(F)βi | Proposition 4.1 | |||
Lemma 4.1 | Proposition 4.2 | Theorem 4.1 Corollary 4.1 |
Theorem 4.2 | |
Lemma 4.2 | ||||
Lemma 4.3 | ||||
Lemma 4.4 | Proposition 4.3 | |||
Lemma 4.5 |
Method | Assumptions | Measure | Convergence rate | Queries |
ZO-SGD [20] | F(⋅,ξ)∈C1+ | E[||∇f(xR)||2] | O(√σn14K14+√n√K) | O(K) |
E[||∇F(x,ξ)−∇f(x)||2]≤σ2 | ||||
ZO-signSGD [30] | F(⋅,ξ)∈C0+ | E[||∇f(xR)||2] | O(√n√K+√n√b+n√bq) | O(bqK) |
F(⋅,ξ)∈C1+ | ||||
||∇F(x,ξ)||2≤η | ||||
ZO-adaMM [15] | F(⋅,ξ)∈C0+ | E[||∇f(xR)||2] | O((√nK14+n√K)(ln(K)+ln(n))14) | O(K) |
F(⋅,ξ)∈C1+ | ||||
||∇F(x,ξ)||∞≤η | ||||
ZO-Signum | F(⋅,ξ)∈C0+ | E[||∇fβ(xR)||2] | O(n√nK14ln(K)) | O(K) |
ZO-Signum | F(⋅,ξ)∈C0+, f convex | E[fβi(xi,K)−fβi(xi,∗)] | O(n√nK13) | O(K) |
ZO-SGD [34] | F(⋅,ξ)∈C0+, f convex | E[f(xi,K)−f(xi,∗)] | O(n√K) | O(K) |
Modified ZSCG [5] | F(⋅,ξ)∈C1+, F convex | E[f(xi,K)−f(xi,∗)] | O(σ√n√K) | O(K) |
E[||∇F(x,ξ)−∇f(x)||2]≤σ2 |
Assumptions on F | Preliminary results | Intermediate results | Main result | When fβ is convex | ||
Assumptions 1, 2 and 3 which imply L1(fβi)≤L1(f) | Lemma 4.6 | Theorem 4.3 (ⅰ) | Theorem 4.3 (ⅱ) | |||
Proposition 4.3 | Lemma 4.7 | Lemma 4.8 | Lemma 4.9 | |||
Lemma 2.1(3) | ||||||
Theorem 4.1 | ||||||
Proposition 4.2 | Lemma 4.10 | |||||
Proposition 4.3 |
Problem | βi | si,k1 | si,k2 | M | q |
Cifar10 | 0.005(i+1)2 | 0.005(i+1)32√k+1 | 0.9(i+1)(k+1)14 | 60 | 10 |
ImageNet | 0.001(i+1)2 | 0.003(i+1)32√k+1 | 0.7(i+1)(k+1)14 | 100 | 10 |
Solar | 0.3(i+1)2 | 0.1(i+1)32√k+1 | 0.5(i+1)(k+1)14 | 5 | 10 |
Method | Attack success rate | Average # of function evaluations | |
ZO-adaMM |
79% | ||
ZO-adaMM |
96% | ||
ZO-adaMM |
98% | ||
CMAES |
99% | ||
SSO | 100% |
Method | Attack success rate | Average # of function evaluations | |
ZO-adaMM |
|||
SSO |
n | The dimension of the space of the design variables |
Ω | The sample space of ξ, i.e., the set of all possible outcomes of ξ |
ξ:Ω→Rm | The vector of uncertainties |
\mathbb{E}_{\boldsymbol{\xi}}[\cdot] | The expectation with respect to the random vector \boldsymbol{\xi} |
F : {{\mathbb{R}}}^n\times {{\mathbb{R}}}^m \to {{\mathbb{R}}} | The stochastic zeroth-order oracle that takes into account the uncertainty \boldsymbol{\xi} |
f : {{\mathbb{R}}}^n \to {{\mathbb{R}}} | The expectation of F with respect to \boldsymbol{\xi} |
\beta \in {{\mathbb{R}}}^{+*} | A strictly positive scalar for use as a smoothing parameter |
{\mathbf{u}} \in {{\mathbb{R}}}^n | A Gaussian random vector |
f^\beta = \mathbb{E}[f({\mathbf{x}} + \beta {\mathbf{u}})] | A smooth approximation of a function f |
L_0(f) | The Lipschitz constant associated with a function f |
L_1(f) | The Lipschitz constant associated with the gradient of a function f |
\nabla f | The gradient of a function f |
\tilde{\nabla} f | An estimator of the gradient of a function f |
\tilde{{\mathbf{g}}} | An estimator of the gradient of a function f based on outputs of the stochastic zeroth-order oracle F({\mathbf{x}}, \boldsymbol{\xi}) |
j \in [1, n] | The counter associated with the dimension |
i \in {{\mathbb{N}}} | The outer iteration counter associated with a subproblem |
k \in {{\mathbb{N}}} | The inner iteration counter |
{\mathbf{m}} \in {{\mathbb{R}}}^n | The momentum vector |
s_2^{i, k} \in (0, 1) | The step size associated with the momentum |
s_1^{i, k} \in (0, 1) | The step size associated with {\mathbf{x}} |
L \in {{\mathbb{R}}}^{+*} | An approximation of the Lispchitz constant |
q \in {{\mathbb{N}}} | The size of the mini batch used to estimate \tilde{\nabla} |
M \in {{\mathbb{N}}} | The minimum number of iterations used in the ZO-signum algorithm |
H_k^{(\alpha)} | The generalized harmonic number of order \alpha |
\mathcal{C}^{0+} | Class of Lipschitz continuous functions |
\mathcal{C}^{1+} | Class of differentiable functions whose gradient is Lipschitz |
\mathcal{C}^{\infty} | Class of infinitely differentiable functions |
Algorithm 3 signSGD algorithm. |
1: Input: {\mathbf{x}}^{0} , s_1 \in (0, 1) |
2: for k = 0, 1, \dots do |
3: Calculate an estimate of the stochastic gradient \tilde{\nabla} f({\mathbf{x}}^k) and update: |
\begin{align*} & {\mathbf{x}}^{k+1} = {\mathbf{x}}^{k} - s_1 \text{sign}(\tilde{\nabla} f({\mathbf{x}}^k)) \end{align*} |
4: end for |
5: Return {\mathbf{x}}^k |
Algorithm 4 Signum algorithm. |
1: Input: {\mathbf{x}}^{0}, {\mathbf{m}}^{0}, s_1 \in (0, 1), s_2 \in (0, 1) |
2: for k = 0, 1, \dots do |
3: Calculate an estimate of the stochastic gradient \tilde{\nabla} f({\mathbf{x}}^k) and update: |
\begin{align*} & {\mathbf{m}}^{k+1} = s_2 {\mathbf{m}}^{k} + (1-s_2) \tilde{\nabla} f({\mathbf{x}}^k) \\ & {\mathbf{x}}^{k+1} = {\mathbf{x}}^k - s_1 \text{sign}({\mathbf{m}}^{k+1}) \end{align*} |
3: end for |
4: Return {\mathbf{x}}^k |
Algorithm 1 ZO-signum algorithm to solve subproblem i \in \mathbb{N}. |
1: Input: {\mathbf{x}}^{i, 0}, {\mathbf{m}}^{i, 0}, \beta^i, s_1^{i, 0}, s_2^{i, 0} , L , q , M |
2: Set k = 0 |
3: Define step-size sequences s_1^{i, k} = \frac{s_1^{i, 0}}{(k+1)^{\alpha_1}} and s_2^{i, k} = \frac{s_2^{i, 0}}{(k+1)^{\alpha_2}} |
4: while ||{\mathbf{m}}^{i, k}|| > \frac{L \beta^i }{4 \beta^0} or k \leq M do |
5: Draw q samples {\mathbf{u}}^k from the Gaussian distribution \mathcal{N}(\mathbf{0}, I) |
6: Calculate the average of the q Gaussian estimate \tilde{\nabla} f^{\beta^i}({\mathbf{x}}^{i, k}, \boldsymbol{\xi}^{i, k}) from Eq (2.3) |
7: Update: |
\begin{align} {\mathbf{m}}^{i, k+1} = s_2^{i, k} \tilde{\nabla} f^{\beta^i}({\mathbf{x}}^{i, k}, \boldsymbol{\xi}^{k}) + (1-s_2^{i, k}){\mathbf{m}}^{i, k} \; \; \; \; \; \; \; \; \; \; \; \; \; \; (3.1)\end{align} |
\begin{align} x_j^{i, k+1} = x_j^{i, k} - s_1^k \text{sign}(m_j^{i, k+1}) \; \forall j \in [1, n] \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; (3.2)\end{align} |
8: k \leftarrow k +1 |
9: end while |
10: Return {\mathbf{m}}^{i, k} and {\mathbf{x}}^{i, k} |
Algorithm 2 SSO algorithm. |
1: Initialization: |
2: Set {\mathbf{x}}^{0, 0} \in {{\mathbb{R}}}^{n} , \beta^0 > 0 and N as the maximum number of function calls for the search step |
3: Set q as the number of gradient estimates at each iteration of ZO-signum (ZOS) algorithm |
4: Set M the minimum number of iterations made by the ZOS algorithm on a subproblem |
5: \mathcal{C} is the cache containing all of the evaluated points |
6: Set {\mathbf{m}}^{0, 0}= \tilde{\nabla} f^{\beta^0}({\mathbf{x}}^{0, 0}, \boldsymbol{\xi}^0) and L = + \infty |
7: Set s_1^{0, 0} > 0 and s_2^{0, 0} > 0 |
8: Set i = 0 |
9: Search step (optional): |
10: while M(i+1)q \leq N : do |
11: Solve subproblem i with Algorithm 1: |
\begin{align*} & {\mathbf{m}}^{i+1, 0} = ZOS({\mathbf{x}}^{i, 0}, {\mathbf{m}}^{i, 0}, \beta^i, s_1^{i, 0}, s_2^{i, 0}, L, q, M) \\ & {\mathbf{x}}^{i+1, 0} \in \mathop{\mathrm{argmin}}_{{\mathbf{x}} \in \mathcal{C}} F({\mathbf{x}}, \boldsymbol{\xi}) \end{align*} |
12: Update \beta^i , s_1^{i, 0} and s_2^{i, 0} as in Step 18 |
13: end while |
14: L = ||{\mathbf{m}}^{0, 0}|| |
15: Local step: |
16: while \beta^i > \epsilon do |
17: Solve subproblem i with Algorithm 1: |
\begin{equation*} {\mathbf{m}}^{i+1, 0}, {\mathbf{x}}^{i+1, 0} = ZOS({\mathbf{x}}^{i, 0}, {\mathbf{m}}^{i, 0}, \beta^i, s_1^{i, 0}, s_2^{i, 0}, L, q, M) \end{equation*} |
18: Update: |
\begin{align*} & \beta^i = \frac{\beta^0}{(i+1)^2}, s_1^{i, 0} = \frac{s_1^{0, 0}}{(i +1)^{\frac{3}{2}}}, \quad s_2^{i, 0} = \frac{s_2^{0, 0}}{i+1} &i \leftarrow i + 1 \end{align*} |
19: end while |
20: Return {\mathbf{x}}^i |
Assumptions on F | Preliminary results | Intermediate results | Main results | When f^\beta is convex |
Assumption 1 which implies that L_1(f^{\beta^i}) = \frac{2 \sqrt{n} L_0(F)}{\beta^i} | Proposition 4.1 | |||
Lemma 4.1 | Proposition 4.2 | Theorem 4.1 Corollary 4.1 |
Theorem 4.2 | |
Lemma 4.2 | ||||
Lemma 4.3 | ||||
Lemma 4.4 | Proposition 4.3 | |||
Lemma 4.5 |
Method | Assumptions | Measure | Convergence rate | Queries |
ZO-SGD [20] | F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{1 +} | \mathbb{E}[||\nabla f({\mathbf{x}}^R)||_2] | O\left(\frac{\sqrt{\sigma}n^{\frac{1}{4}}}{K^{\frac{1}{4}}} + \frac{\sqrt{n}}{\sqrt{K}}\right) | O(K) |
\mathbb{E}[||\nabla F({\mathbf{x}}, \boldsymbol{\xi}) - \nabla f({\mathbf{x}})||^2] \leq \sigma^2 | ||||
ZO-signSGD [30] | F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{0 +} | \mathbb{E}[||\nabla f({\mathbf{x}}^R)||_2] | O\left(\frac{\sqrt{n}}{\sqrt{K}} + \frac{\sqrt{n}}{\sqrt{b}} + \frac{n}{\sqrt{bq}} \right) | O(bqK) |
F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{1 +} | ||||
||\nabla F({\mathbf{x}}, \boldsymbol{\xi})||_2 \leq \eta | ||||
ZO-adaMM [15] | F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{0 +} | \mathbb{E}[||\nabla f({\mathbf{x}}^R)||_2] | O \left(\left(\frac{\sqrt{n}}{K^{\frac{1}{4}}} + \frac{n}{\sqrt{K}}\right) \left(\ln(K) + \ln(n) \right)^{\frac{1}{4}} \right) | O(K) |
F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{1 +} | ||||
||\nabla F({\mathbf{x}}, \boldsymbol{\xi})||_\infty \leq \eta | ||||
ZO-Signum | F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{0 +} | \mathbb{E}[||\nabla f^{\beta}({\mathbf{x}}^R)||_2] | O\left(\frac{n \sqrt{n}}{K^{\frac{1}{4}}} \ln(K) \right) | O(K) |
ZO-Signum | F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{0 +} , f convex | \mathbb{E}[f^{\beta^i}({\mathbf{x}}^{i, K}) - f^{\beta^i}({\mathbf{x}}^{i, *})] | O\left(\frac{n\sqrt{n}}{K^{\frac{1}{3}}} \right) | O(K) |
ZO-SGD [34] | F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{0 +} , f convex | \mathbb{E}[f({\mathbf{x}}^{i, K}) - f({\mathbf{x}}^{i, *})] | O\left(\frac{n}{\sqrt{K}} \right) | O(K) |
Modified ZSCG [5] | F(\cdot, \boldsymbol{\xi}) \in \mathcal{C}^{1 +} , F convex | \mathbb{E}[f({\mathbf{x}}^{i, K}) - f({\mathbf{x}}^{i, *})] | O\left(\frac{\sigma \sqrt{n}}{\sqrt{K}} \right) | O(K) |
\mathbb{E}[||\nabla F({\mathbf{x}}, \boldsymbol{\xi}) - \nabla f({\mathbf{x}})||^2] \leq \sigma^2 |
Assumptions on F | Preliminary results | Intermediate results | Main result | When f^\beta is convex | ||
Assumptions 1, 2 and 3 which imply L_1(f^{\beta^i}) \leq L_1(f) | Lemma 4.6 | Theorem 4.3 (ⅰ) | Theorem 4.3 (ⅱ) | |||
Proposition 4.3 | Lemma 4.7 | Lemma 4.8 | Lemma 4.9 | |||
Lemma 2.1(3) | ||||||
Theorem 4.1 | ||||||
Proposition 4.2 | Lemma 4.10 | |||||
Proposition 4.3 |
Problem | \beta^i | s_1^{i, k} | s_2^{i, k} | M | q |
Cifar10 | \frac{0.005}{(i+1)^2} | \frac{0.005}{(i+1)^{\frac{3}{2}}\sqrt{k+1}} | \frac{0.9}{(i+1) (k+1)^{\frac{1}{4}}} | 60 | 10 |
ImageNet | \frac{0.001}{(i+1)^{2}} | \frac{0.003}{(i+1)^{\frac{3}{2}}\sqrt{k+1}} | \frac{0.7}{(i+1) (k+1)^{\frac{1}{4}}} | 100 | 10 |
Solar | \frac{0.3}{(i+1)^2} | \frac{0.1}{(i+1)^{\frac{3}{2}}\sqrt{k+1}} | \frac{0.5}{(i+1) (k+1)^{\frac{1}{4}}} | 5 | 10 |
Method | Attack success rate | Average # of function evaluations | |
ZO-adaMM |
79% | ||
ZO-adaMM |
96% | ||
ZO-adaMM |
98% | ||
CMAES |
99% | ||
SSO | 100% |
Method | Attack success rate | Average # of function evaluations | |
ZO-adaMM |
|||
SSO |
n | The dimension of the space of the design variables |
\Omega | The sample space of \boldsymbol{\xi} , i.e., the set of all possible outcomes of \boldsymbol{\xi} |
\boldsymbol{\xi}: \Omega \to {{\mathbb{R}}}^m | The vector of uncertainties |
\mathbb{E}_{\boldsymbol{\xi}}[\cdot] | The expectation with respect to the random vector \boldsymbol{\xi} |
F : {{\mathbb{R}}}^n\times {{\mathbb{R}}}^m \to {{\mathbb{R}}} | The stochastic zeroth-order oracle that takes into account the uncertainty \boldsymbol{\xi} |
f : {{\mathbb{R}}}^n \to {{\mathbb{R}}} | The expectation of F with respect to \boldsymbol{\xi} |
\beta \in {{\mathbb{R}}}^{+*} | A strictly positive scalar for use as a smoothing parameter |
{\mathbf{u}} \in {{\mathbb{R}}}^n | A Gaussian random vector |
f^\beta = \mathbb{E}[f({\mathbf{x}} + \beta {\mathbf{u}})] | A smooth approximation of a function f |
L_0(f) | The Lipschitz constant associated with a function f |
L_1(f) | The Lipschitz constant associated with the gradient of a function f |
\nabla f | The gradient of a function f |
\tilde{\nabla} f | An estimator of the gradient of a function f |
\tilde{{\mathbf{g}}} | An estimator of the gradient of a function f based on outputs of the stochastic zeroth-order oracle F({\mathbf{x}}, \boldsymbol{\xi}) |
j \in [1, n] | The counter associated with the dimension |
i \in {{\mathbb{N}}} | The outer iteration counter associated with a subproblem |
k \in {{\mathbb{N}}} | The inner iteration counter |
{\mathbf{m}} \in {{\mathbb{R}}}^n | The momentum vector |
s_2^{i, k} \in (0, 1) | The step size associated with the momentum |
s_1^{i, k} \in (0, 1) | The step size associated with {\mathbf{x}} |
L \in {{\mathbb{R}}}^{+*} | An approximation of the Lispchitz constant |
q \in {{\mathbb{N}}} | The size of the mini batch used to estimate \tilde{\nabla} |
M \in {{\mathbb{N}}} | The minimum number of iterations used in the ZO-signum algorithm |
H_k^{(\alpha)} | The generalized harmonic number of order \alpha |
\mathcal{C}^{0+} | Class of Lipschitz continuous functions |
\mathcal{C}^{1+} | Class of differentiable functions whose gradient is Lipschitz |
\mathcal{C}^{\infty} | Class of infinitely differentiable functions |
Algorithm 3 signSGD algorithm. |
1: Input: {\mathbf{x}}^{0} , s_1 \in (0, 1) |
2: for k = 0, 1, \dots do |
3: Calculate an estimate of the stochastic gradient \tilde{\nabla} f({\mathbf{x}}^k) and update: |
\begin{align*} & {\mathbf{x}}^{k+1} = {\mathbf{x}}^{k} - s_1 \text{sign}(\tilde{\nabla} f({\mathbf{x}}^k)) \end{align*} |
4: end for |
5: Return {\mathbf{x}}^k |
Algorithm 4 Signum algorithm. |
1: Input: {\mathbf{x}}^{0}, {\mathbf{m}}^{0}, s_1 \in (0, 1), s_2 \in (0, 1) |
2: for k = 0, 1, \dots do |
3: Calculate an estimate of the stochastic gradient \tilde{\nabla} f({\mathbf{x}}^k) and update: |
\begin{align*} & {\mathbf{m}}^{k+1} = s_2 {\mathbf{m}}^{k} + (1-s_2) \tilde{\nabla} f({\mathbf{x}}^k) \\ & {\mathbf{x}}^{k+1} = {\mathbf{x}}^k - s_1 \text{sign}({\mathbf{m}}^{k+1}) \end{align*} |
3: end for |
4: Return {\mathbf{x}}^k |