
Despite recent advances in regularization theory, the issue of parameter selection still remains a challenge for most applications. In a recent work the framework of statistical learning was used to approximate the optimal Tikhonov regularization parameter from noisy data. In this work, we improve their results and extend the analysis to the elastic net regularization. Furthermore, we design a data-driven, automated algorithm for the computation of an approximate regularization parameter. Our analysis combines statistical learning theory with insights from regularization theory. We compare our approach with state-of-the-art parameter selection criteria and show that it has superior accuracy.
Citation: Zeljko Kereta, Valeriya Naumova. On an unsupervised method for parameter selection for the elastic net[J]. Mathematics in Engineering, 2022, 4(6): 1-36. doi: 10.3934/mine.2022053
[1] | Stefano Almi, Giuliano Lazzaroni, Ilaria Lucardesi . Crack growth by vanishing viscosity in planar elasticity. Mathematics in Engineering, 2020, 2(1): 141-173. doi: 10.3934/mine.2020008 |
[2] | Wenjing Liao, Mauro Maggioni, Stefano Vigogna . Multiscale regression on unknown manifolds. Mathematics in Engineering, 2022, 4(4): 1-25. doi: 10.3934/mine.2022028 |
[3] | Virginia Agostiniani, Lorenzo Mazzieri, Francesca Oronzio . A geometric capacitary inequality for sub-static manifolds with harmonic potentials. Mathematics in Engineering, 2022, 4(2): 1-40. doi: 10.3934/mine.2022013 |
[4] | Carlo Ciliberto, Massimiliano Pontil, Dimitrios Stamos . Reexamining low rank matrix factorization for trace norm regularization. Mathematics in Engineering, 2023, 5(3): 1-22. doi: 10.3934/mine.2023053 |
[5] | Gabriel B. Apolinário, Laurent Chevillard . Space-time statistics of a linear dynamical energy cascade model. Mathematics in Engineering, 2023, 5(2): 1-23. doi: 10.3934/mine.2023025 |
[6] | Michiaki Onodera . Linear stability analysis of overdetermined problems with non-constant data. Mathematics in Engineering, 2023, 5(3): 1-18. doi: 10.3934/mine.2023048 |
[7] | Neil S. Trudinger . On the local theory of prescribed Jacobian equations revisited. Mathematics in Engineering, 2021, 3(6): 1-17. doi: 10.3934/mine.2021048 |
[8] | Gabriele Grillo, Giulia Meglioli, Fabio Punzo . Global existence for reaction-diffusion evolution equations driven by the p-Laplacian on manifolds. Mathematics in Engineering, 2023, 5(3): 1-38. doi: 10.3934/mine.2023070 |
[9] | Lucrezia Cossetti . Bounds on eigenvalues of perturbed Lamé operators with complex potentials. Mathematics in Engineering, 2022, 4(5): 1-29. doi: 10.3934/mine.2022037 |
[10] | Lucio Boccardo, Giuseppa Rita Cirmi . Regularizing effect in some Mingione’s double phase problems with very singular data. Mathematics in Engineering, 2023, 5(3): 1-15. doi: 10.3934/mine.2023069 |
Despite recent advances in regularization theory, the issue of parameter selection still remains a challenge for most applications. In a recent work the framework of statistical learning was used to approximate the optimal Tikhonov regularization parameter from noisy data. In this work, we improve their results and extend the analysis to the elastic net regularization. Furthermore, we design a data-driven, automated algorithm for the computation of an approximate regularization parameter. Our analysis combines statistical learning theory with insights from regularization theory. We compare our approach with state-of-the-art parameter selection criteria and show that it has superior accuracy.
Inverse problems deal with the recovery of an unknown quantity of interest x∈Rd from a corrupted observation y∈Rm. In most cases the relationship between x and y is linear, and can be approximately described by
y=Ax+σw, | (1.1) |
where A∈Rm×d is a known linear forward operator, w is a zero-mean isotropic random vector, modeling the noise, and σ>0 is the noise level. Inverse problems of this type are ubiquitous in image processing, compressed sensing and other scientific fields. In image processing applications they model tasks such as denoising, where A is the identity; deblurring, where A is a convolution operator; and inpainting, where A is a masking operator.
The recovery of the original signal x from the corrupted observation y is an ill-posed inverse problem. Thus, the theory of inverse problems suggests the use of suitable regularization techniques [16]. Specifically, in case of Gaussian noise, x is approximated with the minimizer of a regularized functional
argminz∈Rd‖Az−y‖22+λJ(z), | (1.2) |
where ‖⋅‖2 is the Euclidean norm modeling data-fidelity, J is a penalty term encoding an a priori knowledge on the nature of the true solution, and λ is a regularization parameter determining a trade-off between these two terms. Having the penalty term fixed, a central issue concerns the selection of λ. The optimal parameter λopt is the one that minimizes the discrepancy between the minimizer zλ(y) of (1.2) and the exact solution x, i.e.,
λopt=argminλ∈(0,+∞)‖zλ−x‖2. | (1.3) |
If a dataset of pairs of clean signals and noisy observations were available, we could try to to learn a suitable performance regression functional y↦λopt that would allow to select a regularization parameter by knowing only the noisy observation. Unfortunately though, due to the curse of dimensionality accurate learning such a high-dimensional function requires a number of samples that scales exponentially with the ambient dimension [10,26]. A common approach to mitigate these effects is to assume that the relevant data are supported on structures of substantially lower dimensionality. On a practical level, regularization parameters are instead typically selected heuristically or through cross-validation.
But, in many applications x is unknown (thus we cannot use supervised methods) and there is no available information about the noise w or the noise level σ. This is the case in classical regularization theory, where the clean image is unknown and hence λopt is approximated using prior knowledge or an estimate of the noise. We add that classical regularization theory is mostly concerned with data that belongs to a function space, and correspondingly most parameter selection methods focus on the recovery of the minimum least-squares norm solution.
Choosing a good approximation to λopt is a non-trivial, problem-dependent task that has motivated significant amounts of research over the last decades. However, there is still no framework that allows a fast and efficient parameter selection, particularly in a completely unsupervised setting. In this paper, we aim at (partially) closing this gap and provide a novel concept for automated parameter selection by recasting the problem in the framework of statistical learning theory. Specifically, inspired by recent and (to our knowledge) first results in this spirit [10] we propose a method for computing an accurate regularized solution zλ(y) for the elastic net, with a nearly optimal regularization parameter λ, that uses a dimension reduction preprocessing step with the help of a dataset of corrupted data samples. We emphasize that the method is unsupervised and requires minimal human interference.
Existing parameter selection methods. Parameter selection rules used in regularization theory can be broadly classified as those based on the discrepancy principle [1,25], generalized cross-validation (GCV)[19], balancing principle [23,34], quasi-optimality [21,31] and various estimations of the mean-squared error (MSE) (see [11,27] and references therein). GCV is a particularly popular parameter rule for linear methods since it gives a closed form for the regularization parameter and does not require tuning of any additional parameters or the knowledge of the noise. In specialized cases GCV can be extended to nonlinear problems [36], but the regularization parameter is no longer given in closed form nor through an implicit equation. Balancing principle is a stable method that has received a lot of attention in the inverse problems community and has also been studied in the framework of learning theory, but requires tuning of additional parameters. Quasi-optimality is one of the simplest parameter choice methods. It does not require any information about the problem but it is not as stable as the (hardened) balancing principle as it does not account for statistical outliers [3]. Discrepancy and MSE-based principles still remain the preferred methods for parameter selection for nonlinear estimators due to their simplicity and accuracy. We refer to a recent rather comprehensive comparative study on the existing approaches [3].
In order to select the regularization parameter most existing methods require the regularized solution zλ(y) to be computed over a predefined grid of values of λ. The regularization parameters are then chosen according to some criteria, e.g., loss over a validation set. To find regularization parameters by an exhaustive search is a computationally expensive task, especially in the high-dimensional data scenario, with often no guarantees on the quality of approximation. Moreover, most criteria presuppose that some a priori information is available, such as an accurate estimate of the noise level (in e.g., discrepancy principle) or bounds on the noise error (in e.g., balancing principle) and require additional, method-specific parameters to be preselected.
Elastic net regularization. Elastic net regularization was proposed by Zou and Hastie [37], as
zλ(y)=argminz∈Rd‖Az−y‖22+λ(‖z‖1+α‖z‖22), | (1.4) |
where α≥0 is a hyperparameter controlling the trade-off between ℓ1 and ℓ2 penalty terms. Our motivation for considering the elastic net is that it produces sparse models comparable to the Lasso (and is thus well suited for problems with data on lower dimensional structures), while often achieving superior accuracy in real-world and simulated data. Moreover, the elastic net overcomes the main limitations of ℓ1 minimization. Namely, it encourages the grouping effect, which is relevant for many real-life applications such as microarray classification and face detection (see [9] and references therein).
To solve (1.4) the authors in [37] rewrite the elastic net functional as Lasso regularization with augmented datum, use LARS [14] to reconstruct the entire solution path, and apply cross-validation to select the optimal regularization parameter. Later work [9] studies theoretical properties of (1.4) in the context of learning theory, analyzes the underlying functional and uses iterative soft-thresholding to compute the solution. For the parameter choice the authors provide an adaptive version of the balancing principle [23,34]. The rule aims to balance approximation and sample errors, which have contrasting behavior with respect to the tuning parameter, but requires (potentially) many evaluations of zλ(y). We will rework some of the arguments from [9] for the computation of zλ, while keeping our focus on an efficient approach for parameter learning. In [22] the authors propose an active set algorithm for solving (1.4). Addressing the problem in the framework of classical regularization theory, the authors consider the discrepancy principle [6,25] for determining the parameter. This requires estimations of the solution zλ for many parameter values, and a pre-tuning of other, method-specific parameters. Moreover, it is assumed that the noise level is known, which is often not the case in practice.
The authors in [24] use a hybrid alternating method for tuning parameters λ and α for the model fitting problem yi=a⊤ix,i=1,…,n, where yi∈R, ai∈Rp and x∈Rp. The first step is to update the solution zλ, using coordinate descent, and then to update λ and α in one iteration. The main advantage is the efficiency, as one does not need to calculate zλ for multiple parameters at once, but rather on a much coarser parameter grid. The method is in spirit similar to LARS, but has better scalability. It requires that a non-convex problem is solved, and hence has inherent limitations. Moreover, it cannot be used in the setting of general inverse problems (1.1), where the design matrix A is fixed and each response y is generated by a new clean signal x. In summary, we are not aware of any parameter selection rule for the elastic net that allows to select λ without a priori assumptions and without extensive manual adjustments.
This work. We leverage the work [10] where the authors propose a data-driven method for determining the optimal parameter for Tikhonov regularization, under the assumption that a training set of independent observations y1,…,yN is made available, each of them associated with an (unknown) signal x1,…,xN through yi=Axi+σwi. The starting point of the method is to find an empirical proxy ˆx of the real solution x by assuming that x1,…,xN are distributed over a lower-dimensional linear subspace and then select the regularization parameter as
ˆλopt=argminλ∈(0,+∞)‖zλTik(y)−ˆx‖2, | (1.5) |
where zλTik(y) is the minimizer of the Tikhonov functional minz∈Rd‖Az−y‖22+λ‖z‖22. The analysis and techniques regarding ˆx are independent from the chosen optimization scheme, whereas the selection of ˆλopt is defined by the regularization scheme. However, it is worth mentioning that if A† is not injective, ˆx is not a good proxy of x. For Tikhonov regularization this is not an issue as, without loss of generality, we can always assume that A is injective. Specifically, one can replace x in (1.3) with x†=A†Ax and recall that zλTik belongs to ker⊥(A) for all λ. Therefore, for wider applicability of the suggested framework, it is important to address the selection of ˆλopt for a wider class of regularizers and inverse problems.
In this paper, we extend the framework of [10] by providing the analysis for the elastic net regularization, and improving the theoretical results regarding the empirical estimators through stronger concentration bounds, see Lemma 2.2. Moreover, we develop an efficient, fully automated algorithm for computing a nearly optimal regularization parameter, that we call OptEN. The algorithm uses backtracking line search to minimize a given loss function: the discrepancy between the elastic net solution zλ(y) and the empirical estimator ˆx, which is well suited if A is injective. In case of a non-injective A we develop an adjusted loss function that again ensures near optimal performance. Lastly, whereas in [10] the dimensionality of the subspace to which the real solution x belongs to is assumed to be known, and which is needed to construct the empirical estimator ˆx, in this work we develop two strategies for its estimation. The first is based on the spectrum of the empirical covariance matrix of noisy observations, and is used when x lies in a true lower dimensional linear subspace. The second strategy is used in cases where x only approximately lies in a lower-dimensional subspace, as is the case e.g. in wavelet denoising. The algorithm is extensively tested on synthetic and real-world examples. The last point is the main practical contribution of our paper. We add that our goal is not to introduce a new regularization paradigm but rather to design a fast and unsupervised method for determining a near optimal regularization parameter for existing regularization methods. In summary, we analyze our problem in two settings:
(ⅰ) simplified case A=Id: (corresponding to image denoising): We restate the lower level problem and show that in case of a bounded w it admits a unique minimizer, which motivates the algorithm. Furthermore, we provide a bound on |λopt−ˆλopt| for independent Bernoulli random noise and discuss the number of samples needed for optimal learning, see Proposition 4.4. Though the latter model might be oversimplified, it captures the essence of the problem and our experiments confirm the results in more general settings.
(ⅱ) general case: for a general matrix A we provide an unsupervised, efficient, and accurate algorithm for the computation of an approximate optimal parameter. We study the performance of our algorithm, comparing it to state-of-the-art parameter choice methods on synthetic and image denoising problems. The obtained results show that our approach achieves superior accuracy.
In Section 2, we describe the assumptions on the linear inverse problem and the distribution of data that we study, and we define and prove bounds regarding empirical estimators. In Section 3 we discuss minimizers of the elastic net (1.4), and define loss functions that will be used for parameter selection. Section 4 provides the main theoretical results of the paper regarding loss functionals and their minimizers. In Section 5 we present an efficient and accurate algorithm for the computation of an approximate optimal parameter. We study the performance of our method through several numerical experiments on synthetic and imaging data in Section 6. Therein, we compare our method with state-of-the-art parameter selection criteria in terms of accuracy of the solution recovery, closeness to the optimal parameter, sparse recovery and computational time. For imaging tasks our focus is on wavelet-based denoising where we work on synthetic images and real-world brain MRIs. We conclude with a brief discussion about future directions in Section 7. The Appendix contains proofs of auxiliary results.
The Euclidean and the ℓ1-norms of u=(u1,…,ud)⊤ are denoted by ‖u‖2 and ‖u‖1, respectively. The modulus function |⋅|, the sign function sgn(⋅), and the positive part function (⋅)+ are defined component-wise for i=1,…,d, by |u|i=|ui|, sgn(u)i=sgn(ui), and ((u)+)i=(ui)+, where for any u∈R
sgn(u)={1,ifu>0,0,ifu=0,−1,ifu<0,and(u)+=max{0,u}. |
The canonical basis of Rd is denoted by {ei}i=1,…,d. We denote the transpose of a matrix M by M⊤, the Moore-Penrose pseudo inverse by M†, and the spectral norm by ‖M‖2. Furthermore, range(M) and ker(M) are the range and the null space of M, respectively. For a square-matrix M, we use trace(M) to denote its trace. The identity matrix is denoted by Id and we use 1D for the indicator function of a set D⊂Rd. For any v∈Rd, v⊗v is the rank one operator acting on w∈Rd as (v⊤w)v.
A random vector ξ is called sub-Gaussian if
‖ξ‖ψ2:=sup‖v‖2=1supq≥1q−12E[|v⊤ξ|q]1q<+∞. |
The value ‖ξ‖ψ2 is the sub-Gaussian norm of ξ, with which the space of sub-Gaussian vectors becomes a normed vector space [33]. The (non-centered) covariance of a random vector ξ is denoted as
Σ(ξ):=Cov(ξ)=E[ξ⊗ξ]. |
We write a≲b if there exists an absolute constant C>0 such that a≤Cb.
We consider the following stochastic linear inverse problem: given a deterministic matrix A∈Rm×d, we are interested in recovering a vector x∈Rd from a noisy observation y∈Rm obeying
y=Ax+σw, | (2.1) |
where
(A1) the unknown datum x∈Rd is a sub-Gaussian vector, such that ‖x‖ψ2=1;
(A2) there exists a sub- family 1≤i1<…≤ih≤d of h indices, with h≪d, such that
V:=range(Σ(x))=span{ei1,…,eih} |
and ker(A)∩V={0};
(A3) the noise w∈Rm is an independent sub-Gaussian vector, such that ‖σw‖ψ2≤1, Σ(w)=Id and σ>0 is the noise level.
Conditions (A1) and (A3) are standard assumptions on the distributions of the exact datum x and the noise σw, ensuring that the tails have fast decay. Note also that from a theoretical standpoint, normalization conditions on x and w can always be satisfied by rescaling.
As discussed in the introduction, we consider the recovery of x by means of regularized minimization, zλ(y)=argminz∈Rd‖Az−y‖22+λJ(z), where J(z) is the elastic net functional. More specifically, we are interested in the selection of the regularization parameter λ. To do so, we follow [10] and leverage assumption (A2) to construct an empirical estimator of x, which will be used to select λ by minimizing the suitable notion of discrepancy between zλ(y) and the constructed empirical estimator.
We begin by discussing the construction of the empirical estimator, from the given training data of noisy samples, and developing concentration bounds. First, it follows from the definition that V is the smallest subspace such that x∈V, almost surely. Thus, by (A2), the exact datum x almost surely has at most h non-zero entries, and since ker(A)∩V={0}, it is a unique vector with that property. Define now W=range(Σ(Ax)). The following simple result was shown for an injective A in [10]; here we extend it to the general case.
Lemma 2.1. Under Assumption (A2) we have dimW=h and W=AV.
Proof. A direct computation gives
Σ(Ax)=E[Ax⊗Ax]=AΣ(x)A⊤=APΣ(x)(AP)⊤, |
where P denotes the orthogonal projection onto V. Assumption (A2) says that A is injective on V, and thus Σ(Ax) and Σ(x) both have rank h. Furthermore, (AP)⊤ maps Rm onto V, so that
range(Σ(Ax))=(APΣ(x))V=APV=W, |
where Σ(x)V=V, by the definition of V and since Σ(x) is symmetric.
Lemma 2.1 suggests that V could be directly recovered if A were invertible and W were known. In most practical situations though, neither of those assumptions is satisfied: we only have access to noisy observations and A could not only be non-invertible, but also non-injective. We will address this issue by recasting the problem to a statistical learning framework, similar to [10]. Namely, suppose we are given observation samples y1,…,yN such that yi=Axi+σwi for i=1,…,N, where xi,wi and σ are unknown, and let
ˆΣ(y)=1NN∑i=1yi⊗yi |
be the empirical covariance of the observations. Standard statistical theory suggests that ˆΣ(y) is a good approximation to Σ(y) provided N is large enough. As a consequence, we will show that a vector space spanned by the first h eigenvectors of ˆΣ(y), denoted by ˆW, is a good estimator of W.
To justify the above claims, observe first that since Σ(w)=Id holds by (A3), we have
Σ(y)=Σ(Ax)+σ2Id. | (2.2) |
Therefore, Σ(y) and Σ(Ax) have the same eigenvectors and the spectrum of Σ(y) is just a shift of the spectrum of Σ(Ax) by σ2. Let λ1≥…≥λh be the non-zero eigenvalues of Σ(Ax), counting for multiplicity, and μ1≥…≥μm and ˆμ1≥…≥ˆμm be the eigenvalues of Σ(y) and ˆΣ(y), respectively. From (2.2) it follows
{μi=λi+σ2,for i=1,…,h,μi=σ2,for i=h+1,…,m. | (2.3) |
Let Π be the (orthogonal) projection onto W, which has rank h due to Lemma 2.1, and let ˆΠ be the (orthogonal) projection onto ˆW. We now show the fundamental tool of our study: that ˆΠ is an accurate and an unbiased approximation of Π for a sufficiently small noise level σ. We distinguish between bounded and unbounded y and improve upon results in [10].
Lemma 2.2. Assume that σ2<λh and ‖Σ(y)‖2≥1. Given u>0, with probability greater than 1−2exp(−u)
‖ˆΠ−Π‖2≲λ1λh(√h+σ2m+uN+h+σ2m+uN), | (2.4) |
provided N≳(h+σ2m+u). Furthermore, if y is bounded, then with probability greater than 1−exp(−u)
‖ˆΠ−Π‖2≲λ1λh(√log(h+m)+uN+log(h+m)+uN), | (2.5) |
provided N≳(log(2m)+u).
Proof. We will first show (2.4). Using Theorem 9.2.4 and Exercise 9.2.5 in [33], we have
‖Σ(y)−ˆΣ(y)‖2≲‖Σ(y)‖2(√r+uN+r+uN), | (2.6) |
with probability greater than 1−2exp(−u), where r=trace(Σ(y))/‖Σ(y)‖2 is the stable rank of Σ(y)1/2.
Using trace(Σ(y))≤λ1h+mσ2 and ‖Σ(y)‖2=λ1+σ2≤2λ1, we get
‖Σ(y)−ˆΣ(y)‖2≲λ1(√h+σ2m+uN+h+σ2m+uN), | (2.7) |
where we used ‖Σ(y)‖2≥1 to bound r. Let μ∗=μh>σ2. By (2.3) it follows that Π is the projection onto the linear span of those eigenvectors of Σ(y) whose corresponding eigenvalue is greater than or equal to μ∗. Using μh−μh+1=λh, by (2.7) we have
ϵ:=‖Σ(y)−ˆΣ(y)‖2<μh−μh+12=λh2, | (2.8) |
provided N≳(h+σ2m+u). Let now Πμ∗ be the projection onto the linear span of those eigenvectors of ˆΣ(y) whose corresponding eigenvalue is greater than or equal to μ∗. As a consequence of Theorem 7.3.1 in [5], there exists an eigenvalue ˆμ∗ of ˆΣ(y) such that
|μ∗−ˆμ∗|≤ϵ, and dimΠμ∗=dimΠ | (2.9) |
ˆμj≤μh+1+ϵ=σ2+ϵ,∀ˆμj<ˆμ∗ | (2.10) |
‖Πμ∗−Π‖2≤1λh−ϵ‖(Id−Πμ∗)(ˆΣ(y)−Σ(y))Π‖2. | (2.11) |
By (2.9) it follows that ˆμ∗=ˆμh so that Πμ∗=ˆΠ and hence
‖ˆΠ−Π‖2≤1λh−ϵ‖ˆΣ(y)Π−Σ(y)Π‖2≤2λh‖ˆΣ(y)Π−Σ(y)Π‖2. | (2.12) |
Since ‖ˆΣ(y)Π−Σ(y)Π‖2≤‖ˆΣ(y)−Σ(y)‖2, the claim follows by (2.7).
Assume now that ‖y‖2≤√L holds almost surely and consider a family of independent m×h matrices
Si=(yi⊗yi)Π−Σ(y)Π,i=1,…,N. |
Since 1N∑Ni=1Si=ˆΣ(y)Π−Σ(y)Π we can apply the matrix Bernstein inequality for rectangular matrices (Theorem 6.1.1. in [32]). Thus, for u>0 we have
P(‖ˆΣ(y)Π−Σ(y)Π‖2≥u)≤(m+h)exp(−Ns2M+2Lu/3), |
where M>0 is a matrix variance constant independent of m, h, and d, such that
max{E‖S⊤iSi‖2,E‖SiS⊤i‖2}≤M. |
A direction computation gives M≤L‖Σ(y)‖2. It follows that
‖ˆΣ(y)Π−Σ(y)Π‖2≲λ1(√log(h+m)+uN+log(h+m)+uN), |
holds with probability greater than 1−exp(−u) for every u>0. Moreover, by analogous argumentation (2.8) holds provided N≳(log(2m)+u), see 7 for details. Thus, (2.5) follows by applying (2.12).
The previous result comes with a certain caveat. Namely, the proof implicitly assumes that either h or the spectral gap are known (which informs the choice of the approximate projector ˆΠ). In practice however, the desired eigenspace can only be detected if there is a spectral gap and if it corresponds to the eigenspace we want to recover, i.e., if λh>δ, where
δ=maxi=1,…,h−1(λi−λi+1)=maxi=1,…,h−1(μi−μi+1). | (2.13) |
Proposition 2.3. Assume λh>δ, where δ is given by (2.13). Then the empirical covariance matrix has a spectral gap at the h−th eigenvalue, with probability greater than 1−2exp(−u), provided δ<λh and N≳λ21(λh−δ)2(h+u).
Proof. Assume ‖Σ(y)−ˆΣ(y)‖2<ϵ, for ϵ>0. Since maxj=1,…,m|ˆμj−μj|≤‖Σ(y)−ˆΣ(y)‖2 we get
|ˆμj−ˆμj+1|≤2ϵ+|μj−μj+1|, |
by adding and subtracting μj and μj+1 inside the first term. Thus, if j>h then |ˆμj−ˆμj+1|≤2ϵ by (2.3), and if j<h then |ˆμj−ˆμj+1|≤2ϵ+δ, by the definition of δ in (2.13). For j=h on the other hand we have |ˆμh−ˆμh+1|>|μh−μh+1|−2ϵ. In conclusion,
argmaxi=1,…,m−1(ˆμi−ˆμi+1)=h |
holds provided provided ϵ<λh−δ4. Using (2.6) the claim follows.
It is clear that if δ>λh then the spectral gap of the empirical covariance matrix is achieved at argmaxi=1,…,h−1(λi−λi+1), which is smaller than h. In practice though, the situation is not as pessimistic as this observation would suggest and we can rely on a wealth of ad hoc remedies. We devote more attention to this question in Section 6.1, and suggest alternative heuristics for estimating the intrinsic dimension h.
We are ready to define our empirical estimator of x. Let Q=AA† be the orthogonal projection onto range(A)=ker⊥(A⊤), and P=A†A the orthogonal projection onto range(A⊤)=ker⊥(A). The empirical estimator of x is defined as
ˆx=A†ˆΠy. | (2.14) |
In the following, we use ˆx to learn a nearly optimal regularization parameter for parameter selection for the elastic net, but it can in principle be used for a broader range of problems This is due to the fact that ˆx is in principle independent of the choice of the optimization scheme. That is, it only leverages the structural assumption (A2).
Remark 2.4. One might want to consider ˆx as an approximate solution by itself, and completely avoid regularization and thus the issue of parameter choice. Even though ˆx can be a good estimator of x in certain situations, it will however often perform poorly. Namely, when A is not injective, when the training set size N is small, or when the noise level is small, the empirical estimator ˆx will perform worse than the regularized solution zλ(y). This can be seen by the empirical evidence, cf. Table 2, where ˆx is the worst performing method. In addition, ˆx does not preserve the structure of the original signal, e.g., ˆx will in general not be sparse for a sparse x. Thus, regularization is indeed needed to ensure the desired structure and optimize the reconstruction performance. Lastly, we remind that we are interested in using an estimator ˆx for which |λopt−ˆλopt| is small with high probability (i.e., the one that can be used to derive an accurate parameter selection) and we are not interested in directly controlling ‖x−ˆx‖2, which is the goal in manifold learning [4].
Observe also that \widehat{\boldsymbol{ {\bf{ \pmb{\mathsf{ η}} }}}} = {{\bf{y}}} - \widehat{\Pi} {{\bf{y}}} the empirical estimator \widehat{ {{\bf{x}}}} satisfies the (empirical) inverse problem
\begin{equation} { \rm{A}}\widehat{ {{\bf{x}}}} + { \rm{Q}}\widehat{ {\bf{ \pmb{\mathsf{ η}} }}} = { \rm{Q}} {{\bf{y}}}. \end{equation} | (2.15) |
Thus, a direct consequence of (2.15) is that minimizers of the empirical and of the original problem coincide.
Lemma 2.5. Let \widehat{ {{\bf{z}}}}^\lambda{({{\bf{y}}})} = \mathop {{\rm{argmin}}}_{ {{\bf{z}}}\in\mathbb{R}^d} \left\|{{ {A} {{\bf{z}}}- {Q} {{\bf{y}}}}}\right\|_2^2 + \lambda\, J({{\bf{z}}}), where {{Q}} = {{A}} { {A}}^\dagger , for { {A}}\in \mathbb{R}^{m\times d} , and J is the (elastic net) penalty term. Then \widehat{ {{\bf{z}}}}^\lambda{({{\bf{y}}})} = {{\bf{z}}}^\lambda({{\bf{y}}}) , where {{\bf{z}}}^\lambda({{\bf{y}}}) is given by (1.4).
Proof. We compute \left\|{{ {\rm A} {{\bf{z}}} - {{\bf{y}}}}}\right\|_2^2 = \left\|{{ {\rm A} {{\bf{z}}} - {\rm Q} {{\bf{y}}} + ({\rm Q}- \mathsf{{Id}}) {{\bf{y}}}}}\right\|_2^2. Since { \rm{Q}} is an orthogonal projection onto \operatorname{range}({ \rm{A}}) it follows ({ \rm{Q}}- \mathsf{{Id}}) {{\bf{y}}}\in \operatorname{range}^\perp({ \rm{A}}) . Using Pythagoras' theorem we thus have \left\|{{ {\rm A} {{\bf{z}}} - {{\bf{y}}}}}\right\|_2^2 = \left\|{{ {\rm A} {{\bf{z}}} - {\rm Q} {{\bf{y}}}}}\right\|_2^2 + \left\|{{({\rm Q}- \mathsf{{Id}}) {{\bf{y}}}}}\right\|_2^2. Since the second term does not depend on {{\bf{z}}} we get
\mathop {{\rm{argmin}}}\limits_{ {{\bf{z}}}\in\mathbb{R}^d}\left\|{{ { \rm{A}} {{\bf{z}}}- {{\bf{y}}}}}\right\|^2_2 +\lambda J( {{\bf{z}}}) = \mathop {{\rm{argmin}}}\limits_{ {{\bf{z}}}\in\mathbb{R}^d} \left\|{{ { \rm{A}} {{\bf{z}}}- { \rm{Q}} {{\bf{y}}}}}\right\|^2_2 +\lambda\, J( {{\bf{z}}}). |
In this section we discuss the elastic net minimization, and the loss functionals that will be used for the selection of the regularization parameter.
From now on we focus on the parameter choice for the elastic net, where J({{\bf{z}}}) = \left\|{{ {{\bf{z}}}}}\right\|_1 + \alpha\left\|{{ {{\bf{z}}}}}\right\|_2^2 , so that
\begin{equation} {{\bf{z}}}^{\lambda}( {{\bf{y}}}) = \mathop {{\rm{argmin}}}\limits_{ {{\bf{z}}}\in \mathbb{R}^m} \left\|{{ { \rm{A}} {{\bf{z}}}- {{\bf{y}}}}}\right\|_2^2 + \lambda \left({\left\|{{ {{\bf{z}}}}}\right\|_1 + \alpha\left\|{{ {{\bf{z}}}}}\right\|_2^2}\right). \end{equation} | (3.1) |
The term \left\|{{ {{\bf{z}}}}}\right\|_1 , enforces the sparsity of the solution, whereas \left\|{{ {{\bf{z}}}}}\right\|_2^2 enforces smoothness and ensures that in case of highly correlated features we can correctly retrieve all the relevant ones. We first recall some basic facts about existence, uniqueness and sensitivity of elastic net solutions with respect to regularization parameters [22].
Lemma 3.1. The elastic net functional is strictly convex and coercive. Moreover, for each \lambda > 0 the minimizer of (1.2) exists, is unique and the mapping \lambda\mapsto {{\bf{z}}}^\lambda is continuous with respect to \lambda > 0 .
In the remainder of this paper we will recast (3.1) as
\begin{align} {{\bf{z}}}^{t}( {{\bf{y}}}) & = \mathop {{\rm{argmin}}}\limits_{ {{\bf{z}}}\in \mathbb{R}^m} t\left\|{{ { \rm{A}} {{\bf{z}}}- {{\bf{y}}}}}\right\|_2^2 + (1-t)\left({\left\|{{ {{\bf{z}}}}}\right\|_1 + \alpha\left\|{{ {{\bf{z}}}}}\right\|_2^2}\right), \end{align} | (3.2) |
where t\in[0, 1] and \alpha > 0 is a fixed parameter. For t\in(0, 1) the solutions of (3.2) correspond to solutions of (3.1) for \lambda = \frac{1-t}{t}. On the other hand, for t = 0 we get {{\bf{z}}}^0({{\bf{y}}}) = {\bf{0}} , and for t = 1 we define {{\bf{z}}}^1({{\bf{y}}}): = {{\bf{x}}}^\alpha, where
\begin{equation} {{\bf{x}}}^\alpha = \mathop {{\rm{argmin}}}\limits_{ {{\bf{z}}}\in {\cal N}} \left({\left\|{{ {{\bf{z}}}}}\right\|_1 + \alpha\left\|{{ {{\bf{z}}}}}\right\|_2^2}\right), \text{ for } {\cal N} = \{ {{\bf{z}}}\in \mathbb{R}^m\mid { \rm{A}}^\top { \rm{A}} {{\bf{z}}} = { \rm{A}}^\top {{\bf{y}}}\}. \end{equation} | (3.3) |
This definition is driven by the following observations. First, the set {\cal N} = { \rm{A}}^\dagger {{\bf{y}}} \oplus \operatorname{ker}\left({ { \rm{A}}}\right) is non-empty (since { \rm{A}} has finite rank). Furthermore, it was shown in [9] and [22] that in case of the elastic net minimization
\begin{equation} \lim\limits_{t\to 1} {{\bf{z}}}^t = { {{\bf{x}}}}^\alpha. \end{equation} | (3.4) |
In other words, {{\bf{x}}}^\alpha plays the role of the Moore-Penrose solution in linear regularization schemes [16]. By Lemma 3.1 the minimizer of (3.2) always exist and is unique, the map t\mapsto {{\bf{z}}}^t is continuous for t\in(0, 1) . Equation (3.4) implies that t\mapsto {{\bf{z}}}^t is continuous at t = 1 , and later in (3.7) we show that the continuity also holds at t = 0 .
Solution via soft-thresholding. The elastic net does not admit a closed form solution in case of a general forward matrix { \rm{A}} . In Zou and Hastie [37] the elastic net problem is recast as a Lasso problem with augmented data, which can then be solved by many different algorithms (e.g., the LARS method [14]). Alternative algorithms compute the elastic net minimizer directly, and are generally either of the active set [22] or the iterative soft-thresholding-type [9]. Here we adhere to iterative soft-thresholding, and rework the arguments in [9] to show that the solution to (3.2) can be obtained through fixed point iterations for all t\in[0, 1] . To begin, define the soft-thresholding function by
\begin{align} {\cal S}_\tau( \mathsf{u}) & = \operatorname{sgn}( \mathsf{u})\left({\left|{{ \mathsf{u}}}\right|-\frac{\tau}{2}}\right)_+, \end{align} | (3.5) |
and the corresponding soft-thresholding operator \boldsymbol{ {\cal S}}_\tau(\boldsymbol{ {{\bf{u}}}}) , acting component-wise on vectors {{\bf{u}}}\in \mathbb{R}^m . The next lemma is a direct reworking of the arguments in [9,Corollary 1] and states that (3.2) is a fixed point of a contractive map.
Lemma 3.2. The solution to (3.2), for {{A}}\in \mathbb{R}^{m\times d}, {{\bf{y}}}\in \mathbb{R}^m and t\in (0, 1) , satisfies {{\bf{z}}} = {\cal T}_t({{\bf{z}}}) , where the map {\cal T}_t: \mathbb{R}^d\rightarrow \mathbb{R}^d is a contraction and is defined by
\begin{equation} {{\bf{z}}}^t = {\cal T}_t( {{\bf{z}}}) = \frac{1}{\tau t +(1-t)\alpha} \boldsymbol{{\cal S}}_{1-t}\left({t\left({\theta \mathsf{{Id}}- { \rm{A}}^\top { \rm{A}}}\right) {{\bf{z}}}+t { \rm{A}}^\top {{\bf{y}}}}\right), \end{equation} | (3.6) |
with the Lipschitz constant
\dfrac{t (\sigma^2_M- \sigma_m^2) }{t (\sigma^2_M+ \sigma_m^2) +2\alpha (1-t)} < 1, |
where \theta = \frac{\sigma_m^2+\sigma_M^2}{2} , and \sigma_m and \sigma_M are the smallest and the largest singular values of the matrix { {A}} , respectively.
For t = 0 , the solution is {{\bf{z}}}^0 = {\bf{0}} , which is consistent with (3.6). Furthermore, by (3.5) and (3.6), we get
\begin{equation} {{\bf{z}}}^t = 0 \qquad \text{ if }\quad 0\leq t\leq \frac{1}{1+2 \left\|{{ { \rm{A}}^\top {{\bf{y}}}}}\right\|_2}. \end{equation} | (3.7) |
Our definition of the solution at t = 1 in (3.3) also satisfies {{\bf{z}}} = {\cal T}_t({{\bf{z}}}) since \boldsymbol{{\cal S}}_{1-t} is identity and thus {{\bf{z}}}^{t = 1}({{\bf{y}}}) = { \rm{A}}^\dagger {{\bf{y}}} , though {\cal T}_1 is not a contraction. In summary, the solutions are consistent with Lemma 3.1, as expected.
Closed form solution. In the case of orthogonal design [37], i.e., { \rm{A}}^\top { \rm{A}} = \mathsf{{Id}} , the solution of (3.1) is given by
\begin{equation} {{\bf{z}}}^\lambda = \frac{(\lvert { \rm{A}}^\top {{\bf{y}}}\lvert -\lambda/2)_+}{1+\alpha\lambda} \operatorname{sgn}( { \rm{A}}^\top {{\bf{y}}}), \end{equation} | (3.8) |
where the product is understood component-wise. Plugging \lambda = \frac{1-t}{t} into (3.8) we have
\begin{equation} {{\bf{z}}}^{t}( {{\bf{y}}}) = \frac{(t(1+2\lvert { \rm{A}}^\top {{\bf{y}}}\lvert)-1)_+}{2(t(1-\alpha)+\alpha)} \operatorname{sgn}\left({ { \rm{A}}^\top {{\bf{y}}}}\right). \end{equation} | (3.9) |
To select the regularization parameters we go back to first principles and consider quadratic loss functionals.
Definition 3.3. Functions R, \widehat{R}\colon[0, 1]\rightarrow \mathbb{R} , defined by
\begin{equation} R(t) = \left\|{{ {{\bf{z}}}^t( {{\bf{y}}})- {{\bf{x}}}}}\right\|_2^2, \quad \widehat{R}(t) = \left\|{{ {{\bf{z}}}^t( {{\bf{y}}})-\widehat{ {{\bf{x}}}}}}\right\|_2^2, \end{equation} | (3.10) |
are called the true and the empirical quadratic loss, respectively. Furthermore, define
\begin{equation} t_{{opt}} = \mathop {{\rm{argmin}}}\limits_{t\in[0, 1]} R(t), \quad \widehat{t}_{{opt}} = \mathop {{\rm{argmin}}}\limits_{t\in[0, 1]} \widehat{R}(t). \end{equation} | (3.11) |
In view of Lemma 3.1 and the discussion in Section 3.1, the benefits of recasting \lambda to [0, 1] are clear: R and \widehat{R} are both continuous, defined on a bounded interval, and, hence, achieve a minimum. Thus, our aim is to minimize \widehat{R} while ensuring \left|{{t_{\text{opt}}-\widehat{t}_{\text{opt}}}}\right| is small.
Let us discuss some difficulties associated with elastic net minimization which need to be addressed. On one hand, a closed form solution of (3.2) is available only when { \rm{A}}^\top { \rm{A}} = \mathsf{{Id}} and is otherwise only approximated. Furthermore, as we will see below, loss functionals R and \widehat{R} are globally neither differentiable nor convex, but rather only piecewise. These two issues suggest that the analysis of their minimizers in full detail is challenging. Therefore, in the following we split the analysis into a simplified case for { \rm{A}}^\top { \rm{A}} = \mathsf{{Id}} where we can provide guarantees, and the general case where we provide an efficient algorithm. Furthermore, we need to amend the empirical loss function \widehat{R} in the case when { \rm{A}} is non-injective. This is due to the fact that in case of non-linear methods R(t) cannot be reliably estimated outside the kernel of { \rm{A}} , see [15] and Figure 1. We follow the idea of SURE-based methods [18], which provide an unbiased estimate of R(t) by projecting the regularized solution onto \operatorname{ker}^\perp({ \rm{A}}) . Namely, we define projected and modified loss functions \widehat{R}_p, \widehat{R}_m\colon[0, 1]\rightarrow \mathbb{R} by
\begin{equation} \widehat{R}_p(t) = \left\|{{ { \rm{P}} {{\bf{z}}}^t( {{\bf{y}}})-\widehat{ {{\bf{x}}}}}}\right\|^2_2, \text{ and } \widehat{R}_m(t) = \left\|{{ { \rm{A}} {{\bf{z}}}^t( {{\bf{y}}}) - \widehat{\Pi} {{\bf{y}}}}}\right\|_2^2, \end{equation} | (3.12) |
where { \rm{P}} = { \rm{A}}^\dagger { \rm{A}} is the orthogonal projection onto \operatorname{ker}^\perp({ \rm{A}}) . Define also
\begin{equation} \hat{t}_p = \mathop {{\rm{argmin}}}\limits_{t\in[0, 1]} \widehat{R}_p(t), \text{ and } \hat{t}_m = \mathop {{\rm{argmin}}}\limits_{t\in[0, 1]} \widehat{R}_m(t). \end{equation} | (3.13) |
Note that to define \widehat{R}_m(t) we used the fact that \widehat{ {{\bf{x}}}} = { \rm{A}}^\dagger \widehat{\Pi} {{\bf{y}}} , and thus compared to \widehat{R}_p, we avoid the computation of the Moore-Penrose inverse { \rm{A}}^\dagger , which might be either costly to compute or indeed numerically unstable if { \rm{A}} is poorly conditioned. As we will show in Section 6, and can see in the right-most panel in Figure 1, using \widehat{R}_p and \widehat{R}_m instead of \widehat{R} when { \rm{A}} is non-injective dramatically improves the performance. Note that projecting onto \operatorname{ker}^\perp({ \rm{A}}) makes the loss functional smoother (dampening the gradients).
In the next section we analyze the minimizers t_{\text{opt}} and \widehat{t}_{\text{opt}} , and develop bounds on \left|{{t_{\text{opt}}-\widehat{t}_{\text{opt}}}}\right| in simplified settings. These results motivate us to design an algorithm, which we call OptEN, that minimizes the corresponding empirical loss function ( \widehat{R}(t) , \widehat{R}_p(t) , or \widehat{R}_p(t) ). This is discussed in Section 5.
Since the elastic net solution is in general not available in closed form, a rigorous study of the parameter error is challenging in full generality. Therefore, we restrict our attention to simplified cases, though we emphasize that our approach in practice performs well on significantly broader model assumptions, which we will show in Section 6. In case of orthogonal design { \rm{A}}^\top { \rm{A}} = \mathsf{{Id}} we can, without loss of generality, assume { \rm{A}} = \mathsf{{Id}} (otherwise redefine {{\bf{y}}} as { \rm{A}}^\top {{\bf{y}}} ). Let now {{\bf{y}}} = {{\bf{x}}}+\sigma {{\bf{w}}} and assume \lvert \mathsf{y}_1\lvert\geq\ldots\geq\lvert \mathsf{y}_m\lvert . Plugging (3.9) into (3.10) we get
R(t) = \sum\limits_{i = 1}^m \left(\frac{(t(1+2\lvert \mathsf{y}_i\lvert)-1)_+}{2(t(1-\alpha)+\alpha)} \operatorname{sgn}( \mathsf{y}_i) - \mathsf{x}_i\right)^2. |
Define \mathsf{b}_i = {{1+2\left|{{ \mathsf{y}_i}}\right|}} , for i = 1, \ldots, m . Loss function R(t) is continuous on [0, 1] , and differentiable on intervals
\begin{align*} {\cal I}_0 & = \Big[0, \mathsf{b}_1^{-1}\Big), \quad {\cal I}_{m} = \Big( \mathsf{b}_{m}^{-1}, 1\Big], \text{ and } {\cal I}_k = \left({ \mathsf{b}_k^{-1}, \mathsf{b}_{k+1}^{-1}}\right), \;{\rm{ for }}\; k = 1, \ldots, m-1. \end{align*} |
Considering one interval at a time a direct computation yields the minimizer of each R\lvert_{{ {\cal I}}_k} . Let k = 1, \ldots, m-1 and define a_i = \operatorname{sgn}(\mathsf{y}_i)(1+2\alpha\left|{{ \mathsf{y}_i}}\right|) , c_i = \operatorname{sgn}(\mathsf{y}_i)+2 \mathsf{x}_i(\alpha-1) + 2 \mathsf{y}_i , and d_i = \operatorname{sgn}(\mathsf{y}_i)+2\alpha \mathsf{x}_i , and consider the expression t_k = \frac{\sum_{i = 1}^k a_i d_i}{\sum_{i = 1}^k a_ic_i} . The minimizer t^{\star, k} of R\lvert_{{ {\cal I}}_k} is then given by
\begin{align} t^{\star, k} = \begin{cases} \mathsf{b}_k, & \;{\rm{for }}\; t_k < \mathsf{b}_k \\t_k, & \;{\rm{for }}\; \mathsf{b}_{k}\leq t_k\leq \mathsf{b}_{k+1} \\ \mathsf{b}_{k+1}, & \;{\rm{for }}\; t_k > \mathsf{b}_{k+1} \end{cases}. \end{align} | (4.1) |
Therefore, t_k is the minimizer if it is inside the interval {\cal I}_k , and otherwise the minimizer is the corresponding edge of the interval. Further details regarding the derivation are in Section 7 in the Appendix. An analogous expression holds for k = m , whereas R(t) is constant on {\cal I}_0 , as argued in (3.7). Therefore, the minimizer of R(t) is t_{ \rm{opt}} = \mathop {{\rm{argmin}}}_{k = 0, \ldots, m} R(t^{\ast, k}) . The empirical loss function \widehat{R}(t) is also continuous on [0, 1] and is piecewise differentiable on the same set of intervals since they depend only on {{\bf{y}}} . Consequently, minimizers \widehat{t}^{\star, k} of \widehat{R}(t) are also of the form (4.1), where we only ought to replace \mathsf{x}_i by \widehat{ \mathsf{x}}_i .
Notice that unless further assumptions are made, minimizers t_ \rm{opt} and \widehat{t}_ \rm{opt} are not given explicitly: we still need to evaluate R(t) and \widehat{R}(t) at m+1 locations, and it is not clear that there are no local minima or that the minimizer is unique. We will now show that in case of bounded noise there is indeed only one minimum and that it concentrates near t = 1.0 for moderate noise levels. This analysis will also give a theoretical intuition that will drive our algorithm. Furthermore, we will show that in a simplified case of Bernoulli noise we get explicit bounds on the parameter error.
Consider now the case of bounded noise such that there is a gap between the noise and the signal. We show that there exists a unique minimizer and there are no local minima. For simplicity of computation, we let \alpha = 1 , though the results hold for all \alpha > 0 . Let {{\bf{y}}} = {{\bf{x}}}+\sigma {{\bf{w}}} and assume {{\bf{x}}} = \left({ \mathsf{x}_1, \ldots, \mathsf{x}_h, 0, \ldots, 0}\right)^\top where \left|{{ \mathsf{x}_i}}\right| > 2\sigma\left|{{ \mathsf{w}_j}}\right| for all i = 1, \ldots, h and j = 1, \ldots, m . Without loss of generality, we assume that {{\bf{y}}} is ordered so that
\begin{equation*} \left|{{ \mathsf{x}_i+\sigma \mathsf{w}_i}}\right| > \left|{{ \mathsf{x}_j+\sigma \mathsf{w}_j}}\right| \text{ for }1\leq i < j\leq h \text{ and } \left|{{ \mathsf{w}_i}}\right|\geq\left|{{ \mathsf{w}_j}}\right|\text{ for }1\leq i < j\leq m. \end{equation*} |
The loss functional R(t) = \left\|{{ {{\bf{z}}}^t - {{\bf{x}}}}}\right\|_2^2 is thus piecewise differentiable on intervals {\cal I}_k , where \mathsf{b}_i = 1+2\left|{{ \mathsf{x}_i+\sigma \mathsf{w}_i}}\right| for i = 1, \ldots, h , and \mathsf{b}_i = 1+2\sigma\left|{{ \mathsf{w}_i}}\right| for i = h+1, \ldots, m . Also, we have \mathsf{b}_i\geq \mathsf{b}_j for i\leq j . We will show that R(t) can be monotonically increasing only for t larger than some \vartheta_j > \mathsf{b}_{h+1}^{-1} . Thus, for all t smaller than \vartheta_j , R(t) is a monotonically decreasing function. Let thus t\in {\cal I}_j for j < h . The function R(t) is continuously differentiable in {\cal I}_j , so it is sufficient to show that R'(t) is positive. A direct computation, as in Section 4 and in particular in 7 in the Appendix, gives
\begin{equation} R'(t) \geq 0 \text{ if } t\geq \frac{\sum\nolimits_{i = 1}^j \mathsf{b}_i\left({1+2 \operatorname{sgn}( \mathsf{y}_i) \mathsf{x}_i }\right)}{\sum\nolimits_{i = 1}^j \mathsf{b}_i^2} = :\vartheta_j. \end{equation} | (4.2) |
It suffices to show \vartheta_j\geq \mathsf{b}_{j+1}^{-1}. Since \operatorname{sgn}(\mathsf{y}_i) = \operatorname{sgn}(\mathsf{x}_i) for i\leq j < h , we have
\begin{align} \mathsf{b}_{j+1}\left({1+2\left|{{ \mathsf{x}_i}}\right|}\right) - \mathsf{b}_i > 4\left|{{ \mathsf{x}_i}}\right| \left|{{ \mathsf{x}_{j+1}+\sigma \mathsf{w}_{j+1}}}\right| > 0. \end{align} | (4.3) |
Therefore, \mathsf{b}_i\left({1+2 \operatorname{sgn}(\mathsf{y}_i) \mathsf{x}_i }\right)\geq \mathsf{b}_{j+1}^{-1} \mathsf{b}_i^2 , and the claim follows. Extending the same analysis to t\in {\cal I}_h we ought to show \mathsf{b}_{h+1}\left({1+2\left|{{ \mathsf{x}_i}}\right|}\right) - \mathsf{b}_i > 0 for 1\leq i \leq h . Computing gives
\mathsf{b}_{h+1}\left({1+2\left|{{ \mathsf{x}_i}}\right|}\right) - \mathsf{b}_i > 2 \sigma\left|{{ \mathsf{w}_{h+1}}}\right|\left({1+2\left|{{ \mathsf{x}_i}}\right|}\right) > 0. |
Hence, t_{\text{opt}} > \mathsf{b}_{h+1}^{-1} , as desired. We will now show that R(t) admits only one minimizer. Assume there exists t^\star such that t^\star\in {\cal I}_{j^\star} for some j^\star > h and R'(t^\star) = 0 . This means
t^\star = \frac{\sum_{i = 1}^h \mathsf{b}_i\left({1+2 \operatorname{sgn}( \mathsf{y}_i) \mathsf{x}_i }\right) + \sum_{i = h+1}^{j^\star} \mathsf{b}_i }{\sum_{i = 1}^{j^\star} \mathsf{b}_i^2} = \vartheta_{j^\star}, \text{ and } \mathsf{b}_{j^\star}^{-1} < \vartheta_{j^\star} < \mathsf{b}_{j^\star+1}^{-1}. |
We proceed by induction showing that R(t) is increasing for all j > j^\star . For t\in {\cal I}_j with j > h , it follows
R'(t) \leq 0 \text{ if } t\leq \frac{\sum\nolimits_{i = 1}^h \mathsf{b}_i\left({1+2 \operatorname{sgn}( \mathsf{y}_i) \mathsf{x}_i }\right) + \sum\nolimits_{i = h+1}^{j} \mathsf{b}_i }{\sum\nolimits_{i = 1}^{j} \mathsf{b}_i^2} = :\vartheta_j. |
Let us show \vartheta_j < \mathsf{b}_j^{-1} for j = j^\star+1 . We have
\begin{align*} \vartheta_j & = \frac{\sum\nolimits_{i = 1}^h \mathsf{b}_i\left({1+2 \operatorname{sgn}( \mathsf{y}_i) \mathsf{x}_i }\right) + \sum\nolimits_{i = h+1}^{j} \mathsf{b}_i }{\sum\nolimits_{i = 1}^{j} \mathsf{b}_i^2} = \frac{\vartheta_{j^\star}{\sum\nolimits_{i = 1}^{j^\star} \mathsf{b}_i^2} + \mathsf{b}_j}{\sum\nolimits_{i = 1}^{j} \mathsf{b}_i^2}\leq \mathsf{b}_{j}^{-1}, \end{align*} |
where we used the fact \vartheta_{j^\star}\leq \mathsf{b}_j^{-1} = \mathsf{b}_{j^\star+1}^{-1} , and \mathsf{b}_j > \mathsf{b}_{j^\star} . The rest of the proof then follows by mathematical induction. Analogous computation yields the same type of a result for the empirical loss function \widehat{R} .
Lemma 4.1. Let {{\bf{y}}} = {{\bf{x}}}+\sigma {{\bf{w}}} , assume {{\bf{x}}} = \left({ \mathsf{x}_1, \ldots, \mathsf{x}_h, 0, \ldots, 0}\right)^\top where \left|{{ \mathsf{x}_i}}\right| > 2\sigma\left|{{ \mathsf{w}_j}}\right| for all i = 1, \ldots, h and j = 1, \ldots, m , and let \alpha > 0 . Loss function R(t) is then either monotonically decreasing on the entire interval [0, 1] , or it is decreasing until some interval {\cal I}_{j^\star} , for j^\star > h+1 where it achieves a (unique) minimum, and it is monotonically increasing on all the subsequent intervals. The same holds for \widehat{R} .
Lemma 4.1 states that R(t) and \widehat{R}(t) achieve a unique minimum in [0, 1] , and that they are monotonically decreasing before, and monotonically increasing after this minimum. Furthermore, the minimizer is bigger than (1+2\sigma\left|{{ \mathsf{w}_{h+1}}}\right|)^{-1} , which means that for moderate noise levels, it will be close to 1 . The issue is that minimizers \vartheta_{j^\star} and \widehat{\vartheta}_{\widehat{j}^\star} do not need to lie in the same interval, i.e., j^\star\neq\widehat{j}^\star , and thus they cannot be directly compared. Instead, we consider a simplified model that encodes the main features of the problem. In particular, let
\begin{align} {{\bf{y}}} = {{\bf{x}}} + \sigma {{\bf{w}}}, \;{\rm{ where }}\; \mathbb{P}\left({ \mathsf{w}_i = \pm 1}\right) = \frac{1}{2}, \end{align} | (4.4) |
and assume {{\bf{x}}} = (\mathsf{x}_1, \ldots, \mathsf{x}_h, 0, \ldots, 0)^\top, and \left|{{ \mathsf{x}_i}}\right|\geq 2\sigma, for i = 1, \ldots, h. As before, without loss of generality we can assume \lvert \mathsf{y}_1\lvert\geq\lvert \mathsf{y}_2\lvert\geq\ldots\geq\lvert \mathsf{y}_m\lvert . It then follows \mathsf{b}_i = 2\left|{{ \mathsf{x}_i\pm\sigma}}\right|+1, for 1\leq i \leq h , and \mathsf{b}_i = 2\sigma + 1 otherwise. Moreover, \mathsf{b}_j > \mathsf{b}_{h+1}, for all j = 1, \ldots, h . In the following, we will for the sake of simplicity consider the case \alpha = 1 . The details regarding the general case, \alpha \neq 1 , are in the Appendix. First, as in Section 4.1 we know that t_ \rm{opt}\geq \mathsf{b}^{-1}_{h+1}. We can now explicitly compute the minimizer of R(t) .
Lemma 4.2. Let {{\bf{x}}} satisfy (4.4) and assume {{\bf{x}}} = (\mathsf{x}_1, \ldots, \mathsf{x}_h, 0, \ldots, 0)^\top, and \left|{{ \mathsf{x}_i}}\right|\geq 2\sigma, for i = 1, \ldots, h. True loss functional R(t) is minimized for t_{ \rm{opt}} = \min\{t^\star, 1\}\in [\mathsf{b}_{h+1}^{-1}, 1] , where
\begin{equation} t^\star = \frac{\sum\nolimits_{i = 1}^h \mathsf{b}_i\left({1+2 \operatorname{sgn}\left({ \mathsf{y}_i}\right) \mathsf{x}_i}\right) + \mathsf{b}_{h+1}(m-h)}{\sum\nolimits_{i = 1}^h \mathsf{b}_i^2 + (m-h) \mathsf{b}_{h+1}^2}. \end{equation} | (4.5) |
Proof. Considering (4.2) for t\in \left({ \mathsf{b}_{h+1}^{-1}, 1}\right) we get
\begin{align} 2R'(t) & = \left({\sum\limits_{i = 1}^h \mathsf{b}_i^2 + (m-h) \mathsf{b}_{h+1}^2}\right)t -\sum\limits_{i = 1}^h \mathsf{b}_i\left({1+2 \operatorname{sgn}\left({ \mathsf{y}_i}\right) \mathsf{x}_i}\right) - \mathsf{b}_{h+1}(m-h). \end{align} | (4.6) |
The root of (4.6) is exactly (4.5). Arguing as in (4.3) we have t^\star > \frac{1}{ \mathsf{b}_{h+1}} . Restricting to [0, 1] the claim follows.
Remark 4.3. The minimizer given by Lemma 4.2 will be in [0, 1] provided \sum_{i = 1}^h \left|{{ \mathsf{y}_i}}\right|\leq (m-h)\sigma and h\leq m/2 .
For the empirical loss function it is in general not true that \widehat{ \mathsf{x}}_i = 0 for i > h , nor is \frac{ \mathsf{y}_i-\widehat{ \mathsf{x}}_i}{\sigma} a Bernoulli random variable. However, {{\bf{y}}} and \mathsf{b}_i 's remain the same, and an entirely analogous computation gives
\begin{equation} \widehat{t}^\star = t^\star + \frac{\sum\nolimits_{i = 1}^m \mathsf{b}_i \operatorname{sgn}\left({ \mathsf{y}_i}\right)( \mathsf{x}_i-\widehat{ \mathsf{x}}_i) }{\sum\nolimits_{ \mathsf{b}_i > 1/t^{\star}} \mathsf{b}_i^2}. \end{equation} | (4.7) |
We can now bound the approximation error for the optimal regularization parameter.
Theorem 4.4. Assume that the conditions in Lemma 4.2 hold, and that t_{ \rm{opt}} < 1 and \sigma\frac{h}{m} < 1 . Given u > 0 , with probability of at least 1-2\exp(-u) we have
\begin{equation} \left|{{t_{ {opt}} - \widehat{t}_{ {opt}}}}\right| \leq \frac{\lambda_1}{\lambda_h} \left({\sqrt{\frac{h+\sigma^2 m+u }{N}} + \frac{h+\sigma^2 m+u}{N}}\right) + \sigma\sqrt{\frac{h}{m}}, \end{equation} | (4.8) |
provided N\gtrsim \left({h+\sigma^2m +u}\right) . Assume now {{\bf{y}}} is bounded. With probability greater than 1 - 3 \exp\left({-u}\right) we then have
\begin{equation} \left|{{t_{ {opt}} - \widehat{t}_{ {opt}}}}\right| \leq \frac{\lambda_1}{\lambda_h} \left({\sqrt{\frac{\log(h+m)+u}{N}} + \frac{\log(h+m)+u}{N} }\right) + \sigma\sqrt{\frac{h}{m}} , \end{equation} | (4.9) |
provided N\gtrsim \left({\log{2m} + u}\right) .
Proof. By assumptions we have t_{ \rm{opt}} = t^\star by Lemma 4.2. We can now rewrite (4.5) and (4.7) as
t^\star = \frac{\left\|{{ {{\bf{b}}}}}\right\|_1 + 2\left < {{ ( \operatorname{sgn}{ {{\bf{y}}}}\cdot {{\bf{b}}})}}, {{ {{\bf{x}}}}}\right > } {\left\|{{ {{\bf{b}}}}}\right\|_2^2}, \quad \widehat t^\star = \frac{\left\|{{ {{\bf{b}}}}}\right\|_1 + 2\left < {{ ( \operatorname{sgn}{ {{\bf{y}}}}\cdot {{\bf{b}}})}}, {{\widehat {{\bf{x}}}}}\right > } {\left\|{{ {{\bf{b}}}}}\right\|_2^2}. |
Therefore, using (4.7) we have
\begin{align*} t^\star - \widehat{t}^\star = \frac{ \left < {{ {{\bf{v}}}}}, {{ {{\bf{x}}}-\widehat {{\bf{x}}}}}\right > }{\left\|{{ {{\bf{b}}}}}\right\|_2^2}, \end{align*} |
with {{\bf{v}}} defined by \mathsf{v}_j = 2 \operatorname{sgn}(\mathsf{y}_j) \mathsf{b}_j for j = 1, \ldots, h , and \mathsf{v}_j = \operatorname{sgn}(\mathsf{w}_j) for j > h . Thus, \left\|{{ {{\bf{v}}}}}\right\|_2\leq\left\|{{ {{\bf{b}}}}}\right\|_2 and we have
\begin{align*} \left|{{t^\star - \widehat{t}^\star}}\right| &\leq 2 \frac{\left\|{{ {{\bf{x}}}-\widehat {{\bf{x}}}}}\right\|_2} {\left\|{{ {{\bf{b}}}}}\right\|_2} \leq 2\left({\frac{\left\|{{ {{\bf{y}}}}}\right\|_2}{\left\|{{ {{\bf{b}}}}}\right\|_2}\left\|{{\Pi-\widehat\Pi}}\right\|_2 +\sigma\frac{\left\|{{\Pi {{\bf{w}}}}}\right\|_2} {\left\|{{ {{\bf{b}}}}}\right\|_2}}\right). \end{align*} |
Using \left\|{{ {{\bf{b}}}}}\right\|_2^2 = m + 4\left\|{{ {{\bf{y}}}}}\right\|_1 + 4\left\|{{ {{\bf{y}}}}}\right\|_2^2 and \left\|{{\Pi {{\bf{w}}}}}\right\|_2 = {h} , and provided N\gtrsim \left({h+u+\sigma^2m }\right) , by Lemma 2.2 we have
\left|{{t^\star - \widehat{t}^\star}}\right| \leq C \frac{\lambda_1}{\lambda_h} \left({\sqrt{\frac{h+\sigma^2 m+u}{N}} + \frac{h+\sigma^2 m+u}{N}}\right) + \sigma\sqrt{\frac{h}{m}}, |
with probability greater than 1-2\exp(-u) , and C > 0 is a constant. Since \left|{{t^\star-\widehat{t}^\star}}\right|\geq \left|{{t^\star - \widehat{t}_{ \rm{opt}}}}\right| the claim follows. The proof for a bounded {{\bf{y}}} is entirely analogous.
Remark 4.5. Theorem 4.4 is valid not only for \alpha = 1 but for all \alpha . Namely, in Appendix 7 we show that for any \alpha > 0 the following holds
\left|{{t_{ {opt}} - \widehat{t}_{ {opt}}}}\right| \lesssim \frac{\lambda_1}{\lambda_h} \left({\sqrt{\frac{h+u+\sigma^2 m }{N}} + \frac{h+u+\sigma^2 m}{N}}\right) + \sigma\sqrt{\frac{h}{m}}. |
Driven by insights in Section 4, we are ready to present an efficient heuristic algorithm for learning the Optimal regularization parameter for the Elastic Net (OptEN). The algorithm is based on the minimization of a given loss function ( \widehat{R} , \widehat{R}_m or \widehat{R}_p ). In Section 4 we showed that in a simplified, yet instructive, setting that the optimal parameter tends to be in the vicinity of t = 1 , depending on the noise level and the signal-to-noise gap. This is supported by experimental evidence in more general situations such as for non-injective { \rm{A}} , as we will see in Section 6. Moreover, the loss function is monotonically decreasing as we get away from t = 1 . These observations drive our algorithm which assumes that the minimizer lies in a valley not too far from t = 1, see Figure 1. Therefore, we will perform a line search on the graph of a given loss function, starting from t = 1 .
Line search methods follow iterations t_{k+1} = t_k + s_k \mathsf{p}_k , where \mathsf{p}_k is the search direction and s_k the step size:
● Search direction. We select \mathsf{p}_k by estimating \widehat{R}'(t_k) with central differences, \widehat{R}'(t) \sim \Delta_\epsilon R(t) : = \frac{R(t+\epsilon)-R(t-\epsilon)}{2\epsilon} where \epsilon > 0 is small enough. For t = 1 we instead use \tilde\Delta_\epsilon R(1) : = \frac{R(1)-R(1-\epsilon)}{\epsilon} . Then set \mathsf{p}_k = - \Delta_\epsilon R(t).
● Step size. We estimate s_k with the backtracking line search (consult [2] for an overview of line search methods).
Our approach is presented in Algorithm 1, while an extensive numerical study is provided in the next section.
Algorithm 1 OptEN algorithm for approximating the optimal elastic net regularization parameter using backtracking line search |
Input: {{\bf{y}}}_1, \ldots, {{\bf{y}}}_N, {{\bf{y}}}\in \mathbb{R}^m, \, { \rm{A}}\in \mathbb{R}^{m\times d} ; Compute \widehat{ {{\bf{x}}}} according to (2.14). Set a loss function \rightarrow \, \, \widehat{R} \;{\rm{ or }}\; \widehat{R}_p \;{\rm{ or }}\; \widehat{R}_m . In the rest of the algorithm we will refer to it as \tilde R ; Set \epsilon > 0 , \texttt{tol} > 0 , \texttt{tol2} > 0 , 0 < \alpha < 1 , and c_1, \beta, \gamma > 0 ; Set k\leftarrow0 , t_0 = 1 Compute r_1=\tilde R(1) , \tilde{r}_1=\tilde R(1-\epsilon) , \mathsf{p}_0=(r_1-\tilde r_1)/\epsilon , and r_2 = \varphi(\gamma_0) ; repeat \tilde{t} = t_k + \alpha \mathsf{p}_k ; \varphi_0=r_1 , \varphi_0'=- \mathsf{p}_k^2 , \varphi_1=R(\tilde t) ; if \varphi_1-\varphi_0 < c_1 \varphi_0'\alpha then s_k=\alpha ; else s_k = -\frac{1}{2}\frac{\varphi_0'\alpha^2}{\varphi_1- \varphi_0 -\varphi_0'\alpha} ; end if if |s_k| < \texttt{tol2} or |s_{k-1}/s_k| > \gamma then s_k = s_{k-1} \cdot \beta ; end if Set t_{k+1}=t_k + s_k \mathsf{p}_k ; Compute r_1=\tilde R(t_{k+1}) , \mathsf{p}_{k+1} = (\tilde R(t_{k+1}+\epsilon)-\tilde R(t_{k+1}-\epsilon)/(2\epsilon) ; k\leftarrow k+1 ; until | \mathsf{p}_k| < \texttt{tol} or k < \texttt{max_iter} ; Output: Approximate regularization parameter \hat{t}:=t_k . |
We now study the performance of our approach and show its adaptivity to different scenarios by conducting experiments on synthetic and imaging data. In the first set of experiments we perform a thorough comparison of our method with state-of-the-art parameter selection rules by exploring their behavior with respect to noise level and other notions. The second set of experiments deals with image denoising where we use wavelet-based thresholding with the elastic net. We consider two data-sets: natural images and a real-world brain MRI data. Note that we do not aim to compare our method with state-of-the art denoising methods, but rather only with state-of-the-art methods regarding the selection of the regularization parameter for the elastic net. We start with a discussion of methods that can be used for the automatic detection of the sparsity level h and show that when a sufficient amount of training points is given, we can reliably estimate h .
In real applications the sparsity level of a vector is either not available or is only an approximate notion, i.e., the desired vector is sparse only when we threshold its entries. Such regimes require h to be estimated, which in our case means looking at the spectrum of the corresponding covariance matrix. This question belongs to the class of low-rank matrix recovery problems since what we are trying to recover is the geometry (i.e., projection onto the range) of the noiseless, lower rank matrix \Sigma({ \rm{A}} {{\bf{x}}}) , using only the covariance matrix of noisy observations \widehat\Sigma({{\bf{y}}}) , which is of full rank. Thus, estimating h boils down to thresholding singular values of the empirical covariance matrix according to some spectral criterion that exploits the underlying structure.
For a positive definite matrix with singular values \lambda_1\geq \ldots\geq \lambda_m > 0 commonly used spectral criteria are
(ⅰ) The spectral gap \mathop {{\rm{argmax}}}\nolimits_k \left|{{\lambda_{k} - \lambda_{k+1}}}\right| ;
(ⅱ) The relative gap \mathop {{\rm{argmax}}}\nolimits_{k}\left({1 - \frac{\lambda_k}{\lambda_{k+1}}}\right) ;
(ⅲ) The cumulative spectral energy \sum_{i = 1}^k \lambda_i/\sum_{i = 1}^m \lambda_i ;
(ⅳ) The relative cumulative spectral energy \sum_{i = 1}^k \lambda_i/\sum_{i = 1}^{k+1} \lambda_i .
For the latter two criteria one sets a threshold, say 0.95 , and selects \tilde h as the first k for which the corresponding spectral energy reaches that threshold.
We study the behavior of these four criteria on three different types of forward matrices { \rm{A}} : random Gaussian, random circulant Rademacher, and random Toeplitz Gaussian matrices. These matrices were chosen because they have different spectral behavior and commonly appear in inverse problems. In each case { \rm{A}} is a 100\times 100 real matrix, normalized so that \left\|{{ { \rm{A}}}}\right\|_2 = 1 , and we take N = 100 samples {{\bf{x}}}_i , sampled according to {{\bf{x}}}_i = \boldsymbol{\xi}_i + { {{\bf{v}}}_i} , where \xi_i\sim {\cal N}(0, 1) and \mathsf{v}_i = 4 \operatorname{sgn}(\xi_i) for 1\leq i \leq i\leq 20 , and \xi_i = \mathsf{v}_i = 0 otherwise, {{\bf{w}}}_i\sim {\cal N}({\bf{0}}, { \rm{I}}_{100}) and \sigma = 0.3 . We compute \widehat\Sigma({{\bf{y}}}) = \frac{1}{N} \sum_{i = 1}^N {{\bf{y}}}_i\otimes {{\bf{y}}}_i for {{\bf{y}}}_i = { \rm{A}} {{\bf{x}}}_i+\sigma {{\bf{w}}}_i .
In Figure 2, we show the application of the aforementioned spectral criteria to \widehat{\Sigma}({{\bf{y}}}) . It is clear from the results that all four methods would fail if used without taking further information into account. For example, the spectral gap criterion (in the upper right panel) would dictate the selection of h = 3 , but a more careful look at the plot suggests that the behavior of the spectral gap changes dramatically around h = 20 , which corresponds to the true h . Such ad hoc solutions are sensible and often improve the performance but can be hard to quantify, especially on real data.
The last spectral criteria, 1-{\lambda_k}/{\lambda_{k+1}} , is perhaps the most promising, but is also subject to demands on N , as shown in the bottom row of Figure 2. Namely, if N is not large enough then 1-{\lambda_k}/{\lambda_{k+1}} has a heavy tail on the spectrum of \widehat{\Sigma}({{\bf{y}}}) and would thus suggest a large h , as in the bottom left corner of the Figure. Instead, in Section 6.2 we look for the relative gap within the first m/2 singular vectors. We note that in the case of the first three spectral criteria the situation does not change as the number of samples increases. On the other hand, for the last criterion it does: heavy tails flatten back to zero for all three choices of random matrices, see the plot in the bottom right corner of Figure 2.
In Section 6.3 we will apply our algorithm to wavelet denoising where there is no natural choice of h since wavelet coefficients of images are not truly sparse. We will instead consider two scenarios; when we are given an oracle h (i.e., the h giving the highest PSNR), and when h has to be estimated from data.
Experimental setting. We consider the inverse problem of the type {{\bf{y}}} = { \rm{A}} {{\bf{x}}} + \sigma {{\bf{w}}} , where the data are generated according to {{\bf{x}}} = \boldsymbol{\xi} + { {{\bf{v}}}} , where
(\textsf{D}1) { \rm{A}}\in \mathbb{R}^{m\times d} is a random Gaussian matrix such that \left\|{{ { \rm{A}}}}\right\|_2 = 1 ,
(\textsf{D}2) \xi_i\sim {\cal N}(0, 1) , and \mathsf{v}_i = 4 \operatorname{sgn}(\xi_i) for 1\leq i \leq h ; \xi_i = \mathsf{v}_i = 0 otherwise,
(\textsf{D}3) {{\bf{w}}}\sim {\cal N}({\bf{0}}, \mathsf{{Id}}_m) .
The rationale behind distributional choices in (\textsf{D}2) is twofold. First, having {{\bf{v}}} be a non-constant vector ensures that {\cal V} = \operatorname{range} (\Sigma({{\bf{x}}}) is truly and fully h -dimensional (i.e. if {{\bf{v}}} were constant there would be one dominant singular vector). Second, forcing \left|{{ \mathsf{v}_i}}\right| = 4 for i = 1, \ldots, h ensures that the data are not concentrated around the origin (which happens e.g., if \mathsf{v}_i = 0 for all i in (\textsf{D}2)) and that there is a gap between the original signal and the noise. The gap can be measured by \operatorname{SparseSNR}: = \frac{\max_{1\leq i\leq m} \left|{{\sigma \mathsf{w}_i}}\right|}{\min_{1\leq i\leq h} \left|{{ \mathsf{x}_i}}\right|} . We add that the conclusions and the results of this section, and of Section 6.1, stay the same in case of the more usual distribution assumptions, i.e., for {{\bf{x}}}\sim {\cal N}({\bf{0}}, \mathsf{{Id}}_h) where \mathsf{{Id}}_h is a diagonal matrix with exactly h entries set to 1 and the rest to 0 , and noise levels as in Section 6.3.
Comparison. We compare our algorithm with the following parameter selection methods: the discrepancy principle [25], monotone error rule [29], quasi optimality [30], L-curve [20], (Monte-Carlo) balancing principle [23] and its elastic net counterpart [9], (Monte-Carlo) generalized cross-validation [18] and nonlinear cross validation [17]. In the remainder of this paper we refer to other methods by their acronyms, and to our method as OptEN. The first five methods are commonly used in inverse problems (a detailed account and an experimental study can be found in [3]), whereas Monte-Carlo and nonlinear cross-validation are adaptations of generalized cross-validation for non-linear regularization methods.
Before presenting the results, we provide a concise description of considered methods. Most of the methods require some additional information about the problem, predominantly the noise level \sigma , to be either known or estimated, which affects their performance. We provide the true noise level whenever a given method requires it and furthermore, we perform judicious testing and tuning of all other quantities, taking into account recommendations from relevant literature. We consider a regularization parameter sequence t_n = \frac{1}{1+\nu_0 q^n} , where n\in\{0, 1, \ldots, N_{\max}\} , and \nu_0 > 0 , q > 0 and N_{\max}\in \mathbb{N} are preselected*. For each n we denote the corresponding elastic net solution as {{\bf{z}}}_n: = {{\bf{z}}}^{t_n} .
*This is an adaptation of the parameter sequence from [3] that reflects our reparametrization from \lambda to t , as in (3.2)
Discrepancy Principle [DP]
Discrepancy principle is one of the oldest parameter choice rules which selects a solution so that the norm of the residual is at the noise level. Thus, the regularization parameter is chosen by the first n\in \mathbb{N} such that
\begin{equation} \left\|{{ { \rm{A}} {{\bf{z}}}_n - {{\bf{y}}}}}\right\|_2 \leq \tau \sigma \sqrt{m}, \end{equation} | (6.1) |
where we fix \tau = 1 .
Monotone Error Rule [ME]
This rule is based on the observation that the monotone decrease of the error \left\|{{ {{\bf{z}}}_n - {{\bf{x}}}}}\right\|_2 can only be guaranteed for large values of the regularization parameter. Therefore, the best parameter t_{n^*} is chosen as the first t -value for which one can ensure that the error is monotonically decreasing. The parameter is then chosen by the smallest n such that
\begin{equation} \frac{\left < {{ { \rm{A}} {{\bf{z}}}_n- {{\bf{y}}}}}, {{ { \rm{A}}^{-\top}\left({ {{\bf{z}}}_n - {{\bf{z}}}_{n+1}}\right)}}\right > }{\| { \rm{A}}^{-\top}\left({ {{\bf{z}}}_n - {{\bf{z}}}_{n+1}}\right) \|_2} \leq \tau \sigma \sqrt{m}. \end{equation} | (6.2) |
We fix \tau = 1 for our experiments. The left hand side of (6.2) is replaced with (6.1) whenever the denominator is 0 .
Quasi-Optimality Criterion [QO]
Quasi-optimality is a parameter rule that does not need the noise level, and thus has enjoyed reasonable success in practice, especially for Tikhonov regularization and truncated singular value decomposition. The regularization parameter is chosen according to
\begin{equation} n_\star = \mathop {{\rm{argmin}}}\limits_{n\leq N_{\max}} \left\|{{ {{\bf{z}}}_n- {{\bf{z}}}_{n+1}}}\right\|_2. \end{equation} | (6.3) |
L-curve method [LC]
The criterion is based on the fact that the \log-\log plot of (\left\|{{ { \rm{A}} {{\bf{z}}}_n- {{\bf{y}}}}}\right\|_2, \left\|{{ {{\bf{z}}}_n}}\right\|_2) often has a distinct L-shape. As the points on the vertical part correspond to under-smoothed solutions, and those on the horizontal part correspond to over-smoothed solutions, the optimal parameter is chosen at the elbow of that L-curve. There exist several versions of the method; here we use the following criterion
\begin{equation} n_\ast = \mathop {{\rm{argmin}}}\limits_{n\leq N_{\max}} \{\left\|{{ { \rm{A}} {{\bf{z}}}_n- {{\bf{y}}}}}\right\|_2\left\|{{ {{\bf{z}}}_n}}\right\|_2\}. \end{equation} | (6.4) |
(Monte-Carlo) Balancing Principle [BP]
The principle aims to balance two error contributions, approximation and sampling errors, which have an opposite behavior with respect to the tuning parameter. More precisely, we select the parameter by
n^* = \mathop {{\rm{argmin}}}\limits_n \{ t_n | \left\|{{ {{\bf{z}}}_n- {{\bf{z}}}_k}}\right\|_2 \leq 4 \kappa \sigma \rho(k), k = n, \ldots, N_{\max} \}, |
where \kappa > 0 is a tuning parameter. More computationally friendly, yet equally accurate, versions of the balancing principle are also available [3]. As our main focus on the accuracy of the parameter choice, we will use the original and more computationally heavy version of the balancing principle.
The value of \sigma \rho(k) is in general unknown but it can be estimated in case of white noise. Following [3], we calculate \rho(k)^2\approx \operatorname{mean} \{ \left\|{{ {\rm A}_n^{-1} \boldsymbol{\xi}_i}}\right\|_2^2\}, where \boldsymbol{\xi}_i\sim {\cal N}({\bf{0}}, \mathsf{{Id}}_m) , 1\leq i \leq L (we use L = 4 ), and { \rm{A}}_n^{-1} is the map that assigns {{\bf{y}}} to {{\bf{z}}}_n.
Elastic Nets Balancing Principle [ENBP]
In [9] the authors propose a reformulation of the balancing principle for elastic net,
n^* = \mathop {{\rm{argmin}}}\limits_n \{ t_n | \left\|{{ {{\bf{z}}}_k- {{\bf{z}}}_{k+1}}}\right\|_2 \leq \frac{4 C}{\sqrt{d}\alpha \nu_0 q^{k+1} }, k = N_{\max} -1 , \ldots, n \}, |
The method stops the first time two solutions are sufficiently far apart. The constant C needs to be selected, and in our experience this task requires a delicate touch.
(Monte-Carlo) Generalized Cross-Validation [GCV]
The rule stems from the ordinary cross-validation, which considers all the leave-one-out regularized solutions and chooses the parameter that minimizes the average of the squared prediction errors. Specifically, GCV selects n according to
\begin{equation} n_\ast = \mathop {{\rm{argmin}}}\limits_{n\leq N_{\max}} \frac{m^{-1}\left\|{{ { \rm{A}} {{\bf{z}}}_n - {{\bf{y}}}}}\right\|_2^2}{\left({m^{-1} \operatorname{tr}( \mathsf{{Id}}_m - { \rm{A}} { \rm{A}}_n^{-1})}\right)^2}, \end{equation} | (6.5) |
where { \rm{A}}_n^{-1} is the map such that {{\bf{z}}}_n = { \rm{A}}_n^{-1} ({{\bf{y}}}) . In the case of the elastic net, the map { \rm{A}}_n^{-1} is not linear and, thus, we cannot assign a meaning to its trace. Instead, we follow the ideas of [18] and estimate the trace stochastically using only one data sample.
Nonlinear Generalized Cross-Validation [NGCV]
In [17] the authors reconfigure GCV for non-linear shrinkage methods, and n is selected according to
n_\ast = \mathop {{\rm{argmin}}}\limits_{n\leq N_{\max}} \frac{\left\|{{ { \rm{A}} {{\bf{z}}}_n - {{\bf{y}}}}}\right\|_2^2}{m^{-1}\left({1- ds/m}\right)^2}, |
where s = \frac{\left\|{{ {{\bf{z}}}_n}}\right\|_\gamma}{\left\|{{ {{\bf{z}}}^\dagger}}\right\|_\gamma} with \left\|{{\cdot}}\right\|_\gamma : = \left\|{{\cdot}}\right\|_1+\alpha\left\|{{\cdot}}\right\|^2_2.
Comparison/Error. For each method we compute: the (normalized) error in approximating the optimal regularization parameter, the (normalized) error, false discovery proportion (FDP), true positive proportion (TPP), and the computational time. TPP and FDP are measures that quantify the notions of true and false discovery of relevant features in sparsity based regression tasks [28]. FDP is the ratio between false discoveries and the total number of discoveries,
\operatorname{FDP}(t) = \frac{\#\left[{j:\, \mathsf{z}^t_j \neq 0 \;{\rm{ and }}\; \mathsf{x}_j = 0}\right]}{\max\left({\#\left[{j:\, \mathsf{z}^t_j \neq 0}\right], 1}\right)}. |
TPP on the other hand is the ratio between true (i.e., correct) discoveries in the reconstruction and true discoveries in the original signal,
\operatorname{TPP}(t) = \frac{\#\left[{j:\, \mathsf{z}^t_j \neq 0 \;{\rm{ and }}\; \mathsf{x}_j \neq 0}\right]}{h} . |
Thus, to recover the structure of the original sparse data we want FDP close to 0 and TPP close to 1 . It is known that there is often an explicit (and sometimes even quantifiable) trade-off between FDP and TPP, in the sense that the support overestimation is an (undesirable) side-effect of full support recovery. In other words, a consequence of true support discovery is often a non-trivial false support discovery [28]. When computing FDP and TPP we will rather than demand for an entry to be exactly zero, instead threshold the values (with 0.5 being the threshold).
Testing setup. To compute the true optimal parameter t_{opt}, we run a dense grid search on [0, 1] using the true expected loss \left\|{{ {{\bf{z}}}^t- {{\bf{x}}}}}\right\|_{2}^2 . As suggested in [3] we use \tau = 1 and provide the true noise level \sigma for discrepancy principle and monotone error rule; balancing principle uses \kappa = 1/4 and true \sigma ; elastic net balancing principle uses C = 1/2500 . The parameter grid for DP, ME, BP, QO, LC, BP, GCV, and NGCV is defined by \nu_0 = 1 , q = 0.95 , and N_{\max} = 100 (thus, t_0 = 0.5 and t_{N_{\max}} = 0.9941143171) , whereas for ENBP, we use t_0 = 0.05 , q = 1.05 , and N_{\max} = 100 . The tests are conducted for m\in\{500,900\}, \, d\in\{100,200\} and h\in\{10, 20, 30\} , where all combinations of \alpha\in\{10^{-5}, 10^{-3}, 10^{-2}, 10^{-1}\} and \sigma\in\{0.05, 0.1, 0.2, 0.3\} are considered. To compute the empirical estimator \widehat{ {{\bf{x}}}} , we generate N = 50 independent random samples of the training data ( {{\bf{x}}}, {{\bf{y}}} ).
Results in Table 1 are averaged over 100 independent runs for \alpha = 10^{-3} , m = 500 , d = 100 , h = 10 , where {\cal V} = \operatorname{span}\left\{{{ {{\bf{e}}}_1, \ldots, {{\bf{e}}}_h}}\right\} , and \sigma = 0.3 , which corresponds to \operatorname{SparseSNR}\approx 0.17 . The first row in the table, x_{t_{opt}} , describes the elastic net minimizer for which the true optimal regularization parameter is provided.
Method | \frac{|t_{opt} - \widehat{t}|}{t_{opt}} | \frac{\| {{\bf{x}}} - {{\bf{z}}}^{\widehat{t}}\|_2}{\| {{\bf{x}}}\|_2} | \text{FDP}(\widehat{t}) | \text{TPP}(\widehat{t}) | \genfrac{}{}{0pt}{}{ \rm{computational}}{ \rm{time [s]}} |
x_{t_{opt}} | 0 | 0.0984 | 0.1583 | 1.000 | 0 |
\widehat {{\bf{x}}} | N/A | 0.1004 | 0.000 | 1.000 | N/A |
OptEN | 0.0254 | 0.0994 | 0.160 | 1.000 | 3.716 |
DP | 0.1196 | 0.1118 | 0.087 | 1.000 | 0.7421 |
ME | 0.2493 | 0.1376 | 0.010 | 1.000 | 0.1757 |
QO | 0.4543 | 0.1843 | 0.716 | 1.000 | 7.193 |
LC | 0.1414 | 0.1174 | 0.188 | 1.000 | 1.564 |
BP | 0.0877 | 0.1057 | 0.095 | 1.000 | 36.77 |
ENBP | 0.2728 | 0.1450 | 0.007 | 1.000 | 4.355 |
GCV | 0.4548 | 0.1844 | 0.716 | 1.000 | 15.91 |
NGCV | 0.3597 | 0.1577 | 0.638 | 1.000 | 7.306 |
Discussion. OptEN always returns the value which is the closest to the optimal regularization parameter, and its results are in general comparable to the ones provided by the minimizer with the optimal parameter. However, one can observe that other methods, e.g., discrepancy principle, provide a better balance between FDP and TPP (returning solutions that are more sparse), though at a cost of a larger approximation error. Balancing principle also provides very good results, but it is slow unless an effort is made to improve its computational time. Moreover, we observed that the performance of all methods that require the noise level \sigma to be known deteriorates if we do not provide the exact value of the noise level, but only its rough estimate.
The overall results are mostly consistent over all experimental scenarios we looked at, with a couple of exceptions. As expected, FDP and estimation errors deteriorate not only for larger \sigma but also for larger \alpha , though the ranking of the methods and the patterns of behavior remain the same. This is due to the fact that as \alpha increases the elastic net penalty sacrifices sparsity for smoothness. The empirical estimator \widehat {{\bf{x}}} is a very accurate estimator of the original signal, and it sometimes outperforms even the elastic net solution that uses the optimal parameter. However, as it has been observed in [10] for Tikhonov regularization and confirmed in Figure 3, the performance of the empirical estimator worsens in the small noise regime.
Comparison with empirical estimator: effects of \sigma and N_{\text{train}}
We study the behavior of the relative estimation error with respect to \sigma and the number of training samples. We compare OptEN with the empirical estimator \widehat {{\bf{x}}} , DP, NGCV, and BP. We use m = 500, \, d = 100, \, h = 10 and (\textsf{D}1)-(\textsf{D}3) with \sigma ranging from 0.1 to 0.5 in the first experiment, whereas we vary the number of training samples N from 20 to 60 in the second experiments as depicted in Figure 3. Our method again outperforms other considered parameter selection rules. However, the empirical estimator performs slightly better than OptEN for larger noise levels (it is also better than the optimal elastic net solution), and it performs worse for lower noise levels. This is essentially due to the fact that \widehat{ {{\bf{x}}}} is never truly sparse, but has a lower (thresholded) FDP. Namely, as a projection onto an h -dimensional space the non-zero entries of \widehat{ {{\bf{x}}}} are very small, whereas the non-zero entries obtained with elastic net are larger and their size depending on the noise level. The parameter \alpha plays a similar role; for small \alpha OptEN beats \widehat {{\bf{x}}} , and for larger \alpha the situation is reversed.
Non-injective matrices
We now conduct experiments with non-injective matrices. The setting is as in Table 1, where now { \rm{A}}\in \mathbb{R}^{500\times 100} with \operatorname{rank}\left({ { \rm{A}}}\right) = 40 . As mentioned in Section 3.2, we test our method by minimizing the projected loss functional \widehat{R}_p and the modified error functional \widehat{R}_m . The results can be found in Table 2. Our method (using both the projected and modified functionals) again outperforms standard parameter selection rules in terms of the precision accuracy, and loses out to some methods when it comes to FDP and TPP. We also observe that the performance of the empirical estimator deteriorates and \widehat {{\bf{x}}} indeed should not be used as the solution itself but some additional regularization is required.
Method | \frac{|t_{opt} - \widehat{t}|}{t_{opt}} | \frac{\| {{\bf{x}}} - {{\bf{z}}}^{\widehat{t}}\|_2}{\| {{\bf{x}}}\|_2} | FDP(\widehat{t}) | TPP(\widehat{t}) | \genfrac{}{}{0pt}{}{ \rm{computational}}{ \rm{time [s]}} |
using t_{opt} | 0 | 0.5704 | 0.4918 | 0.9450 | 0 |
empirical estimator \widehat {{\bf{x}}} | N/A | 0.7978 | 0.8047 | 1.000 | N/A |
projected OptEN | 0.0718 | 0.6033 | 0.507 | 0.930 | 8.16 |
modified OptEN | 0.0763 | 0.6046 | 0.497 | 0.926 | 15.57 |
DP | 0.1316 | 0.6343 | 0.528 | 0.926 | 1.93 |
ME | 0.3203 | 0.7234 | 0.290 | 0.765 | 0.15 |
QO | 0.3167 | 0.7857 | 0.819 | 0.997 | 7.04 |
LC | 0.3389 | 0.7426 | 0.304 | 0.749 | 7.00 |
GCV | 0.3172 | 0.7865 | 0.819 | 0.997 | 14.04 |
NGCV | 0.3172 | 0.7865 | 0.819 | 0.997 | 7.06 |
BP | 0.2636 | 0.7179 | 0.757 | 0.974 | 35.01 |
ENBP | 0.2133 | 0.6549 | 0.362 | 0.847 | 3.69 |
The task of image denoising is to find an estimate \boldsymbol Z of an unknown image \boldsymbol{X} from a noisy measurement \boldsymbol{Y} , where \boldsymbol{Y} = \boldsymbol X + \sigma{\bf{\Xi}} , and {\bf{\Xi}} denotes isotropic white noise. The goal is to improve the image quality by removing noise while preserving important image features such as edges and homogeneous regions.
There are a large number of methods addressing image denoising, starting from 'classical' wavelet thresholding [12,13] and non-linear filters, to stochastic and variational methods [7,8]. Since the primary goal of this paper is to evaluate how the proposed approach performs as a parameter selection method for the elastic net, here we only compare our method with other state-of-the-art parameter selection methods for the elastic net, and do not compare the elastic net with image denoising methods in general. In particular, we compare OptEN with the discrepancy principle and the balancing principle (i.e., the top performers from previous experiments). In all cases the results show that OptEN has superior performance and selects nearly optimal parameters, see Table 3.
PSNR | SSIM | ||||||||||
\genfrac{}{}{0pt}{}{ \rm{using}}{t_{ \rm{opt}}} | noisy | OptEN | DP | BP | \genfrac{}{}{0pt}{}{ \rm{using}}{t_{ \rm{opt}}} | noisy | OptEN | DP | BP | ||
\textsf{space} \textsf{shuttle} | |||||||||||
\sigma=0.05 | 32.33 | 26.27 | 32.32 | 30.26 | 30.63 | 0.929 | 0.687 | 0.929 | 0.850 | 0.862 | |
\sigma=0.075 | 30.26 | 22.87 | 30.25 | 27.32 | 27.69 | 0.908 | 0.516 | 0.908 | 0.748 | 0.765 | |
\sigma=0.1 | 28.77 | 20.50 | 28.76 | 25.32 | 25.60 | 0.892 | 0.394 | 0.891 | 0.660 | 0.676 | |
\textsf{cherries} | |||||||||||
\sigma=0.05 | 35.80 | 26.08 | 35.79 | 30.52 | 31.08 | 0.973 | 0.647 | 0.972 | 0.837 | 0.855 | |
\sigma=0.075 | 33.59 | 22.63 | 33.59 | 27.40 | 27.77 | 0.964 | 0.463 | 0.964 | 0.720 | 0.737 | |
\sigma=0.1 | 31.94 | 20.24 | 31.82 | 25.26 | 25.54 | 0.958 | 0.339 | 0.942 | 0.617 | 0.633 | |
\textsf{cat} | |||||||||||
\sigma=0.05 | 29.34 | 26.02 | 29.33 | 28.72 | 28.93 | 0.890 | 0.767 | 0.890 | 0.861 | 0.868 | |
\sigma=0.075 | 27.06 | 22.51 | 27.03 | 25.85 | 26.04 | 0.825 | 0.612 | 0.823 | 0.758 | 0.767 | |
\sigma=0.1 | 25.69 | 20.04 | 25.24 | 23.88 | 24.03 | 0.771 | 0.486 | 0.741 | 0.664 | 0.671 | |
\textsf{mud flow} | |||||||||||
\sigma=0.05 | 28.38 | 26.02 | 28.37 | 28.20 | 28.30 | 0.870 | 0.777 | 0.869 | 0.856 | 0.860 | |
\sigma=0.075 | 26.07 | 22.50 | 26.06 | 25.46 | 25.60 | 0.795 | 0.629 | 0.795 | 0.755 | 0.762 | |
\sigma=0.1 | 24.71 | 20.01 | 24.46 | 23.58 | 23.70 | 0.735 | 0.506 | 0.717 | 0.662 | 0.668 | |
\textsf{IHC} | |||||||||||
\sigma=0.05 | 29.33 | 26.02 | 29.33 | 28.73 | 28.92 | 0.890 | 0.767 | 0.889 | 0.861 | 0.868 | |
\sigma=0.075 | 27.06 | 22.51 | 27.04 | 25.86 | 26.04 | 0.825 | 0.612 | 0.823 | 0.759 | 0.767 | |
\sigma=0.1 | 25.70 | 20.04 | 25.39 | 23.85 | 24.04 | 0.772 | 0.486 | 0.748 | 0.662 | 0.671 |
Wavelet-based denoising. We denoise the noisy image \boldsymbol{Y} by minimizing
\begin{equation} (1-t)\left\|{{\boldsymbol{Z} - \boldsymbol{Y}}}\right\|_2^2 + t( \left\|{{ \mathbb{W}\boldsymbol{Z}}}\right\|_1 + \alpha\left\|{{\boldsymbol{Z}}}\right\|^2_2), \;{\rm{ for }}\; \boldsymbol{Z}\in[0, 1]^p, \end{equation} | (6.6) |
where \mathbb{W} is the wavelet transform using the family of \texttt{db4} wavelets, \alpha = 10^{-3} , and p is the number of pixels in the image. The wavelet transform sparsifies natural images, and we thus select the empirical estimator in the wavelet domain. Moreover, in the limit with respect to the number of samples N\rightarrow \infty , the empirical projection \widehat \Pi Y is for a given h equivalent to a hard thresholding of \mathbb{W}\boldsymbol{Y} that preserves its h largest wavelet coefficients. Thus, for image denoising we do not use samples \boldsymbol{Y}_i but instead only threshold \mathbb{W}\boldsymbol{Y} for a well chosen h . Here h cannot be chosen by searching for a gap in \mathbb{W}\boldsymbol{Y} , since it most often does not exist. Instead, we say that the true h is the one that minimizes the MSE of the reconstructed image. In our first set of experiments the empirical estimator \widehat{\boldsymbol{X}} is chosen by hard thresholding \boldsymbol{Y} , where h is optimal.
Data and learning setup. We consider five grayscale images: space shuttle, cherries, cat, mud flow, and IHC, each of size 512\times512 pixels. For BP and DP the regularization parameter is selected from a sequence of parameter values t_n = \frac{1}{1+\nu_0 q^n} with \nu_0 = 1 , q = 0.95 , and N_{\max} = 100, same as before. Moreover, we fix \tau = 1 , \kappa = 1/4 for BP and DP, and provide them with the true noise level. For OptEN the empirical estimator is computed with an oracle h , i.e., the one returning the lowest MSE.
Comparison/Error. We use two performance metrics: peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) between the original image \boldsymbol{X} and the recovered version \boldsymbol{Z} . PSNR is a standard pixel-based performance metric, defined through the MSE by
\quad \text{PSNR}(\boldsymbol{X}, \boldsymbol{Z}) = 10 \log_{10}\left({\frac{\max\nolimits_{1\leq i\leq p} X_i - \min\nolimits_{1\leq i\leq p} X_i}{\text{MSE}(\boldsymbol{X}, \boldsymbol{Z})}}\right), \text{ MSE}(\boldsymbol{X}, {\boldsymbol{Z}}) = \frac{1}{p} \left\|{{\boldsymbol{X} - \boldsymbol{Z}}}\right\|_2^2. |
MSE and PSNR are ubiquitous in image and signal analysis due to their simplicity and suitability for optimization tasks, but are also infamous for their inability to capture features salient for human perception of image quality and fidelity [35]. SSIM on the other hand, is a structure-based performance metric that tries to address this issue by using easy-to-compute structural statistics to estimate image similarity. It is defined through
\text{SSIM}(\boldsymbol{X}, \boldsymbol{Z}) = \left({\frac{2{\overline{\boldsymbol{X}}}\, {\overline{\boldsymbol{Z}}} + C_1}{{\overline{\boldsymbol{X}}}^2{\overline{\boldsymbol{Z}}}^2 + C_1}}\right)\left({\frac{2 std(\boldsymbol{X}) std(\boldsymbol{Z}) + C_2}{std(\boldsymbol{X})^2std(\boldsymbol{Z})^2 + C_2}}\right), |
where \overline{\boldsymbol{X}}, \, \overline{\boldsymbol{Z}} are the means, and std(\boldsymbol{X}), \, std(\boldsymbol{Z}) are the standard deviations of pixels of corresponding images \boldsymbol{X} and \boldsymbol{Z} , and C_1, C_2 are positive constants†.
†We take C_1 = 0.01, \, C_2 = 0.03 by the convention of \texttt{python}'s \texttt{skimage} package
Table 3 provides the PSNR and SSIM values generated by all algorithms on the considered images, while Figure 4 shows the result of denoising on a 128\times128 detail of each image for \sigma = 0.075 . We can see that our method achieves the highest PSNR on all images and that this effect is more pronounced for larger noise values.
In this set of experiments we study the performance of our method in a situation where the optimal h is not known a priori. We run experiments on a real-world dataset of brain images‡, which consists of in vivo MRIs of 13 patients with brain tumor, taken pre-surgery. For each patient we took an MRI slice, isolated the area around the brain and then added additional isotropic white noise with \sigma\in\{0.05, 0.075, 0.1\} . We then select h by a heuristically driven procedure. Namely, for each image \boldsymbol{Y} we set an initial h_0\in \mathbb{N} and determine \hat{t}_0 by performing Algorithm 1, where \widehat{\boldsymbol{X}}_{h_0} is constructed by taking h_0 largest coefficients of \mathbb{W} \boldsymbol{Y} . We then set h_1 = h_0+h_\text{step} , repeat the procedure, and continue iteratively for h_k . The iterations are stoppped once the corresponding \hat{t}_k start to decrease or become discontinuous (since heuristically this corresponds to a decrease in the PSNR of the corresponding elastic-net regularized solution). h_0 and h_\text{step} are chosen according to the size of the image. The behavior of this criterion can be seen in Figure 5, and it shows that if h is too large the empirical estimator is virtually the same as \boldsymbol{Y} . In other words, the minimizer of \left\|{{\boldsymbol{Z}^t-\widehat{\boldsymbol{X}}_h}}\right\|_2 is t = 1 (i.e., \lambda = 0 ), which we observe in Figure 5.
‡Obtained from http://nist.mni.mcgill.ca/?page_id=672, therein referred to as group 2
The resulting reconstruction for \sigma = 0.1 can be seen in Figure 6 on four images. The effects of denoising are visually not as striking as the results in Section 6.3.1. We attribute this to the fact that PSNR gains with the best possible choice of parameter using elastic net are quite small, namely, PSNR of the noisy image improves only by around 5-7\% , when taking the optimal parameter (see Table 4). Other parameter selection rules (discrepancy principle and balancing principle) did not improve the PSNR and are thus not presented.
using t_{ \rm{opt}} | noisy | OptEN |
28.608 | 27.297 | 28.432 |
28.329 | 26.792 | 28.079 |
28.221 | 26.735 | 27.935 |
28.501 | 26.906 | 28.240 |
In this paper, we presented an approach for the estimation of the optimal regularization parameter for the elastic net. Theoretical guarantees are available only in simplified scenarios but we used insights gained therein to steer and create an efficient algorithm. The algorithm exhibits excellent prediction accuracy, including in cases when there are no theoretical guarantees. Comparison with state-of-the-art methods show a clear superiority of our method, under the studied testing scenarios. Moreover, whereas other studied methods require adjsting a number of additional parameters in order to achieve satisfactory results, our method is entirely autonomous given a sufficient number of training samples.
We aim to use the ideas presented in this paper in further studies. Namely, we will study the behavior of the solution with respect to the other hyperparameter, \alpha , and consider other optimization schemes, predominantly focusing on imaging applications. We will also work on developing an optimization scheme for a joint minimization of both the regularization functional and the loss functional for the regularization parameter.
V. Naumova and Z. Kereta acknowledge the support from RCN-funded FunDaHD project No 251149/O70.
The authors declare no conflict of interest.
Proofs for Section 4
We will here add more details regarding per-interval minimisers of the loss function R(t) , defined in (3.10). First, we note the expression
R(t) = \sum\limits_{i = 1}^m \left(\frac{(t(1+2\lvert \mathsf{y}_i\lvert)-1)_+}{2(t(1-\alpha)+\alpha)} \operatorname{sgn}( \mathsf{y}_i) - \mathsf{x}_i\right)^2 = :\sum\limits_{i = 1}^m r_i(t), |
established in 4. Define \mathsf{b}_i = {{1+2\left|{{ \mathsf{y}_i}}\right|}} , for i = 1, \ldots, m . Loss function R(t) is continuous on [0, 1] , and differentiable on intervals
\begin{align*} {\cal I}_0 & = \Big[0, \mathsf{b}_1^{-1}\Big), \quad {\cal I}_{m} = \Big( \mathsf{b}_{m}^{-1}, 1\Big], \text{ and } {\cal I}_k = \left({ \mathsf{b}_k^{-1}, \mathsf{b}_{k+1}^{-1}}\right), \;{\rm{ for }}\; k = 1, \ldots, m-1. \end{align*} |
On {\cal I}_0 the loss function is constant, R(t) = \sum_{i = 1}^m \mathsf{x}_i^2 . Define now a_i = \operatorname{sgn}(\mathsf{y}_i)(1+2\alpha\left|{{ \mathsf{y}_i}}\right|) , c_i = \operatorname{sgn}(\mathsf{y}_i)+2 \mathsf{x}_i(\alpha-1) + 2 \mathsf{y}_i , d_i = \operatorname{sgn}(\mathsf{y}_i)+2\alpha \mathsf{x}_i . If 1 < k < m then r_i(t) = \mathsf{x}_i^2 for i > k , and for i\leq k we have r_i(t) = \frac{1}{4}\Big(\frac{tc_i-d_i}{t(1-\alpha)+\alpha}\Big)^2 , and thus
r_i'(t) = -\frac{1}{2}(d_i(1-\alpha)+c_i\alpha)\frac{tc_i-d_i}{(\alpha+t(1-\alpha))^3} = -\frac{1}{2}a_i\frac{tc_i-d_i}{(\alpha+t(1-\alpha))^3}. |
Therefore, R_{ {\cal I}_k}'(t) = -\frac{1}{2(\alpha+t(1-\alpha))^3}\sum_{i = 1}^k a_i(tc_i-d_i). Equating the above expression with zero and restricting the solution to the interval {\cal I}_k , yields the desired expression.
Proofs for Theorem 2.2
We will here add an analogue of Eq (2.8) for the case of bounded {{\bf{y}}} . Assume \left\|{{ {{\bf{y}}}}}\right\|_2\leq \sqrt{L} holds almost surely and consider a random matrix { \rm{R}} = {{\bf{y}}}\otimes {{\bf{y}}} . Then \mathbb{E} { \rm{R}} = \Sigma({{\bf{y}}}) , and \left\|{{ { \rm{R}}}}\right\|_2 \leq L . Furthermore, { \rm{R}}^\top = { \rm{R}} and
m_2( { \rm{R}}) = \max \left\{ \left\|{{ \mathbb{E}[ { \rm{R}} { \rm{R}}^\top]}}\right\|_2, \left\|{{ \mathbb{E}[ { \rm{R}}^\top { \rm{R}}]}}\right\|_2 \right\} = \left\|{{ \mathbb{E} { \rm{R}}^\top { \rm{R}}}}\right\|_2\leq L\left\|{{\Sigma( {{\bf{y}}})}}\right\|_2. |
Let now {{\bf{y}}}_i\sim {{\bf{y}}} and define a family of independent m\times m matrices
{ \rm{R}}_i = {{\bf{y}}}_i\otimes {{\bf{y}}}_i, \quad i = 1, \ldots, N, |
so that { \rm{R}}_i\sim { \rm{R}} . The empirical covariance \widehat{\Sigma}({{\bf{y}}}) = \frac{1}{N}\sum_{i = 1}^N { \rm{R}}_i is then the matrix sampling estimator and by Corollary 6.2.1 from [32] we have that for all s\geq 0
\mathbb{P}\left({\left\|{{\widehat{\Sigma}( {{\bf{y}}})-\Sigma( {{\bf{y}}}) }}\right\|_2 \geq s }\right) \leq 2m\exp\left({-\frac{Ns^2/2}{m_2( { \rm{R}}) + 2Ls/3}}\right). |
Writing now 2m\exp\left({-\frac{Ns^2/2}{m_2({\rm R}) + 2Ls/3}}\right) = \exp(-u), we have a quadratic equation for s , whose solution is
s = \frac{4L\varepsilon + \sqrt{(4L\varepsilon)^2+ 72 Nm_2( { \rm{R}})\varepsilon }}{6N} |
for \varepsilon : = u + 4\log(2m) . It then follows
\begin{align*} s&\leq \frac{4L\varepsilon + 3\sqrt{2Nm_2( { \rm{R}})\varepsilon}}{3N} \leq \frac{\max(4L, 3\sqrt{2m_2( { \rm{R}})})}{3}\, \frac{\varepsilon + \sqrt{N\varepsilon}}{N}\\&\le C \left({\frac{u+\log(2m)}{N} + \sqrt{\frac{u+\log(2m)}{N}}}\right) \end{align*} |
for C = \frac{\max(4L, 3\sqrt{2m_2({ \rm{R}})})}{3} . Plugging it all together we have that with probability at least 1-\exp(-u)
\left\|{{\widehat{\Sigma}( {{\bf{y}}})-\Sigma( {{\bf{y}}}) }}\right\|_2 \lesssim \frac{u+\log(2m)}{N} + \sqrt{\frac{u+\log(2m)}{N}}. |
Thus, provided N\gtrsim u+\log(2m) we have
\left\|{{\widehat{\Sigma}( {{\bf{y}}})-\Sigma( {{\bf{y}}}) }}\right\|_2\leq \lambda_h/2. |
Computations for \alpha\neq 1 in Section 4.2
Let {{\bf{y}}} = {{\bf{x}}} + \sigma {{\bf{w}}}, where \mathbb{P}\left({ \mathsf{w}_i = \pm 1}\right) = \frac{1}{2}, and assume {{\bf{x}}} = (\mathsf{x}_1, \ldots, \mathsf{x}_h, 0, \ldots, 0)^\top, and \left|{{ \mathsf{x}_i}}\right|\geq 2\sigma, for i = 1, \ldots, h. In the following we will use {{\bf{v}}}_{1:k} to denote a vector in \mathbb{R}^k that consists of the first k entries of a vector {{\bf{v}}}\in \mathbb{R}^m , and denote {\bf{ \pmb{\mathsf{ η}} }} = \sigma {{\bf{w}}} . In Section 4 we showed that the minimum of R(t) = \left\|{{ {{\bf{z}}}^t({{\bf{y}}}) - {{\bf{x}}}}}\right\|_2^2 and \widehat R(t) = \left\|{{ {{\bf{z}}}^t({{\bf{y}}}) - \widehat {{\bf{x}}}}}\right\|_2^2 in each sub-interval {\cal I}_k , for k = 1, \ldots, m of [0, 1] is of the form
t^{\ast, k} = \frac{\sum\nolimits_{i = 1}^k a_id_i}{\sum\nolimits_{i = 1}^k a_ic_i}, \quad \widehat{t}^{\ast, k} = \frac{\sum\nolimits_{i = 1}^k a_i\widehat{d}_i}{\sum\nolimits_{i = 1}^k a_i\widehat{c}_i} |
for
\begin{align*} a_i & = \mathsf{s}_i(1+2\alpha\left|{{ \mathsf{y}_i}}\right|), \, \, \, c_i = \mathsf{s}_i+2 \mathsf{x}_i(-1+\alpha) + 2 \mathsf{y}_i, \, \, \, \widehat{c}_i = \mathsf{s}_i+2\widehat{ \mathsf{x}}_i(-1+\alpha) + 2 \mathsf{y}_i, \\ d_i & = \mathsf{s}_i+2\alpha \mathsf{x}_i, \, \, \, \widehat{d}_i = \mathsf{s}_i+2\alpha\widehat{ \mathsf{x}}_i, \end{align*} |
where \mathsf{s}_i = \operatorname{sgn}(\mathsf{y}_i) . Denoting {\bf{err}} = 2(\widehat{ {{\bf{x}}}}- {{\bf{x}}}) , we write
\begin{align*} \widehat{d}_i -d_i & = 2\alpha\left({\widehat{ \mathsf{x}}_i - \mathsf{x}_i}\right) = \alpha\mathsf{err}_i, \text{ and } \widehat{c}_i -c_i = 2(\alpha-1)(\widehat{ \mathsf{x}}_i - \mathsf{x}_i) = (\alpha-1)\mathsf{err}_i. \end{align*} |
We now have
\begin{align*} t^{\ast, k} - \widehat{t}^{\ast, k} & = \frac{\sum\nolimits_{i = 1}^k a_id_i\sum\nolimits_{i = 1}^k a_i\widehat{c}_i - \sum\nolimits_{i = 1}^k a_i\widehat{d}_i \sum\nolimits_{i = 1}^k a_ic_i}{\sum\nolimits_{i = 1}^k a_ic_i\sum\nolimits_{i = 1}^k a_i\widehat{c}_i}\\ & = \frac{\sum\nolimits_{i = 1}^k a_i^2 \left({d_i\widehat{c}_i - \widehat{d}_ic_i}\right) - \sum\nolimits_{1\leq i < j\leq k} a_ia_j\left({d_i\widehat{c}_j+d_j\widehat{c}_i - \widehat{d}_ic_j-\widehat{d}_jc_i}\right)}{\sum\nolimits_{i = 1}^k a_ic_i\sum\nolimits_{i = 1}^k a_i\widehat{c}_i}. \end{align*} |
Writing down each of the terms in the numerator we get
\begin{align*} d_i\widehat{c}_i - \widehat{d}_ic_i = -2\alpha\mathsf{err}_i( \mathsf{y}_i- \mathsf{x}_i) -\mathsf{err}_id_i = -\mathsf{err}_i a_i, \end{align*} |
where we use the fact that c_i = d_i+2(\mathsf{y}_i- \mathsf{x}_i) . We also get
\begin{align*} d_i\widehat{c}_j+d_j\widehat{c}_i - \widehat{d}_ic_j-\widehat{d}_jc_i & = \mathsf{err}_j\left({(\alpha-1)d_i-\alpha c_i}\right) + \mathsf{err}_i\left({(\alpha-1)d_j -\alpha c_j}\right) = -\left({\mathsf{err}_ja_i+\mathsf{err}_ia_j}\right). \end{align*} |
Thus,
\begin{align*} \sum\limits_{i = 1}^k a_id_i\sum\limits_{i = 1}^k a_i\widehat{c}_i - \sum\limits_{i = 1}^k a_i\widehat{d}_i \sum\limits_{i = 1}^k a_ic_i & = -\left({\sum\limits_{i = 1}^k a_i^3\mathsf{err}_i + \sum\limits_{1\leq i < j\leq k} a_ia_j (a_i\mathsf{err}_j+a_j\mathsf{err}_i)}\right)\\ & = -\sum\limits_{i = 1}^k \mathsf{err}_i\left({a_i^3 + a_i\sum\limits_{i\neq j} a_j^2}\right) = -\left\|{{ {{\bf{a}}}_{1:k}}}\right\|_2^2\sum\limits_{i = 1}^k a_i \mathsf{err}_i \end{align*} |
Turning our attention to the denominator we have
\begin{align*} \sum\limits_{i = 1}^k a_ic_i\sum\limits_{i = 1}^k a_i\widehat{c}_i & = \sum\limits_{i = 1}^k a_i^2c_i\widehat{c}_i + \sum\limits_{1\leq i < j\leq k} a_ia_j\left({c_i\widehat{c}_j+\widehat{c}_i c_j}\right)\\& = \left({\sum\limits_{i = 1}^k a_ic_i}\right)^2 + (\alpha-1)\left({\sum\limits_{i = 1}^k a_ic_i}\right) \sum\limits_{j = 1}^k a_j\mathsf{err}_j, \end{align*} |
due to
c_i\widehat{c}_j +\widehat{c_i}c_j = 2c_ic_j + (\alpha-1)\left({\mathsf{err}_jc_i+\mathsf{err}_i c_j}\right), \;{\rm{ and }}\;\, c_i\widehat{c}_i = c_i^2 + (\alpha-1)c_i \mathsf{err}_i. |
Putting it all together and rewriting we have
t^{\ast, k} - \widehat{t}^{\ast, k} = \frac{-\left\|{{ {{\bf{a}}}_{1:k}}}\right\|_2^2\langle {{\bf{a}}}_{1:k}, { {\bf{err}}}_{1:k}\rangle}{ \left({\left\|{{ {{\bf{a}}}_{1:k}}}\right\|_2^2-2(\alpha-1)\langle {{\bf{a}}}_{1:k}, {\bf{ \pmb{\mathsf{ η}} }}_{1:k}\rangle}\right)\left({\left\|{{ {{\bf{a}}}_{1:k}}}\right\|_2^2- 2(\alpha-1)\langle {{\bf{a}}}_{1:k}, ( {{\bf{y}}} - \widehat{ {{\bf{x}}}})_{1:k}\rangle}\right) } |
and recall {\bf{ \pmb{\mathsf{ η}} }}_{1:k} = ({{\bf{y}}}- {{\bf{x}}})_{1:k} . Taking now k = m we have by Cauchy-Schwartz inequality
\left|{{t_\text{opt} - \widehat{t}_\text{opt}}}\right| \leq \frac{\left\|{{ {{\bf{a}}}}}\right\|_2^3}{\left|{{ \left({\left\|{{ {{\bf{a}}}}}\right\|_2^2-2(\alpha-1)\langle {{\bf{a}}}, {\bf{ \pmb{\mathsf{ η}} }}\rangle}\right) \left({\left\|{{ {{\bf{a}}}}}\right\|_2^2- 2(\alpha-1)\langle {{\bf{a}}}, ( {{\bf{y}}} - \widehat{ {{\bf{x}}}})\rangle}\right) }}\right|} \left\|{{ {{\bf{x}}}-{\widehat {{\bf{x}}}}}}\right\|_2. |
The term \left\|{{ {{\bf{x}}}-\widehat {{\bf{x}}}}}\right\|_2 can be bounded as in Section 4. To bound the first factor we compute
{\left|{{{\left\|{{ {{\bf{a}}}}}\right\|_2^2-2(\alpha-1)\langle {{\bf{a}}}, {\bf{ \pmb{\mathsf{ η}} }}\rangle} }}\right|} = \left\|{{ {{\bf{a}}}}}\right\|_2^2 \left|{{1 -2(\alpha-1)\frac{\left < {{\frac{ {{\bf{a}}}}{\left\|{{ {{\bf{a}}}}}\right\|_2}}}, {{ {\bf{ \pmb{\mathsf{ η}} }}}}\right > }{\left\|{{ {{\bf{a}}}}}\right\|_2}}}\right|. |
Provided§ \left|{{2(\alpha-1){\left < {{\frac{ {{\bf{a}}}}{\left\|{{ {{\bf{a}}}}}\right\|_2}}}, {{ {\bf{ \pmb{\mathsf{ η}} }}}}\right > }}}\right| \leq \frac{\sqrt{2}}{2} {\left\|{{ {{\bf{a}}}}}\right\|_2} we have
\left|{{1 -2(\alpha-1)\frac{\left < {{\frac{ {{\bf{a}}}}{\left\|{{ {{\bf{a}}}}}\right\|_2}}}, {{ {\bf{ \pmb{\mathsf{ η}} }}}}\right > }{\left\|{{ {{\bf{a}}}}}\right\|_2}}}\right|^{-1} \leq 2 \left|{{1+2(\alpha-1)\frac{\left < {{\frac{ {{\bf{a}}}}{\left\|{{ {{\bf{a}}}}}\right\|_2}}}, {{ {\bf{ \pmb{\mathsf{ η}} }}}}\right > }{\left\|{{ {{\bf{a}}}}}\right\|_2}}}\right|\leq 2 \left({1+2\left|{{\alpha-1}}\right|\frac{\left\|{{ {\bf{ \pmb{\mathsf{ η}} }}}}\right\|_2}{\left\|{{ {{\bf{a}}}}}\right\|_2}}\right), |
§This holds for example if (\alpha-1)^2\sigma^2m\lesssim m + 4\alpha\left\|{{ {{\bf{y}}}}}\right\|_1+4\alpha^2\left\|{{ {{\bf{y}}}}}\right\|_2^2
and
\left|{{1 -2(\alpha-1)\frac{\left < {{\frac{ {{\bf{a}}}}{\left\|{{ {{\bf{a}}}}}\right\|_2}}}, {{ {{\bf{y}}}-\widehat {{\bf{x}}}}}\right > }{\left\|{{ {{\bf{a}}}}}\right\|_2}}}\right|^{-1} \leq 2 \left({1+2\left|{{\alpha-1}}\right|\frac{\left\|{{ {{\bf{y}}}-\widehat {{\bf{x}}}}}\right\|_2}{\left\|{{ {{\bf{a}}}}}\right\|_2}}\right)\leq2 \left({1+2\left|{{\alpha-1}}\right|\frac{\left\|{{ {{\bf{y}}}}}\right\|_2}{\left\|{{ {{\bf{a}}}}}\right\|_2}}\right). |
where in the last line we used {{\bf{y}}}-\widehat {{\bf{x}}} = (\mathsf{{Id}}-\widehat\Pi) {{\bf{y}}} and the fact \left\|{{ \mathsf{{Id}}- { \rm{P}}}}\right\|_2 = \left\|{{ { \rm{P}}}}\right\|_2 for non-trivial (neither null nor identity) orthogonal projections { \rm{P}} . For \alpha\geq1 we have
2\left|{{\alpha-1}}\right|\frac{\left\|{{ {{\bf{y}}}}}\right\|_2}{\left\|{{ {{\bf{a}}}}}\right\|_2} \leq 1, \text{ and } 2\left|{{\alpha-1}}\right|\frac{\left\|{{ {\bf{ \pmb{\mathsf{ η}} }}}}\right\|_2}{\left\|{{ {{\bf{a}}}}}\right\|_2} \leq 1, |
using \left\|{{ {{\bf{a}}}}}\right\|_2 \geq 2\left|{{\alpha}}\right|\left\|{{ {{\bf{y}}}}}\right\|_2 , and the signal-to-noise gap in the last inequality. On the other hand, for 0 < \alpha < 1 we have \left\|{{ {{\bf{y}}}}}\right\| = \left\|{{ {{\bf{x}}}+ {\bf{ \pmb{\mathsf{ η}} }}}}\right\|\lesssim \sqrt{h}+\sigma\sqrt{m} , with high probability, giving
\frac{\left\|{{ {{\bf{a}}}}}\right\|_2^3}{\left|{{ \left({\left\|{{ {{\bf{a}}}}}\right\|_2^2-2(\alpha-1)\langle {{\bf{a}}}, {\bf{ \pmb{\mathsf{ η}} }}\rangle}\right) \left({\left\|{{ {{\bf{a}}}}}\right\|_2^2- 2(\alpha-1)\langle {{\bf{a}}}, {{\bf{y}}} - \widehat{ {{\bf{x}}}}\rangle}\right) }}\right|}\lesssim \frac{1}{\sqrt{m}}. |
In conclusion, for \alpha > 0 we have
\left|{{t_\text{opt} - \widehat{t}_\text{opt}}}\right| \lesssim \left\|{{\Pi-\widehat{\Pi}}}\right\|_2 + \sigma\sqrt{\frac{h}{m}}, |
as desired.
[1] |
S. W. Anzengruber, R. Ramlau, Morozov's discrepancy principle for Tikhonov-type functionals with nonlinear operators, Inverse Probl., 26 (2010), 025001. doi: 10.1088/0266-5611/26/2/025001
![]() |
[2] | A. Astolfi, Optimization: An introduction, 2006. |
[3] |
F. Bauer, M. A. Lukas, Comparing parameter choice methods for regularization of ill-posed problems, Math. Comput. Simulat., 81 (2011), 1795–1841. doi: 10.1016/j.matcom.2011.01.016
![]() |
[4] | M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res., 7 (2006), 2399–2434. |
[5] | R. Bhatia, Matrix analysis, New York: Springer-Verlag, 1997. |
[6] |
T. Bonesky, Morozov's discrepancy principle and Tikhonov-type functionals, Inverse Probl., 25 (2009), 015015. doi: 10.1088/0266-5611/25/1/015015
![]() |
[7] |
A. Chambolle, R. DeVore, N. Lee, B. Lucier, Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage, IEEE Trans. Image Process., 7 (1998), 319–335. doi: 10.1109/83.661182
![]() |
[8] |
A. Chambolle, P. L. Lions, Image recovery via total variation minimization and related problems, Numer. Math., 76 (1997), 167–188. doi: 10.1007/s002110050258
![]() |
[9] |
C. De Mol, E. De Vito, L. Rosasco, Elastic-net regularisation in learning theory, J. Complexity, 25 (2009), 201–230. doi: 10.1016/j.jco.2009.01.002
![]() |
[10] | E. De Vito, M. Fornasier, V. Naumova, A machine learning approach to optimal Tikhonov regularization I: affine manifolds, Anal. Appl., 2021, in press. |
[11] |
C.-A. Deledalle, S. Vaiter, J. Fadili, G. Peyré, Stein Unbiased GrAdient estimator of the Risk (SUGAR) for multiple parameter selection, SIAM J. Imaging Sci., 7 (2014), 2448–2487. doi: 10.1137/140968045
![]() |
[12] |
D. L. Donoho, De-noising by soft-thresholding, IEEE Trans. Inform. Theory, 41 (1995), 613–627. doi: 10.1109/18.382009
![]() |
[13] |
D. L. Donoho, I. Johnstone, Ideal spatial adaptation via wavelet shrinkage, Biometrika, 81 (1994), 425–455. doi: 10.1093/biomet/81.3.425
![]() |
[14] | B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, The Annals of Statistics, 32 (2004), 407–499. |
[15] |
Y. C. Eldar, Generalized SURE for exponential families: applications to regularization, IEEE Trans. Signal Process., 57 (2009), 471–481. doi: 10.1109/TSP.2008.2008212
![]() |
[16] | H. W. Engl, M. Hanke, A. Neubauer, Regularization of inverse problems, Dordrecht: Kluwer Academic Publishers, 1996. |
[17] |
W. J. Fu, Nonlinear GCV and quasi-GCV for shrinkage models, J. Stat. Plan. Infer., 131 (2005), 333–347. doi: 10.1016/j.jspi.2004.03.001
![]() |
[18] |
R. Giryes, M. Elad, Y. C. Eldar, The projected GSURE for automatic parameter tuning in iterative shrinkage methods, Appl. Comput. Harmonic Anal., 30 (2011), 407–422. doi: 10.1016/j.acha.2010.11.005
![]() |
[19] |
G. H. Golub, M. Heath, G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics, 21 (1979), 215–223. doi: 10.1080/00401706.1979.10489751
![]() |
[20] |
P. C. Hansen, Analysis of discrete ill-posed problems by means of the L-curve, SIAM Rev., 34 (1992), 561–580. doi: 10.1137/1034115
![]() |
[21] | B. Hofmann, Regularization for applied inverse and ill-posed problems: a numerical approach, Springer-Verlag, 2013. |
[22] |
B. Jin, D. Lorenz, S. Schiffler, Elastic-net regularisation: error estimates and active set methods, Inverse Probl., 25 (2009), 115022. doi: 10.1088/0266-5611/25/11/115022
![]() |
[23] |
O. Lepskii, On a problem of adaptive estimation in Gaussian white noise, Theory Probab. Appl., 35 (1991), 454–466. doi: 10.1137/1135065
![]() |
[24] | A. Lorbert, P. Ramadge, Descent methods for tuning parameter refinement, In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010,469–476. |
[25] | V. A. Morozov, Methods for solving incorrectly posed problems, Springer Science & Business Media, 2012. |
[26] |
E. Novak, H. Woźniakowski, Optimal order of convergence and (in)tractability of multivariate approximation of smooth functions, Constr. Approx., 30 (2009), 457–473. doi: 10.1007/s00365-009-9069-8
![]() |
[27] | C. M. Stein, Estimation of the mean of a multivariate normal distribution, Ann. Statist., 9 (1981), 1135–1151. |
[28] | W. Su, M. Bogdan, E. Candés, False discoveries occur early on the Lasso path, Ann. Statist., 45 (2017), 2133–2150. |
[29] |
U. Tautenhahn, U. Hämarik, The use of monotonicity for choosing the regularization parameter in ill-posed problems, Inverse Probl., 15 (1999), 1487–1505. doi: 10.1088/0266-5611/15/6/307
![]() |
[30] |
A. Tikhonov, V. Glasko, Use of the best rate of adaptive estimation in some inverse problems, USSR Computational Mathematics and Mathematical Physics, 5 (1965), 93–107. doi: 10.1016/0041-5553(65)90150-3
![]() |
[31] | A. N. Tikhonov, V. Y. Arsenin, Solutions of ill-posed problems, Washington, D. C.: V. H. Winston & Sons, New York: John Wiley & Sons, 1977. |
[32] | J. A. Tropp, An introduction to matrix concentration inequalities, Now Foundations and Trends, 2015. |
[33] | R. Vershynin, High-dimensional probability: An introduction with applications in data science, Cambridge University Press, 2018. |
[34] |
E. De Vito, S. Pereverzyev, L. Rosasco, Adaptive kernel methods using the balancing principle, Found. Comput. Math., 10 (2010), 455–479. doi: 10.1007/s10208-010-9064-2
![]() |
[35] |
Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process., 13 (2004), 600–612. doi: 10.1109/TIP.2003.819861
![]() |
[36] |
S. N. Wood, Modelling and smoothing parameter estimation with multiple quadratic penalties, J. Roy. Stat. Soc. B, 62 (2000), 413–428. doi: 10.1111/1467-9868.00240
![]() |
[37] |
H. Zou, T. Hastie, Regularisation and variable selection via the elastic net, J. Roy. Stat. Soc. B, 67 (2005), 301–320. doi: 10.1111/j.1467-9868.2005.00503.x
![]() |
1. | Jonathan Chirinos-Rodríguez, Ernesto De Vito, Cesare Molinari, Lorenzo Rosasco, Silvia Villa, On learning the optimal regularization parameter in inverse problems, 2024, 40, 0266-5611, 125004, 10.1088/1361-6420/ad8a84 |
Algorithm 1 OptEN algorithm for approximating the optimal elastic net regularization parameter using backtracking line search |
Input: {{\bf{y}}}_1, \ldots, {{\bf{y}}}_N, {{\bf{y}}}\in \mathbb{R}^m, \, { \rm{A}}\in \mathbb{R}^{m\times d} ; Compute \widehat{ {{\bf{x}}}} according to (2.14). Set a loss function \rightarrow \, \, \widehat{R} \;{\rm{ or }}\; \widehat{R}_p \;{\rm{ or }}\; \widehat{R}_m . In the rest of the algorithm we will refer to it as \tilde R ; Set \epsilon > 0 , \texttt{tol} > 0 , \texttt{tol2} > 0 , 0 < \alpha < 1 , and c_1, \beta, \gamma > 0 ; Set k\leftarrow0 , t_0 = 1 Compute r_1=\tilde R(1) , \tilde{r}_1=\tilde R(1-\epsilon) , \mathsf{p}_0=(r_1-\tilde r_1)/\epsilon , and r_2 = \varphi(\gamma_0) ; repeat \tilde{t} = t_k + \alpha \mathsf{p}_k ; \varphi_0=r_1 , \varphi_0'=- \mathsf{p}_k^2 , \varphi_1=R(\tilde t) ; if \varphi_1-\varphi_0 < c_1 \varphi_0'\alpha then s_k=\alpha ; else s_k = -\frac{1}{2}\frac{\varphi_0'\alpha^2}{\varphi_1- \varphi_0 -\varphi_0'\alpha} ; end if if |s_k| < \texttt{tol2} or |s_{k-1}/s_k| > \gamma then s_k = s_{k-1} \cdot \beta ; end if Set t_{k+1}=t_k + s_k \mathsf{p}_k ; Compute r_1=\tilde R(t_{k+1}) , \mathsf{p}_{k+1} = (\tilde R(t_{k+1}+\epsilon)-\tilde R(t_{k+1}-\epsilon)/(2\epsilon) ; k\leftarrow k+1 ; until | \mathsf{p}_k| < \texttt{tol} or k < \texttt{max_iter} ; Output: Approximate regularization parameter \hat{t}:=t_k . |
Method | \frac{|t_{opt} - \widehat{t}|}{t_{opt}} | \frac{\| {{\bf{x}}} - {{\bf{z}}}^{\widehat{t}}\|_2}{\| {{\bf{x}}}\|_2} | \text{FDP}(\widehat{t}) | \text{TPP}(\widehat{t}) | \genfrac{}{}{0pt}{}{ \rm{computational}}{ \rm{time [s]}} |
x_{t_{opt}} | 0 | 0.0984 | 0.1583 | 1.000 | 0 |
\widehat {{\bf{x}}} | N/A | 0.1004 | 0.000 | 1.000 | N/A |
OptEN | 0.0254 | 0.0994 | 0.160 | 1.000 | 3.716 |
DP | 0.1196 | 0.1118 | 0.087 | 1.000 | 0.7421 |
ME | 0.2493 | 0.1376 | 0.010 | 1.000 | 0.1757 |
QO | 0.4543 | 0.1843 | 0.716 | 1.000 | 7.193 |
LC | 0.1414 | 0.1174 | 0.188 | 1.000 | 1.564 |
BP | 0.0877 | 0.1057 | 0.095 | 1.000 | 36.77 |
ENBP | 0.2728 | 0.1450 | 0.007 | 1.000 | 4.355 |
GCV | 0.4548 | 0.1844 | 0.716 | 1.000 | 15.91 |
NGCV | 0.3597 | 0.1577 | 0.638 | 1.000 | 7.306 |
Method | \frac{|t_{opt} - \widehat{t}|}{t_{opt}} | \frac{\| {{\bf{x}}} - {{\bf{z}}}^{\widehat{t}}\|_2}{\| {{\bf{x}}}\|_2} | FDP(\widehat{t}) | TPP(\widehat{t}) | \genfrac{}{}{0pt}{}{ \rm{computational}}{ \rm{time [s]}} |
using t_{opt} | 0 | 0.5704 | 0.4918 | 0.9450 | 0 |
empirical estimator \widehat {{\bf{x}}} | N/A | 0.7978 | 0.8047 | 1.000 | N/A |
projected OptEN | 0.0718 | 0.6033 | 0.507 | 0.930 | 8.16 |
modified OptEN | 0.0763 | 0.6046 | 0.497 | 0.926 | 15.57 |
DP | 0.1316 | 0.6343 | 0.528 | 0.926 | 1.93 |
ME | 0.3203 | 0.7234 | 0.290 | 0.765 | 0.15 |
QO | 0.3167 | 0.7857 | 0.819 | 0.997 | 7.04 |
LC | 0.3389 | 0.7426 | 0.304 | 0.749 | 7.00 |
GCV | 0.3172 | 0.7865 | 0.819 | 0.997 | 14.04 |
NGCV | 0.3172 | 0.7865 | 0.819 | 0.997 | 7.06 |
BP | 0.2636 | 0.7179 | 0.757 | 0.974 | 35.01 |
ENBP | 0.2133 | 0.6549 | 0.362 | 0.847 | 3.69 |
PSNR | SSIM | ||||||||||
\genfrac{}{}{0pt}{}{ \rm{using}}{t_{ \rm{opt}}} | noisy | OptEN | DP | BP | \genfrac{}{}{0pt}{}{ \rm{using}}{t_{ \rm{opt}}} | noisy | OptEN | DP | BP | ||
\textsf{space} \textsf{shuttle} | |||||||||||
\sigma=0.05 | 32.33 | 26.27 | 32.32 | 30.26 | 30.63 | 0.929 | 0.687 | 0.929 | 0.850 | 0.862 | |
\sigma=0.075 | 30.26 | 22.87 | 30.25 | 27.32 | 27.69 | 0.908 | 0.516 | 0.908 | 0.748 | 0.765 | |
\sigma=0.1 | 28.77 | 20.50 | 28.76 | 25.32 | 25.60 | 0.892 | 0.394 | 0.891 | 0.660 | 0.676 | |
\textsf{cherries} | |||||||||||
\sigma=0.05 | 35.80 | 26.08 | 35.79 | 30.52 | 31.08 | 0.973 | 0.647 | 0.972 | 0.837 | 0.855 | |
\sigma=0.075 | 33.59 | 22.63 | 33.59 | 27.40 | 27.77 | 0.964 | 0.463 | 0.964 | 0.720 | 0.737 | |
\sigma=0.1 | 31.94 | 20.24 | 31.82 | 25.26 | 25.54 | 0.958 | 0.339 | 0.942 | 0.617 | 0.633 | |
\textsf{cat} | |||||||||||
\sigma=0.05 | 29.34 | 26.02 | 29.33 | 28.72 | 28.93 | 0.890 | 0.767 | 0.890 | 0.861 | 0.868 | |
\sigma=0.075 | 27.06 | 22.51 | 27.03 | 25.85 | 26.04 | 0.825 | 0.612 | 0.823 | 0.758 | 0.767 | |
\sigma=0.1 | 25.69 | 20.04 | 25.24 | 23.88 | 24.03 | 0.771 | 0.486 | 0.741 | 0.664 | 0.671 | |
\textsf{mud flow} | |||||||||||
\sigma=0.05 | 28.38 | 26.02 | 28.37 | 28.20 | 28.30 | 0.870 | 0.777 | 0.869 | 0.856 | 0.860 | |
\sigma=0.075 | 26.07 | 22.50 | 26.06 | 25.46 | 25.60 | 0.795 | 0.629 | 0.795 | 0.755 | 0.762 | |
\sigma=0.1 | 24.71 | 20.01 | 24.46 | 23.58 | 23.70 | 0.735 | 0.506 | 0.717 | 0.662 | 0.668 | |
\textsf{IHC} | |||||||||||
\sigma=0.05 | 29.33 | 26.02 | 29.33 | 28.73 | 28.92 | 0.890 | 0.767 | 0.889 | 0.861 | 0.868 | |
\sigma=0.075 | 27.06 | 22.51 | 27.04 | 25.86 | 26.04 | 0.825 | 0.612 | 0.823 | 0.759 | 0.767 | |
\sigma=0.1 | 25.70 | 20.04 | 25.39 | 23.85 | 24.04 | 0.772 | 0.486 | 0.748 | 0.662 | 0.671 |
using t_{ \rm{opt}} | noisy | OptEN |
28.608 | 27.297 | 28.432 |
28.329 | 26.792 | 28.079 |
28.221 | 26.735 | 27.935 |
28.501 | 26.906 | 28.240 |
Algorithm 1 OptEN algorithm for approximating the optimal elastic net regularization parameter using backtracking line search |
Input: {{\bf{y}}}_1, \ldots, {{\bf{y}}}_N, {{\bf{y}}}\in \mathbb{R}^m, \, { \rm{A}}\in \mathbb{R}^{m\times d} ; Compute \widehat{ {{\bf{x}}}} according to (2.14). Set a loss function \rightarrow \, \, \widehat{R} \;{\rm{ or }}\; \widehat{R}_p \;{\rm{ or }}\; \widehat{R}_m . In the rest of the algorithm we will refer to it as \tilde R ; Set \epsilon > 0 , \texttt{tol} > 0 , \texttt{tol2} > 0 , 0 < \alpha < 1 , and c_1, \beta, \gamma > 0 ; Set k\leftarrow0 , t_0 = 1 Compute r_1=\tilde R(1) , \tilde{r}_1=\tilde R(1-\epsilon) , \mathsf{p}_0=(r_1-\tilde r_1)/\epsilon , and r_2 = \varphi(\gamma_0) ; repeat \tilde{t} = t_k + \alpha \mathsf{p}_k ; \varphi_0=r_1 , \varphi_0'=- \mathsf{p}_k^2 , \varphi_1=R(\tilde t) ; if \varphi_1-\varphi_0 < c_1 \varphi_0'\alpha then s_k=\alpha ; else s_k = -\frac{1}{2}\frac{\varphi_0'\alpha^2}{\varphi_1- \varphi_0 -\varphi_0'\alpha} ; end if if |s_k| < \texttt{tol2} or |s_{k-1}/s_k| > \gamma then s_k = s_{k-1} \cdot \beta ; end if Set t_{k+1}=t_k + s_k \mathsf{p}_k ; Compute r_1=\tilde R(t_{k+1}) , \mathsf{p}_{k+1} = (\tilde R(t_{k+1}+\epsilon)-\tilde R(t_{k+1}-\epsilon)/(2\epsilon) ; k\leftarrow k+1 ; until | \mathsf{p}_k| < \texttt{tol} or k < \texttt{max_iter} ; Output: Approximate regularization parameter \hat{t}:=t_k . |
Method | \frac{|t_{opt} - \widehat{t}|}{t_{opt}} | \frac{\| {{\bf{x}}} - {{\bf{z}}}^{\widehat{t}}\|_2}{\| {{\bf{x}}}\|_2} | \text{FDP}(\widehat{t}) | \text{TPP}(\widehat{t}) | \genfrac{}{}{0pt}{}{ \rm{computational}}{ \rm{time [s]}} |
x_{t_{opt}} | 0 | 0.0984 | 0.1583 | 1.000 | 0 |
\widehat {{\bf{x}}} | N/A | 0.1004 | 0.000 | 1.000 | N/A |
OptEN | 0.0254 | 0.0994 | 0.160 | 1.000 | 3.716 |
DP | 0.1196 | 0.1118 | 0.087 | 1.000 | 0.7421 |
ME | 0.2493 | 0.1376 | 0.010 | 1.000 | 0.1757 |
QO | 0.4543 | 0.1843 | 0.716 | 1.000 | 7.193 |
LC | 0.1414 | 0.1174 | 0.188 | 1.000 | 1.564 |
BP | 0.0877 | 0.1057 | 0.095 | 1.000 | 36.77 |
ENBP | 0.2728 | 0.1450 | 0.007 | 1.000 | 4.355 |
GCV | 0.4548 | 0.1844 | 0.716 | 1.000 | 15.91 |
NGCV | 0.3597 | 0.1577 | 0.638 | 1.000 | 7.306 |
Method | \frac{|t_{opt} - \widehat{t}|}{t_{opt}} | \frac{\| {{\bf{x}}} - {{\bf{z}}}^{\widehat{t}}\|_2}{\| {{\bf{x}}}\|_2} | FDP(\widehat{t}) | TPP(\widehat{t}) | \genfrac{}{}{0pt}{}{ \rm{computational}}{ \rm{time [s]}} |
using t_{opt} | 0 | 0.5704 | 0.4918 | 0.9450 | 0 |
empirical estimator \widehat {{\bf{x}}} | N/A | 0.7978 | 0.8047 | 1.000 | N/A |
projected OptEN | 0.0718 | 0.6033 | 0.507 | 0.930 | 8.16 |
modified OptEN | 0.0763 | 0.6046 | 0.497 | 0.926 | 15.57 |
DP | 0.1316 | 0.6343 | 0.528 | 0.926 | 1.93 |
ME | 0.3203 | 0.7234 | 0.290 | 0.765 | 0.15 |
QO | 0.3167 | 0.7857 | 0.819 | 0.997 | 7.04 |
LC | 0.3389 | 0.7426 | 0.304 | 0.749 | 7.00 |
GCV | 0.3172 | 0.7865 | 0.819 | 0.997 | 14.04 |
NGCV | 0.3172 | 0.7865 | 0.819 | 0.997 | 7.06 |
BP | 0.2636 | 0.7179 | 0.757 | 0.974 | 35.01 |
ENBP | 0.2133 | 0.6549 | 0.362 | 0.847 | 3.69 |
PSNR | SSIM | ||||||||||
\genfrac{}{}{0pt}{}{ \rm{using}}{t_{ \rm{opt}}} | noisy | OptEN | DP | BP | \genfrac{}{}{0pt}{}{ \rm{using}}{t_{ \rm{opt}}} | noisy | OptEN | DP | BP | ||
\textsf{space} \textsf{shuttle} | |||||||||||
\sigma=0.05 | 32.33 | 26.27 | 32.32 | 30.26 | 30.63 | 0.929 | 0.687 | 0.929 | 0.850 | 0.862 | |
\sigma=0.075 | 30.26 | 22.87 | 30.25 | 27.32 | 27.69 | 0.908 | 0.516 | 0.908 | 0.748 | 0.765 | |
\sigma=0.1 | 28.77 | 20.50 | 28.76 | 25.32 | 25.60 | 0.892 | 0.394 | 0.891 | 0.660 | 0.676 | |
\textsf{cherries} | |||||||||||
\sigma=0.05 | 35.80 | 26.08 | 35.79 | 30.52 | 31.08 | 0.973 | 0.647 | 0.972 | 0.837 | 0.855 | |
\sigma=0.075 | 33.59 | 22.63 | 33.59 | 27.40 | 27.77 | 0.964 | 0.463 | 0.964 | 0.720 | 0.737 | |
\sigma=0.1 | 31.94 | 20.24 | 31.82 | 25.26 | 25.54 | 0.958 | 0.339 | 0.942 | 0.617 | 0.633 | |
\textsf{cat} | |||||||||||
\sigma=0.05 | 29.34 | 26.02 | 29.33 | 28.72 | 28.93 | 0.890 | 0.767 | 0.890 | 0.861 | 0.868 | |
\sigma=0.075 | 27.06 | 22.51 | 27.03 | 25.85 | 26.04 | 0.825 | 0.612 | 0.823 | 0.758 | 0.767 | |
\sigma=0.1 | 25.69 | 20.04 | 25.24 | 23.88 | 24.03 | 0.771 | 0.486 | 0.741 | 0.664 | 0.671 | |
\textsf{mud flow} | |||||||||||
\sigma=0.05 | 28.38 | 26.02 | 28.37 | 28.20 | 28.30 | 0.870 | 0.777 | 0.869 | 0.856 | 0.860 | |
\sigma=0.075 | 26.07 | 22.50 | 26.06 | 25.46 | 25.60 | 0.795 | 0.629 | 0.795 | 0.755 | 0.762 | |
\sigma=0.1 | 24.71 | 20.01 | 24.46 | 23.58 | 23.70 | 0.735 | 0.506 | 0.717 | 0.662 | 0.668 | |
\textsf{IHC} | |||||||||||
\sigma=0.05 | 29.33 | 26.02 | 29.33 | 28.73 | 28.92 | 0.890 | 0.767 | 0.889 | 0.861 | 0.868 | |
\sigma=0.075 | 27.06 | 22.51 | 27.04 | 25.86 | 26.04 | 0.825 | 0.612 | 0.823 | 0.759 | 0.767 | |
\sigma=0.1 | 25.70 | 20.04 | 25.39 | 23.85 | 24.04 | 0.772 | 0.486 | 0.748 | 0.662 | 0.671 |
using t_{ \rm{opt}} | noisy | OptEN |
28.608 | 27.297 | 28.432 |
28.329 | 26.792 | 28.079 |
28.221 | 26.735 | 27.935 |
28.501 | 26.906 | 28.240 |