
Citation: Gerardo Sánchez Licea. Sufficiency for singular trajectories in the calculus of variations[J]. AIMS Mathematics, 2020, 5(1): 111-139. doi: 10.3934/math.2020008
[1] | Huiting Zhang, Yuying Yuan, Sisi Li, Yongxin Yuan . The least-squares solutions of the matrix equation A∗XB+B∗X∗A=D and its optimal approximation. AIMS Mathematics, 2022, 7(3): 3680-3691. doi: 10.3934/math.2022203 |
[2] | J. Alberto Conejero, Antonio Falcó, María Mora–Jiménez . A pre-processing procedure for the implementation of the greedy rank-one algorithm to solve high-dimensional linear systems. AIMS Mathematics, 2023, 8(11): 25633-25653. doi: 10.3934/math.20231308 |
[3] | Yinlan Chen, Wenting Duan . The Hermitian solution to a matrix inequality under linear constraint. AIMS Mathematics, 2024, 9(8): 20163-20172. doi: 10.3934/math.2024982 |
[4] | Yinlan Chen, Min Zeng, Ranran Fan, Yongxin Yuan . The solutions of two classes of dual matrix equations. AIMS Mathematics, 2023, 8(10): 23016-23031. doi: 10.3934/math.20231171 |
[5] | Jeong-Kweon Seo, Byeong-Chun Shin . Reduced-order modeling using the frequency-domain method for parabolic partial differential equations. AIMS Mathematics, 2023, 8(7): 15255-15268. doi: 10.3934/math.2023779 |
[6] | Yinlan Chen, Lina Liu . A direct method for updating piezoelectric smart structural models based on measured modal data. AIMS Mathematics, 2023, 8(10): 25262-25274. doi: 10.3934/math.20231288 |
[7] | Justin Eilertsen, Marc R. Roussel, Santiago Schnell, Sebastian Walcher . On the quasi-steady-state approximation in an open Michaelis–Menten reaction mechanism. AIMS Mathematics, 2021, 6(7): 6781-6814. doi: 10.3934/math.2021398 |
[8] | Wanlin Jiang, Kezheng Zuo . Revisiting of the BT-inverse of matrices. AIMS Mathematics, 2021, 6(3): 2607-2622. doi: 10.3934/math.2021158 |
[9] | Yonghong Duan, Ruiping Wen . An alternating direction power-method for computing the largest singular value and singular vectors of a matrix. AIMS Mathematics, 2023, 8(1): 1127-1138. doi: 10.3934/math.2023056 |
[10] | Muammer Ayata, Ozan Özkan . A new application of conformable Laplace decomposition method for fractional Newell-Whitehead-Segel equation. AIMS Mathematics, 2020, 5(6): 7402-7412. doi: 10.3934/math.2020474 |
The Sigma-Pi-Sigma neural network (SPSNN) is a feed-forward neural network composed of Sigma-Pi units, which can be used to achieve a static mapping of multi-layer neural networks [1,2,3]. In [4], a new model which can be regarded as a subclass of networks on Sigma-Pi units was considered, and the authors showed the origin of the Kronecker product representation from the classical Sigma-Pi units. In [5], a Sigma-Pi network and a new arithmetic were proposed, by which the output representation self-organizes to form a topographic map, whose main contribution was to solve the frame of reference transformation issues through unsupervised learning. Furthermore, in [6,7,8], the approximation, convergence performance, and generalization ability of sparse Sigma-Pi network functions were studied, and it was shown that the new algorithms were more efficient than those in the existing literatures.
It is widely recognized that neural network optimization has emerged as a highly significant research topic in recent times. Study on neural network optimization consists of two main topics: One is weight optimization, that is, for a given network structure, select the appropriate learning method to seek the optimal weight such that the training error and the generalization error are small enough [9,10]. The second is structural optimization, namely the selection of appropriate activation function, network layer number, connection mode, and so on [11,12,13]. But, the research on neural network structure optimization is far less rich than that on weight optimization. On the other hand, there is no literature showing that more hidden layer neurons yield better generalization ability.
When it comes to neural networks, the number of neurons is a crucial factor. There are two common methods for determining the size of networks. The first method is the growing method, where the network starts with a smaller size and new hidden neurons are added during the training process [14]. Another method is the pruning method [15,16,17,18,19], which begins with a larger network and eventually removes redundant nodes.
These kind of algorithms separates weight learning and weight training, which is inefficient. There are also many slightly more complex algorithms, which further introduce various mechanisms like particle swarm optimization [20], genetic algorithms [21], eigenvalue analysis [22], statistical analysis, and synthetic minority over-sampling techniques [23,24], so as to enhance the sparsification efficiency. The disadvantage of these algorithms is that the program is complex and the calculation is large.
An appropriately sized network structure is instrumental in enhancing efficiency. Overfitting poses a significant challenge during network training and is particularly problematic for deep neural network learning [25]. Consequently, researchers have extensively explored various forms of sparse regularization techniques, highlighting their indispensability.
Recent years, Lp regularization is diffusely used to solve variable selection and parameter estimation problems in machine learning. This regularization method takes the form
E(W)=ˆE(W)+τ‖W‖pp, |
where ˆE is the normal error function, ‖W‖p=(∑i|Wi|p)1/p denotes the p-norm, τ is the penalty coefficient, and ‖⋅‖ represents the euclidean norm. This regularization term is also named the penalty term.
In general, there are several common forms of regularization: weight elimination, weight decay, and approximate smoothing [26,27,28,29,30,31,32]. Among them, weight elimination is widely used as the penalty term in pruning feedforward neural networks, mainly to reduce unnecessary connections or optimize the network weights [33]. A little more detailed introduction for different penalty terms are as follows.
For different values of p, the L2 norm to the standard error function makes it more optimized [34,35,36,37]. This practice is called L2 regularization, and the form is shown as follows:
E(W)=ˆE(W)+τ‖W‖22, |
where ‖W‖22 denotes the penalty term, and the L2 norm solution is popular because of its special relationship with the normal distributions. It can serve as brute force to avoid excessive weights. But, unfortunately, it is not sparse. This means that, during the training process, the L2 regularization can not drive unnecessary weights to zero.
The L0 regularization term is the ordinary method for feature extraction and variable selection. Constrained by the number of coefficients, the L0 regularizer produces the most sparse solutions, which are difficult to calculate, and it is a combinatory optimization problem [38,39].
As an alternative to the L0 regularization term, the increasingly important L1 regularization term (Lasso) has become popular since it just needs to resolve a quadratic programming problem [40]. In [41], the L1-norm was combined with the capped L1-norm to indicate the amount of information collected by the filter and the control regularization. The L1 regularization penalty function is generally denoted as follows:
E(W)=ˆE(W)+τ‖W‖1. |
It was shown in [42,43] that, although these algorithms can generate an alternate neural network structure, they do not offer a unified neural network framework to solve a class of problems.
In order to explore more appropriate neural network structures to get over the obstacles posed by suboptimal models, we propose a new SPSNN algorithm based on the L1 penalty and the L2 penalty to deal with complex and varied tasks within a unified framework to improve the robustness and generality of the model. Based on the L1 norm, an L2 norm is presented in SPSNNs to promote the population sparsity effect to select the relevant hidden node population. Therefore, our proposed variant algorithm benefits from ridge regression and the tendency towards sparse solutions of the L1 penalty, which generates a more suitable neural network structure than using one of the regular terms alone, and the elastic net algorithm can also be used to solve for these hybrid penalties [44].
In this paper, the penalty method is considered in the case of selecting the weights, which not only overcomes the shortcomings of the L1 and L2 norms, (both L1 and L2 penalties are used in the same minimization problem), but also has good generalization ability and sparsity. Thus, the mathematical expression for this hybrid penalty as an error function can be expressed as follows:
E(W)=ˆE(W)+τ1‖W‖1+τ2‖W‖22, |
where the tuning constants τ1 and τ2 are fixed and non-negative. The role of τ1 provides a choice of variables through a sparse vector, and τ2 ensures a unique solution and leads to a grouping effect. A penalty term is added that is a convex combination of the L1 norm ‖W‖1 and the L2 norm ‖W‖22 of the parameter W.
During the training progress, the usual mixed regularization terms are not differentiable at the origin, which usually give rise to oscillation phenomenon. Therefore, we propose a new smoothing algorithm to get over these difficulties. That is, we can use the smoothing technique of the weight in the neighborhood of the origin of the error function instead of the absolute value of the weight. The main contribution and novelty of this article are as follows:
(1) In this article, in order to obtain the optimal architecture with good generalization performance, we will eliminate the weights in the hidden layer by using the idea of elastic net regularization. It means that, by incorporating the L1 and L2 regularization terms, this novel algorithm not only eliminates unnecessary weights but also performs pruning of the network structure in the hidden layers. This pruning reduces the size of the network, leading to an effective optimization of the network structure.
(2) We propose an SPSNN based on elastic net regularized batch gradient methods to obtain an optimal architecture with good generalization performance. By means of smoothing technology, we effectively get over the oscillation phenomenon in the process of network learning.
(3) To test the accuracy and robustness, we apply the proposed method to a large number of regression and classification tasks and compare it with other algorithms that have good sparsity and generalization capabilities.
The rest of the article is arranged as follows: In Section 2 the SPSNN is presented, the batch gradient algorithm for this model is depicted in Section 3, and the novel pruning algorithm based on L1 plus L2 penalty term is depicted for more details. Some numerical experiments are provided in Section 4, and finally some conclusions are made.
Next, we mainly depict the fabric of SPSNNs, which are composed of an input layer, a hidden layer of summing nodes(∑1), a hidden layer of product nodes(∏), and output layer(∑2). P, N, Q, and 1 represent the number of the nodes of the input layer, ∑1 layer, ∏ layer, and ∑2 layer, respectively (see Fig. 1)
We can use W0=(w01,w02,⋯w0Q)T∈RQ to express the weight vector which connects ∑2 layer and the ∏ layer. Let the weight vector connecting the ∏ layer and the ∑1 layer be fixed as 1. Meanwhile, using Wn=(wn1,wn2,⋯wnP)T as the weight vector which connects the input layer and the nth node of the ∑1 layer, we can set
W=(W0T,W1T,⋯WNT)T∈RQ+NP. |
Let us use X=(x1,x2,⋯xP)∈RP to describe the input of the networks. Assume that g:R⟶R is a given sigmoid activation function for the ∑1 layer. We denote the output vector ξ∈RN of the ∑1 layer as
ξ=(g(W1⋅X),g(W2⋅X),⋯g(WN⋅X))T, |
where ⋅ denotes the inner product between vectors.
Similarly, we take δ=(δ1,δ2,⋯δQ)T to indicate the output vector of the ∏ layer. The nodes of the network are connected in two different ways. One is fully connected between the ∑1 layer and the ∏ layer, and the other one is sparsely connected. The difference between them lies in the number of product nodes: the former has 2N product nodes, and the latter is less than 2N. We can clearly see the structure of the SPSNNs in Figure 1.
Here, Hq(1≤q≤Q) represents the set of nodes in the ∑1 layer connected with the q-th product node, while the set of all product nodes connected with the n-th node in the ∑1 layer is represented by Fn(1≤n≤N). We can suppose an arbitrary muster italicize, and φ(a) is the number of elements in muster italicize, and we have
Q∑q=1φ(Hq)=N∑n=1φ(Fn), |
which will be used later.
We can compute the output vector δ∈RQ in the ∏ layer by
δq=∏i∈Hqξi,1≤q≤Q. |
To make convention, we denote ∏i∈Hqξi=1 when Hq=∅. In the SPSNNs, the final output can be written as
y=g(W0⋅δ). |
We introduce the batch gradient algorithm for SPSNNs. Let {Xl,Ol}Ll=1⊂RP×R be the given set of the training samples, where Xl denotes the lth input sample and the Ol is the lth corresponding ideal output. Let yl∈R be the real output for each input Xl. The conventional square error function can be given as
ˆE(W)=12L∑l=1(g(W0⋅δl)−Ol)2=L∑l=112(g(W0⋅δl)−Ol)2=L∑l=1gl(W0⋅δl), |
where gl(z)=12(g(z)−Ol)2,z∈R,1≤l≤L.
For convenience, we need to provide the following forms:
δl=(δl1,δl2,⋯δlQ)=(∏i∈H1ξi,∏i∈H2ξi,⋯∏i∈HQξi)=(∏i∈H1g(Wi⋅Xl),∏i∈H2g(Wi⋅Xl),⋯,∏i∈HQg(Wi⋅Xl)), | (3.1) |
and thus, by virtue of
ˆE(W)=L∑l=1gl(Q∑q=1w0q⋅∏i∈HQg(Wi⋅Xl)) | (3.2) |
and some calculation, we can gain
ˆEW0(W)=L∑l=1g′l(W0⋅δl)⋅δl. |
It follows from δlq=∏i∈HQg(Wi⋅Xl) that
∂δlq∂Wn={(∏i∈Hq∖nξi)⋅g′(Wn⋅Xl)Xl,ifq≠1andn∈Hq,0,ifq≡1orn∉Hq. |
Then, from the above equality and (3.2), we get
ˆEWn(W)=L∑l=1[g′l(W0⋅δl)Q∑q=1(w0q⋅∂δlq∂Wn)]. |
To simplify the network structure, we add the elastic net regularization to optimize the network on the group level. So, we can get the corresponding form of the error function
E(W)=L∑l=1gl(W0⋅δl)+τ(α(‖W0‖1+N∑n=1‖Wn‖1) +(1−α)(‖W0‖22+N∑n=1‖Wn‖22)), | (3.3) |
where τ and α are the tuning parameters that control the performance of the penalty term. With the gradual increase of α, the L1 regular term dominates, and the error function gets closer to Lasso regression; with the gradual decrease of α, the L2 regular term dominates, and the error function gets closer to ridge regression. In particular, the error function is equivalent to that of ridge regression at α=0, and the cost function is equivalent to that of Lasso regression at α=1. Let τ1=τ⋅α and τ2=2τ⋅(1−α). Then, we have
E(W)=L∑l=1gl(W0⋅δl)+τ1(‖W0‖1+N∑n=1‖Wn‖1) +τ22(‖W0‖22+N∑n=1‖Wn‖22). | (3.4) |
The gradient of the error function with respect to W0 and Wn are, respectively, given as
EW0(W)=L∑l=1g′l(W0⋅δl)⋅δl+τ1W0‖W0‖+τ2W0 | (3.5) |
and
EWn(W)=L∑l=1[g′l(W0⋅δl)∑q∈Fn[w0q(∏i∈Hq∖nξli) ×g′(Wn⋅Xl)⋅Xl]]+τ1Wn‖Wn‖+τ2Wn. | (3.6) |
We notice that elastic net regularization in (3.4) is combined with L1 norm regularization and L2 norm regularization. It is clear that (3.4) is not differentiable at the origin, which will yield the oscillation phenomenon, and we propose a smoothing approximation method to overcome this problem caused by the non-smoothness. For any limited dimensional vector u and a fixed constant γ>0, we can define a smoothing function of ‖u‖ as follows:
h(u,γ)={‖u‖,if‖u‖>γ,‖u‖22γ+γ2,if‖u‖≤γ. | (3.7) |
We use (3.7) to approximate the elastic net regularization in (3.4). Furthermore, the gradient of h(u,γ) with respect to the vector u is given as follows:
∇uh(u,γ)={u‖u‖,if‖u‖>γ,uγ,if‖u‖≤γ. |
Accordingly, the error function (3.4) can be rewritten as
E(W)=L∑l=1gl(W0⋅δl)+τ1[h(W0,γ)+N∑n=1h(Wn,γ)]+τ22(‖W0‖22+N∑n=1‖Wn‖22). | (3.8) |
According to (3.8), we can get a smoothing elastic net Sigma-Pi-Sigma neural network as
EW0(W)=L∑l=1g′l(W0⋅δl)⋅δlq+τ1∇W0h(W0,γ)+τ2W0, | (3.9) |
EWn(W)=L∑l=1g′l(W0⋅δl)∑q∈Fnw0q(∏i∈Hq∖nξli) ×g′(Wn⋅Xl)⋅Xl+τ1∇Wnh(Wn,γ)+τ2Wn, | (3.10) |
where l=1,2,⋯,L.
Beginning with an arbitrary initial weight vector W0, by the following iterative formula we define the weight sequence
Wk+1=Wk+△Wk, | (3.11) |
△Wk0=−ηEW0(Wk), | (3.12) |
△Wkn=−ηEWn(Wk), | (3.13) |
where η represents the learning rate.
In this section, the performance of the models with no regularizer, the L2 regularizer, the original L1/2 regularizer (OL1/2), the smoothing L1/2 regularizer (SL1/2), and the original group lasso regularizer (OGL) algorithms are compared with the smoothing elastic net regularizer algorithm (SGL) by using four examples: classification problem, parity problem, function approximation problem, and prediction problem.
In this example, we choose 8 benchmark data sets from the UCI machine learning repository to test the performance of the new algorithm (SGL), and compare it with the no regularizer, the L2 regularizer, the OL1/2, the SL1/2, and the OGL algorithms.
Table 1 presents the main characteristics of the relevant data sets, which includes the size of datasets, attributes, categories, and sizes of the training and testing sets, where the dataset is randomly partitioned into two subsets: 70% for training and 30% for testing.
Dataset | Dataset Size | Training set | Testing set | Attributes | Classes |
Ecoli | 336 | 224 | 112 | 8 | 7 |
Olitos | 120 | 80 | 40 | 25 | 4 |
Seeds | 210 | 147 | 63 | 7 | 3 |
Iris | 150 | 105 | 45 | 4 | 3 |
Wine | 178 | 120 | 58 | 13 | 3 |
Liver | 345 | 240 | 105 | 7 | 2 |
Sonar | 208 | 138 | 70 | 60 | 2 |
Diabetes | 768 | 526 | 242 | 8 | 2 |
As described at the beginning of this paper, we learn the structure of SPSNNs (see Figure 1). Then, we select P=13, N=4, Q=16, and 1, representing the number of the nodes of the input, the ∑1, the ∏, and the ∑2 layers, respectively. For each learning algorithm, the initial weights are randomly selected in the interval [−0.5,0.5], the learning rate η is 0.0028, and the regular factor τ is 0.001, and we conduct 20 trials for every data set to compare the performance of different algorithms.
To assess the performance of the smoothing elastic net regularizer, based on each data set, we compare the number of remaining hidden neurons after pruning (RNN), the training accuracy testing accuracy, and training time for each algorithm, and all experimental results are recorded in Table 2. From the table, it can be observed that the training accuracy has slightly improved, while the testing accuracy has increased by approximately 1% to 3%. We can find our proposed smoothing elastic net regularizer is superior to the no regularizer, the L2 regularizer, the original L1/2 regularizer, the smoothing L1/2 regularizer, and the original group Lasso regularizer algorithms.
Dataset | Algorithm | RNN | Training accuracy | Testing accuracy | Training time(s) |
Ecoli | NoPenalty | 13.30 | 0.9761 | 0.9082 | 17.2756 |
L2 | 13.00 | 0.9747 | 0.8980 | 16.7845 | |
OL1/2 | 12.50 | 0.9749 | 0.9191 | 18.1372 | |
SL1/2 | 11.80 | 0.9771 | 0.9304 | 16.8050 | |
OGL | 12.20 | 0.9749 | 0.9191 | 18.3792 | |
SGL | 11.50 | 0.9780 | 0.9314 | 17.3763 | |
Olitos | NoPenalty | 13.00 | 0.9573 | 0.8914 | 29.9742 |
L2 | 12.33 | 0.9592 | 0.9053 | 29.3393 | |
OL1/2 | 12.00 | 0.9604 | 0.9160 | 30.6111 | |
SL1/2 | 11.67 | 0.9589 | 0.9245 | 30.3067 | |
OGL | 12.00 | 0.9617 | 0.9264 | 33.0595 | |
SGL | 11.00 | 0.9628 | 0.9355 | 30.5098 | |
Seeds | NoPenalty | 13.33 | 0.9737 | 0.9522 | 13.4081 |
L2 | 12.67 | 0.9761 | 0.9554 | 13.5387 | |
OL1/2 | 12.33 | 0.9791 | 0.9582 | 13.8205 | |
SL1/2 | 11.67 | 0.9797 | 0.9676 | 13.2730 | |
OGL | 12.33 | 0.9792 | 0.9629 | 14.7718 | |
SGL | 11.33 | 0.9813 | 0.9749 | 12.7701 | |
Iris | NoPenalty | 13.67 | 0.9715 | 0.9296 | 13.4447 |
L2 | 13.00 | 0.9719 | 0.9390 | 13.4420 | |
OL1/2 | 12.67 | 0.9723 | 0.9458 | 14.3637 | |
SL1/2 | 12.33 | 0.9743 | 0.9554 | 13.5318 | |
OGL | 12.33 | 0.9748 | 0.9522 | 16.1023 | |
SGL | 11.00 | 0.9791 | 0.9629 | 14.2630 | |
Wine | NoPenalty | 12.67 | 0.9872 | 0.9729 | 20.4322 |
L2 | 12.67 | 0.9892 | 0.9753 | 20.4173 | |
OL1/2 | 12.00 | 0.9896 | 0.9770 | 21.3012 | |
SL1/2 | 11.50 | 0.9911 | 0.9814 | 20.5545 | |
OGL | 11.67 | 0.9906 | 0.9798 | 21.1104 | |
SGL | 10.50 | 0.9915 | 0.9833 | 20.1607 | |
Liver | NoPenalty | 13.33 | 0.9937 | 0.9823 | 15.5081 |
L2 | 12.67 | 0.9943 | 0.9838 | 15.4588 | |
OL1/2 | 12.33 | 0.9947 | 0.9858 | 16.4440 | |
SL1/2 | 11.33 | 0.9951 | 0.9861 | 15.9798 | |
OGL | 11.00 | 0.9948 | 0.9868 | 17.4374 | |
SGL | 10.33 | 0.9962 | 0.9902 | 16.1645 | |
Sonar | NoPenalty | 12.67 | 0.9825 | 0.9756 | 12.6535 |
L2 | 12.67 | 0.9831 | 0.9787 | 12.1906 | |
OL1/2 | 12.33 | 0.9860 | 0.9830 | 12.9125 | |
SL1/2 | 12.00 | 0.9909 | 0.9852 | 12.5690 | |
OGL | 12.33 | 0.9892 | 0.9849 | 13.7648 | |
SGL | 11.67 | 0.9918 | 0.9860 | 11.7338 | |
Diabetes | NoPenalty | 13.00 | 0.9925 | 0.9937 | 17.7933 |
L2 | 12.50 | 0.9936 | 0.9947 | 17.1156 | |
OL1/2 | 11.67 | 0.9954 | 0.9950 | 17.4822 | |
SL1/2 | 11.33 | 0.9967 | 0.9955 | 17.1170 | |
OGL | 11.67 | 0.9961 | 0.9953 | 18.4761 | |
SGL | 11.33 | 0.9978 | 0.9961 | 17.3682 |
In addition, we have also compared our approach with other existing methods. In [45], the authors considered the group Lasso regularization method on the Sigma-Pi-Sigma neural network. In [46], the authors applied the group L1/2 regularization term on high-order neural networks. Our proposed elastic net regularization method is on par with these approaches.
For the parity problem, there is an input set of 2n samples in n-dimensional space, and every sample is an n-bit binary vector. We consider a 5-bit parity problem which has an input set with 25 samples in 5-dimensional space; the ideal output equals to 1 if the number of 1 in the samples is odd, otherwise it equals to zero. Here, using the above method, we test the performance of our proposed smoothing elastic net regularizer.
Similarly, we can study the structure of SPSNNs. We select P=13, N=4, Q=16, and 1 for the number of the nodes of the input, ∑1, ∏, and ∑2 layers, separately. In the interval [−0.5,0.5], the initial weights are randomly selected, the learning rate η is 0.0045, and the regular factor τ is 0.001. For each learning algorithm we carry out 20 experiments, and we train up to 40, 000 steps or we stop once the error is less than 1e−4. So, as to assess the sparsity and convergence of the smoothing elastic net regularizer, we compare the average error (AVE) and the number of remaining hidden neurons after pruning (RNN) with the no regularizer, the L2 regularizer, the original L1/2 regularizer, the smoothing L1/2 regularizer, the original group Lasso regularizer, and the smoothing elastic net regularizer, which are listed in Table 3.
Learning Methods | AVE | RNN |
NoPenalty | 0.004433 | 17.00 |
L2 | 0.003967 | 17.00 |
OL1/2 | 0.003929 | 17.14 |
SL1/2 | 0.004033 | 17.33 |
OGL | 0.003925 | 17.00 |
SGL | 0.003471 | 16.71 |
The results show that the proposed smooth elastic net regularizer outperforms the no regularizer, the L2 regularizer, the original L1/2 regularizer, the smoothing L1/2 regularizer, and the original group Lasso regularizer.
Figure 2(a) shows the error performance of the original group Lasso regularizer and smoothing elastic net regularizer via the 5-bit parity problem. Figure 2(b) shows that the norm of the gradient curve of the error function, based on the 5-bit parity problem, approaches a small positive constant. This indicates that the smoothing elastic net regularizer removes the oscillation of occurring in the original group Lasso regularizer in the learning process.
In this example, we study the multi-dimensional Gabor function to compare the approximation performance of the above algorithms.
k(x,y)=12π(0.5)2exp(−x2+y22(0.5)2)cos(2π(x+y)). |
As described at the beginning of this paper, we learn the structure of the SPSNNs. Then, we select P=13, N=4, Q=16, and 1 for the number of the nodes of the input layer, ∑1, ∏, and ∑2 layers, separately.
In this experiment, we select 169 training samples from an evenly spaced 6×6 grid on the square −0.5≤x≤0.5 and −0.5≤y≤0.5. In the interval [−0.5,0.5], the initial weights are randomly selected, the learning rate η is 0.0028, and the regular factor τ is 0.001. For each learning algorithm we carry out 20 experiments, and we train up to 40, 000 times or until the error is less than 1e−4 stop iterations.
To assess the sparsity and convergence of the smoothing elastic net regularizer, we compare the average error (AVE) and the number of remaining hidden neurons after pruning (RNN) with the no regularizer, the L2 regularizer, the original L1/2 regularizer, the smoothing L1/2 regularizer, the original group Lasso regularizer, and the smoothing elastic net regularizer, which are shown in Table 4. We can see our proposed smoothing elastic net regularizer is superior to the no regularizer, the L2 regularizer, the original L1/2 regularizer, the smoothing L1/2 regularizer, and the original group Lasso regularizer.
Learning Methods | AVE | RNN |
NoPenalty | 0.003940 | 18.00 |
L2 | 0.003560 | 18.00 |
OL1/2 | 0.003783 | 17.67 |
SL1/2 | 0.004486 | 17.37 |
OGL | 0.003523 | 16.87 |
SGL | 0.003286 | 16.14 |
For each learning algorithm, we show the error function and the norm of gradient of one of the 20 experiments after 40, 000 epochs in Figures 3–5. Figure 3(a) shows the oscillation phenomenon of no regularizer, the L2 regularizer, the original L1/2 regularizer and the original group lasso regularizer. Figure 3(b) shows the error curve of the smoothing L1/2 regularizer and the smoothing elastic net regularizer. Figure 4(a) shows the norm of gradient curve of the no regularizer, the L2 regularizer, the original L1/2 regularizer, and the original group Lasso regularizer. Figure 4(b) shows the norm of the gradient curve of the smoothing L1/2 regularizer, and the smoothing elastic net regularizer. Obviously, it approaches to a small positive constant. Figure 5 shows a typical performance for one of 20 experiments, and we can see that it has a good approximation effect compared with other algorithms. In each learning algorithm for the same parameters, we get the corresponding results. We can see the learning method with the smoothing elastic net regularizer converges faster than other learning methods, and the smoothing elastic net regularizer method overcomes the numerical oscillation phenomenon. It also shows that during the iterative process the error function curves are monotonically decreasing and converge to zero, which also validates our theoretical results.
To verify the effectiveness of the algorithm further, this part takes an interval shield tunneling project of Metro Line 9 in Zhengzhou City, China, as an example. In order to monitor the impact of the subway shield tunneling process on the surface buildings and structures, 10-meter intervals are used to set up settlement observation points in advance in each shield tunneling section, and the surrounding structures and buildings are monitored. JGC1, the closest settlement observation point of the signal tower to the right line of the shield structure, is selected as the objective of study, and the Leica DNA03 level is used to collect data 10 days before the shield structure is excavated to the closest point of JGC1, for a total of 40 days, and the frequency of observation is once a day. In this experiment, we use the data of the first 30 days as the training data set and the data of the last 10 days as the test data set (see Table 5).
Time | JGC1 (mm) | Time | JGC1 (mm) |
2015.3.18 | +0.21000 | 2015.3.31 | 0.003286 |
2015.3.19 | -0.13000 | 2015.4.2 | +0.10000 |
2015.3.20 | +0.11000 | 2015.4.3 | -0.02000 |
2015.3.21 | -0.06000 | 2015.4.4 | +0.08000 |
2015.3.22 | -0.23000 | 2015.4.5 | -0.16000 |
2015.3.23 | -0.23000 | 2015.4.6 | +0.06000 |
2015.3.24 | -0.15000 | 2015.4.7 | -0.18000 |
2015.3.25 | -0.03000 | 2015.4.8 | +0.28000 |
2015.3.26 | 0.003286 | 2015.4.9 | -0.07000 |
2015.3.27 | 0.003286 | 2015.4.10 | +0.07000 |
Time | JGC1 (mm) | Time | JGC1 (mm) |
2015.4.11 | +0.01000 | 2015.4.22 | +0.01000 |
2015.4.12 | -0.01000 | 2015.4.23 | 0.00000 |
2015.4.13 | -0.15000 | 2015.4.24 | +0.19000 |
2015.4.14 | +0.08000 | 2015.4.25 | -0.21000 |
2015.4.15 | -0.09000 | 2015.4.26 | +0.10000 |
2015.4.16 | +0.14000 | 2015.4.27 | -0.12000 |
2015.4.17 | -0.02000 | 2015.4.28 | +0.05000 |
2015.4.18 | -0.26000 | 2015.4.29 | +0.15000 |
2015.4.19 | +0.42000 | 2015.4.30 | -0.17000 |
2015.4.20 | -0.22000 | 2015.5.1 | -0.07000 |
In this experiment, we learn the structure of the SPSNNs. Then, we select P=5, N=4, Q=16, and 1 for the number of the nodes of the input layer, ∑1, ∑2, and the output layer, respectively. We use the sigmod activation function at the ∑1 and output layers, respectively, and our stopping criteria in this experiment is an error of less than 1×10−5 or 5000 iterations.
Figure 6 is the error curve of the training set with 5000 iterations, in which red is the error curve without the regularization term and blue is the error curve with the smoothing elastic net regularization, it can be obtained that the error of the network with the smoothing elastic net regularization decreases faster, and after 500 iterations, the error is smaller than that without the regularization term, which precisely verifies the theoretical results of this paper and the effectiveness of the proposed algorithm.
In this paper, a new batch gradient algorithm for SPSNNs with an L1 plus L2 regularization algorithm is proposed as an effective weight pruning technique. It can handle multi-output regression and multi-class classification problems within a unified framework. This algorithm obtains good performance in both Lasso and ridge regression, penalizing the weights by reducing the weight vectors to zero, which is more efficient than other various pruning strategies. Moreover, the theoretical results and the advantages of this algorithm are also illustrated by numerical experiments.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Conceptualization, methodology, original draft preparation, J. Jiao; software, editing, K. Su.
All authors declare there is no conflict of interest.
[1] | G. A. Bliss, Lectures on the Calculus of Variations, Chicago: University of Chicago Press, 1946. |
[2] | O. Bolza, Lectures on the Calculus of Variations, New York: Chelse Press, 1961. |
[3] | U. Brechtken-Manderscheid, Introduction to the Calculus of Variations, London: Chapman & Hall, 1983. |
[4] | L. Cesari, Optimization-Theory and Applications, Problems with Ordinary Differential Equations, New York: Springer-Verlag, 1983. |
[5] | F. H. Clarke, Functional Analysis, Calculus of Variations and Optimal Control, New York: Springer-Verlag, 2013. |
[6] | G. M. Ewing, Calculus of Variations with Applications, New York: Dover, 1985. |
[7] | I. M. Gelfand, S. V. Fomin, Calculus of Variations, New Jersey: Prentice-Hall, 1963. |
[8] | M. Giaquinta, S. Hildebrant, Calculus of Variations I, New York: Springer-Verlag, 2004. |
[9] | M. Giaquinta, S. Hildebrant, Calculus of Variations II, New York: Springer-Verlag, 2004. |
[10] | M. R. Hestenes, Calculus of Variations and Optimal Control Theory, New York: John Wiley & Sons, 1966. |
[11] | G. Leitmann, The Calculus of Variations and Optimal Control, New York: Plenum Press, 1981. |
[12] | P. D. Loewen, Second-order sufficiency criteria and local convexity for equivalent problems in the calculus of variations, J. Math. Anal. Appl., 146 (1990), 512-522. |
[13] | A. A. Milyutin, N. P. Osmolovskii, Calculus of Variations and Optimal Control, Rhode Island: American Mathematical Society, 1998. |
[14] | M. Morse, Variational Analysis: Critical Extremals and Sturmian Extensions, New York: John Wiley & Sons, 1973. |
[15] | F. Rindler, Calculus of Variations, Coventry: Springer, 2018. |
[16] | J. L. Troutman, Variational Calculus with Elementary Convexity, New York: Springer-Verlag, 1983. |
[17] | F. Y. M. Wan, Introduction to the Calculus of Variations and its Applications, New York: Chapman & Hall, 1995. |
1. | Khidir Shaib Mohamed, Ibrhim M. A. Suliman, Mahmoud I. Alfeel, Abdalilah Alhalangy, Faiza A. Almostafa, Ekram Adam, A Modified High-Order Neural Network with Smoothing L1 Regularization and Momentum Terms, 2025, 19, 1863-1703, 10.1007/s11760-025-03973-4 | |
2. | Khidir Shaib Mohamed, Suhail Abdullah Alsaqer, Tahir Bashir, Ibrhim. M. A. Suliman, Convergence analysis of gradient descent based on smoothing L0 regularization and momentum terms, 2025, 1598-5865, 10.1007/s12190-024-02353-4 |
Dataset | Dataset Size | Training set | Testing set | Attributes | Classes |
Ecoli | 336 | 224 | 112 | 8 | 7 |
Olitos | 120 | 80 | 40 | 25 | 4 |
Seeds | 210 | 147 | 63 | 7 | 3 |
Iris | 150 | 105 | 45 | 4 | 3 |
Wine | 178 | 120 | 58 | 13 | 3 |
Liver | 345 | 240 | 105 | 7 | 2 |
Sonar | 208 | 138 | 70 | 60 | 2 |
Diabetes | 768 | 526 | 242 | 8 | 2 |
Dataset | Algorithm | RNN | Training accuracy | Testing accuracy | Training time(s) |
Ecoli | NoPenalty | 13.30 | 0.9761 | 0.9082 | 17.2756 |
L2 | 13.00 | 0.9747 | 0.8980 | 16.7845 | |
OL1/2 | 12.50 | 0.9749 | 0.9191 | 18.1372 | |
SL1/2 | 11.80 | 0.9771 | 0.9304 | 16.8050 | |
OGL | 12.20 | 0.9749 | 0.9191 | 18.3792 | |
SGL | 11.50 | 0.9780 | 0.9314 | 17.3763 | |
Olitos | NoPenalty | 13.00 | 0.9573 | 0.8914 | 29.9742 |
L2 | 12.33 | 0.9592 | 0.9053 | 29.3393 | |
OL1/2 | 12.00 | 0.9604 | 0.9160 | 30.6111 | |
SL1/2 | 11.67 | 0.9589 | 0.9245 | 30.3067 | |
OGL | 12.00 | 0.9617 | 0.9264 | 33.0595 | |
SGL | 11.00 | 0.9628 | 0.9355 | 30.5098 | |
Seeds | NoPenalty | 13.33 | 0.9737 | 0.9522 | 13.4081 |
L2 | 12.67 | 0.9761 | 0.9554 | 13.5387 | |
OL1/2 | 12.33 | 0.9791 | 0.9582 | 13.8205 | |
SL1/2 | 11.67 | 0.9797 | 0.9676 | 13.2730 | |
OGL | 12.33 | 0.9792 | 0.9629 | 14.7718 | |
SGL | 11.33 | 0.9813 | 0.9749 | 12.7701 | |
Iris | NoPenalty | 13.67 | 0.9715 | 0.9296 | 13.4447 |
L2 | 13.00 | 0.9719 | 0.9390 | 13.4420 | |
OL1/2 | 12.67 | 0.9723 | 0.9458 | 14.3637 | |
SL1/2 | 12.33 | 0.9743 | 0.9554 | 13.5318 | |
OGL | 12.33 | 0.9748 | 0.9522 | 16.1023 | |
SGL | 11.00 | 0.9791 | 0.9629 | 14.2630 | |
Wine | NoPenalty | 12.67 | 0.9872 | 0.9729 | 20.4322 |
L2 | 12.67 | 0.9892 | 0.9753 | 20.4173 | |
OL1/2 | 12.00 | 0.9896 | 0.9770 | 21.3012 | |
SL1/2 | 11.50 | 0.9911 | 0.9814 | 20.5545 | |
OGL | 11.67 | 0.9906 | 0.9798 | 21.1104 | |
SGL | 10.50 | 0.9915 | 0.9833 | 20.1607 | |
Liver | NoPenalty | 13.33 | 0.9937 | 0.9823 | 15.5081 |
L2 | 12.67 | 0.9943 | 0.9838 | 15.4588 | |
OL1/2 | 12.33 | 0.9947 | 0.9858 | 16.4440 | |
SL1/2 | 11.33 | 0.9951 | 0.9861 | 15.9798 | |
OGL | 11.00 | 0.9948 | 0.9868 | 17.4374 | |
SGL | 10.33 | 0.9962 | 0.9902 | 16.1645 | |
Sonar | NoPenalty | 12.67 | 0.9825 | 0.9756 | 12.6535 |
L2 | 12.67 | 0.9831 | 0.9787 | 12.1906 | |
OL1/2 | 12.33 | 0.9860 | 0.9830 | 12.9125 | |
SL1/2 | 12.00 | 0.9909 | 0.9852 | 12.5690 | |
OGL | 12.33 | 0.9892 | 0.9849 | 13.7648 | |
SGL | 11.67 | 0.9918 | 0.9860 | 11.7338 | |
Diabetes | NoPenalty | 13.00 | 0.9925 | 0.9937 | 17.7933 |
L2 | 12.50 | 0.9936 | 0.9947 | 17.1156 | |
OL1/2 | 11.67 | 0.9954 | 0.9950 | 17.4822 | |
SL1/2 | 11.33 | 0.9967 | 0.9955 | 17.1170 | |
OGL | 11.67 | 0.9961 | 0.9953 | 18.4761 | |
SGL | 11.33 | 0.9978 | 0.9961 | 17.3682 |
Learning Methods | AVE | RNN |
NoPenalty | 0.004433 | 17.00 |
L2 | 0.003967 | 17.00 |
OL1/2 | 0.003929 | 17.14 |
SL1/2 | 0.004033 | 17.33 |
OGL | 0.003925 | 17.00 |
SGL | 0.003471 | 16.71 |
Learning Methods | AVE | RNN |
NoPenalty | 0.003940 | 18.00 |
L2 | 0.003560 | 18.00 |
OL1/2 | 0.003783 | 17.67 |
SL1/2 | 0.004486 | 17.37 |
OGL | 0.003523 | 16.87 |
SGL | 0.003286 | 16.14 |
Time | JGC1 (mm) | Time | JGC1 (mm) |
2015.3.18 | +0.21000 | 2015.3.31 | 0.003286 |
2015.3.19 | -0.13000 | 2015.4.2 | +0.10000 |
2015.3.20 | +0.11000 | 2015.4.3 | -0.02000 |
2015.3.21 | -0.06000 | 2015.4.4 | +0.08000 |
2015.3.22 | -0.23000 | 2015.4.5 | -0.16000 |
2015.3.23 | -0.23000 | 2015.4.6 | +0.06000 |
2015.3.24 | -0.15000 | 2015.4.7 | -0.18000 |
2015.3.25 | -0.03000 | 2015.4.8 | +0.28000 |
2015.3.26 | 0.003286 | 2015.4.9 | -0.07000 |
2015.3.27 | 0.003286 | 2015.4.10 | +0.07000 |
Time | JGC1 (mm) | Time | JGC1 (mm) |
2015.4.11 | +0.01000 | 2015.4.22 | +0.01000 |
2015.4.12 | -0.01000 | 2015.4.23 | 0.00000 |
2015.4.13 | -0.15000 | 2015.4.24 | +0.19000 |
2015.4.14 | +0.08000 | 2015.4.25 | -0.21000 |
2015.4.15 | -0.09000 | 2015.4.26 | +0.10000 |
2015.4.16 | +0.14000 | 2015.4.27 | -0.12000 |
2015.4.17 | -0.02000 | 2015.4.28 | +0.05000 |
2015.4.18 | -0.26000 | 2015.4.29 | +0.15000 |
2015.4.19 | +0.42000 | 2015.4.30 | -0.17000 |
2015.4.20 | -0.22000 | 2015.5.1 | -0.07000 |
Dataset | Dataset Size | Training set | Testing set | Attributes | Classes |
Ecoli | 336 | 224 | 112 | 8 | 7 |
Olitos | 120 | 80 | 40 | 25 | 4 |
Seeds | 210 | 147 | 63 | 7 | 3 |
Iris | 150 | 105 | 45 | 4 | 3 |
Wine | 178 | 120 | 58 | 13 | 3 |
Liver | 345 | 240 | 105 | 7 | 2 |
Sonar | 208 | 138 | 70 | 60 | 2 |
Diabetes | 768 | 526 | 242 | 8 | 2 |
Dataset | Algorithm | RNN | Training accuracy | Testing accuracy | Training time(s) |
Ecoli | NoPenalty | 13.30 | 0.9761 | 0.9082 | 17.2756 |
L2 | 13.00 | 0.9747 | 0.8980 | 16.7845 | |
OL1/2 | 12.50 | 0.9749 | 0.9191 | 18.1372 | |
SL1/2 | 11.80 | 0.9771 | 0.9304 | 16.8050 | |
OGL | 12.20 | 0.9749 | 0.9191 | 18.3792 | |
SGL | 11.50 | 0.9780 | 0.9314 | 17.3763 | |
Olitos | NoPenalty | 13.00 | 0.9573 | 0.8914 | 29.9742 |
L2 | 12.33 | 0.9592 | 0.9053 | 29.3393 | |
OL1/2 | 12.00 | 0.9604 | 0.9160 | 30.6111 | |
SL1/2 | 11.67 | 0.9589 | 0.9245 | 30.3067 | |
OGL | 12.00 | 0.9617 | 0.9264 | 33.0595 | |
SGL | 11.00 | 0.9628 | 0.9355 | 30.5098 | |
Seeds | NoPenalty | 13.33 | 0.9737 | 0.9522 | 13.4081 |
L2 | 12.67 | 0.9761 | 0.9554 | 13.5387 | |
OL1/2 | 12.33 | 0.9791 | 0.9582 | 13.8205 | |
SL1/2 | 11.67 | 0.9797 | 0.9676 | 13.2730 | |
OGL | 12.33 | 0.9792 | 0.9629 | 14.7718 | |
SGL | 11.33 | 0.9813 | 0.9749 | 12.7701 | |
Iris | NoPenalty | 13.67 | 0.9715 | 0.9296 | 13.4447 |
L2 | 13.00 | 0.9719 | 0.9390 | 13.4420 | |
OL1/2 | 12.67 | 0.9723 | 0.9458 | 14.3637 | |
SL1/2 | 12.33 | 0.9743 | 0.9554 | 13.5318 | |
OGL | 12.33 | 0.9748 | 0.9522 | 16.1023 | |
SGL | 11.00 | 0.9791 | 0.9629 | 14.2630 | |
Wine | NoPenalty | 12.67 | 0.9872 | 0.9729 | 20.4322 |
L2 | 12.67 | 0.9892 | 0.9753 | 20.4173 | |
OL1/2 | 12.00 | 0.9896 | 0.9770 | 21.3012 | |
SL1/2 | 11.50 | 0.9911 | 0.9814 | 20.5545 | |
OGL | 11.67 | 0.9906 | 0.9798 | 21.1104 | |
SGL | 10.50 | 0.9915 | 0.9833 | 20.1607 | |
Liver | NoPenalty | 13.33 | 0.9937 | 0.9823 | 15.5081 |
L2 | 12.67 | 0.9943 | 0.9838 | 15.4588 | |
OL1/2 | 12.33 | 0.9947 | 0.9858 | 16.4440 | |
SL1/2 | 11.33 | 0.9951 | 0.9861 | 15.9798 | |
OGL | 11.00 | 0.9948 | 0.9868 | 17.4374 | |
SGL | 10.33 | 0.9962 | 0.9902 | 16.1645 | |
Sonar | NoPenalty | 12.67 | 0.9825 | 0.9756 | 12.6535 |
L2 | 12.67 | 0.9831 | 0.9787 | 12.1906 | |
OL1/2 | 12.33 | 0.9860 | 0.9830 | 12.9125 | |
SL1/2 | 12.00 | 0.9909 | 0.9852 | 12.5690 | |
OGL | 12.33 | 0.9892 | 0.9849 | 13.7648 | |
SGL | 11.67 | 0.9918 | 0.9860 | 11.7338 | |
Diabetes | NoPenalty | 13.00 | 0.9925 | 0.9937 | 17.7933 |
L2 | 12.50 | 0.9936 | 0.9947 | 17.1156 | |
OL1/2 | 11.67 | 0.9954 | 0.9950 | 17.4822 | |
SL1/2 | 11.33 | 0.9967 | 0.9955 | 17.1170 | |
OGL | 11.67 | 0.9961 | 0.9953 | 18.4761 | |
SGL | 11.33 | 0.9978 | 0.9961 | 17.3682 |
Learning Methods | AVE | RNN |
NoPenalty | 0.004433 | 17.00 |
L2 | 0.003967 | 17.00 |
OL1/2 | 0.003929 | 17.14 |
SL1/2 | 0.004033 | 17.33 |
OGL | 0.003925 | 17.00 |
SGL | 0.003471 | 16.71 |
Learning Methods | AVE | RNN |
NoPenalty | 0.003940 | 18.00 |
L2 | 0.003560 | 18.00 |
OL1/2 | 0.003783 | 17.67 |
SL1/2 | 0.004486 | 17.37 |
OGL | 0.003523 | 16.87 |
SGL | 0.003286 | 16.14 |
Time | JGC1 (mm) | Time | JGC1 (mm) |
2015.3.18 | +0.21000 | 2015.3.31 | 0.003286 |
2015.3.19 | -0.13000 | 2015.4.2 | +0.10000 |
2015.3.20 | +0.11000 | 2015.4.3 | -0.02000 |
2015.3.21 | -0.06000 | 2015.4.4 | +0.08000 |
2015.3.22 | -0.23000 | 2015.4.5 | -0.16000 |
2015.3.23 | -0.23000 | 2015.4.6 | +0.06000 |
2015.3.24 | -0.15000 | 2015.4.7 | -0.18000 |
2015.3.25 | -0.03000 | 2015.4.8 | +0.28000 |
2015.3.26 | 0.003286 | 2015.4.9 | -0.07000 |
2015.3.27 | 0.003286 | 2015.4.10 | +0.07000 |
Time | JGC1 (mm) | Time | JGC1 (mm) |
2015.4.11 | +0.01000 | 2015.4.22 | +0.01000 |
2015.4.12 | -0.01000 | 2015.4.23 | 0.00000 |
2015.4.13 | -0.15000 | 2015.4.24 | +0.19000 |
2015.4.14 | +0.08000 | 2015.4.25 | -0.21000 |
2015.4.15 | -0.09000 | 2015.4.26 | +0.10000 |
2015.4.16 | +0.14000 | 2015.4.27 | -0.12000 |
2015.4.17 | -0.02000 | 2015.4.28 | +0.05000 |
2015.4.18 | -0.26000 | 2015.4.29 | +0.15000 |
2015.4.19 | +0.42000 | 2015.4.30 | -0.17000 |
2015.4.20 | -0.22000 | 2015.5.1 | -0.07000 |