Sufficiency for singular trajectories in the calculus of variations

Gerardo Sánchez Licea; Gerardo Sánchez Licea

doi:10.3934/math.2020008

AIMS Mathematics

2020, Volume 5, Issue 1: 111-139. doi: 10.3934/math.2020008

Previous Article Next Article

Research article

Sufficiency for singular trajectories in the calculus of variations

Gerardo Sánchez Licea ^,

Facultad de Ciencias, Universidad Nacional Autónoma de México, Departamento de Matemáticas, Ciudad de México 04510, México

Received: 25 July 2019 Accepted: 10 October 2019 Published: 23 October 2019
MSC : 49K15

For calculus of variations problems of Bolza with variable end-points, nonlinear inequality and equality isoperimetric constraints and nonlinear inequality and equality mixed pointwise constraints, sufficient conditions for strong minima are derived. The main novelty of the new sufficiency results presented in this article concerns their applicability to cases in which the derivatives of the extremals to be optimal solutions are not necessarily continuous nor piecewise continuous but only essentially bounded and they do not necessarily satisfy the standard strengthened Legendre condition but only the corresponding necessary condition.

Keywords:

calculus of variations,
inequality and equality isoperimetric constraints,
inequality and equality mixed pointwise constraints,
free end-points,
sufficiency,
strong minima,
singular extremals

Citation: Gerardo Sánchez Licea. Sufficiency for singular trajectories in the calculus of variations[J]. AIMS Mathematics, 2020, 5(1): 111-139. doi: 10.3934/math.2020008

Related Papers:

[1]	Huiting Zhang, Yuying Yuan, Sisi Li, Yongxin Yuan . The least-squares solutions of the matrix equation $A^{\ast}XB+B^{\ast}X^{\ast}A = D$ and its optimal approximation. AIMS Mathematics, 2022, 7(3): 3680-3691. doi: 10.3934/math.2022203
[2]	J. Alberto Conejero, Antonio Falcó, María Mora–Jiménez . A pre-processing procedure for the implementation of the greedy rank-one algorithm to solve high-dimensional linear systems. AIMS Mathematics, 2023, 8(11): 25633-25653. doi: 10.3934/math.20231308
[3]	Yinlan Chen, Wenting Duan . The Hermitian solution to a matrix inequality under linear constraint. AIMS Mathematics, 2024, 9(8): 20163-20172. doi: 10.3934/math.2024982
[4]	Yinlan Chen, Min Zeng, Ranran Fan, Yongxin Yuan . The solutions of two classes of dual matrix equations. AIMS Mathematics, 2023, 8(10): 23016-23031. doi: 10.3934/math.20231171
[5]	Jeong-Kweon Seo, Byeong-Chun Shin . Reduced-order modeling using the frequency-domain method for parabolic partial differential equations. AIMS Mathematics, 2023, 8(7): 15255-15268. doi: 10.3934/math.2023779
[6]	Yinlan Chen, Lina Liu . A direct method for updating piezoelectric smart structural models based on measured modal data. AIMS Mathematics, 2023, 8(10): 25262-25274. doi: 10.3934/math.20231288
[7]	Justin Eilertsen, Marc R. Roussel, Santiago Schnell, Sebastian Walcher . On the quasi-steady-state approximation in an open Michaelis–Menten reaction mechanism. AIMS Mathematics, 2021, 6(7): 6781-6814. doi: 10.3934/math.2021398
[8]	Wanlin Jiang, Kezheng Zuo . Revisiting of the BT-inverse of matrices. AIMS Mathematics, 2021, 6(3): 2607-2622. doi: 10.3934/math.2021158
[9]	Yonghong Duan, Ruiping Wen . An alternating direction power-method for computing the largest singular value and singular vectors of a matrix. AIMS Mathematics, 2023, 8(1): 1127-1138. doi: 10.3934/math.2023056
[10]	Muammer Ayata, Ozan Özkan . A new application of conformable Laplace decomposition method for fractional Newell-Whitehead-Segel equation. AIMS Mathematics, 2020, 5(6): 7402-7412. doi: 10.3934/math.2020474

Abstract

1. Introduction

The Sigma-Pi-Sigma neural network (SPSNN) is a feed-forward neural network composed of Sigma-Pi units, which can be used to achieve a static mapping of multi-layer neural networks ^[1,2,3]. In ^[4], a new model which can be regarded as a subclass of networks on Sigma-Pi units was considered, and the authors showed the origin of the Kronecker product representation from the classical Sigma-Pi units. In ^[5], a Sigma-Pi network and a new arithmetic were proposed, by which the output representation self-organizes to form a topographic map, whose main contribution was to solve the frame of reference transformation issues through unsupervised learning. Furthermore, in ^[6,7,8], the approximation, convergence performance, and generalization ability of sparse Sigma-Pi network functions were studied, and it was shown that the new algorithms were more efficient than those in the existing literatures.

It is widely recognized that neural network optimization has emerged as a highly significant research topic in recent times. Study on neural network optimization consists of two main topics: One is weight optimization, that is, for a given network structure, select the appropriate learning method to seek the optimal weight such that the training error and the generalization error are small enough ^[9,10]. The second is structural optimization, namely the selection of appropriate activation function, network layer number, connection mode, and so on ^[11,12,13]. But, the research on neural network structure optimization is far less rich than that on weight optimization. On the other hand, there is no literature showing that more hidden layer neurons yield better generalization ability.

When it comes to neural networks, the number of neurons is a crucial factor. There are two common methods for determining the size of networks. The first method is the growing method, where the network starts with a smaller size and new hidden neurons are added during the training process ^[14]. Another method is the pruning method ^{[15,16,17,18,19]}, which begins with a larger network and eventually removes redundant nodes.

These kind of algorithms separates weight learning and weight training, which is inefficient. There are also many slightly more complex algorithms, which further introduce various mechanisms like particle swarm optimization ^[20], genetic algorithms ^[21], eigenvalue analysis ^[22], statistical analysis, and synthetic minority over-sampling techniques ^[23,24], so as to enhance the sparsification efficiency. The disadvantage of these algorithms is that the program is complex and the calculation is large.

An appropriately sized network structure is instrumental in enhancing efficiency. Overfitting poses a significant challenge during network training and is particularly problematic for deep neural network learning ^[25]. Consequently, researchers have extensively explored various forms of sparse regularization techniques, highlighting their indispensability.

Recent years, $L_{p}$ regularization is diffusely used to solve variable selection and parameter estimation problems in machine learning. This regularization method takes the form

$\begin{equation*} E(W) = \widehat{E}(W)+\tau\|W\|^p_{p}, \end{equation*}$

where $\widehat{E}$ is the normal error function, $\|W\|_{p} = (\sum_{i}|W_{i}|^{p})^{1/p}$ denotes the p-norm, $\tau$ is the penalty coefficient, and $\|\cdot\|$ represents the euclidean norm. This regularization term is also named the penalty term.

In general, there are several common forms of regularization: weight elimination, weight decay, and approximate smoothing ^{[26,27,28,29,30,31,32]}. Among them, weight elimination is widely used as the penalty term in pruning feedforward neural networks, mainly to reduce unnecessary connections or optimize the network weights ^[33]. A little more detailed introduction for different penalty terms are as follows.

For different values of $p$ , the $L_{2}$ norm to the standard error function makes it more optimized ^{[34,35,36,37]}. This practice is called $L_{2}$ regularization, and the form is shown as follows:

$\begin{equation*} E(W) = \widehat{E}(W)+\tau\|W\|^2_{2}, \end{equation*}$

where $\|W\|^2_{2}$ denotes the penalty term, and the $L_{2}$ norm solution is popular because of its special relationship with the normal distributions. It can serve as brute force to avoid excessive weights. But, unfortunately, it is not sparse. This means that, during the training process, the $L_{2}$ regularization can not drive unnecessary weights to zero.

The $L_{0}$ regularization term is the ordinary method for feature extraction and variable selection. Constrained by the number of coefficients, the $L_{0}$ regularizer produces the most sparse solutions, which are difficult to calculate, and it is a combinatory optimization problem ^[38,39].

As an alternative to the $L_{0}$ regularization term, the increasingly important $L_{1}$ regularization term (Lasso) has become popular since it just needs to resolve a quadratic programming problem ^[40]. In ^[41], the $L_1$ -norm was combined with the capped $L_1$ -norm to indicate the amount of information collected by the filter and the control regularization. The $L_1$ regularization penalty function is generally denoted as follows:

$\begin{equation*} E(W) = \widehat{E}(W)+\tau\|W\|_{1}. \end{equation*}$

It was shown in ^[42,43] that, although these algorithms can generate an alternate neural network structure, they do not offer a unified neural network framework to solve a class of problems.

In order to explore more appropriate neural network structures to get over the obstacles posed by suboptimal models, we propose a new SPSNN algorithm based on the $L_1$ penalty and the $L_2$ penalty to deal with complex and varied tasks within a unified framework to improve the robustness and generality of the model. Based on the $L_1$ norm, an $L_2$ norm is presented in SPSNNs to promote the population sparsity effect to select the relevant hidden node population. Therefore, our proposed variant algorithm benefits from ridge regression and the tendency towards sparse solutions of the $L_1$ penalty, which generates a more suitable neural network structure than using one of the regular terms alone, and the elastic net algorithm can also be used to solve for these hybrid penalties ^[44].

In this paper, the penalty method is considered in the case of selecting the weights, which not only overcomes the shortcomings of the $L_1$ and $L_2$ norms, (both $L_1$ and $L_2$ penalties are used in the same minimization problem), but also has good generalization ability and sparsity. Thus, the mathematical expression for this hybrid penalty as an error function can be expressed as follows:

$\begin{equation*} E(W) = \widehat{E}(W)+\tau_{1}\|W\|_{1}+\tau_{2}\|W\|_{2}^{2}, \end{equation*}$

where the tuning constants $\tau_{1}$ and $\tau_{2}$ are fixed and non-negative. The role of $\tau_{1}$ provides a choice of variables through a sparse vector, and $\tau_{2}$ ensures a unique solution and leads to a grouping effect. A penalty term is added that is a convex combination of the $L_1$ norm $\|W\|_{1}$ and the $L_2$ norm $\|W\|_{2}^{2}$ of the parameter $W$ .

During the training progress, the usual mixed regularization terms are not differentiable at the origin, which usually give rise to oscillation phenomenon. Therefore, we propose a new smoothing algorithm to get over these difficulties. That is, we can use the smoothing technique of the weight in the neighborhood of the origin of the error function instead of the absolute value of the weight. The main contribution and novelty of this article are as follows:

(1) In this article, in order to obtain the optimal architecture with good generalization performance, we will eliminate the weights in the hidden layer by using the idea of elastic net regularization. It means that, by incorporating the $L_1$ and $L_2$ regularization terms, this novel algorithm not only eliminates unnecessary weights but also performs pruning of the network structure in the hidden layers. This pruning reduces the size of the network, leading to an effective optimization of the network structure.

(2) We propose an SPSNN based on elastic net regularized batch gradient methods to obtain an optimal architecture with good generalization performance. By means of smoothing technology, we effectively get over the oscillation phenomenon in the process of network learning.

(3) To test the accuracy and robustness, we apply the proposed method to a large number of regression and classification tasks and compare it with other algorithms that have good sparsity and generalization capabilities.

The rest of the article is arranged as follows: In Section 2 the SPSNN is presented, the batch gradient algorithm for this model is depicted in Section 3, and the novel pruning algorithm based on $L_1$ plus $L_2$ penalty term is depicted for more details. Some numerical experiments are provided in Section 4, and finally some conclusions are made.

2. Sigma-Pi-Sigma neural networks

Next, we mainly depict the fabric of SPSNNs, which are composed of an input layer, a hidden layer of summing nodes( $\sum_{1}$ ), a hidden layer of product nodes( $\prod$ ), and output layer( $\sum_{2}$ ). P, N, Q, and 1 represent the number of the nodes of the input layer, $\sum_{1}$ layer, $\prod$ layer, and $\sum_{2}$ layer, respectively (see Fig. 1)

Figure 1. Structure of an SPSNN.

DownLoad: Full-Size Img PowerPoint

We can use $W_{0} = (w_{01}, w_{02}, \cdots w_{0Q})^T\in{\mathbb{R}^Q}$ to express the weight vector which connects $\sum_{2}$ layer and the $\prod$ layer. Let the weight vector connecting the $\prod$ layer and the $\sum_{1}$ layer be fixed as 1. Meanwhile, using $W_{n} = (w_{n1}, w_{n2}, \cdots w_{nP})^T$ as the weight vector which connects the input layer and the $nth$ node of the $\sum_{1}$ layer, we can set

$\begin{equation*} W = ({W_0}^T, {W_1}^T, \cdots {W_N}^T)^T\in{\mathbb{R}^{Q+NP}}. \end{equation*}$

Let us use $X = (x_1, x_2, \cdots x_P)\in \mathbb{R}^P$ to describe the input of the networks. Assume that $g:\mathbb{R}\longrightarrow \mathbb{R}$ is a given sigmoid activation function for the $\sum_{1}$ layer. We denote the output vector $\xi\in{\mathbb{R}^N}$ of the $\sum_{1}$ layer as

$\begin{equation*} \xi = (g(W_1\cdot X), g(W_2\cdot X), \cdots g(W_N\cdot X))^T, \end{equation*}$

where $\cdot$ denotes the inner product between vectors.

Similarly, we take $\delta = (\delta_1, \delta_2, \cdots \delta_Q)^T$ to indicate the output vector of the $\prod$ layer. The nodes of the network are connected in two different ways. One is fully connected between the $\sum_{1}$ layer and the $\prod$ layer, and the other one is sparsely connected. The difference between them lies in the number of product nodes: the former has $2^N$ product nodes, and the latter is less than $2^N$ . We can clearly see the structure of the SPSNNs in Figure 1.

Here, $\mathcal{H}_q(1\leq q\leq Q)$ represents the set of nodes in the $\sum_{1}$ layer connected with the q-th product node, while the set of all product nodes connected with the n-th node in the $\sum_{1}$ layer is represented by $\mathcal{F}_n(1\leq n\leq N)$ . We can suppose an arbitrary muster italicize, and $\varphi(a)$ is the number of elements in muster italicize, and we have

$\begin{equation*} \label{4} \sum\limits_{q = 1}^Q \varphi(\mathcal{H}_q) = \sum\limits_{n = 1}^N \varphi(\mathcal{F}_n), \end{equation*}$

which will be used later.

We can compute the output vector $\delta\in \mathbb{R}^Q$ in the $\prod$ layer by

$\begin{equation*} \label{5} \delta_q = \prod\limits_{i\in \mathcal{H}_q} {\xi_i}, 1\leq q\leq Q. \end{equation*}$

To make convention, we denote $\prod_{i\in\mathcal{H} q}\xi_i = 1$ when $\mathcal{H}_q = \varnothing$ . In the SPSNNs, the final output can be written as

$\begin{equation*} \label{6} y = g(W_0\cdot \delta). \end{equation*}$

3. Batch gradient algorithm for Sigma-Pi-Sigma neural networks with elastic net regularization

3.1. Batch gradient algorithm for Sigma-Pi-Sigma neural networks

We introduce the batch gradient algorithm for SPSNNs. Let $\{X^l, O^l\}_{l = 1}^L\subset \mathbb{R}^P\times \mathbb{R}$ be the given set of the training samples, where $X^l$ denotes the $lth$ input sample and the $O^l$ is the $lth$ corresponding ideal output. Let $y^l\in R$ be the real output for each input $X^l$ . The conventional square error function can be given as

$\begin{equation*} \label{7} \begin{split} \widehat{E}(W)& = \frac{1}{2}\sum\limits_{l = 1}^L(g(W_{0}\cdot\delta^{l})-O^{l})^{2}\\ & = \sum\limits_{l = 1}^L\frac{1}{2}(g(W_{0}\cdot\delta^{l})-O^{l})^{2}\\ & = \sum\limits_{l = 1}^L g_l(W_0\cdot \delta^l), \end{split} \end{equation*}$

where $g_l(z) = \frac{1}{2}(g(z)-O^l)^2, z\in\mathbb{R}, 1\leq l\leq L.$

For convenience, we need to provide the following forms:

$\begin{equation} \begin{split} \delta^l& = (\delta_{1}^l, \delta_{2}^l, \cdots \delta_{Q}^l)\\ & = (\prod\limits_{i\in\mathcal{H}_1}\xi_{i}, \prod\limits_{i\in\mathcal{H}_2}\xi_{i}, \cdots \prod\limits_{i\in\mathcal{H}_Q}\xi_{i})\\ & = (\prod\limits_{i\in\mathcal{H}_1}g(W_{i}\cdot X^{l}), \prod\limits_{i\in\mathcal{H}_2}g(W_{i}\cdot X^{l}), \cdots, \prod\limits_{i\in\mathcal{H}_Q}g(W_{i}\cdot X^{l})), \end{split} \end{equation}$

(3.1)

and thus, by virtue of

$\begin{equation} \begin{split} \widehat{E}(W) & = \sum\limits_{l = 1}^L g_{l}(\sum\limits_{q = 1}^{Q}w_{0q}\cdot\prod\limits_{i\in\mathcal{H}_Q}g(W_{i}\cdot X^{l})) \end{split} \end{equation}$

(3.2)

and some calculation, we can gain

$\begin{equation*} \label{11L} \widehat{E}_{W_{0}}(W) = \sum\limits_{l = 1}^L g'_{l}(W_{0}\cdot\delta^{l})\cdot\delta^{l}. \end{equation*}$

It follows from $\delta_{q}^{l} = \prod_{i\in\mathcal{H}_Q}g(W_{i}\cdot X^{l})$ that

$\begin{equation*} \label{L11} \frac{\partial\delta_{q}^{l}}{\partial W_{n}} = \begin{cases} (\prod_{i\in \mathcal{H}_{q}\setminus n}\xi_{i})\cdot g'({W_n}\cdot X^{l})X^{l}, &if\; { q\neq 1 \; and \; n\in\mathcal{H}_q , }\\ 0, &if\; { q\equiv 1 \; or\; n\not\in\mathcal{H}_q .} \end{cases} \end{equation*}$

Then, from the above equality and (3.2), we get

$\begin{equation*} \label{12} \begin{split} \widehat{E}_{W_{n}}(W)& = \sum\limits_{l = 1}^L[g'_{l}(W_0 \cdot \delta^l)\sum\limits_{q = 1}^Q(w_{0q}\cdot\frac{\partial\delta_q^l}{\partial W_{n}})].\\ \end{split} \end{equation*}$

3.2. Batch gradient algorithm for Sigma-Pi-Sigma neural networks with elastic net regularization

To simplify the network structure, we add the elastic net regularization to optimize the network on the group level. So, we can get the corresponding form of the error function

$\begin{equation} \begin{split} E(W) & = \sum\limits_{l = 1}^L g_l(W_0\cdot \delta^l)+\tau(\alpha(\|W_0\|_{1}+\sum\limits_{n = 1}^{N}\|W_n\|_{1}) \\ &\ \ \ \ +(1-\alpha)(\|W_0\|^2_{2}+\sum\limits_{n = 1}^{N}\|W_n\|^2_2)), \end{split} \end{equation}$

(3.3)

where $\tau$ and $\alpha$ are the tuning parameters that control the performance of the penalty term. With the gradual increase of $\alpha$ , the $L_1$ regular term dominates, and the error function gets closer to Lasso regression; with the gradual decrease of $\alpha$ , the $L_2$ regular term dominates, and the error function gets closer to ridge regression. In particular, the error function is equivalent to that of ridge regression at $\alpha = 0$ , and the cost function is equivalent to that of Lasso regression at $\alpha = 1$ . Let $\tau_1 = \tau\cdot\alpha$ and $\tau_2 = 2\tau\cdot(1-\alpha)$ . Then, we have

$\begin{equation} \begin{split} E(W) & = \sum\limits_{l = 1}^L g_l(W_0\cdot \delta^l)+\tau_1(\|W_0\|_{1}+\sum\limits_{n = 1}^{N}\|W_n\|_1) \\ &\ \ \ \ +\frac{\tau_2}{2}(\|W_0\|^2_{2}+\sum\limits_{n = 1}^{N}\|W_n\|^2_2). \end{split} \end{equation}$

(3.4)

The gradient of the error function with respect to $W_0$ and $W_n$ are, respectively, given as

$\begin{equation} \begin{split} E_{W_{0}}(W) & = \sum\limits_{l = 1}^L g'_l(W_0\cdot \delta^l)\cdot\delta^l+\tau_1\frac{W_{0}}{\|W_0\|}+\tau_2W_0 \end{split} \end{equation}$

(3.5)

and

$\begin{equation} \begin{split} E_{W_{n}}(W) & = \sum\limits_{l = 1}^L[g'_l(W_0\cdot \delta^l)\sum\limits_{q\in\mathcal{F}_n}[w_{0q}(\prod\limits_{i\in\mathcal{H}_q \setminus {n}} \xi_i^l)\\ &\ \ \ \ \times g'(W_n\cdot X^l)\cdot X^l]]+\tau_1\frac{W_{n}}{\|W_n\|}+\tau_2W_n. \end{split} \end{equation}$

(3.6)

We notice that elastic net regularization in (3.4) is combined with $L_1$ norm regularization and $L_2$ norm regularization. It is clear that (3.4) is not differentiable at the origin, which will yield the oscillation phenomenon, and we propose a smoothing approximation method to overcome this problem caused by the non-smoothness. For any limited dimensional vector $\mathbf{u}$ and a fixed constant $\gamma > 0$ , we can define a smoothing function of $\|\mathbf{u}\|$ as follows:

$\begin{equation} h(\mathbf{u}, \gamma) = \begin{cases} \|\mathbf{u}\|, &if\; { \|\mathbf{u}\| > \gamma }, \\ \frac{\|u\|^2}{2\gamma}+\frac{\gamma}{2}, &if\; { \|\mathbf{u}\|\leq\gamma }. \end{cases} \end{equation}$

(3.7)

We use (3.7) to approximate the elastic net regularization in (3.4). Furthermore, the gradient of $h(\mathbf{u}, \gamma)$ with respect to the vector $\mathbf{u}$ is given as follows:

$\begin{equation*} \nabla_{\mathbf{u}}h(\mathbf{u}, \gamma) = \begin{cases} \frac{\mathbf{u}}{\|\mathbf{u}\|}, &if\; { \|\mathbf{u}\| > \gamma }, \\ \frac{\mathbf{u}}{\gamma}, &if\; { \|\mathbf{u}\|\leq\gamma }. \end{cases} \end{equation*}$

Accordingly, the error function (3.4) can be rewritten as

$\begin{equation} \begin{split} E(W) = &\sum\limits_{l = 1}^L g_l(W_0\cdot \delta^l)+\tau_1[h(W_{0}, \gamma)+\sum\limits_{n = 1}^N h(W_n, \gamma)]\\ &+\frac{\tau_2}{2}(\|W_0\|^2_{2}+\sum\limits_{n = 1}^{N}\|W_n\|^2_2). \end{split} \end{equation}$

(3.8)

According to (3.8), we can get a smoothing elastic net Sigma-Pi-Sigma neural network as

$\begin{equation} E_{W_{0}}(W) = \sum\limits_{l = 1}^L g'_l(W_{0}\cdot \delta ^l)\cdot \delta_q^{l}+\tau_1 \nabla_{W_{0}}h(W_0, \gamma)+\tau_2W_0, \end{equation}$

(3.9)

$\begin{equation} \begin{split} E_{W_{n}}(W)& = \sum\limits_{l = 1}^L g'_l(W_0\cdot \delta ^l)\sum\limits_{q\in \mathcal{F}_n}w_{0q}(\prod\limits_{i\in \mathcal{H}_q\setminus n}\xi_{i} ^l)\\ &\ \ \ \ \times g'(W_{n}\cdot X^l)\cdot X^{l}+\tau_1 \nabla_{W_{n}}h(W_{n}, \gamma)+\tau_2W_n, \end{split} \end{equation}$

(3.10)

where $l = 1, 2, \cdots, L$ .

Beginning with an arbitrary initial weight vector $W^0$ , by the following iterative formula we define the weight sequence

$\begin{equation} W^{k+1} = W^k+\bigtriangleup W^{k}, \end{equation}$

(3.11)

$\begin{equation} \bigtriangleup W_{0}^k = -\eta E_{W_{0}}(W^k), \end{equation}$

(3.12)

$\begin{equation} \bigtriangleup W_{n}^k = -\eta E_{W_{n}}(W^k), \end{equation}$

(3.13)

where $\eta$ represents the learning rate.

4. Simulation results

In this section, the performance of the models with no regularizer, the $L_{2}$ regularizer, the original $L_{1/2}$ regularizer $(OL_{1/2})$ , the smoothing $L_{1/2}$ regularizer $(SL_{1/2})$ , and the original group lasso regularizer $(OGL)$ algorithms are compared with the smoothing elastic net regularizer algorithm $(SGL)$ by using four examples: classification problem, parity problem, function approximation problem, and prediction problem.

4.1. Classification problem

In this example, we choose 8 benchmark data sets from the UCI machine learning repository to test the performance of the new algorithm $(SGL)$ , and compare it with the no regularizer, the $L_{2}$ regularizer, the $OL_{1/2}$ , the $SL_{1/2}$ , and the $OGL$ algorithms.

Table 1 presents the main characteristics of the relevant data sets, which includes the size of datasets, attributes, categories, and sizes of the training and testing sets, where the dataset is randomly partitioned into two subsets: 70% for training and 30% for testing.

Table 1. Detailed description of the classification data sets.

Dataset	Dataset Size	Training set	Testing set	Attributes	Classes
Ecoli	336	224	112	8	7
Olitos	120	80	40	25	4
Seeds	210	147	63	7	3
Iris	150	105	45	4	3
Wine	178	120	58	13	3
Liver	345	240	105	7	2
Sonar	208	138	70	60	2
Diabetes	768	526	242	8	2

| Show Table

DownLoad: CSV

As described at the beginning of this paper, we learn the structure of SPSNNs (see ). Then, we select ${P = 13}$ , ${N = 4}$ , ${Q = 16}$ , and 1, representing the number of the nodes of the input, the ${\sum_{1}}$ , the ${\prod}$ , and the ${\sum_{2}}$ layers, respectively. For each learning algorithm, the initial weights are randomly selected in the interval ${[-0.5, 0.5]}$ , the learning rate ${\eta}$ is 0.0028, and the regular factor ${\tau}$ is 0.001, and we conduct 20 trials for every data set to compare the performance of different algorithms.

To assess the performance of the smoothing elastic net regularizer, based on each data set, we compare the number of remaining hidden neurons after pruning $(RNN)$ , the training accuracy testing accuracy, and training time for each algorithm, and all experimental results are recorded in . From the table, it can be observed that the training accuracy has slightly improved, while the testing accuracy has increased by approximately 1% to 3%. We can find our proposed smoothing elastic net regularizer is superior to the no regularizer, the $L_{2}$ regularizer, the original $L_{1/2}$ regularizer, the smoothing $L_{1/2}$ regularizer, and the original group Lasso regularizer algorithms.

Table 2. Performance comparison for classification problems.

Dataset	Algorithm	$RNN$	Training accuracy	Testing accuracy	Training time(s)
Ecoli	$NoPenalty$	13.30	0.9761	0.9082	17.2756
	$L_{2}$	13.00	0.9747	0.8980	${\textbf {16.7845}}$
	$OL_{1/2}$	12.50	0.9749	0.9191	18.1372
	$SL_{1/2}$	11.80	0.9771	0.9304	16.8050
	$OGL$	12.20	0.9749	0.9191	18.3792
	$SGL$	${\textbf {11.50}}$	${\textbf {0.9780}}$	${\textbf {0.9314}}$	17.3763
Olitos	$NoPenalty$	13.00	0.9573	0.8914	29.9742
	$L_{2}$	12.33	0.9592	0.9053	${\textbf {29.3393}}$
	$OL_{1/2}$	12.00	0.9604	0.9160	30.6111
	$SL_{1/2}$	11.67	0.9589	0.9245	30.3067
	$OGL$	12.00	0.9617	0.9264	33.0595
	$SGL$	${\textbf {11.00}}$	${\textbf {0.9628}}$	${\textbf {0.9355}}$	30.5098
Seeds	$NoPenalty$	13.33	0.9737	0.9522	13.4081
	$L_{2}$	12.67	0.9761	0.9554	13.5387
	$OL_{1/2}$	12.33	0.9791	0.9582	13.8205
	$SL_{1/2}$	11.67	0.9797	0.9676	13.2730
	$OGL$	12.33	0.9792	0.9629	14.7718
	$SGL$	${\textbf {11.33}}$	${\textbf {0.9813}}$	${\textbf {0.9749}}$	${\textbf {12.7701}}$
Iris	$NoPenalty$	13.67	0.9715	0.9296	13.4447
	$L_{2}$	13.00	0.9719	0.9390	${\textbf {13.4420}}$
	$OL_{1/2}$	12.67	0.9723	0.9458	14.3637
	$SL_{1/2}$	12.33	0.9743	0.9554	13.5318
	$OGL$	12.33	0.9748	0.9522	16.1023
	$SGL$	${\textbf {11.00}}$	${\textbf {0.9791}}$	${\textbf {0.9629}}$	14.2630
Wine	$NoPenalty$	12.67	0.9872	0.9729	20.4322
	$L_{2}$	12.67	0.9892	0.9753	20.4173
	$OL_{1/2}$	12.00	0.9896	0.9770	21.3012
	$SL_{1/2}$	11.50	0.9911	0.9814	20.5545
	$OGL$	11.67	0.9906	0.9798	21.1104
	$SGL$	${\textbf {10.50}}$	${\textbf {0.9915}}$	${\textbf {0.9833}}$	${\textbf {20.1607}}$
Liver	$NoPenalty$	13.33	0.9937	0.9823	15.5081
	$L_{2}$	12.67	0.9943	0.9838	${\textbf {15.4588}}$
	$OL_{1/2}$	12.33	0.9947	0.9858	16.4440
	$SL_{1/2}$	11.33	0.9951	0.9861	15.9798
	$OGL$	11.00	0.9948	0.9868	17.4374
	$SGL$	${\textbf {10.33}}$	${\textbf {0.9962}}$	${\textbf {0.9902}}$	16.1645
Sonar	$NoPenalty$	12.67	0.9825	0.9756	12.6535
	$L_{2}$	12.67	0.9831	0.9787	12.1906
	$OL_{1/2}$	12.33	0.9860	0.9830	12.9125
	$SL_{1/2}$	12.00	0.9909	0.9852	12.5690
	$OGL$	12.33	0.9892	0.9849	13.7648
	$SGL$	${\textbf {11.67}}$	${\textbf {0.9918}}$	${\textbf {0.9860}}$	${\textbf {11.7338}}$
Diabetes	$NoPenalty$	13.00	0.9925	0.9937	17.7933
	$L_{2}$	12.50	0.9936	0.9947	${\textbf {17.1156}}$
	$OL_{1/2}$	11.67	0.9954	0.9950	17.4822
	$SL_{1/2}$	11.33	0.9967	0.9955	17.1170
	$OGL$	11.67	0.9961	0.9953	18.4761
	$SGL$	${\textbf {11.33}}$	${\textbf {0.9978}}$	${\textbf {0.9961}}$	17.3682

| Show Table

DownLoad: CSV

In addition, we have also compared our approach with other existing methods. In ^[45], the authors considered the group Lasso regularization method on the Sigma-Pi-Sigma neural network. In ^[46], the authors applied the group $L_{1/2}$ regularization term on high-order neural networks. Our proposed elastic net regularization method is on par with these approaches.

4.2. 5-bit parity problem

For the parity problem, there is an input set of ${2^n}$ samples in n-dimensional space, and every sample is an n-bit binary vector. We consider a 5-bit parity problem which has an input set with ${2^5}$ samples in 5-dimensional space; the ideal output equals to 1 if the number of 1 in the samples is odd, otherwise it equals to zero. Here, using the above method, we test the performance of our proposed smoothing elastic net regularizer.

Similarly, we can study the structure of SPSNNs. We select ${P = 13}$ , ${N = 4}$ , ${Q = 16}$ , and 1 for the number of the nodes of the input, ${\sum_{1}}$ , ${\prod}$ , and ${\sum_{2}}$ layers, separately. In the interval ${[-0.5, 0.5]}$ , the initial weights are randomly selected, the learning rate ${\eta}$ is 0.0045, and the regular factor ${\tau}$ is 0.001. For each learning algorithm we carry out 20 experiments, and we train up to 40, 000 steps or we stop once the error is less than ${1e-4}$ . So, as to assess the sparsity and convergence of the smoothing elastic net regularizer, we compare the average error $(AVE)$ and the number of remaining hidden neurons after pruning $(RNN)$ with the no regularizer, the $L_{2}$ regularizer, the original $L_{1/2}$ regularizer, the smoothing $L_{1/2}$ regularizer, the original group Lasso regularizer, and the smoothing elastic net regularizer, which are listed in Table 3.

Table 3. Emulation results for 5-bit parity problem.

Learning Methods	AVE	RNN
$NoPenalty$	0.004433	17.00
$L_{2}$	0.003967	17.00
$OL_{1/2}$	0.003929	17.14
$SL_{1/2}$	0.004033	17.33
$OGL$	0.003925	17.00
$SGL$	0.003471	16.71

| Show Table

DownLoad: CSV

The results show that the proposed smooth elastic net regularizer outperforms the no regularizer, the $L_{2}$ regularizer, the original $L_{1/2}$ regularizer, the smoothing $L_{1/2}$ regularizer, and the original group Lasso regularizer.

Figure 2(a) shows the error performance of the original group Lasso regularizer and smoothing elastic net regularizer via the 5-bit parity problem. Figure 2(b) shows that the norm of the gradient curve of the error function, based on the 5-bit parity problem, approaches a small positive constant. This indicates that the smoothing elastic net regularizer removes the oscillation of occurring in the original group Lasso regularizer in the learning process.

Figure 2. The performance results based on 5-bit parity problem with

$OGL$ and

$SGL$ .

DownLoad: Full-Size Img PowerPoint

4.3. Function approximation problem

In this example, we study the multi-dimensional Gabor function to compare the approximation performance of the above algorithms.

$\begin{equation*} k(x, y) = \frac{1}{2\pi(0.5)^2}exp(-\frac{x^2+y^2}{2(0.5)^2})cos(2\pi(x+y)). \end{equation*}$

As described at the beginning of this paper, we learn the structure of the SPSNNs. Then, we select ${P = 13}$ , ${N = 4}$ , ${Q = 16}$ , and 1 for the number of the nodes of the input layer, ${\sum_{1}}$ , ${\prod}$ , and ${\sum_{2}}$ layers, separately.

In this experiment, we select 169 training samples from an evenly spaced ${6\times6}$ grid on the square ${-0.5\leq x \leq 0.5}$ and ${-0.5\leq y \leq 0.5}$ . In the interval ${[-0.5, 0.5]}$ , the initial weights are randomly selected, the learning rate ${\eta}$ is 0.0028, and the regular factor ${\tau}$ is 0.001. For each learning algorithm we carry out 20 experiments, and we train up to 40, 000 times or until the error is less than ${1e-4}$ stop iterations.

To assess the sparsity and convergence of the smoothing elastic net regularizer, we compare the average error $(AVE)$ and the number of remaining hidden neurons after pruning $(RNN)$ with the no regularizer, the $L_{2}$ regularizer, the original $L_{1/2}$ regularizer, the smoothing $L_{1/2}$ regularizer, the original group Lasso regularizer, and the smoothing elastic net regularizer, which are shown in . We can see our proposed smoothing elastic net regularizer is superior to the no regularizer, the $L_{2}$ regularizer, the original $L_{1/2}$ regularizer, the smoothing $L_{1/2}$ regularizer, and the original group Lasso regularizer.

Table 4. Emulation results for identifying the Gabor function.

Learning Methods	AVE	RNN
$NoPenalty$	0.003940	18.00
$L_{2}$	0.003560	18.00
$OL_{1/2}$	0.003783	17.67
$SL_{1/2}$	0.004486	17.37
$OGL$	0.003523	16.87
$SGL$	0.003286	16.14

| Show Table

DownLoad: CSV

For each learning algorithm, we show the error function and the norm of gradient of one of the 20 experiments after 40, 000 epochs in Figures 3–. shows the oscillation phenomenon of no regularizer, the $L_{2}$ regularizer, the original $L_{1/2}$ regularizer and the original group lasso regularizer. shows the error curve of the smoothing $L_{1/2}$ regularizer and the smoothing elastic net regularizer. shows the norm of gradient curve of the no regularizer, the $L_{2}$ regularizer, the original $L_{1/2}$ regularizer, and the original group Lasso regularizer. shows the norm of the gradient curve of the smoothing $L_{1/2}$ regularizer, and the smoothing elastic net regularizer. Obviously, it approaches to a small positive constant. Figure 5 shows a typical performance for one of 20 experiments, and we can see that it has a good approximation effect compared with other algorithms. In each learning algorithm for the same parameters, we get the corresponding results. We can see the learning method with the smoothing elastic net regularizer converges faster than other learning methods, and the smoothing elastic net regularizer method overcomes the numerical oscillation phenomenon. It also shows that during the iterative process the error function curves are monotonically decreasing and converge to zero, which also validates our theoretical results.

Figure 3. The performance results of the error function for the above algorithms.

DownLoad: Full-Size Img PowerPoint

Figure 4. The performance results of the gradient norm for the above algorithms.

DownLoad: Full-Size Img PowerPoint

Figure 5. The approximation result of the smoothing elastic net regularizer.

DownLoad: Full-Size Img PowerPoint

4.4. Prediction problem

To verify the effectiveness of the algorithm further, this part takes an interval shield tunneling project of Metro Line 9 in Zhengzhou City, China, as an example. In order to monitor the impact of the subway shield tunneling process on the surface buildings and structures, 10-meter intervals are used to set up settlement observation points in advance in each shield tunneling section, and the surrounding structures and buildings are monitored. JGC1, the closest settlement observation point of the signal tower to the right line of the shield structure, is selected as the objective of study, and the Leica DNA03 level is used to collect data 10 days before the shield structure is excavated to the closest point of JGC1, for a total of 40 days, and the frequency of observation is once a day. In this experiment, we use the data of the first 30 days as the training data set and the data of the last 10 days as the test data set (see Table 5).

Table 5. Measured settlement of metro shield at point JGC1.

Time	JGC1 (mm)	Time	JGC1 (mm)
2015.3.18	+0.21000	2015.3.31	0.003286
2015.3.19	-0.13000	2015.4.2	+0.10000
2015.3.20	+0.11000	2015.4.3	-0.02000
2015.3.21	-0.06000	2015.4.4	+0.08000
2015.3.22	-0.23000	2015.4.5	-0.16000
2015.3.23	-0.23000	2015.4.6	+0.06000
2015.3.24	-0.15000	2015.4.7	-0.18000
2015.3.25	-0.03000	2015.4.8	+0.28000
2015.3.26	0.003286	2015.4.9	-0.07000
2015.3.27	0.003286	2015.4.10	+0.07000
Time	JGC1 (mm)	Time	JGC1 (mm)
2015.4.11	+0.01000	2015.4.22	+0.01000
2015.4.12	-0.01000	2015.4.23	0.00000
2015.4.13	-0.15000	2015.4.24	+0.19000
2015.4.14	+0.08000	2015.4.25	-0.21000
2015.4.15	-0.09000	2015.4.26	+0.10000
2015.4.16	+0.14000	2015.4.27	-0.12000
2015.4.17	-0.02000	2015.4.28	+0.05000
2015.4.18	-0.26000	2015.4.29	+0.15000
2015.4.19	+0.42000	2015.4.30	-0.17000
2015.4.20	-0.22000	2015.5.1	-0.07000

| Show Table

DownLoad: CSV

In this experiment, we learn the structure of the SPSNNs. Then, we select $P = 5$ , $N = 4$ , $Q = 16$ , and $1$ for the number of the nodes of the input layer, $\sum_1$ , $\sum_2$ , and the output layer, respectively. We use the sigmod activation function at the $\sum_1$ and output layers, respectively, and our stopping criteria in this experiment is an error of less than $1\times10^{-5}$ or $5000$ iterations.

is the error curve of the training set with $5000$ iterations, in which red is the error curve without the regularization term and blue is the error curve with the smoothing elastic net regularization, it can be obtained that the error of the network with the smoothing elastic net regularization decreases faster, and after $500$ iterations, the error is smaller than that without the regularization term, which precisely verifies the theoretical results of this paper and the effectiveness of the proposed algorithm.

Figure 6. The curve of the error function for no penalty and for the smoothing elastic net.

DownLoad: Full-Size Img PowerPoint

5. Conclusions

In this paper, a new batch gradient algorithm for SPSNNs with an $L_1$ plus $L_2$ regularization algorithm is proposed as an effective weight pruning technique. It can handle multi-output regression and multi-class classification problems within a unified framework. This algorithm obtains good performance in both Lasso and ridge regression, penalizing the weights by reducing the weight vectors to zero, which is more efficient than other various pruning strategies. Moreover, the theoretical results and the advantages of this algorithm are also illustrated by numerical experiments.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Author contribution

Conceptualization, methodology, original draft preparation, J. Jiao; software, editing, K. Su.

Conflict of interest

All authors declare there is no conflict of interest.

References

[1]	G. A. Bliss, Lectures on the Calculus of Variations, Chicago: University of Chicago Press, 1946.
[2]	O. Bolza, Lectures on the Calculus of Variations, New York: Chelse Press, 1961.
[3]	U. Brechtken-Manderscheid, Introduction to the Calculus of Variations, London: Chapman & Hall, 1983.
[4]	L. Cesari, Optimization-Theory and Applications, Problems with Ordinary Differential Equations, New York: Springer-Verlag, 1983.
[5]	F. H. Clarke, Functional Analysis, Calculus of Variations and Optimal Control, New York: Springer-Verlag, 2013.
[6]	G. M. Ewing, Calculus of Variations with Applications, New York: Dover, 1985.
[7]	I. M. Gelfand, S. V. Fomin, Calculus of Variations, New Jersey: Prentice-Hall, 1963.
[8]	M. Giaquinta, S. Hildebrant, Calculus of Variations I, New York: Springer-Verlag, 2004.
[9]	M. Giaquinta, S. Hildebrant, Calculus of Variations II, New York: Springer-Verlag, 2004.
[10]	M. R. Hestenes, Calculus of Variations and Optimal Control Theory, New York: John Wiley & Sons, 1966.
[11]	G. Leitmann, The Calculus of Variations and Optimal Control, New York: Plenum Press, 1981.
[12]	P. D. Loewen, Second-order sufficiency criteria and local convexity for equivalent problems in the calculus of variations, J. Math. Anal. Appl., 146 (1990), 512-522.
[13]	A. A. Milyutin, N. P. Osmolovskii, Calculus of Variations and Optimal Control, Rhode Island: American Mathematical Society, 1998.
[14]	M. Morse, Variational Analysis: Critical Extremals and Sturmian Extensions, New York: John Wiley & Sons, 1973.
[15]	F. Rindler, Calculus of Variations, Coventry: Springer, 2018.
[16]	J. L. Troutman, Variational Calculus with Elementary Convexity, New York: Springer-Verlag, 1983.
[17]	F. Y. M. Wan, Introduction to the Calculus of Variations and its Applications, New York: Chapman & Hall, 1995.

This article has been cited by:

1.	Khidir Shaib Mohamed, Ibrhim M. A. Suliman, Mahmoud I. Alfeel, Abdalilah Alhalangy, Faiza A. Almostafa, Ekram Adam, A Modified High-Order Neural Network with Smoothing L1 Regularization and Momentum Terms, 2025, 19, 1863-1703, 10.1007/s11760-025-03973-4
2.	Khidir Shaib Mohamed, Suhail Abdullah Alsaqer, Tahir Bashir, Ibrhim. M. A. Suliman, Convergence analysis of gradient descent based on smoothing L0 regularization and momentum terms, 2025, 1598-5865, 10.1007/s12190-024-02353-4

Reader Comments

Your name:*

Email:*
© 2020 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Mathematics

1.8 3.4

Metrics

Article views(3979) PDF downloads(430) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

AIMS Mathematics

Sufficiency for singular trajectories in the calculus of variations

Related Papers:

Abstract

1. Introduction

2. Sigma-Pi-Sigma neural networks

3. Batch gradient algorithm for Sigma-Pi-Sigma neural networks with elastic net regularization

3.1. Batch gradient algorithm for Sigma-Pi-Sigma neural networks

3.2. Batch gradient algorithm for Sigma-Pi-Sigma neural networks with elastic net regularization

4. Simulation results

4.1. Classification problem

4.2. 5-bit parity problem

4.3. Function approximation problem

4.4. Prediction problem

5. Conclusions

Use of AI tools declaration

Author contribution

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Catalog

AIMS Mathematics

Sufficiency for singular trajectories in the calculus of variations

Related Papers:

Abstract

1. Introduction

2. Sigma-Pi-Sigma neural networks

3. Batch gradient algorithm for Sigma-Pi-Sigma neural networks with elastic net regularization

3.1. Batch gradient algorithm for Sigma-Pi-Sigma neural networks

3.2. Batch gradient algorithm for Sigma-Pi-Sigma neural networks with elastic net regularization

4. Simulation results

4.1. Classification problem

4.2. 5-bit parity problem

4.3. Function approximation problem

4.4. Prediction problem

5. Conclusions

Use of AI tools declaration

Author contribution

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog