In the context of deep learning, the more expensive computational phase is the full training of the learning methodology. Indeed, its effectiveness depends on the choice of proper values for the so-called hyperparameters, namely the parameters that are not trained during the learning process, and such a selection typically requires an extensive numerical investigation with the execution of a significant number of experimental trials. The aim of the paper is to investigate how to choose the hyperparameters related to both the architecture of a Convolutional Neural Network (CNN), such as the number of filters and the kernel size at each convolutional layer, and the optimisation algorithm employed to train the CNN itself, such as the steplength, the mini-batch size and the potential adoption of variance reduction techniques. The main contribution of the paper consists in introducing an automatic Machine Learning technique to set these hyperparameters in such a way that a measure of the CNN performance can be optimised. In particular, given a set of values for the hyperparameters, we propose a low-cost strategy to predict the performance of the corresponding CNN, based on its behavior after only few steps of the training process. To achieve this goal, we generate a dataset whose input samples are provided by a limited number of hyperparameter configurations together with the corresponding CNN measures of performance obtained with only few steps of the CNN training process, while the label of each input sample is the performance corresponding to a complete training of the CNN. Such dataset is used as training set for a Support Vector Machines for Regression and/or Random Forest techniques to predict the performance of the considered learning methodology, given its performance at the initial iterations of its learning process. Furthermore, by a probabilistic exploration of the hyperparameter space, we are able to find, at a quite low cost, the setting of a CNN hyperparameters which provides the optimal performance. The results of an extensive numerical experimentation, carried out on CNNs, together with the use of our performance predictor with NAS-Bench-101, highlight how the proposed methodology for the hyperparameter setting appears very promising.
Citation: Giorgia Franchini, Valeria Ruggiero, Federica Porta, Luca Zanni. Neural architecture search via standard machine learning methodologies[J]. Mathematics in Engineering, 2023, 5(1): 1-21. doi: 10.3934/mine.2023012
Related Papers:
[1]
Guoyi Li, Jun Wang, Kaibo Shi, Yiqian Tang .
Some novel results for DNNs via relaxed Lyapunov functionals. Mathematical Modelling and Control, 2024, 4(1): 110-118.
doi: 10.3934/mmc.2024010
[2]
Saravanan Shanmugam, R. Vadivel, S. Sabarathinam, P. Hammachukiattikul, Nallappan Gunasekaran .
Enhancing synchronization criteria for fractional-order chaotic neural networks via intermittent control: an extended dissipativity approach. Mathematical Modelling and Control, 2025, 5(1): 31-47.
doi: 10.3934/mmc.2025003
[3]
Gani Stamov, Ekaterina Gospodinova, Ivanka Stamova .
Practical exponential stability with respect to h−manifolds of discontinuous delayed Cohen–Grossberg neural networks with variable impulsive perturbations. Mathematical Modelling and Control, 2021, 1(1): 26-34.
doi: 10.3934/mmc.2021003
[4]
Bangxin Jiang, Yijun Lou, Jianquan Lu .
Input-to-state stability of delayed systems with bounded-delay impulses. Mathematical Modelling and Control, 2022, 2(2): 44-54.
doi: 10.3934/mmc.2022006
[5]
Hongwei Zheng, Yujuan Tian .
Exponential stability of time-delay systems with highly nonlinear impulses involving delays. Mathematical Modelling and Control, 2025, 5(1): 103-120.
doi: 10.3934/mmc.2025008
[6]
M. Haripriya, A. Manivannan, S. Dhanasekar, S. Lakshmanan .
Finite-time synchronization of delayed complex dynamical networks via sampled-data controller. Mathematical Modelling and Control, 2025, 5(1): 73-84.
doi: 10.3934/mmc.2025006
[7]
Shipeng Li .
Impulsive control for stationary oscillation of nonlinear delay systems and applications. Mathematical Modelling and Control, 2023, 3(4): 267-277.
doi: 10.3934/mmc.2023023
[8]
Qin Xu, Xiao Wang, Yicheng Liu .
Emergent behavior of Cucker–Smale model with time-varying topological structures and reaction-type delays. Mathematical Modelling and Control, 2022, 2(4): 200-218.
doi: 10.3934/mmc.2022020
[9]
Yanchao He, Yuzhen Bai .
Finite-time stability and applications of positive switched linear delayed impulsive systems. Mathematical Modelling and Control, 2024, 4(2): 178-194.
doi: 10.3934/mmc.2024016
[10]
Weisong Zhou, Kaihe Wang, Wei Zhu .
Synchronization for discrete coupled fuzzy neural networks with uncertain information via observer-based impulsive control. Mathematical Modelling and Control, 2024, 4(1): 17-31.
doi: 10.3934/mmc.2024003
Abstract
In the context of deep learning, the more expensive computational phase is the full training of the learning methodology. Indeed, its effectiveness depends on the choice of proper values for the so-called hyperparameters, namely the parameters that are not trained during the learning process, and such a selection typically requires an extensive numerical investigation with the execution of a significant number of experimental trials. The aim of the paper is to investigate how to choose the hyperparameters related to both the architecture of a Convolutional Neural Network (CNN), such as the number of filters and the kernel size at each convolutional layer, and the optimisation algorithm employed to train the CNN itself, such as the steplength, the mini-batch size and the potential adoption of variance reduction techniques. The main contribution of the paper consists in introducing an automatic Machine Learning technique to set these hyperparameters in such a way that a measure of the CNN performance can be optimised. In particular, given a set of values for the hyperparameters, we propose a low-cost strategy to predict the performance of the corresponding CNN, based on its behavior after only few steps of the training process. To achieve this goal, we generate a dataset whose input samples are provided by a limited number of hyperparameter configurations together with the corresponding CNN measures of performance obtained with only few steps of the CNN training process, while the label of each input sample is the performance corresponding to a complete training of the CNN. Such dataset is used as training set for a Support Vector Machines for Regression and/or Random Forest techniques to predict the performance of the considered learning methodology, given its performance at the initial iterations of its learning process. Furthermore, by a probabilistic exploration of the hyperparameter space, we are able to find, at a quite low cost, the setting of a CNN hyperparameters which provides the optimal performance. The results of an extensive numerical experimentation, carried out on CNNs, together with the use of our performance predictor with NAS-Bench-101, highlight how the proposed methodology for the hyperparameter setting appears very promising.
1.
Introduction
Since neural networks (NNs) are effective in modeling and describing nonlinear dynamics, there has been a remarkable surge in the utilization of NNs, which have contributed substantially to the fields of signal processing, image processing, combinatorial optimization, pattern recognition, and other scientific and engineering domains [1,2]. Theoretical advancements have laid a solid foundation for this progress, thereby facilitating the successful establishment of NNs as a powerful tool for a diverse range of applications. Consequently, the stability analysis of delayed neural networks (DNNs) has attracted many scholars [3,4].
It should be noted that temporal lags are invariably encountered within NNs as a consequence of intrinsic factors, including but not limited to the finite velocity of information processing [5], which can lead to instability and degraded performance in numerous real-world applications of NNs. In this way, the prognostic capacity and resilience of the neural network would suffer severe impairment, thereby gravely undermining its efficacy and dependability. Hence, the assessment of the stability of a computing system must be conducted with great care and precision, employing accurate evaluation criteria that abide by the principles of scientific rigor to provide reliable guarantee for the system's operation. Thus, in recent years, researchers have been dedicated to analyzing the stability of DNNs and investing significant amounts of time and effort into reducing the conservatism of stability criteria [6,7,8]. In regards to the stability criterion, accommodating a wider range of delay tolerance would result in a less conservative estimate. Consequently, the upper limit of the delay range assumes critical significance in the assessment and quantification of the degree of conservativeness. The stability of NNs with variable delays has been widely studied using the Lyapunov-Krasovskii functionals (LKFs) technique because time delays in actual NNs are usually time-varying.
In order to analyze and solve the stability problems of DNNs, common approaches such as the Lyapunov stability method and linear matrix inequalities (LMIs) are often adopted. Among these, the Lyapunov stability method is widely applied, while LMIs provide a useful descriptive framework for these systems. This method aims to create suitable LKFs to derive less conservative stability conditions, ensuring the stability of the studied DNN even with delays varying within the largest possible closed interval. Various types of augmented LKFs were introduced in [9,10,11,12,13,14,15,16,17,18,19,20,21,22] to investigate the asymptotic or exponential stability that is dependent on delay in DNNs with varying temporal delays. Utilizing the augmented LKF approach, numerous improved stability criteria were also established. Moreover, integral inequality, the free-weighting matrix method, and reciprocally convex combination, which are commonly used methods or techniques, have been utilized to obtain the stability criteria. To reduce the conservatism of stability criteria, recent works [10,13,14,15] have employed delayed state derivative terms as augmented vector elements to estimate the time derivative. Although the existing literature indicates that the resultant stability criteria are less conservative, it is worth noting that the dimensions of the criteria in the LMI formulation experience substantial expansion. Thus, enhanced LKFs generally increase the difficulty and complexity, resulting in a corresponding increase in the server's computational burden and time.
The work done by the above scholars still requires symmetry in the construction of LKF. In [23], two novel delay-dependent stability criteria for time-delay systems are presented, utilizing LMIs. Both are established via asymmetric augmented LKFs, ensuring positivity without the need for all involved matrices in the LKFs to be symmetric and positive definite. In [24], the author used the same method to study the stability of Takagi-Sugeno fuzzy system.
In this paper, the primary contribution can be outlined as follows:
(1) An improved asymmetric LKF is proposed, which can be positively definite without requiring that all matrix variables be symmetric or positive-definite.
(2) A new stability criterion is formulated by utilizing linear matrix inequalities incorporating integral inequality and reciprocally convex combination techniques.
(3) Compared to traditional methods, this new approach has less conservatism and complexity, which enables it to more accurately characterize neural network stability issues.
Ultimately, the efficacy and superiority of this novel approach were successfully demonstrated, corroborating its robustness and superiority over existing methodologies through a commonly employed numerical illustration, providing a feasible solution for practical engineering applications.
Notations: Rn denotes the n-dimensional Euclidean vector space. Rn×m is the set of all n×m real matrices. Dn+ represents the set of positive-definite diagonal matrices of Rn×n. diag{} denotes a block-diagonal matrix. He(M)=M+MT. N stands for the sets of nonnegative integers.
2.
Preliminaries
Lemma 2.1.(Jensen's inequality[25]) Given Q>0, for any continuous function
proves that ˙Vt<0. Therefore, if LMIs (3.4)–(3.7) hold, the system (3.1) is asymptotically stable.
The proof is completed. □
Remark 3.1.The underlying expression is deftly rescaled through the integration of the Wirtinger-based integral inequality and the B-L inequality. Here, their coefficients are assigned as α and 1−α, respectively. This approach provides a flexible transformation within the range of [0,1], which facilitates the pursuit of optimal amalgamation. In scenarios where α=1, thus making 1−α=0, the outcome solely relies on scaling through the B-L inequality. Conversely, if α=0, which makes 1−α=1, scaling exclusively uses the Wirtinger-based integral inequality.
Remark 3.2.Numerous researchers have introduced various LKFs in the analysis of DNNs, traditionally requiring the matrices to be symmetric. However, this study optimizes this traditional constraint by devising asymmetric forms of LKFs. With such asymmetric constructs, it is not mandatory for each matrix to be symmetric or positive definite when setting the conditions. Consequently, the conditions become more relaxed, leading to a less conservative theorem, thus broadening the horizons of analysis in this domain.
4.
Numerical example
In this part, a numerical example is given to illustrate the effectiveness of the suggested stability criterion. The primary goal is to acquire an acceptable maximum upper bound (AMUB) that is deemed acceptable for the time-varying delays and provides assurance of the neural networks under consideration's global asymptotic stability. As the AMUB increases, the stability criterion becomes less conservative.
Example 1. We consider DNNs (3.1) with the following parameters [17,18,21,28,29]:
It is evident that the AMUB as derived from Theorem 1 surpasses that obtained in the previous work by Theorem 1 presented in [17,18,21,28,29]. Compared with Theorem 3 in [26], a larger AMUB is obtained when μ ranges from 0.45 to 0.55. When μ is 0.4, the result obtained is smaller than that obtained by [26]. Importantly, the number of NVs involved in Theorem 1, as shown in Table 2, is 16n2+12n, which is less than the 79.5n2+13.5n reported in [26]. This reduction in the number of NVs lowers the computational complexity from a computation perspective. When μ = 0.4 and the initial states are 0.5, 0.8, -0.5, and -0.8, the corresponding state trajectories are illustrated in Figure 1. For μ = 0.4, the illustration shows that the trajectories tend toward zero near the abscissa of around 350. With μ = 0.55 and the same initial states, the state trajectories are depicted in Figure 2, where they tend towards zero near the abscissa of approximately 300.
A suitable asymmetric LKF has been constructed, enabling us to demonstrate its positive definiteness without the necessity for all matrix variables to be symmetric or positive-definite. Additionally, a novel combinatorial optimization approach is employed to its full potential, utilizing the linear combination of multiple inequalities to identify the optimal arrangement and to process these inequalities. Therefore, the conditions presented in this study are shown to have less conservatism, and our newly proposed technique exhibits tremendous potential in terms of its theoretical and empirical capability to generate larger maximum allowable delays in comparison to select recent works in the literature. Furthermore, both theoretical and quantitative analyses confirm that our method notably reduces conservatism. Finally, the proposed method can be effectively combined with the delay segmentation technique to segment the delay interval into N segments for fine processing, which will be further investigated in future work.
Use of AI tools declaration
Throughout the preparation of this work, we utilized the AI-based proofreading tool "Grammarly" to identify and correct grammatical errors. Subsequently, we carefully reviewed the content and made any necessary additional edits. We take complete responsibility for the content of this publication.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant (No. 12061088), the Key R & D Projects of Sichuan Provincial Department of Science and Technology (2023YFG0287) and Sichuan Natural Science Youth Fund Project (Nos. 24NSFSC7038 and 2024NSFSC1404).
Conflict of interest
There are no conflicts of interest regarding this work.
References
[1]
T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperparameter optimization framework, In: Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019, 2623–2631. http://dx.doi.org/10.1145/3292500.3330701
[2]
B. Baker, O. Gupta, R. Raskar, N. Naik, Accelerating neural architecture search using performance prediction, 2017, arXiv: 1705.10823.
[3]
B. Baker, O. Gupta, N. Naik, R. Raskar, Designing neural network architectures using reinforcement learning, 2017, arXiv: 1611.02167.
[4]
J. F. Barrett, N. Keat, Artifacts in CT: recognition and avoidance, RadioGraphics, 24 (2004), 1679–1691. http://dx.doi.org/10.1148/rg.246045065 doi: 10.1148/rg.246045065
[5]
J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization, In: Advances in Neural Information Processing Systems, 2011, 2546–2554.
[6]
J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res., 13 (2012), 281–305.
[7]
L. Bottou, F. E. Curtis, J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), 223–311. http://dx.doi.org/10.1137/16M1080173 doi: 10.1137/16M1080173
[8]
L. Breiman, Random forests, Machine Learning, 45 (2001), 5–32. http://dx.doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324
[9]
C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2 (1998), 121–167. http://dx.doi.org/10.1023/A:1009715923555 doi: 10.1023/A:1009715923555
[10]
H. Cai, T. Chen, W. Zhang, Y. Yu, J. Wang, Efficient architecture search by network transformation, 2017, arXiv/1707.04873.
[11]
T. Domhan, J. T. Springenberg, F. Hutter, Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves, In: IJCAI International Joint Conference on Artificial Intelligence, 2015, 3460–3468.
[12]
T. Elsken, J. H. Metzen, F. Hutter, Neural architecture search: a survey, J. Mach. Learn. Res., 20 (2019), 1997–2017.
[13]
T. Elsken, J.-H. Metzen, F. Hutter, Simple and efficient architecture search for convolutional neural networks, 2017, arXiv: 1711.04528.
[14]
G. Franchini, M. Galinier, M. Verucchi, Mise en abyme with artificial intelligence: how to predict the accuracy of NN, applied to hyper-parameter tuning, In: INNSBDDL 2019: Recent advances in big data and deep learning, Cham: Springer, 2020,286–295. http://dx.doi.org/10.1007/978-3-030-16841-4_30
[15]
D. E. Goldberg, Genetic algorithms in search, optimization, and machine learning, Addison Wesley Publishing Co. Inc., 1989.
[16]
T. Hospedales, A. Antoniou, P. Micaelli, A. Storkey, Meta-learning in neural networks: a survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, in press. http://dx.doi.org/10.1109/TPAMI.2021.3079209
F. Hutter, H. Hoos, K. Leyton-Brown, Sequential model-based optimization for general algorithm configuration, In: LION 2011: Learning and Intelligent Optimization, Berlin, Heidelberg: Springer, 2011,507–523. http://dx.doi.org/10.1007/978-3-642-25566-3_40
[19]
D. P. Kingma, J. Ba, Adam: a method for stochastic optimization, 2017, arXiv: 1412.6980.
[20]
N. Loizou, P. Richtarik, Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods, Comput. Optim. Appl., 77 (2020), 653–710. http://dx.doi.org/10.1007/s10589-020-00220-z doi: 10.1007/s10589-020-00220-z
[21]
J. Mockus, V. Tiesis, A. Zilinskas, The application of Bayesian methods for seeking the extremum, In: Towards global optimisation, North-Holand, 2012,117–129.
[22]
B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Freitas, Taking the human out of the loop: A review of bayesian optimization, Proc. IEEE, 104 (2016), 148–175. http://dx.doi.org/10.1109/JPROC.2015.2494218 doi: 10.1109/JPROC.2015.2494218
C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, F. Hutter, NAS-Bench-101: Towards reproducible neural architecture search, In: Proceedings of the 36–th International Conference on Machine Learning, 2019, 7105–7114.
[25]
Z. Zhong, J. Yan, W. Wei, J. Shao, C.-L. Liu, Practical block-wise neural network architecture generation, In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, 2423–2432. http://dx.doi.org/10.1109/CVPR.2018.00257
[26]
B. Zoph, Q. V. Le, Neural architecture search with reinforcemente learning, 2017, arXiv: 1611.01578.
This article has been cited by:
1.
Weizhong Chen, Haoxu Wang, Xiaohua Wang, Lei Zhang,
Interval Estimation for Switched Neural Networks by Zonotopes,
2025,
0890-6327,
10.1002/acs.4042
Figure 2. Some samples from MNIST database, without and with stripes
Figure 3. A possible CNN scheme
Figure 4. The very bad case
Figure 5. The intermediate case
Figure 6. The good case
Figure 7. One of the reconstructed images obtained from the CNN related to the best hyperparameter configuration
Figure 8. Predicted and real accuracy values obtained by the SVR, RF and HYBRID methodologies for the 23 CNNs used as unseen data
Figure 9. Random search: the red line denotes the accuracy behavior obtained by the final performance of the current CNN as in [24], while the other blue, green and black lines show the accuracy behavior with respect to the training time when the SVR or RF or HYBRID methodology is used as predictor of the final performance of the CNN
Figure 10. Evolution search (regularized): the red line denotes the accuracy behavior obtained by the final performance of the current CNN as in [24], while the other blue, green and black lines show the accuracy behavior with respect to the training time when the SVR or RF or HYBRID methodology is used as predictor of the final performance of the CNN