Feature selection (FS) in huge data sets is a critical aspect of machine learning that involves choosing the most relevant features. It plays a significant role in improving the model's performance, reducing overfitting, and enhancing the interpretability. In this paper, we construct an automatic modification of the basic FS techniques such as Lasso, Deep Neural Network (DNN), Random Forest (RF) and a Principal Component Analysis (PCA), based on the K-means clustering method and the Silhouette score method, instead of visualization or threshold based methods based on background knowledge. Additionally, the construction of two hybrid methods is proposed, the purpose of which is to exploit the advantages offered by a number of feature seledtion methods: the first is the score method to leverage multiple types of methods; and the second is the refinement method to enhance the outcomes of one method by adapting them to another method. Moreover, to evaluate the efficiency of the FS method, a linear regression and a DNN nonlinear regression are employed to minimize the dependency of the choice of the regression. Through numerical tests, we show that the automatic modification of the conventional methods can generate a convenient way to set the criterion. Additionally, based on the results that are derived from both the linear regression and the DNN regression, the hybrid FS techniques can more accurately perform in both the linear and nonlinear regressions without any dependency on the data.
Citation: Sunyoung Bu, Inmi Kim. Hybrid feature selection techniques using automatic modification[J]. AIMS Mathematics, 2026, 11(2): 5152-5171. doi: 10.3934/math.2026210
Feature selection (FS) in huge data sets is a critical aspect of machine learning that involves choosing the most relevant features. It plays a significant role in improving the model's performance, reducing overfitting, and enhancing the interpretability. In this paper, we construct an automatic modification of the basic FS techniques such as Lasso, Deep Neural Network (DNN), Random Forest (RF) and a Principal Component Analysis (PCA), based on the K-means clustering method and the Silhouette score method, instead of visualization or threshold based methods based on background knowledge. Additionally, the construction of two hybrid methods is proposed, the purpose of which is to exploit the advantages offered by a number of feature seledtion methods: the first is the score method to leverage multiple types of methods; and the second is the refinement method to enhance the outcomes of one method by adapting them to another method. Moreover, to evaluate the efficiency of the FS method, a linear regression and a DNN nonlinear regression are employed to minimize the dependency of the choice of the regression. Through numerical tests, we show that the automatic modification of the conventional methods can generate a convenient way to set the criterion. Additionally, based on the results that are derived from both the linear regression and the DNN regression, the hybrid FS techniques can more accurately perform in both the linear and nonlinear regressions without any dependency on the data.
| [1] |
T. Abeel, T. Helleputte, Y. V. Peer, P. Dupont, Y. Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics, 26 (2009), 392–398. https://doi.org/10.1093/bioinformatics/btp630 doi: 10.1093/bioinformatics/btp630
|
| [2] |
H. Ahn, H. Moon, M. J. Fazzari, N. Lim, J. J. Chen, R. L. Kodell, Classification by ensembles from random partitions of high-dimensional data, Comput. Stat. Data Anal., 51 (2007), 6166–6179. https://doi.org/10.1016/j.csda.2006.12.043 doi: 10.1016/j.csda.2006.12.043
|
| [3] |
K. W. De Bock, K. Coussement, D. Van den Poel, Ensemble classification based on generalized additive models, Comput. Stat. Data Anal., 54 (2010), 1535–1546. https://doi.org/10.1016/j.csda.2009.12.013 doi: 10.1016/j.csda.2009.12.013
|
| [4] | D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, ACM, 2010,333–342. https://doi.org/10.1145/1835804.1835848 |
| [5] |
J. Chin, M. Andri, H. Habibollah, H. Nuzly, Supervised, unsupervised, and semi supervised feature selection: A review on gene selection, IEEE/ACM TCBB., 13 (2016), 971–989. https://doi.org/10.1109/TCBB.2015.2478454 doi: 10.1109/TCBB.2015.2478454
|
| [6] |
J. Cai, J. Luo, S. Wang, S. Yang, Feature selection in machine learning: A new perspective, Neurocomputing, 300 (2018), 70–79. https://doi.org/10.1016/j.neucom.2017.11.077 doi: 10.1016/j.neucom.2017.11.077
|
| [7] |
D. Effrosynidis, A. Arampatzis, An evaluation of feature selection methods for environmental data, Ecol. Inform., 61 (2021), 101224. https://doi.org/10.1016/j.ecoinf.2021.101224 doi: 10.1016/j.ecoinf.2021.101224
|
| [8] | T. K. Ho, The random subspace method for constructing decision forests, IEEE. Trans. Pattern Anal. Mach. Intell., 20 (1998), 832–844. |
| [9] |
N. Hoque, M. Singh, D. K. Bhattacharyya, EFS-MI: An ensemble feature selection method for classification, Complex Intell. Syst., 4 (2018), 105–118. https://doi.org/10.1007/s40747-017-0060-x doi: 10.1007/s40747-017-0060-x
|
| [10] |
A. Hashemi, M. B. Dowlatshahi, H. Nezamabadi-pour, A pareto-based ensemble of feature selection algorithms, Expert Syst. Appl., 180 (2021), 115130. https://doi.org/10.1016/j.eswa.2021.115130 doi: 10.1016/j.eswa.2021.115130
|
| [11] |
M. H. Law, M. A. Figueiredo, A. K. Jain, Simultaneous feature selection and clustering using mixture models, IEEE Trans. Pattern Anal. Mach. Intell., 26 (2004), 1154–1166. https://doi.org/10.1109/TPAMI.2004.71 doi: 10.1109/TPAMI.2004.71
|
| [12] |
H. Liu, L. Liu, H. Zhang, Ensemble gene selection for cancer classification, Pattern Recogn., 43 (2010), 2763–2772. https://doi.org/10.1016/j.patcog.2010.02.008 doi: 10.1016/j.patcog.2010.02.008
|
| [13] |
J. Nobre, F. Neves, Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets, Expert Syst. Appl., 125 (2019), 181–194. https://doi.org/10.1016/j.eswa.2019.01.083 doi: 10.1016/j.eswa.2019.01.083
|
| [14] | E. O. Omuya, G. O. Okeyo, M. W. Kimwele, Feature selection for classification using principal component analysis and information gain, Expert Syst. Appl., 174 (2021), https://doi.org/10.1016/j.eswa.2021.114765 |
| [15] |
J. Qiu, X. Xiang, C. Wang, X. Zhang, A multi-objective feature selection approach based on chemical reaction optimization, Appl. Soft Comput., 112 (2021), 107794. https://doi.org/10.1016/j.asoc.2021.107794 doi: 10.1016/j.asoc.2021.107794
|
| [16] |
M. Seok, W. Kim, J. Kim, Machine learning for sarcopenia prediction in the elderly using socioeconomic, infrastructure, and quality-of-life data, Healthcare, 11 (2023), 2881. https://doi.org/10.3390/healthcare11212881 doi: 10.3390/healthcare11212881
|
| [17] |
R. J. Urbanowicz, R. S. Olson, P. Schmitt, M. Meeker, J. H. Moore, Benchmarking relief-based feature selection methods for bioinformatics data mining, J. Biomed. Inform., 85 (2018), 168–188. https://doi.org/10.1016/j.jbi.2018.07.015 doi: 10.1016/j.jbi.2018.07.015
|
| [18] |
S. L. Wang, X. L. Li, J. Fang, Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification, BMC Bioinform., 13 (2012), 178. https://doi.org/10.1186/1471-2105-13-178 doi: 10.1186/1471-2105-13-178
|
| [19] |
B. Xue, M. Zhang, W. N. Browne, X. Yao, A survey on evolutionary computation approaches to feature selection, IEEE Trans. Evol. Comput., 20 (2015), 606–626. https://doi.org/10.1109/TEVC.2015.2504420 doi: 10.1109/TEVC.2015.2504420
|
| [20] | L. Zhang, P. N. Suganthan, Random forests with ensemble of feature spaces, Pattern Recognit., 47 (2014), 3429–3437. |
| [21] | Z. Zhao, H. Liu, Spectral feature selection for supervised and unsupervised learning, In Proceedings of the 24th international conference on Machine learning, ACM, 2007, 1151–1157. |
| [22] | Z. Zhao, L. Wang, H. Liu, Efficient spectral feature selection with minimum redundancy, Proceedings of the AAAI Conference on Artificial Intelligence, 24 (2010), 673–678. https://doi.org/10.1609/aaai.v24i1.7671 |