One of the current challenges in applying machine learning is optimizing the constructed models to achieve the best possible performance. This article proposes a novel approach to determine which data preprocessing strategy may be most beneficial for enhancing the evaluation fit metrics, based on studying the correlation between the dataset's meta-features and the performance response variables. Additionally, the meta-features were categorized in terms of modification cost and control over the model's fit to determine which strategies appeared to be the most optimal. Also, we studied if these transformation can improve the results obtained by several training and testing proportions. The study was conducted with 42 different configuration datasets derived from the multiclass classification problem of IP address maliciousness. Three machine learning algorithms and tools were evaluated: Autosklearn, Gaussian Mixture Models, and Extreme Gradient Boosting; all of them have been studied on the same problem in previous works. Also, we applied five different types of data transformations, such as scaling, dimensionality reduction techniques, and quantile transformation. The results show that meta-feature correlation analysis significantly improves machine learning performance by guiding data preprocessing and transformation strategies, sometimes even surpassing the best original performance.
Citation: Noemí DeCastro-García, David Escudero García. Meta-feature-based data preprocessing for machine learning through correlation structures and cost-aware benchmarking: A case study in cybersecurity[J]. AIMS Mathematics, 2025, 10(7): 16762-16795. doi: 10.3934/math.2025753
One of the current challenges in applying machine learning is optimizing the constructed models to achieve the best possible performance. This article proposes a novel approach to determine which data preprocessing strategy may be most beneficial for enhancing the evaluation fit metrics, based on studying the correlation between the dataset's meta-features and the performance response variables. Additionally, the meta-features were categorized in terms of modification cost and control over the model's fit to determine which strategies appeared to be the most optimal. Also, we studied if these transformation can improve the results obtained by several training and testing proportions. The study was conducted with 42 different configuration datasets derived from the multiclass classification problem of IP address maliciousness. Three machine learning algorithms and tools were evaluated: Autosklearn, Gaussian Mixture Models, and Extreme Gradient Boosting; all of them have been studied on the same problem in previous works. Also, we applied five different types of data transformations, such as scaling, dimensionality reduction techniques, and quantile transformation. The results show that meta-feature correlation analysis significantly improves machine learning performance by guiding data preprocessing and transformation strategies, sometimes even surpassing the best original performance.
| [1] |
M. Guven, Leveraging deep learning and image conversion of executable files for effective malware detection: A static malware analysis approach, AIMS Mathematics, 9 (2024), 15223–15245. http://doi.org/10.3934/math.2024739 doi: 10.3934/math.2024739
|
| [2] |
M. Aljebreen, H. A. Mengash, K. Mahmood, A. A. Alhashmi, A. S. Salama, Enhancing cybersecurity in cloud-assisted Internet of Things environments: A unified approach using evolutionary algorithms and ensemble learning, AIMS Mathematics, 9 (2024), 15796–15818. http://doi.org/10.3934/math.2024763 doi: 10.3934/math.2024763
|
| [3] | M. Feurer, A. Klein, K. Eggensperger, J. T. Springenberg, M. Blum, F. Hutter, Efficient and robust automated machine learning, In: Proceedings of the 29th international conference on neural information processing systems, Cambridge: MIT Press, 2015, 2755–2763. |
| [4] | N. DeCastro-García, Castañeda, Á. L. Muñoz Castañeda, M. Fernández-Rodríguez, RADSSo: An automated tool for the multi-CASH machine learning problem, In: Hybrid artificial intelligent systems, HAIS 2020, Cham: Springer, 2020,183–194. https://doi.org/10.1007/978-3-030-61705-9_16 |
| [5] | C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, In: Proceedings of the 34th international conference on machine learning (ICML), Cambridge: MIT Press, 2017, 1126–1135. |
| [6] |
D. Kedziora, T.-D. Nguyen, K. Musial, B. Gabrys, On taking advantage of opportunistic meta-knowledge to reduce configuration spaces for automated machine learning, Expert Syst. Appl., 239 (2024), 122359. http://doi.org/10.1016/j.eswa.2023.122359 doi: 10.1016/j.eswa.2023.122359
|
| [7] |
M. V. Carriegos, N. DeCastro-García, D. Escudero, Towards supercomputing categorizing the maliciousness upon cybersecurity blacklists with concept drift, Comput. Math. Methods, 2023 (2023), 5780357. https://doi.org/10.1155/2023/5780357 doi: 10.1155/2023/5780357
|
| [8] |
N. DeCastro-García, D. E. García, M. V. Carriegos, A mathematical analysis about the geo-temporal characterization of the multi-class maliciousness of an IP address, Wireless Netw., 30 (2024), 5033–5048. https://doi.org/10.1007/s11276-022-03215-2 doi: 10.1007/s11276-022-03215-2
|
| [9] | A. Rivolli, L. P. F. Garcia, A. C. Lorena, A. C. P. L. F. de Carvalho, A study of the correlation of metafeatures used for metalearning, In: Advances in computational intelligence, Cham: Springer, 2021,471–483. https://doi.org/10.1007/978-3-030-85030-2_39 |
| [10] |
A. Rivolli, L. P. F. Garcia, C. Soares, J. Vanschoren, A. C. P. L. F. de Carvalho, Meta-features for meta-learning, Knowl.-Based Syst., 240 (2022), 108101. https://doi.org/10.1016/j.knosys.2021.108101 doi: 10.1016/j.knosys.2021.108101
|
| [11] |
Y. Jeong, S. Lee, J. Lee, W. Choi, Data-efficient surrogate modeling using meta-learning and physics-informed deep learning approaches, Expert Syst. Appl., 250 (2024), 123758. https://doi.org/10.1016/j.eswa.2024.123758 doi: 10.1016/j.eswa.2024.123758
|
| [12] |
B. A. Pimentel, A. C. P. L. F. de Carvalho, A new data characterization for selecting clustering algorithms using meta-learning, Inform. Sciences, 477 (2019), 203–219. https://doi.org/10.1016/j.ins.2018.10.043 doi: 10.1016/j.ins.2018.10.043
|
| [13] | Y. H. Peng, P. A. Flach, C. Soares, P. Brazdil, Improved dataset characterisation for meta-learning, In: Discovery Science, Berlin: Springer, 2002,141–152. https://doi.org/10.1007/3-540-36182-0_14 |
| [14] |
S.-N. Vo, T.-T. Vo, B. Le, Interpretable extractive text summarization with meta-learning and BI-LSTM: A study of meta learning and explainability techniques, Expert Syst. Appl., 245 (2024), 123045. http://doi.org/10.1016/j.eswa.2023.123045 doi: 10.1016/j.eswa.2023.123045
|
| [15] | A. Filchenkov, A. Pendryak, Datasets meta-feature description for recommending feature selection algorithm, Proceedings Of 2015 Artificial Intelligence And Natural Language And Information Extraction, Social Media And Web Search FRUCT Conference (AINL-ISMW FRUCT), Petersburg, Russia, 2015, 11–18. http://doi.org/10.1109/AINL-ISMW-FRUCT.2015.7382962 |
| [16] |
T. L. Marinho, D. C. do Nascimento, B. A. Pimentel, Optimization on selecting XGBoost hyperparameters using meta-learning, Expert Syst., 41 (2024), e13611. http://doi.org/10.1111/exsy.13611 doi: 10.1111/exsy.13611
|
| [17] | S. Shilbayeh, S. Vadera, Feature selection in meta learning framework, 2014 Science And Information Conference, London, UK, 2014,269–275. http://doi.org/10.1109/SAI.2014.6918200 |
| [18] | A. Raghu, J. Lorraine, S. Kornblith, M. McDermott, D. Duvenaud, Meta-learning to improve pre-training, In: Proceedings of the 35th international conference on neural information processing systems, New Youk: Curran Associates Inc., 2021, 23231–23244. |
| [19] |
D. Carneiro, M. Guimarães, M. Carvalho, P. Novais, Using meta-learning to predict performance metrics in machine learning problems, Expert Syst., 40 (2023), e12900. http://doi.org/10.1111/exsy.12900 doi: 10.1111/exsy.12900
|
| [20] | C. Castiello, G. Castellano, A. M. Fanelli, Meta-data: characterization of input features for meta-learning, In: Modeling decisions for artificial intelligence, Berlin: Springer, 2005,457–468. https://doi.org/10.1007/11526018_45 |
| [21] | F. Pinto, C. Soares, J. Mendes-Moreira, Towards automatic generation of metafeatures, In: Advances in knowledge discovery and data mining, Cham: Springer, 2016,215–226. https://doi.org/10.1007/978-3-319-31753-3_18 |
| [22] | L. P. F. Garcia, F. Campelo, G. N. Ramos, A. Rivolli, A. C. P. L. F. de Carvalho, Evaluating clustering meta-features for classifier recommendation, In: Intelligent systems, Cham: Springer, 2021,453–467. https://doi.org/10.1007/978-3-030-91702-9_30 |
| [23] | E. Alcobaça, F. Siqueira, A. Rivolli, L. P. F. Garcia, J. T. Oliva, A. C. P. L. F. de Carvalho, MFE: Towards reproducible meta-feature extraction, J. Mach. Learn. Res., 21 (2020), 1–5. |
| [24] | A. Rivolli, L. P. F. Garcia, C. Soares, J. Vanschoren, A. C. P. L. F. de Carvalho, Characterizing classification datasets: A study of meta-features for meta-learning, arXiv: 1808.10406, (2019). https://doi.org/10.48550/arXiv.1808.10406 |
| [25] | G. C. M. Moura, R. Sadre, A. Pras, Taking on internet bad neighborhoods, 2014 IEEE Network Operations And Management Symposium (NOMS), Krakow, Poland, 2014, 1–7. https://doi.org/10.1109/NOMS.2014.6838284 |
| [26] | A. Renjan, K. P. Joshi, S. N. Narayanan, A. Joshi, DAbR: Dynamic attribute-based reputation scoring for malicious IP address detection, 2018 IEEE International Conference On Intelligence And Security Informatics (ISI), Miami, FL, USA, 2018, 64–69. https://doi.org/10.1109/ISI.2018.8587342 |
| [27] |
D. H. Wolpert, W. G. Macready, No free lunch theorems for optimization, IEEE T. Evolut. Comput., 1 (1997), 67–82. https://doi.org/10.1109/4235.585893 doi: 10.1109/4235.585893
|
| [28] |
S. Uddin, H. H. Lu, Dataset meta-level and statistical features affect machine learning performance, Sci. Rep., 14 (2024), 1670. https://doi.org/10.1038/s41598-024-51825-x doi: 10.1038/s41598-024-51825-x
|
| [29] |
M. Huisman, J. N. van Rijn, A. Plaat, A survey of deep meta-learning, Artif. Intell. Rev., 54 (2021), 4483–4541. https://doi.org/10.1007/s10462-021-10004-4 doi: 10.1007/s10462-021-10004-4
|
| [30] |
A. Vettoruzzo, M.-R. Bouguelia, J. Vanschoren, T. Rögnvaldsson, K. Santosh, Advances and challenges in meta-learning: A technical review, IEEE T. Pattern Anal., 46 (2024), 4763–4779. https://doi.org/10.1109/TPAMI.2024.3357847 doi: 10.1109/TPAMI.2024.3357847
|
| [31] | C. Raymond, Meta-learning loss functions for deep neural networks, PhD Thesis, Victoria University of Wellington, 2025. |
| [32] |
D. J. Hand, Mixture models: inference and applications to clustering, J. R. Stat. Soc. C-Appl., 38 (1989), 384–385. https://doi.org/10.2307/2348072 doi: 10.2307/2348072
|
| [33] | T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, New York: Association for Computing Machinery, 2016,785–794. https://doi.org/10.1145/2939672.2939785 |
| [34] | J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res., 13 (2012), 281–305. |
| [35] |
B. W. Matthews, Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochimica Et Biophysica Acta (BBA)-Protein Structure, 405 (1975), 442–451. https://doi.org/10.1016/0005-2795(75)90109-9 doi: 10.1016/0005-2795(75)90109-9
|
| [36] |
P. Giudici, Safe machine learning, Statistics, 58, (2024), 473—477. https://doi.org/10.1080/02331888.2024.2361481 doi: 10.1080/02331888.2024.2361481
|
| [37] |
G. Babaei, P. Giudici, A statistical package for safe artificial intelligence, Stat. Methods Appl., (2025). https://doi.org/10.1007/s10260-025-00796-y doi: 10.1007/s10260-025-00796-y
|
| [38] |
A. Jain, K. Nandakumar, A. Ross, Score normalization in multimodal biometric systems, Pattern Recogn., 38 (2005), 2270–2285. https://doi.org/10.1016/j.patcog.2005.01.012 doi: 10.1016/j.patcog.2005.01.012
|
| [39] |
K. Pearson, LIII. On lines and planes of closest fit to systems of points in space, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2 (1901), 559–572. https://doi.org/10.1080/14786440109462720 doi: 10.1080/14786440109462720
|
| [40] | L. F. Kozachenko, N. Leonenko, Sample estimate of the entropy of a random vector, Probl. Pered. Inform., 23 (1987), 9–16. |
| [41] |
D. Amaratunga, J. Cabrera, Analysis of data from viral DNA microchips, J. Am. Stat. Assoc., 96 (2001), 1161–1170. https://doi.org/10.1198/016214501753381814 doi: 10.1198/016214501753381814
|