Research article

Graph-based feature extraction for drug classification using machine learning

  • Published: 30 April 2026
  • Drug classification is a key task in medical decision-making, and this process can be supported through appropriate selection of drugs that fit patient attributes and medical history. Currently, state-of-the-art solutions for this problem tend to make use of attribute-based representation and may overlook the relational structure that could exist in the data points. This paper seeks to examine a machine learning algorithm for multiclass drug classification, where the target variable is the prescribed type of drug, using a graph-based feature extraction method. The proposed method can be used thorough data preprocessing strategies, including handling imbalance and removal of outliers. Evaluation of the method requires a 10-fold cross-validation strategy to achieve a genuine and impartial evaluation. In each training set, a similarity graph is created and graph-based attributes such as degree centrality measures and clustering coefficient are extracted by using three different distance measures to consider graph relationships between samples. In the case of the testing data, graph attributes are computed by supplanting a weighted k-nearest neighbor strategy, wherein there is absolute prevention of information leakage. The graph attributes obtained along with other attributes are used to train different machine learning classifiers. The results show that the performance across all distance measures plays a great role for such classifications.

    Citation: Sava Mohammed Hamazyad, Sadegh Sulaimany, Sarbaz H.A. Khoshnaw. Graph-based feature extraction for drug classification using machine learning[J]. AIMS Bioengineering, 2026, 13(2): 209-238. doi: 10.3934/bioeng.2026009

    Related Papers:

  • Drug classification is a key task in medical decision-making, and this process can be supported through appropriate selection of drugs that fit patient attributes and medical history. Currently, state-of-the-art solutions for this problem tend to make use of attribute-based representation and may overlook the relational structure that could exist in the data points. This paper seeks to examine a machine learning algorithm for multiclass drug classification, where the target variable is the prescribed type of drug, using a graph-based feature extraction method. The proposed method can be used thorough data preprocessing strategies, including handling imbalance and removal of outliers. Evaluation of the method requires a 10-fold cross-validation strategy to achieve a genuine and impartial evaluation. In each training set, a similarity graph is created and graph-based attributes such as degree centrality measures and clustering coefficient are extracted by using three different distance measures to consider graph relationships between samples. In the case of the testing data, graph attributes are computed by supplanting a weighted k-nearest neighbor strategy, wherein there is absolute prevention of information leakage. The graph attributes obtained along with other attributes are used to train different machine learning classifiers. The results show that the performance across all distance measures plays a great role for such classifications.



    加载中

    Acknowledgments



    We would like to express our gratitude to our respective universities for the outstanding facilities and resources they provide us with. These supports are greatly appreciated.

    Conflict of interest



    The authors declare no conflict of interest.

    [1] Safdari R, Esmaeili M, Marashi Shooshtari SS, et al. (2021) Drug classification systems: applications and characteristics. Health Manage Inf Sci 8: 149-158. https://doi.org/10.30476/jhmi.2022.91329.1083
    [2] Ashley EA (2016) Towards precision medicine. Nat Rev Genet 17: 507-522. https://doi.org/10.1038/nrg.2016.101
    [3] Gururaj HL, Flammini F, Kumari HC, et al. (2021) Classification of drugs based on mechanism of action using machine learning techniques. Discover Artif Intell 1: 13. https://doi.org/10.1007/s44163-021-00012-2
    [4] Askr H, Elgeldawi E, Aboul Ella H, et al. (2023) Deep learning in drug discovery: an integrative review and future challenges. Artif Intell Rev 56: 5975-6037. https://doi.org/10.1007/s10462-022-10306-1
    [5] Esteva A, Robicquet A, Ramsundar B, et al. (2019) A guide to deep learning in healthcare. Nat Med 25: 24-29. https://doi.org/10.1038/s41591-018-0316-z
    [6] Obaido G, Mienye ID, Egbelowo OF, et al. (2024) Supervised machine learning in drug discovery and development: algorithms, applications, challenges, and prospects. Machine Learn Appl 17: 100576. https://doi.org/10.1016/j.mlwa.2024.100576
    [7] Zhao H, Zhong J, Liang X, et al. (2025) Application of machine learning in drug side effect prediction: databases, methods, and challenges. Front Comput Sci 19: 195902. https://doi.org/10.1007/s11704-024-31063-0
    [8] Chen C (2024) Research on drug classification using machine learning model. Highlights Sci Eng Technol 81: 350-355. https://doi.org/10.54097/nfpj0845
    [9] Gallo K, Goede A, Preissner R, et al. (2022) SuperPred 3.0: drug classification and target prediction—a machine learning approach. Nucleic Acids Res 50: W726-W731. https://doi.org/10.1093/nar/gkac297
    [10] Newman M (2018) Networks. Oxford: Oxford University Press.
    [11] Barabási AL, Gulbahce N, Loscalzo J (2011) Network medicine: a network-based approach to human disease. Nat Rev Genet 12: 56-68. https://doi.org/10.1038/nrg2918
    [12] Zitnik M, Agrawal M, Leskovec J (2018) Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34: i457-i466. https://doi.org/10.1093/bioinformatics/bty294
    [13] Costa LdF, Rodrigues FA, Travieso G, et al. (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56: 167-242. https://doi.org/10.1080/00018730601170527
    [14] Li MM, Huang K, Zitnik M (2022) Graph representation learning in biomedicine and healthcare. Nat Biomed Eng 6: 1353-1369. https://doi.org/10.1038/s41551-022-00930-6
    [15] Gaudelet T, Day B, Jamasb AR, et al. (2021) Utilizing graph machine learning within drug discovery and development. Brief Bioinform 22: bbab159. https://doi.org/10.1093/bib/bbab159
    [16] Dibaji A, Sulaimany S (2023) Improving machine learning classification of heart disease using graph-based techniques. Proceedings of the 2023 13th International Conference on Computer and Knowledge Engineering (ICCKE) . IEEE 474-479. https://doi.org/10.1109/ICCKE58845.2023.10326444
    [17] Mafakheri A, Sulaimany S (2024) Android malware detection through centrality analysis of applications network. Appl Soft Comput 165: 112058. https://doi.org/10.1016/j.asoc.2023.112058
    [18] Opsahl T, Agneessens F, Skvoretz J (2010) Node centrality in weighted networks: generalizing degree and shortest paths. Soc Networks 32: 245-251. https://doi.org/10.1016/j.socnet.2010.03.006
    [19] Levy A, Shalom BR, Chalamish M (2025) A guide to similarity measures and their data science applications. J Big Data 12: 188. https://doi.org/10.1186/s40537-025-01227-1
    [20] Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27: 857-871. https://doi.org/10.2307/2528823
    [21] Vamathevan J, Clark D, Czodrowski P, et al. (2019) Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 18: 463-477. https://doi.org/10.1038/s41573-019-0024-5
    [22] Hodos RA, Kidd BA, Shameer K, et al. (2016) In silico methods for drug repurposing and pharmacology. Wiley Interdiscip Rev Syst Biol Med 8: 186-210. https://doi.org/10.1002/wsbm.1337
    [23] Lo YC, Rensi SE, Torng W, et al. (2018) Machine learning in chemoinformatics and drug discovery. Drug Discov Today 23: 1538-1546. https://doi.org/10.1016/j.drudis.2018.05.010
    [24] Lavecchia A (2015) Machine‑learning approaches in drug discovery: methods and applications. Drug Discov Today 20: 318-331. https://doi.org/10.1016/j.drudis.2014.10.012
    [25] Liu Y, Tang H, Niu T, et al. (2026) A comparative study of deep learning and classical modeling approaches for protein‑ligand binding pose and affinity prediction in coronavirus main proteases. J Chem Inf Model 66: 731-743. https://doi.org/10.1021/acs.jcim.5c02481
    [26] Zhang Y, Liu J, Shen W (2022) A review of ensemble learning algorithms used in remote sensing applications. Appl Sci 12: 8654. https://doi.org/10.3390/app12178654
    [27] Shi H, Liu S, Chen J, et al. (2019) Predicting drug‑target interactions using lasso with random forest based on evolutionary information and chemical structure. Genomics 111: 1839-1852. https://doi.org/10.1016/j.ygeno.2018.12.007
    [28] Chen F, Zhao Z, Ren Z, et al. (2025) Prediction of drug target interaction based on under sampling strategy and random forest algorithm. PLoS One 20: e0318420. https://doi.org/10.1371/journal.pone.0318420
    [29] Thafar MA, Olayan RS, Albaradei S, et al. (2021) DTi2Vec: drug–target interaction prediction using network embedding and ensemble learning. J Cheminform 13: 71. https://doi.org/10.1186/s13321-021-00552-w
    [30] Mustapha IB, Saeed F (2016) Bioactive molecule prediction using extreme gradient boosting. Molecules 21: 983. https://doi.org/10.3390/molecules21080983
    [31] Zhou J, Cui G, Hu S, et al. (2020) Graph neural networks: a review of methods and applications. AI Open 1: 57-81. https://doi.org/10.1016/j.aiopen.2021.01.001
    [32] Wang X, Hu T, Yang Q, et al. (2021) Graph‑theory based degree centrality combined with machine learning algorithms can predict response to treatment with antiepileptic medications in children with epilepsy. J Clin Neurosci 91: 276-282. https://doi.org/10.1016/j.jocn.2021.07.016
    [33] Hosseini M, Dibaji A, Sulaimany S (2024) Graph‑based feature engineering for rolling element bearing fault diagnosis using vibration signals. Eng Res Express 6: 045234. https://doi.org/10.1088/2631-8695/ad8ff0
    [34] Renjini A, Swapna MS, Raj V, et al. (2021) Graph‑based feature extraction and classification of wet and dry cough signals: a machine learning approach. J Complex Netw 9: cnab039. https://doi.org/10.1093/comnet/cnab039
    [35] Dibaji A, Sulaimany S (2023) Improving machine learning classification of heart disease using graph‑based techniques. Proceedings of the 2023 13th International Conference on Computer and Knowledge Engineering (ICCKE) . IEEE 474-479. https://doi.org/10.1109/ICCKE58845.2023.10326444
    [36] Albreiki B, Habuza T, Zaki N (2023) Extracting topological features to identify at‑risk students using machine learning and graph convolutional network models. Int J Educ Technol High Educ 20: 23. https://doi.org/10.1186/s41239-023-00389-3
    [37] Gala DV, Gandhi VB, Gandhi VA, et al. (2021) Drug classification using machine learning and interpretability. 2021 Smart Technologies, Communication and Robotics (STCR) . Sathyamangalam, India: 1-8. https://doi.org/10.1109/STCR51658.2021.9588972
    [38] Vu TA, Hieu TM, Linh HTM, et al. (2023) Drug classification based on machine learning models with a combination of data binning and SMOTE technique. 2023 1st International Conference on Health Science and Technology (ICHST) . Hanoi, Vietnam: 1-5. https://doi.org/10.1109/ICHST59286.2023.10565309
    [39] Purwono P, Wirasto A, Nisa K (2021) Comparison of machine learning algorithms for classification of drug groups. Sisfotenika 11: 196-207. https://doi.org/10.30700/jst.v11i2.1134
    [40] Drug200 datasetKaggle (2020). Available from: https://www.kaggle.com/datasets/hunzaikashif49/drug200
    [41] Bruce P, Bruce A, Gedeck P (2020) Practical statistics for data scientists: 50+ essential concepts using R and Python. USA: O'Reilly Media.
    [42] Houssein EH, Ibrahim IA, Mostafa A, et al. (2025) SMENN-hybrid: an efficient technique combining the synthetic minority oversampling technique with ensemble learning for diabetes prediction. Sci Rep 15: 43104. https://doi.org/10.1038/s41598-025-26583-z
    [43] Hairani H, Priyanto D (2023) A new approach of hybrid sampling SMOTE and ENN to the accuracy of machine learning methods on unbalanced diabetes disease data. Int J Adv Comput Sci Appl 14: 585-590. https://doi.org/10.14569/IJACSA.2023.0140864
    [44] Vairetti C, Assadi JL, Maldonado S (2024) Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification. Expert Syst Appl 246: 123149. https://doi.org/10.1016/j.eswa.2024.123149
    [45] Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6: 20-29. https://doi.org/10.1145/1007730.1007735
    [46] Landherr A, Friedl B, Heidemann J (2010) A critical review of centrality measures in social networks. Bus Inf Syst Eng 2: 371-385. https://doi.org/10.1007/s12599-010-0127-3
    [47] Tymoshchuk D, Didych I, Maruschak P, et al. (2025) Machine learning approaches for classification of composite materials. Modelling 6: 118. https://doi.org/10.3390/modelling6040118
    [48] Li G, Sheng H (2025) A hybrid machine learning framework for predicting aircraft scaled sound pressure levels: a comparative study. J Eng Appl Sci 72: 163. https://doi.org/10.1186/s44147-025-00714-9
    [49] Mienye ID, Jere N (2024) A survey of decision trees: concepts, algorithms, and applications. IEEE Access 12: 86716-86727. https://doi.org/10.1109/ACCESS.2024.3416838
    [50] Sujon KM, Hassan R, Choi K, et al. (2025) Accuracy, precision, recall, F1-score, or MCC? Empirical evidence from advanced statistics, ML, and XAI for evaluating business predictive models. J Big Data 12: 268. https://doi.org/10.1186/s40537-025-01313-4
  • Reader Comments
  • © 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(101) PDF downloads(3) Cited by(0)

Article outline

Figures and Tables

Figures(10)  /  Tables(8)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog