Research article Special Issues

Integrative approach for classifying male tumors based on DNA methylation 450K data


  • Received: 15 August 2023 Revised: 25 September 2023 Accepted: 26 September 2023 Published: 13 October 2023
  • Malignancies such as bladder urothelial carcinoma, colon adenocarcinoma, liver hepatocellular carcinoma, lung adenocarcinoma and prostate adenocarcinoma significantly impact men's well-being. Accurate cancer classification is vital in determining treatment strategies and improving patient prognosis. This study introduced an innovative method that utilizes gene selection from high-dimensional datasets to enhance the performance of the male tumor classification algorithm. The method assesses the reliability of DNA methylation data to distinguish the five most prevalent types of male cancers from normal tissues by employing DNA methylation 450K data obtained from The Cancer Genome Atlas (TCGA) database. First, the chi-square test is used for dimensionality reduction and second, L1 penalized logistic regression is used for feature selection. Furthermore, the stacking ensemble learning technique was employed to integrate seven common multiclassification models. Experimental results demonstrated that the ensemble learning model utilizing multiple classification models outperformed any base classification model. The proposed ensemble model achieved an astonishing overall accuracy (ACC) of 99.2% in independent testing data. Moreover, it may present novel ideas and pathways for the early detection and treatment of future diseases.

    Citation: Ji-Ming Wu, Wang-Ren Qiu, Zi Liu, Zhao-Chun Xu, Shou-Hua Zhang. Integrative approach for classifying male tumors based on DNA methylation 450K data[J]. Mathematical Biosciences and Engineering, 2023, 20(11): 19133-19151. doi: 10.3934/mbe.2023845

    Related Papers:

  • Malignancies such as bladder urothelial carcinoma, colon adenocarcinoma, liver hepatocellular carcinoma, lung adenocarcinoma and prostate adenocarcinoma significantly impact men's well-being. Accurate cancer classification is vital in determining treatment strategies and improving patient prognosis. This study introduced an innovative method that utilizes gene selection from high-dimensional datasets to enhance the performance of the male tumor classification algorithm. The method assesses the reliability of DNA methylation data to distinguish the five most prevalent types of male cancers from normal tissues by employing DNA methylation 450K data obtained from The Cancer Genome Atlas (TCGA) database. First, the chi-square test is used for dimensionality reduction and second, L1 penalized logistic regression is used for feature selection. Furthermore, the stacking ensemble learning technique was employed to integrate seven common multiclassification models. Experimental results demonstrated that the ensemble learning model utilizing multiple classification models outperformed any base classification model. The proposed ensemble model achieved an astonishing overall accuracy (ACC) of 99.2% in independent testing data. Moreover, it may present novel ideas and pathways for the early detection and treatment of future diseases.



    加载中


    [1] H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, et al., Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA Cancer J Clin., 71 (2021), 209–249. https://doi.org/10.3322/caac.21660 doi: 10.3322/caac.21660
    [2] W. Wang, L. R. Meadows, J. M. den Haan, N. E. Sherman, Y. Chen, E. Blokland, et al., Human HY: a male-specific histocompatibility antigen derived from the SMCY protein, Science, 269 (1995), 1588–1590. https://doi.org/10.1126/science.7667640 doi: 10.1126/science.7667640
    [3] K. Shibuya, C. D. Mathers, C. Boschi-Pinto, A. D. Lopez, C. J. Murray, Global and regional estimates of cancer mortality and incidence by site: Ⅱ. Results for the global burden of disease 2000, BMC Cancer, 2 (2002), 37. https://doi.org/10.1186/1471-2407-2-37 doi: 10.1186/1471-2407-2-37
    [4] A. Jemal, R. Siegel, J. Xu, E. Ward, Cancer statistics, 2010, CA Cancer J. Clin., 60 (2010), 277–300. https://doi.org/10.3322/caac.20073 doi: 10.3322/caac.20073
    [5] Cancer Genome Atlas Research, Comprehensive molecular characterization of urothelial bladder carcinoma, Nature, 507 (2014), 315–322. https://doi.org/10.1038/nature12965
    [6] J. Terzic, S. Grivennikov, E. Karin, M. Karin, Inflammation and colon cancer, Gastroenterology, 138 (2010), 2101–2114. https://doi.org/10.1053/j.gastro.2010.01.058 doi: 10.1053/j.gastro.2010.01.058
    [7] F. X. Bosch, J. Ribes, M. Diaz, R. Cleries, Primary liver cancer: worldwide incidence and trends, Gastroenterology, 127 (2004), S5–S16. https://doi.org/10.1053/j.gastro.2004.09.011 doi: 10.1053/j.gastro.2004.09.011
    [8] Cancer Genome Atlas Research, Comprehensive molecular profiling of lung adenocarcinoma, Nature, 511 (2014), 543–550. https://doi.org/10.1038/nature13385
    [9] P. Rawla, Epidemiology of prostate cancer, World J. Oncol., 10 (2019), 63–89. https://doi.org/10.14740/wjon1191 doi: 10.14740/wjon1191
    [10] P. Jurmeister, M. Leitheiser, P. Wolkenstein, F. Klauschen, D. Capper, L. Brcic, DNA methylation-based machine learning classification distinguishes pleural mesothelioma from chronic pleuritis, pleural carcinosis, and pleomorphic lung carcinomas, Lung Cancer, 170 (2022), 105–113. https://doi.org/10.1016/j.lungcan.2022.06.008 doi: 10.1016/j.lungcan.2022.06.008
    [11] Z. D. Smith, A. Meissner, DNA methylation: roles in mammalian development, Nat. Rev. Genet., 14 (2013), 204–220. https://doi.org/10.1038/nrg3354 doi: 10.1038/nrg3354
    [12] P. A. Jones, Functions of DNA methylation: islands, start sites, gene bodies and beyond, Nat. Rev. Genet., 13 (2012), 484–492. https://doi.org/10.1038/nrg3230 doi: 10.1038/nrg3230
    [13] T. Bozic, C. C. Kuo, J. Hapala, J. Franzen, M. Eipel, U. Platzbecker, et al., Investigation of measurable residual disease in acute myeloid leukemia by DNA methylation patterns, Leukemia, 36 (2022), 80–89. https://doi.org/10.1038/s41375-021-01316-z doi: 10.1038/s41375-021-01316-z
    [14] C. Stirzaker, D. S. Millar, C. L. Paul, P. M. Warnecke, J. Harrison, P. C. Vincent, et al., Extensive DNA methylation spanning the Rb promoter in retinoblastoma tumors, Cancer Res., 57 (1997), 2229–2237.
    [15] I. Huh, X. Yang, T. Park, S. V. Yi, Bis-class: a new classification tool of methylation status using bayes classifier and local methylation information, BMC Genomics, 15 (2014), 608. https://doi.org/10.1186/1471-2164-15-608 doi: 10.1186/1471-2164-15-608
    [16] J. Jo, J. Oh, C. Park, Microbial community analysis using high-throughput sequencing technology: a beginner's guide for microbiologists, J. Microbiol., 58 (2020), 176–192. https://doi.org/10.1007/s12275-020-9525-5 doi: 10.1007/s12275-020-9525-5
    [17] M. Mohammed, H. Mwambi, I. B. Mboya, M. K. Elbashir, B. Omolo, A stacking ensemble deep learning approach to cancer type classification based on TCGA data, Sci. Rep., 11 (2021), 15626. https://doi.org/10.1038/s41598-021-95128-x doi: 10.1038/s41598-021-95128-x
    [18] S. Jia, Y. Zhang, Y. Mao, J. Gao, Y. Chen, Y. Jiang, et al., A new parsimonious method for classifying Cancer Tissue-of-Origin Based on DNA Methylation 450K data, preprint, arXiv: 2101.00570. https://doi.org/10.48550/arXiv.2101.00570
    [19] W. Lin, S. Hu, Z. Wu, Z. Xu, Y. Zhong, Z. Lv, et al., iCancer-Pred: A tool for identifying cancer and its type using DNA methylation, Genomics, 114 (2022), 110486. https://doi.org/10.1016/j.ygeno.2022.110486 doi: 10.1016/j.ygeno.2022.110486
    [20] M. J. Goldman, B. Craft, M. Hastie, K. Repecka, F. McDade, A. Kamath, et al., Visualizing and interpreting cancer genomics data via the Xena platform, Nat. Biotechnol., 38 (2020), 675–678. https://doi.org/10.1038/s41587-020-0546-8 doi: 10.1038/s41587-020-0546-8
    [21] N. Pandis, The chi-square test, Am. J. Orthod. Dentofacial Orthop., 150 (2016), 898–899. https://doi.org/10.1016/j.ajodo.2016.08.009 doi: 10.1016/j.ajodo.2016.08.009
    [22] T. Desyani, A. Saifudin, Y. Yulianti, Feature selection based on naive bayes for caesarean section prediction, IOP Conf. Ser.: Mater. Sci. Eng., 879 (2020), 01209. https://doi.org/10.1088/1757-899X/879/1/012091 doi: 10.1088/1757-899X/879/1/012091
    [23] A. Abraham, F. Pedregosa, M. Eickenberg, P. Gervais, A. Mueller, J. Kossaifi, et al., Machine learning for neuroimaging with scikit-learn, Front. Neuroinf., 8 (2014), 14. https://doi.org/10.3389/fninf.2014.00014 doi: 10.3389/fninf.2014.00014
    [24] M. Wimmer, G. Sluiter, D. Major, D. Lenis, A. Berg, T. Neubauer, et al., Multi-task fusion for improving mammography screening data classification, IEEE Trans. Med. Imaging, 41 (2022), 937–950. https://doi.org/10.1109/TMI.2021.3129068 doi: 10.1109/TMI.2021.3129068
    [25] P. Khumprom, D. Grewell, N. Yodo, Deep neural network feature selection approaches for data-driven prognostic model of aircraft engines, Aerospace, 7 (2020), 132. https://doi.org/10.3390/aerospace7090132 doi: 10.3390/aerospace7090132
    [26] H. Kaneko, Examining variable selection methods for the predictive performance of regression models and the proportion of selected variables and selected random variables, Heliyon, 7 (2021), e07356. https://doi.org/10.1016/j.heliyon.2021.e07356 doi: 10.1016/j.heliyon.2021.e07356
    [27] H. Gao, H. Zhao, Multilevel bioluminescence tomography based on radiative transfer equation Part 1: l1 regularization, Opt. Express, 18 (2010), 1854–1871. https://doi.org/10.1364/OE.18.001854 doi: 10.1364/OE.18.001854
    [28] P. Ravikumar, M. J. Wainwright, J. D. Lafferty, High-dimensional Ising model selection using ℓ1-regularized logistic regression, Ann. Statist., 38 (2010), 1287–1319. https://doi.org/10.1214/09-aos691 doi: 10.1214/09-aos691
    [29] K. Shah, H. Patel, D. Sanghvi, M. Shah, A comparative analysis of logistic regression, random forest and KNN models for the text classification, Augment. Hum. Res., 5 (2020). https://doi.org/10.1007/s41133-020-00032-0 doi: 10.1007/s41133-020-00032-0
    [30] Y. Wang, D. Wang, D. Geng, Y. Wang, Y. Yin, Y. Jin, Stacking-based ensemble learning of decision trees for interpretable prostate cancer detection, Appl. Soft Comput., 77 (2019), 188–204. https://doi.org/10.1016/j.asoc.2019.01.015 doi: 10.1016/j.asoc.2019.01.015
    [31] L. Breiman, Random forests, Mach. Learn., 45 (2001), 5–32. https://doi.org/10.1007/978-1-4419-9890-3_12 doi: 10.1007/978-1-4419-9890-3_12
    [32] C. J. C. Burges, K. Discovery, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov., 2 (1998), 121–167. https://doi.org/10.1023/A:1009715923555 doi: 10.1023/A:1009715923555
    [33] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, IJCAI, 7 (1995), 1137–1143. https://dl.acm.org/doi/10.5555/1643031.1643047 doi: 10.5555/1643031.1643047
    [34] B. Recht, C. Re, S. Wright, F. Niu, Hogwild!: A lock-free approach to parallelizing stochastic gradient descent, Adv. Neural Inf. Process. Syst., 24 (2011), 693–701. https://doi.org/10.48550/arXiv.1106.5730 doi: 10.48550/arXiv.1106.5730
    [35] S. Cui, Y. Yin, D. Wang, Z. Li, Y. Wang, A stacking-based ensemble learning method for earthquake casualty prediction, Appl. Soft Comput., 101 (2021). https://doi.org/10.1016/j.asoc.2020.107038 doi: 10.1016/j.asoc.2020.107038
    [36] S. Boughorbel, F. Jarray, M. El-Anbari, Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric, PLoS One, 12 (2017), e0177678. https://doi.org/10.1371/journal.pone.0177678 doi: 10.1371/journal.pone.0177678
    [37] T. S. Tsou, A robust likelihood approach to inference about the kappa coefficient for correlated binary data, Stat. Methods Med. Res., 28 (2019), 1188–1202. https://doi.org/10.1177/0962280217751519 doi: 10.1177/0962280217751519
    [38] L. Li, W. K. Ching, Z. P. Liu, Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods, Comput. Biol. Chem., 100 (2022), 107747. https://doi.org/10.1016/j.compbiolchem.2022.107747 doi: 10.1016/j.compbiolchem.2022.107747
    [39] H. Zou, T. Hastie, Regularization and variable selection via the elastic nets, J. R. Stat. Soc. Series B Stat. Methodol., 67 (2015), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x doi: 10.1111/j.1467-9868.2005.00503.x
    [40] T. P. Hettinger, J. F. Gent, L. E. Marks, M. E. Frank, A confusion matrix for the study of taste perception, Percept. Psychophys., 61 (1999), 1510–1521. https://doi.org/10.3758/bf03213114 doi: 10.3758/bf03213114
    [41] I. Palatnik de Sousa, M. Maria Bernardes Rebuzzi Vellasco, E. Costa da Silva, Local interpretable model-agnostic explanations for classification of lymph node metastases, Sensors (Basel), 19 (2019). https://doi.org/10.3390/s19132969 doi: 10.3390/s19132969
    [42] S. Ding, H. Li, Y. H. Zhang, X. Zhou, K. Feng, Z. Li, et al., Identification of pan-cancer biomarkers based on the gene expression profiles of cancer cell lines, Front. Cell Dev. Biol., 9 (2021), 781285. https://doi.org/10.3389/fcell.2021.781285 doi: 10.3389/fcell.2021.781285
    [43] Y. H. Zhang, T. Zeng, L. Chen, T. Huang, Y. D. Cai, Determining protein–protein functional associations by functional rules based on gene ontology and KEGG pathway, Biochim. Biophys. Acta Proteins Proteom., 1869 (2021), 140621. https://doi.org/10.1016/j.bbapap.2021.140621 doi: 10.1016/j.bbapap.2021.140621
    [44] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, et al., Cytoscape: a software environment for integrated models of biomolecular interaction networks, Genome Res., 13 (2003), 2498–2504. https://doi.org/10.1101/gr.1239303 doi: 10.1101/gr.1239303
    [45] T. Li, J. Fan, B. Wang, N. Traugh, Q. Chen, J. S. Liu, et al., TIMER: A web server for comprehensive analysis of tumor-infiltrating immune cells, Cancer Res., 77 (2017), e108–e110. https://doi.org/10.1158/0008-5472.CAN-17-0307 doi: 10.1158/0008-5472.CAN-17-0307
    [46] E. L. Kaplan, P. Meier, Nonparametric estimation from incomplete observations, J. Am. Stat. Assoc., 53 (1958), 457–481. https://doi.org/10.1080/01621459.1958.10501452 doi: 10.1080/01621459.1958.10501452
    [47] K. J. Jager, P. C. van Dijk, C. Zoccali, F. W. Dekker, The analysis of survival data: the Kaplan-Meier method, Kidney Int., 74 (2008), 560–565. https://doi.org/10.1038/ki.2008.217 doi: 10.1038/ki.2008.217
    [48] P. Guyot, A. E. Ades, M. J. Ouwens, N. J. Welton, Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves, BMC Med. Res. Methodol., 12 (2012), 9. https://doi.org/10.1186/1471-2288-12-9 doi: 10.1186/1471-2288-12-9
    [49] A. Emami, F. Javanmardi, A. Akbari, J. Kojuri, H. Bakhtiari, T. Rezaei, et al., Survival rate in hypertensive patients with COVID-19, Clin. Exp. Hypertens., 43 (2021), 77–80. https://doi.org/10.1080/10641963.2020.1812624 doi: 10.1080/10641963.2020.1812624
    [50] S. K. Kondapuram, M. S. Coumar, Pan-cancer gene expression analysis: Identification of deregulated autophagy genes and drugs to target them, Gene, 844 (2022), 146821. https://doi.org/10.1016/j.gene.2022.146821 doi: 10.1016/j.gene.2022.146821
    [51] P. Kowalczyk, M. Woszczynski, J. Ostrowski, Increased expression of ribosomal protein S2 in liver tumors, posthepactomized livers, and proliferating hepatocytes in vitro, Acta Biochim. Pol., 49 (2002), 615–624. https://doi.org/10.18388/abp.2002_3770 doi: 10.18388/abp.2002_3770
    [52] K. H. Pan, L. L. Wan, M. Chen, Exploration and identification of potential therapeutic targets and biomarkers for docetaxel resistant prostate cancer, preprint, 2022. https://doi.org/10.21203/rs.3.rs-1172051/v2 doi: 10.21203/rs.3.rs-1172051/v2
    [53] C. Wang, S. Qin, W. Pan, X. Shi, H. Gao, P. Jin, et al., mRNAsi-related genes can effectively distinguish hepatocellular carcinoma into new molecular subtypes, Comput. Struct. Biotechnol. J., 20 (2022), 2928–2941. https://doi.org/10.1016/j.csbj.2022.06.011 doi: 10.1016/j.csbj.2022.06.011
    [54] W. Xu, A. Anwaier, C. Ma, W. Liu, X. Tian, M. Palihati, et al., Multi-omics reveals novel prognostic implication of SRC protein expression in bladder cancer and its correlation with immunotherapy response, Ann. Med., 53 (2021), 596–610. https://doi.org/10.1080/07853890.2021.1908588 doi: 10.1080/07853890.2021.1908588
    [55] K. A. Myers, J. A. Fuller, D. F. Scott, T. J. Devine, M. J. Denton, A. Chan, Multivariate Cox regression analysis of covariates for patency rates after femorodistal vein bypass grafting, Ann. Vasc. Surg., 7 (1993), 262–269. https://doi.org/10.1007/BF02000252 doi: 10.1007/BF02000252
    [56] S. A. Best, S. Ding, A. Kersbergen, X. Dong, J. Y. Song, Y. Xie, et al., Distinct initiating events underpin the immune and metabolic heterogeneity of KRAS-mutant lung adenocarcinoma, Nat. Commun., 10 (2019), 4190. https://doi.org/10.1038/s41467-019-12164-y doi: 10.1038/s41467-019-12164-y
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(792) PDF downloads(71) Cited by(0)

Article outline

Figures and Tables

Figures(10)  /  Tables(6)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog