Predicting factors and top gene identification for survival data of breast cancer

Sarada Ghosh; Guruprasad Samanta; Manuel De la Sen; Sarada Ghosh; Guruprasad Samanta; Manuel De la Sen

doi:10.3934/biophy.2023006

AIMS Biophysics

2023, Volume 10, Issue 1: 67-89. doi: 10.3934/biophy.2023006

Previous Article Next Article

Research article Special Issues

Predicting factors and top gene identification for survival data of breast cancer

1.
Department of Statistics, Gurudas College, Phool Bagan, Kolkata-700054, India
2.
Department of Mathematics, Indian Institute of Engineering Science and Technology, Shibpur, Howrah-711103, India
3.
Institute of Research and Development of Processes, University of the Basque Country, 48940 Leioa, Bizkaia, Spain

Received: 26 September 2022 Revised: 02 February 2023 Accepted: 08 February 2023 Published: 17 February 2023

For high-throughput research with biological data-sets generated sequentially or by transcriptional micro-arrays, proteomics or other means, analytic techniques that address their high dimensional aspects remain desirable. The computation part basically predicts the tendency towards mortality due to breast cancer (BC) by using several classification methods, i.e., Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA) and Decision Tree (DT), and compared the models' performances. We proceed with the RF method since it provides better results than any other underlying models based on accuracy. We have also demonstrated some traditional and competing risk models, illustrated the models with real data analysis, depicted their curves' natures and also compared their fits using prediction error curves and the concordance index. Furthermore, two different survival splitting rules are used by using separate Random Survival Forest (RSF) methods and also constructing the ranking of risk factors due to breast cancer. The results show that high-level grade and diameter are the most important predictors for mortality progression in the presence of competing events of death, and lymph nodes, age and angiography are other vital criteria for this purpose. We have also implemented RSF backward selection criteria, which enables top gene selection related to mortality progression due to breast cancer. This method identifies c-MYB, CDCA7, NUSAP1, BIRC5, ANGPTL4, JAG1, IL6ST, and remaining genes that are mainly responsible for mortality progression due to breast cancer. In this work, R software is used to obtain and evaluate the results.
- breast cancer,
- random forest,
- accuracy,
- brier score,
- minimal depth,
- variable importance
Citation: Sarada Ghosh, Guruprasad Samanta, Manuel De la Sen. Predicting factors and top gene identification for survival data of breast cancer[J]. AIMS Biophysics, 2023, 10(1): 67-89. doi: 10.3934/biophy.2023006

Related Papers:

Abstract

For high-throughput research with biological data-sets generated sequentially or by transcriptional micro-arrays, proteomics or other means, analytic techniques that address their high dimensional aspects remain desirable. The computation part basically predicts the tendency towards mortality due to breast cancer (BC) by using several classification methods, i.e., Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA) and Decision Tree (DT), and compared the models' performances. We proceed with the RF method since it provides better results than any other underlying models based on accuracy. We have also demonstrated some traditional and competing risk models, illustrated the models with real data analysis, depicted their curves' natures and also compared their fits using prediction error curves and the concordance index. Furthermore, two different survival splitting rules are used by using separate Random Survival Forest (RSF) methods and also constructing the ranking of risk factors due to breast cancer. The results show that high-level grade and diameter are the most important predictors for mortality progression in the presence of competing events of death, and lymph nodes, age and angiography are other vital criteria for this purpose. We have also implemented RSF backward selection criteria, which enables top gene selection related to mortality progression due to breast cancer. This method identifies c-MYB, CDCA7, NUSAP1, BIRC5, ANGPTL4, JAG1, IL6ST, and remaining genes that are mainly responsible for mortality progression due to breast cancer. In this work, R software is used to obtain and evaluate the results.

Acknowledgments

The authors are grateful to the learned reviewers and Editors for their careful reading, valuable comments, and helpful suggestions, which have helped them to improve the presentation of this work significantly. They are also thankful to Nirapada Santra, SRF and Bijoy Kumar Das, SRF, and Protyusha Dutta, JRF of the IIEST, Shibpur for helping during the preparation of this manuscript. The third author (Manuel De la Sen) is grateful to the Spanish Government for its support through grant RTI2018-094336-B-I00 (MCIU/AEI/FEDER, UE) and to the Basque Government for its support through grant IT1555-22.

Data availability statement

The data used to support the findings of this study are included in the references within the article.

Conflict of interest

The authors declare that they have no conflict of interest regarding this work.

References

[1]	Nicolau M, Levine AJ, Carlsson G (2011) Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences of the United States of America 108: 7265-7270. https://doi.org/10.1073/pnas.1102826108
[2]	Baur B, Bozdag S (2016) A feature selection algorithm to compute gene centric methylation from probe level methylation data. PLoS One 11: e0148977. https://doi.org/10.1371/journal.pone.0148977
[3]	Trop I, Dugas A, David J, et al. (2011) Breast abscesses: evidence-based algorithms for diagnosis, management, and follow-up. Radiographics 31: 1683-1699. https://doi.org/10.1148/rg.316115521
[4]	NKI Breast Cancer Data, Data World (2016). Available from: https://data.world/deviramanan2016/nki-breast-cancer-data.
[5]	Ghosh S, Samanta GP (2019) Statistical modeling for cancer mortality. Lett Biomath . https://doi.org/10.1080/23737867.2019.1581104
[6]	Breiman L (2001) Random forests. Mach Learn 45: 5-32. https://doi.org/10.1023/A:1010933404324
[7]	Livingston F (2005) Implementation of Breiman's random forest machine learning algorithm. ECE591Q Mach Learn J Pap : 1-13.
[8]	Cox DR (1972) Regression models and life-tables. J Roy Stat Soc: Ser B (Meth) 34: 187-202. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x
[9]	Ishwaran H, Gerds TA, Kogalur UB, et al. (2014) Random survival forests for competing risks. Biostatistics 15: 757-773. https://doi.org/10.1093/biostatistics/kxu010
[10]	Ishwaran H, Kogalur UB, Blackstone EH, et al. (2008) Random survival forests. Ann Appl Stat 2: 841-860. https://doi.org/10.1214/08-AOAS169
[11]	Ishwaran H, Kogalur UB, Gorodeski EZ, et al. (2010) High-dimensional variable selection for survival data. J Am Stat Assoc 105: 205-217. https://doi.org/10.1198/jasa.2009.tm08622
[12]	Borgan Ø (2005) Nelson–Aalen Estimator. Encyclopedia of Biostatistics . https://doi.org/10.1002/0470011815.b2a11054
[13]	Mogensen UB, Ishwaran H, Gerds TA (2012) Evaluating random forests for survival analysis using prediction error curves. J Stat Softw 50: 1-20. https://doi.org/10.18637/jss.v050.i11
[14]	Strobl C, Boulesteix AL, Zeileis A, et al. (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8: 25. https://doi.org/10.1186/1471-2105-8-25
[15]	Ishwaran H, Kogalur UB, Chen X, et al. (2010) Random survival forests for high-dimensional data. Stat Anal Data Min ASA Data Sci J 4: 115-132. https://doi.org/10.1002/sam.10103
[16]	Calle ML, Urrea V, Boulesteix AL, et al. (2011) Auc-rf: a new strategy for genomic profiling with random forest. Hum Hered 72: 121-132. https://doi.org/10.1159/000330778
[17]	Steyerberg EW, Vickers AJ, Cook NR, et al. (2010) Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21: 128-138. https://doi.org/10.1097/EDE.0b013e3181c30fb2
[18]	Diaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7: 3. https://doi.org/10.1186/1471-2105-7-3
[19]	Dietrich S, Floegel A, Troll M, et al. (2016) Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int J Epidemiol 45: 1406-1420. https://doi.org/10.1093/ije/dyw145
[20]	Ye L, Li F, Song Y, et al. (2018) Overexpression of CDCA7 predicts poor prognosis and induces EZH2-mediated progression of triple-negative breast cancer. Int J Cancer 143: 2602-2613. https://doi.org/10.1002/ijc.31766
[21]	Chen L, Yang L, Qiao F, et al. (2015) High levels of nucleolar spindle-associated protein and reduced levels of BRCA1 expression predict poor prognosis in triple-negative breast cancer. PLoS One 10: e0140572. https://doi.org/10.1371/journal.pone.0140572
[22]	Kuchenbaecker KB, Hopper JL, Barnes DR, et al. (2017) Risks of breast, ovarian, and contralateral breast cancer for BRCA1 and BRCA2 mutation carriers. Jama 317: 2402-2416. https://doi.org/10.1001/jama.2017.7112
[23]	Sušac I, Ozretić P, Gregorić M, et al. (2019) Polymorphisms in Survivin (BIRC5 Gene) are associated with age of onset in breast cancer patients. J Oncol 3483192. https://doi.org/10.1155/2019/3483192
[24]	Cai YC, Yang H, Wang KF, et al. (2020) ANGPTL4 overexpression inhibits tumor cell adhesion and migration and predicts favorable prognosis of triple-negative breast cancer. BMC Cancer 20: 878. https://doi.org/10.1186/s12885-020-07343-w
[25]	Wang J, Xu B (2019) Targeted therapeutic options and future perspectives for HER2-positive breast cancer. Signal Transduct Tar 4: 34. https://doi.org/10.1038/s41392-019-0069-2
[26]	Wilson BJ, Giguère V (2008) Meta-analysis of human cancer microarrays reveals GATA3 is integral to the estrogen receptor alpha pathway. Mol Cancer 7: 49. https://doi.org/10.1186/1476-4598-7-49
[27]	Jiang J, Wang J, He X, et al. (2019) High expression of SPAG5 sustains the malignant growth and invasion of breast cancer cells through the activation of Wnt/β-catenin signalling. Clin Exp Pharmacol Physiol 46: 597-606. https://doi.org/10.1111/1440-1681.13082
[28]	Weng TY, Wang CY, Hung YH, et al. (2016) Differential expression pattern of THBS1 and THBS2 in lung cancer: clinical outcome and a systematic-analysis of microarray databases. PLoS One 11: e0161007. https://doi.org/10.1371/journal.pone.0161007
[29]	Weagel EG, Burrup W, Kovtun R, et al. (2018) Membrane expression of thymidine kinase 1 and potential clinical relevance in lung, breast, and colorectal malignancies. Cancer Cell Int 18: 135. https://doi.org/10.1186/s12935-018-0633-9
[30]	Ahsan H, Halpern J, Kibriya MG, et al. (2014) A genome-wide association study of early-onset breast cancer identifies PFKM as a novel breast cancer gene and supports a common genetic spectrum for breast cancer at any age. Cancer Epidem Biomar 23: 658-669. https://doi.org/10.1158/1055-9965.EPI-13-0340
[31]	Zancan P, Sola-Penna M, Furtado CM, et al. (2010) Differential expression of phosphofructokinase-1 isoforms correlates with the glycolytic efficiency of breast cancer cells. Mol Genet Metab 100: 372-378. https://doi.org/10.1016/j.ymgme.2010.04.006
[32]	Smerc A, Sodja E, Legisa M (2011) Posttranslational modification of 6-phosphofructo-1-kinase as an important feature of cancer metabolism. PloS One 6: e19645. https://doi.org/10.1371/journal.pone.0019645
[33]	Danilova N, Kumagai A, Lin J (2010) p53 upregulation is a frequent response to deficiency of cellessential genes. PloS One 5: e15938. https://doi.org/10.1371/journal.pone.0015938
[34]	Marangoni E, Laurent C, Coussy F, et al. (2018) Capecitabine efficacy is correlated with TYMP and RB1 expression in PDX established from triple-negative breast cancers. Clin Cancer Res 24: 2605-2615. https://doi.org/10.1158/1078-0432.CCR-17-3490
[35]	Wu J, Hicks C (2021) Breast cancer type classification using machine learning. J Pers Med 11: 61. https://doi.org/10.3390/jpm11020061
[36]	Lu Y, Han J (2003) Cancer classification using gene expression data. Inf Syst 28: 243-268. https://doi.org/10.1016/S0306-4379(02)00072-8
[37]	Guyon I, Weston J, Barnhill S, et al. (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46: 389-422. https://doi.org/10.1023/A:1012487302797
[38]	Hernandez Hernandez JC, Duval B, Hao JK (2007) A genetic embedded approach for gene selection and classification of microarray data. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics: 5th European Conference . Springer Berlin Heidelberg 90-101.
[39]	Dai B, Chen RC, Zhu SZ, et al. (2018) Using random forest algorithm for breast cancer diagnosis, 2018 International Symposium on Computer, Consumer and Control (IS3C). IEEE : 449-452. https://doi.org/10.1109/IS3C.2018.00119
[40]	Goldberg DE Genetic algorithms in search, optimization, and machine learning, Addison-Wesley, 36 (1989).

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)