
Citation: Ying Sun, Wei Du, Lili Yang, Min Dai, Ziying Dou, Yuxiang Wang, Jining Liu, Gang Zheng. Computational methods for recognition of cancer protein markers in saliva[J]. Mathematical Biosciences and Engineering, 2020, 17(3): 2453-2469. doi: 10.3934/mbe.2020134
[1] | Bo Wei, Rui Wang, Le Wang, Chao Du . Prognostic factor identification by analysis of the gene expression and DNA methylation data in glioma. Mathematical Biosciences and Engineering, 2020, 17(4): 3909-3924. doi: 10.3934/mbe.2020217 |
[2] | Ji-Ming Wu, Wang-Ren Qiu, Zi Liu, Zhao-Chun Xu, Shou-Hua Zhang . Integrative approach for classifying male tumors based on DNA methylation 450K data. Mathematical Biosciences and Engineering, 2023, 20(11): 19133-19151. doi: 10.3934/mbe.2023845 |
[3] | Xiuxian Zhu, Xianxiong Ma, Chuanqing Wu . A methylomics-correlated nomogram predicts the recurrence free survival risk of kidney renal clear cell carcinoma. Mathematical Biosciences and Engineering, 2021, 18(6): 8559-8576. doi: 10.3934/mbe.2021424 |
[4] | Jun Wang, Mingzhi Gong, Zhenggang Xiong, Yangyang Zhao, Deguo Xing . Immune-related prognostic genes signatures in the tumor microenvironment of sarcoma. Mathematical Biosciences and Engineering, 2021, 18(3): 2243-2257. doi: 10.3934/mbe.2021113 |
[5] | Ming-Xi Zhu, Tian-Yang Zhao, Yan Li . Insight into the mechanism of DNA methylation and miRNA-mRNA regulatory network in ischemic stroke. Mathematical Biosciences and Engineering, 2023, 20(6): 10264-10283. doi: 10.3934/mbe.2023450 |
[6] | Ye Hu, Meiling Wang, Kainan Wang, Jiyue Gao, Jiaci Tong, Zuowei Zhao, Man Li . A potential role for metastasis-associated in colon cancer 1 (MACC1) as a pan-cancer prognostic and immunological biomarker. Mathematical Biosciences and Engineering, 2021, 18(6): 8331-8353. doi: 10.3934/mbe.2021413 |
[7] | Dong-feng Li, Aisikeer Tulahong, Md. Nazim Uddin, Huan Zhao, Hua Zhang . Meta-analysis identifying epithelial-derived transcriptomes predicts poor clinical outcome and immune infiltrations in ovarian cancer. Mathematical Biosciences and Engineering, 2021, 18(5): 6527-6551. doi: 10.3934/mbe.2021324 |
[8] | Huiqing Wang, Xiao Han, Jianxue Ren, Hao Cheng, Haolin Li, Ying Li, Xue Li . A prognostic prediction model for ovarian cancer using a cross-modal view correlation discovery network. Mathematical Biosciences and Engineering, 2024, 21(1): 736-764. doi: 10.3934/mbe.2024031 |
[9] | Huili Yang, Wangren Qiu, Zi Liu . Anoikis-related mRNA-lncRNA and DNA methylation profiles for overall survival prediction in breast cancer patients. Mathematical Biosciences and Engineering, 2024, 21(1): 1590-1609. doi: 10.3934/mbe.2024069 |
[10] | Yong Xiao, Zhen Wang, Mengjie Zhao, Wei Ji, Chong Xiang, Taiping Li, Ran Wang, Kun Yang, Chunfa Qian, Xianglong Tang, Hong Xiao, Yuanjie Zou, Hongyi Liu . A novel defined risk signature of interferon response genes predicts the prognosis and correlates with immune infiltration in glioblastoma. Mathematical Biosciences and Engineering, 2022, 19(9): 9481-9504. doi: 10.3934/mbe.2022441 |
Tumor purity is defined as the proportion of cancer cells in tumor tissues [1,2]. As a source of confounding factor, tumor purity has also been recognized to have significant impact on a variety of high-throughput data analyses based on gene expression or DNA methylation data. In this case, estimating tumor purity in the admixture of cells constituting tumor microenvironment is an important step to perform cancer genomic or epigenetic analyses. Inaccurate estimation of tumor purity would jeopardize downstream analyses such as clustering, association study and differential analysis between tumor samples [3]. Traditionally, tumor purity is generally estimated by pathologists using experimental methods, for example, using Immuno-histochemistry (IHC). In recent years, a number of computational methods have been developed to estimate tumor purity from different types of genomic data such as DNA copy numbers [4], gene expression [5], and DNA methylation [1,6,7,8]. These methods are comprehensively reviewed and compared [9,10]. Among all those types of genomic data, DNA methylation is deemed to be one of the most suitable data for purity estimation due to the following reasons: 1) DNA methylation is a long-term, and more stable biomarker than gene expression in detecting cancers [11]; 2) Nearby CpG sites are highly co-methylated under the mechanism of methyltransferase enzymes, which potentially reduces the random noises and increases inferring accuracy; 3) CpG sites in each individual cell are either methylated (methylation level = 1) or unmethylated (methylation level = 0), so methylation ratio of a tumor tissue intrinsically reflects proportion of some certain cell types. Under these considerations, we proposed InfiniumPurify, a tool for estimating purities of tumor samples based on Illumina Infinium 450 k methylation microarray [7,12]. It estimates the tumor purity by exploiting the beta value distribution of the most differential DNA methylation sites (informative differentially methylated CpG sites, iDMCs). Along this line, PAMES updated the selection of iDMCs by taking advantage of highly clonal cancer specific CpG sites [6]. MEpurity uses a beta mixture model to estimate tumor purity from only tumor methylation data [8].
In spite of the advances in this field, it is still technically challenging for biological and clinical researchers to take advantage of the methodological development. A main reason is that the aforementioned tools are developed based on various platforms, for example, InfiniumPurify [12] and PAMES are R packages, while MEpurity was written in C++. Using these tools could be a daunting task for many researchers since they require non-trivial computational skills, thus, there is an urgent need for a set of more accessible and intuitive tools. In this work, we develop Purimeth, an interactive and user-friendly web-based tool for analyzing cancer DNA methylation data, which can implement purity estimation and downstream data analyses accounting for tumor purity in a few clicks. By uploading beta value matrices of both tumor and normal samples, purity of tumor samples can be obtained by using three state-of-the-art tools for purity estimation from DNA methylation array data, i.e., InfiniumPurify, MEpurity and PAMES. Based on the purity estimates, users can perform a series of DNA methylation analyses accounting for tumor purity, including differential methylation analysis, clustering tumor samples into subtypes and purification of tumor methylomes. Users are also allowed to download and explore purities of 9364 tumor samples from The Cancer Genome Atlas (TCGA) using nine methods including ESTIMATE [5], ABSOLUTE [4], LUMP, IHC, CPE [1], InfiniumPurify [12], PAMES [6], MEpurity [8] and Consensus (see the Result section for detail).
The Purimeth webserver system consists of five major modules: GetPurity, Differential methylation (DM), Clustering, Purification and TCGA purity exploration. The general workflow of Purimeth in a typical data analysis is illustrated in Figure 1. First of all, it requires a matrix of methylation levels (in beta values) for tumor samples, and optionally a matrix of methylation levels for normal samples or cancer type as input (Figure 1A). After inputting these data, users can then estimate tumor purity using one of the three methods, i.e., InfiniumPurify, MEpurity and PAMES (Figure 1B). Finally, with estimated purities, users can perform variety of downstream data analyses including differential methylation (DM), clustering, or purification of tumor methylomes (Figure 1C).
Purimeth allows users to upload methylation profiles of tumor and normal sample from either 450 k or 850 k array in the form of a.txt, .zip or.gz file, where rows and columns represent the CpG sites and samples respectively. In some modules, users need to upload purity file (in same format) for tumor samples as well. For the convenience of users, we provide example data files for breast tumor samples and matched normal samples from TCGA and their corresponding tumor purity file on the website.
Increasing attention has been devoted to the relationship between tumor purity and various studies on tumor samples. For the purpose of obtaining the purity of the tumor sample fast and sound, the "GetPurity" module allows users to estimate purities of tumor samples by InfiniumPurify, MEpurity or PAMES on the same page according to the method selected. InfininumPurify estimates purity from the probability density of methylation levels of iDMCs from cancer-normal comparisons. MEpurity estimates tumor purity based on tumor-only Illumina Infinium 450 k methylation microarray data using a beta mixture model-based algorithm. PAMES uses the methylation level of a few dozens of highly clonal tumor type specific CpG sites to estimate the purity of tumor samples, and only works for 450 k array data in its current edition. For MEpurity users only need to upload DNA methylation matrix (where rows are for CpG sites and columns for samples). And for InfiniumPurify and PAMES, besides the beta value matrix of tumor samples, either cancer type that can be specified by the select button 'Cancer Type' or normal sample data should be inputted. In addition, the cancer type should be specified for InfiniumPurify if the inputted normal samples are insufficient for iDMC identification (less than 20). Once a file is uploaded, the first six rows of the data will automatically be shown so that the user can confirm whether the file is correct (same for other modules). With a click on the "Run" button, a table with the estimated purities of tumor samples will be displayed in the Result panel (Figure 2A). Meanwhile, a barplot will be shown in the Plot panel for visualizing the estimated purities (Figure 2B). To provide users a reference, we tested the running time of three methods using a typical example data of 20 tumor samples, which are shown in Figure 2C. When inputting only tumor samples, MEpurity takes more than 10 seconds to get the result, while InfinumPurify and PAMES take only less than 1 second. When both tumor and matched normal samples are input, MEpurity does not work in this case, while InfiniumPurity still runs faster than PAMES.
Differential methylation (DM) between tumor and normal samples, or between two groups of tumor samples showing different phenotypes is a central task in cancer epigenomics research. The differentially methylated CpG sites (DMCs) or regions (DMRs) could potentially serve as diagnostic biomarkers or therapeutic targets [13,14,15,16]. In Purimeth, DM module contains two submodules, "Tumor vs Normal" and "Tumor1 vs Tumor2", allowing users to infer the differentially methylated CpG sites accounting for tumor purity. These two modules correspond to our two previous works [7,17], both of which are based on the generalized linear regression model and Wald test to call DM sites. For 'tumor vs normal', users are needed to input beta value matrices of tumor and normal samples, as well as tumor purity file for tumor samples, which could be obtained from the "GetPurity" module. And for "tumor1 vs tumor2", beta value matrices and tumor purity files (obtained from the first module) for both subtypes of tumor samples are required. Purimeth will return a list of differentially methylated CpG sites sorted by their q-values (Figure 3A). Meanwhile, besides a heat map showing the top N differentially methylated CpG sites (N could be set by users) (Figure 3B), it will also provide a scatter plot illustrating log2 fold change of average corrected methylation level between two sample groups for CpG sites (Figure 3C).
The identification of tumor subtypes is of great significance for the early diagnosis and clinical treatment of cancer. Given both DNA methylation profiles of tumor samples and tumor purities, the "Clustering" module allows users to cluster tumor samples into different subtypes. It models the subtype of a tumor mixture sample as a latent variable in a statistical model and solves it by the Expectation-Maximization (EM) algorithm [18]. The purity file inputted can be obtained from "GetPurity" module, other purity estimation tools or pathologists. This module performs with the adjustment of several parameters including the number of clusters, the maximum number of iterations and tolerance for convergence of EM iterations. The clustering result shows the predicted subtype for each sample (Figure 4A) and visualizes all samples by plotting the first two principal components of the data (Figure 4B).
Methylation profiles of pure cancer cells can be hardly obtained from real tumor tissues which are always mixtures of normal and cancer cells. In this situation, the "Purification" module aims to infer methylation profiles of pure cancer cells from tumor mixture samples, matched normal samples and tumor purities. This module implements a regression-based model to get rid of the normal cell signals and obtain pure cancer cell methylomes. After uploading the data and clicking the 'Run' button, Purimeth will report purified methylation profiles in a table (Figure 5A), and show the boxplots of tumor, normal and purified tumor data for 4 example CpG sites (Figure 5B). Users can also show barplots for any CpG sites (sorted by the average methylation difference between tumor and purified tumor samples) of interest by using the input box "Choose a CpG(s) site for plot".
A number of tools have been proposed to estimate tumor purity for TCGA tumor samples from different types of genomics data by using different underline models. However, the estimates for the same samples vary by method. Thus comparison and integration of estimates from different methods are needed. Motivated by Aran et al. [1], we created a consensus purity estimate (named "Consensus") by taking the median of purities estimated from five available methods including ABSOLUTE, ESTIMATE, LUMP, IHC and InfiniumPurify after normalization. Compared with the original CPE method, our update method includes the purities of InfiniumPurify. In the last module, Purimeth integrates tumor purity estimates of TCGA tumor samples using Consensus and the following eight state-of-the-art tools, i.e., ESTIMATE, ABSOLUTE, LUMP, IHC, CPE, InfiniumPurify, PAMES and MEpurity. Users can select any cancer types, methods and samples of interest to obtain their corresponding purities which will be shown in the result panel (Figure 6A). If multiple samples are inputted, each sample should be separated by a comma. Based on the number of samples from the same cancer type, the plot panel will display two different figures. If there is only one sample for a cancer type, a bar plot will be displayed for this cancer type (Figure 6B). Otherwise, a box plot will be generated for each selected cancer type or method (Figure 6C). To compare the performance of different methods, we calculate the Pearson's correlation between each of the two methods on 21 cancer types and all merged samples. As shown in Figure 6D, InfiniumPurify and Consensus methods show the highest overall consistence for all cancer types compared to other methods.
Estimating and accounting for tumor purity from DNA methylation data are hot topics in cancer research. In recent years, multiple purity estimation tools have been developed by different algorithms and software platforms. Using these tools could be a daunting task for many researchers since those methods require non-trivial computational skills. In this work, we developed Purimeth, an integrated web-based tool for estimating and accounting for tumor purity in DNA methylation studies. Besides Infinium 450 k array data, our tool was also tested on the latest EPIC bead chip (850 k array) data. Since the methylation profiles measured by microarray and bisulfite sequencing are highly consistent, our tool designed for microarray data also works for sequencing data including WGBS, RRBS and HMST-seq. For a given cancer type, users only need to extract methylation levels of its informative DMC sites (iDMCs), and upload it according to the example file format. We provided the iDMCs for 32 cancer types as a download link in the GetPurity module. As an example, we also provided a demo (in supplementary file) to use Purimeth on WGBS data of colon cancer samples, including purity estimation and differential methylation analysis. Overall, our study provides a comprehensive web tool for researchers to perform DNA methylation data analyses regarding tumor purities.
Purimeth was developed by Shiny (Version 1.6.0) on Tencent cloud server, which enables better stability and scalability for computing resources. The computational times for purity estimation, DM and purification are less than 2 minutes for a typical data set of 20 tumor and 20 normal samples, while the Clustering module is more time-consuming which will take 2 to 10 minutes to get the clustering results depending on the number of tumor samples and necessary steps for iteration given.
We thank Shijiang Wang for the suggestions on codes and web server design.
This work was supported by the National Key R & D Program of China [2018YFA0900600 to X.Z.]; National Natural Science Foundation of China [61972257 to X.Z.]; Key Laboratory of Data Science and Intelligence Education (Hainan Normal University), Ministry of Education [DSIE202002 to X.Z.] and the Hainan Province Natural Science Foundation [No. 2019RC184 to C.L.].
All authors declare there is no conflicts of interest.
[1] | R. Ruddon, Cancer Biology, Oxford University Press, 2007. |
[2] | Y. Wang, S. Liang, Y. Tian, J. Zhao, W. Du, Y. Liang, et al., Using machine learning to measure relatedness between genes: a multi-features model, Sci. Rep., 9 (2019), 1-15. |
[3] | S. Liang, A. Ma, S. Yang, Y. Wang, Q. Ma, A review of Matched-pairs feature selection methods for gene expression data analysis, Comput. Structur. Biotechnol. J., 16 (2018), 88-97. |
[4] | A.W. Partin, J. Yoo, H. B. Carter, J. D. Pearson, D. W. Chan, J. I. Epstein, et al., The use of prostate specific antigen, clinical stage and Gleason score to predict pathological stage in men with localized prostate cancer, J. Urol., 150 (1993), 110-114. |
[5] | M. Hollstein, D. Sidransky, B. Vogelstein, C. C. Harris, P53 mutations in human cancers, J. Sci., 253 (1991), 49-53. |
[6] | K. E. Stuart, A. J. Anand, R. L. Jenkins, Hepatocellular carcinoma in the United States: prognostic features, treatment outcome, and survival, Cancer Interdiscipl. Int. J. Am. Cancer Soc., 77 (1996), 2217-2222. |
[7] | P. Kuusela, C. Haglund, P. J. Roberts, Comparison of a new tumour marker CA 242 with CA 199, CA 50 and carcinoembryonic antigen (CEA) in digestive tract diseases, British J. Cancer, 63 (1991), 636-640. |
[8] | J. Schneider, H. G. Velcovsky, H. Morr, N. Katz, K. Neu, E. Eigenbrodt, Comparison of the tumor markers tumor M2-PK, CEA, CYFRA 21-1, NSE and SCC in the diagnosis of lung cancer, Anticancer Res., 20 (2000), 5053-5058. |
[9] | L. A. Cole, J. M. Sutton, Selecting an appropriate hCG test for managing gestational trophoblastic disease and cancer, J. Reproduct. Med., 49 (2004), 545-553. |
[10] | J. A. Ludwig, J. N. Weinstein, Biomarkers in cancer staging, prognosis and treatment selection, Nat. Rev. Cancer, 5(2005), 845-856. |
[11] | G. J. Rustin, M. Marples, A. E. Nelstrop, M. Mahmoudi, T. Meyer, Use of CA-125 to define progression of ovarian cancer in patients with persistently elevated levels, J. Clin. Oncol., 19 (2001), 4054-4057. |
[12] | H. Zheng, R. C. Luo, Diagnostic value of combined detection of TPS, CA153 and CEA in breast cancer, J. First Milit. Med. Univers., 25 (2003), 1293. |
[13] | H. Q. Zhang, R. B.Wang, H. J. Yan, W. Zhao, K. L. Zhu, S. M. Jiang, et al., Prognostic significance of CYFRA21-1, CEA and hemoglobin in patients with esophageal squamous cancer undergoing concurrent chemoradiotherapy, Asian Pacific J. Cancer Prevent., 13 (2012), 199-203. |
[14] | A. Hsu, S. L. Tang, S. Halgamuge, An unsupervised hierarchical dynamic self-organising approach to cancer class discovery and marker gene identification in microarray data, Bioinformatics, 19 (2003), 2131-2140. |
[15] | J. J. Liu, G. Cutler, W. Li, Z. Pan, S. Peng, T. Hoey, et al., Multiclass cancer classification and biomarker discovery using GA-based algorithms, Bioinformatics, 21 (2005), 2691-2697. |
[16] | B. J. Beattie, P. N. Robinson, Binary state pattern clustering: A digital paradigm for class and biomarker discovery in gene microarray studies of cancer, J. Comput. Biol., 13 (2006), 1114-1130. |
[17] | C. Harris, N. Ghaffari, Biomarker discovery across annotated and unannotated microarray datasets using semi-supervised learning, BMC Genomics, 9(2008), S7. |
[18] | T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, Robust biomarker identification for cancer diagnosis with ensemble feature selection methods, Bioinformatics, 26 (2010), 392-398. |
[19] | L. Chen, J. Xuan, C. Wang, I. M. Shih, Y. Wang, Z. Zhang, et al., Knowledge-guided multi-scale independent component analysis for biomarker identification, BMC Bioinformatics, 9 (2008), 416. |
[20] | J. Cui, Q. Liu, D. Puett, Y. Xu, Computational prediction of human proteins that can be secreted into the bloodstrea, Bioinformatics, 24 (2008), 2370-2375. |
[21] | J. Cui, Y. Chen, W. C. Chou, L. Sun, L. Chen, J. Suo, et al., An integrated transcriptomic and computational analysis for biomarker identification in gastric cancer, Nucleic Acids Res., 39 (2011),1197-1207. |
[22] | C. S. Hong, J. Cui, Z. Ni, Y. Su, D. Puett, F. Li, et al., A computational method for prediction of excretory proteins and application to identification of gastric cancer markers in urine, PloS One, 6 (2011), e16875. |
[23] | J. Wang, Y. Liang, Y. Wang, J. Cui, M. Liu, W. Du, et al., Computational prediction of human salivary proteins from blood circulation and application to diagnostic biomarker identification, PloS One, 8 (2013), e80211. |
[24] | Y. Sun, W. Du, C. Zhou, Y. Zhou, Z. Cao, Y. Tian, et al., A Computational Method for Prediction of Saliva-Secretory Proteins and its Application to Identification of Head and Neck Cancer Biomarkers for Salivary Diagnosis, IEEE Transact. Nanobiosci., 14 (2015),167-174. |
[25] | A. Ben-Hur, D. Horn, H. T. Siegelmann, V. Vapnik, A support vector method for clustering, Adv. Neural Inform. Process. Syst., 13 (2001), 367-373. |
[26] | Y. Chen, Y. Zhang, Y. Yin, G. Gao, S. Li, Y. Jiang, et al., SPD-a web-based secreted protein database, Nucleic Acids Res., 33 (2005), D169-D173. |
[27] | J. Sprenger, J. Lynn Fink, S. Karunaratne, K. Hanson, N. A. Hamilton, R. D. Teasdale, LOCATE: A mammalian protein subcellular localization database, Nucleic Acids Res., 36 (2007), D230-D233. |
[28] | M. Magrane, Uniprot knowledgebase: A hub of integrated protein data, Database, 2011 (2011). |
[29] | S. J. Li, M. Peng, H. Li, B. S. Liu, C. Wang, J. R. Wu, et al., Sys-bodyfluid: A systematical database for human body fluid proteome research, Nucleic Acids Res., 37 (2009), 907-912. |
[30] | S. Hu, J. A. Loo, D. T. Wong, Human saliva proteome analysis and disease biomarker discovery, Expert Rev. Proteom., 4 (2007), 531-538. |
[31] | P. Denny, F. K. Hagen, M. Hardt, L. Liao, W. Yan, M. Arellanno, et al., The proteomes of human parotid and submandibular/sublingual gland salivas collected as the ductal secretions, J. Proteom. Res., 7 (2008), 1994-2006. |
[32] | S. El-Gebali, J. Mistry, A. Bateman, S. R. Eddy, A. Luciani, S. C. Potter, et al., The Pfam protein families database in 2019, Nucleic Acids Res., 47 (2019), D427-D432. |