Abbreviations: LUAD: Lung adenocarcinoma; MAG: Metastasis-associated gene
1.
Introduction
Lung cancer is one of the most commonly diagnosed malignancies in the world and accounts for the number one incidence and mortality rates among all human cancers [1,2]. Approximately 40–60% of all lung cancer cases show metastasis at their initial diagnosis [3,4] and the secondary tumors are frequently found in the brain, bones, liver, and adrenal glands [5,6,7,8,9,10]. A large portion (70–90%) of lung cancer patients succumb to the disease as a result of distant tumor metastasis rather than uncontrolled primary tumor growth [11].
Lung adenocarcinoma (LUAD) is the most common subtype of lung cancer and accounts for approximately 40% of all lung cancers. LUAD is a heterogeneous malignant disease and the existence of heterogeneity makes LUAD therapy challenging. Previous studies have widely employed Bulk RNA sequencing (RNA-seq) to clarify the transcriptome of LUAD. However, Bulk RNA sequencing (RNA-seq) only shows the average expression across all cells, but does not reveal the gene expression patterns of individual cells, resulting in the neglect of heterogeneity among individual cells. The use of scRNA-seq could provide a better insight into the heterogeneity of different subgroups of cells. Importantly, this technology has been successfully used to evaluate tumor heterogeneity [12] and has revealed the complexity of the tumor microenvironment [13]. Moreover, recent data using scRNA-seq has shown better identification of tumor heterogeneity, including better illustration of tumor growth, resistance to treatment, and tumor metastasis, as well as understanding of tumor biology [14,15,16,17]. In addition, scRNA-seq could help to identify an association between tumor transcriptional heterogeneity and patient prognosis [18,19]. Therefore, the aim of this study was to reveal the transcpritome and heterogeneity between primary LUAD and metastatic LUAD through scRNA-seq and to identify metastasis-associated genes (MAGs) to provide new insights for metastatic LUAD treatment and prognosis judgment.
In this study, a 7-MAG signature was established based on LUAD scRNA-seq data, and this model performed well in predicting the OS of LUAD patients. Impactful information has been presented for prognostic indicators and novel therapeutic targets of metastatic LUAD.
2.
Material and methods
2.1. Database search and data acquisition
The LUAD scRNA-seq data on 126 LUAD cell samples (datasets of PDX-LC-PT-45 and PDX-LC-MBT-15) were downloaded from the Gene Expression Omnibus Database (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) with the accession number GSE69405. The LC-PT-45 tumor was derived from a 60-year-old male, treatment-naive LUAD patient [20]. The LC-MBT-15 tumor originated from a 57-year-old woman with LUAD heterochronous brain metastasis after standard chemotherapy and erlotinib treatment [20]. Furthermore, the transcriptome profile together with clinicopathological data (including gender, age, tumor grade, TNM stage, pathological stage, and follow-up data) were obtained from The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/). Clinical data for the LUAD patients included in this study are shown in Table 1.
2.2. Analysis of scRNA-seq and TCGA data
The LUAD scRNA-seq data of the raw reads from 126 tumor cell samples were generated using the Illumina Hiseq2500 System and mapped to the GRCH36 human genome sequences. A series of data preprocessing, including readings of the LUAD scRNA-seq data matrix and taking the average of duplicate genes, was carried out. The LUAD scRNA-seq data were then converted into a Seurat object and subjected to data filtering. The exclusion criteria were set as: 1) data on cells with fewer than 200 detected genes; 2) genes identified in fewer than three cells; and 3) mitochondrial DNA reads greater than 11%. Log normalization was used to normalize single-cell transcriptome profiling and FindVaribleGene was used to assess the variable genes across single cells for downstream analysis. Prior to this, ScaleData was used to scale the data and remove unnecessary variable sources, such as technical noise. Subsequently, a principal component analysis (PCA) was conducted to evaluate the most significant principal components for cluster analysis. Afterwards, t-Distributed Stochastic Neighbor Embedding (t-NSE) was performed to assess the cluster classification, and FindAllMarkers was used to screen the co-expressed variable genes in the clusters, which were defined as MAGs. An absolute value of logFC (|Log FC|) > 0.5 and an adjusted P value (adj P) < 0.05 were used as the cut-off values for MAGs. Next, the TCGA-LUAD transcriptome profile normalization was performed using the edgeR package.
2.3. Identification of differentially expressed MAGs (DEMAGs) associated with OS
The MAG expression profile was first extracted from the TCGA-LUAD transcriptome profile and the R package Limma was performed to identify the DEMAGs using logFC > 1.0 and false discovery rate (FDR) < 0.01 as the cut-off criteria. Next, we matched the MAG expression profile with OS of LUAD patients in the entire TCGA-LUAD cohort. The prognostic MAGs were identified through univariate Cox regression analysis when P value < 0.01 in the entire TCGA-LUAD cohort. The prognostic DEMAGs identified by intersecting the results of (a) the prognostic MAGs and (b) the DEMAGs were used for downstream analysis. MAGs with a hazard ratio (HR) > 1 were considered risk factors, whereas MAGs with an HR < 1 were considered protective.
2.4. Construction of a MAG signature and a MAG nomogram model for the prediction of LUAD prognosis
The entire TCGA-LUAD cohort was randomly divided into either the training (n = 236) or the testing group (n = 232), and the training group was utilized to construct a MAG signature for the prediction of LUAD patient OS. Before constructing a MAG signature, these prognostic DEMAGs were initially visualized through univariate Cox analysis, and a LASSO regression analysis was then implemented to further screen and narrow down these prognostic DEMAGs. Multivariate Cox proportional hazards regression analysis was used to construct a prognostic model in the training group, and the HR and regression coefficient for each prognostic DEMAG were calculated. Eventually, seven DEMAGs were identified to establish the MAG signature for the prediction of OS in LUAD patients. The calculation of the MAG signature was as follows: Risk score = Ʃ (βi x Expi), where βi represented the coefficient of gene i, standing for the weight of gene i, and Expi represented the expression level of gene i. Kaplan-Meier curves and log-rank tests were performed to associate this 7-MAG signature with OS of LUAD patients, then a receiver operating characteristic (ROC) curve was used to identify the performance of the MAG signature in predicting OS of LUAD patients in both the training and testing groups, as well as the entire TCGA-LUAD cohort.
Finally, the rms R package was used to construct a nomogram to predict OS of LUAD patients, which incorporated these seven DEMAGs. This nomogram model was a prognostic statistical model made using simple graphs according to previous studies [21,22].
2.5. Functional pathway analysis of the 7-MAG signature
To search for and identify potential molecular mechanisms, Gene Set Enrichment Analysis (GSEA) was performed. The risk scores was defined as the phenotype, and then the entire TCGA-LUAD cohort was divided into either a high- or low-risk group using the median of the risk score as the cutoff value. The Kyoto Encyclopedia of Genes and Genomes (KEGG) database was then used to enrich the functional pathways of these high- and low-risk groups with the FDR < 0.05 as the cut-off value for significance.
2.6. Statistical analysis
All statistical analysis were performed using the R package, and glmnet was used for LASSO coefficient regression to screen and narrow down DEMAGs. The Kaplan-Meier with log-rank analysis was performed to evaluate OS of LUAD patients. The generalized linear model (GLM) was established using the rms software package to develop a quantitative model for the prediction of OS in LUAD patients. A P value < 0.05 was considered statistically significant.
3.
Results
3.1. Identification of MAGs in LUAD samples using scRNA-seq data
A total of 77 and 49 high-quality tumor cell samples were obtained from PDX-LC-PT-45 and PDX-LC-MBT-15, respectively. The quality control chart was shown in Figure 1A, illustrating the range of single cell RNA numbers, the sequencing count, and the proportion of mitochondrial sequencing counts of each cell. In total, 1500 variable genes were found in the single cell samples (Figure 1B). The PCA method was then used to divide these single cell samples into 20 different components (Figure 1C), in which the statistically significant components were used for further analysis. In addition, the single cell samples were mapped into two independent dimensions based on the PC1 and PC2 (Figure 1D). Apart from performing PCA, the t-NSE algorithm was also completed to successfully divide the single-cell sample data into primary- and metastatic- tumor cell subpopulations, defined as cluster 1 and cluster 0, respectively (Figure 1E). Subsequently, 414 co-expressed genes were identified from these two clusters as MAGs using limma software with |Log FC| > 0.5 and an adj P < 0.05, and the heatmap data revealed the top 10 genes between these two clusters (Figure 1F).
3.2. Verification of DEMAGs associated with OS in the entire TCGA-LUAD cohort
The 414 MAGs expression profile was obtained from TCGA-LUAD transcriptome profile consisting of 551 samples (497 LUAD samples and 54 normal lung tissue samples). The 414 MAGs expression profiling was uploaded into R software packages to identify DEMAGs, which revealed a total of 114 DEMAGs using the criteria of logFC > 1.0 and FDR < 0.01, including 95 upregulated and 19 downregulated MAGs. The heatmap and volcano plot of these DEMAGs were shown in Figure 2A, B.
To identify prognostic DEMAGs, we initially defined a cohort of 468 LUAD patients with clinical information from TCGA database as the entire TCGA-LUAD cohort. Next, the 414 MAGs expression profile was incorporated with the clinical information from the entire TCGA-LUAD cohort and univariate Cox regression was performed (Supplementary Figure S1) to identify 51 MAGs associated with OS of LUAD patients. Then, the genes that overlapped for both prognostic MAGs and DEMAGs were defined as prognostic DEMAGs. There were 22 prognostic DEMAGs (Figure 3A, C) found in the entire TCGA-LUAD cohort, and the distribution and correlation of these prognostic DEMAGs was shown in Figure 3B and Figure 3D, respectively. LASSO regression analysis was then performed, which identified 12 prognostic DEMAGs for further analysis (Figure 3E, F).
3.3. Identification of the MAG signature in prediction of LUAD prognosis
The 12 prognostic DEMAGs were further analyzed using multivariate Cox proportional hazards regression analysis and seven genes of interest were identified in the training group (n = 236): serine protease 3 (PRSS3), glucose-6-phosphate isomerase (GPI), chemokine ligand 20 (CCL20), keratin-18 (KRT18), transcobalamin I (TCN1), solute carrier organic anion transporter family member 1B3 (SLCO1B3), and glucosamine-phosphate N-acetyltransferase 1 (GNPNAT1). The MAG signature (Figure 4) to predict the OS of LUAD patients was calculated as follows: Risk score = Exp (PRSS3) × 0.1263 + Exp (GPI) × 0.3664 + Exp (CCL20) × 0.0889 + Exp (KRT18) × 0.3787 + Exp (TCN1) × 0.1570 + Exp (SLCO1B3) × 0.1756 + Exp (GNPNAT1) × 0.3249. The training group (n = 236) was divided into a high- or low-risk group according to the median risk score as the cut-off value for each sample (Figure 5A). As shown in Figure 5A, the distribution of the vitals demonstrated that the high-risk group had more cases of death compared to the low-risk group, and LUAD patients with high risk scores tended to express a higher level of PRSS3, GPI, CCL20, KRT18, TCN1, SLCO1B3 and GNPNAT1 than those with low risk scores. The Kaplan-Meier curve analysis revealed that LUAD patients with high risk scores had significantly worse OS than those with low risk scores (P = 7.947e−05; Figure 5B). The ROC curve also illustrated that the area under curve (AUC) of the MAG signature was 0.767 (Figure 5C), which indicated a moderate prediction value.
3.4. Validation of the MAG signature in prediction of LUAD prognosis in the testing group and the entire TCGA-LUAD cohort
The 7-MAG signature to predict OS of LUAD patients was validated in the testing group (n = 232) and the entire TCGA-LUAD cohort (n = 468). The 7-MAG signature was able to effectively distinguish LUAD patients into two groups with better or worse OS in the testing group (P = 9.465e−03, AUC = 0.682; Figure 6A–C). Similarly, the entire TCGA-LUAD cohort also further confirmed the results, and LUAD patients with high risk scores had a shorter OS than those with low risk scores (P = 1.47e−05, AUC = 0.727; Figure 6D–F). Univariate- and multivariate Cox analysis of the 7-MAG signature in the entire TCGA-LUAD cohort confirmed it to be an independent prognostic predictor for LUAD patients (Figure 7).
3.5. Construction of a MAG nomogram model in accordance with the 7-MAG signature
A quantitative model of the seven DEMAGs to predict OS of LUAD patients was developed by integrating the seven DEMAGs into a nomogram (Figure 8). In the nomogram model, we first assigned points to each variable using a point scale based on the multivariate Cox analysis. Next, a horizontal line was drawn to determine the point of each variable, and the total points of each LUAD patient, which were distributed between 0 and 100, was calculated by summing the points of all variables. Finally, a vertical line between the total points axis and each prognostic axis was utilized to evaluate the 1-, 2-, and 3-year OS of LUAD patients.
3.6. Bioinformatical enrichment of the 7-MAG signature in the entire TCGA-LUAD cohort
To gain specific biological insights of the 7-MAG signature in LUAD, the GSEA was performed that revealed alterations of the cell cycle, DNA replication, mismatch repair, pentose phosphate pathway, proteasome, and the p53 signaling pathway were significantly enriched in LUAD samples with high risk scores, whereas the vascular smooth muscle contraction was significantly enriched in LUAD samples with low risk scores (FDR < 0.05; Figure 9).
4.
Discussion
LUAD is a heterogeneously malignant disease and often shows early-stage tumor metastasis leading to a poor prognosis. scRNA-seq technology could help to effectively dissect tumor cell heterogeneity and identify potential prognosis biomarkers. Therefore, we first screened LUAD scRNA-seq data to identify 414 MAGs from the GSE69405 and found 22 prognostic DEMAGs from the entire TCGA-LUAD cohort. After that, we successfully developed a 7-MAG signature, which could serve as novel indicator for prognosis of LUAD patients and provide potential novel therapeutic targets and molecular mechanism for metastatic LUAD.
To attain potential MAGs, we performed PCA and tNSE analysis on scRNA-seq data of primary LUAD and metastatic LUAD. At the same time, we characterized the transcriptome and heterogeneity between primary and metastatic LUADs and identified a total of 414 MAGs from the GSE69405 and combined with LUAD Bulk RNA-data for subsequent analysis. In the entire TCGA-LUAD cohort, 114 DEMAGs were identified from 414 MAGs using the Limma package in accordance to logFC > 1.0 and FDR < 0.01, and 52 out of 414 MAGs were identified to be associated with OS of LUAD patients via univariate Cox analysis, among which 22 genes were considered as prognostic DEMAGs. To ensure the construction of a MAG signature that was more reasonable and reliable, LASSO regression analysis was executed to narrow down these prognostic DEMAGs. Finally, a 7-MAG signature, including PRSS3, GPI, CCL20, KRT18, TCN1, SLCO1B3, and GNPNAT1, was established through multivariate Cox proportional hazards regression analysis in the training group. To elucidate the prognostic value of the 7-MAG signature, the Kaplan-Meier with log-rank analysis and ROC curve analysis confirmed that the 7-MAG signature showed a good predictive ability for OS of LUAD patients in the training (n = 236, P = 7.947e−05, AUC = 0.767; Figure 5A–C) and the testing (n = 232, P = 9.465e−03, AUC = 0.682; Figure 6A–C) groups, as well as the entire TCGA-LUAD cohort (n = 468, P = 1.47e−05, AUC = 0.727; Figure 6D–F). Additionally, univariate- and multivariate Cox analysis also demonstrated that the 7-MAG signature was an independent factor in the prediction of LUAD prognosis. Importantly, the nomogram based on the 7-MAG signature was established to more intuitively help the prediction of OS in LUAD patients at one, two, and three years. Finally, to show validity of the MAG signature, GSEA illustrated that the cell cycle, DNA replication, mismatch repair, pentose phosphate pathway, proteasome, and the p53 signaling pathway were involved in progression of LUAD.
Importantly, components of the 7-MAG signature possess different biological functions and various patterns of expression in different human cancers. For example, PRSS3 was reported to be involved in tumor metastasis in prostate, pancreatic, and gastric cancers [23,24,25], while Ma et al. revealed that PRSS3 could predict prognosis of LUAD patients and promote growth and invasion in LUAD cells [26]. Ma et al. showed that knockdown of GPI inhibited cancer cell proliferation, invasion, and migration in gastric cancer [27], whereas the high GPI expression was associated with NSCLC metastatic potential [28]. CCL20 was shown to be expressed in various human tissues, including the lung, liver, and lymph nodes [29,30,31]. Moreover, CCL20 could mediate the migration of epithelial cells and likely participates in cancer cell migration and metastasis in a variety of human cancers, such as breast, colorectal, prostate, and pancreatic cancers [32,33,34,35]. Similarly, Wang et al. indicated that CCL20 was significantly overexpressed in NSCLC tissues, and it contributed to cancer cell proliferation and migration through the PI3K pathway in lung cancer [36]. KRT18, also known as cytokeratin 18 (CK18), was shown to be abnormally expressed in various human cancers and has been associated with poor disease progression and prognosis [37,38]. For instance, Ma et al. reported that KRT18 was highly expressed and correlated with poor prognosis in NSCLC, while KRT18 knockdown was prone to inhibit NSCLC cell migration [39]. TCN1 is a type of vitamin B12-binding protein that transports vitamin B12 from the stomach to the intestine. It was reported that TCN1 overexpression was correlated with poor biological behavior and tumorigenesis of various tumor tissues [40]. Liu et al. found that TCN1 was overexpressed in colon cancer, and TCN1 overexpression was a poor prognostic biomarker and predicted neoadjuvant chemosensitivity in colon cancer [41]. Moreover, SLCO1B3, also known as the organic anion transporting polypeptide (AOTP), localizes at chromosome 12p12-31.7 to 12p12-37.2 and plays important roles in transporting various components to cells. Recent studies have shown that a truncated form of SLCO1B3 occurred in human cancer tissues and cells lines [42,43,44]. Hase et al. demonstrated that SLCO1B3 promoted NSCLC progression via mediating epithelial-mesenchymal transition [45]. Finally, GNPNAT1 is a member of the GCN5-related N-acetyltransferase superfamily. Kaushik et al. reported that genetic loss-of-function of GNPNAT1 in castration-resistant prostate cancer (CRPC)-like cells contributed to proliferation and increased tumor cell aggressiveness through the PI3K/AKT signaling pathway [46]. Taken together, these seven genes have been shown to be involved in the occurrence and progression of various human cancers.
In the current study, it was confirmed that the 7-MAG signature possessed the ability to predict LUAD prognosis. However, our current study did have some limitations. For example, we only assessed the available online data but did not use data from our own patients; thus, it is a retrospective study. Future prospective research is warranted to verify our results.
5.
Conclusions
In the current study, we characterized the transcriptome and heterogeneity between primary LUAD and metastatic LUAD based on scRNA-seq, and 414 MAGs were identified from LUAD scRNA-seq data and a 7-MAG signature was established in the training group. The 7-MAG signature was able to predict OS of LUAD patients in the training group, the testing group, the entire TCGA-LUAD cohort and the nomogram model. Furthermore, the GSEA results revealed that LUAD progression was due to alterations in the cell cycle, DNA replication, mismatch repair, pentose phosphate pathway, the proteasome, and the p53 signaling pathway. Future studies will need to verify the usefulness of the 7-MAG signature for the prediction of LUAD prognosis and for the potential to target these biomarkers as novel strategies to control LUAD with metastasis.
Data avaliability
All data in this study are already available from the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) and The Cancer Genome Atlas (https://portal.gdc.cancer.gov/) databases.
Conflicts of interest
The authors declare that there is no conflict of interest in this work.