1.
Introduction
According to Global Cancer Statistics 2022, for women, breast cancer, lung cancer, and colorectal cancer account for 51% of all new diagnoses, with breast cancer alone accounting for almost one-third [1]. HER2-positive breast cancer is a highly heterogeneous tumor, accounting for approximately 15–20% of invasive breast cancers [2]. Before the popularization of HER2-targeted therapy, early-stage HER2-positive patients tended to have shorter recurrence times, higher rates of metastasis, and higher mortality than HER2-negative patients. With the widespread use of the monoclonal antibody trastuzumab, which targets HER2, and the small-molecule antibody molecular tyrosine kinase inhibitor lapatinib, which targets HER1 and HER2, the prognosis of HER2-positive patients has improved significantly [3,4,5]. However, more than 30% of patients still experience recurrence, metastasis, or death within 10 years after treatment [2].
The purpose of survival analysis is to analyze the factors that cause events such as death or recurrence in patients in a certain period after treatment, which has important clinical application. Through radiomics and genomics, many scholars have attempted to identify factors that are associated with death or metastasis in patients [6,7,8]. It is easier for a clinician to choose an appropriate treatment when they have a more accurate assessment of a patient's survival risk [9,10,11]. In recent years, with the popularization and development of digital pathological images, pathologists can obtain digital pathological slices more quickly and with higher resolution. However, predicting patient prognosis from pathological images and clinical features is still challenging for the following reasons: 1) Whole slide images (WSIs) often contain more than one billion pixels. Therefore, mainstream computers and models cannot process them directly. 2) Due to the high heterogeneity of breast cancer, patients may have several pathological slides with different characteristics. 3) There are large differences in image features and clinical features.
In this paper, we propose an end-to-end system to predict the prognosis of patients through the fusion of pathological images and clinical features. The main contributions of this study are as follows: 1) The proposed model is based on MIL and thus does not require pixel-level annotation. The proposed model directly learns WSI representations from pathological images. 2) The proposed model fuses clinical features and image features through compact bilinear pooling (CBP), and the C-index of the fusion model is 0.06 higher than that of the single-modal model. 3) During the validation of the fusion model, stratified by lymph node status, the model still performed well.
Although traditional machine learning techniques, including support vector machines (SVMs) [12] and neural networks [13], have found wide applications in different fields, such as speckle reduction [14], image segmentation [15], and image retrieval [16], their performance needs further improvement. Deep learning can achieve this goal, and deep learning methods for pathological image processing can be divided into supervised learning and weakly supervised learning according to the labels of the data. Supervised learning typically analyzes partial regions of interest (ROIs) in WSIs, and the labels are based on the ROIs. Weakly supervised learning analyzes all or multiple WSIs, and the labels are typically based on pathology reports. Classic ROI-based methods require manual annotation by pathologists, and then patient traits are predicted by manually designing quantitative features. Cain et al. [17] established two multivariate models using radiomics features and proved that radiomics features could be used to predict the pathological complete response (pCR) to neoadjuvant therapy (NAT) in patients with triple-negative/HER2-positive breast cancer. Wang et al. [18] proposed a computer-aided diagnosis and survival analysis system for non-small cell lung cancer (NSCLC) based on WSIs. They manually designed and extracted 166 image features for diagnosis and prognosis, with a classification accuracy of 92%. Unlike traditional learning methods, recent studies have used CNNs to perform feature extraction and classification/prediction [19]. For example, Yang et al. [20] divided the ROIs into 512 × 512 pixel tiles and trained a ResNet-50-based model to predict the recurrence risk of breast cancer patients. Yu et al. [21] divided the WSIs into 1000 × 1000 pixel tiles and extracted 9879 quantitative image features, which were used for prognosis prediction after verifying the validity of the features through two types of classification tasks. In our previous study [22], we used a deep fully convolutional neural network to perform end-to-end segmentation on pathological tissue slices. Our method achieved state-of-the-art segmentation accuracy on public nuclei histopathology datasets. Yan et al. [23] combined a CNN and an RNN to establish a cancer classification model, and the classification accuracy reached 92%.
However, ROIs may not capture all the information of WSIs, and pathologists must also manually annotate ROIs, which is time-consuming. WSI-based analysis methods have been proposed for the above reasons and can directly obtain the prediction results of WSIs or patients. Zhu et al.[24] proposed an effective whole-slide histopathological image survival analysis framework (WSISA) to predict patient prognosis. Li et al. [25] proposed a WSI-based framework, DeepGraphSurv, and pioneered graph convolutional networks (GCNs) for survival analysis. WSIs were extracted as a graph, and their network used graph convolutional networks to obtain a WSI representation. Yao et al. [26,27] introduced MIL based on prior knowledge of pathology and proposed DeepAttnMISL. They clustered each WSI into multiple graphs and then aggregated them into bag representations through an attention-based MIL model. Wu et al. [28] proposed DeepGCNMIL based on this framework to optimize the model structure, which increased the C-index by 0.035. Campane et al. [29] built a cancer classification model based on MIL and trained it using a large amount of data from more than 40,000 WSIs, and their AUC was 0.98. Chen et al. [30] innovatively combined gene expression and pathological picture features to predict the WHO grade and prognosis of patients and achieved good results.
Although there have been many related studies of WSI-based analysis, research on prognosis prediction using MIL remains limited, and due to the scarcity of patient follow-up data, there is currently no related research that focuses on the prognosis analysis of HER2-positive patients. In this study, we established an MIL framework, GATSurvMIL, that integrates clinical features and WSI representations. During pathological diagnosis, we often do not think that the prognosis of a patient is related to the entire WSI, but a portion of the WSI reflects the patient's survival status.
2.
Materials and methods
2.1. Data collection and screening
We collected the clinical features and WSIs of 1069 HER2-positive patients in WCH from 2009 to 2019 and downloaded the data of 160 HER2-positive patients from TCGA [31]. The study protocol (2020-427) was approved by the Ethics Committee on Biomedical Research, WCH of Sichuan University.
The distribution of specific clinical characteristics of patients is shown in Table 1. The age of diagnosis in the two datasets has a large gap, and the overall age of TCGA diagnosis is relatively high. The mortality rate of TCGA is also much higher than that of WCH, but there are fewer patients with recurrence or metastasis.
Data screening in this study was based on the criteria shown in Figure 1. All patients were identified as HER2 3+ via immunohistochemistry or HER2 gene amplification based on fluorescence in situ hybridization (FISH). We then performed clinical pairing of the data to ensure that the clinical characteristics of the patients were evenly distributed, and all excluded patients were randomly selected.
2.2. WSI processing
Because a WSI generally contains more than 1 billion pixels, existing machine learning models cannot directly process an entire WSI. Therefore, we cut each WSI into portions for sampling, as shown in Figure 2. First, we used an advanced tissue region segmentation framework [32] to segment each WSI's foreground from the background and then cut the foreground tissue area into patches, each with a size of 256 × 256 pixels. Although patches can be used as the input of common neural networks, a WSI usually has more than 10,000 patches. Thus, we calculated the energy values of all the patches by 2D convolution and then used the 500 patches with the highest energy values to represent the WSI. By doing this, we can reduce the computational complexity and information redundancy.
Next, we used the pretrained ResNet-50 [33] to extract the features of the color-normalized patches. The strong representational ability of ResNet-50 has been demonstrated in many previous related studies [26]. Finally, we used K-means clustering to cluster all patch features from a patient into several phenotypes, which have different predictive powers for patients' clinical outcomes.
2.3. Feature aggregation
After WSI processing, each patient was represented by 10 phenotype groups, unlike the Mi-FCN used by Yao et al. [26], which ignores intranode connections and only performs a simple aggregation. The GCN has a strong learning ability for graph data [34,35] that far surpasses CNNs in node classification tasks. Based on the GCN, GAT introduces an attention mechanism that enables the ability to process dynamic graphs. Therefore, we used GAT [36] to learn the relationship between patches within a phenotype group. Figure 2 shows the proposed embedded GAT module, which contains multiple GAT layers and performs nonlinear transformations among GAT layers through ReLU. We determined the edge index among the patches using knn_graph and initialized them as a dynamic graph before the phenotype enters the model. The output is a feature vector for the phenotype group.
The data from each patient contains graph embedding vectors for multiple phenotypes, and we aggregated these vectors through a multihead attention network [37,38]. The attention function is a method to map a query and a set of key-value pairs to an output. Multihead attention projects queries and key-value pairs into multiple projection spaces to learn richer semantic information and can speed up model training through parallel learning:
where headi=Attention(QWQi,KWKi,VWVi), and the projections are the parameter matrices WQi∈Rdmodel×dq, WKi∈Rdmodel×dk, WVi∈Rdmodel×dv, and Wo∈Rhddv×dmodel.
2.4. Feature fusion
Clinical characteristics play a crucial role in identifying patients for clinical treatment. Through data comparison, the selected clinical features are the intersection of two datasets: {ER, PR, Age, Lymph, Stage}. Common simple feature fusion methods primarily include concatenation, element dot product, and element summation. In this study, a series of experiments were conducted to investigate the efficacy of different feature fusion techniques. The GATSurvMIL model, which is based on image features, was employed as a baseline for comparative analysis. Based on the data presented in Table 2, it is evident that the utilization of these uncomplicated techniques has proven to be ineffective in enhancing the model's performance. In fact, these methods may even impede the model's learning capacity.
Thus, we used CBP to fuse features. Original bilinear pooling typically encounters problems when the dimension of fused features is too high. The authors also used principal component analysis (PCA) to reduce the dimensions, but the results are not significant. CBP uses the idea of a linear kernel machine to solve the problem of overdimensionality. We thus fused a patient's image features and clinical features using CBP and finally input the fused features into a fully connected layer to obtain the risk value of the patient.
2.5. Loss function
We used the following loss function for backpropagation to update the model parameters:
where oi is the predicted risk of the i-th patient, t is the patient's event occurrence time or follow-up time, and e is the patient's state. This function can be maximized via network parameter learning to obtain the largest partial likelihood estimate. Compared with that of the simple Cox loss function, the loss function can represent the overall concordance more accurately.
3.
Results
3.1. Implementation details
We set the initial learning rate to 0.0001 and the maximum number of epochs to 100. We updated the gradient and adjusted the step size with the Adam optimizer with a weight decay of 0.0005. We randomly selected 20% of the data in the training set as the validation set to control the early stopping of the model. We tested the model performance and tuned the hyperparameters using 3-fold cross-validation. The k in knn_graph was set to 10, and the number of GAT layers was set to 3. We then verified the generalizability of the model using the TCGA external test set. We trained two other models with the same data partition: 1) GATSurvMIL only with image features and 2) a Cox model only with clinical features.
3.2. Model prognostic results
We used the C-index as a measure of the prognostic performance of the model. We built an OS (overall survival)-based model with death as the patient outcome. The model achieved a mean C-index of 0.668 for 3-fold cross-validation, and the C-index was 0.726 with the TCGA test set. Both the fitting effect and generalization effect of the model were markedly improved compared with those when using only imaging or clinical features.
We built a DFS (disease-free survival)-based model with recurrence, metastasis, or death as the patient outcome. The 3-fold validation cross-average C-index of the model was 0.645, and the C-index with the TCGA test set was 0.617. Because lymph node metastasis is an important factor affecting treatment, we separately modeled patients with lymph node status (LMN) according to lymph node stage. The performance of the two models are similar to that of the image-only model. The specific results are shown in Table 3.
3.3. Comparison of the C-index with other MIL models
We reproduced the experimental results of DeepAttnMISL [26] and DeepGCNMIL [28]. We then used the same parameters to test the performance of the proposed model. We also fused each patient's pathological features using CBP before the output layer of the model. The results in Table 4 demonstrate that the cross-validation results of our model (mean = 0.668) are much better than those of DeepAttnMISL (mean = 0.606) and DeepGCNMIL (mean = 0.592). The cross-validation standard deviation of our model was 0.0713, which is more stable than that of the other models.
3.4. Multivariate Cox analysis
We used the image-only model prediction results and clinical characteristics as the input of the multivariate Cox model to analyze the primary factors affecting patient survival and prognosis. The results are shown in Table 5. The results of the Cox model based on OS and DFS are similar, in which the image model predicted value and tumor stage are strongly associated with patient prognosis. The hazard ratio (HR = 4.15) of the predicted value indicates that the imaging factor has a stronger impact on the prognosis of the patient.
3.5. Kaplan‒Meier survival curves
Currently, clinicians assess a patient's prognostic risk based on the patient's clinical characteristics. To verify the predictive performance of the proposed model, we drew a Kaplan–Meier survival curve of the patients in the test set, which is shown in Figure 3. Since prognostic analysis is time dependent, we calculated specificity and sensitivity by time-varying ROC using Hegerty's method [39]. We used the value corresponding to the year with the largest Youden index as the cutoff value for the risk grouping.
Figure 3 (a, b) shows the OS Cox model established by clinical features. Figure 3 (c, d) shows GATSurvMIL fused with imaging features. The high-risk and low-risk groups of WCH could not be significantly distinguished by clinical characteristics (P = 0.19), but GATSurvMIL could significantly distinguish the high- and low-risk groups of WCH (P = 0.034) and TCGA (P = 0.0016). Figure 3(e–h) shows the results of the corresponding DFS model, in which patients were indistinguishable by clinical features (P = 0.34), and even after 20 months, the high-risk group appeared safer, but the GATSurvMIL fusion feature could preliminarily separate patients (P = 0.079). We performed the same analysis for the models built by LMN+ and LMN-, and the results are shown in Figure 4.
4.
Discussion
Breast cancer has become the most common tumor in the world. HER2-positive tumors are aggressive and prone to recurrence and metastasis, and the prognosis of patients is poor, making these tumors a major threat to public health. Unfortunately, no effective model or product is currently available to predict the prognosis of HER2-positive patients. Currently, the C-index for WSI-based breast cancer prognosis prediction is generally 0.5 to 0.7, and the performance of image-based models must be improved.
Based on the performance of GATSurvMIL on DFS, the proposed model can effectively integrate the advantages of images and clinical features to obtain better performance when clinical features cannot accurately represent patient risk. We modeled patients separately by lymph node status, where clinical features barely describe a patient's prognostic risk, but the proposed model still combines the strengths of imaging and clinical features. We believe that when the number of patients is small, the image features still have a strong ability to represent patients' risk because the information in pathological images is rich. However, clinical features cannot accurately represent the prognosis of patients due to differences in patient distribution. When the number of patients is large, clinical features have a more significant discriminative ability, and the performance of the proposed fused feature model can also be improved more than only WSIs. Therefore, there is still much room for improvement in the prognostic analysis of pathological images. The proposed model is also based on MIL and thus does not require pixel-level manual annotation and can be directly applied to most current datasets. The introduced GAT and attention network provide better interpretability with improved model performance. Currently, the prediction of patient prognosis through pathological images remains in the experimental stage. With an increasing number of patients and follow-up data, more advanced methods and models should appear in the future. Compared to using one feature alone to predict patient outcomes, multimodal models can predict patient status from a more comprehensive perspective. In contrast, multimodal data often require patients to have experienced multiple diagnostic methods and sufficient follow-up time, which is rare. WSI, MRI, and clinical features are often different, and it is also difficult to find a suitable feature fusion method.
5.
Conclusions
This paper proposes an MIL-based multimodal model to predict the prognosis of HER2-positive breast cancer patients. The proposed model can be directly applied to most current hospital data, which can help provide personalized treatment of patients and auxiliary decision-making for doctors. However, this study currently has some limitations: 1) since TCGA provides fewer clinical features, we only included 5 clinical variables. 2) Because survival analysis requires long-term tracking records, the amount of data is small. 3) The "black box" problem of neural networks is still unavoidable, even though we have added GAT and attention modules. We continue to work on improving the prediction of patient prognosis and plan to study more modal models and algorithms in the future by combining features such as gene expression and radiomics. Finally, we look forward to more central datasets and longer follow-up records to improve model performance.
Acknowledgments
This work was supported by Natural Science Foundation of Sichuan, China (2023NSFSC1393); Scientific Research Starting Project of SWPU (2021QHZ001); Sichuan Nanchong Science and Technology Bureau (SXQHJH046); West China Hospital of Sichuan University-University of Electronic Science and Technology of China Medical-Industrial Integration Interdisciplinary Talent Training Fund (ZYGX2022YGRH015); 1.3.5 Project for Disciplines of Excellence, West China Hospital, Sichuan University (ZYJC21035) and Fund of Sichuan Provincial Department of Science and Technology (2023YFH0095).
Conflict of interest
The authors declare there is no conflict of interest.