Cutaneous squamous cell carcinoma (cSCC) is one of the most frequent types of cutaneous cancer. The composition and heterogeneity of the tumor microenvironment significantly impact patient prognosis and the ability to practice precision therapy. However, no research has been conducted to examine the design of the tumor microenvironment and its interactions with cSCC.
Material and Methods
We retrieved the datasets GSE42677 and GSE45164 from the GEO public database, integrated them, and analyzed them using the SVA method. We then screened the core genes using the WGCNA network and LASSO regression and checked the model's stability using the ROC curve. Finally, we performed enrichment and correlation analyses on the core genes.
Results
We identified four genes as core cSCC genes: DTYMK, CDCA8, PTTG1 and MAD2L1, and discovered that RORA, RORB and RORC were the primary regulators in the gene set. The GO semantic similarity analysis results indicated that CDCA8 and PTTG1 were the two most essential genes among the four core genes. The results of correlation analysis demonstrated that PTTG1 and HLA-DMA, CDCA8 and HLA-DQB2 were significantly correlated.
Conclusions
Examining the expression levels of four primary genes in cSCC aids in our understanding of the disease's pathophysiology. Additionally, the core genes were found to be highly related with immune regulatory genes, suggesting novel avenues for cSCC prevention and treatment.
Citation: Jiahua Xing, Muzi Chen, Yan Han. Multiple datasets to explore the tumor microenvironment of cutaneous squamous cell carcinoma[J]. Mathematical Biosciences and Engineering, 2022, 19(6): 5905-5924. doi: 10.3934/mbe.2022276
Related Papers:
[1]
Lin Feng, Ziren Chen, Harold A. Lay Jr., Khaled Furati, Abdul Khaliq .
Data driven time-varying SEIR-LSTM/GRU algorithms to track the spread of COVID-19. Mathematical Biosciences and Engineering, 2022, 19(9): 8935-8962.
doi: 10.3934/mbe.2022415
[2]
Shuang Tan, Shangrui Zhao, Jinran Wu .
QL-ADIFA: Hybrid optimization using Q-learning and an adaptive logarithmic spiral-levy firefly algorithm. Mathematical Biosciences and Engineering, 2023, 20(8): 13542-13561.
doi: 10.3934/mbe.2023604
[3]
Shubashini Velu .
An efficient, lightweight MobileNetV2-based fine-tuned model for COVID-19 detection using chest X-ray images. Mathematical Biosciences and Engineering, 2023, 20(5): 8400-8427.
doi: 10.3934/mbe.2023368
[4]
Qiao Xiang, Tianhong Huang, Qin Zhang, Yufeng Li, Amr Tolba, Isack Bulugu .
A novel sentiment analysis method based on multi-scale deep learning. Mathematical Biosciences and Engineering, 2023, 20(5): 8766-8781.
doi: 10.3934/mbe.2023385
[5]
Saleh I. Alzahrani, Wael M. S. Yafooz, Ibrahim A. Aljamaan, Ali Alwaleedi, Mohammed Al-Hariri, Gameel Saleh .
AI-driven health analysis for emerging respiratory diseases: A case study of Yemen patients using COVID-19 data. Mathematical Biosciences and Engineering, 2025, 22(3): 554-584.
doi: 10.3934/mbe.2025021
[6]
Zahra Movahedi Nia, Ali Ahmadi, Bruce Mellado, Jianhong Wu, James Orbinski, Ali Asgary, Jude D. Kong .
Twitter-based gender recognition using transformers. Mathematical Biosciences and Engineering, 2023, 20(9): 15962-15981.
doi: 10.3934/mbe.2023711
[7]
Biplab Dhar, Praveen Kumar Gupta, Mohammad Sajid .
Solution of a dynamical memory effect COVID-19 infection system with leaky vaccination efficacy by non-singular kernel fractional derivatives. Mathematical Biosciences and Engineering, 2022, 19(5): 4341-4367.
doi: 10.3934/mbe.2022201
[8]
Michael James Horry, Subrata Chakraborty, Biswajeet Pradhan, Maryam Fallahpoor, Hossein Chegeni, Manoranjan Paul .
Factors determining generalization in deep learning models for scoring COVID-CT images. Mathematical Biosciences and Engineering, 2021, 18(6): 9264-9293.
doi: 10.3934/mbe.2021456
[9]
Akansha Singh, Krishna Kant Singh, Michal Greguš, Ivan Izonin .
CNGOD-An improved convolution neural network with grasshopper optimization for detection of COVID-19. Mathematical Biosciences and Engineering, 2022, 19(12): 12518-12531.
doi: 10.3934/mbe.2022584
Cutaneous squamous cell carcinoma (cSCC) is one of the most frequent types of cutaneous cancer. The composition and heterogeneity of the tumor microenvironment significantly impact patient prognosis and the ability to practice precision therapy. However, no research has been conducted to examine the design of the tumor microenvironment and its interactions with cSCC.
Material and Methods
We retrieved the datasets GSE42677 and GSE45164 from the GEO public database, integrated them, and analyzed them using the SVA method. We then screened the core genes using the WGCNA network and LASSO regression and checked the model's stability using the ROC curve. Finally, we performed enrichment and correlation analyses on the core genes.
Results
We identified four genes as core cSCC genes: DTYMK, CDCA8, PTTG1 and MAD2L1, and discovered that RORA, RORB and RORC were the primary regulators in the gene set. The GO semantic similarity analysis results indicated that CDCA8 and PTTG1 were the two most essential genes among the four core genes. The results of correlation analysis demonstrated that PTTG1 and HLA-DMA, CDCA8 and HLA-DQB2 were significantly correlated.
Conclusions
Examining the expression levels of four primary genes in cSCC aids in our understanding of the disease's pathophysiology. Additionally, the core genes were found to be highly related with immune regulatory genes, suggesting novel avenues for cSCC prevention and treatment.
1.
Introduction
In the 21st century, a new viral infection known as SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) originated in Wuhan city of South China. The novel coronavirus has created an adverse impact on the health and socio-economic activities of the public, as it has spread rapidly to different regions of the world within a short span of time. To control the spread of the COVID-19 infection, the World Health Organization (WHO) declared COVID-19 as a pandemic on March 11, 2020. The announcement of COVID-19 as a pandemic has led to the announcement of strict lockdown policies such as the shutdown of national and international airports, traveling restrictions and the closure of non-essential services by most of the countries around the world to control the dissemination of infection. The fear created by the COVID-19 outbreak has caused significant consequences on societal health and well-being, as well as the global economy [1].
In this situation, the internet plays a key role in connecting individuals with the rest of the world. Most of the individuals are dependent on the internet to check the content that circulates on the social media platforms regarding the COVID-19 pandemic. Social media provides different platforms, such as Facebook, Twitter, Instagram and Reddit, to provide communication and share information and other data among the human community [2,3]. As the Twitter platform provides real-time coverage of current events that are vigorously and locally fluctuating, like the COVID-19 pandemic, it provides comprehensive and valuable information on the present state of public health and disease spreading in a region. Several studies have shown that tweets have been a vital source for the early identification, warning and intervention of infectious diseases like cholera [4], seasonal conjunctivitis [5], Ebola [6], the bubonic plague [7] and E. coli [8]. The WHO has reported that early identification of more than 60% of epidemics can be found via social media data [6]. Moreover, the mining of social media data can assist in the early identification of public health emergencies. The identification of a personal health mention (PHM) is one of the crucial steps in public health surveillance. The basic aim of a PHM is to identify the health condition of a person depending on online text information. To explore the identification of PHMs for COVID-19 pandemic using Twitter, a masked attention model was developed by Luo et al. [9]. Initially, a COVID-19 PHM dataset consisting of 11, 231 tweets from February to May 2020 was built. Then, these tweets were annotated with four types of COVID-19-related conditions. The experimental results revealed that the proposed approach attained superior performance when compared with other approaches. To address the class imbalance problem, a dual convolutional neural network (CNN) framework has been proposed by Luo et al. [10]. The proposed approach has been evaluated by using a COVID-19 PHM dataset containing more than 11, 000 annotated tweets, and the results indicate that the proposed approach can mitigate the impact of class imbalance problems and attain promising outcomes. As the mining of social media data minimizes the cost of information acquisition and analysis, it enhances the responsiveness of public health sectors and provides a new perspective on public health [11]. Therefore, the mining of tweets related to the COVID-19 pandemic can also assist the public health sector in monitoring the status of COVID-19 infection in a timely manner and making appropriate decisions to control the spread of infection. In this present pandemic situation, it is impossible for a person to work without social media, as it covers all of the updates, such as updates on COVID-19, updates on the stock market and so on [12]. Regardless of the large content present on the social media, these contents may cause contradictory impacts on the lives of people, such as negative or positive psychological influences [13]. In most cases, information related to COVID-19 that has circulated on social media has guided the people toward wrong decisions. Therefore, converting these posts into resources is highly valuable for making decisions regarding business and policy deployment, which can be done efficiently through sentimental analysis and social mining [14].
Computer technologies play a vital role in performing sentimental analysis of social media data in order to facilitate opportunities to fight against outbreaks of infectious disease [15]. Several researchers have indicated that outbreaks can be effectively controlled if experts make use of social media data [16]. Nowadays, machine learning approaches are being used in a variety of fields, such as opinion mining, image analysis, wireless sensor networks, fake news identification and many more, because such approaches have the ability to learn from itself without being explicitly programmed. Due to their versatility and veracity, machine learning approaches have been gaining more popularity in recent years in sentiment analysis. To improve user satisfaction in online shopping, a novel framework based on a machine learning approach known as a Local Search Improvised Bat Algorithm based Elman Neural Network (LSIBA-ENN) model was recommended by Zhao et al. [17] for the efficient sentiment analysis of online reviews of products. The major components of the proposed framework are data collection, pre-processing, feature extraction and polarity. In the first step, the customer online reviews of a product are gathered by using web scraping tools. In the next step, pre-processing of the scraped data extracted from the web is performed. Then, for additional processing, the processed data undergo TW (Term Weighting) and FS (Feature Selection) by applying a hybrid mutation-based earthworm algorithm (HMEWA) and Log Term Frequency-based Modified Inverse Class Frequency (LTF-MICF). In the last step, the customer reviews are classified as positive, neutral and negative by rendering HMEWA data to the suggested LSIBA-ENN. The performance of the suggested framework was analyzed by using '2' yardstick datasets, and the results revealed that the LSIBA-ENN exhibited the best performance as compared to other approaches. Moreover, it was also observed that the Elman neural network attained a recall of 87.79 when the LTF-MICF scheme was utilized, and a recall of 86.04, 85.48, 84.03 and 83.55, respectively, when the TF-DFS (Term Frequency-Distinguishing Feature selector), TF-IDF (Term Frequency-Inverse Document Frequency), TF (Term Frequency) and W2V (Word 2 Vector) schemes were utilized.
To predict the rates of currency exchange based on the analysis of Twitter messages, an approach depending on machine learning has been recommended by Naeem et al. [18]. Initially, a dataset was developed based on the data from the Forex Trading website. The dataset contains exchange rates between USD (United States Dollar) and PKR (Pakistani Rupee). In addition, the dataset also contains tweets related to finance words that were collected from the Pakistan business community. Then, data pre-processing was applied to the collected raw dataset. After performing data pre-processing, response variable labeling was assigned to the processed dataset. Further, the authors utilized principal component analysis and linear discriminant analysis for the better representation of the dataset in a three-dimensional vector space. To optimize the data, clusters that were formed through a sampling approach has been utilized. Finally, recommended classifiers such as a random forest, naïve Bayes, a simple logistic classifier, a support vector machine and bagging have been evaluated; the results indicate that simple logistic regression attained better performance, i.e., 82.14% accuracy, as compared with other approaches.
In recent years, deep learning approaches have been able to explore complex semantic descriptions of text automatically from data without any feature engineering. Moreover, deep learning approaches have also enhanced the modernity in many tasks of sentiment analysis, such as the sentiment classification of sentences, lexicon learning of sentiment and extraction of sentiments. Therefore, these approaches are presently surging as vigorous computational models in sentiment analysis. To perform precise sentiment classification of the Chinese language, a novel approach based on deep learning, known as HEMOS, has been developed by Li et al [19]. The authors also analyzed the significance of observing the impact of humor, slang and pictograms on the task of perceptual processing of the social media data. Initially, 576 periodic slang expressions from the internet were collected as slang lexicon. Then, a Chinese emoji lexicon was constructed from the textual features that have been obtained by converting 109 Weibo emojis. Later, the base for four-level sentiment classification of Weibo posts was created by carrying out two polarity annotations with "optimistic" and "pessimistic" humorous types. Then, positive and negative lexicons were applied to the proposed attention-based bi-directional long short-term memory recurrent neural network (AttBiLSTM) approach for the precise sentiment classification of the Chinese language. Further, the developed approach was evaluated by using undersized labeled data; the results indicate that the developed approach can substantially enhance the performance of standard methods in assessing sentiment polarity on Weibo emojis.
Pathak et al. [20] developed a deep learning-based topic-level sentiment analysis model for the efficient discovery and analysis of topics discussed on social media platforms, and for making efficient decisions in a timely manner. Basically, the developed model aims to extract the topic from sentence level data by means of online latent semantic indexing with a regularization constraint. Then, the model implements a topic-level attention process to carry out sentiment analysis in a long short-term memory (LSTM) network. As the model supports topic-level sentiment analysis and provides both scalable and dynamic topic modeling of streaming short text data, it is considered as a peculiar model. Regarding in-domain topic-level sentiment analysis, the model achieved a recall of 0.879 on the SemEval-2017 Task 4 Subtask B dataset. On the out-of-domain data, the model achieved average recalls of 0.794, 0.824 and 0.846 on datasets developed by respectively using the hashtags #facebook, #bitcoin and #ethereum from Twitter. Moreover, the scalability of the model was analyzed in terms of the average time in milliseconds, topics determined per second and seconds required for the construction of feature vectors, throughput and average time, respectively.
To capture the current conditions of the market by means of Twitter sentiments by using Google's bidirectional transformer BERT, two novel approaches known as Sentimental All-Weather and Sentimental MPT have been suggested by Leow et al. [21]. Further, the authors have utilized a genetic algorithm for the maximization of cumulative returns and the minimization of volatility. Then, the suggested approaches have been trained on tweets data for the USA stock market collected from August 2018 to December 31, 2019. The empirical results revealed that suggested models attained superior performance with regard to common measures of portfolio performance, the cumulative returns, the Sharpe ratio and the value-at-risk as compared to other benchmarks such as the MPT model, buy-and-hold spy index and CRB approach for the All-Weather Portfolio.
A systematic review on the mobility data usage from recent publications on COVID-19 and human mobility from a data-oriented perspective has been carried out by Hu et al. [22] to assist policymakers and reviewers in the decision-making of policies related to the COVID-19 pandemic and other infectious disease outbreaks. The authors identified the public transit system, social media-derived mobility data and mobility data as three primary sources of mobility data. Further, the authors utilized four distinct approaches, such as the use of social activity patterns, social media-derived mobility data, index-based mobility data and public transit-based flow, to determine human mobility. Then, the authors assessed the quality, data privacy, space-time coverage privacy, accessibility and high-performance data storage to compare the characteristics of mobility data. Moreover, the authors also presented challenges and future directions for utilizing mobility data.
To address the partial domain adaptation challenge, a model known as the deep residual correction network (DRCN) has been suggested by Li et al. [23]. To improve the adaptation from source to target and weaken the influence from inappropriate source classes, the authors plugged one residual block into the source network together with the task-specific feature layer. The plugged residual block that consists of several fully connected layers not only intensify the basic network, but it also improves its capability of feature representation. Moreover, a weighted class-wise domain alignment loss was designed by the authors to couple two domains. Then, the extensive experiments conducted on partial, traditional and fine-grained cross-domain visual recognition revealed that the DRCN outperforms other deep domain adaption approaches.
From the above discussion, it can be observed that different machine learning and deep learning approaches have been used to perform sentiment analysis efficiently on social media data. Hence, sentiment analyses for the present COVID-19 pandemic are also important, as COVID-19 stays as a controversial global topic in social media [24]. To observe the sentiments of Bangladeshi people's comments on COVID-19 posts posted on Facebook, a deep learning approach utilizing a CNN was proposed by Pran et al. [25]. The suggested architecture makes use of three classes, namely, analytical, depressed and angry, to investigate the emotions. Further, the suggested architecture was evaluated by using a dataset created in the Bangla language, and the results indicate that the CNN attained an accuracy of 97.24% in analyzing COVID-19 posts posted on Facebook by Bangladeshi people.
To perform the sentiment analysis of live-stream tweets related to COVID-19 and to predict COVID-19 cases, a novel approach based on deep learning has been recommended by Das et al [26]. Initially, the authors developed a large tweet corpus depending on COVID-19 tweets. Then, they performed trend analysis and polarity classification by splitting the data into training and testing data. Further, the precise outcome from the trend analysis assists in training the data in order to produce an incremental learning curve for the neural network. The model was evaluated, and the results indicate that, not only did the model attain a better accuracy of 90.67%, but it also maintained performance stability when validated against various prevalent open-source text corpora.
To address the issues related to COVID-19 by performing sentiment analysis of social media data, different deep learning approaches have been used. It can be observed from the discussion that the approaches using CNN architecture has the limitation that it can impact the accuracy of the model [27]. Due to their ability to catch complex patterns and obtain better performance, several studies have utilized the rectified linear activation function (ReLU) in their neural network models [28]. However, the usage of the ReLU function can cause dying neurons that may restrict the learning process of neural networks. To overcome this limitation, architecture such as that of LSTM has been recommended. As LSTM assists in solving sequential and time-series problems with superior results, it is considered as an important function [29]. Therefore, our proposed approach makes use of the LSTM activation function because of its ability to learn text sequences and find associations between words or phrases in sentiment analysis. Moreover, it can be used to enhance the semantic data of tweets, as well as the efficiency of the learning model, which yields superior performance on some datasets because the network is not acquiring the right volatility trends in the data.
Therefore, the main contribution of the present study is as follows:
i. An LSTM model has been proposed for the classification of COVID-19 related tweets.
ii. To enhance the overall performance of the model, a firefly optimization algorithm is proposed to tune the hyperparameters of the LSTM model.
iii. Further, the performance of the proposed model is compared with other state-of-the-art ensemble and machine learning-based methods, such as the decision tree (DT), the multi-layer perceptron (MLP), k-nearest neighbors (KNN), the random forest (RF), AdaBoost, gradient boost (GBoost), bagging and extreme gradient boost, to prove the efficacy of the developed model.
The remaining part of this paper is sectioned as follows. Section 2 describes some of the previous studies carried out on the analysis of Twitter-related data, along with their limitations. The description of the proposed methodology is illustrated in Section 3. Section 4 outlines the experimental setup and description of the dataset used in the proposed approach. Section 5 provides a comparative analysis of the results of the proposed approach and other state-of-the-art models. Lastly, Section 6 concludes the work with some important future directions.
2.
Related work
The outbreak of the COVID-19 pandemic has imposed a tremendous impact on the health and socio-economic activities of the public all over the world. With the rapid rise in the number of COVID-19 cases, it has become a major cause for depression, panic and anxiety among the people due to the deceitful information posted on social media. Among the social media platforms, Twitter stands on top regarding the time to spread COVID-19 news [30]. This section presents a brief discussion of some of the studies carried out on the analysis of Twitter COVID-19 data by using machine learning and deep learning approaches.
Using the R programming language, Kaur et al. [31] performed an interpretation of Twitter data that depends on keywords such as coronavirus, new case, COVID-19, recovered and deaths. In addition, a novel algorithm known as the hybrid heterogeneous support vector machine (H-SVM) has been suggested by the authors to carry out the sentiment classification. Then, the efficacy of the suggested H-SVM approach was evaluated by using various metrics, such as precision, accuracy, F1-score and recall. From the empirical results, it was concluded that the H-SVM obtained better performance in terms of the values of the precision, recall and F1-score, i.e., 0.86, 0.69 and 0.77, respectively, as compared to linear support vector machines (SVMs) and neural networks. To perform sentiment analysis of the coronavirus Twitter data of eight different countries, a novel approach has been suggested by Basiri et al. [32]. This approach is a synthesis of four deep learning approaches and one conventional machine learning approach. In addition, to sense the alterations in the sentiment patterns at distinct places and times, Google Trends was used to analyze the searches related to the coronavirus. From the findings, it was found that COVID-19 has drawn the attention of the public from distinct countries with differing intensities at different times. Moreover, the news and events that took place in their countries, such as the number of deaths, infected cases and recoveries, were also matching with the sentiment in their tweets. To classify tweets into positive, negative or neutral sentiments, three distinct ensemble classifiers known as the stacking classifier, bagging classifier and voting classifier were proposed by Rahman and Islam [33]. For the classification of tweets into sentiments, 12, 000 tweets collected from the United Kingdom (UK) were meticulously annotated by three separate reviewers. Finally, from the study, it was concluded that the stacking classifier obtained better performance (83.5% F1-score) as compared to the F1-score of the bagging and voting classifiers.
Rustam et al. [34] performed the analysis of tweets related to COVID-19 by using supervised learning approaches such as a support vector classifier, an RF, a DT, XGBoost and extra trees classifiers. Then, data preprocessing techniques and the TextBlob library were used to clean the data and extract the sentiments. In addition, the authors have also suggested a feature set for evaluating the models, which was obtained by integrating term frequency-inverse document frequency and bag-of-words. Further, the suggested approaches and the LSTM approach were also evaluated by using various performance metrics, and the results indicate that the extra tree classifier surpassed the other supervised machine learning models and the LSTM approach by attaining an accuracy of 93%. Moreover, the results were compared with the Vader sentiment analysis technique to show the efficiency of the suggested feature set. To identify misleading information regarding COVID-19 on Twitter, modified LSTM and modified GRU deep learning approaches have been recommended by Abdelminaam et al. [35]. The parameters of the proposed models were optimized by using a Keras tuner. The suggested approaches, along with machine learning approaches such as the DT, KNN, SVM, RF, logistic regression and naïve Bayes were evaluated on four standard datasets, and two feature extraction methods were used to extract features from the standard four datasets. From the results, it was concluded that the suggested framework obtained better accuracy in terms of detecting misleading tweets related to COVID-19 as compared with the standard machine learning approaches. In [36], the authors used a bidirectional encoder representation from transformers (BERT) bidirectional LSTM (Bi-LSTM) ensemble learning approach to perform sentiment analysis of COVID-19-related tweets. The proposed approach comprises two stages. The BERT model [37], in the first stage, acquires the domain knowledge with data on COVID-19 and adjusts with a sentiment word dictionary. To process the data in a bidirectional way, a Bi-LSTM approach [38] was used in the second stage. Finally, the BERT and Bi-LSTM approach was combined by the ensemble approach to classify the sentiment into positive and negative categories. Experiments were conducted, and the results indicate that the proposed approach attains better results as compared with other standard approaches. The following table, Table 1, represents other works carried out on the analysis of sentiments regarding COVID-19.
Table 1.
Other works performed on the analysis of sentiments related to COVID-19.
Author & Year
Approach
Dataset used
Aim of the work
Results
Limitation
Ref
Chintalapudi et al. 2021
Bidirectional Encoder Representations from Transformers (BERT) model
3090 tweets obtained from github.com within the time span of March 23, 2020 to July 15, 2020
Introducing fact-checkers on social media platform to overwhelm misleading propaganda about COVID-19 infection
Accuracy = 89%
Addresses only single-country people's emotions on social media sites on present pandemic
Tweets related to COVID-19 within the time span of March 2020 to September 2020 from Maharashtra and Delhi states having highest COVID-19 confirmed cases
To analyze sentiment of people in regions having highest COVID-19 cases
Most of the tweets classified as positive tweets with highest confidence level
Does not consider tweets from distinct regions and countries
This section outlines the architecture of the proposed LSTM approach and firefly optimization algorithm used for the classification of tweets related to COVID-19 as positive and negative sentiments.
3.1. LSTM architecture
The LSTM network developed by Hochreiter and Schmidhuber [44] is a special type of recurrent neural network which is capable of processing sequences of data. LSTM contains memory cells which are used for holding data for a long period. Due to the recursive nature of the cells, LSTM networks can store and connect previous information to the data acquired in the present. Basically, LSTM networks consist of three gates, known as the input gate, forget gate and output gate. The internal architecture of the LSTM network is shown in Figure 1.
In the above figure, xt represents the current state, the previous cell state is represented by Ct−1 and the new state is represented by Ct. The previous and current outputs are respectively represented by ht−1 and ht. The content of cell states is modified by the forget gate, which is placed beneath the cell state. The output value of the forget gate represents a number between 1 and 0. If the output of the forget gate is 0, then the information is not kept in the cell. Otherwise, information is kept in the cell. The information that enters into the cell state is determined by the input gate. Lastly, the information that passes on to the subsequent hidden state is determined by the output gate. The following Eq (1) to Eq (6) are the mathematical formulas used to represent the LSTM network.
The information that needs to be stored has been selected by the input gate by using Eq (1):
it=σ(Wi[ht−1,xt]+bi)
(1)
The information that needs to be abandoned or kept has been selected by the forget gate by using Eq (2):
ft=σ(Wf[ht−1,xt]+bf)
(2)
Finally, the output is determined by using Eq (3):
Ot=σ(WO[ht−1,xt]+b0)
(3)
where the sigmoid function is represented by σ, the weights of the input, forget and output gates are represented by Wi, Wf and W0, respectively, the input at timestamp t is represented by xt, the biases of the input, forget and output gates are respectively represented by bi, bf and b0 and the input, forget and output gates are respectively denoted by it, ft and Ot, respectively.
The mathematical formula for the candidate cell state ct is given by Eq (4):
Ct=tanh(Wc[ht−1,xt]+bc)
(4)
where information to add or remove to the previous input by the LSTM network is allowed by tanh, the candidate at timestamp t is denoted by Ct. bc and Wc denote the control and weight parameters at the tanh layer.
The cell state Ct is represented by the mathematical formula shown in Eq (5):
Ct=ft∗Ct−1+it∗Ct
(5)
The final output state ht is represented by the mathematical formula shown in Eq (6):
ht=Ot∗tanh(Ct)
(6)
where the vector's element-wise multiplication is denoted by *. ht denotes the predicted output obtained from the current LSTM block, and Ct represents the memory cell state at timestamp t.
3.2. Firefly algorithm
In 2007, a multimodal nature-inspired metaheuristic algorithm known as the firefly algorithm was developed by Yang [45] at Cambridge University. Basically, the algorithm was developed based on the inspiration of the flashing behavior of tropical fireflies. The fundamental property that is enables the attraction of fireflies is its flashing light [46]. In general, fireflies stay more active during summer nights [47,48]. The process of coupling takes place between fireflies when a firefly finds another firefly nearby. Usually, male fireflies, through their signals, try to attract the female fireflies on the ground [49,50]. Then, the female fireflies discharge their flashing light in response to the signals of male fireflies, which results in the generation of flashing patterns of both males and females. Normally, female fireflies are more fascinated by male fireflies with brighter flashing lights. Based on the source distance, the intensity of flashing may vary. However, in some rare situations, female fireflies are unable to differentiate between male fireflies having the brightest and weakest flashes. According to Yang [51], the following are the idealized principles of the firefly algorithm:
• Regardless of the gender, the fireflies are attracted to each other.
• Fireflies with less brightness will be attracted by the brighter fireflies, as the attractiveness of fireflies is proportional to the brightness of the fireflies. Anyhow, the brightness of the fireflies decreases with increase in distance.
• The backdrop of the objective function is used to determine the illumination of the firefly.
The two phases of the firefly algorithm have been defined as follows:
i) Generally, the brightness of the firefly algorithm is based on the light intensity discharged by the firefly. Assume there are n number of fireflies. Then, the fitness value of the ith firefly is denoted by f(xi). Now, the brightness of firefly β is chosen as shown in Eq (7):
βi=f(xi)
(7)
ii) As each and every firefly has their special attractiveness, they can strongly attract other fireflies in the swarm. The attractiveness between two similar fireflies will diverge with distance factor dij, as shown in Eq (8):
dij=⌊xi−xj⌋
(8)
Then, the firefly's attractiveness is determined by using Eq (9):
β=β0e−γr2)
where γ denotes the coefficient of light absorption and β denotes the attractiveness of the firefly at r=0.
Then, the movement of the firefly with less brightness toward a firefly with more brightness is determined by using Eq (10):
xnewi=xoldi+βij(xj(k)−xoldi)+α(k)rand(0,1)−0.5L
(10)
Based on the above assumptions, the following pseudocode is used to represent the basic steps of the firefly algorithm.
Algorithm 1: Firefly Algorithm
Step 1: Set size of the population as N and number of iterations as MaxGen Step 2: Set the parameters α,β0∧γ Step 3: Define the objective function f(x), where x = (x1,x2,….,xn) Step 4: Formulate the fitness value of the fireflies from the objective function Step 5: for k = 1 to MaxGen do Step 6: for i = 1 to N do Step 7: for j = 1 to N do Step 8: if f(xj)<f(xi), then Step 9: move xi toward xj using Eq (7) Step 10: end if Step 11: Attractiveness changes with distance r between fireflies by exp(−γr2) Step 12: calculate the fitness value of a new firefly by updating its light intensity. Step 13: end for j Step 14: end for i Step 15: Rank fireflies based on their fitness Step 16: Find the best one Step 17: end for k Step 18: Stop
In the proposed model, the firefly algorithm has been introduced to optimize the parameters of the LSTM network to enhance the classification accuracy. The firefly algorithm is a nature-inspired metaheuristic algorithm with stochastic behavior. Due to the stochastic nature of the firefly algorithm, it is adequate to search a solution set by means of randomization. In heuristic algorithms, the probability of obtaining the optimal solution in a realistic amount of time is less because these algorithms use a trial-and-error base to search the solutions of the problems. Whereas, in metaheuristic algorithms, the probability of obtaining the optimal solution is higher, as the search is based on two approaches. One approach is exploration, which is the process of finding diversified solutions in the search space. The other approach is exploitation, which is the process of finding the best solution among neighbors to elicit the information. As the firefly algorithm makes use of both exploration and exploitation, it generates the best fit solution at lower levels from the recently produced solutions in the search space. Moreover, the firefly algorithm is also not trapped at local minima because of its randomization character. Therefore, the firefly algorithm is used to overcome the drawbacks of both randomization and local searches at higher levels. Furthermore, the firefly algorithm is a population-based algorithm, as it uses cross-operation to find the solutions in the search space. However, the population-based algorithms generate good parameter values and sustain the stability among exploration and exploitation only through the proper adjustment of the parameters. The only parameters that need to be adjusted in the firefly algorithm are the intensities of light and attractiveness. Initially, one should set the parameters of the firefly to produce the initial population of the fireflies randomly, and then calculate the fitness of each firefly by using Eqs (7) to (9). In the training phase, the LSTM network is trained on the dataset with the hyperparameters selected by using proposed firefly algorithm. Then, the fitness of the LSTM is computed by using the sum of the root mean square error (RMSE) as shown in Eq (11):
RMSE=√∑ni=1(Oi−ˆOi)2n
(11)
Finally, the loss function of the model is calculated by using Eq (12):
Loss=dj(t)−hj(t)
(12)
where the final required output at (t−1) is denoted by dj(t). The process terminates if the objective function is satisfied. The overall framework of the proposed LSTM and firefly algorithm for the classification of COVID-19 tweets is depicted in Figure 2.
Figure 2.
Overall framework of the proposed methodology.
The specific steps for implementing the LSTM using firefly algorithm is listed below.
Step 1: Initially, set the parameters of the firefly algorithm, such as the number of fireflies, randomness (α), initial attractiveness, maximum generation, absorption coefficient and number of iterations.
Step 2: Determine the light intensity of each firefly and then evaluate the firefly's attractiveness by using Eq (9).
Step 3: Then, determine the movement of the firefly with less brightness toward the firefly with more brightness by using Eq (10).
Step 4: Further, apply the firefly algorithm to optimize the parameters of the LSTM network, such as the learning rate, number of hidden layers, batch size, activation function and epochs.
Step 5. Train the LSTM network by using the optimized parameters and evaluate the fitness function.
Step 6: Repeat Steps 2 to 5 until the termination condition is satisfied.
Step 7: In the final step, compare the performance of the optimal proposed model with other standard machine and ensemble machine learning approaches by using evaluation metrics such as the accuracy, F1-score, precision, area under the curve of receiver operator characteristic (AUC-ROC) and recall.
4.
Experimental setup and dataset
All of the experiments in this study were performed by using a Jupyter Notebook with Python programming language which makes use of the Anaconda distribution package; the following is the system configuration used for carrying out the experiments: HP Pavilion system with Windows 10 operation system, 16 GB RAM, 160 GB programming space allocated by the NVIDIA Corporation, Intel core i7+10 generation processor with 1.80 GHz and 2.30 GHz speed. Moreover, seaborn, matplotlib and statsmodel packages were used to perform data visualization and statistical analysis of the Twitter data.
4.1. Dataset description
In this study, the dataset related to COVID-19 tweets was incorporated from IEEE DataPort [52]. For this dataset, entity recognition and translation were done by using a fully automated algorithm that makes use of artificial intelligence with sentiment analysis. The dataset contains 5, 016 entities obtained from 1, 866 messages collected within the time span of July 15, 2021 to August 10, 2021. Out of 1, 866 tweet messages, 990 tweets contain one or more location entities, and 1, 322 location entities were identified in city, continent, country region, language and state entity categories. Initially, data pre-processing was applied to the dataset to enhance the optimization of the training process. Data pre-processing consisted of applying label encoding to the target column to convert the labels into numeric form and applying min-max scaler transformation to the independent columns.
4.2. Data partitioning
To avoid overfitting on the dataset, the dataset was divided into two portions, i.e., training and testing. The trained tweets were used to identify the patterns of data and minimize the error rates. The testing dataset was used for the evaluation of model performance. Of the tweets, 80% were used for training and 20% of the tweets were used for testing.
5.
Result analysis
This section describes the performance metrics used in the evaluation of the model and the analysis of the results obtained using the proposed approach.
5.1. Performance metrics
The performance of the proposed model has been evaluated by using various performance metrics, such as accuracy, precision, F1-score, recall and AUC-ROC. The mathematical formulas used for the representation of the accuracy, precision, recall and F1-score are shown in Eqs (13) to (16):
Accuracy=TP+TNTP+FP+TN+FN
(13)
Precision=TPTP+FP
(14)
Recall=TPTp+FN
(15)
F1−score=2∗P∗R(P+R)
(16)
In the above equations, true positive, false positive, true negative and false negative are represented by TP, FP, TN and FN, respectively. P and R are used to represent the precision and recall.
5.2. Comparative analysis of the results
In this section, the performance of the proposed approach is analyzed, along with other state-of-the-art models, such as the DT, the MLP, KNN, the RF, AdaBoost, gradient boost, bagging, extreme gradient boost and the LSTM network. The hyperparameters of the proposed approach were fine-tuned using the firefly algorithm. Table 2 presents detailed descriptions of the parameters used in model training of the proposed and other state-of-the-art models.
Table 2.
Descriptions of parameters used for training the model.
Figure 3(a) and 3(b) respectively represent the accuracy curve and loss curve for the LSTM and proposed approach. From Figure 3(a), it is evident that the proposed LSTM + Firefly approach attained better training and testing accuracy from the initial epoch as compared with the training and testing accuracy of the LSTM approach. Also, it is evident from Figure 3b that the training and testing loss of LSTM + Firefly was less as compared to the training and testing loss of LSTM model.
Figure 3.
(a) Accuracy curve of LSTM and proposed approach. (b) Loss curve of LSTM and proposed approach.
Figure 4 represents a comparative analysis of the training and testing accuracy of the proposed approach and other state-of-the-art models. From the figure, it can be concluded that the proposed approach obtains better training and testing accuracy as compared with the DT, MLP, KNN, RF, Adaboost, GBoost, XGBoost, Bagging and LSTM approaches.
Figure 4.
Comparative analysis of training and testing accuracy of proposed approach and other state-of-the-art models.
Table 3 and Figure 5 illustrate the comparative analysis of different performance metrics, such as the accuracy, precision, recall, F1-score and AUC-ROC considered for the evaluation of the performance of the proposed model and other considered models. From the table, it is clear that there is certain uncertainty in the best and average performances of various performance metrics used to evaluate all of the models. It can also be noticed from the table that deep learning approaches are dominant over machine learning and ensemble learning approaches in the classification of sentiment. The proposed LSTM + Firefly algorithm outperformed other models with 0.9951, 0.9956, 0.9951. 0.9979 and 99.59 for the precision, recall, F1-score, AUC-ROC and accuracy, respectively. The LSTM approach attained the second highest accuracy of 99.14 over other conventional models. It is also observed that the AdaBoost approach obtained the lowest accuracy of 80.13%, followed by the MLP with 81.45% accuracy. The other models, such as the DT, KNN, RF, GBoost, Bagging and XGBoost models, obtained accuracy values of 92.32%, 88.30%, 89.44%, 91.62%, 91.62% and 92.89%, respectively. From Table 3, it is also observed that the ensemble learning approaches obtained better accuracy as compared with the machine learning approaches. As the ensemble approaches combine multiple models and assist in reducing the bias or variance, they result in higher predictive accuracy models over machine learning approaches. It is also observed that the deep learning approach obtained superior performance over the ensemble approaches because deep learning approaches optimize the features while extracting. Hence, it can be concluded from the results that the proposed method is able to produce efficient results because of the fine-tuning of the hyperparameters by using the firefly algorithm, which signifies that optimization plays an important role in obtaining better results.
Table 3.
Comparative analysis of the performance metrics of the proposed model and other models.
Figure 6(a) to (j) depicts the results for the AUC-ROC curves of the proposed model and other standard approaches. From the figure, it is observed that the proposed LSTM + Firefly attained an AUC-ROC of 1.00 for both the class labels as compared to other conventional methods, which indicates that the suggested approach can test all the instances in the data effectively as compared with others. Even the micro- and macro-average ROC curve values for the proposed approach were equal to 1.00, which indicates that all instances were correctly classified. The efficiency of the proposed model is mainly due to the fine-tuning of the hyperparameters of the LSTM model by using the firefly algorithm. Figure 7 shows how the fitness of the individuals evolves over the course of numerous generations for the LSTM model and LSTM + Firefly model. From all of the experimental results, it can be concluded that the efficiency of the machine learning and ensemble learning approaches used for the classification of sentiments related to COVID-19 can be enhanced if the hyperparameters are fine-tuned using optimization algorithms.
Figure 6.
AUC-ROC curves for a) DT, b) MLP, c) KNN, d) RF, e) AdaBoost, f) Gboost, g) Bagging, h) XGBoost, i) LSTM and j) LSTM + Firefly.
Moreover, the study also included a comparative analysis of the results obtained by using the proposed method, along with the results obtained in other studies on the sentiment analysis of COVID-19-related tweets, as shown in Table 4. From the table, it can be concluded that our proposed methodology has obtained better performance in terms of accuracy as compared with the performance of the LSTM models used in other studies for the classification of COVID-19 tweets.
Table 4.
Analysis of results of proposed method with results of other works on the sentiment analysis of COVID-19 related tweets.
Since its origin, COVID-19 has been considered as one of the biggest challenges to human life all over the world. Researchers and government agencies across the world are continuously working on this disease to control the spread of this disease. Social media has created a great impact on the lives of people, as most people use these media to update the information related to COVID-19. In most of the situations, the information shared on the social media platform has misguided the people about the COVID-19 pandemic. which in turn created an adverse impact on the mental and well-being of the people. Therefore, in this study, we proposed a deep learning approach based on LSTM for the classification of tweets related to COVID-19 as positive and negative sentiments. The proposed approach also makes use of the firefly algorithm to fine-tune the hyperparameters of the LSTM approach. Further, the proposed model and other state-of-the-art models have been evaluated by using various performance metrics; the experimental results revealed that the proposed LSTM + Firefly model outperformed other approaches with an accuracy of 99.59%, precision and F1-score of 99.51%, recall of 99.56% and AUC-ROC of 99.79%. However, in recent days, many advanced machine learning-based approaches have been found to be successful in solving pattern recognition problems [53]. As future work, a fair comparative analysis may be conducted among the classical machine learning approaches, advanced approaches and the proposed framework to verify the superiority of adapting various related complex real-world problems. Moreover, the suggested approach can be extended to analyze the reactions regarding COVID-19 vaccinations due to the increase of anti-vaccine sentiments.
Conflict of interest
The authors declare that this manuscript has no conflict of interest with any other published source and has not been published previously (partly or in full). No data have been fabricated or manipulated to support our conclusions. There is no funding agencies involved in this research.
References
[1]
C. Fitzmaurice, D. Abate, N. Abbasi, H. Abbastabar, F. Abd-Allah, O. Abdel-Rahman, et al., Global, regional, and national cancer incidence, mortality, years of life lost, years lived with disability, and disability-adjusted life-years for 29 cancer groups, 1990 to 2017: a Systematic analysis for the global burden of disease study, JAMA Oncol., 5 (2019), 1749–1768. https://doi.org/10.1001/jamaoncol.2019.2996 doi: 10.1001/jamaoncol.2019.2996
[2]
H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, et al., Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries, CA Cancer J. Clin., 71 (2021), 209–249. https://doi.org/10.3322/caac.21660 doi: 10.3322/caac.21660
[3]
P. S. Karia, J. Han, C. D. Schmults, Cutaneous squamous cell carcinoma: estimated incidence of disease, nodal metastasis, and deaths from disease in the United States, 2012, J. Am. Acad. Dermatol., 68 (2013), 957–966. https://doi.org/10.1016/j.jaad.2012.11.037 doi: 10.1016/j.jaad.2012.11.037
[4]
J. M. Janus, R. F. L. O'Shaughnessy, C. Harwood, T. Maffucci, Phosphoinositide 3-Kinase-Dependent signalling pathways in cutaneous squamous cell carcinomas, Cancers, 9 (2017), 86. https://doi.org/10.3390/cancers9070086 doi: 10.3390/cancers9070086
[5]
M. Piipponen, R. Riihilä, L. Nissinen, V. Kähäri, The role of p53 in progression of cutaneous squamous cell carcinoma, Cancers, 13 (2021), 4507. https://doi.org/10.3390/cancers13184507 doi: 10.3390/cancers13184507
[6]
A. Boutros, F. Cecchi, E. Tanda, E. Croce, R. Gili1, L. Arecco, et al., Immunotherapy for the treatment of cutaneous squamous cell carcinoma, Front. Oncol., 11 (2021), 733917. https://doi.org/10.3389/fonc.2021.733917 doi: 10.3389/fonc.2021.733917
[7]
Y. Sawada, M. Nakamura, Daily lifestyle and cutaneous malignancies, Int. J. Mol. Sci., 22 (2021), 5227. https://doi.org/10.3390/ijms22105227 doi: 10.3390/ijms22105227
[8]
K, Suozzi, J. Turban, M. Girardi, Cutaneous photoprotection: a review of the current status and evolving strategies, Yale J. Biol. Med., 93 (2020), 55–67.
[9]
C. Flower, D. Gaskin, S. Bhamjee, Z. Bynoe, High-risk variants of cutaneous squamous cell carcinoma in patients with discoid lupus erythematosus: a case series, Lupus, 22 (2013), 736–739. https://doi.org/10.1177%2F0961203313490243
[10]
K. K. Das, A. Chakaraborty, A. Rahman, S. Khandkar, Incidences of malignancy in chronic burn scar ulcers: experience from Bangladesh, Burns, 41 (2015), 1315–1321. https://doi.org/10.1016/j.burns.2015.02.008 doi: 10.1016/j.burns.2015.02.008
[11]
T. J. Knackstedt, L. K. Collins, Z. Li, S. Yan, F. Samie, Squamous cell carcinoma arising in hypertrophic lichen planus: a review and analysis of 38 cases, Dermatol. Surg., 41 (2015), 1411–1418. http://doi.org/10.1097/DSS.0000000000000565 doi: 10.1097/DSS.0000000000000565
[12]
J. Xing, Z. Jia, Y. Xu, M. Chen, Z. Yang, Y. Chen, et al., KLF9 (Kruppel Like Factor 9) induced PFKFB3 (6-Phosphofructo-2-Kinase/Fructose-2, 6-Biphosphatase 3) downregulation inhibits the proliferation, metastasis and aerobic glycolysis of cutaneous squamous cell carcinoma cells, Bioengineered, 12 (2021), 7563–7576. https://doi.org/10.1080/21655979.2021.1980644 doi: 10.1080/21655979.2021.1980644
[13]
J. G. Newman, M. A. Hall, S. J. Kurley, R. Cook, A. S. Farberg, J. L. Geiger, et al., Adjuvant therapy for high-risk cutaneous squamous cell carcinoma: 10-year review, Head Neck, 43 (2021), 2822–2843. https://doi.org/10.1002/hed.26767 doi: 10.1002/hed.26767
[14]
J. Pang, H. Pan, C. Yang, P. Meng, W. Xie, J. Li, et al., Prognostic value of immune-related multi-incRNA signatures associated with tumor microenvironment in esophageal cancer, Front. Genet., 12 (2021), 722601. https://dx.doi.org/10.3389%2Ffgene.2021.722601
[15]
Y. Pan, H. Han, K. E. Labbe, H. Zhang, W. Wong, Recent advances in preclinical models for lung squamous cell carcinoma, Oncogene, 40 (2021), 2817–2829. https://doi.org/10.1038/s41388-021-01723-7 doi: 10.1038/s41388-021-01723-7
[16]
A. Elmusrati, J. Wang, C. Y. Wang, Tumor microenvironment and immune evasion in head and neck squamous cell carcinoma, Int. J. Oral. Sci., 13 (2021), 24. https://doi.org/10.1038/s41368-021-00131-7 doi: 10.1038/s41368-021-00131-7
[17]
T. Suwa, M. Kobayashi, J. M. Nam, H, Harada, Tumor microenvironment and radioresistance, Exp. Mol. Med., 53 (2021), 1029–1035. https://doi.org/10.1038/s12276-021-00640-9 doi: 10.1038/s12276-021-00640-9
[18]
S. Paget, The distribution of secondary growths in cancer of the breast, Cancer Metastasis Rev., 8 (1889), 98–101.
[19]
H. Wang, M. M. H. Yung, H. Y. S. Ngan, K. Chan, D. W. Chan, The impact of the tumor microenvironment on macrophage polarization in cancer metastatic progression, Int. J. Mol. Sci., 22 (2021), 6560. https://doi.org/10.3390/ijms22126560 doi: 10.3390/ijms22126560
[20]
J. Zhuyan, M. Chen, T. Zhu, X. Bao, T. Zhen, K. Xing, et al., Critical steps to tumor metastasis: alterations of tumor microenvironment and extracellular matrix in the formation of pre-metastatic and metastatic niche, Cell Biosci., 10 (2020), 89. https://doi.org/10.1186/s13578-020-00453-9 doi: 10.1186/s13578-020-00453-9
[21]
Y. Xie, F. Xie, L. Zhang, X. Zhou, J. Huang, F. Wang, et al., Targeted anti-tumor immunotherapy using tumor infiltrating cells, Adv. Sci., e2101672. https://doi.org/10.1002/advs.202101672 doi: 10.1002/advs.202101672
[22]
M. Akhtar, A. Haider, S. Rashid, A. Ai-Nabet, Paget's " Seed and Soil" theory of cancer metastasis: an idea whose time has come, Adv. Anat. Pathol., 26 (2019), 69–74. https://doi.org/10.1097/PAP.0000000000000219 doi: 10.1097/PAP.0000000000000219
[23]
G. Yan, L. Li, S. Zhu, Y. Wu, Y. Zhu, L. Zhu, et al., Single-cell transcriptomic analysis reveals the critical molecular pattern of UV-induced cutaneous squamous cell carcinoma, Cell Death Dis., 13 (2022), 23. https://doi.org/10.1038/s41419-021-04477-y doi: 10.1038/s41419-021-04477-y
[24]
A. Ji, A. Rubin, K. Thrane, S. Jiang, D. L. Reynolds, R. M. Meyers, et al., Multimodal analysis of composition and spatial architecture in human squamous cell carcinoma, Cell, 182 (2020), 497–514. https://doi.org/10.1016/j.cell.2020.05.039 doi: 10.1016/j.cell.2020.05.039
[25]
C. B. Steen, C. L. Liu, A. A. Alizadeh, A. M. Newman, Profiling cell type abundance and expression in bulk tissues with CIBERSORTx, Methods Mol. Biol., 2117 (2020), 135–157. https://doi.org/10.1007/978-1-0716-0301-7_7 doi: 10.1007/978-1-0716-0301-7_7
[26]
J. L. Sevilla, V. Segura, A. Podhorski, E. Guruceaga, J. M. Mato, L. A. Martinez-Cruz, et al., (2005) Correlation between gene expression and GO semantic similarity, IEEE/ACM Trans. Comput. Biol. Bioinform., 2 (2005), 330–338. https://doi.org/10.1109/TCBB.2005.50 doi: 10.1109/TCBB.2005.50
[27]
S. Jain, G. D. Bader, An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology, BMC Bioinf., 11 (2010), 562. https://doi.org/10.1186/1471-2105-11-562 doi: 10.1186/1471-2105-11-562
[28]
X. Guo, C. D. Shriver, H. Hu, M. N. Liebman, Analysis of metabolic and regulatory pathways through Gene Ontology-derived semantic similarity measures, in AMIA Annual Symposium Proceedings, American Medical Informatics Association, (2005), 972.
[29]
P. M. Tedder, J. R. Bradford, C. J. Needham, G. A. McConkey, A. J. Bulpitt, D. R. Westhead, Gene function prediction using semantic similarity clustering and enrichment analysis in the malaria parasite Plasmodium falciparum, Bioinformatics, 26 (2010), 2431–2437. https://doi.org/10.1093/bioinformatics/btq450 doi: 10.1093/bioinformatics/btq450
[30]
G. Yu, F. Li, Y. Qin, X. Bo, Y. Wu, S. Wang, GOSemSim: an R package for measuring semantic similarity among GO terms and gene products, Bioinformatics, 26 (2010), 976–978. https://doi.org/10.1093/bioinformatics/btq064 doi: 10.1093/bioinformatics/btq064
[31]
J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, C. F. Chen, A new method to measure the semantic similarity of GO terms, Bioinformatics, 23 (2007), 1274–1281. https://doi.org/10.1093/bioinformatics/btm087 doi: 10.1093/bioinformatics/btm087
[32]
E. Rognoni, M. Widmaier, M. Jakobson, R. Ruppert, S. Ussar, D. Katsougkri, Kindlin-1 controls Wnt and TGF-β availability to regulate cutaneous stem cell proliferation, Nat. Med., 20 (2014), 350–359. https://doi.org/10.1038/nm.3490 doi: 10.1038/nm.3490
[33]
M. Lai, R. Pampena, L. Cornacchia, G. Odorici, A. Piccerillo, G. Pellacani, et al., Cutaneous squamous cell carcinoma in patients with chronic lymphocytic leukemia: a systematic review of the literature, Int. J. Dermatol., 2021 (2021). https://doi.org/10.1111/ijd.15813
[34]
H. B. Jie, P. J. Schuler, S. C. Lee, R. M. Srivastava, A. Argiris, S. Ferrone, et al., CTLA-4⁺ regulatory T cells increased in cetuximab-treated head and neck cancer patients suppress NK cell cytotoxicity and correlate with poor prognosis, Cancer Res., 75 (2015), 2200–2210. https://doi.org/10.1158/0008-5472.CAN-14-2788 doi: 10.1158/0008-5472.CAN-14-2788
[35]
S. Z. Lin, K. J. Chen, Z. Y. Xu, H. Chen, L. Zhou, H. Y. Xie, et al., Prediction of recurrence and survival in hepatocellular carcinoma based on two Cox models mainly determined by FoxP3+ regulatory T cells, Cancer Prev. Res., 6 (2013), 594–602. https://doi.org/10.1158/1940-6207.CAPR-12-0379 doi: 10.1158/1940-6207.CAPR-12-0379
[36]
B. Azzimonti, E. Zavattaro, M. Provasi, M. Vidali, A. Conca, E. Catalano, et al., Intense Foxp3+ CD25+ regulatory T-cell infiltration is associated with high-grade cutaneous squamous cell carcinoma and counterbalanced by CD8+/Foxp3+ CD25+ ratio, Br. J. Dermatol., 172 (2014), 64–73. https://doi.org/10.1111/bjd.13172 doi: 10.1111/bjd.13172
[37]
S. M. Gorsch, V. A. Memoli, T. A. Stukel, L. I. Gold, B. A. Arrick, Immunohistochemical staining for transforming growth factor beta 1 associates with disease progression in human breast cancer, Cancer Res., 52 (1992), 6949–6952.
[38]
M. Ponzoni, F. Pastorino, D. Di Paolo, P. Perri, C. Brignole, Targeting macrophages as a potential therapeutic intervention: impact on inflammatory diseases and cancer, Int. J. Mol. Sci., 19 (2018), 1953. https://doi.org/10.3390/ijms19071953 doi: 10.3390/ijms19071953
[39]
L. Nissinen, M. Farshchian, P. Riihilä, V. Kähäre, New perspectives on role of tumor microenvironment in progression of cutaneous squamous cell carcinoma, Cell Tissue Res., 365 (2016), 691–702. https://doi.org/10.1007/s00441-016-2457-z doi: 10.1007/s00441-016-2457-z
[40]
J. S. Pettersen, J. Fuentes-Duculan, M. Suárez-Fariñas, K. C. Pierson, A. Pitts-Kiefer, L. Fan, et al., Tumor-associated macrophages in the cutaneous SCC microenvironment are heterogeneously activated, J. Invest. Dermatol., 131 (2011), 1322–1330. https://doi.org/10.1038/jid.2011.9 doi: 10.1038/jid.2011.9
[41]
M. Takahara, S. Chen, M. Kido, S. Takeuchi, H. Uchi, Y. Tu, et al., Stromal CD10 expression, as well as increased dermal macrophages and decreased Langerhans cells, are associated with malignant transformation of keratinocytes, J. Cutan. Pathol., 36 (2009), 668–674. https://doi.org/10.1111/j.1600-0560.2008.01139.x doi: 10.1111/j.1600-0560.2008.01139.x
[42]
D. Moussai, H. Mitsui, J. S. Pettersen, K. C. Pierson, K. R. Shah, M. Suárez- Fariñas, et al., The human cutaneous squamous cell carcinoma microenvironment is characterized by increased lymphatic density and enhanced expression of macrophage-derived VEGF-C, J. Invest. Dermatol., 131 (2011), 229–236. https://doi.org/10.1038/jid.2010.266 doi: 10.1038/jid.2010.266
[43]
C. A. Janeway, J. Ron, M. E. Katz, The B cell is the initiating antigen-presenting cell in peripheral lymph nodes, J. Immunol., 138 (1987), 1051–1055.
[44]
D. P. Harris, L. Haynes, P. C. Sayles, D. K. Duso, S. M. Eaton, N. M. Lepak, et al., Reciprocal regulation of polarized cytokine production by effector B and T cells, Nat. Immunol., 1 (2000), 475–482. https://doi.org/10.1038/82717 doi: 10.1038/82717
[45]
A. Sarvaria, J. A. Madrigal, A. Saudemont, B cell regulation in cancer and anti-tumor immunity, Cell Mol. Immunol., 14 (2017), 662–674. https://doi.org/10.1038/cmi.2017.35 doi: 10.1038/cmi.2017.35
[46]
P. Andreu, M. Johansson, N. Affara, F. Pucci, T. Tan, S. Junankar, et al., FcRgamma activation regulates inflammation-associated squamous carcinogenesis, Cancer Cell, 17 (2010), 121–134. https://doi.org/10.1016/j.ccr.2009.12.019 doi: 10.1016/j.ccr.2009.12.019
[47]
K. W. de Visser, L. V. Korets, L. M. Coussens, De novo carcinogenesis promoted by chronic inflammation is B lymphocyte dependent, Cancer Cell, 7 (2005), 411–423. https://doi.org/10.1016/j.ccr.2005.04.014 doi: 10.1016/j.ccr.2005.04.014
[48]
T. Schioppa, R. Moore, R. G. Thompson, F. R. Balkwill, B regulatory cells and the tumor-promoting actions of TNF-α during squamous carcinogenesis, Proc. Natl. Acad. Sci., 108 (2011), 10662–10667. https://doi.org/10.1073/pnas.1100994108 doi: 10.1073/pnas.1100994108
[49]
G. Crawford, M. D. Hayes, R. C. Seoane, S. Ward, T. Dalessandri, C. Lai, et al., Epithelial damage and tissue γδ T cells promote a unique tumor-protective IgE response, Nat. Immunol., 19 (2018), 859–870. https://doi.org/10.1038/s41590-018-0161-8 doi: 10.1038/s41590-018-0161-8
[50]
T. Zhou, R. Qin, S. Shi, H. Zhang, C. Niu, G. Ju, et al., DTYMK promote hepatocellular carcinoma proliferation by regulating cell cycle, Cell Cycle, 20 (2021), 1681–1691. https://doi.org/10.1080/15384101.2021.1958502 doi: 10.1080/15384101.2021.1958502
[51]
Y. Guo, W. Luo, S. Huang, W. Zhao, H. Chen, Y. Ma, et al., DTYMK expression predicts prognosis and chemotherapeutic response and correlates with immune infiltration in hepatocellular carcinoma, J. Hepatocell Carcinoma, 8 (2021), 871–885. https://dx.doi.org/10.2147%2FJHC.S312604
[52]
T. Jeon, M. J. Ko, Y. R. Seo, S. J. Jung, D. Seo, S. Y. Park, et al., Silencing CDCA8 suppresses hepatocellular carcinoma growth and stemness via restoration of ATF3 tumor suppressor and inactivation of AKT/β-catenin signaling, Cancers, 13 (2021), 1055. https://doi.org/10.3390/cancers13051055 doi: 10.3390/cancers13051055
[53]
G. Vlotides, T. Eigler, S. Melmed, Pituitary tumor-transforming gene: physiology and implications for tumorigenesis, Endocr. Rev., 28 (2007), 165–186. https://doi.org/10.1210/er.2006-0042 doi: 10.1210/er.2006-0042
[54]
H. Hong, Z. Jin, T. Qian, X. Xu, X. Zhu, Q. Fei, et al., Falcarindiol enhances cisplatin chemosensitivity of hepatocellular carcinoma via down-regulating the STAT3-modulated PTTG1 pathway, Front. Pharmacol., 12 (2021), 656697. https://dx.doi.org/10.3389%2Ffphar.2021.656697
[55]
S. W. Chen, H. F. Zhou, H. J. Zhang, R. He, Z. Huang, Y. Dang, et al., The clinical significance and potential molecular mechanism of PTTG1 in esophageal squamous cell carcinoma, Front. Genet., 11 (2021), 583085. https://doi.org/10.3389/fgene.2020.583085 doi: 10.3389/fgene.2020.583085
[56]
Z. Chen, K. Cao, Y. Hou, F. Lu, L. Li, L. Wang, et al., PTTG1 knockdown enhances radiation-induced antitumour immunity in lung adenocarcinoma, Life Sci., 277 (2021), 119594. https://doi.org/10.1016/j.lfs.2021.119594 doi: 10.1016/j.lfs.2021.119594
[57]
J. E. Noll, K. Vandyke, D. R. Hewett, K. M. Mrozik, R. J. Bala, S. A. Williams, et al., PTTG1 expression is associated with hyperproliferative disease and poor prognosis in multiple myeloma, J. Hematol. Oncol., 8 (2015), 106. https://doi.org/10.1186/s13045-015-0209-2 doi: 10.1186/s13045-015-0209-2
[58]
R. Wei, Z. Wang, Y. Zhang, B. Wang, N. Shen, E. Li, et al., Bioinformatic analysis revealing mitotic spindle assembly regulated NDC80 and MAD2L1 as prognostic biomarkers in non-small cell lung cancer development, BMC Med. Genomics, 13 (2020), 112. https://doi.org/10.1186/s12920-020-00762-5 doi: 10.1186/s12920-020-00762-5
[59]
M. Vleugel, T. A. Hoek, E. Tromer, T. Sliedrecht, V. Groenewold, M. Omerzu, et al., Dissecting the roles of human BUB1 in the spindle assembly checkpoint, J. Cell Sci., 128 (2015), 2975–2982. https://doi.org/10.1242/jcs.169821 doi: 10.1242/jcs.169821
[60]
Y. H. Ko, J. H. Roh, Y. I. Son, M. K. Chung, J. Y. Jang, H. Byun, et al., Expression of mitotic checkpoint proteins BUB1B and MAD2L1 in salivary duct carcinomas, J. Oral Pathol. Med., 39 (2010), 349–355. https://doi.org/10.1111/j.1600-0714.2009.00835.x doi: 10.1111/j.1600-0714.2009.00835.x
[61]
M. Abal, A. Obrador-Hevia, K. P. Janssen, L. Casadome, M. Menendez, S. Carpentier, et al., APC inactivation associates with abnormal mitosis completion and concomitant BUB1B/MAD2L1 up-regulation, Gastroenterology, 132 (2007), 2448–2458. https://doi.org/10.1053/j.gastro.2007.03.027 doi: 10.1053/j.gastro.2007.03.027
[62]
Y. Wang, Z. Zhou, L. Chen, Y. Li, Z. Zhou, X. Chu, Identification of key genes and biological pathways in lung adenocarcinoma via bioinformatics analysis, Mol. Cell Biochem., 476 (2021), 931–939. https://doi.org/10.1007/s11010-020-03959-5 doi: 10.1007/s11010-020-03959-5
[63]
R. Marima, R. Hull, C. Penny, Z. Dlamini, Mitotic syndicates Aurora Kinase B (AURKB) and mitotic arrest deficient 2 like 2 (MAD2L2) in cohorts of DNA damage response (DDR) and tumorigenesis, Mutat. Res. Rev. Mutat. Res., 787 (2021), 108376. https://doi.org/10.1016/j.mrrev.2021.108376 doi: 10.1016/j.mrrev.2021.108376
Tweets related to COVID-19 within the time span of March 2020 to September 2020 from Maharashtra and Delhi states having highest COVID-19 confirmed cases
To analyze sentiment of people in regions having highest COVID-19 cases
Most of the tweets classified as positive tweets with highest confidence level
Does not consider tweets from distinct regions and countries
Step 1: Set size of the population as N and number of iterations as MaxGen Step 2: Set the parameters α,β0∧γ Step 3: Define the objective function f(x), where x = (x1,x2,….,xn) Step 4: Formulate the fitness value of the fireflies from the objective function Step 5: for k = 1 to MaxGen do Step 6: for i = 1 to N do Step 7: for j = 1 to N do Step 8: if f(xj)<f(xi), then Step 9: move xi toward xj using Eq (7) Step 10: end if Step 11: Attractiveness changes with distance r between fireflies by exp(−γr2) Step 12: calculate the fitness value of a new firefly by updating its light intensity. Step 13: end for j Step 14: end for i Step 15: Rank fireflies based on their fitness Step 16: Find the best one Step 17: end for k Step 18: Stop
Tweets related to COVID-19 within the time span of March 2020 to September 2020 from Maharashtra and Delhi states having highest COVID-19 confirmed cases
To analyze sentiment of people in regions having highest COVID-19 cases
Most of the tweets classified as positive tweets with highest confidence level
Does not consider tweets from distinct regions and countries
Step 1: Set size of the population as N and number of iterations as MaxGen Step 2: Set the parameters α,β0∧γ Step 3: Define the objective function f(x), where x = (x1,x2,….,xn) Step 4: Formulate the fitness value of the fireflies from the objective function Step 5: for k = 1 to MaxGen do Step 6: for i = 1 to N do Step 7: for j = 1 to N do Step 8: if f(xj)<f(xi), then Step 9: move xi toward xj using Eq (7) Step 10: end if Step 11: Attractiveness changes with distance r between fireflies by exp(−γr2) Step 12: calculate the fitness value of a new firefly by updating its light intensity. Step 13: end for j Step 14: end for i Step 15: Rank fireflies based on their fitness Step 16: Find the best one Step 17: end for k Step 18: Stop
Classification of COVID-19-related tweets into sentiments
Accuracy of LSTM = 99.14
LSTM + Firefly
Accuracy of LSTM + Firefly = 99.59
Figure 1. Removing batch effects between chips. (A) pre-corrected box line plots of GSE42677 and GSE45164; (B) corrected box line plots of GSE42677 and GSE45164. Blue group representatives GSE42677, red group representatives GSE45164
Figure 2. Microenvironmental immune infiltration analysis. (A) bar graph of different types of immune cells in each sample; (B) heat map of immune cell correlation; (C) differential analysis of infiltration of different kinds of immune cells in the tumor and regular groups
Figure 3. (A) A co-expression matrix was constructed to group genes into different modules based on their adjacency and similarity, and this was used to obtain a systematic clustering tree among genes; (B) Based on the tom matrix to detect gene modules, a total of nine gene modules were detected, namely black, blue, greenyellow, grey, magenta, purple, red, salmon, and tan. further exploring the correlation analysis between modules and traits, it was found that the salmon module had the highest correlation with the SCC phenotype (cor = 0.54, P = 0.001)
Figure 4. (A) the top half of Figure 4A: enrichment analysis by metascape database showed that genes were mainly enriched in mitotic cell cycle process, microtubule cytoskeleton organization involved in mitosis, centrosome, cell cycle, positive regulation of cell cycle, cell cycle G2/M phase transition, mitotic metaphase plate congression, DNA conformation change, meiotic nuclear division and other pathways; the bottom half of Figure 4A: cluster network of enriched pathways, where nodes sharing the same cluster are usually close to each other; (B) the interaction network which was obtained from the string database to obtain the interactions between proteins and then visualized by cytoscape
Figure 5. Screening of SCC core genes. (A) Ten-fold cross-validation of tuning parameter selection in the LASSO model. (B) LASSO coefficient distribution of differential genes. (C) Coefficients of the four Lasso (core) genes
Figure 6. (A) constructing ROC curves of the prediction model using the training set (GSE42677 and GSE45164), area under the AUC curve is 0.9846; (B) further validating the ROC curves of the diagnosis model using the validation set GSE45216, area under the AUC curve is 0.8567; (C) further validating the ROC curves of the diagnosis model using the validation set GSE53462, area under the AUC curve is 1
Figure 7. Gene set enrichment analysis of four core genes. (A) highly expressed CDCA8-enriched pathway (KEGG FRUCTOSE AND MANNOSE METABOLISM) (B) highly expressed CDCA8-enriched pathway (KEGG CITRATE CYCLE TCA CYCLE); (C) highly expressed DTYMK-enriched pathway (KEGG HUNTINGTONS DISEASE); (D) pathway highly enriched in DTYMK (KEGG PYRIMIDINE METABOLISM); (E) pathway highly enriched in MAD2L1 (KEGG RNA DEGRADATION); (F) pathway highly enriched in MAD2L1 (KEGG CELL CYCLE); (G) pathway enriched by high expression of PTTG1 (KEGG PARKINSONS DISEAS); (H) pathway enriched by high expression of PTTG1 (KEGG PATHOGENIC ESCHERICHIA COLI INFECTION)
Figure 8. Analysis of Motif transcriptional regulation of core genes. (A, B) the red line is the mean of the recovery curve of each motif, the green line is the mean ± standard deviation (sd), and the blue line is the recovery curve of the current motif. The maximum distance point (mean ± sd) between the current motif and the green curve is the maximum enrichment level selected; (C) four core genes enriched to motifs. NES: normalized enrichment score of motifs in the gene set, AUC: area under the curve (used to calculate NES)
Figure 9. Screening of critical genes by GO semantic similarity showed that CDCA8 and PTTG1 are critical in the whole network. The upper and lower limits of the boxes show the 75th and 25th percentiles, and the line in the middle of the box indicates the mean value of similarity. The top two proteins are considered to be key proteins
Figure 10. Correlation analysis of core genes and immune-related genes. (A) Differential analysis of immune regulatory genes showed that HLA-DMA, HLA-DOA, HLA-DOB and HLA-DQB2 were significantly different in the two groups of patients. * denotes P < 0.05, ** denotes P < 0.01, *** denotes P < 0.001; (B) Correlation analysis of critical genes PTTG1 and CDCA8 showed that PTTG1 and HLA-DMA were significantly positively correlated, and CDCA8 was significantly negatively correlated with HLA-DQB2
Figure 11. Immunohistochemical expression results of 4 core genes from Human Protein Atlas (https://www.proteinatlas.org/). (A) CDCA8 expression in the normal and tumor tissue (HPA028058); (B) DTYMK expression in the normal and tumor tissue (HPA042593); (C) MAD2L1 expression in in the normal and tumor tissue (HPA003348); (D) PTTG1 expression in the normal and tumor tissue (HPA045034)