
Citation: Kai Zhang, Xinwei Wang, Hua Liu, Yunpeng Ji, Qiuwei Pan, Yumei Wei, Ming Ma. Mathematical analysis of a human papillomavirus transmission model with vaccination and screening[J]. Mathematical Biosciences and Engineering, 2020, 17(5): 5449-5476. doi: 10.3934/mbe.2020294
[1] | Leonardo Martínez, Stephanie Mesías Monsalve, Karla Yohannessen Vásquez, Sergio Alvarado Orellana, José Klarián Vergara, Miguel Martín Mateo, Rogelio Costilla Salazar, Mauricio Fuentes Alburquenque, Ana Maldonado Alcaíno, Rodrigo Torres, Dante D. Cáceres Lillo . Indoor-outdoor concentrations of fine particulate matter in school building microenvironments near a mine tailing deposit. AIMS Environmental Science, 2016, 3(4): 752-764. doi: 10.3934/environsci.2016.4.752 |
[2] | K. Wayne Forsythe, Cameron Hare, Amy J. Buckland, Richard R. Shaker, Joseph M. Aversa, Stephen J. Swales, Michael W. MacDonald . Assessing fine particulate matter concentrations and trends in southern Ontario, Canada, 2003–2012. AIMS Environmental Science, 2018, 5(1): 35-46. doi: 10.3934/environsci.2018.1.35 |
[3] | Dimitrios Kotzias . Built environment and indoor air quality: The case of volatile organic compounds. AIMS Environmental Science, 2021, 8(2): 135-147. doi: 10.3934/environsci.2021010 |
[4] | Alaeddine Mihoub, Montassar Kahia, Mohannad Alswailim . Measuring the impact of technological innovation, green energy, and sustainable development on the health system performance and human well-being: Evidence from a machine learning-based approach. AIMS Environmental Science, 2024, 11(5): 703-722. doi: 10.3934/environsci.2024035 |
[5] | Suwimon Kanchanasuta, Sirapong Sooktawee, Natthaya Bunplod, Aduldech Patpai, Nirun Piemyai, Ratchatawan Ketwang . Analysis of short-term air quality monitoring data in a coastal area. AIMS Environmental Science, 2021, 8(6): 517-531. doi: 10.3934/environsci.2021033 |
[6] | Rokhana Dwi Bekti, Kris Suryowati, Maria Oktafiana Dedu, Eka Sulistyaningsih, Erma Susanti . Machine learning and topological kriging for river water quality data interpolation. AIMS Environmental Science, 2025, 12(1): 120-136. doi: 10.3934/environsci.2025006 |
[7] | Sreenivasulu Kutala, Harshavardhan Awari, Sangeetha Velu, Arun Anthonisamy, Naga Jyothi Bathula, Syed Inthiyaz . Hybrid deep learning-based air pollution prediction and index classification using an optimization algorithm. AIMS Environmental Science, 2024, 11(4): 551-575. doi: 10.3934/environsci.2024027 |
[8] | Joanna Faber, Krzysztof Brodzik . Air quality inside passenger cars. AIMS Environmental Science, 2017, 4(1): 112-133. doi: 10.3934/environsci.2017.1.112 |
[9] | Clare Maristela V. Galon, James G. Esguerra . Impact of COVID-19 on the environment sector: a case study of Central Visayas, Philippines. AIMS Environmental Science, 2022, 9(2): 106-121. doi: 10.3934/environsci.2022008 |
[10] | Meher Cheberli, Marwa Jabberi, Sami Ayari, Jamel Ben Nasr, Habib Chouchane, Ameur Cherif, Hadda-Imene Ouzari, Haitham Sghaier . Assessment of indoor air quality in Tunisian childcare establishments. AIMS Environmental Science, 2025, 12(2): 352-372. doi: 10.3934/environsci.2025016 |
Artificial intelligence is one of the focuses of Internet technology development in recent years, which enables computers to imitate human behavior more intelligently. Among them, sentiment analysis is a technical difficulty that artificial intelligence urgently needs to overcome. It allows computers to cross the dimension of machines and more closely resemble human thinking patterns. Sentiment analysis profoundly explains the development prospect of human-computer interaction and opens the way forward for information technology in the new era. However, going from simple 01-computing to complex and variable brain thinking takes work. Common sentiment analysis methods are built on text data because the textual content reflects the emotional value well. Some scholars have used parallel convolutional neural networks [1] (CNN) and recurrent neural networks [2] (RNN) for sentiment analysis of the text. Wang et al. proposed an iterative algorithm called SentiDiff to predict sentiment polarities expressed in Twitter messages [3]. Hassonah et al. offered a hybrid machine learning approach to enhance sentiment analysis [4].
However, a single text modality can no longer provide complete data information in the face of complex data types. On social platforms such as Weibo, Twitter and friend circle, people share a large amount of information, such as text, expressions, images, audio and video, to express their emotions in multiple ways. This information also provides a rich database for multimodal sentiment analysis. Audio data contains information such as the size of the voice and the tone of voice. In contrast, image data contains information such as the facial expressions of people and the color tones of images, all of which can assist text content in expressing human emotions better and improving the accuracy of computer judgment of emotional polarity. Huddar et al. presented a novel attention-based multimodal contextual fusion strategy that extracts contextual information among the utterances before fusion [5]. Jiang et al. proposed a model that uses an interactive information fusion mechanism to interactively learn the visual-specific and textual-specific visual representations [6]. Zadeh et al. proposed the multi-attention recurrent network (MARN) [7], novel neural structure for understanding human communication using multi-headed attention modules and mixed long- and short-term memory networks.
Multimodal sentiment analysis methods still have many problems. Regarding feature extraction, the commonly used reinforcement learning methods are often based on the level of words, which ignores the information interaction between words. Most neural network models are also based on single-layer LSTM (long short term memory), which is challenging to extract deeply complex data. Regarding feature fusion, there is a heterogeneous divide between different modalities and the data is in other distribution spaces, which is difficult to measure directly. Existing models are often limited by the exponential growth in computational and memory costs associated with using tensor representations and it is challenging to extend fusion to multiple modalities while maintaining a reasonable model complexity. This paper proposes an attention-based two-layer bidirectional GRU (AB-GRU) multimodal sentiment analysis method to solve the above problems.
First, we preprocess the data of text, audio and image modes, and input the processed data vectors into the double-layer bidirectional GRU network respectively for feature extraction. Then we input the data vector into the attention module to extract the important information and perform the single mode feature fusion. Secondly, we input the extracted multi-modal feature vectors into the low-rank multimodal fusion (LMF) [8] and carry out feature fusion. We add vector 1 to the eigenvectors of three different modes, and align the three modal vectors to form a three-dimensional Cartesian product model, which is then mapped back to the low-dimensional output vector. Finally, we map the results into the sample space to get an output of affective polarity.
This paper aims to learn human emotion polarity using feature extraction techniques and multimodal fusion techniques to learn how emotions are expressed from features such as the content of the text, the priority of audio and human facial expressions in images to achieve more efficient and accurate emotion analysis. The main contributions of this paper are as follows:
1) A two-layer bidirectional GRU network is used to extract multimodal features, which can effectively learn the association of ordered data like text and audio on time series and improve the accuracy of feature extraction. GRU can streamline the gating mechanism and enhance learning efficiency.
2) Connecting two attention mechanisms after a two-layer bidirectional GRU network can capture important information in the feature vector and enhance the learning efficiency of modal features.
3) LMF converts multiple inputs into a high-dimensional tensor and maps them back to a low-dimensional vector, which can effectively improve the efficiency of operations. Different modalities are decoupled from each other so that the model can extend to data with an arbitrary number of modalities.
The world comprises countless complex and varied elements and humans can perceive them through sight, hearing, smell, taste and touch to obtain rich knowledge and information. With the development of Internet technology, scholars are also working on making computers learn to imitate this unique way of human information reception, which is also the research direction of artificial intelligence. The research in this direction has obtained excellent results and has been successfully applied in many fields, such as natural language processing, image recognition, recommendation systems and target detection.
Combining the unimodal learning algorithms and techniques of artificial intelligence in different fields, scholars have opened the research on multimodal fusion methods. Multimodal fusion [9,10] aims to understand and process several different kinds of modal information, including different modalities such as text, audio, image and video, by machine learning to achieve the task of prediction or classification. In the process of data processing by machines, the data of a single modality usually cannot contain complete information and it is challenging for the learning of a single modality to achieve accurate prediction or quickly produce local optimal solutions, so multiple modal data are introduced for fusion learning to improve the learning efficiency. The basic principle of multimodal fusion is the fusion of features of different modal data, i.e., the features of the input data are extracted first. Fusion methods fuse the extracted features of different modalities. Finally, the fused features are input into models such as classification or prediction according to the requirements to obtain the output results. As shown in Figure 1, multimodal fusion methods are divided into early, late and hybrid fusion methods according to the fusion time.
There have been more mature research results on multimodal fusion techniques for different needs. For example, Radford et al. proposed the CLIP [11] model, whose structure consists mainly of a text encoder and an image encoder, which is matched by calculating the similarity between text and image vectors. Zadeh et al. proposed the tensor fusion model (TFN) [12], which uses unimodal features as input and the modal embedding of the 3-fold Cartesian product display of simulated unimodal, bimodal, and trimodal interactions. Memory fusion network (MFN) [13] gives each view an LSTM function component and encodes it independently to send the interactions across views by temporal information.
The use of multimodal fusion techniques for sentiment analysis is also the focus of this research paper. Textual content usually expresses human emotions directly but not comprehensively. Human language is very complex. For example, irony, mockery, rhetorical questions and other emotionally contradictory statements are complex for computers to understand accurately. We, therefore, resort to audio and image data to assist computers in understanding and classifying emotions. The voice can reflect whether the speaker is anxious or relaxed, and the tone can reflect whether the speaker is angry or calm. All information is contained in the audio data. Vision data can visually represent people's facial expressions and body movements and even the color shades of photos can reflect the photographer's emotion. This information together forms the database of multimodal technology and achieves a more intelligent and accurate multimodal emotion analysis.
In multimodal sentiment analysis methods, the effectiveness of feature extraction can directly affect the downstream tasks. Commonly used feature extraction models are CNN [1], RNN [14,15], LSTM [16,17] and the current newer transformer [18] and BERT [19,20]. The sentiment analysis task is mainly based on the text modality. Compared to other modalities, the text modality often contains the richest and most specific sentiment information, while the other modalities play an auxiliary and corrective role. Therefore, among multimodal sentiment analysis methods, the most commonly used feature extraction method is LSTM, which determines the current output by introducing state variables to store past information and current inputs. Also, it solves the problem that RNNs are easily affected by short-term memory, comprehensive sequence information can rarely be kept completely and essential information is easily missed through a unique gating mechanism. Wei et al. proposed a BiLSTM model with multi-polarity orthogonal attention for implicit sentiment analysis [21]. Zhang et al. proposed a recurrent attention LSTM neural network to achieve sentiment analysis by iteratively locating attention regions covering key sentiment words [22].
The GRU [23,24,25] network used in this paper simplifies the internal structure based on LSTM, simplifies the network model and has fewer model parameters while improves the accuracy rate.
The structure of GRU is shown in Figure 2, which is mainly composed of a reset gate and an update gate.
The role of the reset gate is to determine how much information from the hidden state of the last moment needs to be forgotten and to determine the share of new input and to combine the saved information with the new input.
rt=σ(Wr⋅[ht−1,xt]) | (2.1) |
∼ht=tanh(W∼h⋅[rt×ht−1,xt]) | (2.2) |
where xt is the current input information, ht−1 is the hidden state saved at the last moment, W is the weight, σ is the Sigmoid activation function, compressing the value to between 0 and 1. tanh is the tanh activation function compresses the value to between –1 and 1.
rt is used to adjust the proportion of input information xt. The value of rt ranges from 0 to 1, and the smaller the value, the more input information is retained. ht is the candidate's hidden state. Reset gates help to capture short-term dependencies in the timing information.
Update gates are used to process long-term information, decide how much information from the hidden state of the last moment needs to be remembered and pass the remembered information down the line.
zt=σ(Wz⋅[ht−1,xt]) | (2.3) |
ht=(1−zt)×ht−1+zt×∼ht | (2.4) |
zt is used to adjust the degree of history information preservation. zt takes values from 0 to 1. The smaller the value, the more historical information is preserved. ht is used to preserve the information that needs to be passed backward.
The two-layer bidirectional GRU model used in this paper uses a combination of two propagation modes, favorable and negative propagation, based on the ordinary GRU model and a two-layer stacking approach to the bidirectional GRU model. The text and audio modalities are sequential and the previous content will affect the expression of the later content to a certain extent. Thus there is a particular gap between the expressed content of positive and negative propagation. The bidirectional propagation method used in this paper can learn different positive and negative propagation features separately, saving the corresponding hidden information. In the attention module, we combine positive features with hidden positive information and negative features with hidden negative information to focus on the critical information and get more targeted modal features.
For complex modal information, this paper chooses to stack the bidirectional GRU model to achieve higher accuracy feature extraction and improve the overall efficiency of the model because the internal structure of GRU is more straightforward, and the stacking of two layers can also retain its higher computing rate.
Different fusion methods fit various tasks. Common multimodal tasks are cross-modal retrieval, sentiment analysis, and audio-visual recognition [26]. CLIP targets cross-modal retrieval tasks, which enables image and text matching. GLCM [27] is a self-supervised method for learning audiovisual representations, which can generalize to both the tasks which require global semantic information and the tasks that require fine-grained spatio-temporal information. In this paper, we investigate the sentiment analysis of multimodal book data [28] and propose an attention-based two-layer bi-directional GRU model, which outperforms most of its current counterparts on sentiment classification tasks.
Compared with existing models, AB-GRU has better classification accuracy and model complexity and can extend to data with an arbitrary number of modalities. Compared with the current outstanding CLIP [11], AB-GRU can better target data with more than two modalities and has low model complexity, few parameters and a high training rate. Compared with traditional TFN [12], AB-GRU uses a stacked GRU network in the feature extraction module and connects two attention layers to enhance the capture of important information. In the feature fusion module, AB-GRU decomposes the weights into low-rank factors to reduce the number of parameters in the model and improve the computation rate. GRU [29] has a wide range of applications in the field of deep learning. For text data and audio data with temporal characteristics, GRU can better learn its features, has a simple structure and a small number of parameters. GRU can greatly improve the operation rate in complex multimodal tasks.
The AB-GRU model used in this paper is shown in Figure 3, which consists of a combination of four main modules: input module, feature extraction module, feature fusion module and output module.
1) The input module is used to pre-process the multimodal sentiment analysis data, and the data types used in this paper include text, audio and image data.
2) The feature extraction module, which is the focus of improvement in this paper, uses a two-layer bidirectional GRU model based on attention [30] to extract features from the data of the three modalities and obtain the corresponding three feature vectors.
3) The feature fusion module uses LMF for feature fusion to obtain the fused 3D model, which is then mapped back to the low-dimensional output vector.
4) The output module also contains a decision layer. The low-dimensional output vector obtained in the previous step is mapped to the decision layer to obtain the final output by passing the corresponding single-valued output through the fully connected layer.
We preprocess the data of text, audio and image modalities and then use P2FA to perform word alignment to align the three modalities at word granularity and get the data vector of text modality T=(t1,t2,…,tn), n is the vector length of text modality; the data vector of audio modality A=(a1,a2,…,am), m is the audio modal vector length and the vision modal data vector V=(v1,v2,…,vl), l is the image modal vector length.
The first step in the feature extraction module is to input three modalities, text T=(t1,t2,…,tn), audio A=(a1,a2,…,am), and vision V=(v1,v2,…,vl), into the attention-based two-layer bidirectional GRU network. Figure 4 shows the feature extraction process of the text.
In the second step, we input the text vector T=(t1,t2,…,tn) into the bidirectional GRU network for learning. The input information will perform update and forget operations in each GRU cell. Then, put the text vector into a second GRU network layer and repeats the above steps. These processes are shown in the second and third modules of Figure 4. Finally, we get the positive hidden layer state ht+ of the text, the negative hidden layer state ht− and the output GT=(Gt1,Gt2,…,Gtn) of the text after GRU.
Since this paper uses a bidirectional GRU network, output GT comprises positive and negative propagation processes, so GT can be decomposed into positive output GT+ and negative output GT−.
Similarly, the audio vector passes through the double-layer bidirectional GRU network to obtain the positive hidden layer state ha+, the negative hidden layer state ha− and the output GA=(Ga1,Ga2,…,Gam), which can be decomposed into the positive output GA+ and the negative output GA−. The vision vector passes through the two-layer bidirectional GRU network to obtain the positive hidden layer state hv+, the negative hidden layer state hv− and the output GV=(Gv1,Gv2,…,Gvl), which can be decomposed into the positive output GV+ and the negative output GV−.
In the third step, we put ht+, ht−, GT+ and GT− into the attention module, the third module in Figure 4. The input content goes through the first layer of attention mechanism Attention1 for unimodal feature fusion: positive hidden features are combined with positive output and negative hidden features with negative output. Then, the attention mechanism learns the critical information of positive and negative directions respectively. Finally, the positive features of the text and the negative features of the text are obtained as follows:
FT+=∑softmax[relu(ht+×WT+)]×[relu(tanh(GT+×WT+))]×GT+ | (3.1) |
FT−=∑softmax[relu(ht−×WT−)]×[relu(tanh(GT−×WT−))]×GT− | (3.2) |
where, FT+ is the positive feature obtained from the text feature vector after Attention1 and FT− is the negative feature obtained from the text feature vector after Attention1. WT is the parameter matrix needed to learn. relu and tanh are the activation functions.
Then FT+ and FT− are input into the second layer of attention mechanism Attention2, to combine the positive and negative features and learn the weights of positive and negative features to obtain the full text features:
FT=FT+×θT+FT−×(1−θT) | (3.3) |
where FT is the final text feature obtained from the text vector by the attention-based two-layer bidirectional GRU model and θT is the weight needed to learn.
For the audio modality, we combine the audio feature GA=(Ga1,Ga2,…,Gam), the positive output GA+ and the negative output GA− of the audio feature and the positive hidden layer state ha+ and the negative hidden layer state ha− of the audio into the first attention mechanism Attention1: Combining the positive hidden feature with the positive output and the negative hidden feature with the negative output. Finally we obtain the positive feature and the negative feature of the audio:
FA+=∑softmax[relu(ha+×WA+)]×[relu(tanh(GA+×WA+))]×GA+ | (3.4) |
FA−=∑softmax[relu(ha−×WA−)]×[relu(tanh(GA−×WA−))]×GA− | (3.5) |
where FA+ is the positive feature obtained from the audio feature vector after Attention1 and FA− is the negative feature obtained from the audio feature vector after Attention1. WA is the parameter matrix needed to learn. relu and tanh are the activation functions.
Then FA+ and FA− are input into the second layer of attention mechanism Attention2, to combine the positive and negative features and learn the weights of positive and negative features to get the complete audio features:
FA=FA+×θA+FA−×(1−θA) | (3.6) |
where FA is the final audio feature obtained by passing the audio vector through the attention-based two-layer bidirectional GRU model and θA is the weight to be learned.
For the vision modality, we combine the vision feature GV=(Gv1,Gv2,…,Gvl), the positive output GV+ and the negative output GV− of the vision feature and the positive hidden layer state hv+ and the negative hidden layer state hv− of the vision into the first attention mechanism Attention1: combining the positive hidden feature with the positive output and the negative hidden feature with the negative output. Finally, we obtain the positive feature and the negative feature of the vision:
FV+=∑softmax[relu(hv+×WV+)]×[relu(tanh(GV+×WV+))]×GV+ | (3.7) |
FV−=∑softmax[relu(hv−×WV−)]×[relu(tanh(GV−×WV−))]×GV− | (3.8) |
where FV+ is the positive feature obtained from the vision feature vector after Attention1, FV− is the negative feature obtained from the vision feature vector after Attention1, WV is the parameter matrix to be learned and relu and tanh are the activation functions.
Then FV+ and FV− are input into the second layer of attention mechanism Attention2, combine the positive and negative features and learn the weights of positive and negative features to obtain the complete vision features:
FV=FV+×θV+FV−×(1−θV) | (3.9) |
where FV is the final image feature obtained by passing the image vector through the attention-based two-layer bidirectional GRU model and θV is the weight to be learned.
The final feature extraction module gets the outputs: text feature vector FT = (Ft1, Ft2, …, Ftn), audio feature vector FA=(Fa1,Fa2,…,Fam) and vision feature vector FV=(Fv1,Fv2,…,Fvl).
In particular, a fully connected layer is added after the text modality's attention module to reduce the text features' dimensionality. The size of the fully connected layer is the same as the FT dimension and uses Sigmoid as the activation function.
The Low-Rank Multimodal Fusion used in this paper in the feature fusion module is a method that uses a low-rank weight tensor to make multimodal fusion efficient without affecting performance. The tensor is powerful in terms of expressiveness and can simulate the alignment and fusion between different modalities very well. The model used in this paper is also an improvement on the tensor fusion model (TFN) [12], which differs from TFN in that LMF decomposes the weights into low-rank factors after the multidimensional model into which the tensor is fused, reducing the number of parameters in the model. Tensor-based fusion can be effectively improved by using parallel decomposition of the low-rank weight tensor and the input tensor to compute tensor-based fusion, which is more efficient than simple splicing or pooling and can scale linearly with the number of modes.
In LMF, multimodal fusion can be described as a multilinear function of:
f:D1×D2×⋯×DN→H | (3.10) |
where D1,D2,⋯,DN are the vector spaces of the input modes, N is the number of modes and H is the output vector space.
Multimodal fusion aims to encode the unimodal information of n different modalities and assemble them into a compact multimodal representation. In this paper, we use tensor fusion to store the multimodal interaction information by an additional vector 1 and then obtain a high-dimensional tensor ZN containing all the modalities by modeling. The tensor is usually obtained by finding the outer product of the input modalities.
ZN=⊗Nn=1zn | (3.11) |
where ⊗Nn=1 denotes the tensor outer product of a set of vectors indexed by N. zn is the input representation of the additional vector 1 for different modalities. As shown in Figure 5, two modalities are aligned by the additional vector 1 to form a tw-dimensional tensor Z2, which is then decomposed by the weight tensor W2 and mapped to a low-dimensional output vector.
The feature fusion module in this paper is shown in Figure 6, where the text feature vector FT=(Ft1,Ft2,…,Ftn), the audio feature vector FA=(Fa1,Fa2,…,Fam), and the vision feature vector FV=(Fv1,Fv2,…,Fvl) are input into the low-rank tensor fusion model (LMF) for feature fusion:
A vector with feature value of 1 is appended to each modal feature to store the information interactions between different modalities and to obtain the vector representation ZT for text features, ZA for audio features and ZV for vision features, respectively.
Using the additional vector 1 as the intersection point, we then construct the three modes into a three-dimensional Cartesian product model:
Z=ZT⊗ZA⊗ZV | (3.12) |
where Z denotes the three-dimensional tensor obtained by fusing the three modalities.
The three-dimensional tensor Z is then mapped back to a low-dimensional vector space to obtain the output of the feature fusion module h:
h=g(Z;W,b)=W⋅Z+b | (3.13) |
where g(⋅) is the linear layer function, h is the vector Z generated through the linear layer. W is the weight tensor to be learned and b is the offset.
In LMF, we must map the fused multidimensional tensor back to a low-dimensional output vector to improve the fusion efficiency and facilitate the downstream tasks. In this paper, we parameterize g() as a set of mode-specific low-rank factors for recovering the low-rank weight tensor. By decomposing the weights into a set of low-rank factors and exploiting the nature that the tensor Z can be decomposed into {Zn}Nn=1, we can compute the output vector h directly, thus reducing the number of parameters involved in the tenderization the computational complexity from N-dimensions to linear levels.
Thus, the vector h can be decomposed as:
h=(r∑i=1W(i)T⊗W(i)A⊗W(i)V)⋅Z=(r∑i=1W(i)T⋅ZT)∘(r∑i=1W(i)A⋅ZA)∘(r∑i=1W(i)V⋅ZV) | (3.14) |
where r is the minimum rank that makes the decomposition valid, WT is the weight tensor of the text modality, WA is the weight tensor of the audio modality and WV is the weight tensor of the vision modality.
Decision classification will be performed in the output module and the output sentiment polarity will be obtained.
We link three fully connected layers and a decision layer after LMF. The size e of the threfully connected layers will be reduced the dimensionality of the vector h layer by layer. We input the vector h obtained from the feature fusion module into the classification module, and reduce its dimensionality through the three fully connected layers. Finally, a single-valued output ρ is obtained. ρ will be input into the decision layer and map to a sample space and the sentiment polarity is positive when ρ≥0 and negative when ρ<0.
In this paper, we use the multimodal sentiment analysis datasets CMU-MOSI [31] and CMU-MOSEI as the experimental datasets. The CMU-MOSI dataset is a collection of 93 opinion videos from YouTube movie reviews, each consisting of multiple opinion clips calibrated by five workers, and finally averaged. The sentiment values for each segment ranged from strongly negative to strongly positive, and the linear scale ranged from –3 to +3. The CMU-MOSEI dataset is the largest multimodal sentiment and emotion recognition dataset available, and contains 23, 453 annotated video clips with 250 topics from 1000 different speakers. Each of these video clips contains alignment with audio down to the phoneme level.
Each video is divided into clips based on its transcript. Each paragraph corresponds to the audio and vision of that period to obtain a multimodal sentiment dataset consisting of three modalities: text, audio and vision.
Preprocessing operations are performed for each of the three modalities. The text data are truncated or filled to a length of 50, and word embedding is performed using a 300-dimensional Glove to encode the text sequences into word vector sequences. Enhancement and noise reduction are performed on audio data, and audio features are extracted using the COVAREP acoustic analysis framework. Enhancement and noise reduction are performed on image data using the Facet1 library for extracting visual features.
In this paper, Accuracy (ACC) and F1-score are used as the evaluation metrics of the model. Accuracy is a primary metric to evaluate the classification task and is the ratio of correct samples to the total number of samples in the classification result:
Acc=ncorrectntotal | (4.1) |
where ncorrect is the number of correctly classified samples, and ntotal is the total number.
F1-score is the weighted average of the precision and recall rates:
F1=2×precision×recallprecision+recall | (4.2) |
where precision is the accuracy rate and recall is the recall rate. The accuracy rate reflects the ability of the model to distinguish negative samples. The higher the value, the stronger the ability of the model to distinguish negative samples. The recall rate reflects the model's ability to identify positive samples. The higher the value, the stronger the model's ability to identify positive samples. The F1-score is a combination of the two, and the higher the F1-score, the more robust the model.
The F1-score in this paper is calculated by the weighting method. In the baseline of the experiment, if the F1-score has no value, it indicates that the method is not weighted in the calculation.
To enhance the credibility of the experiments, this paper also uses the MAE loss function and the Corr correlation coefficient as the evaluation metrics of the model and the AdamW optimizer as the processor of the network.
MAE=∑i|yi−ypi|n | (4.3) |
where MAE denotes the squared absolute error, yi denotes the magnitude of the sentiment value of the sample label, ypi denotes the magnitude of the predicted value and n denotes the total number of samples.
Corr(X,Y)=Cov(X,Y)√Var[X]Var[Y] | (4.4) |
where Cov(X, Y) is the covariance between X and Y, Var[X] is the variance of X and Var[Y] is the variance of Y.
The experimental setting of this paper is shown in Table 1.
Experimental environment | configuration |
Operating system | Windows 10 |
Processor | Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz |
torch | 1.9.0cpu |
torchvision | 0.10.0 + cpu |
Programming language | Python3.8 |
Deep learning framework | Pytorch |
In this paper, the loss function used in the experiments is L1Loss, the optimizer is AdamW and the learning rate is 0.001. The activation function used for the text features is Sigmoid, and the activation function for vision features is tanh. The dropout values of the model for the three modes of text, Audio, and Vision are all 0.5. The experimental parameters set on the CMU-MOSI dataset are as follows: the embedding dimensions of text, audio and vision modes are 300, 5 and 20 respectively, and the corresponding hidden dimension in the model is 128, 4 and 16 respectively. Set the batch_size value to 128 and the number of training cycles to 20. Due to the large data set of CMU-MOSEI, we ran the model on GPU and set the experimental parameters as follows: the embedding dimensions of text, audio and vision modes are 300, 35 and 74 respectively, and the corresponding hidden layer dimensions in the model are 128, 16 and 32 respectively. Set the batch_size value to 128 and the number of training cycles to 30.
When feature extraction is performed on the text data, there is an additional fully connected layer for reducing the dimensionality of the text features with a dimension of 128 × 64. The three fully connected layers after the feature fusion module are used to reduce the dimensionality of the fusion vectors, which are (4+1)∗(16+1)∗(64+1)×128, 128 × 128 and 128 × 1. The final single-valued output is obtained.
To validate the performance of the attention-based two-layer bidirectional GRU model proposed in this paper, we compare it with other multimodal fusion models on the CMU-MOSI dataset and CMU-MOSEI dataset.
AB-GRU: This paper proposes the attention-based two-layer bidirectional GRU multimodal sentiment analysis model.
LMF [8]: Low-rank multimodal fusion, which decomposes the weights into low-rank factors, reduces the number of parameters in the model.
TFN [12]: The tensor fusion network is tailored to address the instability of spoken language and accompanying gestures and speech in online videos. It can learn intra-modal and inter-modal dynamics end-to-end.
TFN+: This paper has improved attention-based two-layer bidirectional GRU network by integrating a tensor fusion model.
GME-LSTM [32]: Gated multimodal embedding can solve the fusion challenge when noise is present in the modalities. LSTM with temporal attention can perform word-level fusion with better fusion resolution.
MARN [7]: Multi-attention recurrent network, which discovers interactions between morphologies by using neural components called multi-attention blocks (MAB) and stores them in a mixed memory of recurrent components called long short term hybrid memory (LSTHM).
MFN [13]: Memory fusion network, which explicitly accounts for two interactions in neural structures and models them continuously over time, sends interactions across views with temporal information.
MFM [33]: Multimodal decomposition model optimizes the common generation-discrimination objective across multimodal data and labels by decomposing the representation into two independent sets of factors: multimodal discriminative factors and modality-specific generative factors.
RMFN [34]: Recurrent multi-stage fusion network, which decomposes the fusion problem into multiple stages, each focusing on a subset of multimodal signals, for specialized and efficient fusion.
The AB-GRU model used in this paper experimented on the dataset CMU-MOSI, and the results are shown in Figure 8. After the number of training reaches 10, the ACC and loss values of the model gradually become smooth. After several experiments, the final result is 80.9% for ACC and 93.0% for MAE. The current popular multimodal sentiment analysis methods experimented on the dataset CMU-MOSI under the same experimental environment and parameters. The results were compared with the model in this paper, and the results are shown in Table 2. AB-GRU is the attention-based two-layer bidirectional GRU model proposed in this paper. Compared with other multimodal sentiment analysis models, the AB-GRU model showed significant improvements in both ACC and F1 scores, reaching 80.9 and 81.0%, respectively. Compared with the original LMF model, our improved model resulted in a 4.5% increase in classification accuracy. LMF uses LSTM networks for feature extraction, and after experiments, it can be seen that using bidirectional GRU networks can improve the efficiency of feature extraction, while stacked GRU networks can effectively improve the accuracy without affecting the experimental rate and focus the vision on different modal data through the attention mechanism important information and extract data features in-depth for downstream fusion tasks.
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
LMF[8] | 76.4 | 75.7 | 91.2 | 66.8 |
TFN[12] | 77.1 | 77.9 | 95.6 | 67.2 |
GME-LSTM[32] | 76.5 | —— | 102.0 | 62.1 |
MARN[7] | 77.1 | 77.0 | 96.8 | 63.2 |
MFN[13] | 77.4 | 77.3 | 97.1 | 62.5 |
MFM[33] | 78.1 | 78.0 | 94.5 | 60.7 |
RMFN[34] | 78.4 | 78.0 | 92.9 | 67.3 |
Compared with the best RMFN model, AB-GRU improves the training effect by 2.5%. Most current multimodal sentiment classification models focus on modality fusion methods to improve and upgrade. The attention-based bilayer bidirectional GRU model proposed in this paper uses the characteristics of different modal data, which improves and upgrades the feature extraction module and chooses a more suitable low-rank tensor fusion model for feature fusion so that the overall performance has been improved. The high computing rate has been maintained.
The comparative structural analysis of AB-GRU and other models on the CMU-MOSI and CMU-MOSEI datasets is shown in Table 3.
Model | CMU-MOSI | CMU-MOSEI | ||
ACC/% | F1-score/% | ACC/% | F1-score/% | |
AB-GRU | 80.7 | 80.9 | 80.3 | 80.1 |
TFN+ | 80.1 | 88.0 | 78.3 | 78.3 |
LMF[8] | 76.4 | 75.7 | 75.2 | 75.0 |
TFN[12] | 77.1 | 77.9 | 76.2 | 76.1 |
GME-LSTM[32] | 76.5 | —— | 75.6 | —— |
MARN[7] | 77.1 | 77.0 | 75.9 | 75.8 |
MFN[13] | 77.4 | 77.3 | 76.0 | 76.0 |
MFM[33] | 78.1 | 78.0 | 76.8 | 76.5 |
RMFN[34] | 78.4 | 78.0 | 76.7 | 76.9 |
Due to the CMU-MOSEI dataset is more complex, the effect of sentiment analysis are all somewhat weakened, but still it can be seen that the AB-GRU model is superior to other sentiment analysis models.
The experiments show that the AB-GRU model achieves satisfactory performance on both the CMU-MOSI dataset and the CMU-MOSEI dataset. This indicates that the model has good generalization and can adapt to different sentiment analysis tasks and achieve good results on different datasets.
In this section, ablation experiments are set up to verify the importance of different modules in the AB-GRU model. The experimental results are shown in Table 4. We generated Figure 8 from the results in Table 4. The histogram shows that AB-GRU achieves superior results and compares them with the results before improving the different modules in the model. LMF is improved based on TFN, so we combined the AB-GRU model with TFN, whose experimental results are shown in Table 4 for TFN+, which has a significant improvement on TFN and once again verified the effectiveness of the attention-based bilayer bidirectional GRU model on multimodal sentiment classification.
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
single-layer GRU | 80.3 | 79.9 | 91.8 | 67.1 |
LSTM | 79.6 | 79.5 | 97.0 | 67.9 |
Relu | 80.6 | 81.1 | 93.2 | 65.7 |
Add Noisy | 80.7 | 80.7 | 92.7 | 67.8 |
Recurrent neural networks have better results on temporal information such as text and audio. On choosing LSTM or GRU for feature extraction, we verified that using LSTM combined with LMF for multimodal sentiment classification can achieve an accuracy of 79.6%, which exceeds most similar models but is still lower than the AB-GRU model used in this paper. Second, the stacked bilayer GRU improves the accuracy by 0.6% over the single-layer GRU model and only sacrifices a lower training rate.
We also tested the effect of different activation functions on the performance, and the Sigmoid function finally used was slightly better than using the Relu function. In addition, during the experiments, the text modality has high dimensionality. It is more difficult to process, so it is easy to generate overfitting problems when using the GRU network for feature extraction. We considered adding noise to the text data to improve the model's generalization performance. The results are shown in "noisy" in Table 4. After the experiment, the ACC value did not improve significantly.
The different modalities are decoupled from each other in the low-rank tensor fusion model so that it can be extended to data with any number of modalities. To explore the effect of the number and type of modes on the performance, we designed a set of experiments, and the results are shown in Table 5. The training results for different modal combinations following epoch values are shown in Figure 10.
Model | Modality | ACC/% | F1-score/% |
AB-GRU | T+A+V | 80.9 | 81.0 |
AB-GRU | T+A | 79.1 | 78.9 |
AB-GRU | T+V | 75.9 | 75.6 |
AB-GRU | A+V | 56.3 | 56.7 |
It can be seen from Figure 10 that the classification effect of the A+V combination is significantly lower than the other groups, which shows that in the multimodal emotion classification task, text data plays a crucial role. In contrast, audio and image data play a supporting role, where audio data is more compatible with text. The audio data is more compatible with the text, and the T + A combination makes the ACC reach 79.1% and the F1-score reach 78.9%, which can handle the sentiment classification task well. The addition of images has further improved the accuracy. However, the improvement is slight but the small improvement plays a crucial role in the face of complex and redundant information.
In order to solve the problem of the heterogeneinty gap between different modalities and improve the efficiency of feature extraction, this paper proposed an attention-based two-layer bidirectional GRU multimodal sentiment analysis model. The two-layer bidirectional GRU used in this model can effectively learn the text and audio temporal features with a simple structure and fast learning speed. The connected attention layer allows better extraction of essential features. In contrast, the LMF model can reduce the dimensionality of multimodal data, improve the operation rate and increase the accuracy rate. Experimental results show that the performance of the the AB-GRU model proposed in this paper is improved by at least 2.5% compared with other multimodal sentiment analysis models.
In our future work, we will conduct more in-depth research to apply multimodal sentiment analysis methods in different fields. In the medical field, the patient's speech, voice and facial expression can be monitored for condition analysis and timely feedback and treatment can be given. In the short video, classification, integration and recommendation are performed by multimodal methods. Moreover, with the development of technology, we will continue to improve the multimodal sentiment analysis methods, from feature extraction, feature fusion, data pre-processing, and other modules to improve the model's efficiency.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work is supported by the National Natural Science Foundation of China (Grant No.61602161, 61772180), Hubei Province Science and Technology Support Project (Grant No.2020BAB012), Hubei Provincial Science and Technology Program Project (Grant No.2023BCB041), and The Fundamental Research Funds for the Research Fund of Hubei University of Technology (HBUT: 2021046, 21060, 21066)
The authors declare there is no conflict of interest.
[1] | M. Arbyn, X. Castellsagué, S. de Sanjosé, L. Bruni, M. Saraiya, F. Bray, et al, Worldwide burden of cervical cancer in 2008, Ann. Oncol. Circuits, 22 (2011), 2675-2686. |
[2] | G. Bogani, U. L. R. Maggiore, M. Signorelli, F. Martinelli, A. Ditto, I. Sabatucci, et al, The role of human papillomavirus vaccines in cervical cancer: Prevention and treatment, Crit. Rev. Oncol. Hemat., 122 (2018), 92-97. |
[3] | J. K. Oh, E. Weiderpass, Infection and cancer: Global distribution and burden of diseases, Ann. Glob. Health, 80 (2014), 384-392. |
[4] | D. M. Parkin, F. Bray, J. Ferlay, P. Pisani, Estimating the world cancer burden: Globocan 2000, Int. J. Cancer, 94 (2001), 153-156. |
[5] | D. M. Parkin, F. Bray, J. Ferlay, P. Pisani, Global cancer statistics, 2002, Ca-Cancer J. Clin., 55 (2005), 74-108. |
[6] | D. M. Parkin, F. Bray, The burden of HPV-related cancers, Vaccine, 24 (2005), S11-S25. |
[7] | G. Hancock, K. Hellner, L. Dorrell, Therapeutic HPV vaccines, Best Prcat. Res. Cl. Ob., 47 (2018), 59-72. |
[8] | K. Miura, H. Mishima, A. Kinoshita, C. Hayashida, S. Abe, K. Tokunaga, et al, Genome-wide association study of HPV-associated cervical cancer in Japanese women, J. Med. Virol., 86 (2014), 1153-1158. |
[9] | E. L. Franco, E. Duarte-Franco, A. Ferenczy, Cervical cancer: epidemiology, prevention and the role of human papillomavirus infection, Cmaj, 164 (2001), 1017-1025. |
[10] | International Agency for Research on Cancer Working Group, Human papillomaviruses, IARC Monographs on the Evaluation of the Carcinogenic Risks to Humans, Lyon, 1995. |
[11] | T. Malik, J. Reimer, A. Gumel, E. Elbasha, S. Mahmud, The impact of an imperfect vaccine and pap cytology screening on the transmission of human papillomavirus and occurrence of associated cervical dysplasia and cancer, Math. Biosci. Eng., 10 (2013), 1173-1205. |
[12] | L. A. Denny, Jr. T. C. Wright, Human papillomavirus testing and screening, Best Prcat. Res. Cl. Ob., 19 (2005), 501-515. |
[13] | T. G. Harris, R. D. Burk, J. M. Palefsky, L. S. Massad, J. Bang, K. Anastos, et al, Incidence of cervical squamous intraepithelial lesions associated with HIV serostatus, CD4 cell counts, and human papillomavirus test results, Jama, 293 (2005), 1471-1476. |
[14] | M. Schiffman, P. Castle, J. Jeronimo, A. C. Rodriguez, S. Wacholder, Human papillomavirus and cervical cancer, Lancet, 370 (2000), 890-907. |
[15] | R. L. Winer, J. P. Hughes, Q. Feng, S. O'Reilly, N. B. Kiviat, K. K. Holmes, et al, Condom use and the risk of genital human papillomavirus infection in young women, New Engl. J. Med., 354 (2006), 2645-2654. |
[16] | International Collaboration of Epidemiological Studies of Cervical Cancer, Carcinoma of the cervix and tobacco smoking: Collaborative reanalysis of individual data on 13,541 women with carcinoma of the cervix and 23,017 women without carcinoma of the cervix from 23 epidemiological studies, Int. J. Cancer, 118 (2006), 1481-1495. |
[17] | International Collaboration of Epidemiological Studies of Cervical Cancer, Cervical carcinoma and reproductive factors: Collaborative reanalysis of individual data on 16,563 women with cervical carcinoma and 33,542 women without cervical carcinoma from 25 epidemiological studies, Int. J. Cancer, 119 (2006), 1108-1124. |
[18] | J. S. Smith, J. Green, A. B. de. Gonzalez, P. Appleby, P. J. Peto, M. Plummer, et al, Cervical cancer and use of hormonal contraceptives: a systematic review, Lancet, 361 (2003), 1159-1167. |
[19] | D. Saslow, D. Solomon, H. W. Lawson, M. Killackey, S. L. Kulasingam, J. Cain, et al, American Cancer Society, American Society for Colposcopy and Cervical Pathology, and American Society for Clinical Pathology screening guidelines for the prevention and early detection of cervical cancer, Ca-Cancer J. Clin., 62 (2012), 147-172. |
[20] | M. Al-arydah, R. Smith, An age-structured model of human papillomavirus vaccination, Math. Comput. Simulat., 82 (2011), 629-652. |
[21] | O. Sharomi, T. Malik, A model to assess the effect of vaccine compliance on Human Papillomavirus infection and cervical cancer, Appl. Math. Model, 47 (2017), 528-550. |
[22] | A. Omame, R. A. Umana, D. Okuonghae, S. C. Inyama, Mathematical analysis of a two-sex Human Papillomavirus (HPV) model, Int. J. Biomath., 11 (2018), 1850092. |
[23] | E. H. Elbasha, Global stability of equilibria in a two-sex HPV vaccination model, B Math. Biol., 70 (2008), 894. |
[24] | E. H. Elbasha, Impact of prophylactic vaccination against human papillomavirus infection, Contemp. Math., 410 (2006), 113-128. |
[25] | Z. Qu, L. Xue, J. M. Hyman, Modeling the transmission of Wolbachia in mosquitoes for controlling mosquito-borne diseases, SIAM J. Appl. Math., 78 (2018), 826-852. |
[26] | O. Sharomi, C. N. Podder, A. B. Gumel, S. M. Mahmud, E. Rubinstein, Modelling the transmission dynamics and control of the novel 2009 swine influenza (H1N1) pandemic, B Math. Biol., 73 (2011), 515-548. |
[27] | H. W. Hethcote, The mathematics of infectious diseases, SIAM Review, 42 (2000), 599-653. |
[28] | P. Van den Driessche, J. Watmough, Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission, Math. Biosci., 180 (2002), 29-48. |
[29] | L. Perko, Differential Equations and Dynamical Systems, Springer, New York, 1996. |
[30] | S. M. Blower, H. Dowlatabadi, Sensitivity and uncertainty analysis of complex models of disease transmission: an HIV model, as an example, Int. Stat. Rev., 62 (1994), 229-243. |
[31] | M. H. A. Biswas, L. T. Paiva, M. D. R. De Pinho, A SEIR model for control of infectious diseases with constraints, Math. Biosci. Eng., 11 (2014), 761-784. |
[32] | X. Wang, H. Peng, B. Shi, D. Jiang, S. Zhang, B. Chen, Optimal vaccination strategy of a constrained time-varying SEIR epidemic model, Commun. Nonlinear SCI, 67 (2019), 37-48. |
[33] | W. H. Fleming, R. W. Rishel, Deterministic and stochastic optimal control, Springer Verlag, New York, 1975. |
[34] | R. Bellman, L. S. Pontryagin, V. G. Boltyanskii, R. V. Gamkrelidze, E. F. Mischenko, The mathematical theory of optimal processes, Math. Comput., 19 (1965), 159. |
[35] | H. Peng, Q. Gao, Z. Wu, W. Zhong, Symplectic approaches for solving two-point boundary-value problems, J. Guid. Control. Dynam., 35 (2012), 653-658. |
[36] | M. Li, H. Peng, W. Zhong, A symplectic sequence iteration approach for nonlinear optimal control problems with state-control constraints, J. Franklin I, 352 (2015), 2381-2406. |
[37] | H. Peng, Q. Gao, Z. Wu, W. Zhong, Efficient sparse approach for solving receding-horizon control problems, J. Guid. Control. Dynam., 36 (2013), 1864-1872. |
[38] | H. Peng, Q. Gao, Z. Wu, W. Zhong, Symplectic adaptive algorithm for solving nonlinear two-point boundary value problems in astrodynamics, Celest Mech. Dyn. Astr., 110 (2011), 319-342. |
[39] | X. Wang, H. Peng, S. Zhang, B. Chen, W. Zhong, A symplectic pseudospectral method for nonlinear optimal control problems with inequality constraints, ISA T, 68 (2017), 335-352. |
[40] | H. Peng, X. Wang, M. Li, B. Chen, An hp symplectic pseudospectral method for nonlinear optimal control, Commun. Nonlinear. Sci., 42 (2017), 623-644. |
[41] | X. Wang, H. Peng, S. Zhang, B. Chen, W. Zhong, A symplectic local pseudospectral method for solving nonlinear state-delayed optimal control problems with inequality constraints, Int. J. Robust Nonlinear Control, 28 (2018), 2097-2120. |
[42] | X. Wang, J. Liu, X. Dong, C. Li, Y. Zhang, A symplectic pseudospectral method for constrained time-delayed optimal control problems and its application to biological control problems, Optimization, 2020. |
[43] | J. Liu, W. Han, X. Wang, J. Li, Research on cooperative trajectory planning and tracking problem for multiple carrier aircraft on the deck, IEEE Syst., 14 (2020), 3027-3038. |
[44] | X. Wang, J. Liu, Y. Zhang, B. Shi, D. Jiang, H. Peng, A unified symplectic pseudospectral method for motion planning and tracking control of 3D underactuated overhead cranes, Int. J. Robust Nonlinear Control, 29 (2019), 2236-2253. |
[45] | H. Peng, X. Wang, B. Shi, S. Zhang, B. Chen, Stabilizing constrained chaotic system using a symplectic psuedospectral method, Commun. Nonlinear SCI, 56 (2018), 77-92. |
[46] | J. Carr, Applications of Centre Manifold Theory, Springer Science & Business Media, 2012. |
[47] | C. Castillo-Chavez, B. Song, Dynamical models of tuberculosis and their applications, Math. Biosci. Eng., 1 (2004), 361-404. |
1. | Muhammad Khalid Khan, Aneeq Yousuf, Faisal Ahmed, 2020, Analyzing Air Quality to Model Human Livability using Machine Learning Techniques, 978-1-6654-1860-7, 1, 10.1109/GCWOT49901.2020.9391626 | |
2. | Sang S. Pak, Madeline Ratoza, Victor Cheuy, Examining rehabilitation access disparities: an integrated analysis of electronic health record data and population characteristics through bivariate choropleth mapping, 2024, 24, 1472-6963, 10.1186/s12913-024-10649-1 | |
3. | Rokhana Dwi Bekti, Kris Suryowati, Maria Oktafiana Dedu, Eka Sulistyaningsih, Erma Susanti, Machine learning and topological kriging for river water quality data interpolation, 2025, 12, 2372-0352, 120, 10.3934/environsci.2025006 |
Experimental environment | configuration |
Operating system | Windows 10 |
Processor | Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz |
torch | 1.9.0cpu |
torchvision | 0.10.0 + cpu |
Programming language | Python3.8 |
Deep learning framework | Pytorch |
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
LMF[8] | 76.4 | 75.7 | 91.2 | 66.8 |
TFN[12] | 77.1 | 77.9 | 95.6 | 67.2 |
GME-LSTM[32] | 76.5 | —— | 102.0 | 62.1 |
MARN[7] | 77.1 | 77.0 | 96.8 | 63.2 |
MFN[13] | 77.4 | 77.3 | 97.1 | 62.5 |
MFM[33] | 78.1 | 78.0 | 94.5 | 60.7 |
RMFN[34] | 78.4 | 78.0 | 92.9 | 67.3 |
Model | CMU-MOSI | CMU-MOSEI | ||
ACC/% | F1-score/% | ACC/% | F1-score/% | |
AB-GRU | 80.7 | 80.9 | 80.3 | 80.1 |
TFN+ | 80.1 | 88.0 | 78.3 | 78.3 |
LMF[8] | 76.4 | 75.7 | 75.2 | 75.0 |
TFN[12] | 77.1 | 77.9 | 76.2 | 76.1 |
GME-LSTM[32] | 76.5 | —— | 75.6 | —— |
MARN[7] | 77.1 | 77.0 | 75.9 | 75.8 |
MFN[13] | 77.4 | 77.3 | 76.0 | 76.0 |
MFM[33] | 78.1 | 78.0 | 76.8 | 76.5 |
RMFN[34] | 78.4 | 78.0 | 76.7 | 76.9 |
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
single-layer GRU | 80.3 | 79.9 | 91.8 | 67.1 |
LSTM | 79.6 | 79.5 | 97.0 | 67.9 |
Relu | 80.6 | 81.1 | 93.2 | 65.7 |
Add Noisy | 80.7 | 80.7 | 92.7 | 67.8 |
Model | Modality | ACC/% | F1-score/% |
AB-GRU | T+A+V | 80.9 | 81.0 |
AB-GRU | T+A | 79.1 | 78.9 |
AB-GRU | T+V | 75.9 | 75.6 |
AB-GRU | A+V | 56.3 | 56.7 |
Experimental environment | configuration |
Operating system | Windows 10 |
Processor | Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz |
torch | 1.9.0cpu |
torchvision | 0.10.0 + cpu |
Programming language | Python3.8 |
Deep learning framework | Pytorch |
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
LMF[8] | 76.4 | 75.7 | 91.2 | 66.8 |
TFN[12] | 77.1 | 77.9 | 95.6 | 67.2 |
GME-LSTM[32] | 76.5 | —— | 102.0 | 62.1 |
MARN[7] | 77.1 | 77.0 | 96.8 | 63.2 |
MFN[13] | 77.4 | 77.3 | 97.1 | 62.5 |
MFM[33] | 78.1 | 78.0 | 94.5 | 60.7 |
RMFN[34] | 78.4 | 78.0 | 92.9 | 67.3 |
Model | CMU-MOSI | CMU-MOSEI | ||
ACC/% | F1-score/% | ACC/% | F1-score/% | |
AB-GRU | 80.7 | 80.9 | 80.3 | 80.1 |
TFN+ | 80.1 | 88.0 | 78.3 | 78.3 |
LMF[8] | 76.4 | 75.7 | 75.2 | 75.0 |
TFN[12] | 77.1 | 77.9 | 76.2 | 76.1 |
GME-LSTM[32] | 76.5 | —— | 75.6 | —— |
MARN[7] | 77.1 | 77.0 | 75.9 | 75.8 |
MFN[13] | 77.4 | 77.3 | 76.0 | 76.0 |
MFM[33] | 78.1 | 78.0 | 76.8 | 76.5 |
RMFN[34] | 78.4 | 78.0 | 76.7 | 76.9 |
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
single-layer GRU | 80.3 | 79.9 | 91.8 | 67.1 |
LSTM | 79.6 | 79.5 | 97.0 | 67.9 |
Relu | 80.6 | 81.1 | 93.2 | 65.7 |
Add Noisy | 80.7 | 80.7 | 92.7 | 67.8 |
Model | Modality | ACC/% | F1-score/% |
AB-GRU | T+A+V | 80.9 | 81.0 |
AB-GRU | T+A | 79.1 | 78.9 |
AB-GRU | T+V | 75.9 | 75.6 |
AB-GRU | A+V | 56.3 | 56.7 |