MEGAKANs: enhancing intermodal dependence with global channel-spatial attention and Kolmogorov-Arnold networks for multimodal sentiment analysis

Xinglong Shen; Xuesi Ma; Xinglong Shen; Xuesi Ma

doi:10.3934/bdia.2026006

Big Data and Information Analytics

2026, Volume 10, 96-129. doi: 10.3934/bdia.2026006

Previous Article Next Article

Research article

MEGAKANs: enhancing intermodal dependence with global channel-spatial attention and Kolmogorov-Arnold networks for multimodal sentiment analysis

Xinglong Shen ,
Xuesi Ma ^,

School of Mathematics and Information Science, Henan Polytechnic University, Jiaozuo 45400, China

Received: 23 February 2026 Revised: 23 February 2026 Accepted: 05 April 2026 Published: 10 April 2026

Multimodal sentiment analysis (MSA), which integrates text, audio, and visual cues, plays a critical role in affective computing and human-computer interaction. However, existing fusion architectures—typically based on multilayer perceptrons (MLPs)—struggle to capture complex nonlinear dependencies across modalities, limiting their effectiveness in modeling subtle and implicit emotional expressions. To address this, we have created a novel framework named MEGAKANs, which introduces a highly expressive and interpretable fusion strategy. Unlike conventional fusion approaches that rely on static MLPs, MEGAKANs incorporates Kolmogorov-Arnold networks (KANs) into the mid-fusion stage, leveraging learnable functional decomposition to flexibly model high-order nonlinear interactions across modalities. Complementarily, embedding KANs into the global channel-spatial attention (GCSA) module can adaptively highlight salient emotional patterns across spatial and channel dimensions, thereby enhancing cross-modal alignment. MEGAKANs was rigorously evaluated on the benchmark multimodal sentiment dataset (CMU-MOSI) for binary, multi-class, and regression-based sentiment prediction tasks. Experimental results revealed that MEGAKANs surpasses state-of-the-art baselines, achieving a binary accuracy of 87.02% and reducing the mean absolute error (MAE) to 0.7265, thereby demonstrating superior robustness and generalization capabilities. Notably, the proposed model showed the greatest relative improvement in the underutilized visual modality, validating its ability to capture subtle affective cues. These results demonstrate not only the superior performance of MEGAKANs but also highlight the potential of KANs in multimodal learning, offering a scalable and interpretable solution for real-world affective computing applications.
- multimodal sentiment analysis,
- Kolmogorov-Arnold networks,
- global channel-spatial attention,
- multimodal fusion,
- modality contribution
Citation: Xinglong Shen, Xuesi Ma. MEGAKANs: enhancing intermodal dependence with global channel-spatial attention and Kolmogorov-Arnold networks for multimodal sentiment analysis[J]. Big Data and Information Analytics, 2026, 10: 96-129. doi: 10.3934/bdia.2026006

Related Papers:

Abstract

Multimodal sentiment analysis (MSA), which integrates text, audio, and visual cues, plays a critical role in affective computing and human-computer interaction. However, existing fusion architectures—typically based on multilayer perceptrons (MLPs)—struggle to capture complex nonlinear dependencies across modalities, limiting their effectiveness in modeling subtle and implicit emotional expressions. To address this, we have created a novel framework named MEGAKANs, which introduces a highly expressive and interpretable fusion strategy. Unlike conventional fusion approaches that rely on static MLPs, MEGAKANs incorporates Kolmogorov-Arnold networks (KANs) into the mid-fusion stage, leveraging learnable functional decomposition to flexibly model high-order nonlinear interactions across modalities. Complementarily, embedding KANs into the global channel-spatial attention (GCSA) module can adaptively highlight salient emotional patterns across spatial and channel dimensions, thereby enhancing cross-modal alignment. MEGAKANs was rigorously evaluated on the benchmark multimodal sentiment dataset (CMU-MOSI) for binary, multi-class, and regression-based sentiment prediction tasks. Experimental results revealed that MEGAKANs surpasses state-of-the-art baselines, achieving a binary accuracy of 87.02% and reducing the mean absolute error (MAE) to 0.7265, thereby demonstrating superior robustness and generalization capabilities. Notably, the proposed model showed the greatest relative improvement in the underutilized visual modality, validating its ability to capture subtle affective cues. These results demonstrate not only the superior performance of MEGAKANs but also highlight the potential of KANs in multimodal learning, offering a scalable and interpretable solution for real-world affective computing applications.

References

[1]	Gandhi A, Adhvaryu K, Poria S, Cambria E, Hussain A, (2023) Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf Fusion 91: 424–444. https://doi.org/10.1016/j.inffus.2022.09.025 doi: 10.1016/j.inffus.2022.09.025
[2]	Zhao F, Zhang C, Geng B, (2024) Deep multimodal data fusion. ACM Comput Surv 56: 1–36. https://doi.org/10.1145/3649447 doi: 10.1145/3649447
[3]	Jiao T, Guo C, Feng X, Chen Y, Song J, (2024) A comprehensive survey on deep learning multi-modal fusion: Methods, technologies and applications. Comput Mater Continua 80: 1. https://doi.org/10.32604/cmc.2024.053204 doi: 10.32604/cmc.2024.053204
[4]	Pawłowski M, Wróblewska A, Sysko-Romańczuk S, (2023) Effective techniques for multimodal data fusion: A comparative analysis. Sensors 23: 2381. https://doi.org/10.3390/s23052381 doi: 10.3390/s23052381
[5]	Zheng Y, Xu Z, Wang X, (2021) The fusion of deep learning and fuzzy systems: A state-of-the-art survey. IEEE Trans Fuzzy Syst 30: 2783–2799. https://doi.org/10.1109/TFUZZ.2021.3062899 doi: 10.1109/TFUZZ.2021.3062899
[6]	Zhu L, Zhu Z, Zhang C, Xu Y, Kong X, (2023) Multimodal sentiment analysis based on fusion methods: A survey. Inf Fusion 95: 306–325. https://doi.org/10.1016/j.inffus.2023.02.028 doi: 10.1016/j.inffus.2023.02.028
[7]	Majumder N, Hazarika D, Gelbukh A, Cambria E, Poria S, (2018) Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl Based Syst 161: 124–133. https://doi.org/10.1016/j.knosys.2018.07.041 doi: 10.1016/j.knosys.2018.07.041
[8]	Cheng H, Yang Z, Zhang X, Yang Y, (2023) Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion. IEEE Trans Affective Comput 14: 3149–3163. https://doi.org/10.1109/TAFFC.2023.3265653 doi: 10.1109/TAFFC.2023.3265653
[9]	Wang H, Du Q, Xiang Y, (2025) Image-text sentiment analysis based on hierarchical interaction fusion and contrast learning enhanced. Eng Appl Artif Intell 146: 110262. https://doi.org/10.1016/j.engappai.2025.110262 doi: 10.1016/j.engappai.2025.110262
[10]	Liu Z, Zhou B, Chu D, Sun Y, Meng L, (2024) Modality translation-based multimodal sentiment analysis under uncertain missing modalities. Inf Fusion 101: 101973. https://doi.org/10.1016/j.inffus.2023.101973 doi: 10.1016/j.inffus.2023.101973
[11]	Hou Y, Ji T, Zhang D, Stefanidis A, (2024) Kolmogorov-Arnold networks: A critical assessment of claims, performance, and practical viability. preprint, arXiv: 2407.11075. https://doi.org/10.48550/arXiv.2407.11075
[12]	Yu R, Yu W, Wang X, (2024) KAN or MLP: A fairer comparison. preprint, arXiv: 2407.16674. https://doi.org/10.48550/arXiv.2407.16674
[13]	Zadeh A, Zellers R, Pincus E, Morency LP, (2016) Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. preprint, arXiv: 1606.06259. https://doi.org/10.48550/arXiv.1606.06259
[14]	Poria S, Peng H, Hussain A, Howard N, Cambria E, (2017) Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing 261: 217–230. https://doi.org/10.1016/j.neucom.2016.09.117 doi: 10.1016/j.neucom.2016.09.117
[15]	Ezzameli K, Mahersia H, (2023) Emotion recognition from unimodal to multimodal analysis: A review. Inf Fusion 99: 101847. https://doi.org/10.1016/j.inffus.2023.101847 doi: 10.1016/j.inffus.2023.101847
[16]	Li K, Huang Y, Zhong G, Nurmemet Y, Wushouer S, (2025) MHAN: Bottleneck fusion model based on hybrid attention network for multimodal emotion recognition. J Shanghai Jiaotong Univ Sci 2025: 1–8. https://doi.org/10.1007/s12204-025-2820-x doi: 10.1007/s12204-025-2820-x
[17]	Gadzicki K, Khamsehashari R, Zetzsche C, (2020) Early vs late fusion in multimodal convolutional neural networks, In: 2020 IEEE 23rd International Conference on Information Fusion (FUSION), 1–6. https://doi.org/10.23919/FUSION45008.2020.9190246
[18]	Fu Y, Zhang Z, Yang R, Yao C, (2024) Hybrid cross-modal interaction learning for multimodal sentiment analysis. Neurocomputing 571: 127201. https://doi.org/10.1016/j.neucom.2023.127201 doi: 10.1016/j.neucom.2023.127201
[19]	Krizhevsky A, Sutskever I, Hinton GE, (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2012: 25.
[20]	Hochreiter S, Schmidhuber J, (1997) Long short-term memory. Neural Comput 9: 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 doi: 10.1162/neco.1997.9.8.1735
[21]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al., (2017) Attention is all you need. Adv Neural Inf Process Systems 30.
[22]	Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. preprint, arXiv: 2010.11929.
[23]	Devlin J, Chang MW, Lee K, Toutanova K, (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423
[24]	Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al., (2019) Roberta: A robustly optimized Bert pretraining approach. preprint, arXiv: 1907.11692. https://doi.org/10.48550/arXiv.1907.11692
[25]	Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, et al. (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776–780. https://doi.org/10.1109/ICASSP.2017.7952261
[26]	Hu Y, Wang K, Liu M, Tang H, Nie L, (2023). Semantic collaborative learning for cross-modal moment localization. ACM Trans Inf Syst 42: 1–26. https://doi.org/10.1145/3620669 doi: 10.1145/3620669
[27]	Liu WX, Chen DX, Tan MQ, Chen KY, Yin Y, Shang WL, et al. (2024) Model parameter prediction method for accelerating distributed DNN training. Comput Networks 255: 110883. https://doi.org/10.1016/j.comnet.2024.110883 doi: 10.1016/j.comnet.2024.110883
[28]	Schmidt-Hieber J, (2021) The Kolmogorov-Arnold representation theorem revisited. Neural Networks 137: 119–126. https://doi.org/10.1016/j.neunet.2021.01.020 doi: 10.1016/j.neunet.2021.01.020
[29]	Cheon M (2024) Demonstrating the efficacy of Kolmogorov-Arnold networks in vision tasks. preprint arXiv: 2406.14916. https://doi.org/10.48550/arXiv.2406.14916
[30]	Liu J, (2024) Exploring the power of KANs: Overcoming MLP limitations in complex data analysis. Appl Comput Eng 83: 1–7. https://doi.org/10.54254/2755-2721/83/2024GLG0057 doi: 10.54254/2755-2721/83/2024GLG0057
[31]	Tamotia A, Karmokar DS, Komal R, Nawas KK, Shahina A, Khan AN, (2025) Fusion of multimodal audio data for enhanced speaker identification using Kolmogorov-Arnold networks. IEEE Access 2025. https://doi.org/10.1109/ACCESS.2025.3569606 doi: 10.1109/ACCESS.2025.3569606
[32]	Lawan A, Pu J, Yunusa H, Lawan M, Umar A, Yahya AS, et al. (2026). DualKanbaFormer: An efficient selective sparse framework for multimodal aspect-based sentiment analysis. IEEE Trans Emerging Top Comput Intell 2026. https://doi.org/10.1109/TETCI.2026.3671067 doi: 10.1109/TETCI.2026.3671067
[33]	Lawan A, Pu J, Yunusa H, Umar A, Lawan M, (2025) Enhancing long-range dependency with state space model and Kolmogorov-Arnold networks for aspect-based sentiment analysis. In Proceedings of the 31st International Conference on Computational Linguistics, 2176–2186.
[34]	Liu Z, Wang Y, Vaidya S, Ruehle F, Halverson J, Soljačić M, et al. (2024) KAN: Kolmogorov-Arnold networks. preprint, arXiv: 2404.19756. https://doi.org/10.48550/arXiv.2404.19756
[35]	Zadeh A, Chen M, Poria S, Cambria E, Morency LP, (2017) Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1103–1114. https://doi.org/10.18653/v1/D17-1115
[36]	Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh AB, Morency LP, (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2247–2256. https://doi.org/10.18653/v1/P18-1209
[37]	Tsai YHH, Bai S, Liang PP, Kolter JZ, Morency LP, Salakhutdinov R, (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6558–6569. https://doi.org/10.18653/v1/P19-1656
[38]	Yu W, Xu H, Meng F, Zhu Y, Ma Y, Wu J, et al., (2020) CH-SIMS: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3718–3727. https://doi.org/10.18653/v1/2020.acl-main.343
[39]	Wu T, Peng J, Zhang W, Zhang H, Tan S, Yi F, et al., (2022). Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl Based Syst 235: 107676. https://doi.org/10.1016/j.knosys.2021.107676 doi: 10.1016/j.knosys.2021.107676
[40]	Wu J, Mai S, Hu H, (2021) Graph capsule aggregation for unaligned multimodal sequences. In: Proceedings of the 2021 International Conference on Multimodal Interaction, 521–529. https://doi.org/10.1145/3462244.3479931
[41]	Huang J, Pu Y, Zhou D, Cao J, Gu J, Zhao Z, et al., (2024) Dynamic hypergraph convolutional network for multimodal sentiment analysis. Neurocomputing 565: 126992. https://doi.org/10.1016/j.neucom.2023.126992 doi: 10.1016/j.neucom.2023.126992
[42]	Peng J, Wu T, Zhang W, Cheng F, Tan S, Yi F, et al. (2023) A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Exp Syst Appl 221: 119721. https://doi.org/10.1016/j.eswa.2023.119721 doi: 10.1016/j.eswa.2023.119721
[43]	Wang L, Peng J, Zheng C, Zhao T, Zhu LA, (2024) A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning. Inf Process Manage 61: 103675. https://doi.org/10.1016/j.ipm.2024.103675 doi: 10.1016/j.ipm.2024.103675
[44]	Sahay S, Okur E, Kumar SH, Nachman L, (2020) Low rank fusion based transformers for multimodal sequences. In: Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), 29–34. https://doi.org/10.18653/v1/2020.challengehml-1.4
[45]	Chen M, Li X, (2020) Swafn: Sentimental words aware fusion network for multimodal sentiment analysis. In: Proceedings of the 28th International Conference on Computational Linguistics, 1067–1077. https://doi.org/10.18653/v1/2020.coling-main.93
[46]	Mai S, Xing S, He J, Zeng Y, Hu H, (2023) Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling. ACM Trans Multimedia Comput Commun Appl 19: 1–24. https://doi.org/10.1145/3542927 doi: 10.1145/3542927
[47]	Chen R, Zhou W, Li Y, Zhou H, (2022) Video-based cross-modal auxiliary network for multimodal sentiment analysis. IEEE Trans Circuits Syst Video Technol 32: 8703–8716. https://doi.org/10.1109/TCSVT.2022.3197420 doi: 10.1109/TCSVT.2022.3197420
[48]	Ko D, Choi J, Choi HK, On KW, Roh B, Kim HJ, (2023) Meltr: Meta loss transformer for learning to fine-tune video foundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20105–20115. https://doi.org/10.1109/CVPR52729.2023.01925
[49]	Yu W, Xu H, Yuan Z, Wu J, (2021) Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, 10790–10797. https://doi.org/10.1609/aaai.v35i12.17289
[50]	Han W, Chen H, Poria S, (2021) Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 9180–9192. https://doi.org/10.18653/v1/2021.emnlp-main.723
[51]	Arjmand M, Dousti MJ, Moradi H, (2021) Teasel: A transformer-based speech-prefixed language model. preprint, arXiv: 2109.05522. https://doi.org/10.48550/arXiv.2109.05522

Reader Comments

Your name:*

Email:*
© 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)