In response to issues such as high-dimensional sparsity, missing semantic information, and ambiguous topic boundaries in traditional methods, in this paper, we investigated a document expert information extraction method based on Word2Vec and Transformer, aiming to enhance the semantic accuracy and clustering effectiveness of document expert information extraction. First, semantic embedding vectors were generated through the Word2Vec model, effectively reducing high-dimensional sparsity and enhancing the semantic representation capability of documents. Second, the Transformer algorithm was used to extract expert information from document vectors, achieving effective differentiation between topics. Experiments showed that document semantic embedding based on Word2Vec can significantly improve the performance of Transformer in expert information extraction. Compared to the traditional TF-IDF + Transformer method, this approach demonstrates superior performance in topic consistency and semantic capture.
Citation: Tianyu Yang, Chang Li, Liang Li. Research on expert information extraction based on Word2Vec and improved Transformer[J]. AIMS Electronics and Electrical Engineering, 2026, 10(1): 54-70. doi: 10.3934/electreng.2026003
In response to issues such as high-dimensional sparsity, missing semantic information, and ambiguous topic boundaries in traditional methods, in this paper, we investigated a document expert information extraction method based on Word2Vec and Transformer, aiming to enhance the semantic accuracy and clustering effectiveness of document expert information extraction. First, semantic embedding vectors were generated through the Word2Vec model, effectively reducing high-dimensional sparsity and enhancing the semantic representation capability of documents. Second, the Transformer algorithm was used to extract expert information from document vectors, achieving effective differentiation between topics. Experiments showed that document semantic embedding based on Word2Vec can significantly improve the performance of Transformer in expert information extraction. Compared to the traditional TF-IDF + Transformer method, this approach demonstrates superior performance in topic consistency and semantic capture.
| [1] |
Ma J, Wang L, Zhang YR, Yuan W, Guo W (2022) An integrated latent Dirichlet allocation and Word2vec method for generating the topic evolution of mental models from global to local. Expert Syst Appl 212: 118695. https://doi.org/10.1016/j.eswa.2022.118695 doi: 10.1016/j.eswa.2022.118695
|
| [2] |
Lyu LC, Wang XZ, Chen W, Zhang X, Chen XL, Liu XW (2021) The Research on Disruptive Technology Identification Based on Scientific and Technological Information Mining and Expert Consultation: A Case Study on the Energy Field. Lecture Notes in Electrical Engineering 653: 469‒482. https://doi.org/10.1007/978-981-15-8599-9_54 doi: 10.1007/978-981-15-8599-9_54
|
| [3] |
Rampisela TV, Yulianti E (2020) Academic Expert Finding in Indonesia using Word Embedding and Document Embedding: A Case Study of Fasilkom UI. International Conference on Information and Communication Technology (ICoICT) 1‒6. https://doi.org/10.1109/ICoICT49345.2020.9166249 doi: 10.1109/ICoICT49345.2020.9166249
|
| [4] |
Nikzad-Khasmakhi N, Balafar M, Feizi-Derakhshi MR, Motamed C (2021) ExEm: Expert embedding using dominating set theory with deep learning approaches. Expert Syst Appl 177: 114913. https://doi.org/10.1016/j.eswa.2021.114913 doi: 10.1016/j.eswa.2021.114913
|
| [5] |
Skeppstedt M, Ahltorp M, Kucher K, Lindström M (2024) From word clouds to Word Rain: Revisiting the classic word cloud to visualize climate change texts. Inform Visual 23: 217‒238. https://doi.org/10.1177/14738716241236188 doi: 10.1177/14738716241236188
|
| [6] |
Catanuto G, Rocco N, Balafa K, Masannat Y, Karakatsanis A, Maglia A, et al. (2023) Natural Language Processing to Extract Meaningful Information from a Corpus of Written Knowledge in Breast Cancer: Transforming Books into Data. Breast Care 18(3): 1‒4. https://doi.org/10.1159/000530448 doi: 10.1159/000530448
|
| [7] | Debele AG, Woldeyohannis MM (2022) Multimodal Amharic Hate Speech Detection Using Deep Learning. 2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA) 102‒107. https://doi.org/10.1109/ICT4DA56482.2022.9971436 |
| [8] | Yu J, Yu X, Li JL, Sun HX, Sun MD (2024) Smart Contract Vulnerability Detection Based on Multimodal Feature Fusion. Advanced Intelligent Computing Technology and Applications: 20th International Conference 14864: 319‒330. https://doi.org/10.1007/978-981-97-5588-2_27 |
| [9] | Ceh-Varela E, Imhmed E (2023) Uncovering Water Research with Natural Language Processing. 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC) 983‒984. https://doi.org/10.1109/COMPSAC57700.2023.00138 |
| [10] |
Goel S, Kumar R (2020) SoTaRePo: Society-Tag Relationship Protocol based architecture for UIP construction. Expert Syst Appl 141: 112955. https://doi.org/10.1016/j.eswa.2019.112955 doi: 10.1016/j.eswa.2019.112955
|
| [11] |
Ma TH, Pan Q, Wang HM, Shao WY, Tian Y, Al-Nabhan N (2020) Graph classification algorithm based on graph structure embedding. Expert Syst Appl 161: 113715. https://doi.org/10.1016/j.eswa.2020.113715 doi: 10.1016/j.eswa.2020.113715
|
| [12] |
Hong M, Koo C, Chung N (2022) DSER: Deep-Sequential Embedding for single domain Recommendation. Expert Syst Appl 208: 118156. https://doi.org/10.1016/j.eswa.2022.118156 doi: 10.1016/j.eswa.2022.118156
|
| [13] |
Wartschinski L, Noller Y, Vogel T, Kehrer T, Grunske L (2022) VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python. Inform Software Tech 144: 106809. https://doi.org/10.1016/j.infsof.2021.106809 doi: 10.1016/j.infsof.2021.106809
|
| [14] |
Ji FX, Cao QW, Li H, Fujita H, Liang CY, Wu J (2023) An online reviews-driven large-scale group decision making approach for evaluating user satisfaction of sharing accommodation. EXPERT SYST APPL 213: 118875. https://doi.org/10.1016/j.eswa.2022.118875 doi: 10.1016/j.eswa.2022.118875
|
| [15] |
Helaly MA, Rady S, Aref MM (2022) BERT contextual embeddings for taxonomic classification of bacterial DNA sequences. Expert Syst Appl 208: 117972. https://doi.org/10.1016/j.eswa.2022.117972 doi: 10.1016/j.eswa.2022.117972
|
| [16] |
Kumar A, Thakare A, Bhende M, Sinha AK, Alguno AC, Kumar YP (2022) Identification and Classification of Depressed Mental State for End-User over Social Media. Comput Intel Neurosci 2022: 1‒10. https://doi.org/10.1155/2022/8755922 doi: 10.1155/2022/8755922
|
| [17] | Aychew M, Alemneh E (2022) Selection of Architectural Patterns based on Tactics. 2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA) 13‒18. https://doi.org/10.1109/ICT4DA56482.2022.9971369 |
| [18] |
VanGessel FG, Perry E, Mohan S, Barham OM, Cavolowsky M (2023) Natural language processing for knowledge discovery and information extraction from energetics corpora. Propell Explos Pyrot 48: e202300109. https://doi.org/10.1002/prep.202300109 doi: 10.1002/prep.202300109
|
| [19] |
Shahzad M, Alhoori H (2022) Public Reaction to Scientific Research via Twitter Sentiment Prediction. J Data Inform Sci 7: 97‒124. https://doi.org/10.2478/jdis-2022-0003 doi: 10.2478/jdis-2022-0003
|
| [20] | Wang X, Cao Y, Mao B (2020) Spatio-temporal Semantic Analysis of Safety Production Accidents in Grain Depot based on Natural Language Processing. 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) 931‒935. https://doi.org/10.1109/WIIAT50758.2020.00142 |
| [21] |
Sun YL, Lian JG, Teng Z, Wei ZY, Tang Y, Yang L, et al. (2024) COVID-19 diagnosis based on swin transformer model with demographic information fusion and enhanced multi-head attention mechanism. Expert Syst Appl 243: 122805. https://doi.org/10.1016/j.eswa.2023.122805 doi: 10.1016/j.eswa.2023.122805
|
| [22] | Zhang H, Yang JW, Dong XB, Lv XG, Jia W, Jin Z, Li XJ (2024) A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer. Pattern Recognition and Computer Vision: 6th Chinese Conference, PRCV 2023 29-43. https://doi.org/10.1007/978-981-99-8469-5_3 |
| [23] | Qian YX (2024) Discriminative Activation of Information Is What You Need in Image Super-Resolution Transformer. Pattern Recognition and Computer Vision: 6th Chinese Conference, PRCV 2023 482‒493. https://doi.org/10.1007/978-981-99-8552-4_38 |
| [24] |
Chang, MW, Ratinov L, Roth D (2012). Structured learning with constrained conditional models. Mach Learn 88: 399‒431. https://doi.org/10.1007/s10994-012-5296-5 doi: 10.1007/s10994-012-5296-5
|
| [25] | Zhou Y, Fan M, Chen YL, Xiao XQ, Pan XX, Li LH (2024) A transformer model guided by histopathological image information for DCE-MRI-based prediction of response to neoadjuvant chemotherapy in breast cancer. Medical Imaging 2024: Imaging Informatics for Healthcare, Research, and Applications 1293114. https://doi.org/10.1117/12.3006656 |
| [26] | Deng XR, Huang Z, Ma KF, Chen K, Guo J, Qiu WD (2023) GenTC: Generative Transformer via Contrastive Learning for Receipt Information Extraction. Artificial Neural Networks and Machine Learning – ICANN 2023 14259. https://doi.org/10.1007/978-3-031-44223-0_32 |
| [27] |
Yuan WL, Chen JX, Chen SF, Feng DW, Hu ZZ, Li P, et al. (2024) Application of Transformer-based reinforcement learning methods in intelligent decision-making: A review. Front Inform Tech Electr Eng 25: 763‒790. https://doi.org/10.1631/FITEE.2300548. doi: 10.1631/FITEE.2300548
|
| [28] | Ochi M, Shiro M, Mori J, Sakata I (2023) Integrating Linguistic and Citation Information with Transformer for Predicting Top-Cited Papers. Web Information Systems and Technologies. WEBIST 2022 494: 121‒141. https://doi.org/10.1007/978-3-031-43088-6_7 |
| [29] | Li SY, Dong JW, Chen JY, Gao XZ, Niu SJ (2023) Vision Transformer with Depth Auxiliary Information for Face Anti-spoofing. Neural Information Processing. ICONIP 2022 13625: 335‒346. https://doi.org/10.1007/978-3-031-30111-7_29 |
| [30] | Chen JJ, Wang JR, Zheng SL, Liu YJ, Li ZN, Xie SL, et al. (2024) Improving NLOS/LOS Classification Accuracy in Urban Canyon Based on Channel-Independent Patch Transformer with Temporal Information. Proceedings of the 2024 International Technical Meeting of The Institute of Navigation 869‒882. https://doi.org/10.33012/2024.19507 |
| [31] | Lin LX, Li YW, Wang HZ (2024) TSMGAN-II: Generative Adversarial Network Based on Two-Stage Mask Transformer and Information Interaction for Speech Enhancement. Advanced Intelligent Computing Technology and Applications. ICIC 2024 14865: 174‒185. https://doi.org/10.1007/978-981-97-5591-2_15 |
| [32] |
Zhu YP, Huang L, Chen JX, Wang SY, Wan FY, Chen JN (2024) VG-DOCoT: a novel DO-Conv and transformer framework via VAE-GAN technique for EEG emotion recognition. Front Inform Techn Electr Eng 25: 1497‒1514. https://doi.org/10.1631/FITEE.2300781 doi: 10.1631/FITEE.2300781
|
| [33] |
Liu FY, Zhou ZQ, Men CY, Sun Q, Huang KJ (2023) IFGLT: Information fusion guided lightweight Transformer for image denoising. J Vis Commun Image Rep 97: 103994. https://doi.org/10.1016/j.jvcir.2023.103994 doi: 10.1016/j.jvcir.2023.103994
|
| [34] |
Xiong WX, Wang P, Sun XC, Wang J (2024) SiET: Spatial information enhanced transformer for multivariate time series anomaly detection. Knowledge-Based Systems 296: 111928. https://doi.org/10.1016/j.knosys.2024.111928 doi: 10.1016/j.knosys.2024.111928
|
| [35] | Abouei E, Pan SY, Hu MZ, Kesarwala AH, Zhou J, Roper J, et al. (2024) Cardiac MRI segmentation using block-partitioned transformer with global-local information integration. Medical Imaging 2024: Clinical and Biomedical Imaging 1293021. https://doi.org/10.1117/12.3006929 |
| [36] |
Kesarwani A, Das S, Kisku DR, Dalui M (2024) Dual mode information fusion with pre-trained CNN models and transformer for video-based non-invasive anaemia detection. Biomed Signal Proces 88: 105592. https://doi.org/10.1016/j.bspc.2023.105592 doi: 10.1016/j.bspc.2023.105592
|
| [37] |
Zhou LN, Lu ZG, You WK, Fang XF (2023) Reversible data hiding using a transformer predictor and an adaptive embedding strategy. Front Inform Tech Electr Eng 24: 1143‒1155. https://doi.org/10.1631/FITEE.2300041 doi: 10.1631/FITEE.2300041
|
| [38] | Ochi M, Shiro M, Mori J, Sakata I (2022) Classification of the Top-cited Literature by Fusing Linguistic and Citation Information with the Transformer Model. Proceedings of the 18th International Conference on Web Information Systems and Technologies - WEBIST 286‒293. https://doi.org/10.5220/0011542200003318 |
| [39] | Zhao L, Tian XC, Liu YP (2024) Transformer Based Position Information Enhancement for Medical Image Segmentation. 2024 4th Asia Conference on Information Engineering (ACIE) 92‒96. https://doi.org/10.1109/ACIE61839.2024.00022 |
| [40] | Hasany SN, Petitjean C, Meriaudeau F (2023) A study of attention information from transformer layers in hybrid medical image segmentation networks. Medical Imaging 2023: Image Processing 12464: 389‒400. https://doi.org/10.1117/12.2652215 |
| [41] |
Citarella AA, Barbella M, Ciobanu MG, De Marco F, Di Biasi L, Tortora G (2025) Assessing the effectiveness of ROUGE as unbiased metric in Extractive vs. Abstractive summarization techniques. Journal of Computational Science 87: 102571. https://doi.org/10.1016/j.jocs.2025.102571 doi: 10.1016/j.jocs.2025.102571
|
| [42] | Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. (2019) PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32: 8024‒8035. https://arXiv.org/pdf/1912.01703 |
| [43] |
Liang X, Liu Z, Zhang H (2023) NASTyLinker: NIL-Aware Scalable Transformer-based Entity Linker. In European Semantic Web Conference 174‒191. https://doi.org/10.1007/978-3-031-33455-9_11 doi: 10.1007/978-3-031-33455-9_11
|
| [44] |
Zou X, Zhou X, Zhu Z, Ji L (2019) Novel subgroups of patients with adult-onset diabetes in Chinese and US populations. The Lancet Diabetes & Endocrinology 7: 9‒11. https://doi.org/10.1016/S2213-8587(18)30316-4 doi: 10.1016/S2213-8587(18)30316-4
|