Research article

Research on AI-generated Chinese text detection method based on deep learning

  • Published: 05 December 2025
  • This paper proposes a dual stream feature fusion model by integrating RoBERTa semantic encoding with manually designed text statistical features using deep learning to fuse statistical features. A cross domain hybrid multi-source corpus is constructed to train and validate the model's detection performance. We developed a multi-domain corpus encompassing the HC3 dataset, ChatGPT detection dataset, and self-constructed academic abstract and literary work datasets. By integrating academic abstracts from CNKI with texts generated by three LLMs (DeepSeek R1, Phi4, and Qwen 2.5), we constructed a dataset containing human-written and machine-generated texts. Experiments show that the RoBERTa-text model achieves optimal detection performance for Phi4-generated texts (recall: 100%), while Qwen 2.5-generated texts present greater challenges due to their human-like writing patterns (accuracy: 88.91%). Through categorizing texts into true positive, false positive, true negative, and false negative groups, we conducted statistical and linguistic analyses. Texts characterized by limited sentence length variation and dense punctuation were more likely to be identified as AI-generated. Comparative analysis of word frequency distribution and semantic perplexity revealed that LLM-generated texts exhibit repetitive lexical selection patterns, whereas human-written texts demonstrate more diverse vocabulary usage. We elucidate decision-making rationale and provide novel perspectives for AI-generated text detection research.

    Citation: Chang Su, Yaqi Jiang, Jianlin Wang, Junfang Zhao. Research on AI-generated Chinese text detection method based on deep learning[J]. Big Data and Information Analytics, 2025, 9: 328-349. doi: 10.3934/bdia.2025016

    Related Papers:

  • This paper proposes a dual stream feature fusion model by integrating RoBERTa semantic encoding with manually designed text statistical features using deep learning to fuse statistical features. A cross domain hybrid multi-source corpus is constructed to train and validate the model's detection performance. We developed a multi-domain corpus encompassing the HC3 dataset, ChatGPT detection dataset, and self-constructed academic abstract and literary work datasets. By integrating academic abstracts from CNKI with texts generated by three LLMs (DeepSeek R1, Phi4, and Qwen 2.5), we constructed a dataset containing human-written and machine-generated texts. Experiments show that the RoBERTa-text model achieves optimal detection performance for Phi4-generated texts (recall: 100%), while Qwen 2.5-generated texts present greater challenges due to their human-like writing patterns (accuracy: 88.91%). Through categorizing texts into true positive, false positive, true negative, and false negative groups, we conducted statistical and linguistic analyses. Texts characterized by limited sentence length variation and dense punctuation were more likely to be identified as AI-generated. Comparative analysis of word frequency distribution and semantic perplexity revealed that LLM-generated texts exhibit repetitive lexical selection patterns, whereas human-written texts demonstrate more diverse vocabulary usage. We elucidate decision-making rationale and provide novel perspectives for AI-generated text detection research.



    加载中


    [1] Kocoń J, Cichecki I, Kaszyca O, Kochanek M, Szydło D, Baran J, et al. (2023) ChatGPT: Jack of all trades, master of none. Inf Fusion 99: 101861. https://doi.org/10.1016/j.inffus.2023.101861 doi: 10.1016/j.inffus.2023.101861
    [2] Nasiri S, Hashemzadeh A, (2025) The evolution of disinformation from fake news propaganda to AI-driven narratives as deepfake. J Cyberspace Stud 2025, 9: 229-250. https://doi.org/10.22059/jcss.2025.387249.1119 doi: 10.22059/jcss.2025.387249.1119
    [3] Yeo MA, (2023) Academic integrity in the age of artificial intelligence (AI) authoring apps. Tesol J 14: e716. https://doi.org/10.1002/tesj.716 doi: 10.1002/tesj.716
    [4] Verma A, (2023) The copyright problem with emerging generative AI. J Intell Prot Stud 7: 69. https://doi.org/10.2139/ssrn.4537389 doi: 10.2139/ssrn.4537389
    [5] Ufuk F, (2023) The role and limitations of large language models such as ChatGPT in clinical settings and medical journalism. Radiology 307: e230276. https://doi.org/10.1148/radiol.230276 doi: 10.1148/radiol.230276
    [6] Gehrmann S, Strobelt H, Rush AM, Gltr: Statistical detection and visualization of generated text, preprint, arXiv: 1906.04043. https://doi.org/10.48550/arXiv.1906.04043
    [7] Fröhling L, Zubiaga A, (2021) Feature-based detection of automated language models: Tackling GPT-2, GPT-3 and Grover. PeerJ Comput Sci 7: e443. https://doi.org/10.7717/peerj-cs.443 doi: 10.7717/peerj-cs.443
    [8] Corizzo R, Leal-Arenas S, (2023) One-class learning for AI-generated essay detection. Appl Sci 13: 7901. https://doi.org/10.3390/app13137901 doi: 10.3390/app13137901
    [9] Crothers EN, Japkowicz N, Viktor HL, (2023) Machine-generated text: A comprehensive survey of threat models and detection methods. IEEE Access11: 70977-71002. https://doi.org/10.1109/ACCESS.2023.3294090 doi: 10.1109/ACCESS.2023.3294090
    [10] Nguyen-Son HQ, Tieu ND, Nguyen HH, Yamagishi J, Zen IE, (2017) Identifying computer-generated text using statistical analysis, In: 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017: 1504-1511. https://doi.org/10.1109/APSIPA.2017.8282270
    [11] Li W, (2002) Zipf's law everywhere. Glottometrics 5: 14-21.
    [12] Uchendu A, Le T, Shu K, Lee D, (2020) Authorship attribution for neural text generation, In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020: 8384-8395. https://doi.org/10.18653/v1/2020.emnlp-main.673
    [13] Flesch R, (1952) "Simplification of Flesch reading ease formula": Reply. J Appl Psychol 36: 54-55. https://psycnet.apa.org/doi/10.1037/h0051965
    [14] Solnyshkina M, Zamaletdinov R, Gorodetskaya L, Gabitov A, (2017) Evaluating text complexity and Flesch-Kincaid grade level. J Soc Stud Educ Res 8: 238-248.
    [15] Devlin J, Chang MW, Lee K, Toutanova K, (2019) Bert: Pre-training of deep bidirectional transformers for language understanding, In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1: 4171-4186. https://doi.org/10.18653/v1/N19-1423
    [16] Cui Y, Che W, Liu T, Qin B, Yang Z, Pre-training with whole word masking for Chinese Bert. IEEE/ACM Trans Audio Speech Lang Process 29: 3504-3514. https://doi.org/10.1109/TASLP.2021.3124365
    [17] Liu Z, Lin W, Shi Y, Jun Z, (2021) A robustly optimized BERT pre-training approach with post-training. In: Proceedings of the 20th Chinese National Conference on Computational Linguistics, 2021: 1218-1227. https://doi.org/10.1007/978-3-030-84186-7_31
    [18] Crothers E, Japkowicz N, Viktor H, Branco P, Adversarial robustness of neural-statistical features in detection of generative transformers. In: 2022 International Joint Conference on Neural Networks (IJCNN), 2022: 1-8. https://doi.org/10.1109/IJCNN55064.2022.9892269
    [19] Xu K, Hui ZL, Dong ZJ, Cai PH, Lu LQ, (2024) Construction of an automatic detection dataset for open-domain texts generated by ChatGPT. J Chin Inf Process 38: 39-53. https://doi.org/10.3969/j.issn.1003-0077.2024.12.005 doi: 10.3969/j.issn.1003-0077.2024.12.005
  • Reader Comments
  • © 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(750) PDF downloads(16) Cited by(0)

Article outline

Figures and Tables

Figures(7)  /  Tables(17)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog