Research article Special Issues

Topic-based automatic summarization algorithm for Chinese short text

  • Received: 23 March 2020 Accepted: 05 May 2020 Published: 12 May 2020
  • Most current automatic summarization methods are for English texts. The distinction between words in Chinese text is large, the types of parts of speech are many and complex, and polysemy or ambiguous words appear frequently. Therefore, compared with English text, Chinese text is more difficult to extract useful feature words. Due to the complex syntax of Chinese, there are currently relatively few automatic summarization methods for Chinese text. In the past, only the important sentences in the original text can be selected and simply arranged to obtain a summary with chaotic sentences and insufficient coherence. Meanwhile, because Chinese short text usually contains more redundant information and the sentence structure is not neat, we propose a topic-based automatic summary method for Chinese short text. Firstly, a key sentence selection method is proposed combining topic words and TF-IDF to obtain the score of each text corresponding to the topic in the original text data. Then the sentence with the highest score as the topic sentence of the topic is selected. Considering that the short text of Weibo may contain a lot of irrelevant information and sometimes even lack some important components of topic, three retouching mechanisms are proposed to improve the conciseness, richness and readability of topic sentence extraction results. We validate our approach on natural disaster and social hot event datasets from Sina Weibo. The experimental results show that the polished topic summary not only reflects the exact relationship between topic sentences and natural disasters or social hot events, but also has rich semantic information. More importantly, we can almost grasp the basic elements of natural disaster or social hot event from the topic sentence, so as to help the government guide disaster relief or meet the needs of users for quickly obtaining information of social hot events.

    Citation: Tinghuai Ma, Hongmei Wang, Yuwei Zhao, Yuan Tian, Najla Al-Nabhan. Topic-based automatic summarization algorithm for Chinese short text[J]. Mathematical Biosciences and Engineering, 2020, 17(4): 3582-3600. doi: 10.3934/mbe.2020202

    Related Papers:

  • Most current automatic summarization methods are for English texts. The distinction between words in Chinese text is large, the types of parts of speech are many and complex, and polysemy or ambiguous words appear frequently. Therefore, compared with English text, Chinese text is more difficult to extract useful feature words. Due to the complex syntax of Chinese, there are currently relatively few automatic summarization methods for Chinese text. In the past, only the important sentences in the original text can be selected and simply arranged to obtain a summary with chaotic sentences and insufficient coherence. Meanwhile, because Chinese short text usually contains more redundant information and the sentence structure is not neat, we propose a topic-based automatic summary method for Chinese short text. Firstly, a key sentence selection method is proposed combining topic words and TF-IDF to obtain the score of each text corresponding to the topic in the original text data. Then the sentence with the highest score as the topic sentence of the topic is selected. Considering that the short text of Weibo may contain a lot of irrelevant information and sometimes even lack some important components of topic, three retouching mechanisms are proposed to improve the conciseness, richness and readability of topic sentence extraction results. We validate our approach on natural disaster and social hot event datasets from Sina Weibo. The experimental results show that the polished topic summary not only reflects the exact relationship between topic sentences and natural disasters or social hot events, but also has rich semantic information. More importantly, we can almost grasp the basic elements of natural disaster or social hot event from the topic sentence, so as to help the government guide disaster relief or meet the needs of users for quickly obtaining information of social hot events.



    加载中


    [1] S. L. Lo, R. Chiong, D. Cornforth, An unsupervised multilingual approach for online social media topic identification, Expert Syst. Appl., 81 (2017), 282-298. doi: 10.1016/j.eswa.2017.03.029
    [2] J. F. Yeh, Y. S. Tan, C. H. Lee, Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation, Neurocomputing, 216 (2016), 310-318. doi: 10.1016/j.neucom.2016.08.017
    [3] J. Christensen, Mausam, S. Soderland, O. Etzioni, Towards coherent multi-document summarization, Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies, 2013, 1163-1173. Available from: https://www.aclweb.org/anthology/N13-1136/.
    [4] E. Lloret, M. Palomar, Towards automatic tweet generation: A comparative study from the text summarization perspective in the journalism genre, Expert Syst. Appl., 40 (2013), 6624-6630. doi: 10.1016/j.eswa.2013.06.021
    [5] G. Yang, D. Wen, Kinshuk, N. S. Chen, E. Sutinen, A novel contextual topic model for multidocument summarization, Expert Syst. Appl., 42 (2015), 1340-1352. doi: 10.1016/j.eswa.2014.09.015
    [6] I. Mani, M. T. Maybury, Advances in Automatic Text Summarization, (MITRE Corporation) Cambridge, The MIT Press, (1999).
    [7] J. M. Torres-Moreno, Automatic Text Summarization, John Wiley and Sons, 2014.
    [8] A. Nenkova, K. McKeown, A survey of text summarization techniques, Min. Text Data, 2012 (2012), 43-76.
    [9] T. Ma, Y. Zhao, H. Zhou, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, Natural disaster topic extraction in sina microblogging based on graph analysis, Expert Syst. Appl., 115 (2019), 346-355. doi: 10.1016/j.eswa.2018.08.010
    [10] T. Ma, Q. Liu, J. Cao, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, LGIEM: Global and local node influence based community detection, Future Gener. Comput. Syst., 105 (2020), 533-546. doi: 10.1016/j.future.2019.12.022
    [11] T. Ma, H. Rong, Y. Hao, J. Cao, Y. Tian, M. A. Al-Rodhaan, A Novel Sentiment Polarity Detection Framework for Chinese, IEEE Trans. Affective Comput., 2019.
    [12] A. Kazantseva, S. Szpakowicz, Summarizing short stories, Comput. Linguist., 36 (2010), 71-109. doi: 10.1162/coli.2010.36.1.36102
    [13] M. T. Khan, M. Durrani, S. Khalid, F. Aziz, Online knowledge-based model for big data topic extraction, Comput. Intell. Neurosci., 2016 (2016), 1-10.
    [14] Indra, E. Winarko, R. Pulungan, Trending topics detection of Indonesian tweets using BN-grams and Doc-p, J. King Saud Univ. Comput. Inf. Sci., 31 (2019), 266-274.
    [15] W. M. Wang, Z. Li, J. W. Wang, Z. H. Zheng, How far we can go with extractive text summarization? Heuristic methods to obtain near upper bounds, Expert Syst. Appl., 90 (2017), 439-463. doi: 10.1016/j.eswa.2017.08.040
    [16] M. Moradi, N. Ghadiri, Different approaches for identifying important concepts in probabilistic biomedical text summarization, Artif. Intell. Med., 84 (2018), 101-116. doi: 10.1016/j.artmed.2017.11.004
    [17] R. Yan, L. Kong, C. Huang, X. Wan, X. Li, Y. Zhang, Timeline generation through evolutionary trans-temporal summarization, In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011,433-443. Available from: https://www.aclweb.org/anthology/D11-1040/.
    [18] W. Liu, X. Luo, J. Zhang, R. Xue, R. Xu, Semantic summary automatic generation in news event, Concurrency Comput. Pract. Exp., 29 (2017), e4287. doi: 10.1002/cpe.4287
    [19] D. Zhou, D. Zhong, A semi-supervised learning framework for biomedical event extraction based on hidden topics, Artif. Intell. Med., 64 (2015), 51-58. doi: 10.1016/j.artmed.2015.03.004
    [20] W. Xiong, D. Litman, Empirical analysis of exploiting review helpfulness for extractive summarization of online reviews, In Proceedings of coling 2014, the 25th international conference on computational linguistics: Technical papers, 2014, 1985-1995. Available from: https://www.aclweb.org/anthology/C14-1187/.
    [21] Z. Wu, L. Lei, G. Li, H. Huang, C. Zheng, E. Chen, et al., A topic modeling based approach to novel document automatic summarization, Expert Syst. Appl., 84 (2017), 12-23. doi: 10.1016/j.eswa.2017.04.054
    [22] A. Barrera, R. Verma, Combining syntax and semantics for automatic extractive single-document summarization, In International Conference on Intelligent Text Processing and Computational Linguistics, 2012,366-377. Available from: https://link.springer.com/chapter/10.1007/978-3-642-28601-8_31.
    [23] F. Barrios, F. López, L. Argerich, R. Wachenchauzer, Variations of the similarity function of textrank for automated summarization, preprint, arXiv1602.03606, 2016.
    [24] C. Fang, D. Mu, Z. Deng, Z. Wu, Word-sentence co-ranking for automatic extractive text summarization, Expert Syst. Appl., 72 (2017), 189-195. doi: 10.1016/j.eswa.2016.12.021
    [25] M. Schinas, S. Papadopoulos, Y. Kompatsiaris, P. A. Mitkas, Mgraph: Multimodal event summarization in social media using topic models and graph-based ranking, Int. J. Multimedia Inf. Retr., 5 (2016), 51-69. doi: 10.1007/s13735-015-0089-9
    [26] F. Ye, X. Xu, Automatic multi-document summarization based on keyword density and sentenceword graphs, J. Shanghai Jiaotong Univ. Sci., 23 (2018), 584-592. doi: 10.1007/s12204-018-1957-2
    [27] W. Xie, F. Zhu, J. Jiang, E. P. Lim, K. Wang, Topicsketch: Real-time bursty topic detection from twitter, IEEE Trans. Knowl. Data Eng., 28 (2016), 2216-2229. doi: 10.1109/TKDE.2016.2556661
    [28] X. Yang, P. Jin, X. Chen, The construction of a kind of chat corpus in chinese word segmentation, In 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 2015,168-172. Available from: https://ieeexplore.ieee.org/document/7397448.
    [29] D. Yan, E. Hua, B. Hu, An improved single-pass algorithm for chinese microblog topic detection and tracking, In 2016 IEEE International Congress on Big Data (BigData Congress), 2016,251-258. Available from: https://ieeexplore.ieee.org/abstract/document/7584945.
    [30] C. C. Birant, O. Aktas, Rule-based turkish text summarizer (RB-TTS), Adv. Electr. Comput. Eng., 18 (2018), 113-119.
    [31] A. Abdi, N. Idris, R. M. Alguliev, R. M. Aliguliyev, Automatic summarization assessment through a combination of semantic and syntactic information for intelligent educational systems, Inf. Process. Manage., 51 (2015), 340-358. doi: 10.1016/j.ipm.2015.02.001
    [32] H. Rong, T. Ma, J. Cao, Y. Tian, A. Al-Dhelaan, M. Al-Rodhaan, Deep Rolling: A Novel Emotion Prediction Model for a Multi-Participant Communication Context, Inf. Sci., 488 (2019), 158-180. doi: 10.1016/j.ins.2019.03.023
  • Reader Comments
  • © 2020 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3093) PDF downloads(284) Cited by(2)

Article outline

Figures and Tables

Tables(9)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog