A new document representation based on global policy for supervised term weighting schemes in text categorization

Longjia Jia; Bangzuo Zhang; Longjia Jia; Bangzuo Zhang

doi:10.3934/mbe.2022245

Mathematical Biosciences and Engineering

2022, Volume 19, Issue 5: 5223-5240. doi: 10.3934/mbe.2022245

Previous Article Next Article

Research article Special Issues

A new document representation based on global policy for supervised term weighting schemes in text categorization

Longjia Jia ¹,
Bangzuo Zhang ^{2
,
,}

1.
School of Mathematics and Statistics, Northeast Normal University, Changchun, China
2.
School of Information Science and Technology, Northeast Normal University, Changchun, China

Academic Editor: Xiangtao Li

Received: 06 September 2021 Revised: 27 February 2022 Accepted: 04 March 2022 Published: 23 March 2022

There are two main factors involved in documents classification, document representation method and classification algorithm. In this study, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of classification results. We propose a document representation strategy for supervised text classification named document representation based on global policy (DRGP), which can obtain an appropriate document representation according to the distribution of terms. The main idea of DRGP is to construct the optimization function through the importance of terms to different categories. In the experiments, we investigate the effects of DRGP on the 20 Newsgroups, Reuters21578 datasets, and using the SVM as classifier. The results show that the DRGP outperforms other text representation strategy schemes, such as Document Max, Document Two Max and global policy.
- document representation strategy,
- global policy,
- text categorization,
- machine learning
Citation: Longjia Jia, Bangzuo Zhang. A new document representation based on global policy for supervised term weighting schemes in text categorization[J]. Mathematical Biosciences and Engineering, 2022, 19(5): 5223-5240. doi: 10.3934/mbe.2022245

Related Papers:

Abstract

There are two main factors involved in documents classification, document representation method and classification algorithm. In this study, we focus on document representation method and demonstrate that the choice of representation methods has impacts on quality of classification results. We propose a document representation strategy for supervised text classification named document representation based on global policy (DRGP), which can obtain an appropriate document representation according to the distribution of terms. The main idea of DRGP is to construct the optimization function through the importance of terms to different categories. In the experiments, we investigate the effects of DRGP on the 20 Newsgroups, Reuters21578 datasets, and using the SVM as classifier. The results show that the DRGP outperforms other text representation strategy schemes, such as Document Max, Document Two Max and global policy.

References

[1]	M. Lan, S. Sung, H. Low, C. Tan, A comparative study on term weighting schemes for text categorization, in Proceedings 2005 IEEE International Joint Conference on Neural Networks, 1 (2005), 546–551. https://doi.org/10.1109/IJCNN.2005.1555890
[2]	X. Li, A. Zhang, C. Li, J. Ouyang, Y. Cai, Exploring coherent topics by topic modeling with term weighting, Inf. Process. Manage., 54 (2018), 1345–1358. https://doi.org/10.1016/j.ipm.2018.05.009 doi: 10.1016/j.ipm.2018.05.009
[3]	M. Lan, C. Tan, J. Su, Y. Lu, Supervised and traditional term weighting methods for automatic text categorization, IEEE Trans. Pattern Anal. Mach. Intell., 31 (2008), 721–735. https://doi.org/10.1109/TPAMI.2008.110 doi: 10.1109/TPAMI.2008.110
[4]	E. H. Han, G. Karypis, V. Kumar, Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification, Proc. Pacific Asia Conf. Knowl. Discovery Data Min., (2001), 53–65. https://doi.org/10.1007/3-540-45357-1_9 doi: 10.1007/3-540-45357-1_9
[5]	X. Quan, W. Liu, B. Qiu, Term weighting schemes for question categorization, IEEE Trans. Pattern Anal. Mach. Intell., 33 (2010), 1009–1021. https://doi.org/10.1109/TPAMI.2010.154 doi: 10.1109/TPAMI.2010.154
[6]	A. I. Kadhim, Survey on supervised machine learning techniques for automatic text classification, Artif. Intell. Rev., 51 (2019), 273–292. https://doi.org/10.1007/s10462-018-09677-1 doi: 10.1007/s10462-018-09677-1
[7]	M. M. Michał, J. Protasiewicz, A recent overview of the state-of-the-art elements of text classification, Expert Syst. Appl., 106 (2018), 36–54. https://doi.org/10.1016/j.eswa.2018.03.058 doi: 10.1016/j.eswa.2018.03.058
[8]	C. Liu, Y. Sheng, Z. Wei, Y. Yang, Research of text classification based on improved TF-IDF algorithm, in 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE), 2018. https://doi.org/10.1109/IRCE.2018.8492945
[9]	Y. Ko, A study of term weighting schemes using class information for text classification, in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, 2012. https://doi.org/10.1145/2348283.2348453
[10]	M. Yurochkin, S. Claici, E. Chien, F. Mirzazadeh, J. Solomon, Hierarchical optimal transport for document representation, preprint, arXiv: abs/1906.10827.
[11]	W. Zhang, Y. Li, S. Wang, Learning document representation via topic-enhanced LSTM model, Knowl. Based Syst., 174 (2019), 194–204. https://doi.org/10.1016/J.KNOSYS.2019.03.007 doi: 10.1016/J.KNOSYS.2019.03.007
[12]	L. Li, B. Qin, W. Ren, T. Liu, Document representation and feature combination for deceptive spam review detection, Neurocomputing, 254 (2017), 33–41. https://doi.org/10.1016/j.neucom.2016.10.080 doi: 10.1016/j.neucom.2016.10.080
[13]	S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci., 41 (1990), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
[14]	D. M. Blei, A. Ng, M. I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res., 3 (2003), 993–1022. https://doi.org/10.1016/B978-0-12-411519-4.00006-9 doi: 10.1016/B978-0-12-411519-4.00006-9
[15]	T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, Comput. Sci., 2013. https://doi.org/10.48550/arXiv.1301.3781 doi: 10.48550/arXiv.1301.3781
[16]	Q. V. Le, T. Mikolov, Distributed representations of sentences and documents, Int. Conf. Mach. Learn. PMLR, 2014.
[17]	F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. (CSUR), 34 (2002), 1–47. https://doi.org/10.1145/505282.505283 doi: 10.1145/505282.505283
[18]	L. Jia, B. Zhang, Optimal document representation strategy for supervised term weighting schemes in automatic text categorization, in 2019 9th International Conference on Information and Social Science, 2019.
[19]	Y. Q. Miao, M. Kamel, Pairwise optimized Rocchio algorithm for text categorization, Pattern Recogn. Lett., 32 (2011), 375–382. https://doi.org/10.1016/j.patrec.2010.09.018 doi: 10.1016/j.patrec.2010.09.018
[20]	C. Deng, X. He, Manifold adaptive experimental design for text categorization, IEEE Trans. Knowl. Data Eng., 24 (2011), 707–719. https://doi.org/10.1109/TKDE.2011.104 doi: 10.1109/TKDE.2011.104
[21]	L. Man, C. L. Tan, H. B. Low, Proposing a new term weighting scheme for text categorization, AAAI, 6 (2006).
[22]	M. Revanasiddappa, B. Harish, A new feature selection method based on intuitionistic fuzzy entropy to categorize text documents, Int. J. Interact. Multim. Artif. Intell, 5 (2018), 106–117. https://doi.org/10.9781/ijimai.2018.04.002 doi: 10.9781/ijimai.2018.04.002
[23]	M. Goudjil, M. Koudil, M. Bedda, N. Ghoggali, A novel active learning method using SVM for text classification, Int. J. Autom. Comput., 15 (2018), 290–298. https://doi.org/10.1007/S11633-015-0912-Z doi: 10.1007/S11633-015-0912-Z
[24]	M. Haddoud, A. Mokhtari, T. Lecroq, Saïd Abdeddaïm Combining supervised term-weighting metrics for SVM text classification with extended term representation, Knowl. Inf. Syst., 49 (2016), 909–931. https://doi.org/10.1007/s10115-016-0924-1 doi: 10.1007/s10115-016-0924-1
[25]	A. McCallum, K. Nigam, A comparison of event models for naive bayes text classification, in Proceeding AAAI Workshop Learning for Text Categorization, 1998.
[26]	Y. Yang, An evaluation of statistical approaches to text categorization, Inf. Retr., 1 (2004), 69–90. https://doi.org/10.1023/A:1009982220290 doi: 10.1023/A:1009982220290
[27]	E. Leopold, J. Kindermann, Text categorization with support vector machines. How to represent texts in input space, Mach. Learn., 46 (2002), 423–444. https://doi.org/10.1023/A:1012491419635 doi: 10.1023/A:1012491419635
[28]	S. Lee, K. Seo, Intelligent fault diagnosis based on a hybrid multi-class support vector machines and case-based reasoning approach, J. Comput. Theor. Nanosci., 10 (2013), 1727–1734. https://doi.org/10.1166/JCTN.2013.3116 doi: 10.1166/JCTN.2013.3116
[29]	C. C. Chang, C. J. Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. (TIST), 2 (2011), 27:1–27:27. https://doi.org/10.1145/1961189.1961199
[30]	J. Zhang, L. Chen, G. Guo, Projected-prototype based classifier for text categorization, Knowl.-Based Syst., 49 (2013), 179–189. https://doi.org/10.1016/j.knosys.2013.05.013 doi: 10.1016/j.knosys.2013.05.013
[31]	F. Ren, M. G. Sohrab, Class-indexing-based term weighting for automatic text classification, Inf. Sci., 236 (2013), 109–125. https://doi.org/10.1016/j.ins.2013.02.029 doi: 10.1016/j.ins.2013.02.029
[32]	I. Alsmadi, G. K. Hoon, Term weighting scheme for short-text classification: Twitter corpuses, Neural Comput. Appl., 31 (2019), 3819–3831. https://doi.org/10.1007/s00521-017-3298-8 doi: 10.1007/s00521-017-3298-8
[33]	Y. Ko, New feature weighting approaches for speech-act classification, Pattern Recogn. Lett., 51 (2015), 107–111. https://doi.org/10.1016/j.patrec.2014.08.014 doi: 10.1016/j.patrec.2014.08.014

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)