Combining statistical, structural, and linguistic features for keyword extraction from web pages

Himat Shah; Pasi Fränti; Himat Shah; Pasi Fränti

doi:10.3934/aci.2022007

Applied Computing and Intelligence

2022, Volume 2, Issue 2: 115-132. doi: 10.3934/aci.2022007

Previous Article Next Article

Research article

Combining statistical, structural, and linguistic features for keyword extraction from web pages

Himat Shah ,
Pasi Fränti ^,

School of Computing, University of Eastern Finland, Joensuu, Finland

Academic Editor: Chih-Cheng Hung

Received: 09 June 2022 Revised: 07 July 2022 Accepted: 12 September 2022 Published: 20 September 2022

Keywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in WordNet. A new method called ACI‑rank is also compiled from the best working combination.
- web mining,
- text analysis,
- keyword extraction,
- document object model tree
Citation: Himat Shah, Pasi Fränti. Combining statistical, structural, and linguistic features for keyword extraction from web pages[J]. Applied Computing and Intelligence, 2022, 2(2): 115-132. doi: 10.3934/aci.2022007

Related Papers:

Abstract

Keywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in WordNet. A new method called ACI‑rank is also compiled from the best working combination.

References

[1]	M. Rezaei, N. Gali, P. Fränti, CLRank: A method for keyword extraction from web pages using clustering and distribution of nouns, IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI-IAT), 1 (2015), 79-84. https://doi.org/10.1109/WI-IAT.2015.64 doi: 10.1109/WI-IAT.2015.64
[2]	S. Lazemi, H. Ebrahimpour-Komleh, N. Noroozi, PAKE: a supervised approach for Persian automatic keyword extraction using statistical features, SN Appl. Sci., 1 (2019), 1-4. https://doi.org/10.1007/s42452-019-1627-5 doi: 10.1007/s42452-019-1627-5
[3]	S. Vijaya Shetty, S. Akshay, S. Reddy, H. Rakesh, M. Mihir, J. Shetty, Graph-Based Keyword Extraction for Twitter Data, Emerging Research in Computing, Information, Communication and Applications, (2022), 863-871. https://doi.org/10.1007/978-981-16-1342-5_68 doi: 10.1007/978-981-16-1342-5_68
[4]	B. Armouty, S. Tedmori, Automated keyword extraction using support vector machine from Arabic news documents, IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), (2019), 342-346. https://doi.org/10.1109/JEEIT.2019.8717420 doi: 10.1109/JEEIT.2019.8717420
[5]	P. Sun, L. Wang, Q. Xia, The Keyword Extraction of Chinese Medical Web Page Based on WF TF IDF Algorithm, Ininternational conference on cyber enabled distributed computing and knowledge discovery (CyberC), (2017), 193-198. https://doi.org/10.1109/CyberC.2017.40 doi: 10.1109/CyberC.2017.40
[6]	A. Onan, S. Korukoğlu, H. Bulut, Ensemble of keyword extraction methods and classifiers in text classification, Expert Syst. Appl., 57 (2016), 232-247. https://doi.org/10.1016/j.eswa.2016.03.045 doi: 10.1016/j.eswa.2016.03.045
[7]	W. Zhang, D. Wang, G. R. Xue, H. Zha, Advertising Keywords Recommendation for Short Text Web Pages using Wikipedia, ACM T. Intel. Syst. Tec., 3 (2012), 1-25. https://doi.org/10.1145/2089094.2089112 doi: 10.1145/2089094.2089112
[8]	A. Hulth, Improved automatic keyword extraction given more linguistic knowledge, Proceedings of the conference on empirical methods in natural language processing, (2003), 216-223. https://doi.org/10.3115/1119355.1119383 doi: 10.3115/1119355.1119383
[9]	H. Shah, M. U. Khan, P. Fränti, H-rank: a keywords extraction method from web pages using POS tags, IEEE 17th International Conference on Industrial Informatics (INDIN), 1 (2019), 264-269. https://doi.org/10.1109/INDIN41052.2019.8972331 doi: 10.1109/INDIN41052.2019.8972331
[10]	D. Khyani, B. S. Siddhartha, N. M. Niveditha, B. M. Divya, An Interpretation of Lemmatization and Stemming in Natural Language Processing, Journal of University of Shanghai for Science and Technology, 22 (2021), 350-357.
[11]	Nie H, Yang Y, and Zeng D, Keyword Generation for Sponsored Search Advertising: Balancing Coverage and Relevance, In IEEE intelligent systems, vol. 34, number 5, pp. 14-24, 2019. https://doi.org/10.1109/MIS.2019.2938881
[12]	O. Alqaryouti, H. Khwileh, T. Farouk, A. Nabhan, K. Shaalan, Graph based keyword extraction, Intelligent natural language processing: trends and applications, 740 (2018), 159-172. https://doi.org/10.1007/978-3-319-67056-0_9 doi: 10.1007/978-3-319-67056-0_9
[13]	W. Zhang, W. Feng, J. Wang, Integrating semantic relatedness and words intrinsic features for keyword extraction, Twenty third international joint conference on artiﬁcial intelligence (IJCAI'13), (2013), 2225-2231.
[14]	J. Xu, Q. Lu, Z. Liu, Aggregating skip bigrams into key phrase-based vector space model for web person disambiguation, In KONVENS, (2012), 108-117.
[15]	T. D. Nguyen, M. Y. Kan, Keyphrase extraction in scientific publications, Proceedings of the 10th international conference on Asian digital libraries, (2007), 317-326. https://doi.org/10.1007/978-3-540-77094-7_41 doi: 10.1007/978-3-540-77094-7_41
[16]	A. Gupta, A. Dixit, A. K. Sharma, A novel statistical and linguistic features-based technique for keyword extraction, International conference on information systems and computer networks (ISCON), (2014), 55-59. https://doi.org/10.1109/ICISCON.2014.6965218 doi: 10.1109/ICISCON.2014.6965218
[17]	D. Cai, S. Yu, J. R. Wen, W. Y. Ma, VIPS: a vision-based page segmentation algorithm, In Microsoft technical report, MSR-TR-2003-79, 2003.
[18]	H. Shah, M. Rezaei, P. Fränti, DOM based keyword extraction from webpages, In proceedings of international conference on artificial intelligence, information processing and cloud computing (AⅡPCC), (2019), 1-6. https://doi.org/10.1145/3371425.3371495 doi: 10.1145/3371425.3371495
[19]	P. Liu, J. Azimi, R. Zhang, Automatic keywords generation for contextual advertising, In Proceedings of the 23rd International Conference on World Wide Web, (2014), 345-346. https://doi.org/10.1145/2567948.2577361 doi: 10.1145/2567948.2577361
[20]	S. Siddiqi, A. Sharan, Keyword and keyphrase extraction techniques: a literature review, In international journal of computer applications, 109 (2015), 18-23. https://doi.org/10.5120/19161-0607 doi: 10.5120/19161-0607
[21]	M. Grineva, M. Grinev, D. Lizorkin, Extracting key terms from noisy and multi-theme documents, In Proceedings of the 18th international conference on World Wide Web, (2009), 661-670. https://doi.org/10.1145/1526709.1526798 doi: 10.1145/1526709.1526798
[22]	F. Lei, M. Yao, Y. Hao, Improve the performance of the webpage content extraction using webpage segmentation algorithm, In proceedings of international forum on computer science-technology and applications, (2009), 323-325. https://doi.org/10.1109/IFCSTA.2009.84 doi: 10.1109/IFCSTA.2009.84
[23]	D. Cai, S. Yu, J. R. Wen, W. Y. Ma, Extracting content structure for web pages based on visual representation, In Asia-Pacific Web Conference, (2003), 406-417. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36901-5_42 doi: 10.1007/3-540-36901-5_42
[24]	G. Salton, C. S. Yang, C. T. Yu, A theory of term importance in automatic text analysis, Journal of the American society for Information Science, 26 (1975), 33-44. https://doi.org/10.1002/asi.4630260106 doi: 10.1002/asi.4630260106
[25]	J. Pasternack, D. Roth, Extracting article text from the web with maximum subsequence segmentation, Proceedings of the 18th international conference on world wide web, (2009), 971-980. https://doi.org/10.1145/1526709.1526840 doi: 10.1145/1526709.1526840
[26]	S. Gupta, G. Kaiser, D. Neistadt, P. Grimm, DOM-based content extraction of html documents, Proceedings of the 12th international conference on World Wide Web, (2003), 207-214. https://doi.org/10.1145/775152.775182 doi: 10.1145/775152.775182
[27]	M. Krapivin, A. Autayeu, M. Marchese, E. Blanzieri, N. Segata, Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing, International Conference on Asian Digital Libraries, (2010), 102-111. https://doi.org/10.1007/978-3-642-13654-2_12 doi: 10.1007/978-3-642-13654-2_12
[28]	R. Mihalcea, P. Tarau, TextRank: Bringing order into texts, In proceedings of (EMNLP04) conference on empirical methods in natural language processing, (2004), 404-411.
[29]	R. Campos, V. Mangaravite, A. Pasquali, A. M. Jorge, C. Nunes, A. Jatowt, YAKE! Collection-Independent Automatic Keyword Extractor, European conference on information retrieval, 10772 (2018), 806-810. https://doi.org/10.1007/978-3-319-76941-7_80 doi: 10.1007/978-3-319-76941-7_80
[30]	H. Shah, R. Mariescu-Istodor, P. Fränti, WebRank: Language-Independent Extraction of Keywords from Webpages, IEEE International Conference on Progress in Informatics and Computing (PIC), (2021), 184-192. https://doi.org/10.1109/PIC53636.2021.9687047 doi: 10.1109/PIC53636.2021.9687047
[31]	N. Gali, R. Mariescu-Istodor, D. Hostettler, P. Fränti, Framework for syntactic string similarity measures, Expert Syst. Appl., 129 (2019), 169-185. https://doi.org/10.1016/j.eswa.2019.03.048 doi: 10.1016/j.eswa.2019.03.048
[32]	N. Gali, R. Mariescu-Istodor, P. Fränti, Using linguistic features to automatically extract web page title, Expert Syst. Appl., 79 (2017), 296-312. https://doi.org/10.1016/j.eswa.2017.02.045 doi: 10.1016/j.eswa.2017.02.045
[33]	N. Gali, A. Tabarcea, P. Fränti, Extracting Representative Image from Web Page, In WEBIST, (2015), 411-419. https://doi.org/10.5220/0005438704110419 doi: 10.5220/0005438704110419
[34]	P. Fränti and R. Mariescu-Istodor, Soft precision and recall. Manuscript. Software available from: https://cs.uef.fi/sipu/soft/SoftEval/
[35]	M. Grootendorst, KeyBERT: minimal keyword extraction with BERT. Available from: https://github.com/MaartenGr/KeyBERT.
[36]	A. Awajan, Keyword extraction from Arabic documents using term equivalence classes, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 14 (2015), 1-18. https://doi.org/10.1145/2665077 doi: 10.1145/2665077

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)