A novel comprehensive method for customer segmentation based on identifying topics and sentiments from unstructured online product reviews

Chaolong Ding; Xuesi Ma; Chaolong Ding; Xuesi Ma

doi:10.3934/bdia.2026001

Big Data and Information Analytics

2026, Volume 10, 1-28. doi: 10.3934/bdia.2026001

Previous Article Next Article

Research article

A novel comprehensive method for customer segmentation based on identifying topics and sentiments from unstructured online product reviews

Chaolong Ding ,
Xuesi Ma ^,

School of Mathematics and Information Science, Henan Polytechnic University, Jiaozuo 45400, China

Received: 01 December 2025 Revised: 18 December 2025 Accepted: 19 December 2025 Published: 12 January 2026

In product development and business insights, topic extraction and sentiment analysis are crucial components. Due to the information overload in e-commerce reviews and the diverse preferences of customers, traditional research methods fail to identify commonalities among customers effectively. To overcome these challenges, we proposed an innovative five-stage ensemble approach for customer segmentation. First, TextRank was employed for data preprocessing to extract key textual features and filter relevant content. Subsequently, key topics were identified through the Word2Vec-based topic identification model. Then, to enhance the accuracy of topic-level sentiment scores, clause-level sentiment analysis was conducted using BERT, where sentiment scores were fine-tuned through TF-IDF weighting for enhanced granularity. After that, interpretable machine learning (IML) algorithms were employed to analyze user satisfaction (USAT), ensuring predictive performance and model transparency. Finally, deep embedded clustering (DEC) was leveraged to perform customer segmentation based on the extracted key topic-sentiment features. The effectiveness of the proposed method was validated through a real-world case study involving 22, 320 online user reviews. The results showed that categorical boosting (CatBoost) achieved the highest performance, with an F1-score of 0.9433, demonstrating its high accuracy and transparency in predicting USAT determinants. The findings facilitate the identification of innovative product concepts.
- user satisfaction,
- topic identification,
- sentiment analysis,
- customer segmentation,
- explainable artificial intelligence
Citation: Chaolong Ding, Xuesi Ma. A novel comprehensive method for customer segmentation based on identifying topics and sentiments from unstructured online product reviews[J]. Big Data and Information Analytics, 2026, 10: 1-28. doi: 10.3934/bdia.2026001

Related Papers:

Abstract

In product development and business insights, topic extraction and sentiment analysis are crucial components. Due to the information overload in e-commerce reviews and the diverse preferences of customers, traditional research methods fail to identify commonalities among customers effectively. To overcome these challenges, we proposed an innovative five-stage ensemble approach for customer segmentation. First, TextRank was employed for data preprocessing to extract key textual features and filter relevant content. Subsequently, key topics were identified through the Word2Vec-based topic identification model. Then, to enhance the accuracy of topic-level sentiment scores, clause-level sentiment analysis was conducted using BERT, where sentiment scores were fine-tuned through TF-IDF weighting for enhanced granularity. After that, interpretable machine learning (IML) algorithms were employed to analyze user satisfaction (USAT), ensuring predictive performance and model transparency. Finally, deep embedded clustering (DEC) was leveraged to perform customer segmentation based on the extracted key topic-sentiment features. The effectiveness of the proposed method was validated through a real-world case study involving 22, 320 online user reviews. The results showed that categorical boosting (CatBoost) achieved the highest performance, with an F1-score of 0.9433, demonstrating its high accuracy and transparency in predicting USAT determinants. The findings facilitate the identification of innovative product concepts.

References

[1]	Wang Y, Lu X, Tan Y, (2018) Impact of product attributes on customer satisfaction: An analysis of online reviews for washing machines. Electron Commer Res Appl 29: 1–11. https://doi.org/10.1016/j.elerap.2018.03.003 doi: 10.1016/j.elerap.2018.03.003
[2]	Park S, Kim H, (2024) Extracting product design guidance from online reviews: An explainable neural network-based approach. Expert Syst Appl 236: 121357. https://doi.org/10.1016/j.eswa.2023.121357 doi: 10.1016/j.eswa.2023.121357
[3]	Kumar A, Chakraborty S, Bala PK, (2023) Text mining approach to explore determinants of grocery mobile app satisfaction using online customer reviews. J Retail Consum Serv 73: 103363. https://doi.org/10.1016/j.jretconser.2023.103363 doi: 10.1016/j.jretconser.2023.103363
[4]	Xuan Y, Zhang L, Bao H, Hu J, (2024) How to obtain product green design requirements based on sentiment analysis and topic analysis: Using washing machine online reviews as an example. J Environ Manage 365: 121454. https://doi.org/10.1016/j.jenvman.2024.121454 doi: 10.1016/j.jenvman.2024.121454
[5]	Alsayat A, (2023) Customer decision-making analysis based on big social data using machine learning: a case study of hotels in Mecca. Neural Comput Appl 35: 4701–4722. https://doi.org/10.1007/s00521-022-07992-x doi: 10.1007/s00521-022-07992-x
[6]	Pal S, Biswas B, Gupta R, Kumar A, Gupta S, (2023) Exploring the factors that affect user experience in mobile-health applications: A text-mining and machine-learning approach. J Bus Res 156: 113484. https://doi.org/10.1016/j.jbusres.2022.113484 doi: 10.1016/j.jbusres.2022.113484
[7]	Joung J, Kim HM, (2021) Explainable neural network-based approach to Kano categorisation of product features from online reviews. Int J Prod Res 60: 7053–7073. https://doi.org/10.1080/00207543.2021.2000656 doi: 10.1080/00207543.2021.2000656
[8]	Bi JW, Liu Y, Fan ZP, Cambria E, (2019) Modelling customer satisfaction from online reviews using ensemble neural network and effect-based Kano model. Int J Prod Res 57: 7068–7088. https://doi.org/10.1080/00207543.2019.1574989 doi: 10.1080/00207543.2019.1574989
[9]	Zhang C, Zhang H, Wang J, (2018) Personalized restaurant recommendation method combining group correlations and customer preferences. Inf Sci 454: 128–143. https://doi.org/10.1016/j.ins.2018.04.061 doi: 10.1016/j.ins.2018.04.061
[10]	Suryadi D, Kim HM, (2019) A data-driven methodology to construct customer choice sets using online data and customer reviews. J Mech Des 141: 111103. https://doi.org/10.1115/1.4044198 doi: 10.1115/1.4044198
[11]	Rungruang C, Riyapan P, Intarasit A, Chuarkham K, Muangprathub J, (2024) RFM model customer segmentation based on hierarchical approach using FCA. Expert Syst Appl 237: 121449. https://doi.org/10.1016/j.eswa.2023.121449 doi: 10.1016/j.eswa.2023.121449
[12]	Wang B, Miao Y, Zhao H, Jin J, Chen Y, (2016) A biclustering-based method for market segmentation using customer pain points. Eng Appl Artif Intell 47: 101–109. https://doi.org/10.1016/j.engappai.2015.06.005 doi: 10.1016/j.engappai.2015.06.005
[13]	Li Y, Meng C, Tian J, Fang Z, Cao H, (2024) Data-driven customer online shopping behavior analysis and personalized marketing strategy. J Organ End User Comput 36: 1–22. https://doi.org/10.4018/JOEUC.346230 doi: 10.4018/JOEUC.346230
[14]	Fang X, Zhou J, Pantelous AA, Lu W, (2024) A machine learning and clustering-based methodology for the identification of lead users and their needs from online communities. Expert Syst Appl 248: 123381. https://doi.org/10.1016/j.eswa.2024.123381 doi: 10.1016/j.eswa.2024.123381
[15]	Guo L, Zhan J, Kou G, Martínez L, (2024) A sentiment analysis and dual trust relationship-based approach to large-scale group decision-making for online reviews: A case study of China Eastern Airlines. Inf Sci 667: 120515. https://doi.org/10.1016/j.ins.2024.120515 doi: 10.1016/j.ins.2024.120515
[16]	Harris ZS, (1954) Distributional structure. WORD 10: 146–162. https://doi.org/10.1080/00437956.1954.11659520 doi: 10.1080/00437956.1954.11659520
[17]	Blei DM, Ng AY, Jordan MI, (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022
[18]	Srivastava A, Sutton C, (2017) Autoencoding variational inference for topic models. Preprint, arXiv: 1703.01488. https://doi.org/10.48550/arXiv.1703.01488 doi: 10.48550/arXiv.1703.01488
[19]	Ke J, Wang Y, Fan M, Chen X, Zhang W, Gou J, (2024) Discovering e-commerce user groups from online comments: An emotional correlation analysis-based clustering method. Comput Electr Eng 113: 109035. https://doi.org/10.1016/j.compeleceng.2023.109035 doi: 10.1016/j.compeleceng.2023.109035
[20]	Yamarthi S, Chintala B, Rambabu R, Rao BY, Rao PV, Basha PH, (2025) Sentiment analysis framework for entropy-based product recommendation system. Knowl Inf Syst 67: 11611–11631. https://doi.org/10.1007/s10115-025-02570-8 doi: 10.1007/s10115-025-02570-8
[21]	Jeong B, Yoon J, Lee JM, (2019) Social media mining for product planning: A product opportunity mining approach based on topic modeling and sentiment analysis. Int J Inf Manage 48: 280–290. https://doi.org/10.1016/j.ijinfomgt.2017.09.009 doi: 10.1016/j.ijinfomgt.2017.09.009
[22]	Xiao Y, Li C, Thürer M, Liu Y, Qu T, (2022) User preference mining based on fine-grained sentiment analysis. J Retailing Consum Serv 68: 103013. https://doi.org/10.1016/j.jretconser.2022.103013 doi: 10.1016/j.jretconser.2022.103013
[23]	Wang X, Goh DHL, (2020) Components of game experience: An automatic text analysis of online reviews. Entertain Comput 33: 100338. https://doi.org/10.1016/j.entcom.2019.100338 doi: 10.1016/j.entcom.2019.100338
[24]	Liu Y, You TH, Zou J, Cao BB, (2024) Modelling customer requirement for mobile games based on online reviews using BW-CNN and S-Kano models. Expert Syst Appl 258: 125142. https://doi.org/10.1016/j.eswa.2024.125142 doi: 10.1016/j.eswa.2024.125142
[25]	Park J, (2023) Combined text-mining/DEA method for measuring level of customer satisfaction from online reviews. Expert Syst Appl 232: 120767. https://doi.org/10.1016/j.eswa.2023.120767 doi: 10.1016/j.eswa.2023.120767
[26]	Kumar A, Bala PK, Chakraborty S, Behera RK, (2024) Exploring antecedents impacting user satisfaction with voice assistant app: A text mining-based analysis on Alexa services. J Retailing Consum Serv 76: 103586. https://doi.org/10.1016/j.jretconser.2023.103586 doi: 10.1016/j.jretconser.2023.103586
[27]	Matzler K, Bailom F, Hinterhuber HH, Renzl B, Pichler J, (2004) The asymmetric relationship between attribute-level performance and overall customer satisfaction: A reconsideration of the importance-performance analysis. Ind Mark Manage 33: 271–277. https://doi.org/10.1016/S0019-8501(03)00055-5 doi: 10.1016/S0019-8501(03)00055-5
[28]	Brin S, Page L, (2012) Reprint of: The anatomy of a large-scale hypertextual web search engine. Comput Networks 56: 3825–3833. https://doi.org/10.1016/j.comnet.2012.10.007 doi: 10.1016/j.comnet.2012.10.007
[29]	Boyd-Graber J, Mimno D, Newman D, (2014) Care and feeding of topic models: Problems, diagnostics, and improvements, In: Handbook of Mixed Membership Models and Their Applications, Chapman and Hall/CRC, 30.
[30]	Joung J, Kim HM, (2021) Approach for importance–performance analysis of product attributes from online reviews. J Mech Des 143: 081705. https://doi.org/10.1115/1.4049865 doi: 10.1115/1.4049865
[31]	Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J, (2013) Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, Curran Associates, Inc., 3111–3119.
[32]	Xu X, Li Y, (2016) The antecedents of customer satisfaction and dissatisfaction toward various types of hotels: A text mining approach. Int J Hospitality Manage 55: 57–69. https://doi.org/10.1016/j.ijhm.2016.03.003 doi: 10.1016/j.ijhm.2016.03.003
[33]	Salton G, Buckley C, (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24: 513–523. https://doi.org/10.1016/0306-4573(88)90021-0 doi: 10.1016/0306-4573(88)90021-0
[34]	Zaki Ahmed A, Rodríguez-Díaz M, (2020) Significant labels in sentiment analysis of online customer reviews of airlines. Sustainability 12: 8683. https://doi.org/10.3390/su12208683 doi: 10.3390/su12208683
[35]	Chen T, Guestrin C, (2016) XGBoost: A scalable tree boosting system, In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
[36]	Breiman L, (2001) Random forests. Mach Learn 45: 5–32. https://doi.org/10.1023/A:1010933404324 doi: 10.1023/A:1010933404324
[37]	Freund Y, Schapire RE, (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55: 119–139. https://doi.org/10.1006/jcss.1997.1504 doi: 10.1006/jcss.1997.1504
[38]	Friedman JH, (2001) Greedy function approximation: A gradient boosting machine. Ann Stat 29: 1189–1232. https://doi.org/10.1214/aos/1013203451 doi: 10.1214/aos/1013203451
[39]	Breiman L, (1996) Bagging predictors. Mach Learn 24: 123–140. https://doi.org/10.1007/BF00058655 doi: 10.1007/BF00058655
[40]	Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12: 2825–2830.
[41]	Bishop CM, (2006) Pattern Recognition and Machine Learning, New York: Springer.
[42]	Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. (2017) Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30: 3146–3154.
[43]	Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A, (2018) CatBoost: Unbiased boosting with categorical features. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6639–6649.
[44]	Cortes C, Vapnik V, (1995) Support-vector networks. Mach Learn 20: 273–297. https://doi.org/10.1007/BF00994018 doi: 10.1007/BF00994018
[45]	Molnar C, Casalicchio G, Bischl B, (2020) Interpretable machine learning—a brief history, state-of-the-art and challenges, In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 417–431.
[46]	Lundberg SM, Lee SI, (2017) A unified approach to interpreting model predictions, In: Advances in Neural Information Processing Systems, 4765–4774.
[47]	Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. (2020) From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2: 56–67. https://doi.org/10.1038/s42256-019-0138-9 doi: 10.1038/s42256-019-0138-9
[48]	Liu C, Li Y, Fang M, Liu F, (2023) Using machine learning to explore the determinants of service satisfaction with online healthcare platforms during the COVID-19 pandemic. Serv Bus 17: 449–476. https://doi.org/10.1007/s11628-023-00535-x doi: 10.1007/s11628-023-00535-x
[49]	Xie J, Girshick R, Farhadi A, (2016) Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning, 478–487.
[50]	Tao WB, Qian YR, Zhang YY, Ma HZ, Leng HY, Ma MN, (2022) Survey of deep clustering algorithm based on autoencoder. Comput Eng Appl 58: 16–25. https://doi.org/10.3778/j.issn.1002-8331.2204-0049 doi: 10.3778/j.issn.1002-8331.2204-0049
[51]	Ulwick AW, (2002) Turn customer input into innovation. Harv Bus Rev 80: 91–97.
[52]	Davies DL, Bouldin DW, (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2009: 224–227. https://doi.org/10.1109/TPAMI.1979.4766909 doi: 10.1109/TPAMI.1979.4766909
[53]	Dwivedi YK, Hughes L, Ismagilova E, Aarts G, Coombs C, Crick T et al. (2021) Artificial intelligence (AI): Multidisciplinary perspectives on emerging challenges, opportunities, and agenda for research, practice and policy. Int J Inf Manage 57: 101994. https://doi.org/10.1016/j.ijinfomgt.2019.08.002 doi: 10.1016/j.ijinfomgt.2019.08.002
[54]	Zuo M, Angelopoulos S, Liang Z, Ou CX, (2023) Blazing the trail: Considering browsing path dependence in online service response strategy. Inf Syst Front 25: 1605–1619. https://doi.org/10.1007/s10796-022-10311-3 doi: 10.1007/s10796-022-10311-3

Reader Comments

Your name:*

Email:*
© 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)