The increasing use of large language models (LLMs) in mental health support necessitates a detailed evaluation of their recommendation capabilities. This study compared four modern LLMs—Gemma 2, GPT-3.5-Turbo, GPT-4o, and Claude 4 Sonnet—in recommending mental health applications. We constructed a structured dataset of 55 mental health apps using RoBERTa-based sentiment analysis and keyword similarity scoring, focusing on depression, anxiety, ADHD, and insomnia. Baseline LLMs demonstrated inconsistent total accuracy (ranging from 60% to 75%) and often relied on outdated or generic information. In contrast, our retrieval-augmented generation (RAG) pipeline enabled all models to achieve 100% accuracy, while maintaining good diversity and recommending apps with significantly better user ratings. These findings demonstrated that dataset-enhanced LLMs, regardless of being open-source or proprietary, can excel in domain-specific applications like mental health resource recommendations, potentially improving accessibility to quality mental health support tools.
Citation: Kris Prasad, Md Abdullah Al Hafiz Khan, Yong Pei. Large language model enabled mental health app recommendations using structured datasets[J]. Applied Computing and Intelligence, 2025, 5(2): 154-167. doi: 10.3934/aci.2025010
The increasing use of large language models (LLMs) in mental health support necessitates a detailed evaluation of their recommendation capabilities. This study compared four modern LLMs—Gemma 2, GPT-3.5-Turbo, GPT-4o, and Claude 4 Sonnet—in recommending mental health applications. We constructed a structured dataset of 55 mental health apps using RoBERTa-based sentiment analysis and keyword similarity scoring, focusing on depression, anxiety, ADHD, and insomnia. Baseline LLMs demonstrated inconsistent total accuracy (ranging from 60% to 75%) and often relied on outdated or generic information. In contrast, our retrieval-augmented generation (RAG) pipeline enabled all models to achieve 100% accuracy, while maintaining good diversity and recommending apps with significantly better user ratings. These findings demonstrated that dataset-enhanced LLMs, regardless of being open-source or proprietary, can excel in domain-specific applications like mental health resource recommendations, potentially improving accessibility to quality mental health support tools.
| [1] | World Health Organization, COVID-19 pandemic triggers 25% increase in prevalence of anxiety and depression worldwide, World Health Organization news release, 2022. Available from: https://www.who.int/news/item/02-03-2022-covid-19-pandemic-triggers-25-increase-in-prevalence-of-anxiety-and-depression-worldwide. |
| [2] |
J. Radez, T. Reardon, C. Creswell, F. Orchard, P. Waite, Adolescents' perceived barriers and facilitators to seeking and accessing professional help for anxiety and depressive disorders: a qualitative interview study, Eur. Child Adolesc. Psychiatry, 31 (2022), 891–907. https://doi.org/10.1007/s00787-020-01707-0 doi: 10.1007/s00787-020-01707-0
|
| [3] |
M. Omar, S. Soffer, A. Charney, I. Landi, G. Nadkarni, E. Klang, Applications of large language models in psychiatry: a systematic review, Front. Psychiatry, 15 (2024), 1422807. https://doi.org/10.3389/fpsyt.2024.1422807 doi: 10.3389/fpsyt.2024.1422807
|
| [4] |
J. Borghouts, E. Eikey, G. Mark, C. De Leon, S. Schueller, M. Schneider, et al., Barriers to and facilitators of user engagement with digital mental health interventions: systematic review, J. Med. Internet Res., 23 (2021), e24387. https://doi.org/10.2196/24387 doi: 10.2196/24387
|
| [5] |
A. Le Glaz, Y. Haralambous, D. Kim-Dufor, P. Lenca, R. Billot, T. Ryan, et al., Machine learning and natural language processing in mental health: systematic review, J. Med. Internet Res., 23 (2021), e15708. https://doi.org/10.2196/15708 doi: 10.2196/15708
|
| [6] | P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, et al., Retrieval-augmented generation for knowledge-intensive NLP tasks, Proceedings of the 34th International Conference on Neural Information Processing System, 2020, 9459–9474. |
| [7] | Y. Shi, S. Xu, T. Yang, Z. Liu, T. Liu, Q. Li, et al., Mkrag: medical knowledge retrieval augmented generation for medical question answering, AMIA Annu Symp Proc., 2024 (2025), 1011–1020. |
| [8] | E. Leivada, G. Marcus, F. Günther, E. Murphy, A sentence is worth a thousand pictures: can large language models understand human language and the world behind words? arXiv: 2308.00109. https://doi.org/10.48550/arXiv.2308.00109 |
| [9] | M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. Mazaré, et al., The faiss library, arXiv: 2401.08281. https://doi.org/10.48550/arXiv.2401.08281 |
| [10] | Hugging Face, All-roberta-large-v1, Sentence-Transformers, 2025. Available from: https://huggingface.co/sentence-transformers/all-roberta-large-v1. |
| [11] | Hugging Face, Twitter-roberta-base for sentiment analysis, Cardiff NLP, 2020. Available from: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment. |
| [12] | GitHub, Get up and running with large language models, Ollama, 2025. Available from: https://ollama.com/. |
| [13] | GitHub, Langchain, LangChain, 2025. Available from: https://github.com/hwchase17/langchain. |
| [14] | OpenAI platform, OpenAI API, OpenAI, 2025. Available from: https://platform.openai.com/docs/api-reference. |