Police narrative reports from 911 responses contain valuable signs for early behavioral health intervention, but obtaining them manually is time-consuming and tedious. We present RECAP, a human-AI collaboration framework that fine-tunes three large language models: Mistral-7B, Llama-3-8B, and TinyLlama to classify BH cases in narratives and generate short, sentence-level rationales. The models are trained on an annotated corpus using supervised instruction data and then aligned using direct preference optimization (DPO), allowing patrol officer preferences to continually influence system behavior without disrupting existing workflows. On a held-out test set, Mistral-7B achieves 85.2% weighted accuracy and 84.2% F1-score, matching strong prior baselines while improving interpretability by short, text-span-linked explanations; Llama-3-8B performs similarly, while TinyLlama provides competitive accuracy at a lower compute cost. RECAP is designed to reduce manual effort and identify behavioral health signs earlier in public narratives, while providing rationales and maintaining officer control.
Citation: William A. Stigall, Francis Nweke, Hailey N. Walker, Md Abdullah Al Hafiz Khan, Sharon Perry, Yong Pei, Dominic Thomas, Monica Nandan. RECAP: reinforced, explainable LLM classifier for behavioral-health analysis in police narratives[J]. Applied Computing and Intelligence, 2025, 5(2): 337-347. doi: 10.3934/aci.2025019
Police narrative reports from 911 responses contain valuable signs for early behavioral health intervention, but obtaining them manually is time-consuming and tedious. We present RECAP, a human-AI collaboration framework that fine-tunes three large language models: Mistral-7B, Llama-3-8B, and TinyLlama to classify BH cases in narratives and generate short, sentence-level rationales. The models are trained on an annotated corpus using supervised instruction data and then aligned using direct preference optimization (DPO), allowing patrol officer preferences to continually influence system behavior without disrupting existing workflows. On a held-out test set, Mistral-7B achieves 85.2% weighted accuracy and 84.2% F1-score, matching strong prior baselines while improving interpretability by short, text-span-linked explanations; Llama-3-8B performs similarly, while TinyLlama provides competitive accuracy at a lower compute cost. RECAP is designed to reduce manual effort and identify behavioral health signs earlier in public narratives, while providing rationales and maintaining officer control.
| [1] | A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, et al., The Llama 3 herd of models, arXiv: 2407.21783. https://doi.org/10.48550/arXiv.2407.21783 |
| [2] |
M. Brown, A. Azmee, M. Khan, D. Thomas, Y. Pei, M. Nandan, Adaptive attention-aware fusion for human-in-the-loop behavioral health detection, Smart Health, 32 (2024), 100475. https://doi.org/10.1016/j.smhl.2024.100475 doi: 10.1016/j.smhl.2024.100475
|
| [3] | M. Brown, M. Khan, D. Thomas, Y. Pei, M. Nandan, Detection of behavioral health cases from sensitive police officer narratives, Proceedings of IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC), 2023, 1398–1403. https://doi.org/10.1109/COMPSAC57700.2023.00213 |
| [4] | T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, et al., Language models are few-shot learners, Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, 1877–1901. |
| [5] | P. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, D. Amodei, Deep reinforcement learning from human preferences, Proceedings of 31st Conference on Neural Information Processing Systems, 2023, 1–9. |
| [6] | T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: efficient finetuning of quantized LLMs, Proceedings of 37th Conference on Neural Information Processing Systems, 2023, 1–28. |
| [7] | J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, 4171–4186. https://doi.org/10.18653/v1/N19-1423 |
| [8] | A. Dinu, A. Moldovan, Automatic detection and classification of mental illnesses from general social media texts, Proceedings of the International Conference on Recent Advances in Natural Language Processing, 2021,358–366. |
| [9] | E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, et al., Lora: low-rank adaptation of large language models, arXiv: 2106.09685. https://doi.org/10.48550/arXiv.2106.09685 |
| [10] | A. Irwin, B. Pearl, The community responder model: how cities can send the right responder to every 911 call, Center for American Progress, 2020. Available from: https://www.americanprogress.org/wp-content/uploads/sites/2/2020/10/Alternatives911-report.pdf. |
| [11] | J. Ive, G. Gkotsis, R. Dutta, R. Stewart, S. Velupillai, Hierarchical neural model with attention mechanisms for the classification of social media text related to mental health, Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, 2018, 69–77. https://doi.org/10.18653/v1/W18-0607 |
| [12] | N. Jain, P. Chiang, Y. Wen, J. Kirchenbauer, H. Chu, G. Somepalli, et al., Neftune: noisy embeddings improve instruction finetuning, arXiv: 2310.05914. https://doi.org/10.48550/arXiv.2310.05914 |
| [13] | A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. de las Casas, et al., Mistral 7B, arXiv: 2310.06825. https://doi.org/10.48550/arXiv.2310.06825 |
| [14] | A. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, et al., Mixtral of experts, arXiv: 2401.04088. https://doi.org/10.48550/arXiv.2401.04088 |
| [15] | Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, Roberta: a robustly optimized bert pretraining approach, arXiv: 1907.11692. https://doi.org/10.48550/arXiv.1907.11692 |
| [16] | E. Mitchell, A note on DPO with noisy preferences and relationship to IPO, 2023. Available from: https://ericmitchell.ai/cdpo.pdf. |
| [17] |
F. Nweke, A. Azmee, M. Khan, Y. Pei, D. Thomas, M. Nandan, A transformer-driven framework for multi-label behavioral health classification in police narratives, Applied Computing and Intelligence, 4 (2024), 234–252. https://doi.org/10.3934/aci.2024014 doi: 10.3934/aci.2024014
|
| [18] | NIH, Mental illness, National Institute of Mental Health, 2023. Available from: https://www.nimh.nih.gov/health/statistics/mental-illness. |
| [19] | OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, et al., GPT-4 technical report, arXiv: 2303.08774. https://doi.org/10.48550/arXiv.2303.08774 |
| [20] | K. O'shea, R. Nash, An introduction to convolutional neural networks, arXiv: 1511.08458. https://doi.org/10.48550/arXiv.1511.08458 |
| [21] | R. Rafailov, A. Sharma, E. Mitchell, C. Manning, S. Ermon, C. Finn, Direct preference optimization: your language model is secretly a reward model, Proceedings of 37th Conference on Neural Information Processing Systems, 2023, 1–14. |
| [22] | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, et al., Attention is all you need, Proceedings of 31st Conference on Neural Information Processing Systems, 2023, 1–11. |
| [23] | L. Xu, H. Xie, S. Qin, X. Tao, F. Wang, Parameter-efficient fine-tuning methods for pretrained language models: a critical review and assessment, arXiv: 2312.12148. https://doi.org/10.48550/arXiv.2312.12148 |
| [24] | Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. Le, Xlnet: generalized autoregressive pretraining for language understanding, Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, 5753–5763. |
| [25] | P. Zhang, G. Zeng, T. Wang, W. Lu, Tinyllama: an open-source small language model, arXiv: 2401.02385. https://doi.org/10.48550/arXiv.2401.02385 |