Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o's superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.
Citation: Xiao Lu, Hao Zhen, Jidong J. Yang. Automating crash diagram generation using vision-language models: a case study on multilane roundabouts[J]. Applied Computing and Intelligence, 2026, 6(1): 38-57. doi: 10.3934/aci.2026003
Crash diagrams are essential tools in transportation safety analysis, yet their manual preparation remains time-consuming and prone to human variability. This study investigates the use of Vision-Language Models (VLMs) to automate crash diagram generation from police crash reports, focusing on multilane roundabouts as a challenging test case. A three-part structured prompt framework was developed to guide model reasoning through interpretation, extraction, and visual synthesis, while a 10-metric evaluation system was designed to assess diagram quality in terms of semantic accuracy, spatial fidelity, and visual clarity. Three popular models, including GPT-4o, Gemini-1.5-Flash, and Janus-4o, were tested on 79 crash reports. GPT-4o achieved the highest average performance (6.29 out of 10), followed by Gemini-1.5-Flash (5.28) and Janus-4o (3.64). The analysis revealed GPT-4o's superior spatial reasoning and alignment between extracted and visualized crash data. These results highlight both the promise and current limitations of VLMs in engineering visualization tasks. The study lays the groundwork for integrating generative AI into crash analysis workflows to improve efficiency, consistency, and interpretability.
| [1] |
D. Fernandez, P. MohajerAnsari, A. Salarpour, M. Pesé, Avoiding the crash: a vision-language model evaluation of critical traffic scenarios, SAE Int. J. Adv. Curr. Prac. in Mobility, 7 (2025), 2255–2266. https://doi.org/10.4271/2025-01-8213 doi: 10.4271/2025-01-8213
|
| [2] |
S. Jaradat, N. Acharya, S. Shivshankar, T. Alhadidi, M. Elhenawy, AI for data quality auditing: detecting mislabeled work zone crashes using large language models, Algorithms, 18 (2025), 317. https://doi.org/10.3390/a18060317 doi: 10.3390/a18060317
|
| [3] | UC Berkeley SafeTREC, Transportation injury mapping system (tims), UC Regents, 2025. Available from: https://tims.berkeley.edu. |
| [4] | PdMagic, Crash magic online, Pd' Programming, Inc., 2025. Available from: https://www.pdmagic.com. |
| [5] | AASHTOWare, Aashtoware safety intersection, American Association of State Highway and Transportation Officials, 2025. Available from: https://www.aashtoware.org/products/safety/aashtoware-safety-intersection. |
| [6] |
H. Zhen, J. Yang, Tab-text: bridging tabular data and natural language for enhanced traffic safety analysis and modeling, Expert Syst. Appl., 290 (2025), 128450. https://doi.org/10.1016/j.eswa.2025.128450 doi: 10.1016/j.eswa.2025.128450
|
| [7] |
H. Zhen, Y. Shi, Y. Huang, J. Yang, N. Liu, Leveraging large language models with chain-of-thought and prompt engineering for traffic crash severity analysis and inference, Computers, 13 (2024), 232. https://doi.org/10.3390/computers13090232 doi: 10.3390/computers13090232
|
| [8] |
H. Zhen, J. Yang, Crashsage: a large language model-centered framework for contextual and interpretable traffic crash analysis, Artificial Intelligence for Transportation, 3–4 (2025), 100030. https://doi.org/10.1016/j.ait.2025.100030 doi: 10.1016/j.ait.2025.100030
|
| [9] | S. Akter, I. Shihab, A. Sharma, Large language models for crash detection in video: a survey of methods, datasets, and challenges, arXiv: 2507.02074. https://doi.org/10.48550/arXiv.2507.02074 |
| [10] | X. Cao, T. Zhou, Y. Ma, W. Ye, C. Cui, K. Tang, et al., Maplm: a real-world large-scale vision-language benchmark for map and traffic scene understanding, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, 21819–21830. https://doi.org/10.1109/CVPR52733.2024.02061 |
| [11] |
H. Ding, Y. Du, Z. Xia, Urban road anomaly monitoring using vision-language models for enhanced safety management, Appl. Sci., 15 (2025), 2517. https://doi.org/10.3390/app15052517 doi: 10.3390/app15052517
|
| [12] | OpenAI, A. Hurst, A. Lerer, A. Goucher, A. Perelman, A. Ramesh, et al., Gpt-4o system card, arXiv: 2410.21276. https://doi.org/10.48550/arXiv.2410.21276 |
| [13] | R. Anil, S. Borgeaud, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk et al., Gemini: a family of highly capable multimodal models, arXiv: 2312.11805. https://doi.org/10.48550/arXiv.2312.11805 |
| [14] | J. Chen, Z. Cai, P. Chen, S. Chen, K. Ji, X. Wang, et al., Sharegpt-4o-image: aligning multimodal models with gpt-4o-level image generation, arXiv: 2506.18095. https://doi.org/10.48550/arXiv.2506.18095 |
| [15] | A. Medina, J. Bansen, B. Williams, A. Pochowski, L. Rodegerdts, J. Markosian, et al., Reasons for drivers failing to yield at multi-lane roundabout exits: transportation pooled fund study final report, Technical report: FHWA-HRT-23-023. |