This paper introduces a multi-agent framework for comprehensive highway scene understanding, designed around a mixture-of-experts strategy. In this framework, a large generic vision-language model (VLM), such as GPT-4o, is contextualized with domain knowledge to generate task-specific chain-of-thought prompts. These fine-grained prompts are then used to guide a smaller, efficient VLM in reasoning over short videos, along with complementary modalities as applicable. This framework simultaneously addresses multiple critical perception tasks including weather classification, pavement wetness assessment, and traffic congestion detection, which achieve robust multi-task reasoning while balancing accuracy and computational efficiency. To support empirical validation, we curated three specialized datasets aligned with these tasks. Notably, the pavement wetness dataset is multimodal, combining video streams with road weather sensor data, highlighting the benefits of multimodal reasoning. Experimental results consistently demonstrate the strong performance across diverse traffic and environmental conditions. From a deployment perspective, the framework can be readily integrated with existing traffic camera systems and strategically applied to high-risk rural locations, such as sharp curves, flood-prone lowlands, and icy bridges. By continuously monitoring the targeted sites, the system enhances situational awareness and delivers timely alerts, even in resource-constrained environments.
Citation: Yunxiang Yang, Ningning Xu, Jidong J. Yang. Multi-agent visual-language reasoning for comprehensive highway scene understanding[J]. Applied Computing and Intelligence, 2025, 5(2): 315-336. doi: 10.3934/aci.2025018
This paper introduces a multi-agent framework for comprehensive highway scene understanding, designed around a mixture-of-experts strategy. In this framework, a large generic vision-language model (VLM), such as GPT-4o, is contextualized with domain knowledge to generate task-specific chain-of-thought prompts. These fine-grained prompts are then used to guide a smaller, efficient VLM in reasoning over short videos, along with complementary modalities as applicable. This framework simultaneously addresses multiple critical perception tasks including weather classification, pavement wetness assessment, and traffic congestion detection, which achieve robust multi-task reasoning while balancing accuracy and computational efficiency. To support empirical validation, we curated three specialized datasets aligned with these tasks. Notably, the pavement wetness dataset is multimodal, combining video streams with road weather sensor data, highlighting the benefits of multimodal reasoning. Experimental results consistently demonstrate the strong performance across diverse traffic and environmental conditions. From a deployment perspective, the framework can be readily integrated with existing traffic camera systems and strategically applied to high-risk rural locations, such as sharp curves, flood-prone lowlands, and icy bridges. By continuously monitoring the targeted sites, the system enhances situational awareness and delivers timely alerts, even in resource-constrained environments.
| [1] | A. Keskar, S. Perisetla, R. Greer, Evaluating multimodal vision-language model prompting strategies for visual question answering in road scene understanding, Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), 2025, 1027–1036. https://doi.org/10.1109/WACVW65960.2025.00115 |
| [2] | S. Park, C. Cui, Y. Ma, A. Moradipari, R. Gupta, K. Han, et al., Nuplanqa: a large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models, arXiv: 2503.12772. https://doi.org/10.48550/arXiv.2503.12772 |
| [3] |
S. Luo, W. Chen, W. Tian, R. Liu, L. Hou, X. Zhang, et al., Delving into multi-modal multi-task foundation models for road scene understanding: from learning paradigm perspectives, IEEE Transactions on Intelligent Vehicles, 9 (2024), 8040–8063. https://doi.org/10.1109/TIV.2024.3406372 doi: 10.1109/TIV.2024.3406372
|
| [4] | J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, et al., Chain-of-thought prompting elicits reasoning in large language models, Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022, 24824–24837. |
| [5] | H. Yang, J. Lin, A. Yang, P. Wang, C. Zhou, H. Yang, Prompt tuning for generative multimodal pretrained models, arXiv: 2208.02532. https://doi.org/10.48550/arXiv.2208.02532 |
| [6] | H. Gao, L. Zhang, Y. Zhao, Z. Yang, J. Cao, Application of vision-language model to pedestrians behavior and scene understanding in autonomous driving, arXiv: 2501.06680. https://doi.org/10.48550/arXiv.2501.06680 |
| [7] |
R. Zhang, B. Wang, J. Zhang, Z. Bian, C. Feng, K. Ozbay, When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis, Accident Anal. Prev., 219 (2025), 108077. https://doi.org/10.1016/j.aap.2025.108077 doi: 10.1016/j.aap.2025.108077
|
| [8] | Z. Fang, J. Wang, X. Hu, L. Wang, Y. Yang, Z. Liu, Compressing visual-linguistic model via knowledge distillation, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, 1428–1438. https://doi.org/10.1109/ICCV48922.2021.00146 |
| [9] | Y. Liu, C. Wu, S. Tseng, V. Lal, X. He, N. Duan, Kd-vlp: improving end-to-end vision-and-language pretraining with object knowledge distillation, Proceedings of Findings of the Association for Computational Linguistics: NAACL 2022, 2022, 1589–1600. https://doi.org/10.18653/v1/2022.findings-naacl.119 |
| [10] |
Y. Yang, N. Xu, J. Yang, Structured prompting and collaborative multi-agent knowledge distillation for traffic video interpretation and risk inference, Computers, 14 (2025), 490. https://doi.org/10.3390/computers14110490 doi: 10.3390/computers14110490
|
| [11] |
P. Lynch, The origins of computer weather prediction and climate modeling, J. Comput. Phys., 227 (2008), 3431–3444. https://doi.org/10.1016/j.jcp.2007.02.034 doi: 10.1016/j.jcp.2007.02.034
|
| [12] |
F. Zhang, M. Zhang, J. Hansen, Coupling ensemble kalman filter with four-dimensional variational data assimilation, Adv. Atmos. Sci., 26 (2009), 1–8. https://doi.org/10.1007/s00376-009-0001-8 doi: 10.1007/s00376-009-0001-8
|
| [13] |
P. Bechtold, N. Semane, P. Lopez, J. Chaboureau, A. Beljaars, N. Bormann, Representing equilibrium and nonequilibrium convection in large-scale models, J. Atmos. Sci., 71 (2014), 734–753. https://doi.org/10.1175/JAS-D-13-0163.1 doi: 10.1175/JAS-D-13-0163.1
|
| [14] |
A. Geer, F. Baordo, N. Bormann, P. Chambon, S. English, M. Kazumori, et al., The growing impact of satellite observations sensitive to humidity, cloud and precipitation, Q. J. Roy. Meteor. Soc., 143 (2017), 3189–3206. https://doi.org/10.1002/qj.3172 doi: 10.1002/qj.3172
|
| [15] |
P. Bauer, A. Thorpe, G. Brunet, The quiet revolution of numerical weather prediction, Nature, 525 (2015), 47–55. https://doi.org/10.1038/nature14956 doi: 10.1038/nature14956
|
| [16] | Y. Shi, Y. Li, J. Liu, X. Liu, Y. Murphey, Weather recognition based on edge deterioration and convolutional neural networks, Proceedings of 24th International Conference on Pattern Recognition (ICPR), 2018, 2438–2443. https://doi.org/10.1109/ICPR.2018.8546085 |
| [17] |
H. Zhen, Y. Shi, J. Yang, J. M. Vehni, Co-supervised learning paradigm with conditional generative adversarial networks for sample-efficient classification, Appl. Comput. Intel., 3 (2023), 13–26. https://doi.org/10.3934/aci.2023002 doi: 10.3934/aci.2023002
|
| [18] |
X. Qing, Y. Niu, Hourly day-ahead solar irradiance prediction using weather forecasts by lstm, Energy, 148 (2018), 461–468. https://doi.org/10.1016/j.energy.2018.01.177 doi: 10.1016/j.energy.2018.01.177
|
| [19] | V. Schmidt, M. Alghali, K. Sankaran, T. Yuan, Y. Bengio, Modeling cloud reflectance fields using conditional generative adversarial networks, arXiv: 2002.07579. https://doi.org/10.48550/arXiv.2002.07579 |
| [20] | N. Webersinke, M. Kraus, J. Bingler, M. Leippold, Climatebert: a pretrained language model for climate-related text, arXiv: 2110.12010. https://doi.org/10.48550/arXiv.2110.12010 |
| [21] | D. Thulke, Y. Gao, P. Pelser, R. Brune, R. Jalota, F. Fok, et al., Climategpt: towards ai synthesizing interdisciplinary research on climate change. arXiv: 2401.09646. https://doi.org/10.48550/arXiv.2401.09646 |
| [22] |
M. Khan, M. Ahmed, Weather and surface condition detection based on road-side webcams: application of pre-trained convolutional neural network, International Journal of Transportation Science and Technology, 11 (2022), 468–483. https://doi.org/10.1016/j.ijtst.2021.06.003 doi: 10.1016/j.ijtst.2021.06.003
|
| [23] |
S. Chandra, K. AlMansoor, C. Chen, Y. Shi, H. Seo, Deep learning based infrared thermal image analysis of complex pavement defect conditions considering seasonal effect, Sensors, 22 (2022), 9365. https://doi.org/10.3390/s22239365 doi: 10.3390/s22239365
|
| [24] |
Z. Wang, S. Wang, L. Yan, Y. Yuan, Road surface state recognition based on semantic segmentation, Journal of Highway and Transportation Research and Development, 15 (2021), 88–94. https://doi.org/10.1061/JHTRCQ.0000779 doi: 10.1061/JHTRCQ.0000779
|
| [25] | M. Kalliris, S. Kanarachos, R. Kotsakis, O. Haas, M. Blundell, Machine learning algorithms for wet road surface detection using acoustic measurements, Proceedings of IEEE International Conference on Mechatronics (ICM), 2019,265–270. https://doi.org/10.1109/ICMECH.2019.8722834 |
| [26] |
H. Elwahsh, A. Allakany, M. Alsabaan, M. Ibrahem, E. El-Shafeiy, A deep learning technique to improve road maintenance systems based on climate change, Appl. Sci., 13 (2023), 8899. https://doi.org/10.3390/app13158899 doi: 10.3390/app13158899
|
| [27] | A. Mihaita, H. Li, M. Rizoiu, Traffic congestion anomaly detection and prediction using deep learning, arXiv: 2006.13215. https://doi.org/10.48550/arXiv.2006.13215 |
| [28] |
Y. Liu, Z. Cai, H. Dou, Highway traffic congestion detection and evaluation based on deep learning techniques, Soft Comput., 27 (2023), 12249–12265. https://doi.org/10.1007/s00500-023-08821-6 doi: 10.1007/s00500-023-08821-6
|
| [29] |
P. Chakraborty, Y. Adu-Gyamfi, S. Poddar, V. Ahsani, A. Sharma, S. Sarkar, Traffic congestion detection from camera images using deep convolution neural networks, Transport. Res. Rec., 2672 (2018), 222–231. https://doi.org/10.1177/0361198118777631 doi: 10.1177/0361198118777631
|
| [30] |
T. Azfar, J. Li, H. Yu, R. Cheu, Y. Lv, R. Ke, Deep learning-based computer vision methods for complex traffic environments perception: a review, Data Sci. Transp., 6 (2024), 1. https://doi.org/10.1007/s42421-023-00086-7 doi: 10.1007/s42421-023-00086-7
|
| [31] | OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, et al., Gpt-4 technical report, arXiv: 2303.08774. https://doi.org/10.48550/arXiv.2303.08774 |
| [32] | S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, et al., Qwen2.5-vl technical report, arXiv: 2502.13923. https://doi.org/10.48550/arXiv.2502.13923 |