Accurate prediction of winter wheat yield is essential for precision agricultural management. Traditional methods, which primarily rely on vegetation indices, often fall short in achieving high prediction accuracy. In this study, we proposed an innovative approach by integrating multi-source data, including vegetation, geographical, and soil features, with 61 features extracted across growth stages. Using these features, the XGBoost regression model was applied to predict winter wheat yield in Henan Province, China. The model was trained on data from 2016 to 2020 and validated using the 2021 dataset. The XGBoost model achieved a mean absolute error (MAE) of 371.36 kg/hm2, a root mean square error (RMSE) of 516.97 kg/hm2, and a coefficient of determination (R2) of 0.85. These results significantly outperformed other models, such as random forest (RF), gradient boosting decision tree (GBDT), and Lasso. Notably, the XGBoost model provided high accuracy 2–3 growth stages before harvest, enabling earlier yield prediction than traditional methods. To enhance interpretability, SHAP (shapley additive explanations) was employed to quantify the influence of each input variable on yield and determine the direction of influence. This study underscores the potential of multi-source data and advanced machine learning techniques for accurate and interpretable winter wheat yield prediction, providing a valuable solution for precision agriculture.
Citation: Hua Li, Jingyi Gao, Yadan Guo, Xianzhi George Yuan. Application of XGBoost model and multi-source data for winter wheat yield prediction in Henan Province of China[J]. Big Data and Information Analytics, 2025, 9: 29-47. doi: 10.3934/bdia.2025002
Accurate prediction of winter wheat yield is essential for precision agricultural management. Traditional methods, which primarily rely on vegetation indices, often fall short in achieving high prediction accuracy. In this study, we proposed an innovative approach by integrating multi-source data, including vegetation, geographical, and soil features, with 61 features extracted across growth stages. Using these features, the XGBoost regression model was applied to predict winter wheat yield in Henan Province, China. The model was trained on data from 2016 to 2020 and validated using the 2021 dataset. The XGBoost model achieved a mean absolute error (MAE) of 371.36 kg/hm2, a root mean square error (RMSE) of 516.97 kg/hm2, and a coefficient of determination (R2) of 0.85. These results significantly outperformed other models, such as random forest (RF), gradient boosting decision tree (GBDT), and Lasso. Notably, the XGBoost model provided high accuracy 2–3 growth stages before harvest, enabling earlier yield prediction than traditional methods. To enhance interpretability, SHAP (shapley additive explanations) was employed to quantify the influence of each input variable on yield and determine the direction of influence. This study underscores the potential of multi-source data and advanced machine learning techniques for accurate and interpretable winter wheat yield prediction, providing a valuable solution for precision agriculture.
| [1] | Mo Z, Liu ZQ, Wu YC, (2008) The present situation analysis of agricultural production monitoring and output forecasting system at home and abroad. Chin Agric Sci Bull 24: 434–437. |
| [2] | Yang BJ, Lu DH, Pei ZY, Zhao HJ, Wu YY, (1997) The structure design of a national crop condition monitoring system. Trans Chin Soc Agricu Eng 13: 16–19. |
| [3] |
Zhang L, Zhang Z, Luo Y, Cao J, Xie R, Li S, (2021) Integrating satellite-derived climatic and vegetation indices to predict smallholder maize yield using deep learning. Agric For Meteorol 311: 108666. https://doi.org/10.1016/j.agrformet.2021.108666 doi: 10.1016/j.agrformet.2021.108666
|
| [4] |
Bolton DK, Friedl MA, (2013) Forecasting crop yield using remotely sensed vegetation indices and crop phenology metrics. Agric For Meteorol 173: 74–84. https://doi.org/10.1016/j.agrformet.2013.01.007 doi: 10.1016/j.agrformet.2013.01.007
|
| [5] |
Qu G, Shuai Y, Shao C, Peng X, Huang J, (2023) County scale corn yield estimation based on multi-source data in Liaoning Province. Agronomy 13: 1428. https://doi.org/10.3390/agronomy13051428 doi: 10.3390/agronomy13051428
|
| [6] |
Qiao L, Tang W, Gao D, Zhao R, An L, Li M, et al. (2022) UAV-based chlorophyll content estimation by evaluating vegetation index responses under different crop coverages. Comput Electron Agric 196: 106775. https://doi.org/10.1016/j.compag.2022.106775 doi: 10.1016/j.compag.2022.106775
|
| [7] |
Zhou W, Liu Y, Ata-Ul-Karim ST, Ge Q, Li X, Xiao J, (2022) Integrating climate and satellite remote sensing data for predicting county-level wheat yield in China using machine learning methods. Int J Appl Earth Obs Geoinformation 111: 102861. https://doi.org/10.1016/j.jag.2022.102861 doi: 10.1016/j.jag.2022.102861
|
| [8] |
Lang P, Zhang L, Huang C, Chen J, Kang X, Zhang Z, et al. (2023) Integrating environmental and satellite data to estimate county-level cotton yield in Xinjiang Province. Front Plant Sci 13: 1048479. https://doi.org/10.3389/fpls.2022.1048479 doi: 10.3389/fpls.2022.1048479
|
| [9] |
Pede T, Mountrakis G, Shaw SB, (2019) Improving corn yield prediction across the US Corn Belt by replacing air temperature with daily MODIS land surface temperature. Agric For Meteorol 276: 107615. https://doi.org/10.1016/j.agrformet.2019.107615 doi: 10.1016/j.agrformet.2019.107615
|
| [10] |
He Y, Qiu B, Cheng F, Chen C, Sun Y, Zhang D, et al. (2023) National scale maize yield Estimation by integrating multiple spectral indexes and temporal aggregation. Remote Sens 15: 414. https://doi.org/10.3390/rs15020414 doi: 10.3390/rs15020414
|
| [11] |
Zhang L, Zhang Z, Luo Y, Cao J, Tao F, (2019) Combining optical, fluorescence, thermal satellite, and environmental data to predict county-level maize yield in China using machine learning approaches. Remote Sens 12: 21. https://doi.org/10.3390/rs12010021 doi: 10.3390/rs12010021
|
| [12] |
Bian C, Shi H, Wu S, Zhang K, Wei M, Zhao Y, et al. (2022) Prediction of field-scale wheat yield using machine learning method and multispectral UAV data. Remote Sens 14: 1474. https://doi.org/10.3390/rs14061474 doi: 10.3390/rs14061474
|
| [13] |
Di Y, Gao M, Feng F, Li Q, Zhang H, (2022) A new framework for winter wheat yield prediction integrating deep learning and Bayesian optimization. Agronomy 12: 3194. https://doi.org/10.3390/agronomy12123194 doi: 10.3390/agronomy12123194
|
| [14] |
Cao J, Zhang Z, Tao F, Zhang L, Luo Y, Han J, et al. (2020) Identifying the contributions of multi-source data for winter wheat yield prediction in China. Remote Sens 12: 750. https://doi.org/10.3390/rs12050750 doi: 10.3390/rs12050750
|
| [15] |
Wang J, Si H, Gao Z, Shi L, (2022) Winter wheat yield prediction using an LSTM model from MODIS LAI products. Agriculture 12: 1707. https://doi.org/10.3390/agriculture12101707 doi: 10.3390/agriculture12101707
|
| [16] |
Ren Y, Li Q, Du X, Zhang Y, Wang H, Shi G, et al. (2023) Analysis of corn yield prediction potential at various growth phases using a process-based model and deep learning. Plants 12: 446. https://doi.org/10.3390/plants12030446 doi: 10.3390/plants12030446
|
| [17] |
Maimaitijiang M, Sagan V, Sidike P, Hartling S, Esposito F, Fritschi FB, (2020) Soybean yield prediction from UAV using multimodal data fusion and deep learning. Remote Sens Environ 237: 111599. https://doi.org/10.1016/j.rse.2019.111599 doi: 10.1016/j.rse.2019.111599
|
| [18] |
Kumar C, Mubvumba P, Huang Y, Dhillon J, Reddy K, (2023) Multi-stage corn yield prediction using high-resolution UAV multispectral data and machine learning models. Agronomy 13: 1277. https://doi.org/10.3390/agronomy13051277 doi: 10.3390/agronomy13051277
|
| [19] |
Zhang J, Zhao Y, Hu Z, Xiao W, (2023) Unmanned aerial system-based wheat biomass estimation using multispectral, structural and meteorological data. Agriculture 13: 1621. https://doi.org/10.3390/agriculture13081621 doi: 10.3390/agriculture13081621
|
| [20] |
Zhou X, Zheng HB, Xu XQ, He JY, Ge XK, Yao X, et al. (2017) Predicting grain yield in rice using multi-temporal vegetation indices from UAV-based multispectral and digital imagery. ISPRS J Photogramm Remote Sens 130: 246–255. https://doi.org/10.1016/j.isprsjprs.2017.05.003 doi: 10.1016/j.isprsjprs.2017.05.003
|
| [21] |
Clevers J (1989) Application of a weighted infrared-red vegetation index for estimating leaf area index by correcting for soil moisture. Remote Sens Environ 29: 25–37. https://doi.org/10.1016/0034-4257(89)90076-X doi: 10.1016/0034-4257(89)90076-X
|
| [22] |
Wang XQ, Wang MM, Wang SQ, Wu YD, (2015) Extraction of vegetation information from visible unmanned aerial vehicle images. Trans Chin Soc Agric Eng 31: 152–159. https://doi.org/10.3969/j.issn.1002-6819.2015.05.022 doi: 10.3969/j.issn.1002-6819.2015.05.022
|
| [23] |
Jiang H, Hu H, Zhong R, Xu J, Xu J, Huang J, et al. (2020) A deep learning approach to conflating heterogeneous geospatial data for corn yield estimation: A case study of the US Corn Belt at the county level. Global Change Biol 263: 1754–1766. https://doi.org/10.1111/gcb.14885 doi: 10.1111/gcb.14885
|
| [24] |
Cao J, Zhang Z, Tao F, Zhang L, Luo Y, Zhang J, et al. (2021) Integrating multi-source data for rice yield prediction across China using machine learning and deep learning approaches. Agric For Meteorol 297: 108275. https://doi.org/10.1016/j.agrformet.2020.108275 doi: 10.1016/j.agrformet.2020.108275
|
| [25] |
Shahhosseini M, Martinez-Feria RA, Hu G, Archontoulis SV, (2019) Maize yield and nitrate loss prediction with machine learning algorithms. Environ Res Lett 14: 124026. https://doi.org/10.1088/1748-9326/ab5268 doi: 10.1088/1748-9326/ab5268
|
| [26] | Chen T, Guestrin C, (2016) Xgboost: A scalable tree boosting system. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016: 785–794. https://doi.org/10.1145/2939672.2939785 |
| [27] |
Li H, Cao Y, Li S, Zhao J, Sun Y, (2020) XGBoost model and its application to personal credit evaluation. IEEE Intell Syst 35: 52–61. https://doi.org/10.1109/MIS.2020.2972533 doi: 10.1109/MIS.2020.2972533
|
| [28] |
He XH, Luo HT, Qiao MJ, ZH Tian, GS Zhou, (2021) Yield estimation of winter wheat in China based on CNN-RNN network. Trans Chin Soc Agric Eng 37: 124–132. https://doi.org/10.11975/j.issn.1002-6819.2021.17.014 doi: 10.11975/j.issn.1002-6819.2021.17.014
|
| [29] |
Zhang YB, Li X, Man WD, Liu MY, Fan JH, HR Hu, et al. (2024) Research on yield estimation method of winter wheat based on Sentinel-1/2 data and machine learning algorithms. Acta Agric Zhejiangensis 36: 2812–2822. https://doi.org/10.3969/j.issn.1004-1524.20231368 doi: 10.3969/j.issn.1004-1524.20231368
|
| [30] | Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30. |