1.
Introduction
Ensuring grain security is a critical strategic issue for economic development and social stability, and it serves as a cornerstone of national security. For China, a country with a large population, the level of agricultural development and grain production capacity directly impact the national economy and people's livelihoods [1]. Consequently, accurate monitoring and prediction of grain yields are essential for formulating national and regional socio-economic development plans. They not only aid in the development of agricultural import and export strategies but also ensure national grain security and provide a vital basis for guiding and regulating the macro-level structure of the planting industry [2].
Traditional grain yield prediction methods typically rely on field surveys conducted during the growing season or empirical knowledge of grain growth conditions. However, these approaches are often limited by small sample sizes and constraints in human resources, leading to sampling and non-sampling errors that can compromise the reliability of the results. In contrast, satellite remote sensing, with its extensive coverage and short revisit cycles, has emerged as a critical technological tool for large-scale grain yield prediction. Vegetation indices (VIs) derived from satellite data are the most commonly used means for predicting grain yields, with hundreds of VIs currently available. These VIs can be categorized into three types based on their calculation formulas and functional requirements: Simple VIs, modified VIs, and functional VIs. Here, we list different VIs and their related literature in Table 1, showing that the normalized difference vegetation index (NDVI) and enhanced vegetation index (EVI), based on visible and near-infrared (NIR) bands, are the most widely used. However, studies have shown that other indices may outperform these traditional VIs in specific contexts. For example, Zhang et al. [3] found that the green chlorophyll vegetation index (GCVI) outperformed other VIs, although this result was not validated in this study. Bolton et al. [4] demonstrated that the enhanced vegetation index (EVI2) is the best indicator for predicting maize yield in non-semi-arid regions. Qu et al. [5] pointed out that the green normalized difference vegetation index (GNDVI) is highly correlated with nitrogen and is suitable for characterizing canopy biomass during the heading and tasseling stages of crops. Additionally, Qiao et al. [6] showed that combined indices such as the modified chlorophyll absorption ratio index (MCARI) and the optimized soil-adjusted vegetation index (OSAVI) have strong capabilities for estimating chlorophyll content.
VIs offer valuable insights into vegetation conditions. However, they alone cannot fully capture the diverse environmental stresses crops experience during growth and development. To address this limitation, recent studies have increasingly incorporated environmental variables that consider abiotic factors to enhance the accuracy of crop yield predictions. These variables, including temperature, precipitation, and soil data, have shown significant effectiveness in some cases [4]. Specifically, Zhou et al. [7] used the following environmental variables: maximum temperature (tmx), minimum temperature (tmn), daily precipitation (pre), reference evapotranspiration (pet), vapor pressure (vap), and vapor pressure deficit (vpd). Lang et al. [8] also employed the following environmental variables: the palmer drought severity index (pdsi), precipitation accumulation (pr), and soil moisture (soil). However, traditional air temperature (Tair) primarily reflects external controls, and rainfall alone is insufficient to represent soil moisture accurately [9]. To overcome these challenges, we introduce a more comprehensive set of environmental factors, detailed in Table 2. The data are sourced from the ERA5* global reanalysis climate dataset released by the European Centre for Medium-Range Weather Forecasts (ECMWF). Notably, the feasibility of daytime land surface temperature (LST_Day_1km) and nighttime land surface temperature (LST_Night_1km) has been validated in previous studies [9,10].
*ERA5 is the fifth-generation ECMWF global climate atmospheric reanalysis data system, produced by C3S, covering data from January 1940 onwards. ERA5 provides hourly estimates of a large number of atmospheric, land, and oceanic climate variables, covering the Earth on a 30 km grid, with 137 levels from the surface up to 80 km in altitude.
Geographical location is a critical factor influencing crop yield, necessitating the incorporation of geographical features into yield prediction models. In this study, we introduce administrative division codes of counties and cities as one-dimensional geographical features, with "NAME" used to identify the geographical locations. Integrating such data enables a more in-depth analysis and understanding of the dynamic variations in crop yield.
The formation of crop yield is a complex and prolonged process, influenced by numerous factors that may vary in their impact across growth stages. For example, Jiang et al. [23] demonstrated that the transition from the vegetative to the reproductive development stage is most critical for maize yield prediction. Similarly, Ren et al. [16] found that the key features during the vegetative and reproductive growth stages of maize are growth status features and water-related features, respectively. These studies highlight the importance of analyzing various growth stages and their corresponding features for accurate yield prediction. However, in-depth research on the features of different growth stages remains relatively limited, and our comprehensive understanding of the mechanisms underlying crop yield formation requires further enhancement. To better comprehend the dynamic variations in crop yield, it is essential to conduct thorough investigations into the significance of various features during different growth stages for yield prediction.
Traditional crop yield prediction methods primarily rely on crop growth models and statistical regression techniques. However, crop growth models demand extensive input data, including climate, variety, management, and soil conditions, which can be challenging to obtain and process. Moreover, empirical regression models are often limited by their locality and insufficient spatial generalization capabilities. To overcome these limitations, machine learning (ML) algorithms provide a more flexible and efficient alternative. Studies (see Table 3) have highlighted the significant potential of ML in crop yield prediction. For instance, Shahhosseini et al. [25] evaluated several ML models, including LASSO regression, RF, and XGBoost, and demonstrated that XGBoost is the most accurate model for yield prediction. The XGBoost model in Chen et al. [26] further showed that the XGBoost model exhibits strong generalization capabilities, allowing it to adapt to various types of data and complex relationships. Additionally, Li et al. [27] highlighted that XGBoost has clear advantages in feature selection and classification performance compared to logistic regression and other tree-based models.
While researchers [28] have achieved significant success in winter wheat yield prediction using deep learning, we focus on constructing an XGBoost regression model to explore the precise prediction of winter wheat yield in Henan Province. Notably, although the researchers in [29] also employed the XGBoost model for winter wheat yield prediction, we enhance feature selection by integrating multiple feature sources, including VIs, geographical factors, and soil features. In this study, we construct a combination model with multiple feature variable sets to achieve regression prediction of winter wheat yield in Henan Province and compare it with other ML models. Furthermore, we adopt the shapley additive explanations (SHAP) explainable framework [30] to conduct an in-depth analysis of the selected features. Our core objectives of this research are to address the following four objectives:
(1) Evaluate the predictive performance of the model by combining wheat growth stages, VIs, and newly introduced feature variables;
(2) Analyze the prediction results of winter wheat across growth stages to identify the stage with the highest prediction accuracy;
(3) Compare the performance of the XGBoost model with other ML models in winter wheat yield prediction;
(4) Assess the importance of selected features across growth stages to gain a deeper understanding of their contribution to yield prediction.
The remainder of this paper is organized as follows: In section 2, we introduce the research area and data, providing a detailed description of the specific region and data types used, along with preliminary analysis and organization of the collected data to ensure accuracy and reliability. In section 3, we present the framework of the prediction model, summarizing its core components. In section 4, we discuss the results and analysis, comprehensively validating the model's effectiveness through feature verification, prediction result presentation, multi-model comparison, and model interpretation. Finally, In section 5, we provide the major conclusions, summarizing the primary outcomes of the study and offer recommendations based on the research results.
2.
Research area and research data
2.1. Research area
The study area is in Henan Province, which encompasses 102 counties (Figure 1). Located in the middle and lower reaches of the Yellow River, Henan Province spans geographical coordinates from 31∘23′ to 36∘22′ north latitude and 110∘21′ to 116∘39′ east longitude. Predominantly within the warm temperate zone, the province extends into the subtropical zone in the south, characterized by a continental monsoon climate that transitions from the northern subtropical to the warm temperate zone. The region also features a climatic gradient from plains in the east to hilly and mountainous areas in the west, with distinct seasons and concurrent rainfall and heat. Over the past decade, Henan Province has experienced an average annual temperature ranging from 15.1∘C to 15.9∘C, with average annual precipitation of 512.6 mm to 1129.1 mm and average annual sunshine duration of 1774.5 to 2024.1 hours. These climatic conditions, combined with fertile soil, make Henan one of China's most important grain production areas. Within this region, wheat accounts for approximately one-quarter of the national planting area and yield, significantly contributing to China's overall grain production.
2.2. Research data
2.2.1. Data source
Here, we leverage remote sensing data from Google Earth Engine (GEE), specifically two MODIS data products from the TERRA satellite: MYD11A2 and MYD15A2H. The MYD11A2 data provide detailed information on land surface temperature, including daytime and nighttime temperatures. The MYD15A2H dataset is utilized to obtain ecological parameters such as the leaf area index. Additionally, various spectral reflectance bands of the TERRA satellite data are combined linearly and nonlinearly to generate VIs for comprehensive vegetation growth monitoring. The ERA5 global reanalysis climate dataset from the ECMWF is also employed, encompassing major meteorological parameters like soil moisture and temperature at different depths. Furthermore, data from the Global Land Data Assimilation System (GLDAS) of the National Aeronautics and Space Administration (NASA) serve as supplementary data. The wheat yield data for Henan Province are sourced from the Henan Statistical Yearbook, which compiles the wheat planting area and yield across 102 counties and cities.
2.2.2. Data processing
To ensure data accuracy, we initially employ the boxplot method to identify and handle outliers in the feature data. Subsequently, the data are integrated according to the wheat growth stages, and growth stage images are plotted to facilitate preliminary feature selection through visualization. To further refine the feature selection, the correlation between 32 features and yield is calculated across different growth stages, and a correlation heatmap is generated (Figure 2). Features with an absolute correlation value greater than 0.6 in each growth stage are selected to ensure their significant impact on wheat yield. Additionally, domain knowledge and feature importance evaluation are combined to further select features that have a substantial influence on wheat yield from the initially screened set. The detailed process of feature selection is illustrated in Figure 3. Ultimately, features corresponding to 8 growth stages are selected, along with the geographical feature "NAME", resulting in a total of 61 features across growth stages. Table 4 presents the selected growth stages and their corresponding features, while Table 5 provides the division information of the growth stages.
The data and code supporting this paper are available in https://gitee.com/zzufinlab/bookcode/blob/master/Prediction_of_Wheat_Yield.
3.
Establishment of the prediction module framework.
3.1. Establishment of the XGBoost winter wheat yield prediction model
XGBoost, an efficient gradient boosting framework, is based on a C++ implementation of the Gradient Boosting Machine and can handle regression and classification problems. It leverages the multi-threading capabilities of the CPU to perform parallel computations, significantly enhancing the efficiency of model training. The core mechanism involves integrating multiple weak learners (decision trees) by selecting partial sample features to generate weak learners and continuously fitting the residuals of previous models to minimize the objective function. Through iterative processes, a robust predictive model is constructed.
In the process of building an XGBoost model for yield prediction, the selected feature dataset is divided into two parts based on time. Data from 2016 to 2020 are used as the training set, while data from 2021 serve as the test set. The training set comprises 510 records of 62-dimensional data extracted from 2016 to 2020, with the 61 selected features serving as independent variable x and yield data as the dependent variable y. The model training process is described as follows:
● Initialization: A simple decision tree is initialized as the base model to predict the approximate values of the initial target;
● Residual Calculation: The residuals between the predicted values of the current model and the actual target values are calculated;
● Weight Calculation: The weights of each sample are calculated based on these residuals, and these weights are used to train the next decision tree;
● Prediction Accumulation: In each iteration, the predictions of the newly trained decision tree are weighted and accumulated with the predictions of all previous models to obtain new predicted values;
● Iteration Termination: The iterative process terminated when the preset maximum number of trees is reached or when the model performance no longer improved significantly.
The formula for each tree can be expressed as follows:
where F represents the function space containing all trees, K is the total number of trees (or boosting iterations) up to the current step t, fk(xi) denotes the weight of the i-th sample in the leaf node of the k-th tree, and ˆy(t)i represents the predicted value of the i-th sample at the t iteration.
To simplify the objective function of the model and effectively prevent overfitting, a regularization term is introduced, thereby improving the model's performance when handling large-scale and high-dimensional data.
After model training is complete, the trained model is applied to the test dataset for evaluation to verify its performance and generalization capability. To comprehensively evaluate the effectiveness and accuracy of the XGBoost prediction model, we adopte the root mean squared error (RMSE), mean absolute error (MAE), and coefficient of determination (R2)as evaluation metrics. These metrics reflect the model's predictive performance from different perspectives, including the magnitude of errors, the proportion of deviation between predicted and actual values, and the model's ability to explain data variability.
where yi is the true value of the i sample, ˆyi is the predicted value of the i sample, ¯y is the sample mean, and n is the number of samples.
3.2. Overview of key points in model explanation
SHAP is a powerful explainable framework proposed by Lundberg et al. [30] for interpreting the predictions of ML models. It provides an intuitive and rigorous way to understand how a model makes predictions based on input features by calculating the contribution value of each feature to the prediction outcome. These contribution values can be positive or negative, indicating the direction of each feature's influence on the prediction. A positive value signifies that a feature positively impacts the prediction, while a negative value indicates a negative influence. Moreover, the greater the absolute value of a feature's contribution, the more important the feature is in the model.
In this study, leveraging the SHAP framework, we calculate the SHAP values for each feature to generate local explanations for each prediction. These SHAP values quantify the specific contribution of each feature to the model's predictions, thereby providing a measure of feature importance across different growth stages. By analyzing these values, we can gain insights into how the model utilizes input features to make predictions and identify the most influential features at each growth stage. This analysis not only enhances the interpretability of the XGBoost model but also addresses the "black box" problem often associated with machine learning models in yield prediction.
4.
Results and analysis
In this study, data from 2016 to 2020 are selected as the training samples to construct the XGBoost model, while data from 2021 are used as the test samples. The prediction model is applied to obtain the predicted yields of winter wheat for 102 counties in Henan Province in 2021, and the prediction results for winter wheat across growth stages are analyzed. The results are then compared with those of RF, GBDT, and Lasso. Subsequently, based on the SHAP explainable framework, the SHAP values for each feature were calculated to generate a local explanation for each prediction.
4.1. Overview of feature validation
The yield data for winter wheat in each county and city are calculated by aggregating the wheat planting area and yield data. Missing values are imputed with the average yield of the respective county or city. Table 6 presents the winter wheat yields for selected regions in Henan Province. This approach ensures the completeness and accuracy of the yield data, providing a reliable basis for the subsequent analysis and modeling.
The initially selected 32 features are validated through a comparative analysis of data from high-yield and low-yield areas in some regions of Henan Province in 2021 (as shown in Figure 4). Taking DVI, EVI, CanopInt_inst, LST_Night_1km, SoilMoi10_40cm_inst, and soil_temperature_level_3 as examples, the analysis results indicate that during the overwintering to grain-filling stages of wheat, the DVI and EVI values in high-yield areas are significantly higher than those in low-yield areas. Additionally, while the differences in LST_Night_1km, soil_temperature_level_3, and SoilMoi10_40cm_inst between high-yield and low-yield areas are not significant, these features exhibit clear periodic patterns. In contrast, CanopInt_inst shows no distinct differentiation between high-yield and low-yield areas, and its variation pattern is not evident. Therefore, CanopInt_inst is excluded from subsequent analyses.
4.2. Presentation and discussion of prediction results
Applying the XGBoost model for yield prediction reveals that models incorporating geographical and soil variables, as well as combining these variables with NAME, consistently outperform models using only VIs. Specifically, the model that integrates VIs, soil variables, and NAME demonstrates the best predictive performance. As shown in Table 7, compared to the model using only VIs, the model incorporating VIs and soil variables achieves a higher R2, indicating enhanced explanatory power. Additionally, the RMSE and MAE decrease, signifying improved prediction accuracy. Furthermore, the model combining VIs, soil variables, and NAME exhibits even better predictive performance than the model that includes only VIs and soil variables. These results highlight the significant role of soil variables and NAME in enhancing the predictive performance of the model.
4.3. Prediction of winter wheat across growth stages
In this section, we conduct yield prediction for cumulative growth stages of the winter wheat (tillering stage, tillering to elongation stage, tillering to booting stage, tillering to heading stage, tillering to flowering stage, tillering to grain filling stage, tillering to ripening stage, and tillering to harvesting stage) to evaluate the optimal time window for the XGBoost model in predicting winter wheat yield. The variable NAME is excluded during yield estimation, and the results are shown in Figure 5. However, even when including this variable, the trends remain similar.
The XGBoost yield prediction model performs poorly in the early stages of winter wheat growth, specifically exhibiting higher RMSE and MAE values and a lower R2. This phenomenon may be attributed to the model's inability to capture sufficient crop growth, geographical, and soil information during the initial stages. As time progresses, the model gradually integrates more information, leading to improved performance. From the elongation stage to the booting stage, the RMSE and MAE decrease compared to the tillering stage, while R2 increases, indicating a significant improvement in prediction accuracy. From the booting stage to the grain filling stage, the evaluation metrics exhibit minor fluctuations but remain relatively stable. After the grain filling stage, these metrics stabilize. Throughout the growth stages, the grain filling stage achieves the lowest MAE of 527.40 kg/hm2, with an RMSE of 670.50 kg/hm2. In contrast, the heading stage achieves the lowest RMSE of 662.29 kg/hm2, a 1.2% reduction compared to the grain filling stage, while the MAE increases by 3% to 547.74 kg/hm2. These results indicate that the XGBoost model provides high accuracy 2–3 growth stages before harvest, enabling earlier yield prediction than traditional methods. Therefore, the conclusions of this study align with the findings in the literature [28], and the prediction performance of this study has further improved compared to [28].
4.4. Comparison and discussion of data results from different models
To further evaluate the performance of the XGBoost model in predicting winter wheat yield, we compare it with several other machine learning models, including RF, GBDT, and Lasso. Figure 6 shows scatter plots of the predicted versus actual yield values for counties in Henan Province using these four prediction models, while Table 8 provides the evaluation metrics for the different models. The XGBoost model, with an MAE of 371.36 kg/hm2, RMSE of 516.97 kg/hm2, and R2 of 0.85, significantly outperforms the GBDT, RF, and Lasso models. This conclusion is further supported by Figure 6. Compared to the researchers in [29], we not only use VIs but also incorporate geographical and soil features, resulting in a higher R2 than [29]. Although the RMSE is slightly higher than in [29], the difference is not significant, with only a 3% increase. This result indicates that, despite the slight increase in RMSE, the overall performance of the model has improved, particularly in terms of explaining the relationship between variables and yield.
4.5. Model interpretation and findings
1) Analysis of individual samples.
Each sample in the test set has a corresponding explanation result, where the SHAP values represent the impact of each feature on the predicted yield for that sample. Here, we take the 10th sample in the test set as an example to analyze how the model generates the final prediction result. For the XGBoost model in this study, the SHAP explainable framework provides a baseline value of 6152.88, meaning that in the absence of any information, the model's predicted yield is 6152.88. The impact of each of the 61 features on the prediction result is then analyzed individually. The SHAP value of the feature variable NAME is 139.44, indicating that it increases the predicted yield from 6152.88 to 6292.32. This suggests that the feature has a positive effect on the yield prediction, increasing the predicted value. The SHAP value of the feature variable AvgSurfT_inst during the ripening stage is 282.70, indicating that it increases the predicted yield from 6292.32 to 6575.02. This suggests that the feature has a positive effect on the yield prediction, further increasing the predicted value. The SHAP value of the feature variable LST_Day_1km during the heading stage is −29.28, indicating that it decreases the predicted yield from 6575.02 to 6545.74. This suggests that the feature has a negative effect on the yield prediction, reducing the predicted value.
By analogy, the impact of the remaining 58 features on the yield prediction can be analyzed. Under the combined influence of all 61 features, the SHAP predicted value for this sample increases from the baseline value of 6152.88 to 7029.09. The SHAP explainable framework clearly demonstrates the role each feature plays in this prediction process. The results are visualized using the SHAP explainable framework, as shown in Figure 7.
2) Analysis and explanation of the overall samples.
To gain a clearer understanding of the impact of features on the prediction results, we present the explanation results and feature importance for all samples, as shown in Figure 8. In the left plot, each point corresponds to the SHAP value of a sample, with the color gradient indicating the magnitude of the feature value: red represents higher feature values, while blue represents lower feature values. The SHAP value of 0 serves as a dividing line, where points to the left indicate a negative impact on the final prediction result and points to the right indicate a positive impact, with the distance from the line reflecting the magnitude of the effect. In the right plot, the x-axis represents the mean of the absolute SHAP values for each feature across all explained samples, with larger values indicating a greater average influence on the model's output.
By combining the insights from both figures, it can be observed that the following features have a significant impact on the model's prediction results: (1) WDVI during the heading stage, (2) the geographical feature NAME, (3) AvgSurfT_inst during the ripening stage, (4) WDVI during the flowering stage, and (5) SoilMoi0_10cm_inst during the heading stage.
These findings underscore the importance of integrating multisource-data, including vegetation indices, geographical, and soil features, in enhancing the accuracy and interpretability of winter wheat yield prediction. The SHAP explainable framework not only quantifies the contribution of each feature to the model's predictions but also identifies the most influential features across different growth stages, thereby addressing the "black box" problem often associated with machine learning models. This comprehensive analysis provides valuable insights for precision agricultural management, enabling more informed decision-making and contributing to enhanced food security and sustainable agricultural development in Henan Province.
5.
Conclusions
Based on data from 2016 to 2020, we establish a prediction model for winter wheat yield in Henan Province in 2021 using the XGBoost model and conducts explainability analysis and feature impact assessment using the SHAP method. The following key findings and conclusions are drawn:
(1) Feature selection and model performance: Through feature selection, we identify 8 growth stages and their corresponding features, along with the NAME, resulting in a total of 61 features included in the model. Compared to models using only VIs or VIs combined with soil variables, the model incorporating all newly introduced features performs slightly better. This highlights the significant role of soil variables and NAME in enhancing the model's predictive performance.
(2) Yield prediction across growth stages: By predicting yield across cumulative growth stages, we found that the proposed yield prediction model can accurately predict winter wheat yield during the mid-to-late growth stages, with the grain filling stage showing the best performance. Notably, the XGBoost model provides high accuracy 2–3 growth stages before harvest, enabling earlier yield prediction than traditional methods.
(3) Model evaluation metrics: The evaluation metrics of the XGBoost model indicate an MAE of 371.36 kg/hm2, an RMSE of 516.97 kg/hm2, and an R2 of 0.85. When comparing the predictive performance of different models, the XGBoost model slightly outperforms other models in predicting winter wheat yield in Henan Province.
(4) SHAP analysis for model explainability: Using SHAP to explain the model, the analysis of individual samples reveals the impact of each feature on the prediction results. Additionally, the analysis of the overall samples demonstrates that the following features have a significant influence on the model's predictions: 1) WDVI during the heading stage, 2) the geographical feature NAME, 3) AvgSurfT_inst during the ripening stage, 4) WDVI during the flowering stage, and 5) SoilMoi0_10cm_inst during the heading stage.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (71971031), and General Program of National Social Science Foundation of China (18BJT021).