Abbreviations: ML: Machine Learning; RF: Random Forest; RNN: Recurrent Neural Networks; GIS: Geographic Information Systems; LSTM: Long-Short Term Memory
1.
Introduction
Solar radiation is the energy source of the planet, and knowledge of its magnitude on any given spatiotemporal frame is necessary for a multitude of applications, from power generation [1] to irrigation planning [2]. Focusing on its agricultural applications, and with the advent of smart agriculture, weather stations connected to the Internet are often used to supply these important measurements directly to a monitoring team tens or hundreds of kilometers away. However, an uncommon but persistent issue of operating a meteorological station at a remote location is the loss of data pertaining to malfunctioning sensors. Due to the difficulty of coordination and access, it could take days for a maintenance team to fix the problem after it is first detected, wasting precious data and poking holes in otherwise complete datasets. This issue requires an accurate yet easy to implement solution, as complex methodologies may be unappealing or wasteful for tackling such a mundane issue.
A similar problem is the absence of solar radiation measurements in general meteorological observation stations observed in studies like those of Zang et al. [3,4], Ağbulut et al. [5], and Soulis et al. [6]. The absence of solar radiation measurements is even more pronounced in the case of historical data [6]. In all these cases, estimation of the solar radiation values through the more readily available data for other meteorological parameters such as air temperature and air relative humidity was necessary in some capacity, either through empirical functions as proposed by Hargreaves and Samani [7] or Meza and Yebra [8], or, more recently, through Machine Learning (ML) algorithms [3,5].
The studies presented in Table 1 offer a review of previous studies presenting robust methodologies for predicting daily average global solar radiation based on other meteorological variables. These studies incorporate ML methods, either as individual models [9,10,11] or hybrid ML-empirical models [4], reporting promising results. Some researchers [5,12,13] have even performed comparisons of different ML methods to discern the best one to use for tackling this problem. However, literature pertaining to comparing the performance of traditional empirical methods and the relatively new ML methods, taking into account the availability of the required variables, the methods' simplicity, and the computational requirements in this setting, seems scarce. Such an evaluation is a worthwhile endeavor since ML methods are inherently less accessible compared to the well-established and intensely calibrated traditional methods, with many researchers opting to use the latter over the former for the sake of convenience.
We aim to compare the accuracy of different methods for the estimation of daily global solar radiation using other routine meteorological variables for the purposes of filling in missing measurements while considering the differences in availability of measured parameters for any given station, simplicity, and computational requirements. Considering the focus on simplicity and speed of this analysis, the methods selected for this comparison were the well-established empirical equation proposed by Hargreaves and Samani [7] and the modified version of the same equation proposed by Valiantzas [14]. These traditional methods were compared with various implementations of two ML models. These models were the Random Forest, which was selected for its relatively simpler implementation without sacrificing accuracy [15], and the Recurrent Neural Networks, which was selected for its specialization in handling sequential data [16].
2.
Materials and methods
2.1. Study area, data sourcing and processing
The data used in this paper came from the meteorological station network operated by the GIS Research Unit of the Agricultural University of Athens in the area of Nemea, comprising ten (10) stations (Figure 1) that record data on precipitation, temperature, wind speed, relative humidity, and solar radiation on a 15-minute time step, which were drawn from the period from 1/10/2019 to 27/12/2023.
This data was then used to calculate daily sums for precipitation and daily average, maximum, and minimum values for temperature, relative humidity, wind speed, and solar radiation. The aforementioned process was completed using a semi-automatic algorithm developed in MS Excel VBA to reduce processing time. Time steps that had missing or evidently erroneous measurement values based on simple criteria considering the acceptable value range for each parameter were entirely omitted from the procedure.
Daily extraterrestrial radiation is a very important variable in estimating average daily solar radiation [7,14], and its values were calculated in MS Excel VBA for each of the stations' locations for all selected dates using the following equation [2]:
where Ra is the extraterrestrial radiation in MJ m-2 day, Gsc is the solar constant 0.0820 MJ m-2 day, dr is the inverse relative distance Earth-Sun, ωs is the sunset hour angle in radians, φ is the latitude in radians, and δ is the solar declination angle in radians.
All the variables were then investigated for correlation with average solar radiation using IBM SPSS Statistics 25. Τhe correlation coefficients used were Pearson's and Spearman's ranks to capture both linear and non-linear relationships, and the results are presented in Table 2. The variables that showed a significantly high correlation with average solar radiation with both coefficients across all ten locations were extraterrestrial radiation, average relative humidity, maximum, minimum, and average temperature; average wind speed showed a high significance level for some stations, but gave significance values over but very close to 0.05 for others and it was decided that it would not be omitted. Furthermore, the same parameters are used in existing empirical methods [7,14]. Consequently, these were the characteristics that could potentially be used for the interpolation of average solar radiation by the ML algorithms used in this paper.
Another piece of data that was needed and had to be calculated was the dew point temperature, which was calculated in daily temporal step in MS Excel using the formula by Allen et. al (1998) [2]:
where ea is the actual vapor pressure in kPa and was calculated by the following formula:
where RH is the relative humidity as %, T is the average daily temperature in ℃.
The data was then split in two datasets, the data from 1/10/2019 to 30/9/2022 was used for the training and calibration of the machine learning models and relevant empirical equations versions, while the data from 1/10/2022 to 27/12/2023 was used for the calibration of the relevant equation versions and for validation and comparison of both the equations and machine learning models. Given ten locations, the training dataset comprised of a total of 10,970 data points and the validation and comparison dataset was comprised of 4,499 data points.
2.2. Empirical equations methods
Some of the equations traditionally used in the estimation of missing solar radiation values and which will be examined in this paper are the equation proposed by Hargreaves and Samani (Eq. 4) [7] and the modified version of the same equation proposed by Valiantzas (Eq. 5) [14], which uses dew point temperature in place of minimum temperature.
where Rs is the solar radiation in Wm-2, Ra is the extraterrestrial radiation in Wm-2, Tmax is the maximum daily temperature in ℃, Tmin is the minimum daily temperature in ℃, Tdew is the calculated dew point temperature in ℃, and kRs is an empirical solar radiation adjustment coefficient that generally varies from 0.12 to 0.25.
Both equations were examined and split in three use cases each. The first use case dubbed "non-calibrated" was adopting the conventional kRs value of 0.17 for all stations, the second use case dubbed "local" was calibration of the kRs values [17] for each of the ten locations separately using the data from the training dataset and the third dubbed "total" was calibrated simultaneously using the data of all ten locations. The calibration was done up to the third decimal digit of kRs and was performed by slightly adjusting the value until the corresponding equation's result gave the lowest Root Mean Square Error (RMSE) value when compared to the ground truth data of the training dataset. RMSE is calculated as follows [4]:
where yi is the observed value of the i-th time step, ypi is the predicted value of the i-th timestep, and n is the number of time steps.
2.3. Random forest
Out of many possible options, the Random Forest (RF) algorithm was chosen for its robustness in dealing with regression problems and its relative ease of development and use. Using RStudio, three RF models were constructed, each using a diminishing number of input variables. The first iteration, dubbed "RFcomplete", used all the variables found to have a significant correlation with average solar radiation, as explained in section 2.1, namely maximum temperature, minimum temperature, average temperature, average wind speed, average relative humidity, and extraterrestrial solar radiation. The second model, dubbed "RFhalf", did away with average wind speed and temperature to be comparable to Equation 5 in terms of prerequisite data. The final, dubbed "RFminimal", would use only extraterrestrial solar radiation and maximum and minimum temperature to be in direct juxtaposition to Equation 4 and its minimal need for data.
Using the grid search method, different models with different combinations of hyperparameters were trained for each iteration in order to discern the best performing combination in terms of R2 on the independent validation dataset. The coefficient of determination (R2) is defined as the variability explained by the regression model and is calculated by the following formula [18]:
where yi is the observed value of the i-th time step, ypi is the predicted value of the i-th timestep, and ¯y is the average of all observed values.
The hyperparameters in question were the number of trees, the minimum terminal node size, the sample size for each tree, the depth of the decision trees, and the number of variables available for splitting at each tree node. The hyperparameters chosen for each model are presented in Table 3.
The first two models favored less random trees but less dense forests, while the RFminimal model was most effective with a denser but more random forest. This could potentially be explained by the number and importance of the variables fed to each model, as shown in Figure 2. With a higher number of more important input variables, the average tree in the former models is more robust as a predictor than their RFminimal counterparts [19]; thus, fewer trees are needed to get an accurate prediction, and a median degree of randomness is required as to not compromise that robustness while allowing for variability [20,21]. Inversely, a more randomized forest with higher density makes sense if each single decision tree is relatively weaker. This interpretation is supported by comparing the importance of each feature for its respective model, as average relative humidity was identified as the second most important feature in both models, where it is used as a predictor.
Additionally, all of the models were most effective when choosing a number of trees in the middle-lower part of the range of available options (200 to 1000 trees). This is an extension of the previous observation, suggesting that given a high number of trees with medium or low randomness, the model overfits the training dataset without adding substantially to the accuracy [22], while highly random trees are either too inefficient as learners or require a forest density beyond the range allowed in our training to reach comparable accuracy levels [23].
2.4. Recurrent neural networks
Artificial Neural Networks are computational models that simulate the workings of the human brain to make decisions [16] and are often portrayed as one of the most accurate machine learning methods available [24]. There are various ANN architectures that specialize in different applications, specifically for regression problems dealing in temporal sequences, and Recurrent Neural Networks (RNNs) appear to be the most fitting choice [25]. This RNN used a Long Short Term Memory (LSTM) layer as the recipient of the input, followed by a dropout layer and a number of hidden layers, comprised of one dense layer followed by a dropout layer each. To avoid features with more massive numerical values having an inordinate effect on the learning process, Z-score normalization is performed on each feature of the data [26] before it is fed into the LSTM layer using the following equation:
where Zi is the normalized value of the i-th observation, Xi is the original value of the i-th observation, ¯X is the mean value of all observations, and S is the standard deviation of all observations.
Three RNN iterations were developed using RStudio, dubbed "ANNcomplete", "ANNhalf", and "ANNminimal", categorized with respect to the features used in accordance with their respective RF counterparts, as explained in the previous section. Similarly, different models with different combinations of hyperparameters were trained for each respective iteration using grid search, selecting the best-performing combination in terms of R2 on the independent validation dataset. These hyperparameters were the number of neurons for each layer, the number of hidden layers, the dropout rate, and the number of epochs as, shown in Table 4.
Foregoing the inclusion of time steps in the hyperparametrization process appears like an odd choice for the training of RNNs. However, to allow for maximum flexibility in the application of the model to temporal datasets of wildly different lengths, the time step hyperparameter was set to 1 [27], as the LSTM layer should be capable of capturing long-term temporal dependencies on its own [28]. Additionally, it is important to mention that the layer architecture for each RNN model is identical, as shown in Figure 3.
The final hyperparameters seem to suggest a relatively simple relationship between the selected features and solar radiation, as all models performed best on the lowest available number of epochs and hidden layers. Furthermore, the "ANNcomplete" model favored a higher dropout rate than its counterparts using a smaller number of predictor features, possibly because the additional features, which were explained to be less significant in the preceding section, led to misestimating connection weights, and thus a higher deactivation rate allows for consistent use of the more robust nodal connections [29]. However, it could also be explained by the comparatively lower number of total neurons in the other two models, though marginally, as it could afford to deactivate more neurons [30].
Speaking of neurons, no model achieved better accuracy with the highest possible number of neurons, which would suggest that noise within the dataset is quite prevalent [30]. Furthermore, as expected, the model with the lowest number of predictor variables also has the lowest available number of neurons for its hidden layers, though the highest number was picked for the LSTM layer. The "ANNhalf" model has a lower number of neurons in the LSTM layer, and given that relative humidity is a major predictor in this model but missing in the "ANNmin" model, it would be fair to assume that the relationship between relative humidity and average daily solar radiation isn't particularly dependent on long-term interactions [28].
2.5. Metrics for comparison
Different metrics were used to judge the performance of the final models from different perspectives [4], as, for example, RMSE penalizes errors of greater magnitude more severely than Mean Absolute Error (MAE), which could be misleading [31] but could also potentially reveal model differences more clearly [32]. In addition to R2 and RMSE, which were discussed above and calculated using Eq.7 and Eq.6, respectively, additional metrics were used to compare the results. These are as follows:
Relative Root Mean Square (rRMSE) which was calculated as [4,33]:
Mean Absolute Error (MAE) which was calculated as [32]:
Mean Bias Error (MBE) which was calculated as [31]:
where yi is the observed value of the i-th time step, ypi is the predicted value of the i-th timestep, n is the number of time steps, and ¯y is the average of all observed values.
For RMSE, rRMSE and MAE, values closer to 0 denote a better performance, for R2 this is reversed with values closer to 1 denoting higher accuracy. MBE values are indicative of positive or negative bias, with values closer to 0 denoting a more equal distribution of overestimations and underestimations and not necessarily higher model accuracy.
3.
Results and Discussion
On the totality of the validation dataset, the ANNcomplete model (RMSE = 41.25, R2 = 0.810) showed the best results in all metrics, followed by the ANNhalf model (RMSE = 43.52, R2 = 0.789). The ANNminimal model (RMSE = 48.20, R2 = 0.740) had a marginally worse performance than the RFcomplete model (RMSE = 47.39, R2 = 0.746) and performed almost equally to the RFhalf model (RMSE = 47.93, R2 = 0.740). The non-calibrated equation using Tdew [14] (RMSE = 48.75, R2 = 0.736performed similarly to the aforementioned, scoring marginally worse in all metrics except for MAE, where it had a slight edge. The next cluster of comparable performers (in order of descending accuracy) was comprised of the calibrated by location Tdew equation (RMSE = 50.85, R2 = 0.713), the calibrated by location Tmin equation (RMSE = 51.10, R2 = 0.710), and the RFminimal model (RMSE = 51.71, R2 = 0.697). Finally, the worst performers were the non-calibrated Tmin equation (RMSE = 53.18, R2 = 0.686), and the Tdew (RMSE = 53.00, R2 = 0.687) and Tmin (RMSE = 54.12, R2 = 0.674) equations calibrated on the totality of locations. All the corresponding results are presented in Table 5.
As for bias, all three ANN models were geared towards minor underestimation, with the ANNcomplete iteration having an almost equal distribution of over and underestimations. The Random Forest models, on the other hand, consistently underestimated average solar radiation, with all but the RFminimal model holding the lowest negative MBE scores (Table 5). The calibrated variations of the equations using Tmin were almost equally as prone to underestimation as the RFcomplete and RFhalf models, with the respective Tdew equations also gearing towards underestimation but of a smaller magnitude. Finally, both non-calibrated equations were the only methods to have positive MBE scores, though the magnitude of overestimation was on par with the middle-higher part of the spectrum of underestimation values.
Taking a deeper look at the results (Table 5), it is apparent that the RNN models consistently outperformed the RF models using the same or even more predictor variables, with the expected "weakest" RNN model having an almost equal performance to the middle RF model in all metrics but MAE where the RNN performed better, which could be explained by the inherent limitations of RF algorithms versus RNNs in handling regression tasks. Due to the inner workings of RF models, predictions tend to form clusters in a stepwise fashion [34], essentially creating biases against predicting values between these clusters, which could contribute to a significant loss of accuracy on the total dataset. This behavior is evident in Figure 4, where, in the respective plots, areas of lower point density can be observed between areas of higher point density more frequently than in their RNN counterparts. For comparison, the respective scatterplots for the equation methods (Figure 5) show continuous scatter clouds with no evident stepwise clustering.
Furthermore, as Random Forest relies on averaging the predictions of multiple decision trees, the final predictions tend to be stabilized closer to the expected values reducing variability [19], which on one hand reduces the risk of overfitting, but on the other, it creates a relative weakness in predicting extreme or rare values [35]. Figure 4 again seems to support this idea, as the plots for the RF models' scatters seem to present a steeper incline with the perfect prediction line towards the higher end of values compared to their RNN counterparts. That is not to say that RNNs are not subject to the aforementioned observations, as both a degree of stepwise changes of prediction and a drop off of accuracy at the extremes are visible. In fact, the latter is much more pronounced on the lower end of values on RNNs compared to the RFs, with a handful of predictions even being negative values, which is a potential sign that some of the weights leading to the reduction of the value of the final prediction were overestimated leading to unrealistic results. Given that the number of negative values was negligible compared to the total sample size, this does not appear to be a cause for major concern, especially since they seem to be restricted to dates and locations that measured surprisingly low daily average solar radiation values. However, this should be further studied in future research.
However, the discrepancy in accuracy could be explained better, not by the RF's shortcomings, as they seem to be shared between the two techniques to some degree, but by the RNNs' ability to recognize long-term dependencies, which, given that the structure of the input data is that of continuous temporal sequences, gives them a distinct advantage. Though temporal sequentiality is implicitly taken into account by both ML techniques through extraterrestrial solar radiation, only the RNNs have an explicit perception of it [22,24], and thus they can capture complex temporal dependencies that the RF models cannot.
As for the equation methods, the Tdew iterations fared better than their Tmin counterparts in every iteration and almost every metric, indicating that Valiantzas's [14] method is a better alternative to the Hargreaves and Samani [7] method. The iterations where kRs was calibrated on the totality of the dataset performed significantly worse than the other two use cases, and the non-calibrated iteration of the Tdew equation was the top performer among the equation methods, even against its locally calibrated counterpart. This could be explained by the fact that by performing the same calibration procedures on the validation dataset, the discovered "optimal" kRs values for the Tdew iterations in this specific timeframe were generally numerically closer to the commonly accepted 0.17 than the values derived from calibrating kRs on the training dataset (Table 6). The locally calibrated Tmin equation, however, was the top performer among the Tmin iterations, suggesting that calibration based on location-specific historical data can be a viable approach, though its accuracy seems to involve a degree of randomness.
A closer look at the models' performance for each station (Table 7) provides some interesting insights. The ANNcomplete model was the top performer for 7 out of 10 locations, in addition to being the best performer on the total dataset and the second best performer on AUA10. However, for the two remaining stations it ranked on the lower half of performers with significantly lower scores than the top performers; the ANNhalf model performed slightly better. This could potentially arise from the relationships between the selected variables in the majority of stations being roughly equivalent but significantly different than those same relationships in the remaining stations, creating bias [36]. Thus, an interesting venue for the future would be the comparison of equivalent ML models with the integration of explicit spatial awareness features and ML models trained exclusively in and for each location.
The three stations where the ANNcomplete model didn't rank at the top were AUA07, AUA09, and AUA10. The former was dominated by the RFcomplete model, closely followed by the RFhalf model, and with RFminimal also ranking within the top half of performers. As for the latter two locations, the non-calibrated Tdew equations achieved the highest accuracy in AUA09, specifically by a significant margin. As mentioned before, in AUA10, the ANNcomplete model ranked second, but it is important to mention that ANNhalf and ANNminimal ranked third and fourth best in terms of accuracy.
Invariably, the worst performer spot in every location was held by some iteration of the equation methods. Somewhat unsurprisingly, the totally calibrated Tmin equation ranked as the worst performer in three locations, and its Tdew counterpart in two locations. The locally calibrated Tdew equation was the worst performer in AUA09, closely following its Tmin counterpart, both having a significant difference from the third worst rank held by ANNminimal. Last, despite the non-calibrated Tdew equation being the top performer in two locations, it ranked as the worst performer in the other two stations, mirroring its Tmin counterpart. The aforementioned observations seem to imply that simultaneous calibration for multiple locations within an area as large and as geomorphologically diverse as Nemea is ill-advised. Furthermore, it seems to vindicate Samani's recommendation of calibrating kRs based on monthly temperature ranges for each specific location [17] as opposed to calibrations based on historical site-specific data.
Regarding site-specific bias (Table 8), there seems to be a stable scale between the sites for the relative MBE values for each method. Methods that consistently overestimate will have comparatively higher values to other methods even if their scores are negative for that specific location, and vice versa. Expectedly, this does not hold true for the locally calibrated equations, as they were independently calibrated for each station, as opposed to the other methods that were invariably optimized for the totality of locations. This sliding scale of bias further supports the idea of incorporating explicit spatial awareness for potential follow-up ML models. Additional information on all statistical measures is provided in the Appendix Table A1 to A3.
4.
Conclusions
Our aim of this paper was to examine various methods for the purposes of filling in blank or erroneous solar radiation readings in the event of sensor malfunction and in different regimes of substitute weather parameters availability. Considering only sheer accuracy with a generalized approach, the ANNcomplete model is evidently the best pick, with the equation methods showing, on average, worse performance compared to the average performance of the ML models. However, to properly select the most appropriate of the examined methods for a given application, they should be compared in groups based on required input parameters. The "complete" ML models are only comparable to one another since they both use the totality of available routine meteorological data and in that regard, the ANNcomplete model is a better option with the exception of one single location. The "half" ML models are directly comparable to Eq.5, as they both utilize the same set of predictor variables, though in different forms. In most cases, again, the RNNs seem to be the best pick, though they lose out against RF in the same location as their "complete" counterparts. In two locations, the non-calibrated Tdew performed better than the respective ML methods. Finally, the "minimal" variants are comparable to the Tmin equations, as all require the absolute minimum amount of necessary predictor variables. In this category, the picture painted is almost identical to the aforementioned category, albeit with absolutely lower performance.
Additionally, the relative difficulty of development and implementation of each method should be taken into account. The equation methods are the most readily available to implement, requiring no specialized software or hardware to utilize and directly offering results of acceptable accuracy without requiring a massive preparation timespan. The ML methods, on the other hand, require specialized software, a moderate degree of technical knowledge, and relatively robust machines to both develop and implement, making them less accessible. Once developed, however, any given ML model can be deployed for any relevant application with ease, requiring roughly equivalent preparation effort and time to the empirical methods. Between them, RNNs required the largest amount of development time and processing power by a non-negligible margin. In our experience during this application, given the same number of input variables, time steps, and tunable hyperparameter combinations, RNN models would take about 70–150% longer to train than their equivalent RF models, not to mention that they are also considerably less straightforward to implement and significantly harder to interpret [37].
Given our findings, the selection of the most appropriate method is a multifaceted process with no concrete universal answer. Given sufficient resources and granted a focus on sheer accuracy, RNNs are the best choice, but they are also the most demanding of the methods examined. Random Forest's performance did not constitute a significant enough improvement to the equations' to be considered a viable option in most cases. The empirical methods proved competitive against the ML models, and their deficiencies in accuracy compared to the equivalent ML models can be made up for by their relative ease of use. It is also important to mention that the equations examined are spatially generalized in their function, while the trained ML models are locally restricted to the study area and are expected to underperform in different locations; even within this one study area, their performance varied significantly across sites. It is our hope that this study will serve as a useful resource to inform interested parties in deciding the most cost-effective methods to solve the problem of invalid solar radiation measurements according to their needs and capabilities.
Researchers could focus on examining the viability of location-specific ML models for the express purpose of filling in data gaps in known station locations or the possibility of creating more generalized models through the integration of explicit spatial and temporal parameters, which, in addition to being used for the aforementioned purpose, they could find applications in regressing solar radiation values for stations that do not normally record them.
Author Contributions
Soulis K, Conceptualization, Methodology, Data curation, Supervision, Coding, Resources, Review & Editing, Nikitakis E, Coding, Methodology, Data curation, Writing, Validation, Investigation, Methodology, Visualization, Formal analysis, Katsogiannou A, Data curation, Coding, Kalivas D, Resources, Supervision
Acknowledgements
The authors would like to thank the anonymous reviewers for their contributions and constructive criticism.
Funding
The installation of the meteorological stations used in this study was co-funded by the NSRF and the European Union. This work was supported by DT-Agro project, grand number 014815 that is carried out within the framework of the National Recovery and Resilience Plan Greece 2.0, funded by the European Union—NextGenerationEU (Implementation body: HFRI), https://greece20.gov.gr.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Data availability
The data used in this study can be made available upon reasonable request.
Conflicts of interest
The authors hereby declare no conflict of interest.
Appendix
Additional Metric Tables
R Packages Used in this paper
● base R Core Team (2024) R: A Language and Environment for Statistical Computing. (R version 4.4.1) R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
● keras3 Kalinowski T, Allaire J, Chollet F (2024) keras3: R Interface to "Keras". R package version 1.2.0, https://CRAN.R-project.org/package = keras3.
● tensorflow Allaire J, Tang Y (2024) tensorflow: R Interface to "TensorFlow". R package version 2.16.0, https://CRAN.R-project.org/package = tensorflow.
● caret Kuhn M (2008) Building Predictive Models in R Using the caret Package. J Stat Software 28: 1–26. https://doi.org/10.18637/jss.v028.i05
● ranger Wright MN, Ziegler A (2017) ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J Stat Software 77: 1–17. https://doi.org/10.18637/jss.v077.i01
● randomForest Liaw A, Wiener M (2002) "Classification and Regression by randomForest." R News, *2*(3), 18–22. https://CRAN.R-project.org/doc/Rnews/.
● readxl Wickham H, Bryan J (2023) readxl: Read Excel Files. R package version 1.4.3, https://CRAN.R-project.org/package = readxl.
● openxlsx Schauberger P, Walker A (2024) openxlsx: Read, Write and Edit xlsx Files. R package version 4.2.7.1, https://CRAN.R-project.org/package = openxlsx.
● ggplot H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
● gridExtra Auguie B (2017) gridExtra: Miscellaneous Functions for "Grid" Graphics. R package version 2.3, https://CRAN.R-project.org/package = gridExtra.
● writexl Ooms J (2024) writexl: Export Data Frames to Excel "xlsx" Format. R package version 1.5.0, https://CRAN.R-project.org/package = writexl.
● dplyr Wickham H, François R, Henry L, et al. (2023) dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package = dplyr.
● moments Komsta L, Novomestky F (2022) moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests. R package version 0.14.1, https://CRAN.R-project.org/package = moments.