Machine learning models for water quality prediction often encounter challenges due to incomplete datasets resulting from issues such as sensor failures or data transmission errors. However, XGBoost can learn from incomplete data and make predictions even when some input values are missing, without requiring any manual intervention by the user. This offers a considerable advantage over other models that require complete datasets. To test this capability, we utilized XGBoost to forecast salinity levels in the Chao Phraya River, Thailand, based on investigating forecasts at lead times of 24 and 48 hours, using four predictors: the water levels from three stations and the salinity level. The tested model demonstrated high predictive performance across both complete and incomplete subsets of the data. Where the complete input dataset was available, our XGBoost model outperformed the single-level ANN model and was comparable to the multilevel ANN model. Additionally, we enhanced XGBoost's ability to forecast under incomplete data conditions by utilizing both real and synthetic datasets. The synthetic dataset was generated by removing 1–3 predictor variables from the complete dataset. Based on the results, incorporating synthetic data into the training dataset substantially enhanced the model's robustness against missing data. Notably, training with synthetic data that excluded one variable at a time was sufficient for accurate predictions. These results underscore XGBoost's practicality and reliability for real-world water quality forecasting, particularly under conditions of data scarcity.
Citation: Jiramate Changklom, Trang Prommana, Phakawat Lamchuan, Adichai Pornprommin. Salinity forecasting in Chao Phraya River using XGBoost with missing data handling[J]. AIMS Environmental Science, 2025, 12(5): 916-935. doi: 10.3934/environsci.2025040
Machine learning models for water quality prediction often encounter challenges due to incomplete datasets resulting from issues such as sensor failures or data transmission errors. However, XGBoost can learn from incomplete data and make predictions even when some input values are missing, without requiring any manual intervention by the user. This offers a considerable advantage over other models that require complete datasets. To test this capability, we utilized XGBoost to forecast salinity levels in the Chao Phraya River, Thailand, based on investigating forecasts at lead times of 24 and 48 hours, using four predictors: the water levels from three stations and the salinity level. The tested model demonstrated high predictive performance across both complete and incomplete subsets of the data. Where the complete input dataset was available, our XGBoost model outperformed the single-level ANN model and was comparable to the multilevel ANN model. Additionally, we enhanced XGBoost's ability to forecast under incomplete data conditions by utilizing both real and synthetic datasets. The synthetic dataset was generated by removing 1–3 predictor variables from the complete dataset. Based on the results, incorporating synthetic data into the training dataset substantially enhanced the model's robustness against missing data. Notably, training with synthetic data that excluded one variable at a time was sufficient for accurate predictions. These results underscore XGBoost's practicality and reliability for real-world water quality forecasting, particularly under conditions of data scarcity.
| [1] |
Changklom J, Lamchuan P, Pornprommin A (2022) Salinity forecasting on raw water for water supply in the Chao Phraya River. Water 14: 741. https://doi.org/10.3390/w14050741 doi: 10.3390/w14050741
|
| [2] | Tomkratoke S, Kongkulsiri S, Narenpitak P, et al. (2025) Drought and salinity intrusion in the Lower Chao Phraya River: Variability analysis and modeling mitigation approaches. EGUsphere https://doi.org/10.5194/egusphere-2024-4052 |
| [3] |
Kulmart K, Charoenroongruang C (2024) Mathematical forecasting simulation of salinity intrusion in Chao Phraya River. J Inf Optim Sci 45: 181–198. https://doi.org/10.47974/JIOS-1410 doi: 10.47974/JIOS-1410
|
| [4] |
Zhao Q, Zhu Y, Wan D, et al. (2018) Research on the data-driven quality control method of hydrological time series data. Water 10: 1712. https://doi.org/10.3390/w10121712 doi: 10.3390/w10121712
|
| [5] | Ribeiro SM, de Castro CL (2022) Missing data in time series: A review of imputation methods and case study. Learn Nonlinear Models 20: 31–46. |
| [6] |
Nijman SWJ, Leeuwenberg AM, Beekers I, et al. (2022) Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol 142: 218–229. https://doi.org/10.1016/j.jclinepi.2021.11.023 doi: 10.1016/j.jclinepi.2021.11.023
|
| [7] |
Gravesteijn BY, Sewalt CA, Venema E, et al. (2021) Missing Data in Prediction Research: A Five-Step Approach for Multiple Imputation, Illustrated in the CENTER-TBI Study. J Neurotrauma 38: 1842–1857. https://doi.org/10.1089/neu.2020.7218 doi: 10.1089/neu.2020.7218
|
| [8] | Wen H, Pinson P, Gu J, et al. (2022) Wind energy forecasting with missing values within a fully conditional specification framework. Int J Forecasting https://doi.org/10.1016/j.ijforecast.2022.10.012 |
| [9] |
Nijman SWJ, Hoogland J, Groenhof TKJ, et al. (2021) Real-time imputation of missing predictor values in clinical practice. Eur Heart J Digit Health 2: 154–164. https://doi.org/10.1093/ehjdh/ztaa016 doi: 10.1093/ehjdh/ztaa016
|
| [10] |
Zhang Y, Thorburn PJ (2022) Handling missing data in near real-time environmental monitoring: A system and a review of selected methods. Future Gener Comput Syst 128: 63–72. https://doi.org/10.1016/j.future.2021.09.033 doi: 10.1016/j.future.2021.09.033
|
| [11] |
Matusowsky M, Ramotsoela DT, Abu-Mahfouz AM (2020) Data imputation in wireless sensor networks using a machine learning-based virtual sensor. J Sens Actuator Netw 9: 25. https://doi.org/10.3390/jsan9020025 doi: 10.3390/jsan9020025
|
| [12] |
Chen D, Yang S, Zhou F (2019) Transfer learning based fault diagnosis with missing data due to multi-rate sampling. Sensors 19: 1826. https://doi.org/10.3390/s19081826 doi: 10.3390/s19081826
|
| [13] | Kaveh M, Mesgari MS, Kaveh M (2025) A novel evolutionary deep learning approach for PM2.5 prediction using remote sensing and spatial–temporal data: A case study of Tehran. ISPRS Int J Geo-Inf 14: 42. https://doi.org/10.3390/ijgi14020042 |
| [14] | Mena F, Arenas D, Dengel A (2024) Increasing the robustness of model predictions to missing sensors in Earth observation. arXiv preprint arXiv: 2407.15512. https://doi.org/10.48550/arXiv.2407.15512 |
| [15] |
Kaya A, Keçeli AS, Catal C, et al. (2020) Sensor failure tolerable machine learning-based food quality prediction model. Sensors 20: 3173. https://doi.org/10.3390/s20113173 doi: 10.3390/s20113173
|
| [16] | Ok E, Emmanuel M. Handling Missing Data in XGBoost. ResearchGate, 2024. Available from: https://www.researchgate.net/publication/390138079_Handling_Missing_Data_in_XGBoost |
| [17] |
Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. Proc ACM SIGKDD Int Conf Knowl Discov Data Min 22: 785–794. https://doi.org/10.1145/2939672.2939785 doi: 10.1145/2939672.2939785
|
| [18] | März A, Rasul K (2024) Forecasting with Hyper-Trees. arXiv preprint arXiv: 2405.07836. https://doi.org/10.48550/arXiv.2405.07836 |
| [19] | Petneházi G (2019) Recurrent neural networks for time series forecasting. arXiv preprint arXiv: 1901.00069. https://doi.org/10.48550/arXiv.1901.00069 |
| [20] | Khaldi R, El Afia A, Chiheb R, et al. (2024) What is the best RNN-cell structure to forecast each time series behavior? Expert Syst Appl https://doi.org/10.48550/arXiv.2303.07844 |
| [21] |
Che Z, Purushotham S, Cho K, et al. (2018) Recurrent neural networks for multivariate time series with missing values. Sci Rep 8:6085. https://doi.org/10.1038/s41598-018-24271-9 doi: 10.1038/s41598-018-24271-9
|
| [22] | Mehdipour Ghazi M, Nielsen M, Pai A, et al. (2018) Robust training of recurrent neural networks to handle missing data for disease progression modeling. arXiv preprint arXiv: 1808.05500. https://arXiv.org/abs/1808.05500 |
| [23] | Lipton ZC, Kale DC, Wetzel R (2016) Modeling Missing Data in Clinical Time Series with RNNs. Proceedings of the Machine Learning for Healthcare Conference. Available from: https://proceedings.mlr.press/v56/Lipton16.pdf |
| [24] | Bekkerman R (2015) The present and the future of the KDD Cup competition: an outsider's perspective. LinkedIn. Available from: https://www.linkedin.com/pulse/present-future-kdd-cup-competition-outsiders-ron-bekkerman |
| [25] |
Makumbura RK, Mampitiya L, Rathnayake N, et al. (2024) Advancing water quality assessment and prediction using machine learning models, coupled with explainable artificial intelligence (XAI) techniques like shapley additive explanations (SHAP). Results Eng 23: 102831. https://doi.org/10.1016/j.rineng.2024.102831 doi: 10.1016/j.rineng.2024.102831
|
| [26] |
Al Saleem M, Harrou F, Sun Y (2024) Explainable machine learning methods for predicting water treatment plant features under varying weather conditions. Results Eng 21: 101930. https://doi.org/10.1016/j.rineng.2024.101930 doi: 10.1016/j.rineng.2024.101930
|
| [27] |
Ahmed Y, Dutta KR, Chowdhury Nepu SN, et al. (2025) Optimizing photocatalytic dye degradation: A machine learning and metaheuristic approach for predicting methylene blue in contaminated water. Results Eng 25: 103538. https://doi.org/10.1016/j.rineng.2024.103538 doi: 10.1016/j.rineng.2024.103538
|
| [28] | Metropolitan Waterworks Authority. (2022). MWA Consumer Confidence Report 2022. Available from: https://www.mwa.co.th/wp-content/uploads/2023/03/2022-Annual-Water-Quality-Report-%E0%B8%AD%E0%B8%B1%E0%B8%87%E0%B8%81%E0%B8%A4%E0%B8%A9.pdf |
| [29] |
Pokavanich T, Guo X (2024) Saltwater intrusion in Chao Phraya Estuary: A long, narrow and meandering partially mixed estuary influenced by water regulation and abstraction. J Hydrol Reg Stud 52: 101686. https://doi.org/10.1016/j.ejrh.2024.101686 doi: 10.1016/j.ejrh.2024.101686
|
| [30] |
Heyse J, Sheybani L, Vulliémoz S, et al. (2021) Evaluation of directed causality measures and lag estimations in multivariate time-series. Front Syst Neurosci 15: 620338. https://doi.org/10.3389/fnsys.2021.620338 doi: 10.3389/fnsys.2021.620338
|
| [31] |
Nash JE, Sutcliffe JV (1970) River flow forecasting through conceptual models part I — A discussion of principles. J Hydrol 10: 282–290. https://doi.org/10.1016/0022-1694(70)90255-6 doi: 10.1016/0022-1694(70)90255-6
|
| [32] |
Schouten RM, Lugtig P, Vink G (2018) Generating missing values for simulation purposes: A multivariate amputation procedure. J Stat Comput Simul 88: 2909–2930. https://doi.org/10.1080/00949655.2018.1491577 doi: 10.1080/00949655.2018.1491577
|
| [33] |
Rubright JD, Nandakumar R, Glutting JJ (2014) A simulation study of missing data with multiple missing X's. Pract Assess Res Eval 19: 10. https://doi.org/10.7275/9ew5-zd12 doi: 10.7275/9ew5-zd12
|
| [34] |
Bourguignon L, Lukas LP, Guest JD, et al. (2024) Studying missingness in spinal cord injury data: challenges and impact of data imputation. BMC Med Res Methodol 24:5. https://doi.org/10.1186/s12874-023-02125-x doi: 10.1186/s12874-023-02125-x
|
| [35] | Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8: 1623–1657. |
| [36] |
López-Chacón SR, Salazar F, Bladé E (2023) Combining synthetic and observed data to enhance machine learning model performance for streamflow prediction. Water 15: 2020. https://doi.org/10.3390/w15112020 doi: 10.3390/w15112020
|
| [37] |
Yang S, Kim K-D, Ariji E, et al. (2023) Evaluating the performance of generative adversarial network-synthesized periapical images in classifying C-shaped root canals. Sci Rep 13: 18038. https://doi.org/10.1038/s41598-023-45290-1 doi: 10.1038/s41598-023-45290-1
|
| [38] | Duffy W, O'Connell E, McCarroll N, et al. (2025) Evaluating rule-based and generative data augmentation techniques for legal document classification. Knowl Inf Syst. https://doi.org/10.1007/s10115-025-02454-x |
| [39] |
Chawla NV, Bowyer KW, Hall LO, et al. (2002) SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 16: 321–357. https://doi.org/10.1613/jair.953 doi: 10.1613/jair.953
|
| [40] | Li Y, Bonatti R, Abdali S, et al. (2024) Data generation using large language models for text classification: An empirical case study. Proc Mach Learn Res 235: 1–17. |
Environ-12-05-040 - s001.pdf |
![]() |