Research article

Multivariate prediction of total organic carbon in river water using random forest and deep learning regression algorithms


  • Received: 25 May 2025 Revised: 13 October 2025 Accepted: 29 October 2025 Published: 04 November 2025
  • Total organic carbon (TOC) is used to determine the total amount of organic compounds in water. It has been used to indicate water purity levels for industrial water for decades. Analyzing TOC in water is often time-consuming and an expensive activity, requiring the use of multiple high-precision sensors. Our main aim of this study was to compare the use of the Random Forest (RF) algorithm, one-dimensional convolutional neural network (1D-CNN), and multilayer perceptron (MLP) in predicting TOC using water quality parameters selected based on their strength of correlation. RF was chosen because it can model complex interactions between the various parameters and is resistant to overfitting. 1D-CNN can handle local dependencies and spatial relationships between input features whereas MLP handles independent numerical features in the dataset. The dataset was obtained from analysis done in Puget Sound marine waters around Seattle King County in the USA at the Duwamish River at three sampling locations. Learning curve analysis demonstrated that the dataset size was sufficient for stable training and generalization, while five-fold cross-validation confirmed consistent model performance across data splits. The effects of wet and dry seasons on the parameter levels were done and their impact on RF model accuracy was assessed. The selection of parameters based on Gini importance ranking was done to evaluate their effect on the accuracy of the RF model. Our results indicated that the prediction of TOC using RF regression was the most accurate with a coefficient of determination (R2) of 0.732. The 1D-CNN and MLP had R2 of 0.714 and 0.638, respectively. The mean absolute error (MAE) for RF, 1D-CNN, and MLP were 0.120 mg/L, 0.244 mg/L and 0.270 mg/L, respectively. It was concluded that the RF algorithm would be more feasible in predicting TOC in river water than the two deep learning methods.

    Citation: Eric Kipkirui Kemei, Kristof Van Laerhoven, Nancy Wangeci Karuri, Robert Kimutai Tewo. Multivariate prediction of total organic carbon in river water using random forest and deep learning regression algorithms[J]. Applied Computing and Intelligence, 2025, 5(2): 264-285. doi: 10.3934/aci.2025015

    Related Papers:

  • Total organic carbon (TOC) is used to determine the total amount of organic compounds in water. It has been used to indicate water purity levels for industrial water for decades. Analyzing TOC in water is often time-consuming and an expensive activity, requiring the use of multiple high-precision sensors. Our main aim of this study was to compare the use of the Random Forest (RF) algorithm, one-dimensional convolutional neural network (1D-CNN), and multilayer perceptron (MLP) in predicting TOC using water quality parameters selected based on their strength of correlation. RF was chosen because it can model complex interactions between the various parameters and is resistant to overfitting. 1D-CNN can handle local dependencies and spatial relationships between input features whereas MLP handles independent numerical features in the dataset. The dataset was obtained from analysis done in Puget Sound marine waters around Seattle King County in the USA at the Duwamish River at three sampling locations. Learning curve analysis demonstrated that the dataset size was sufficient for stable training and generalization, while five-fold cross-validation confirmed consistent model performance across data splits. The effects of wet and dry seasons on the parameter levels were done and their impact on RF model accuracy was assessed. The selection of parameters based on Gini importance ranking was done to evaluate their effect on the accuracy of the RF model. Our results indicated that the prediction of TOC using RF regression was the most accurate with a coefficient of determination (R2) of 0.732. The 1D-CNN and MLP had R2 of 0.714 and 0.638, respectively. The mean absolute error (MAE) for RF, 1D-CNN, and MLP were 0.120 mg/L, 0.244 mg/L and 0.270 mg/L, respectively. It was concluded that the RF algorithm would be more feasible in predicting TOC in river water than the two deep learning methods.



    加载中


    [1] Z. Kılıç, The importance of water and conscious use of water, Int. J. Hydro., 4 (2021), 239–241. https://doi.org/10.15406/ijh.2020.04.00250 doi: 10.15406/ijh.2020.04.00250
    [2] G. Bennett, Status of drinking water supply and water stress levels in the African Great Lakes region: a time-series analysis from 1980 to 2020, Water Supply, 25 (2025), 228–239. https://doi.org/10.2166/ws.2025.017 doi: 10.2166/ws.2025.017
    [3] A. Boretti, L. Rosa, Reassessing the projections of the world water development report, Npj Clean Water, 2 (2019), 15. https://doi.org/10.1038/s41545-019-0039-9 doi: 10.1038/s41545-019-0039-9
    [4] N. Akhtar, M. I. Syakir Ishak, S. A. Bhawani, K. Umar, Various natural and anthropogenic factors responsible for water quality degradation: a review, Water, 13 (2021), 2660. https://doi.org/10.3390/w13192660 doi: 10.3390/w13192660
    [5] L. McDonough, I. Santos, M. Andersen, D. O'Carroll, H. Rutlidge, K. Meredith, et al., Changes in global groundwater organic carbon driven by climate change and urbanization, Nat. Commun., 11 (2020), 1279. https://doi.org/10.1038/s41467-020-14946-1 doi: 10.1038/s41467-020-14946-1
    [6] D. Dubber, N. F. Gray, Replacement of chemical oxygen demand (COD) with total organic carbon (TOC) for monitoring wastewater treatment performance to minimize disposal of toxic analytical waste, J. Environ. Sci. Heal. A, 45 (2010), 1595–1600. https://doi.org/10.1080/10934529.2010.506116 doi: 10.1080/10934529.2010.506116
    [7] A. M. Handhal, A. M. Al-Abadi, H. E. Chafeet, M. J. Ismail, Prediction of total organic carbon at Rumaila oil field, Southern Iraq using conventional well logs and machine learning algorithms, Mar. Petrol. Geol., 116 (2020), 104347. https://doi.org/10.1016/j.marpetgeo.2020.104347 doi: 10.1016/j.marpetgeo.2020.104347
    [8] E. Goz, M. Yuceer, E. Karadurmus, Total organic carbon prediction with artificial intelligence techniques, Computer Aided Chemical Engineering, 46 (2019), 889–894. https://doi.org/10.1016/B978-0-12-818634-3.50149-1 doi: 10.1016/B978-0-12-818634-3.50149-1
    [9] S. Kim, N. Maleki, M. Rezaie-Balf, V. Singh, M. Alizamir, N. Kim, et al., Assessment of the total organic carbon employing the different nature ‑ inspired approaches in the Nakdong River, South Korea, Environ. Monit. Assess., 193 (2021), 445. https://doi.org/10.1007/s10661-021-08907-4 doi: 10.1007/s10661-021-08907-4
    [10] I. S. Yeon, J. H. Kim, K. W. Jun, Application of artificial intelligence models in water quality forecasting, Environ. Technol., 29 (2008), 625–631. https://doi.org/10.1080/09593330801984456 doi: 10.1080/09593330801984456
    [11] B. Alizadeh, K. Maroufi, M. H. Heidarifard, Estimating source rock parameters using wireline data: an example from Dezful Embayment, south west of Iran, J. Petrol. Sci. Eng., 167 (2018), 857–868. https://doi.org/10.1016/j.petrol.2017.12.021 doi: 10.1016/j.petrol.2017.12.021
    [12] Y. Asgari Nezhad, A. Moradzadeh, M. R. Kamali, A new approach to evaluate organic geochemistry parameters by geostatistical methods: a case study from western Australia, J. Petrol. Sci. Eng., 169 (2018), 813–824. https://doi.org/10.1016/j.petrol.2018.05.027 doi: 10.1016/j.petrol.2018.05.027
    [13] L. Zhu, X. Zhou, W. Liu, Z. Kong, Total organic carbon content logging prediction based on machine learning: a brief review, Energy Geoscience, 4 (2023), 100098. https://doi.org/10.1016/j.engeos.2022.03.001 doi: 10.1016/j.engeos.2022.03.001
    [14] M. Tan, X. Song, X. Yang, Q. Wu, Support-vector-regression machine technology for total organic carbon content prediction from wireline logs in organic shale: acomparative study, J. Nat. Gas Sci. Eng., 26 (2015), 792–802. https://doi.org/10.1016/j.jngse.2015.07.008 doi: 10.1016/j.jngse.2015.07.008
    [15] X. Liu, Y. Lei, X. Luo, X. Wang, K. Chen, M. Cheng, et al., TOC determination of Zhangjiatan shale of Yanchang formation, Ordos Basin, China, using support vector regression and well logs, Earth Sci. Inform., 14 (2021), 1033–1045. https://doi.org/10.1007/s12145-021-00607-4 doi: 10.1007/s12145-021-00607-4
    [16] V. Bolandi, A. Kadkhodaie, R. Farzi, Analyzing organic richness of source rocks from well log data by using SVM and ANN classifiers: a case study from the Kazhdumi formation, the Persian Gulf basin, off shore Iran, J. Petrol. Sci. Eng., 151 (2017), 224–234. https://doi.org/10.1016/j.petrol.2017.01.003 doi: 10.1016/j.petrol.2017.01.003
    [17] J. Sun, W. Dang, F. Wang, H. Nie, X. Wei, P. Li, et al., Prediction of TOC content in organic-rich shale using machine learning algorithms: Comparative study of random forest, support vector machine, and XGBoost, Energies, 16 (2023), 4159. https://doi.org/10.3390/en16104159 doi: 10.3390/en16104159
    [18] J. Rui, H. Zhang, D. Zhang, F. Han, Q. Guo, Total organic carbon content prediction based on support-vector-regression machine with particle swarm optimization, J. Petrol. Sci. Eng., 180 (2019), 699–706. https://doi.org/10.1016/j.petrol.2019.06.014 doi: 10.1016/j.petrol.2019.06.014
    [19] M. Sabzekar, S. Hasheminejad, Robust regression using support vector regressions, Chaos Soliton. Fract., 144 (2021), 110738. https://doi.org/10.1016/j.chaos.2021.110738 doi: 10.1016/j.chaos.2021.110738
    [20] A. O. Ige, M. Sibiya, State-of-the-art in 1d convolutional neural networks: a survey, IEEE Access, 12 (2024), 144082–144105. https://doi.org/10.1109/ACCESS.2024.3433513 doi: 10.1109/ACCESS.2024.3433513
    [21] S. Asante-okyere, Y. Yevenyo, S. Adjei, Improved total organic carbon convolutional neural network model based on mineralogy and geophysical well log data, Unconventional Resources, 1 (2021), 1–8. https://doi.org/10.1016/j.uncres.2021.04.001 doi: 10.1016/j.uncres.2021.04.001
    [22] I. K. Mutai, K. Van Laerhoven, N. W. Karuri, R. K. Tewo, Using multivariate linear regression for biochemical oxygen demand prediction in waste water, Applied Computing and Intelligence, 4 (2024), 125–137. https://doi.org/doi: 10.3934/aci.2024008 doi: 10.3934/aci.2024008
    [23] Y. Ao, H. Li, L. Zhu, S. Ali, Z. Yang, The linear random forest algorithm and its advantages in machine learning assisted logging regression modeling, J. Petrol. Sci. Eng., 174 (2019), 776–789. https://doi.org/10.1016/j.petrol.2018.11.067 doi: 10.1016/j.petrol.2018.11.067
    [24] V. Rodriguez-galiano, M. Sanchez-castillo, M. Chica-olmo, M. Chica-rivas, Machine learning predictive models for mineral prospectivity : an evaluation of neural networks, random forest, regression trees and support vector machines, Ore Geol. Rev., 71 (2015), 804–818. https://doi.org/10.1016/j.oregeorev.2015.01.001 doi: 10.1016/j.oregeorev.2015.01.001
    [25] P. Schober, C. Boer, L. A. Schwarte, Correlation coefficients: appropriate use and interpretation, Anesth. Anal., 126 (2018), 1763–1768. https://doi.org/10.1213/ANE.0000000000002864 doi: 10.1213/ANE.0000000000002864
    [26] G. Peng, S. Sun, Z. Xu, J. Du, Y. Qin, S. Sharshir, et al., The effect of dataset size and the process of big data mining for investigating solar-thermal desalination by using machine learning, Int. J. Heat Mass Tran., 236 (2025), 126365. https://doi.org/10.1016/j.ijheatmasstransfer.2024.126365 doi: 10.1016/j.ijheatmasstransfer.2024.126365
    [27] M. Fettweis, M. Schartau, X. Desmit, B. J. Lee, N. Terseleer, D. Van der Zande, et al., Organic matter composition of biomineral flocs and its influence on suspended particulate matter dynamics along a nearshore to offshore transect, J. Geophys. Res.-Biogeo., 127 (2022), e2021JG006332. https://doi.org/10.1029/2021JG006332 doi: 10.1029/2021JG006332
    [28] H. Schmidt, S. Seitz, E. Hassel, H. Wolf, The density-salinity relation of standard seawater, Ocean Sci., 14 (2018), 15–40. https://doi.org/10.5194/os-14-15-2018 doi: 10.5194/os-14-15-2018
    [29] M. Nicolaus, C. Petrich, S. R. Hudson, M. A. Granskog, Variability of light transmission through Arctic land-fast sea ice during spring, Cryosphere, 7 (2013), 977–986. https://doi.org/10.5194/tc-7-977-2013 doi: 10.5194/tc-7-977-2013
    [30] D. L. Correll, T. E. Jordan, D. E. Weller, Effects of precipitation, air temperature, and land use on organic carbon discharges from rhode river watersheds, Water Air Soil Poll., 128 (2001), 139–159. https://doi.org/10.1023/A:1010337623092 doi: 10.1023/A:1010337623092
    [31] A. Al Bataineh, D. Kaur, S. Jalali, Multi-layer perceptron training optimization using nature inspired computing, IEEE Access, 10 (2022), 36963–36977. https://doi.org/10.1109/ACCESS.2022.3164669 doi: 10.1109/ACCESS.2022.3164669
    [32] E. Qazi, A. Almorjan, T. Zia, A one-dimensional convolutional neural network (1D-CNN) based deep learning system for network intrusion detection, Appl. Sci., 12 (2022), 7986. https://doi.org/10.3390/app12167986 doi: 10.3390/app12167986
    [33] H. Oh, H. Park, J. Kim, B. Lee, J. Choi, J. Hur, Enhancing machine learning models for total organic carbon prediction by integrating geospatial parameters in river watersheds, Sci. Total Environ., 943 (2024), 173743. https://doi.org/10.1016/j.scitotenv.2024.173743 doi: 10.1016/j.scitotenv.2024.173743
    [34] A. L'Heureux, K. Grolinger, H. Elyamany, M. Capretz, Machine learning with big data: challenges and approaches, IEEE Access, 5 (2017), 7776–7797. https://doi.org/10.1109/ACCESS.2017.2696365 doi: 10.1109/ACCESS.2017.2696365
    [35] F. Tang, H. Ishwaran, Random forest missing data algorithms, Stat. Anal. Data Min., 10 (2017), 363–377. https://doi.org/10.1002/sam.11348 doi: 10.1002/sam.11348
  • Reader Comments
  • © 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(307) PDF downloads(20) Cited by(1)

Article outline

Figures and Tables

Figures(13)  /  Tables(11)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog