1.
Introduction
River water is a vital surface water resource for households, agriculture, and industry activities. It also plays an important role in health and environmental issues. Thus, assessing and monitoring the quality and quantity of the water carried in the River is essential. A water quality index (WQI) is a single number that provides information about overall water quality. It is calculated using a parametric expression from several biological, physical, and chemical parameters measured from a water sample at a location and time. The parametric expression for WQI calculation involves the use of relative weights per involved parameter. According to the Water Quality Management Bureau, Pollution Control Department, Thailand, five water parameters including dissolved oxygen (DO), biological oxygen demand (BOD), total coliform bacteria (TCB), fecal coliform bacteria (FCB), and ammonia nitrogen (NH3-N) are employed for the estimation of WQI. Moreover, the WQI is often represented as five levels of water quality; very good (91-100), good (71-90), average (61-70), poor (31-60), and very poor (0-30).
Many researchers have been interested in examining factors that affect water quality, both man-made and environmental. However, changes in climate conditions such as the increase of water temperature, precipitation and evaporation patterns, and heavy rainfall and flooding mainly affect the biological and chemical properties of water quality. As a result, the relationship between climate factors and WQI has been studied, as well as finding an appropriate statistical model to represent the relationship. The WQI parameters were estimated for water streams in Finland with air temperature, rainfall run-off, and precipitation and characteristics of catchment areas as independent variables using artificial neural network (ANN) model [1]. The relationship between the WQI and climate variables on the Euphrates River within Karbala city, Iraq, has been investigated [2]. The non-linear regression and ANN models were employed to forecast the relationship between the WQI and temperature, relative humidity, rainfall depth, and sunshine duration. The non-linear regression model predicted the WQI better than the ANN models. The multiple linear regression (MLR) is used to study the relations between climate variables: temperature, relative humidity, and selected water quality parameters in Lake Manzala, Egypt [3]. They found a positive relation between studied variables. Recently, [4] used MLR and ANN models for predicting the stream water quality parameters on Green River watershed, Kentucky, USA, with independent variables precipitation, temperature, and land-use data.
Although MLR and ANN are widely used to model the relationship between the WQI parameters and climate variables, choosing the suitable regression model and dealing with restrictions in underlying assumptions in MLR modeling could be challenging. One of the difficulties in ANN modeling is choosing an appropriate number of hidden nodes that can affect its performance. The Gaussian process (GP) is a Bayesian machine learning method which has gained attention due to its flexibility in modeling. A GP can be applied in regression analysis, called Gaussian process regression (GPR). It has been applied and compared to estimating and forecasting models in many fields. For example, in health science, [5] used the GPR to estimate the child mortality rate in Iran in 1990 - 2013. They found that the GPR can efficiently estimate the mortality rate with relatively high fitting precision and flexibility. In earth science, [6] used the capability of GPR to predict the porosity and permeability of the southern basin of the South Yellow Sea. They compared the performance of the GPR to the back propagation neural network (BPNN), generalized regression neural network (GRNN), and radial basis function neural network (RBFNN). In food science, [7] applied the GPR to model the drying time of mosambi peel. The study showed that the GPR could be used as an alternative method because it provided a better estimation than the ANN and the response surface method. These findings mentioned above revealed that the GPR is one of the most powerful techniques and can be used in diverse fields.
In this work, we want to use the GPR, MLR, and ANN to predict the WQI using climate variables. The data used in the study is the WQI of the Ping River, the major River in the north of Thailand. The climate-related variables are temperature, humidity, rainfall, and evaporation. We also assess the performance of the prediction ability of the models.
2.
Materials and methods
2.1. Study area and data
The Ping River is located in the central part of northern Thailand (19∘48′45"N 98∘50′20"E). The Ping River originates in Chiang Dao district, Chiang Mai Province. It then flows through the provinces of Lamphun, Tak, Kamphaeng Phet, and Nakhon Sawan. Its estimated length is 658 km, with catchment areas of around 34,885 km2. The average discharge is about 265 m3/s. The Wang River is its main tributary. The Ping river joins the combined Nan and Yom rivers to form the Chao Phraya River, the major River in Thailand, formed in the center of the country and flows through Bangkok and then into the Gulf of Thailand. The Ping river basin is in the area, where is mainly influenced by the Southwest and Northeast monsoon between mid-May or June to mid-October.
The WQI of the rivers in Thailand is monitored by the Pollution Control Department, Water Quality Management Bureau, The Ministry of Natural Resources and Environment of Thailand. There are 14 water quality monitoring stations located in many areas along the basin, shown in Table 1. The spatial plot of the stations and areas of study is illustrated in Figure 1. The WQI is measured in three-month intervals, mostly in February, May, August, and November.
The climate data are daily provided by the Northern Meteorological Center, Meteorological Department, Ministry of Digital Economy and Society of Thailand. To related the WQI and climate data, we then use the monthly average temperature (∘C), monthly average humidity (%), monthly total rainfall (mm.), and monthly average evaporation (mm.) as climate variables in this work.
The WQI and climate data originate from different sources that unfortunately are collected in different locations and timespans. Therefore, data were filtered and only those collected at the same locations and timespans (January 2010-December 2019) were selected for this analysis. Consequently, the selected data used in modeling originate from eight water monitoring stations located in four areas as follows: (1) Mueang-KamphengPhet (PI04 and PI05), (2) Mueng-Tak (PI06, PI07, and PI08) (3) SamNgao-Tak (PI09), and (4) Mueang-Chiang Mai (PI12 and PI13). In each station, we expect 40 WQI observations, however, there are some missing values. Therefore, the selected data consist of 284 observations from eight stations.
2.2. Multiple linear regression (MLR)
Multiple linear regression (MLR) attempts to predict a dependent or response variable, y, on the basis of an assumed linear relationship with several independent or explanatory variables, x1,x2,...,xd. The MLR model can be expressed as
where f(.) is a transition function, mapping the relationship between the response and independent variables. The ϵ is a random error assumed to be independent and identical normally distributed with zero mean and constant variance, ϵ∼N(0,σ2ϵ). For n observations, the model can be expressed in the form of vector and matrix as
where y is n×1 vector of observed values of the response variable, X is a n×(d+1) matrix of independent variables, β is (d+1)×1 vector of regression coefficient parameters, and ϵ is n×1 vector of random errors. The parameter estimation for β and σ2ϵ can be performed using several approaches such as the least square and maximum likelihood estimation. After model fitting, the model adequacy and validation are investigated using some properties of the residuals, e=y−ˆy, where y is the observed responses vector and ˆy is the predicted response obtained from the fitted model. The assumptions of the MLR require that residuals are normal distributed with no systematic pattern, constant variance, and no outliers.
In this study, we use a stepwise method to select the independent variables. The variance inflation factor (VIF) is used to investigate the multicollinearity issue among independent variables. The residual assumptions are diagnosed by Breusch-Pagan (BP) test for testing the constant variance assumption, Shapiro-Wilk (SW) test for normality, and the Durbin-Watson (DW) test for autocorrelation. The Box-Cox transformation is employed to find the best transformation if the assumption(s) is violated.
2.3. Artificial neural network (ANN)
An artificial neural network (ANN) are multi-layer fully-connected neural nets that consist of three-layer processing units; an input layer, multiple hidden h layers, and an output layer. The model is inspired by the human brain that tries to find data structures and algorithms for learning and classifying data. The relation between the output y and the input (x1,x2,...,xd) can be written as follow
where wk,h with k=1,...,p;j=1,2,...,h is the connection weight, d and h are number of input vectors and number of hidden nodes, respectively. Moreover, g is a sigmoid transfer function, w0 and w0,j are weights from the bias terms, and ϵ is the error term. Each of the inputs is multiplied by a connection weight or synapse. A given node takes the weighted sum of its inputs and passes it through a non-linear activation function. The output of the node then becomes the input of another node in the next layer. The number of nodes of the input layer corresponds to the number of variables describing the attributes being tested. The weight parameters of the ANN model are estimated by optimization solution to minimize the sum of squared of residual. The package \texttt{neuralnet} in R programming [8], by training of neural networks using backpropagation network, is used in this study for obtaining the weights estimates and fitted responses. More details about ANN modeling can be found in [9].
2.4. Gaussian process regression (GPR)
A Gaussian process (GP) is a stochastic process involving random variables, any finite number of which have a joint Gaussian distribution. It is considered to be a non-parametric method for modeling data, as the model structure is determined from the data rather than through a parametric model. In the same way that a Gaussian random variable is characterized by its mean and variance, a GP is completely characterized by its mean function and covariance function, which are functions of the input vector. We define the input vector x as a collection of inputs, where x=(x1,...,xn)′. We use the following notation to denote that f(.) is a Gaussian process:
The m(x) is the mean function and k(x,x′) is the covariance function. The mean function has the effect of shifting a zero mean GP by the amount of m(x), often set to be zero. It is the covariance function, which largely determines the properties of samples from the GP model. In this study, we employ the squared exponential (SE) covariance, which is the most commonly used in GP modeling. If we consider the input X with d-dimensional, the covariance function has the form of a direct sum as follows
The SE covariance function is often controlled by parameters, called hyper-parameters: σ2gp controls the amplitude of variation of the sample functions and the correlation parameters lk,k=1,...,d, controls the smoothness of the samples. We define θ to be a set of hyper-parameters for a given mean and covariance function, θ=(σ2gp,{lk}dk=1). More details of the GP modeling can be found in [10].
In regression modeling, we can instead model the transition function f(x) using a GP as
Given a collection of inputs X, the vector f=f(X) has a multivariate Gaussian distribution,
where m(X) is the prior mean vector, and K=k(X,X′) is the covariance matrix or Gram matrix.
Consider regression model y=f(X)+ϵ where ϵ is the additive Gaussian noise N(0,σ2ϵ), the distribution of the output y is then y|f∼N(f,σ2ϵI), and y|x∼N(m(X),K+σ2ϵI), where I is the identity matrix.
Given a set of training input vectors and output, we are often interested in predicting test output y∗=(y∗1,...,y∗m) and latent function variables f∗=f(X∗), at test input X∗, where X∗=(x∗1,x∗2,...,x∗d). We can write the joint distribution of training output y and latent function variables f∗, and training output y and test output y∗ as follow
Bayes's rule is then applied to obtain the joint posterior distribution of training and test latent variables given the training outputs
The predictive distribution of f∗ at given test locations can be derived using conditional distribution of two Gaussian random variables, yields
The predictive distribution of the target outputs, y∗ is
The predictions are the mean vector μ∗, and variances can be obtained from the diagonal of the covariance matrix Σ∗+σ2ϵI. The package mlegp in R programming [11] is used to obtain the hyper-parameter estimates and the GPR predictive mean and variances, as shown in Equation (7).
2.5. Evaluation of prediction accuracy
The scores used to compare the predictive performance of the models are the root mean squared error (RMSE), the mean absolute error (MAE), and the mean absolute percentage error (MAPE), given by
where yi and ˆyi are the ith observed and the predicted response for i=1,...,n and n is the number of observations. These scores are negatively-oriented: lower values are better. Moreover, the plot between the observed and the predicted is illustrated along with the diagonal line. A perfect alignment with the line indicates no difference between observed and predicted, yields perfect prediction. The coefficient of determination (r2) is computed to measure how well the model predicts the data, given by
3.
Results
3.1. Exploratory data analysis
The descriptive statistics of the WQI and climate data of the selected stations and areas are shown in Table 2. The time series plots and boxplots of the WQI from 8 stations and the climate data from four areas are displayed in Figure 2 and 3, respectively.
The means and medians of the WQI are between 58-70. The minimum is 41, and the maximum is 89 at station PI13 and PI04, respectively. The standard deviations are similar in many stations, but at station PI04 and PI09, they are slightly higher than the others. The boxplots show that there are some outliers at stations PI04, PI05, PI06, and PI08. According to Thai Meteorological Department, Thailand has three seasons: summer (February-May), rainy season (June – September), winter (October – January). The statistics of WQI in the seasons are as follows: the means (standard deviation) are 66 (8.46), 62.7 (7.93), and 66.73 (8.39). This means that the WQI in the rainy season is, on average, lower than summer and winter season. However, there is no clear patterns on the variability of WQI by area, as seen in Figure 2.
For climate data, the monthly average temperature (AvgTemp), average humidity (AvgHumid), and average evaporation (AvgEvapor) have similar means and medians in every area. The means of the monthly total rainfall (TotalRainfall), particularly in KamphengPhet and Chiang Mai, are relatively high, whereas the medians are not much different. This indicates the extreme values in the total rainfall data. Moreover, we provide the scatterplot matrix of all data in Figure 4 to show the relationship among variables. The small values of the correlation coefficients indicate a weak relationship between the WQI and the climate variables. Although the plots suggest that the climate variables are moderately correlated, the VIFs indicate no multicollinearity issue.
3.2. Modeling
The dataset is divided into two sets, training and test set, to validate the model performance for seen and unseen data. The output data is the WQI. The input data are the 4 climate variables: monthly average temperature, monthly average humidity, monthly total rainfall, and monthly average evaporation. All data in year 2010-2018 is selected to be the training set with 256 observations or 90.14% of the total number of observations. All data in the year 2019 is used as the test set with 28 observations or 9.86% of the total number of observations. Hence, the dimension of inputs for training and test set are 256 rows × 4 columns and 28 rows × 4 columns, respectively.
We normalize the data before modeling using the standardization technique. Therefore, each variable will be centered around zero and have approximately unit variance. The training data is used in modeling with three methods; MLR, ANN, and GPR. For the MRL, initially, we consider the full model with all four climate variables denoted by MLR1. After that, we use the stepwise method for model selection, and we found that only TotalRainfall is included in the model denoted by MLR2. In fact, the MLR1 and MLR2 do not meet the normality and independent assumptions. Then we use the Box-Cox transformation to MRL1 and MLR2, and it suggests a transformation with λ=−0.5 denoted by MLR3 and MLR4, respectively. As a result, the residuals assumptions for MLR3 and MRL4 are acceptable. The number of nodes in the hidden layer, h, is varied. We use the same set of climate variables used in the MLR1 and MLR2 denoted as ANN1 and ANN2. We found that the best number of hidden nodes, according to the least RMSE for ANN1 and ANN2, are 4 nodes and 1 node. The underlying GP is set to have zero mean and SE covariance function with unknown nugget variance. We also use the same set of climate variables in the ANN models denoted by GPR1 and GPR2. The fitted results from 3 approaches with 8 models are shown in Tables 3-5
3.3. Comparison of models
To validate the prediction performance of the models, the predictive accuracy scores are presented in Table 6. The scores indicate that GPR models outperform the MLR and ANN models for both training and test sets, with the least values in RMSE, MAE, and MAPE. Besides, three scores from all models suggest that the methods perform better in the training set compared to the test set. The plots between the observed and predicted of the WQI of the training set (black dots) and the test set (red dots) are illustrated in Figure 5. We can observe that, for the training set, the results from the GPRs are more concentrated around the diagonal line than those from the MLRs and ANNs. This confirms that the GPRs give a better prediction, which corresponds to the highest values of r2. Nevertheless, it is essential to point out that the ANN and GPR models perform well only when the WQI is in the range of 50 to 70. This can be explained by the density plot of WQI in the training and test set, shown in Figure 6. The WQI in the training set is mostly in range of 50 to 70, while in the test set, the range is from 50 to 90. The models are unable to capture the information and predict when the data are higher than 70. Consequently, the accuracy of the prediction is relatively poor in the test set, particularly for the WQI over 70.
We also investigate the distribution of residuals produced from all models using boxplots shown in Figure 7. For the training set, all models generate residuals with the center location of zero. The deviation of residuals is small in the case of GPRs compared to MLRs and ANNs. For the test set, the mean and median of residuals produced from MLRs are most deviated from zero, followed by ANNs and GPRs. The standard deviations of the residuals from the test set are not significantly different, but they are still larger than those from the training set.
4.
Discussion
Water quality is influenced by climate variabilities such as precipitation, temperature, rainfall level, and wind patterns. In this study, we assess the impact of climate change on the water quality of Ping River, Thailand. We focus on investigating the association of the climate data; temperature, humidity, total rainfall, and evaporation, and the water quality index using statistical modeling. An important result of this study is to develop a predictive model to assess how climate variables affect the WQI. The result suggested that the amount of total rainfall is the only significant climate variable that impacts WQI. The limitation of the available data allows us to analyze the WQI and climate data on the different locations. Moreover, the WQI is not collected in the same period as the climate data. These could be the main reasons that affect our results.
Three statistical models, MLR, ANN, and GPR, are used to analyze the relationship between WQI and climate data. The MLR is often used to describes the relationship between inputs and outputs by specifying a functional form mapping from inputs to outputs. We often set the function to be some specific form, such as combinations of linear, cubic, and higher-order polynomial terms with unknown parameters. However, choosing the correct function can be difficult. The ANN relies on the optimal number of nodes to obtain a better estimation. Unlike MLR and ANN, the GPR can learn the data well without specifying the function between inputs and outputs. It is considered a more flexible way to model the data. However, the GPR can be computationally expensive when the sample size is large. Our results have shown that the GPR is more efficient than the traditional MLR and the ANN as it can achieve better prediction accuracy. According to the results, the prediction accuracy from the full model (MLR1, ANN1, and GPR1), where all climate variables are included, is not much different from that of the models (MLR2, ANN2, and GPR2) consisting of only total rainfall variable.
5.
Conclusions
This study demonstrated and compared the performance of the GPR, the MLR, and the ANN in the regression problem with the case study in predicting the WQI at Ping River basin, Thailand. The climate factors, temperature, humidity, total rainfall, and evaporation were used as the independent variables. We found that the total rainfall was the only significant variable to be included in the models. We also considered various models for each method with the same set of climate variables. In summary, the GPR models are superior to the MLR and the ANN models according to the prediction accuracy.
Acknowledgments
The authors gratefully acknowledge the financial support provided by the Faculty of Science, Chiang Mai University. We would also like to thank the Pollution Control Department, Water Quality Management Bureau, The Ministry of Natural Resources and Environment of Thailand and the Northern Meteorological Center, Meteorological Department, Ministry of Digital Economy and Society of Thailand for their assistance with the data.
Conflict of interest
The authors declare no conflict of interest.