
Citation: Beth Ravit, Frank Gallagher, James Doolittle, Richard Shaw, Edwin Muñiz, Richard Alomar, Wolfram Hoefer, Joe Berg, Terry Doss. Urban wetlands: restoration or designed rehabilitation?[J]. AIMS Environmental Science, 2017, 4(3): 458-483. doi: 10.3934/environsci.2017.3.458
[1] | Sijie Lu, Juan Xie, Yang Li, Bin Yu, Qin Ma, Bingqiang Liu . Identification of lncRNAs-gene interactions in transcription regulation based on co-expression analysis of RNA-seq data. Mathematical Biosciences and Engineering, 2019, 16(6): 7112-7125. doi: 10.3934/mbe.2019357 |
[2] | Yunxiang Wang, Hong Zhang, Zhenchao Xu, Shouhua Zhang, Rui Guo . TransUFold: Unlocking the structural complexity of short and long RNA with pseudoknots. Mathematical Biosciences and Engineering, 2023, 20(11): 19320-19340. doi: 10.3934/mbe.2023854 |
[3] | Mingshuai Chen, Xin Zhang, Ying Ju, Qing Liu, Yijie Ding . iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. Mathematical Biosciences and Engineering, 2022, 19(12): 13829-13850. doi: 10.3934/mbe.2022644 |
[4] | Wangping Xiong, Yimin Zhu, Qingxia Zeng, Jianqiang Du, Kaiqi Wang, Jigen Luo, Ming Yang, Xian Zhou . Dose-effect relationship analysis of TCM based on deep Boltzmann machine and partial least squares. Mathematical Biosciences and Engineering, 2023, 20(8): 14395-14413. doi: 10.3934/mbe.2023644 |
[5] | Virgínia Villa-Cruz, Sumaya Jaimes-Reátegui, Juana E. Alba-Cuevas, Lily Xochilt Zelaya-Molina, Rider Jaimes-Reátegui, Alexander N. Pisarchik . Quantifying Geobacter sulfurreducens growth: A mathematical model based on acetate concentration as an oxidizing substrate. Mathematical Biosciences and Engineering, 2024, 21(5): 5972-5995. doi: 10.3934/mbe.2024263 |
[6] | Pingping Sun, Yongbing Chen, Bo Liu, Yanxin Gao, Ye Han, Fei He, Jinchao Ji . DeepMRMP: A new predictor for multiple types of RNA modification sites using deep learning. Mathematical Biosciences and Engineering, 2019, 16(6): 6231-6241. doi: 10.3934/mbe.2019310 |
[7] | Chongyi Tian, Longlong Lin, Yi Yan, Ruiqi Wang, Fan Wang, Qingqing Chi . Photovoltaic power prediction based on dilated causal convolutional network and stacked LSTM. Mathematical Biosciences and Engineering, 2024, 21(1): 1167-1185. doi: 10.3934/mbe.2024049 |
[8] | Yuan Yang, Yuwei Ye, Min Liu, Ya Zheng, Guozhi Wu, Zhaofeng Chen, Yuping Wang, Qinghong Guo, Rui Ji, Yongning Zhou . Family with sequence similarity 153 member B as a potential prognostic biomarker of gastric cancer. Mathematical Biosciences and Engineering, 2022, 19(12): 12581-12600. doi: 10.3934/mbe.2022587 |
[9] | Shaoyu Li, Su Xu, Xue Wang, Nilüfer Ertekin-Taner, Duan Chen . An augmented GSNMF model for complete deconvolution of bulk RNA-seq data. Mathematical Biosciences and Engineering, 2025, 22(4): 988-1018. doi: 10.3934/mbe.2025036 |
[10] | Kimberlyn Roosa, Ruiyan Luo, Gerardo Chowell . Comparative assessment of parameter estimation methods in the presence of overdispersion: a simulation study. Mathematical Biosciences and Engineering, 2019, 16(5): 4299-4313. doi: 10.3934/mbe.2019214 |
Molecular biology research on the molecular basis of biological activity requires data. Biologists acquire these data by using approaches and technologies such as microarray and RNA-sequencing (RNA-seq) techniques. Microarray technology is used to detect the sequences of nucleic acids and simultaneously thousands of gene transcripts from samples [1]. RNA-seq is a sequencing technique that can show the existence and amount of RNA in a biological sample by using next generation sequencing. Both techniques produce a high-dimensioanal gene expression data matrix with rows that indicate genes, columns that indicate experimental conditions, and cells that indicate the expression of that gene under those conditions. Gene expression data are very important for acquiring knowledge about cells, but there are frequently missing values. These missing values are often caused by experimental errors such as hybridization failures in microarray datasets and missing read counts in RNA-seq datasets. However, further analysis of these datasets requires a complete data matrix. Therefore, missing-value imputation approaches that use coherence the data are needed.
Two well-known missing-value imputation methods are LLS and BPCA. BPCA estimates missing values in the target gene (gene that contains missing values) by using a linear combination of principal components with parameters estimated using a Bayesian method. LLS uses a linear combination of the target gene and its similar genes to estimate the missing values in the target gene, and it uses clustering to measure gene similarities. In reality, genes are similar only under certain experimental conditions, so this similarity should only be measured by considering the related experimental conditions instead of all of the conditions. This is why clustering should be performed in rows and columns simultaneously, which is called biclustering [2]. Biclustering aims to identify local patterns in genes and conditions at the same time. The output of the biclutering technique is biclusters [3]. The use of this technique in LLS gives a better estimation of the missing values. Biclustering collates genes and conditions based on a weighted distance and correlation, respectively. Then, a regression model is used for least square-based missing-value estimation. An iterative framework is applied to improve the selection of coherent genes and correlated conditions. This method is called iterative bicluster-based least squares or bi-iLS [4].
Bi-iLS uses the row average to fill in all of the missing values in the target gene to obtain a temporary complete matrix. However, the row average is viewed as being flawed. The row average cannot reflect the real structure of the dataset because it only uses the information from an individual row. Thus, BPCA is considered better than the row average due to it reflecting the global covariance structure in all genes [5]. In this study, BPCA was used to obtain the temporary complete matrix in bi-iLS instead of the row average. This modification resulted in a new imputation method called bi-BPCA-iLS.
In this paper, the framework and implementation of our proposed bi-BPCA-iLS algorithm for missing-value imputation has been presented. The proposed missing-value imputation method will be implemented on a microarray dataset of Saccharomyces cerevisiae and an RNA-seq dataset of Schizosaccharomyces pombe.
In theory, every data point has a probability of being missing. The process of setting this probability is called the missing-data mechanism or response mechanism, while the models of these processes are called missing-data or response models [6,7,8,9]. Missing values can be categorized into three groups [10]. If the probability of a data point becoming missing is the same for all, then the missing values are called missing completely at random (MCAR). If the probability of a data point becoming missing is the same only for certain groups based on observational data, then the missing values are called missing at random (MAR). If missing values are neither MCAR nor MAR, then they are called missing not at random (MNAR) or not missing at random (NMAR). In other words, missing values in NMAR are independent of unobserved data [11].
Clustering is a technique that groups data points into several groups or clusters. In gene expression data, the purpose of clustering is to group genes into clusters where each cluster consists of genes that are similar to each other and dissimilar to genes from other clusters [12]. Biclustering in gene expression data is the simultaneous clustering of rows and columns [13]. The aim of biclustering is to find groups of similar genes based only on correlated experimental conditions. The output of biclustering is a bicluster. Genes are similar under certain experimental conditions, so biclustering is preferable to clustering. A comparison of biclustering and clustering in two-dimensional gene expression data matrices can be seen in Figure 1 [14]. Figure 1(a) indicates a clustering technique of genes based on all conditions, while Figure 1(b) shows a biclustering technique of genes based only on correlative experimental conditions.
LLS is a missing-value imputation method that identifies tne coherent information in gene expression data. There are two steps to the LLS method. The first step is to select k similar genes using Euclidean distance. The second step is to estimate the missing values [16,17]. This neighbor-based imputation method suits datasets that have a structure with dominant local similarities and high complexity [18].
Let a matrix E be the expression matrix consisting of m genes and n conditions. Assuming that the gene g1 has k similar genes (gs1, gs2, …, gsk) given the Euclidean distance and p missing values in the first p conditions, then the target gene y can be defined.
(gsgs1gs2...gsk)=(αwBA)=(α1α2...αpw1w2...wn−pB1,1B1,2...B1,pA1,1A1,2...A1,n−pB2,1B2,2...B2,pA2,1A2,2...A2,n−p........................Bk1Bk2...BkpAk1Ak2...Ak,n−p), |
where α is a vector of 1×p consisting of p missing values, w is a vector of 1×(n−p) consisting of the non-missing values in the target gene and the matrices B and A are the k similar genes' corresponding columns with α and w, respectively. Vector X can be defined as the solution to the least squares problem with AT and w.
||ATx−wT||. |
The solution of this least squares problem is
ˆx=(AAT)−1AwT=(AT)+wT, |
where A+ is the pseudoinverse of the matrix A. Hence, the missing values in the target gene g1 can be estimated using
ˆa=BTˆx=BT(AT)+w. |
To choose the proper value of k, LLS uses a heuristic algorithm by applying artificial missing values to genes. These artificial missing values will be estimated using different values of k, then the value of k that produces the lowest estimation error will be chosen as the proper value of k [16].
Bi-iLS is updated from the imputation method called LLS [16] in two aspects, i.e., the use of biclustering and an iterative framework. Bi-iLS can recognize gene similarities only under certain correlative conditions (biclustering), while LLS takes account all of the conditions in a data matrix. This makes bi-iLS preferable to LLS for gene expression data [4]. This imputation method suits data that have a dominant local similarity structure [18]. There are two parameters that need to be defined in the early stage of this process, namely k (for k similar genes) and T0.
Let the matrix E be the expression matrix consisting of m genes and n conditions. A gene that has p missing values is called the target gene. Assuming that all p missing values are in the first p conditions without a loss of generality, the target gene is defined as
gTt=(αw), |
where α is a vector of 1×p comprising p missing values and w is a vector of 1×(n−p) consisting of the non-missing values in the target gene. Similar to LLS, the first step of bi-iLS is to select k similar genes of target genes by using the Euclidean distance. The measurement of Euclidean distance requires a complete matrix, so bi-iLS uses the row average to fill n all of the missing values and obtain the temporary complete matrix. After selecting k similar genes, they are defined as
(gTs1...gTsk)=(BA), |
where g(s1)T denotes k similar genes, while the matrices B and A denote, respectively, the expression values for the first p conditions and remaining (n−p) conditions of the selected similar genes. Every condition has a different correlation with the other conditions. So, to account for the correlation or weight of each condition in the identification of the missing values, matrix R is defined as
R=BTA. |
Matrix R, with the size of p×(n−p), represents the weighted correlations between other conditions and the condition where the missing values in the target gene are found. The (j,v)th element of R is denoted by rj(v). The larger the value of rj(v), the larger are the weights and stronger are the correlations between the conditions with the missing values. Then, using R, k similar genes are reselected. Reselection of the k similar genes uses the weighted Euclidean distance of the target gene gt and other genes gs based on the location of the jth missing values. The equation is
dj(gt,gs)=√∑nv=p+1rj(v−p)2[gt(v)−gs(v)]2√∑nv=p+1rj(v)2, |
where g(v) denotes the vth element of gt or gs. Then, upon estimating the jth missing values for the target gene, conditions that are uncorrelated are removed from the least squares framework. Let
rj,max=maxv∈1,...,n−p|rj(v)|; |
then the conditions are said to be related if
|rj(v)|≥T0⋅rj,max. |
where T0 is a pre-defined parameter using the same heuristic algorithm to find the proper value of k. The removal of uncorrelated conditions redefines matrices A and B and w. Hence, we have
gTt=(αjwj), |
where αj denotes the jth missing values and wj denotes the non-missing values of correlated conditions. Also, we have
(gTs1...gTsk)=(BjAj), |
where Bj represents the jth columns of the data and Aj denotes a matrix consisting of the correlated columns of the k similar genes. Similar to LLS, a regression model αj=BTjxj is needed to estimate the jth missing value where xj contains the regression coefficient for k similar genes. xj can be obtained by minimizing the least squares error, as follows:
||ATx−wT||. |
Thus, the jth missing value in the target gene can be estimated by using
αj=BTjˆxj=BTj(AT)+jwTj, |
where (AT)+j is the pseudoinverse of ATj.
An iterative framework is applied to improve the selection of similar genes. A complete matrix output from the ith iteration will be the temporary complete matrix in the (i+1)th iteration. This iteration process will be repeated until it reaches the maximum iteration or a specific criterion. The complete framework of bi-iLS can be seen in Figure 2.
There are three main steps of the BPCA based imputation method: principal component (PC) regression, Bayesian estimation and application of an expectation-maximization (EM) repetitive algorithm [19]. In PC regression, Principal Component Analysis (PCA) represents the D-dimensional vector y as a linear combination of K principal axis vectors wl (1≤l≤K and K<D), as follows:
y=∑Kl=1xlwl+ϵ, |
where D is the quantity of columns in data, xl is a factor score and ∈ is the residual error. Assuming that there are no missing values, PCA can find wl=√λlul where λl and ul respectively denote the eigenvalues and eigenvectors of the corresponding covariance matrix of y. If missing values are present, then the principal axis vectors are split into two parts, i.e., W=(Wobs,Wmiss) where Wobs and Wmiss denote a matrix that has column vectors wobs1,…,wobsK and wmiss1,…,wmissK, respectively. Factor scores x=(x1,…,xK) are obtained by minimizing the residual error of the observed part as follows:
||yobs−Wobsx||2. |
This is a simple least squares problem that can be solved easily. Hence, the missing part of y can be estimated as
ymiss=Wmissx. |
However, these parameters are still unknown. BPCA uses a probabilistic PCA model under the assumption that the residual error ∈ and xl (1≤l≤K) obey normal distributions. The parameters W, μ and τ form a parameter set θ≡{W,μ,τ}. BPCA uses Bayesian estimation to estimate these parameters. It is used here because it can locate the best dimensions for latent space. This estimation is done by applying the EM algorithm until convergence is reached. This imputation method is appropriate for data with lower complexity structures [20].
The proposed bi-BPCA-iLS algorithm updates the bi-iLS algorithm during the process of obtaining the temporary complete matrix. Other than the process of obtaining the temporary complete matrix, bi-BPCA-iLS and bi-iLS are the same. In bi-ILS, the row average is used to fill in all of the missing values for the target genes to obtain a temporary complete matrix. However, the use of the row average to fill in the missing values is considered unsatisfactory. Row averages cannot reflect the structure of the data because they only use the information of a single row or gene [21]. Also, use of the row average is not an effective approach when there is an outlier in the target gene. Hence, the use of BPCA to get a temporary complete matrix is thought to be better than the use of the row average. BPCA can reflect the global covariance structure of all genes [5]. The main idea behind the proposed bi-BPCA-iLS method is to use BPCA instead of the row average to get a temporary complete matrix in the bi-iLS framework. This alteration means that bi-BPCA-iLS becomes an updated and improved missing-value imputation method. As mentioned before, bi-iLS matched to data that have a dominant local similarity structure and high complexity, while BPCA suits data with a structure of lower complexity. The idea of combining BPCA with bi-iLS makes bi-BPCA-iLS become more robust for data with a lower complexity structure. The complete framework of bi-BPCA-iLS can be seen in Figure 3. The differences table for the LLS, bi-iLS and bi-BPCA-iLS methods is given as Table 1.
LLS | Bi-iLS | Bi-BPCA-iLS | |
Gene similarity | Clustering | Biclustering | Biclustering |
Temporary complete matrix | Row-average | Row-average | BPCA |
Parameters | k | k and T0 | k and T0 |
Process of iteration | No | Yes | Yes |
Authors | Kim et al.[16] | Cheng et al.[4] | Newly Proposed |
Table 1 shows the differences between the three least squares-based imputation algorithms. LLS uses clustering to measure gene similarity, while bi-iLS and bi-BPCA-iLS use biclustering, which, as mentioned before, is considered to have higher efficacy. The row average is used in LLS and bi-iLS to obtain the temporary complete matrix, while bi-BPCA-iLS uses BPCA. Only bi-iLS and bi-BPCA-iLS iterates the imputation process. Our proposed imputation algorithm is the newest among these.
The proposed method has been implemented and evaluated on two-dimensional gene expressions: a microarray dataset and an RNA-seq dataset [22]. Bi-iLS was proven to perform well on the microarray datasets of Spellman 1998 for Saccharomyces cerevisiae [4], so bi-BPCA-iLS was also implemented on this dataset to make a performance comparison. Also, both bi-BPCA-iLS and bi-iLS were implemented on RNA-seq to analyze their performances on different gene expression datasets.
The microarray dataset is a cell cycle expression dataset for the yeast Saccharomyces cerevisiae; it has been synchronized using a CDC15 temperature-sensitive mutant [23]. According to Spellman et al., the samples of mRNA were taken every 10 minutes for 300 minutes. However, there were several missing time points in the published data. In fact, samples were taken every 20 minutes from 10 min to 70 min, and then every 10 minutes from 70 min to 250 min and every 20 minutes from 250 min to 290 min. Therefore, the CDC15 dataset contains the expression level of 6178 genes at 24 different time points which gives a matrix size of 6178 × 24. An example of the CDC15 dataset is shown in Table 2.
10 min | 30 min | 50 min | 70 min | … | 290 min | |
Gene 1 | −0.16 | 0.09 | −0.23 | 0.03 | … | −0.26 |
Gene 2 | NaN | NaN | NaN | −0.58 | … | NaN |
Gene 3 | −0.37 | −0.22 | -0.16 | 0.04 | … | −0.41 |
Gene 4 | NaN | NaN | NaN | −1.5 | … | NaN |
Gene 5 | −0.43 | −1.33 | −1.53 | −1.53 | … | 1.18 |
The CDC15 dataset had missing values, so genes that contained missing values were removed to get the ground truth. The ground truth was used to calculate the estimation error or NRMSE of each imputation methods. After removing genes that contained missing values, the size of the matrix became 4381 × 24. In the experiments for this dataset, r% of the observation values was set to be missing randomly where r = 1, 5, 10, 15, 20, 25 and 30. The estimation was repeated five times for each missing rate to generate the average result.
The RNA-seq dataset was gene expression data from the Schizosaccharomyces pombe or GSE150544 [24]. The technique of RNA sequencing was used to identify the differences between the gene expression levels of four different INO80 mutant strains, each with two replicates; this resulted in eight samples for each gene. The four strains were wt (control), Nht1, Iec1 and Iec5. The length for each gene, which indicates how many nucleotides are in that gene, was also included. In this experiment, only coding genes were observed. This dataset contained the expression of 5137 genes under nine different conditions, i.e., the length, wt_rep1, wt_rep2, nht1_rep1, nht1_rep2, Iec1_rep1, Iec2_rep2, Iec5_rep1 and Iec5_rep2, resulting in a matrix size of 5137 × 9. The data were not normalized to ensure the real expression of each gene and positive gene expression. An example of the GSE150544 dataset is shown in Table 3.
Length | wt_rep1 | wt_rep2 | nht1_rep1 | nht1_rep2 | Iec1_rep1 | Iec1_rep2 | Iec5_rep1 | Iec5_rep2 | |
Gene 1 | 669 | 18 | 16 | 8 | 2 | 4 | 15 | 17 | 19 |
Gene 2 | 993 | 46 | 50 | 45 | 25 | 33 | 34 | 25 | 29 |
Gene 3 | 3227 | 1623 | 1474 | 1655 | 1268 | 994 | 1870 | 1476 | 1849 |
Gene 4 | 868 | 258 | 322 | 215 | 200 | 138 | 284 | 278 | 286 |
Gene 5 | 2250 | 87 | 79 | 119 | 121 | 87 | 209 | 88 | 102 |
… | … | … | … | … | … | … | … | … | … |
Gene 5137 | 546 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
A value of zero indicates that a gene was not detected because the gene was not expressed, or was minimally expressed; therefore the value of zero is not a missing value. Then, r% of the observation values was set to be missing randomly where r = 1, 5, 10, 15, 20, 25 and 30. The estimation was repeated five times for each missing rate to generate the average result.
Our proposed imputation method was implemented in MATLAB. The parameters k and T0 were estimated automatically using the integrated function in our algorithm. The estimation process was iterated five times for each test. We carried out five tests for each missing rate to obtain the most accurate and convergent results. The imputation results applying our proposed method to the microarray dataset can be seen in Table 4 and Figure 4 below, where mr denotes the missing rate in Tables 4 and 5.
NRMSE Bi-BPCA-iLS | mr 1% | mr 5% | mr 10% | mr 15% | mr 20% | mr 25% | mr 30% |
Test 1 | 0.1851 | 0.3934 | 0.4766 | 0.5485 | 0.5681 | 0.6102 | 0.6369 |
Test 2 | 0.1685 | 0.4610 | 0.5101 | 0.5634 | 0.5717 | 0.6093 | 0.6269 |
Test 3 | 0.2065 | 0.3882 | 0.4888 | 0.5359 | 0.5887 | 0.6044 | 0.6206 |
Test 4 | 0.2170 | 0.3741 | 0.4892 | 0.5476 | 0.5839 | 0.6034 | 0.6300 |
Test 5 | 0.2371 | 0.3934 | 0.4811 | 0.5458 | 0.5801 | 0.6000 | 0.6203 |
Average | 0.20284 | 0.40202 | 0.48916 | 0.54824 | 0.5785 | 0.60546 | 0.62694 |
NRMSE Bi-BPCA-iLS | mr 1% | mr 5% | mr 10% | mr 15% | mr 20% | mr 25% | mr 30% |
Test 1 | 0.3020 | 0.2593 | 0.2226 | 0.2351 | 0.2470 | 0.2595 | 0.2612 |
Test 2 | 0.1317 | 0.1737 | 0.2684 | 0.2527 | 0.2518 | 0.2515 | 0.2493 |
Test 3 | 0.3011 | 0.2814 | 0.2577 | 0.2378 | 0.2364 | 0.2234 | 0.2641 |
Test 4 | 0.3976 | 0.2838 | 0.2273 | 0.2791 | 0.2872 | 0.2357 | 0.2457 |
Test 5 | 0.4162 | 0.1917 | 0.2571 | 0.2595 | 0.2589 | 0.2485 | 0.3046 |
Average | 0.30972 | 0.23798 | 0.24662 | 0.25284 | 0.25626 | 0.24372 | 0.26498 |
The imputation results of applying our proposed method to the RNA-seq dataset can be seen in Table 5 and Figure 5 below.
Based on Table 5 and Figure 5, the average value of the NRMSE for a missing rate of 1% was 0.29626, for a missing rate of 5% was 0.23798 and for a missing rate of 10% was 0.24662. The lowest estimation error was achieved when the missing rate was 5%; it was highest when the missing rate was 1%. The NRMSE values were predominantly below 0.3 at every missing rate, indicating that our proposed imputation method, bi-BPCA-iLS, performed well on the GSE150544 dataset.
Two existing methods, LLS and bi-iLS, were compared to our proposed imputation method. This comparison entailed the use of the the average value of NRMSE and computational time generated from five trials for every missing rate. The difference between the NRMSE values of Method A and Method B divided by the NRMSE value of Method A shows the improvement of Method B relative to Method A. If the improvement value is positive, then Method B results in a higher imputation accuracy compared to Method A. If the improvement value is negative, then method B has a decrease in imputation accuracy compared to Method A.
The averages of the improvement values across all missing rates for the CDC15 dataset are shown in Table 6 below. Based on these figures, the bi-iLS algorithm showed a significant overall improvement in NRMSE value (10.07%) relative to the LLS algorithm. Our proposed method, bi-BPCA-iLS, also showed a significant overall improvement in NRMSE value: 10.612% relative to LLS and 0.582% relative to bi-iLS.
Average value of NRMSE | NRMSE from LLS | NRMSE from Bi-iLS | NRMSE from Bi-BPCA-iLS | Improvement of bi-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to bi-iLS |
Missing rate 1% | 0.20938 | 0.20792 | 0.20284 | 0.697296781% | 3.123507498% | 2.443247403% |
Missing rate 5% | 0.51722 | 0.40444 | 0.40202 | 21.80503461% | 22.27292061% | 0.598358224% |
Missing rate 10% | 0.57694 | 0.49252 | 0.48916 | 14.63237078% | 15.2147537% | 0.682205799% |
Missing rate 15% | 0.61448 | 0.5499 | 0.54824 | 10.50969926% | 10.77984637% | 0.301873068% |
Missing rate 20% | 0.63408 | 0.5799 | 0.5785 | 8.544663134% | 8.765455463% | 0.241420935% |
Missing rate 25% | 0.65518 | 0.60512 | 0.60546 | 7.640648371% | 7.588754235% | -0.05618720% |
Missing rate 30% | 0.6708 | 0.62610 | 0.62694 | 6.663685152% | 6.538461538% | -0.13416387% |
Average improvement | 10.070% | 10.612% | 0.582% |
As shown in Table 6 and Figure 6, the imputation method that produced the lowest overall NRMSE across all missing rates for the CDC15 dataset was our proposed method, bi-BPCA-iLS.
After comparing the values of NRMSE, the computational times of the imputation methods were also compared in MATLAB. Based on Table 7, bi-iLS is shown to add an overall average of 320.134 seconds of computational time compared to LLS. bi-BPCA-iLS is shown to add an overall average of 343.850 seconds of computational time relative to LLS and only 23.716 seconds relative to bi-iLS.
Average computational time | Computational time of LLS | Computational time of Bi-iLS | Computational time of Bi-BPCA-iLS | Additional time of bi-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to bi-iLS |
Missingrate 1% | 60.402 | 120.439 | 126.396 | 60.037 | 65.994 | 5.957 |
Missingrate 5% | 76.419 | 255.455 | 275.087 | 179.036 | 198.668 | 19.632 |
Missingrate 10% | 71.727 | 388.521 | 455.319 | 316.794 | 383.592 | 66.798 |
Missingrate 15% | 59.123 | 378.844 | 444.281 | 319.721 | 385.158 | 65.437 |
Missingrate 20% | 52.033 | 472.843 | 434.815 | 420.81 | 382.782 | -38.028 |
Missingrate 25% | 45.539 | 457.156 | 467.654 | 411.617 | 422.115 | 10.498 |
Missingrate 30% | 61.047 | 593.972 | 629.689 | 532.925 | 568.642 | 35.717 |
Average additional computational time in seconds | 320.134 | 343.850 | 23.716 |
As shown in Figure 7, LLS displayed a consistent computational time for every missing rate, while bi-iLS and bi-BPCA-iLS had additional computational time following the increase in missing rates. In conclusion, the fastest imputation method was LLS; this is related to the high NRMSE it generated compared to the other methods. Regarding bi-iLS and bi-BPCA-iLS, there was no significant computational time difference between these two methods. If the goal is achieving a lower NRMSE, then one can use bi-BPCA-iLS instead of bi-iLS.
The average improvement values across all missing rates for the RNA-seq dataset (GSE150544) are shown in Table 8. We can see that the bi-iLS algorithm showed an overall improvement in NRMSE value of 5.12% relative to the LLS algorithm. Our proposed method, bi-BPCA-iLS, had an overall improvement in NRMSE value of 8.20% relative to LLS and 3.09% relative to bi-iLS.
Average value of NRMSE | NRMSE from LLS | NRMSE from Bi-iLS | NRMSE from Bi-BPCA-iLS | Improvement of bi-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to bi-iLS |
Missing rate 1% | 0.29318 | 0.32288 | 0.30972 | -10.1303% | -5.64159% | 4.075818% |
Missing rate 5% | 0.23604 | 0.25204 | 0.23798 | -6.77851% | -0.82189% | 5.57848% |
Missing rate 10% | 0.26032 | 0.25436 | 0.24662 | 2.28949% | 5.262754% | 3.042931% |
Missing rate 15% | 0.27980 | 0.26908 | 0.25284 | 3.831308% | 9.635454% | 6.03538% |
Missing rate 20% | 0.28308 | 0.25652 | 0.25626 | 9.382507% | 9.474354% | 0.101357% |
Missing rate 25% | 0.29156 | 0.24558 | 0.24372 | 15.77034% | 16.40829% | 0.757391% |
Missing rate 30% | 0.34442 | 0.27056 | 0.26498 | 21.44475% | 23.06486% | 2.062389% |
Average improvement | 5.12% | 8.20% | 3.09% |
The performances of LLS, bi-iLS, and bi-BPCA-iLS on the GSE1505544 data can be seen in Table 8. Bi-BPCA-iLS and bi-iLS had negative performances when the missing rate was 1% and 5%, so LLS performed well when the missing rate was below 5% in this dataset. But when the missing rate moved above 5%, the performance of bi-BPCA-iLS was superior to the other methods. As shown in Figure 8, the average NRMSE from bi-BPCA-iLS tended to be lower than those of the other methods.
After comparing the values of NRMSE, the computational times of the imputation methods were compared in MATLAB. Based on Table 9 and Figure 9, bi-iLS is shown to add an overall 117.200 seconds of computational time relative to LLS. While bi-BPCA-iLS is shown to add an overall 126.549 seconds of computational time relative to LLS and only 9.349 seconds relative to bi-iLS. There is no significant computational time difference between bi-BPCA-iLS and bi-iLS, only 9.349 seconds.
Average computational time | Computational time of LLS | Computational time of Bi-iLS | Computational time of Bi-BPCA-iLS | Additional time of bi-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to bi-iLS |
Missingrate 1% | 28.418 | 79.678 | 82.420 | 51.26 | 54.002 | 2.742 |
Missingrate 5% | 22.080 | 107.653 | 109.090 | 85.573 | 87.01 | 1.437 |
Missingrate 10% | 18.678 | 130.136 | 142.0635 | 111.458 | 123.3855 | 11.9275 |
Missingrate 15% | 25.831 | 155.458 | 158.179 | 129.627 | 132.348 | 2.721 |
Missingrate 20% | 22.429 | 160.494 | 165.366 | 138.065 | 142.937 | 4.872 |
Missingrate 25% | 20.513 | 152.178 | 189.796 | 131.665 | 169.283 | 37.618 |
Missingrate 30% | 25.018 | 197.769 | 201.895 | 172.751 | 176.877 | 4.126 |
Average additional computational time in seconds | 117.200 | 126.549 | 9.349 |
Early approaches toward missing-value imputation tended to consider all experimental conditions in measuring gene similarity. However, genes are only similar under certain experimental conditions. This meant that an bi-iLS algorithm for imputing missing values has to be developed. This algorithm uses the row average to obtain a temporary complete matrix, which has become to be considered as a flawed approach. The row average cannot reflect the real structure of the dataset because it only leverages the information of an individual row. Thus, in this study, we used BPCA to obtain a temporary complete matrix instead of using row average. The proposed algorithm is called bi-BPCA-iLS. After finding the temporary complete matrix using BPCA, the required parameters can be found. Our proposed algorithm performs clustering on genes and conditions alternately to find biclusters that consist of a subset of genes that are similar under a subset of conditions. After the biclusters related to the target genes are found, least squares estimation of the missing values can be performed while considering only related genes and conditions. This estimation process can be iterated to improve the selection of similar genes and conditions in every iteration, which improves the accuracy of the missing-value imputation.
Experiments were conducted on two gene expression datasets: a microarray dataset for Saccharomyces cerevisiae (CDC15) and an RNA-seq dataset for Schizosaccharomyces pombe (GSE150544). The results show that our proposed method is best suited to impute missing values in microarray datasets and RNA-seq datasets based on the NRMSE, compared to preceding imputation methods such as LLS and bi-iLS. Significant NRMSE improvements of 10.612% for CDC15 and 8.20% for GSE150544 were observed when using bi-BPCA-iLS instead of LLS, indicating the importance of using biclustering and iterative frameworks. Also, bi-BPCA-iLS showed NRMSE improvements of 0.582% for CDC15 and 3.09% for GSE150544 relative to bi-iLS, indicating that the temporary complete matrix is better obtained with BPCA rather than via the row average. The additional computational time of bi-BPCA-iLS compared to bi-iLS was only 23.716 seconds for CDC15 and 9.349 seconds for GSE150544, which can be concluded as not significant. These experimental results show that our proposed method outperforms the other two existing methods. Thus, our proposed method is applicable to other datasets that fit our assumption.
The missing-value imputation method bi-BPCA-iLS outperformed other methods such as LLS and bi-iLS in selected microarray and RNA-seq datasets in terms of the NRMSE. The improvement relative to LLS indicates the importance of using biclustering and iterative framework in the imputation, while the improvement relative to bi-iLS indicates that the temporary complete matrix is better obtained with BPCA rather than via the row average.
Universitas Indonesia funded this research with grant number NKB-030/UN2.F3/HKP.05.00/2021.
The authors declare that there is no conflict of interest.
[1] |
Chapin FS III, Zavaleta ES, Eviner VT, et al. (2000) Consequences of changing biodiversity. Nature 405: 234-242. doi: 10.1038/35012241
![]() |
[2] |
Anon (2016) Rise of the City. Science 352: 906-907. doi: 10.1126/science.352.6288.906
![]() |
[3] | WHO, World Health Organization, 2017. Available from: http://www.who.int/gho/urban_health/situation_trends/urban_population_growth/en/. |
[4] | World Bank, 2017. Available from: http://data.worldbank.org/indicator/SP.URB.TOTL.IN.ZS. |
[5] |
Lee SY, Dunn RJK, Young RA, et al. (2006) Impact of urbanization on coastal wetland structure and function. Austral Ecol 31: 149-163. doi: 10.1111/j.1442-9993.2006.01581.x
![]() |
[6] | Russi D, ten Brink P, Farmer A, et al. (2013) The Economics of Ecosystems and Biodiversity for Water and Wetlands. IEEP, London and Brussels, 2013. |
[7] |
Hettiarachchi M, Morrison TH, McAlpine C (2015) Forty-three years of Ramsar and urban wetlands. Global Environ Chang 32: 57-66. doi: 10.1016/j.gloenvcha.2015.02.009
![]() |
[8] | USEPA (2016) National Wetland Condition Assessment 2011: A Collaborative Survey of the Nation's Wetlands. U.S. Environmental Protection Agency, Office of Wetlands, Oceans and Watersheds, Office of Research and Development, Washington, DC 20460. EPA-843-R-15-005. Available from: https://www.epa.gov/national-aquatic-resource-surveys/nwca. |
[9] | Dahl TE (2011) Status and trends of wetlands in the coterminous United States 2004–2009. U.S. Department of the Interior; Fish and Wildlife Service. Washington, D.C. 108 pp. |
[10] | Brady SJ, Flather CJ (1994) Changes in wetlands on nonfederal rural land of the conterminous United States from 1982 to 1987. Environ Manage 18: 693-705. |
[11] | Kentula ME, Gwin SE, Pierson SM (2004) Tracking changes in wetlands with urbanization: sixteen years of experience in Portland, Oregon, USA. Wetlands 24: 734-743. |
[12] | Gurtzwiller KJ, Flather CH (2011) Wetland features and landscape context predict the risk of wetland habitat loss. Ecol Appl 21: 968-982. |
[13] | Sucik MT, Marks E (2015) The Status and Recent Trends of Wetlands in the United States. U.S. Department of Agriculture. 2015. Summary Report: 2012 National Resources Inventory, Natural Resources Conservation Service, Washington, DC, and Center for Survey Statistics and Methodology, Iowa State University, Ames, Iowa. Available from: http://www.nrcs.usda.gov/technical/nri/12summary. |
[14] | Mitsch WJ, Gosselink JG (2015) Wetlands, 5th Edition. John Wiley & Sons. Hoboken, NJ. |
[15] |
Bolund P, Hunhammar S (1999) Ecosystem services in urban areas. Ecol Econ 29: 293-301. doi: 10.1016/S0921-8009(99)00013-0
![]() |
[16] | Mitsch WJ, Gosselink JG (2000) The value of wetlands: importance of scale and landscape setting. Ecol Econ 35: 25-33. |
[17] |
Baldwin AH (2004) Restoring complex vegetation in urban settings: The case of tidal freshwater marshes. Urban Ecosys 7: 125-137. doi: 10.1023/B:UECO.0000036265.86125.34
![]() |
[18] | Kusler J (2004) Multi-objective wetland restoration in watershed contexts. Association of State Wetland Managers, Inc. Berne, NY. |
[19] |
Bengston DN, Fletcher JO, Nelson KC (2004) Public policies for managing urban growth and protecting open space: policy instruments and lessons learned in the United States. Landscape Urban Plan 69: 271-286. doi: 10.1016/j.landurbplan.2003.08.007
![]() |
[20] |
Gautam M, Achharya K, Shanahan SA (2014) Ongoing restoration and management of Las Vegas Wash: an evaluation of success criteria. Water Policy 16: 720-738. doi: 10.2166/wp.2014.035
![]() |
[21] | Boyer T, Polasky S (2004) Valuing urban wetlands: A review of non-market valuation studies. Wetlands 24: 744-755. |
[22] |
McKenney BA, Kiesecker (2010) Policy development for biodiversity offsets: A review of offset frameworks. Environ Manage 45: 165-176. doi: 10.1007/s00267-009-9396-3
![]() |
[23] | Spieles DJ (2005) Vegetation development in created, restored, and enhanced mitigation wetland banks of the United States. Wetlands 25: 51-63. |
[24] | Matthews JW, Endress AG (2008) Performance criteria, compliance success, and vegetation development in compensatory mitigation wetlands. Environ Manage 41: 130-141. |
[25] |
Booth DB, Hartley D, Jackson R (2002) Forest cover, impervious-surface area, and the mitigation of stormwater impacts. J Am Water Resour As 38: 835-8453. doi: 10.1111/j.1752-1688.2002.tb01000.x
![]() |
[26] |
Moreno-Mateos D, Power ME, Comin FA, et al. (2012) Structural ad functional loss in restored wetland ecosystems. PLoS Biology 10: e1001247. doi: 10.1371/journal.pbio.1001247
![]() |
[27] |
BenDor T, Brozovic N, Pallathucheril VG (2008) The social impacts of wetland mitigation policies in the United States. J Plan Literature 22: 341-357. doi: 10.1177/0885412207314011
![]() |
[28] | USEPA, Principles for the Ecological Restoration of Aquatic Resources. EPA841-F-00-003. Office of Water (4501F), United States Environmental Protection Agency, Washington, DC. 4 pp. 2000. Available from: https://www.epa.gov/wetlands/principles-wetland-restoration. |
[29] |
Ehrenfeld JG (2000) Evaluating wetlands within an urban context. Ecol Engi 15: 253-265. doi: 10.1016/S0925-8574(00)00080-X
![]() |
[30] |
Kentula ME (2000) Perspectives on setting success criteria for wetland restoration. Ecol Eng 15: 199-209. doi: 10.1016/S0925-8574(00)00076-8
![]() |
[31] | Faber-Langendoen D, Kudray G, Nordman C, et al. (2008) Ecological Performance Standards for Wetland Mitigation: An Approach Based on Ecological Integrity Assessments. NatureServe, Arlington, Virginia, 2008. |
[32] |
Euliss NH, Smith LM, Wilcox DA, et al. (2008) Linking ecosystem processes with wetland management goals: Charting a course for a sustainable future. Wetlands 28: 553-562. doi: 10.1672/07-154.1
![]() |
[33] |
Brinson MM, Rheinhardt R (1996) The role of reference wetlands in functional assessment and mitigation. Ecol Appl 6: 69-76. doi: 10.2307/2269553
![]() |
[34] | USACE (2015) Nontidal Wetland Mitigation Banking in Maryland Performance Standards & Monitoring Protocol. Available from: http://www.nab.usace.army.mil/Portals/63/docs/Regulatory/Mitigation/MDNTWLPERMITTEEPERSTMON4115.pdf. |
[35] | Kusler J (2006) Discussion Paper: Developing Performance Standards for the Mitigation and Restoration of Northern Forested Wetlands. Association of State Wetland Managers, Inc. Available from: https://www.aswm.org/pdf_lib/forested_wetlands_080106.pdf. |
[36] |
Stefanik KC, Mitcsch WJ (2012) Structural and functional vegetation development in created and restored wetland mitigation banks of different ages. Ecol Eng 39: 104-112. doi: 10.1016/j.ecoleng.2011.11.016
![]() |
[37] | Zedler JB, Doherty JM, Miller NA (2012) Shifting restoration policy to address landscape change, novel ecosystems, and monitoring. Ecol Soc 17: 36. |
[38] |
Grayson JE, Chapman MG, Underwood AJ (1999) The assessment of restoration of habitat in urban wetlands. Landscape Urban Plan 43: 227-236. doi: 10.1016/S0169-2046(98)00108-X
![]() |
[39] | Ravit B, Obropta C, Kallin P (2008) A baseline characterization approach to wetland enhancement in an urban watershed. Urban Habitats 5: 126-152. |
[40] | Palmer MA, Ambrose RF, Noff NL (1997) Ecological Theory and Community Restoration Ecology. Restor Ecol 5: 291-300. |
[41] |
Felson AJ, Pickett STA (2005) Designed experiments: new approaches to studying urban ecosystems. Front Ecol Environ 3: 549-556. doi: 10.1890/1540-9295(2005)003[0549:DENATS]2.0.CO;2
![]() |
[42] | Zedler JB (2007) Success: An unclear, subjective descriptor of restoration outcomes. Restor Ecol 25: 162-168. |
[43] | NRC (2001) National Research Council. Compensating for Wetland Losses under the Clean Water Act. Washington, D.C. National Academies Press. |
[44] |
Burns D, Vitvar T, McDonnell J, et al. (2005) Effects of suburban development on runoff generation in the Croton River basin, New York, USA. J Hydrol 311: 266-281. doi: 10.1016/j.jhydrol.2005.01.022
![]() |
[45] |
Sudduth EB, Meyer JL (2006) Effects of bioengineered streambank stabilization on bank habitat and macroinvertebrates in urban streams. Environ Manage 38: 218-226. doi: 10.1007/s00267-004-0381-6
![]() |
[46] |
Reinelt L, Horner R, Azous A (1998) Impacts of urbanization on palustrine (depressional freshwater) wetlands-research and management in the Puget Sound region. Urban Ecosys 2: 219-236. doi: 10.1023/A:1009532605918
![]() |
[47] | Hoefer W, Gallagher F, Hyslop R, et al. (2016) Unique landfill restoration designs increase opportunities to create urban open space. Environ Practice 18: 106-115. |
[48] | Magee TK, Kentula ME (2005) Response of wetland plant species to hydrologic conditions. Wetl Ecol Manag 13: 163-181. |
[49] | Hozapfel, Claus (2010) Restoration success assessment and plant community ecology research at the former 'Chromate Waste Site 15' in Liberty State Park. New Jersey Department of Environmental Protection, Monitoring Report. |
[50] |
Larson MA, Heintzman RL, Titus JE, et al. (2016) Urban wetland characterization in south-central New York State. Wetlands 36: 821-829. doi: 10.1007/s13157-016-0789-9
![]() |
[51] |
Walsh CJ, Roy AH, Feminella JW, et al. (2005) The urban stream syndrome: current knowledge and the search for a cure. J N Am Benthol Soc 24: 706-723. doi: 10.1899/04-028.1
![]() |
[52] | Gift DM, Groffman PM, Kaushal SJ, et al. (2010) Denitrification potential, root biomass, and organic matter in degraded and restored urban riparian zones. Restor Ecol 18: 113-120. |
[53] | Baart I, Gschöpf C, Blaschke AP, et al. (2010) Prediction of potential macrophyte development in response to restoration measures in an urban riverine wetland. Aquat Bot 93: 153-162. |
[54] |
Ehrenfeld JG (2008) Exotic invasive species in urban wetlands: environmental correlates and implications for wetland management. J Appl Ecol 45: 1160-1169. doi: 10.1111/j.1365-2664.2008.01476.x
![]() |
[55] |
Ravit B, Ehrenfeld JG, Häggblom MM, et al. (2007) The effects of drainage and nitrogen enrichment on Phragmites australis, Spartina alternaflora, and their root-associated microbial communities. Wetlands 27: 915-927. doi: 10.1672/0277-5212(2007)27[915:TEODAN]2.0.CO;2
![]() |
[56] |
Lee BH, Scholz M (2007) What is the role of Phragmites australis in experimental Constructed wetland filters treating urban runoff? Ecol Eng 29: 87-95. doi: 10.1016/j.ecoleng.2006.08.001
![]() |
[57] | Bragato C, Brix H, Malagoli M (2006) Accumulation of nutrients and heavy metals in Phragmites australis (Cav.) Trin. Ex Steudel and Bolboschoenus maritimus (L.) Palla in a constructed wetland of the Venice lagoon watershed. Environ Pollut 144: 967-975. |
[58] |
Weis JS, Weis P (2004) Metal uptake, transport and release by wetland plants: implications for phytoremediation and restoration. Environ Int 30: 685-700. doi: 10.1016/j.envint.2003.11.002
![]() |
[59] | Casagrande DG (1997) The human component of urban wetland restoration. In Restoration of an Urban Salt Marsh: An Interdisciplinary Approach (D.G. Casagrande, Ed.), 254-270. Bulletin Number 100, Yale School of Forestry and Environmental Studies, Yale University, New Haven, CT. |
[60] |
Groffman PM, Crawford MK (2003) Denitrification potential in urban riparian zones. J Environ Qual 32: 1144-1149. doi: 10.2134/jeq2003.1144
![]() |
[61] |
Kohler EA, Poole VL, Reicher ZJ, et al. (2004) Nutrient, metal, and pesticide removal during storm and nonstorm events by a constructed wetland on an urban golf course. Ecol Eng 23: 285-298. doi: 10.1016/j.ecoleng.2004.11.002
![]() |
[62] | Mahon BL, Polasky S, Adams RM (2000) Valuing urban wetlands: a property price approach. Land Econ 76: 100-113. |
[63] | Nassauer JI (2004) Monitoring the success of metropolitan wetland restorations: Cultural sustainability and ecological function. Wetlands 24: 756-765. |
[64] | Swanson WR, Lamie P. Urban fill characterization and risk-based management decisions-A practical guide. Proceedings of the Annual Internation Conference on Soils, Sediments, Water and Energy. 2010, 12, 9. Available from: http://scholarworks.umass.edu/soilsproceedings/vol12/iss1/9 |
[65] | Morio M, Schadler S, Finkel M (2013) Applying a multi-criteria genetic algorithm framework for brownfield reuse optimization: improving redevelopment options based on stakeholder preferences. J Environ Manage 130: 331-345. |
[66] | Doolittle JA, Brevik EC (2014) The use of electromagnetic induction techniques in soils studies. Geoderma 223-225: 33-45. |
[67] | Daniels JJ, Vendl M, Holt J, et al. (2003) Combining multiple geophysical data sets into a single 3D image. Sympposium on the Application of Geophysics to Engineering and Environmental Problems 2003: 299-306. |
[68] | Allred BJ, Ehsani RM, Daniels JJ (2008) General considerations for geophysical methods applied to agriculture. Handbook of Agricultural Geophysics. CRC Press, Taylor and Francis Group, Boca Raton, Florida, 3-16. |
[69] | Berg J, Underwood K, Regenerative stormwater conveyance (RSC) as an integrated approach to sustainable stormwater planning on linear projects. In Proceedings of the 2009 International Conference on Ecology and Transportation. Eds: Wagner, P.J., Nelson, D., Murray, E. Raleigh, NJ: Center for Transportation and the Environment, North Carolina State University. 2010. |
[70] |
Palmer MA, Filoso S, Fanelli RM (2014) From ecosystems to ecosystem services: stream restoration as ecological engineering. Ecol Eng 65: 62-70. doi: 10.1016/j.ecoleng.2013.07.059
![]() |
[71] | Contaminated Soil-Cleanup Costs and Standards, 2017. Available from: http://science.jrank.org/pages/1737/Contaminated-Soil-Cleanup-costs-standards.html. |
[72] | Final Engineering Evaluation/Cost Analysis. Index Shooting Range, Mt. Baker-Snoqualmie National Forest. Prepared by URS Consulting Engineers. Portland, Oregon USDA Forest Service, November 2011. |
[73] | NJDEP=HPCTF Final Report-V. Costs and Economic Impacts, 2017. Available from: http://www.nj.gov/dep/special/hpctf/final/costs.htm. |
[74] | In Situ Biological Treatment for Soil, Sediment, and Sludge. 2017. Available from: https:frtr.gov/matrix2/section3/sec3_int.html. |
[75] | Obropta C, Kallin P, Mak M, et al. (2008) Modeling urban wetland hydrology for the restoration of a forested riparian wetland ecosystem. Urban Habitat 5: 183-198. |
[76] | Gallagher FJ, Pechmann I, Bogden JD, et al. (2008) Soil metal concentrations and vegetative assemblage structure in an urban brownfield. Environ Pollut 153: 351-361. |
[77] | Salisbury AB, Long term stability of trace element concentrations in a spontaneously vegetated urban brownfield with anthropogenic soils. Soil Science. In Press. |
[78] | Wong THF, Somes NLG (1995) A Stochastic Approach to Designing Wetlands for Stormwater Pollution Control. Water Sci Technol 32: 145-151. |
[79] | Wagner M (2004) The roles of seed dispersal ability and seedling salt tolerance in community assembly of a severely degraded site. In: Temperton, V.M., Hobbs, R.J., Nuttle, T., Jalle, S. (Eds.), Assembly Rules and Restoration Ecology: Bridging the Gap Between Theory and Practice. Island Press, Washington, DC, p. 266. |
[80] | Belyea LR (2004) Beyond ecological filters: Feedback networks in the assembly and retroaction of community structure. In: Temperton, V.M., Hobbs, R.J., Nuttle, T.J., Halle, S. (Eds.), Assembly Rules and Restoration Ecology: Bridging the Gap Between Theory and Practice. Island Press, Washington, D.C., pp. 115e131. |
[81] | Kusler J, Parenteau P, Thomas EA (2007) "Significant Nexus" and Clean Water Act Jurisdiction. Discussion paper. Association of State Wetland Managers, Inc. Available from: https://www.aswm.org/pdf_lib/significant_nexus_paper_030507.pdf. |
[82] | Lockwood JL, Pimm SL (1999) When does restoration succeed? In: E Weiher and P.A. Keddy (eds) Ecological assembly rules: perspective, advances and retreats. Cambridge University Press. |
[83] | Clements FE, 1916. Plant Succession. Publication 242. Carnegie Institution, Washington, D.C. |
[84] | Gleason HA, 1926. The individualistic concept of plant association. Bulletin of the Torrey Botany Club 53, 7e26. |
[85] | Van der Maarel E, Sykes MT (1993) Small-scale plant species turnover in limestone grasslands: the carousel model and some comments on the niche concept. J Vegetative Science 4: 179e188. |
[86] | Hobbs JD, Norton DA, 2004. In: Templeton, V.M.R.J., Hobbs, T., Nuttle, T., Halle, S. (Eds.), Assembly Rules and Restoration Ecology. Island Press, p. 77. |
[87] |
Bendor T (2009) A dynamic analysis of the wetland mitigation process and its effects on no net loss policy. Landscape Urban Plan 89: 17-27. doi: 10.1016/j.landurbplan.2008.09.003
![]() |
[88] |
Palmer MA, Bernhardt ES, Allan JD, et al. (2005) Standards for ecologically successful river restoration. J Appl Ecol 42: 208-217. doi: 10.1111/j.1365-2664.2005.01004.x
![]() |
[89] | Soil Clean Up Criteria, NJDEP, 1999. Available from: http://www.nj.gov/dep/srp/guidance/scc/. |
[90] | Efroymson RA, Will ME, Suter II GW, et al. (1997) Toxicological Benchmarks for Screening Contaminants of Potential Concern for Effects on Terrestrial Plants: 1997 Revision. Oak Ridge National Laboratory, Oak Ridge, TN, 128 pp. |
[91] | United States Environmental Protection Agency, 2003. Guidance for Developing Ecological Soil Screening Levels. OSWER-Directive 9285.7-55. Available from: http://www.epa.gov/ecotox/ecossl/. |
1. | Jingrui Liu, Zixin Duan, Xinkai Hu, Jingxuan Zhong, Yunfei Yin, Detracking Autoencoding Conditional Generative Adversarial Network: Improved Generative Adversarial Network Method for Tabular Missing Value Imputation, 2024, 26, 1099-4300, 402, 10.3390/e26050402 | |
2. | Nital Adikane, V. Nirmalrani, Stock market prediction based on sentiment analysis using deep long short-term memory optimized with namib beetle henry optimization, 2023, 18724981, 1, 10.3233/IDT-230191 | |
3. | Hatice NİZAM ÖZOĞUR, Zeynep ORMAN, Sağlık Verilerinin Analizinde Veri Ön işleme Adımlarının Makine Öğrenmesi Yöntemlerinin Performansına Etkisi, 2023, 16, 1305-8991, 23, 10.54525/tbbmd.1167316 | |
4. | Haoxin Shi, Yanjun Zhang, Yuxiang Cheng, Jixiang Guo, Jianqiao Zheng, Xin Zhang, Yude Lei, Yongjie Ma, Lin Bai, A novel machine learning approach for reservoir temperature prediction, 2025, 125, 03756505, 103204, 10.1016/j.geothermics.2024.103204 | |
5. | Gong Lejun, Yu Like, Wei Xinyi, Zhou Shehai, Xu Shuhua, SeqBMC: Single‐cell data processing using iterative block matrix completion algorithm based on matrix factorisation, 2025, 19, 1751-8849, 10.1049/syb2.70003 |
LLS | Bi-iLS | Bi-BPCA-iLS | |
Gene similarity | Clustering | Biclustering | Biclustering |
Temporary complete matrix | Row-average | Row-average | BPCA |
Parameters | k | k and T0 | k and T0 |
Process of iteration | No | Yes | Yes |
Authors | Kim et al.[16] | Cheng et al.[4] | Newly Proposed |
10 min | 30 min | 50 min | 70 min | … | 290 min | |
Gene 1 | −0.16 | 0.09 | −0.23 | 0.03 | … | −0.26 |
Gene 2 | NaN | NaN | NaN | −0.58 | … | NaN |
Gene 3 | −0.37 | −0.22 | -0.16 | 0.04 | … | −0.41 |
Gene 4 | NaN | NaN | NaN | −1.5 | … | NaN |
Gene 5 | −0.43 | −1.33 | −1.53 | −1.53 | … | 1.18 |
Length | wt_rep1 | wt_rep2 | nht1_rep1 | nht1_rep2 | Iec1_rep1 | Iec1_rep2 | Iec5_rep1 | Iec5_rep2 | |
Gene 1 | 669 | 18 | 16 | 8 | 2 | 4 | 15 | 17 | 19 |
Gene 2 | 993 | 46 | 50 | 45 | 25 | 33 | 34 | 25 | 29 |
Gene 3 | 3227 | 1623 | 1474 | 1655 | 1268 | 994 | 1870 | 1476 | 1849 |
Gene 4 | 868 | 258 | 322 | 215 | 200 | 138 | 284 | 278 | 286 |
Gene 5 | 2250 | 87 | 79 | 119 | 121 | 87 | 209 | 88 | 102 |
… | … | … | … | … | … | … | … | … | … |
Gene 5137 | 546 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
NRMSE Bi-BPCA-iLS | mr 1% | mr 5% | mr 10% | mr 15% | mr 20% | mr 25% | mr 30% |
Test 1 | 0.1851 | 0.3934 | 0.4766 | 0.5485 | 0.5681 | 0.6102 | 0.6369 |
Test 2 | 0.1685 | 0.4610 | 0.5101 | 0.5634 | 0.5717 | 0.6093 | 0.6269 |
Test 3 | 0.2065 | 0.3882 | 0.4888 | 0.5359 | 0.5887 | 0.6044 | 0.6206 |
Test 4 | 0.2170 | 0.3741 | 0.4892 | 0.5476 | 0.5839 | 0.6034 | 0.6300 |
Test 5 | 0.2371 | 0.3934 | 0.4811 | 0.5458 | 0.5801 | 0.6000 | 0.6203 |
Average | 0.20284 | 0.40202 | 0.48916 | 0.54824 | 0.5785 | 0.60546 | 0.62694 |
NRMSE Bi-BPCA-iLS | mr 1% | mr 5% | mr 10% | mr 15% | mr 20% | mr 25% | mr 30% |
Test 1 | 0.3020 | 0.2593 | 0.2226 | 0.2351 | 0.2470 | 0.2595 | 0.2612 |
Test 2 | 0.1317 | 0.1737 | 0.2684 | 0.2527 | 0.2518 | 0.2515 | 0.2493 |
Test 3 | 0.3011 | 0.2814 | 0.2577 | 0.2378 | 0.2364 | 0.2234 | 0.2641 |
Test 4 | 0.3976 | 0.2838 | 0.2273 | 0.2791 | 0.2872 | 0.2357 | 0.2457 |
Test 5 | 0.4162 | 0.1917 | 0.2571 | 0.2595 | 0.2589 | 0.2485 | 0.3046 |
Average | 0.30972 | 0.23798 | 0.24662 | 0.25284 | 0.25626 | 0.24372 | 0.26498 |
Average value of NRMSE | NRMSE from LLS | NRMSE from Bi-iLS | NRMSE from Bi-BPCA-iLS | Improvement of bi-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to bi-iLS |
Missing rate 1% | 0.20938 | 0.20792 | 0.20284 | 0.697296781% | 3.123507498% | 2.443247403% |
Missing rate 5% | 0.51722 | 0.40444 | 0.40202 | 21.80503461% | 22.27292061% | 0.598358224% |
Missing rate 10% | 0.57694 | 0.49252 | 0.48916 | 14.63237078% | 15.2147537% | 0.682205799% |
Missing rate 15% | 0.61448 | 0.5499 | 0.54824 | 10.50969926% | 10.77984637% | 0.301873068% |
Missing rate 20% | 0.63408 | 0.5799 | 0.5785 | 8.544663134% | 8.765455463% | 0.241420935% |
Missing rate 25% | 0.65518 | 0.60512 | 0.60546 | 7.640648371% | 7.588754235% | -0.05618720% |
Missing rate 30% | 0.6708 | 0.62610 | 0.62694 | 6.663685152% | 6.538461538% | -0.13416387% |
Average improvement | 10.070% | 10.612% | 0.582% |
Average computational time | Computational time of LLS | Computational time of Bi-iLS | Computational time of Bi-BPCA-iLS | Additional time of bi-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to bi-iLS |
Missingrate 1% | 60.402 | 120.439 | 126.396 | 60.037 | 65.994 | 5.957 |
Missingrate 5% | 76.419 | 255.455 | 275.087 | 179.036 | 198.668 | 19.632 |
Missingrate 10% | 71.727 | 388.521 | 455.319 | 316.794 | 383.592 | 66.798 |
Missingrate 15% | 59.123 | 378.844 | 444.281 | 319.721 | 385.158 | 65.437 |
Missingrate 20% | 52.033 | 472.843 | 434.815 | 420.81 | 382.782 | -38.028 |
Missingrate 25% | 45.539 | 457.156 | 467.654 | 411.617 | 422.115 | 10.498 |
Missingrate 30% | 61.047 | 593.972 | 629.689 | 532.925 | 568.642 | 35.717 |
Average additional computational time in seconds | 320.134 | 343.850 | 23.716 |
Average value of NRMSE | NRMSE from LLS | NRMSE from Bi-iLS | NRMSE from Bi-BPCA-iLS | Improvement of bi-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to bi-iLS |
Missing rate 1% | 0.29318 | 0.32288 | 0.30972 | -10.1303% | -5.64159% | 4.075818% |
Missing rate 5% | 0.23604 | 0.25204 | 0.23798 | -6.77851% | -0.82189% | 5.57848% |
Missing rate 10% | 0.26032 | 0.25436 | 0.24662 | 2.28949% | 5.262754% | 3.042931% |
Missing rate 15% | 0.27980 | 0.26908 | 0.25284 | 3.831308% | 9.635454% | 6.03538% |
Missing rate 20% | 0.28308 | 0.25652 | 0.25626 | 9.382507% | 9.474354% | 0.101357% |
Missing rate 25% | 0.29156 | 0.24558 | 0.24372 | 15.77034% | 16.40829% | 0.757391% |
Missing rate 30% | 0.34442 | 0.27056 | 0.26498 | 21.44475% | 23.06486% | 2.062389% |
Average improvement | 5.12% | 8.20% | 3.09% |
Average computational time | Computational time of LLS | Computational time of Bi-iLS | Computational time of Bi-BPCA-iLS | Additional time of bi-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to bi-iLS |
Missingrate 1% | 28.418 | 79.678 | 82.420 | 51.26 | 54.002 | 2.742 |
Missingrate 5% | 22.080 | 107.653 | 109.090 | 85.573 | 87.01 | 1.437 |
Missingrate 10% | 18.678 | 130.136 | 142.0635 | 111.458 | 123.3855 | 11.9275 |
Missingrate 15% | 25.831 | 155.458 | 158.179 | 129.627 | 132.348 | 2.721 |
Missingrate 20% | 22.429 | 160.494 | 165.366 | 138.065 | 142.937 | 4.872 |
Missingrate 25% | 20.513 | 152.178 | 189.796 | 131.665 | 169.283 | 37.618 |
Missingrate 30% | 25.018 | 197.769 | 201.895 | 172.751 | 176.877 | 4.126 |
Average additional computational time in seconds | 117.200 | 126.549 | 9.349 |
LLS | Bi-iLS | Bi-BPCA-iLS | |
Gene similarity | Clustering | Biclustering | Biclustering |
Temporary complete matrix | Row-average | Row-average | BPCA |
Parameters | k | k and T0 | k and T0 |
Process of iteration | No | Yes | Yes |
Authors | Kim et al.[16] | Cheng et al.[4] | Newly Proposed |
10 min | 30 min | 50 min | 70 min | … | 290 min | |
Gene 1 | −0.16 | 0.09 | −0.23 | 0.03 | … | −0.26 |
Gene 2 | NaN | NaN | NaN | −0.58 | … | NaN |
Gene 3 | −0.37 | −0.22 | -0.16 | 0.04 | … | −0.41 |
Gene 4 | NaN | NaN | NaN | −1.5 | … | NaN |
Gene 5 | −0.43 | −1.33 | −1.53 | −1.53 | … | 1.18 |
Length | wt_rep1 | wt_rep2 | nht1_rep1 | nht1_rep2 | Iec1_rep1 | Iec1_rep2 | Iec5_rep1 | Iec5_rep2 | |
Gene 1 | 669 | 18 | 16 | 8 | 2 | 4 | 15 | 17 | 19 |
Gene 2 | 993 | 46 | 50 | 45 | 25 | 33 | 34 | 25 | 29 |
Gene 3 | 3227 | 1623 | 1474 | 1655 | 1268 | 994 | 1870 | 1476 | 1849 |
Gene 4 | 868 | 258 | 322 | 215 | 200 | 138 | 284 | 278 | 286 |
Gene 5 | 2250 | 87 | 79 | 119 | 121 | 87 | 209 | 88 | 102 |
… | … | … | … | … | … | … | … | … | … |
Gene 5137 | 546 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
NRMSE Bi-BPCA-iLS | mr 1% | mr 5% | mr 10% | mr 15% | mr 20% | mr 25% | mr 30% |
Test 1 | 0.1851 | 0.3934 | 0.4766 | 0.5485 | 0.5681 | 0.6102 | 0.6369 |
Test 2 | 0.1685 | 0.4610 | 0.5101 | 0.5634 | 0.5717 | 0.6093 | 0.6269 |
Test 3 | 0.2065 | 0.3882 | 0.4888 | 0.5359 | 0.5887 | 0.6044 | 0.6206 |
Test 4 | 0.2170 | 0.3741 | 0.4892 | 0.5476 | 0.5839 | 0.6034 | 0.6300 |
Test 5 | 0.2371 | 0.3934 | 0.4811 | 0.5458 | 0.5801 | 0.6000 | 0.6203 |
Average | 0.20284 | 0.40202 | 0.48916 | 0.54824 | 0.5785 | 0.60546 | 0.62694 |
NRMSE Bi-BPCA-iLS | mr 1% | mr 5% | mr 10% | mr 15% | mr 20% | mr 25% | mr 30% |
Test 1 | 0.3020 | 0.2593 | 0.2226 | 0.2351 | 0.2470 | 0.2595 | 0.2612 |
Test 2 | 0.1317 | 0.1737 | 0.2684 | 0.2527 | 0.2518 | 0.2515 | 0.2493 |
Test 3 | 0.3011 | 0.2814 | 0.2577 | 0.2378 | 0.2364 | 0.2234 | 0.2641 |
Test 4 | 0.3976 | 0.2838 | 0.2273 | 0.2791 | 0.2872 | 0.2357 | 0.2457 |
Test 5 | 0.4162 | 0.1917 | 0.2571 | 0.2595 | 0.2589 | 0.2485 | 0.3046 |
Average | 0.30972 | 0.23798 | 0.24662 | 0.25284 | 0.25626 | 0.24372 | 0.26498 |
Average value of NRMSE | NRMSE from LLS | NRMSE from Bi-iLS | NRMSE from Bi-BPCA-iLS | Improvement of bi-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to bi-iLS |
Missing rate 1% | 0.20938 | 0.20792 | 0.20284 | 0.697296781% | 3.123507498% | 2.443247403% |
Missing rate 5% | 0.51722 | 0.40444 | 0.40202 | 21.80503461% | 22.27292061% | 0.598358224% |
Missing rate 10% | 0.57694 | 0.49252 | 0.48916 | 14.63237078% | 15.2147537% | 0.682205799% |
Missing rate 15% | 0.61448 | 0.5499 | 0.54824 | 10.50969926% | 10.77984637% | 0.301873068% |
Missing rate 20% | 0.63408 | 0.5799 | 0.5785 | 8.544663134% | 8.765455463% | 0.241420935% |
Missing rate 25% | 0.65518 | 0.60512 | 0.60546 | 7.640648371% | 7.588754235% | -0.05618720% |
Missing rate 30% | 0.6708 | 0.62610 | 0.62694 | 6.663685152% | 6.538461538% | -0.13416387% |
Average improvement | 10.070% | 10.612% | 0.582% |
Average computational time | Computational time of LLS | Computational time of Bi-iLS | Computational time of Bi-BPCA-iLS | Additional time of bi-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to bi-iLS |
Missingrate 1% | 60.402 | 120.439 | 126.396 | 60.037 | 65.994 | 5.957 |
Missingrate 5% | 76.419 | 255.455 | 275.087 | 179.036 | 198.668 | 19.632 |
Missingrate 10% | 71.727 | 388.521 | 455.319 | 316.794 | 383.592 | 66.798 |
Missingrate 15% | 59.123 | 378.844 | 444.281 | 319.721 | 385.158 | 65.437 |
Missingrate 20% | 52.033 | 472.843 | 434.815 | 420.81 | 382.782 | -38.028 |
Missingrate 25% | 45.539 | 457.156 | 467.654 | 411.617 | 422.115 | 10.498 |
Missingrate 30% | 61.047 | 593.972 | 629.689 | 532.925 | 568.642 | 35.717 |
Average additional computational time in seconds | 320.134 | 343.850 | 23.716 |
Average value of NRMSE | NRMSE from LLS | NRMSE from Bi-iLS | NRMSE from Bi-BPCA-iLS | Improvement of bi-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to LLS | Improvement of bi-BPCA-iLS relative to bi-iLS |
Missing rate 1% | 0.29318 | 0.32288 | 0.30972 | -10.1303% | -5.64159% | 4.075818% |
Missing rate 5% | 0.23604 | 0.25204 | 0.23798 | -6.77851% | -0.82189% | 5.57848% |
Missing rate 10% | 0.26032 | 0.25436 | 0.24662 | 2.28949% | 5.262754% | 3.042931% |
Missing rate 15% | 0.27980 | 0.26908 | 0.25284 | 3.831308% | 9.635454% | 6.03538% |
Missing rate 20% | 0.28308 | 0.25652 | 0.25626 | 9.382507% | 9.474354% | 0.101357% |
Missing rate 25% | 0.29156 | 0.24558 | 0.24372 | 15.77034% | 16.40829% | 0.757391% |
Missing rate 30% | 0.34442 | 0.27056 | 0.26498 | 21.44475% | 23.06486% | 2.062389% |
Average improvement | 5.12% | 8.20% | 3.09% |
Average computational time | Computational time of LLS | Computational time of Bi-iLS | Computational time of Bi-BPCA-iLS | Additional time of bi-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to LLS | Additional time of bi-BPCA-iLS relative to bi-iLS |
Missingrate 1% | 28.418 | 79.678 | 82.420 | 51.26 | 54.002 | 2.742 |
Missingrate 5% | 22.080 | 107.653 | 109.090 | 85.573 | 87.01 | 1.437 |
Missingrate 10% | 18.678 | 130.136 | 142.0635 | 111.458 | 123.3855 | 11.9275 |
Missingrate 15% | 25.831 | 155.458 | 158.179 | 129.627 | 132.348 | 2.721 |
Missingrate 20% | 22.429 | 160.494 | 165.366 | 138.065 | 142.937 | 4.872 |
Missingrate 25% | 20.513 | 152.178 | 189.796 | 131.665 | 169.283 | 37.618 |
Missingrate 30% | 25.018 | 197.769 | 201.895 | 172.751 | 176.877 | 4.126 |
Average additional computational time in seconds | 117.200 | 126.549 | 9.349 |