Processing math: 100%
Research article

Modeling nitrogen removal in membrane aerated biofilm reactors: the role of nitritation, denitritation, and anammox nitrogen removal

  • Received: 23 December 2024 Revised: 15 May 2025 Accepted: 06 June 2025 Published: 12 June 2025
  • This study develops a two-dimensional, multi-species biofilm model to investigate the influence of environmental factors, specifically temperature and concentrations of oxygen, acetate, and ammonium on nitrogen removal in membrane aerated biofilm reactors (MABRs). The resulting model is a highly nonlinear reaction-diffusion system, explored through computer simulations, and captures microbial interactions, substrate transport, and nitrogen transformations within a biofilm, incorporating the counter-diffusion mechanism. Three nitrogen removal pathways have been examined in this study: nitritation-denitritation (ND), partial nitrification-anammox (PN/A), and conventional nitrification-denitrification (CND). The simulation results show that temperature and concentrations of oxygen and acetate significantly affect nitrogen removal rates and contributions of each pathway. ND dominates under most conditions, while PN/A prevails in oxygen-limited scenarios (O=0.250.5gm3) and co-dominates with ND at moderate oxygen levels (O=0.51gm3). CND is significant only at higher oxygen concentrations (O=5gm3) with low ammonium (N1=515gm3) and acetate levels (A=6gm3). Moreover, it has been shown that temperature enhances nitrogen removal primarily by increasing the contribution of anammox. Effective removal rates (>0.1g/m2/d) occur at O1gm3 with low to moderate acetate levels (A=6gm3 to <100gm3). The simulations further indicate that MABRs can achieve a stable ND nitrogen removal efficiency with biofilm thickness exceeding approximately 0.8mm. In this scenario, ammonium-oxidizing bacteria (AOB) and ND denitrifiers outcompete aerobic heterotrophs and nitrite-oxidizing bacteria, resulting in a biofilm structure predominantly composed of AOB and ND denitrifiers. The findings of this study provide valuable insights for optimizing MABR design and operation to achieve energy efficient nitrogen removal.

    Citation: Maryam Ghasemi, Sheng Chang. Modeling nitrogen removal in membrane aerated biofilm reactors: the role of nitritation, denitritation, and anammox nitrogen removal[J]. Mathematics in Engineering, 2025, 7(3): 350-383. doi: 10.3934/mine.2025015

    Related Papers:

    [1] Seyed Hossein Hashemi, Abas Niknam, Amir Karimian Torghabeh, Nuno Pimentel . Thermodynamic and geochemical studies of formation water in Rag-e Sefid oil and gas field, Iran. AIMS Geosciences, 2023, 9(3): 578-594. doi: 10.3934/geosci.2023031
    [2] Pezhman Soltani Tehrani, Hamzeh Ghorbani, Sahar Lajmorak, Omid Molaei, Ahmed E Radwan, Saeed Parvizi Ghaleh . Laboratory study of polymer injection into heavy oil unconventional reservoirs to enhance oil recovery and determination of optimal injection concentration. AIMS Geosciences, 2022, 8(4): 579-592. doi: 10.3934/geosci.2022031
    [3] Paolo Dell'Aversana . Reservoir geophysical monitoring supported by artificial general intelligence and Q-Learning for oil production optimization. AIMS Geosciences, 2024, 10(3): 641-661. doi: 10.3934/geosci.2024033
    [4] Anthony I Nzekwu, Richardson M Abraham-A . Reservoir sands characterisation involving capacity prediction in NZ oil and gas field, offshore Niger Delta, Nigeria. AIMS Geosciences, 2022, 8(2): 159-174. doi: 10.3934/geosci.2022010
    [5] Paolo Dell'Aversana . Reservoir prescriptive management combining electric resistivity tomography and machine learning. AIMS Geosciences, 2021, 7(2): 138-161. doi: 10.3934/geosci.2021009
    [6] Maleyka Agha Ali Aghayeva . Joint analysis of seismic and well log data applied for prediction of oil presence in Maykop deposits in Naftalan area. AIMS Geosciences, 2021, 7(3): 331-337. doi: 10.3934/geosci.2021020
    [7] Aleksander S. Gundersen, Pasquale Carotenuto, Tom Lunne, Axel Walta, Per M. Sparrevik . Field verification tests of the newly developed flow cone tool—In-situ measurements of hydraulic soil properties. AIMS Geosciences, 2019, 5(4): 784-803. doi: 10.3934/geosci.2019.4.784
    [8] Elvis Ishimwe, Richard A. Coffman, Kyle M. Rollins . Predicting blast-induced liquefaction within the New Madrid Seismic Zone. AIMS Geosciences, 2020, 6(1): 71-91. doi: 10.3934/geosci.2020006
    [9] Muhammad Safdar, Tim Newson, Hamza Ahmad Qureshi . Shear strength of fibre reinforced cemented Toyoura sand. AIMS Geosciences, 2022, 8(1): 68-83. doi: 10.3934/geosci.2022005
    [10] Shuo Yang, Dong Wang, Zeguang Dong, Yingge Li, Dongxing Du . ANN prediction of the CO2 solubility in water and brine under reservoir conditions. AIMS Geosciences, 2025, 11(1): 201-227. doi: 10.3934/geosci.2025009
  • This study develops a two-dimensional, multi-species biofilm model to investigate the influence of environmental factors, specifically temperature and concentrations of oxygen, acetate, and ammonium on nitrogen removal in membrane aerated biofilm reactors (MABRs). The resulting model is a highly nonlinear reaction-diffusion system, explored through computer simulations, and captures microbial interactions, substrate transport, and nitrogen transformations within a biofilm, incorporating the counter-diffusion mechanism. Three nitrogen removal pathways have been examined in this study: nitritation-denitritation (ND), partial nitrification-anammox (PN/A), and conventional nitrification-denitrification (CND). The simulation results show that temperature and concentrations of oxygen and acetate significantly affect nitrogen removal rates and contributions of each pathway. ND dominates under most conditions, while PN/A prevails in oxygen-limited scenarios (O=0.250.5gm3) and co-dominates with ND at moderate oxygen levels (O=0.51gm3). CND is significant only at higher oxygen concentrations (O=5gm3) with low ammonium (N1=515gm3) and acetate levels (A=6gm3). Moreover, it has been shown that temperature enhances nitrogen removal primarily by increasing the contribution of anammox. Effective removal rates (>0.1g/m2/d) occur at O1gm3 with low to moderate acetate levels (A=6gm3 to <100gm3). The simulations further indicate that MABRs can achieve a stable ND nitrogen removal efficiency with biofilm thickness exceeding approximately 0.8mm. In this scenario, ammonium-oxidizing bacteria (AOB) and ND denitrifiers outcompete aerobic heterotrophs and nitrite-oxidizing bacteria, resulting in a biofilm structure predominantly composed of AOB and ND denitrifiers. The findings of this study provide valuable insights for optimizing MABR design and operation to achieve energy efficient nitrogen removal.



    RNA-seq data plays a crucial role in advancing our understanding of biological processes [1,2,3], disease mechanisms [4,5], and therapeutic interventions [6,7] by providing comprehensive insights into tissue-level gene expression patterns [8,9,10]. With the rise of single-cell RNA(scRNA-seq) technologies, direct genome-wide measurement of transcriptome in individual cells and characterization of their heterogeneity is enabled. However, scRNA-seq analysis is greatly susceptible to the high cost, limited high-quality samples, or technical biases [11,12,13]. Compared to scRNA-seq, bulk tissue RNA-seq is more cost-effective for large-scale studies involving numerous samples. It offers a balance between depth of coverage and cost, making it suitable for studies requiring analysis of gene expression across multiple conditions or time points. However, tissues are composed of diverse cell types, each with its own gene expression signature, so this complex nature of bulk tissue samples under investigation remains as a major obstacle. Bulk tissue RNA-seq, although it captures the cumulative gene expression profiles (GEP) from all cells within the sample, cannot distinguish individual cell types and contains no information cellular composition [14,15,16]. For example, features of Alzheimer's disease include amyloid-beta plaques and neurofibrillary tangles, along with neuronal loss and gliosis in the affected brain regions; thus, transcriptome-wide GEPs are different from the brain tissue of patients and neuropathologically normal controls. It is critical to recognize such differences in order to discover genes and biological pathways that are perturbed in and/or lead to Alzheimer's disease [8,10,17,18,19]. One of the important tools to unveil these differences is Differential expression (DE) analysis, which reveals novel insights into the genes and pathways, and is potentially helpful for drug targets therapeutics. However, the question whether disease-associated GEP changes in brain tissues are due to changes in cellular composition of tissue samples, or due to GEP changes in specific cells, e.g., central nervous system cells, remains a fundamental knowledge gap for DE.

    To leverage bulk-tissue data while obtaining information from specific cell types, deconvolution [6,17,19,20,21,22] of bulk-tissue RNA-seq data has emerged as a powerful computational approach in the field of transcriptomics, offering insight into the cellular composition of complex tissue samples. Deconvolution methods aim to computationally dissect bulk-tissue RNA-seq data and estimate the relative proportions of different cell types contributing to the observed expression patterns.

    In recent years, numerous deconvolution techniques have been developed and they can be classified from different perspectives. In terms of mathematical methodologies, they can be grouped as linear regression-based methods [23,24], non-negative matrix factorization (NMF) methods [25], Bayesian Approaches [26], or various machine learning approaches [27,28]. From whether other data sources are used, they can be classified as reference-based (partial deconvolution) [23,24,26,29] and reference-free methods (complete deconvolution) [25,30,31]. Further, different types of references can be used, such as marker gene [32], scRNA-seq [23], or gene expression variability [29]. Additionally, deconvolution methods may provide different outputs, some result in qualitative cellular fractions [23,24,26] while others only provide scores [27,28]. Some offer in silicon purification of sample-wise GEP [24,26,29], while the rest do not. These methods vary in their assumptions, computational complexity, and accuracy. A comprehensive review can be found in [33,34]. Although these models are successfully applied in many biological problems, the underlying mathematical challenges, especially in complete deconvolution, are still open problems. In our work, we focus on the non-negative matrix factorization (NMF) based complete deconvolution models.

    Many studies of NMF have been established for various types of data in other fields, such as spectral unmixing in analytical chemistry [35], remote sensing [36], image processing [37], or topic mining in machine learning [38]. This complete deconvolution problem [17,19,20,21,22] of bulk-tissue RNA-seq is illustrated in Figure 1 as the NMF model [35,39,40]. In this work, the expression of a gene in a sample tissue is assumed to be the linear combination of its expressions in the constituting cell types, with respect to the cell proportion, i.e.,

    gij=kγ=1ci,γpγ,j,1iN,1jn, (1.1)

    where gij and ci,γ are the GEPs of gene i in the j-th sample and γ-th cell type, respectively, while pγ,j is the proportion of the l-th cell type in the j-th sample. For the total number of genes N, total number of samples n, and the number of cell types k, we usually have Nn>k. In matrix form, Equation (1.1) is represented as G=CP with all matrix entries being non-negative. Given data G, both variables C and P are to be solved simultaneously. This is also termed as NMF-based reference-free complete deconvolution. Note that in many applications, matrix C is assumed to be known based on some available sources such as scRNA-seq data and only P is computed. This is called reference-based partial deconvolution, which is just a least square/linear regression problem and will not be termed as NMF model. Partial deconvolution is always well-posed, and it is generally more accurate, if the reference C is accurate. However, when reference is not consistent with the true value of C, the convolution result is not reliable.

    Figure 1.  Diagram of complete deconvolution of bulk tissue data.

    The NMF is strongly ill-posed, and solutions are generally not unique. Such non-uniqueness poses great challenges on solution interpretability: For RNA-seq data, different solutions represent various combinations of GEPs in each cell type and cell proportions in tissues. It is indispensable to obtain meaningful explanation of these biological quantities to the next step DE analysis. Although many NMF based bulk-tissue RNA data deconvolution algorithms have been developed recently [6,41,42,43,44,45,46,50,51] for different types of samples [48,49], the underlying mathematical challenges have never been addressed.

    There are a few guidelines to reduce such ill-posedness in a certain degree. As stated in [52], if the constituent matrices C and P satisfy sufficiently-scattered conditions (see Section 2 for details), it is possible for the NMF problem to have unique solutions, subjective to row/column scaling and permutation ambiguities of the solution. Fortunately, this identifiability condition on matrix C is usually satisfied due to the concept of marker genes in the RNA-seq data. Based on this fact, a geometric structure guided NMF model has been proposed to enhance the identifiability of solutions, by recognizing marker genes (finding structure) first and then enforcing the identifiability condition of matrix C (preserve the structure) accordingly. However, on the other side, the sufficiently scattered requirement on matrix P will almost certainly be violated for realistic bulk tissue data: In many cases, there are usually

    ● Rarely presented cell types across tissue samples, i.e., matrix P may contain a nearly-zero row vector;

    ● There also could be correlated cellular proportions in tissue samples, i.e., matrix P may have linearly dependent rows.

    These common biological features could cause severe instability and inaccuracy in deconvolution algorithms. In this work, we call the bulk-tissue data singular data, if its constituent cellular composition (matrix P) has above characteristics.

    Our objective of this current work is to develop reliable and robust deconvolution algorithms for the aforementioned singular data, based on the geometric structure guided NMF (GSNMF) model. The essential idea is to construct pseudo-bulk tissue data with regular cellular composition and then the GSNMF model is applied on the combined original singular data and regular data. The overall hybrid data is thus regular, since both constituent matrices C and P satisfy the identifiability conditions. Specifically, our approach is achieved from the following steps: (1) Connection between scRNA-seq and bulk tissue RNA-seq data is investigated, and a stochastic algorithm is developed to generate pseudo-bulk tissue data; (2) since the NMF is a non-convex optimization problem, and its solution heavily depends on the initial condition, the relation between initial guess of solution and its converged results is discussed, and an informative initial condition is suggested accordingly; (3) based on the suggested initial condition, it is possible to estimate the qualitative difference between converged solution and true solution and hence a posterior rescaling process may increase solution accuracy.

    Throughout the paper, a bold lower-case letter, such as x, represents a column vector with the appropriate dimension, and |x| represents its l2 norm. Vector 1 is a column vector of some dimension with all entries being one. For a matrix A, AF represents its Frobenius norm, A(i) and A(j) mean its i-th row and j-th column, respectively.

    The paper is organized as the follows: In section 2 we briefly review the NMF, its graphic interpretation, and its separability conditions. Section 3 presents the proposed pseudo-bulk tissue augmented geometric structure guided complete deconvolution model. Solution analysis and its insights in biological background are provided in Section 4, including dependence on informative initial conditions and solution estimates. As validations, in Section 5, we provide numerical results of the proposed model and algorithms for both synthetic and biological data. The paper ends with a conclusion in Section 6, where potential challenges of the work and possible future research directions are discussed. For convenience, some frequently used acronyms throughout the paper are listed in Table 1.

    Table 1.  Some frequently used acronyms.
    GEP Gene expression profile
    DE Differential expression
    NMF Nonnegative Matrix Factorization
    GS-NMF Geometric structure constrained NMF
    ADMM Alternating direction method of multipliers

     | Show Table
    DownLoad: CSV

    Let matrix GRN×n with entry gij be the expression of the i-th gene in the j-th sample; matrix CRN×k with entry cij be the reference expression of the i-th gene in the j-th cell type; and matrix PRk×n with entry pij be the proportion of the i-th cell type in the j-th sample. Assume dimensions Nmax(n,k) and the following linear relation:

    G=CP+ϵ, (2.1)

    where ϵ is noise. Complete deconvolution is the problem that given bulk tissue data GRN×n, solve for the constituent matrices C and P, i.e.,

    (C,P)=arg minCRN×k+,PRk×n+δ(CP,G) (2.2)

    where RN×k+ or Rk×n+ represent matrices with nonnegative entries and δ(,) is a cost function. Usually, the cost function is chosen based on prior knowledge about the probability distribution of the data noise and susceptibility to outliers. In actual applications, the minimization process is considered as a maximum likelihood estimator for additive Gaussian noise when the Frobenius norm is used. On the other hand, when noises are Poisson processes, the Expectation Maximization (EM) algorithm and maximum likelihood [53] leads to the I-divergence [54] cost function. Or the cost function can be taken as row-wise or column-wise l1 norms of the difference matrix [43] if noises follow Laplace distribution. As summarized in [55], other choices include Earth Mover's distance metric, α-divergence, β-divergence, γ-divergence, φ-divergence, Bregman divergence, and α-β-divergence. In this work, we focus on how to handle data singularity and structure, so we only use δ(CP,G)=12GCP2F for simplicity. The NMF is well-known to be ill-posed, and the solution is not unique or identifiable: if (C,P) is a local minimum to (2.2), then for any ΩRk×k, ˆC=CΩ and ˆP=Ω1P are also solutions as long as their non-negativity is satisfied.

    Figure 2 features some situations of solving an NMF problem by the classic Multiplicative Update (MU) rule. Figure 2(a), (b) show two local minimizers (blue) of matrix C, but none of them is even close to the ground truth (red). Figure 2(c) displays a case of NMF model applied in realistic RNA-seq data to compute cellular composition. The simulated results (blue) are highly correlated with ground truth (orange), or will win high score in correlation as the measure metric. However, these computations estimate significantly wrong cellular proportions for two types of cells. This is due to the fact that, even all assumptions are satisfied, the "unique" solutions of NMF are still subject to a permutation and rescaling matrix, see details in the following section. The permutation ambiguity might be resolved by biological knowledge, while the rescaling factors must be addressed computationally.

    Figure 2.  Ill-posedness of NMF and its multiple solutions.

    Different from image processing, the non-uniqueness of solution will significantly impact statistical analysis for decisions in biological implementation. Therefore, it is important to restrict searching space of variables to increase the identifiability of solutions for better interpretability.

    First, the unique solution of NMF is defined in the following sense [56]:

    Definition 2.1 (Uniqueness of NMF solution). The solution (C,P) of NMF (2.2) is unique, or identifiable, if and only if for any other solution (ˉC,ˉP), there exists a permutation matrix Π{0,1}k×k and a diagonal scaling matrix S with positive diagonal matrix such that

    ˉC=CΠS and ˉP=S1ΠP. (2.3)

    This uniqueness can be achieved under certain circumstances from nested cone theory [52,57,58,59]: By non-negativity of C and P, we have

    G(j)cone(C)RN+,1jn, (2.4)

    where the notation cone(C) denotes the convex cone generated by the columns of C, i.e.,

    cone{C}={xRN|x=Cθ, for some θRk,θ0}, (2.5)

    From Eq (2.4), one can further obtain cone(G)cone(C)RN+. This argument also applies to the transpose of the NMF problem, i.e., cone(GT)cone(PT)Rn+. Then, problem (2.2) can be interpreted as a finding nested cone problem: given data G, find the nested cones cone(C) and cone(PT), between cone(G) and RN+, and cone(GT) and Rn+, respectively. Therefore, in order to obtain unique solutions of C and P, it requires data G almost "full-fill" in the positive orthant RN+ and Rn+. However, this requirement about data G is on its original, high dimensional space (N and n are usually large). Instead, the solvability conditions of NMF problem can be interpreted on matrices C and P, in the intrinsic, low-dimensional space Rk+ with kmin{N,n}. First, one needs the following definitions:

    Definition 2.2. The matrix ARN×n+ is separable if cone(A)=RN+.

    Definition 2.3. The matrix ARN×n+ is sufficiently scattered if: (ⅰ) The second-order cone in RN+ is contained in cone(A), i.e., C={xRN+|exN1x2}cone(A); and (ⅱ) There does not exist any orthogonal matrix Q such that cone(A)cone(Q), except for permutation matrices.

    Then, the unique solution of the NMF problem can be achieved if a strong identifiability condition or its weakened version is satisfied [55,56]:

    Theorem 2.4 (Strong identifiability condition). Assuming k=rank(G), ϵ=0, if problem (2.2) admits a solution, for which both C and P are separable matrices, then the solution is unique.

    Theorem 2.5 (Weak identifiability condition). Assuming k=rank(G), ϵ=0, if both C and P are sufficiently scattered, then problem (2.2) admits a unique solution.

    The left panel in Figure 3 offers a graphic explanation of the two theorems. Both rows of C and columns of P are in the low dimensional space Rk, so it is convenient to represent them in a (k1)-simplex with k=3, especially columns of P have sum-to-one property. The red triangle is the positive orthant of R3 and ei,i=1,2,3 are the unit base. Blue dots, either circles or pentagons, represent rows of C or columns of P. Theorem 2.4 is quite rigorous, because being separable matrices it requires P and C must contain (scaled) extreme rays of the nonnegative orthant in the corresponding space, i.e., for every r=1,2,...k there exists a column index lr, such that P(lr)=αrerRk+, where αr is a scalar. In the simplex view, some of the blue dots have to coincide with the three red dots. In contrast, Theorem 2.5 is much more relaxed and it requires only some of the data points (shown as pentagons) being outside of the second-order cone, which is represented by the dashed, inscribed circle in the triangle.

    Figure 3.  Geometric interpretation of solvability condition (left) and constraints of the proposed NMF model (right).

    The right panel of Figure 3 displays the fundamental idea of the geometric structure guided NMF (GSNMF) developed in [60]. According to Theorem 2.5, a solvability constraint is imposed on the searching space of C. In this constraint, some rows of C are required to be close enough to the k fundamental basis, such that the convex hull of all rows can cover the second-order cone in Rk. The different groups of rows of C to each basis, represented by the red, blue, and green dots, are recognized by a clustering algorithm, and are corresponding to the biological marker genes. Within the same group, the manifold constraint is applied based on the local invariance assumption in manifold regularization [61,62,63]. Note that the Eisen cosine correlation distance is used in both constraints, instead of Euclidean distance. With these constraints, the GSNMF significantly improves the interpretability of solutions, but it suffers to obtain accurate solutions when cellular compositions of the bulk tissue data are singular. Recall in either strong or weak identifiability conditions, that these requirements are set on both matrices CT and P. The weak condition on CT is usually satisfied for a given dataset and it is reasonable to impose constraints on C(i) due to the biological features of marker genes, which can be recognized computationally or by prior knowledge. However, there is usually no information in advance about the cellular composition. The black star dots in the left panel of Figure 3 show such a case that cellular proportion of two cell types across all samples are highly correlated. In this scenario, the data points will distribute more or less along a line but not fully occupy the whole simplex.

    One major contribution of this work is to address the challenge of singular data, as mentioned in the introduction. To do so, the core idea is to augment the original NMF problem of true potentially singular sample data G by pseudo bulk tissue data ˆG with regular cellular proportions, i.e., the original NMF problem is augmented as

    (C,P)=arg minCRN×k+,PRk×(n+ˆn)+δ(CP,[G,ˆG]). (3.1)

    The key issue is how to construct regular pseudo bulk tissue data ˆG in Eq (3.1).

    Singe-cell RNA-seq data is just a spreadsheet recording the number counts of a set of N genes in multiple cells of different types and in various samples. In this section, we illustrate how to generate a vector of pseudo-bulk tissue date from the spread sheet. Let cγ,lRN be the GEP of all genes in the l-th cell of the γ-th cell type, where 1γk and total number of cells depend on a specific dataset. Then, the relation between single-cell data and the j-th sample in the pseudo-bulk tissue data is

    ˆG(j)=kγ=1m(γ,j)l=1cγ,l=kγ=1m(γ,j)ˉcγ, (3.2)

    where m(γ,j) means that there are m(γ,j) cells for the γ-th cell type, or cell composition in the j-th sample. Since the deconvolution problem does not distinguish individual cells, the above equation is further written in the form of averaged GEP ˉcγ in each cell type. If the cell type proportion is used, then let Mj=kγ=1m(γ,j) being the total number of cells in the j-th sample and Eq (3.2) becomes

    ˆG(j)=kγ=1m(γ,j)kγ=1m(γ,j)Mjˉcγ=kγ=1pγ,jMjˉcγ (3.3)

    with kγ=1pγ,j=1 being proportion.

    It is important to notice the difference between Eq (3.3) and the j-th column of the NMF model, i.e., G(j)=kγ=1pγ,jC(γ). Note that the matrix CRN×k only contains k columns C(γ) (cell type index 1γk) and no sample index j, so the NMF model for bulk tissue data deconvolution implies the assumption that cell type specific GEP should be homogeneous across tissue samples. This assumption also gives us a guidance when constructing pseudo-bulk tissue, that the total number of cells and average GEP, or Mjˉcγ, taken from all samples need to be as consistent as possible.

    ● Choose the total cell number M, as a parameter, for all pseudo-bulk tissue samples;

    ● Compute ˉcγ from the single-RNA-seq data sheet for 1γk;

    ● Choose a mechanism to generate random numbers pγ,j and use Eq (3.3) to generate the j-th pseudo-bulk sample.

    On the other hand, from the point view of linear regression, the matrix C estimated from the bulk-tissue data (pseudo or real) by the NMF model is actually the averaged value ˉcγ=<Mjˉcγ> across samples. This understanding raises two concerns about applying the NMF model: (1) The original data need to be "good" in the sense that the GEP among samples are homogeneous. By prior knowledge, the GEP varies significantly for sample to sample, then the NMF is not a suitable model. (2) Appropriate single-cell data usage: The scRNA-seq data could be directly used in NMF as a reference and it reduces the complete deconvolution to a regular linear regression problem. However, in realistic applications, the scRNA-seq data may not be consistent with the inherent GEP in the bulk tissue data, or they are even taken from other literature. This inconsistency will significantly reduce the computational accuracy of the GEP (matrix C) and hence cellular proportion (matrix P) in the original bulk tissue data.

    In order to generate pseudo-bulk tissue data, it is critical to utilize proper mechanism to construct pseudo proportion for each sample as in Eq (3.3). Denote the pseudo proportion matrix as ˆP, then we require the weak identification condition ˆP, i.e., it contains sufficient amount of columns ˆP(j) that are close enough to eγRk. To do so, we propose to use multivariable Dirichlet distribution:

    Consider the n column vectors in matrix ˆP as the n realizations of the random vector ˆpRk. With parameter α1,α2,...αk>0, the probability density function with respect to Lebesgue measure on the Euclidean space Rk1 is given as

    f(ˆp1,ˆp2,...ˆpk;α1,α2,...αk)=1S(α)Πkγ=1ˆpαγ1γ, where α=(α1,α2,...αk) (3.4)

    where {ˆpγ} belong to the standard k1 simplex (because of the sum-to-one condition) and the S(α) is the normalizing constant defined in terms of the gamma function, i.e.,

    S(α)=Πkγ=1Γ(αγ)Γ(kγ=1). (3.5)

    It is well known that the mean of the variable is E[ˆpγ]=αγ/α0, where α0=kγ=1αγ, and variance and covariances of the random vector are

    Var[ˆpγ]=αγ(α0αγ)α20(α0+1),and Cov[ˆpγ,ˆpγ]=αγαγα20(α0+1), (3.6)

    so by choosing small and equally valued αγ, one can obtain smaller value of α0, hence widely and evenly scattered vectors in the k1 simplex.

    The regularity of matrix ˆP can be investigated by the condition number of ˆPˆPT: if ˆP contains correlated or extremely small rows, matrix ˆPˆPT will admit large condition number. Since Var[ˆpγ]=E[ˆp2γ]E[ˆpγ]2 and Cov[ˆpγ,ˆpγ]=E[ˆpγˆpγ]E[ˆpγ]E[ˆpγ], it is easy to derive

    E[ˆp2γ]=αγ(αγ+1)α0(α0+1)and E[ˆpγˆpγ]=αγαγα0(α0+1). (3.7)

    Then the matrix of ˆPˆPT has the following entries when sample size is large enough:

    ˆPˆPT=c0[(α1α2αk)+(α1α2αk)(α1α2αk)] (3.8)

    where c0=n/α0(α0+1) and n is the sample size.

    Equation (3.8) serves as an indicator of how to choose parameters in the Dirichlet Distribution, in order to maintain a small condition number for ˆPˆPT. Actually, if α1=α2=...αk=α, then

    ˆPˆPT=c0α(Ik×k+α11T). (3.9)

    Notice that eigenvalues of 11T are (k,0,0,..0), hence κ(ˆPˆPT)=kα+1. Then, it is reasonable to choose a uniform vector α=(α,α,...α) with small value of α to maintain a small condition number. Further, we claim that for the mixed bulk tissue data (original plus pseudo bulk) in the proposed algorithm, the proportion matrix ˉP=[P,ˆP] is always regular. Indeed,

    ˉPˉPT=PPT+ˆPˆPT (3.10)

    has full rank, because there is a diagonal matrix in ˆPˆPT and the remaining matrix is positive semi-definite.

    In this section, we summarize and assemble all the components of the proposed method.

    According to Eq (3.6), pick small and equal values for parameters α, e.g., α1=α2=...αk=1.5 and use Dirichlet distribution (3.4) to generate n random vectors ˆpjRk,j=1,2,...n and belong to the standard k1 simplex. Assume that there are Mγ cells in the single cell data set for the γ-th cell type and let M=kγ=1|Mγ|. Denote ˆpγ,j as the γ-th entry of vector ˆpj, then we randomly choose Mˆpγ,j from the total Mγ (regardless which one is bigger) cells of the γ-th type with replacement. Then, the j-th pseudo bulk tissue sample, or column of ˆG, can be generated in Eq (3.2) as

    ˆG(j)=kγ=1Mˆpγ,jl=1gγ,l. (3.11)

    The subset of marker genes from the augmented data ˜G=[G,ˆG] can be identified by clustering rows of ˜G into k groups. The total N rows of ˜G are considered as the vertex set V={˜G(i)}Ni=1Rn in the similarity graph G=(V,E) in the spectral clustering method [64]. The non-negative weights ωij of edges E={eij} are calculated by a function Rn×RnR+ as,

    ωij=exp{deisen(˜G(i),˜G(j))2σ},1iN,1jN, (3.12)

    where

    deisen(˜G(i),˜G(j))=1<˜G(i),˜G(j)>|˜G(i)||˜G(j)| (3.13)

    is the Eisen cosine correlation distance quantifying the correlation between two vertices, and σ>0 is a parameter. With the adjacency matrix W=(ωij)RN×N and its degree matrix D=diag(d1,d2,...dN), where di=Nj=1ωij, different types of graph Laplacians (gL) of G can be defined, such as the unnormalized gL L=DW, symmetric normalized gL Lsym=ID12WD12, or random walk gL Lrw=ID1W. By examining the first a few eigenvectors of gL, rows of data ˜G will be clustered into k groups, and the set {Gr}kr=1 records row indices of ˜G in the corresponding clusters. The choice of different gLs depends on specific data applications [65]. For our problem, we use the normalized gL Lsym and the matrix W will be used in the next step. The other output for next step are disjoint sets Sγ,γ=1,2,...k, which represent the indices of marker genes for the γ-th cell type.

    minC0,P012[G,ˆG]C[P,ˆP]2F+F(C)+1T([P,ˆP]). (3.14)

    For model simplicity, we just use the Frobenius norm to measure the error between deconvoluted solutions and the given mixed data. Note that PRk×n corresponds to the original n samples while ˆPRk×n is for the pseudo bulk tissue. Both of them are subject to the sum-to-one conditions on columns, which is represented by the set T:={ZRk×n+,1Z=1} and the indicator function 1T as 1T(Z)=0 if ZT while 1T(Z)= otherwise.

    The regularization function F(C)=F1(C)+F2(C), where

    F1(C)=λ12kr=1iSrdeisen(C(i),er)2, (3.15)

    and

    F2(C)=λ22Nj=1Ni=1ωijdeisen(C(i),C(j))2, (3.16)

    with λ1 and λ2 being parameters. Recall that entry ωij>0 in the adjacency matrix W in (3.12) measure the correlations (larger value represents stronger correlation) between genes i and j in data G.

    The Alternating direction method of the multipliers (ADMM) [66] method is used to numerically solve this optimization problem, and there are two types of parameters involved in the the overall computational process. One set of parameters are in the ADMM algorithm itself: A detailed discussion can be found in our earlier work [60] and the same strategies of parameter tuning are used in this paper. On the other hand, there are two major parameters when generating and integrating the pseudo bulk tissue data. One is the vector α=(α1,α2,...αk) in the Dirichlet distribution for pseudo cellular proportions. As discussed in 5.1, small and equal values αl,l=1,2,...k contribute to both geometric structure requirement and computational stability. The other is the number n of pseudo bulk samples. It is obvious that if n is too small, there will not be enough information. But too large n is not necessary according to Definition 2.3, and it will also increase computational cost. The proper amount of pseudo bulk samples is not a trivial question: From Eq (3.1), we can see that the solution C is the "average" of the constituent GEP from the original bulk tissue samples and the reference scRNA. Thus, if the two sets of data are not consistent, the generated pseudo bulk samples will lead to inaccuracy of the actual GEP and hence the cellular proportion. In Section 5.3, we computationally address this issue with a specific dataset, while a systematic investigation will be our future work.

    In many studies and packages, the mostly used evaluation metrics is the row-wise (across samples) correlation of cell proportion matrices between computational results and ground ground truth. i.e., P(i) and P(i) for i=1,2,...k. This metric is problematic: One issue is that it only demonstrates the global similarity of the computational results with the ground truth for i-th cell type for all samples, but has little to tell about all cell types in a local specific sample. More importantly, according to the uniqueness of the NMF solution, each row of matrix P could be subject to a rescaling factor. It is easy to see that λiP(i) will have the same correlation coefficient with P(i) as P(i) does. Thus, even all corresponding rows in P and P are highly correlated, each column of P and P could be totally different, and this difference will lead to misinformation of cellular composition of each sample. In addition to the Pearson correlation, we also measure simulation accuracy (relative error) in terms of discrete L1 norm, or equivalently, the normalized mean absolute difference (mAD) and discrete L2 norm, or equivalently, the normalized root mean square deviation (RMSD). Similar metrics are also defined for columns of matrix C, i.e., GEP for each cell type.

    According to the theorem, the solution is subject to a rescaling factor, i.e., if C and P are the "true" solutions of the NMF problem, the best computational solution one can obtain is

    ˉC=CΩ1λ and ˉP=ΩλP, (4.1)

    where Ωλ=diag{λ1,λ2,...λk>0} is unknown. This unknown rescaling matrix may cause problematic estimations of cellular proportion. Another major contribution of this work is to perform a posterior error estimates of the NMF solution, which gives important insights on data preparation for realistic biological problems. We first try to explore the dependence of the solution of the non-convex problem on initial conditions, and then try to estimate the rescaling matrix. Note that the results in this section are base on the matrix decomposition form G=CP, where P is regular and without considering noises and nonnegativity constraint.

    First, we consider the converged solution.

    Theorem 4.1. Given initial condition P[0] with P[0]PT[0] being invertible and assume the true GEP and cell proportion as C and P, respectively, then the solutions of G=CP converge to C=CM1M21 and P=M2M11P, where M1=PPT[0] and M2=P[0]PT[0].

    Proof. With the simple iteration,

    C[0]=GPT[0](P[0]PT[0])1=CPPT[0](P[0]PT[0])1=CM1M12 (4.2)

    Then

    P[1]=(CT[0]C[0])1CT[0]G=(CT[0]C[0])1CT[0]CP=[(M12)TMT1CTCM1M12]1(M12)TMT1CTCP=M2M11P

    with such updated P[1], we compute

    C[1]=GPT[1](P[1]PT[1])1=CPPT(M11)T(M2M11PPT(M11)TMT2)1=CM1M12=C[0] (4.3)

    Based on Theorem 4.1, it is possible to combine the information of marker genes and properly choose initial conditions, so that the computational solutions can be estimated. At this moment, we assume the marker genes and their associated cell types are identified. Mathematically, set M is used to represent the indices of all marker genes and disjoint sets Sγ,γ=1,2,...k represent the indices of marker genes for the γ-th cell type. Note γSγ=M and denote Scγ=MSγ. Based on the property of marker genes, ideally we have c[Scγ],γ=0, where c[Scγ],γ means all entries of the γ-th column of C with row index in Scγ. Then for certain γ and γ, let i1,i2Sγ, and i3Sγ, it is easy to see

    G(i1)=ci1,γP(γ),G(i2)=ci2,γP(γ), and G(i3)=ci3,γP(γ) (4.4)

    From (4.4) we can conclude that rows of G (i1 and i2) from the marker genes of the same cell type will be linearly dependent since they are multiples of the same γ-th row of P. Further, for every 1γk, we take the mean of all rows G(i) over all iSγ, i.e., <G(i)>Sγ, and denote the result as the the γ-th row of a matrix ˜P, i.e.,

    ˜P(γ)=<G(i)>Sγ=<ci,γ>SγP(γ), or just˜P=ΩcP, (4.5)

    where Ωc=diag(A) and AT=[<ci,1>S1,...<ci,γ>Sγ,...<ci,k>Sk].

    If a good identification of marker gene sets can be made, the matrix ˜P can be computed directly from the original data G and this matrix is related to the true cellular proportion P with a diagonal rescaling factor Ωc. Although this factor unknown yet, the matrix ˜P can serve as an informative initial condition.

    First, if initial condition P[0] as the column normalization of ˜P with sum-to-on condition, i.e., P[0]=ΩcPΩ1a, where

    Ωa=diag(ATP(1),ATP(2),...,ATP(n)).

    Consequently, by Theorem 4.1, solutions converge to

    C=CM1M12=C(PΩ1aPT)(PΩ2aPT)1Ω1c. (4.6)

    and

    P=M2M11P=Ωc(PΩ2aPT)(PΩ1aPT)1P. (4.7)

    Theorem 4.2. Assume ideal data G has true solutions C and P, i.e., G=CP. Additionally, row index set Ω of C (also G) can be subdivided into k disjoint sets Sγ, i.e., Ω=γSγ and entry (of C) ci,γ=0 if iSγ. Given initial condition P[0]=ΩcPΩ1a and the solutions of G=CP converge to C and P, then the following estimations hold for each row of C

    β1C(i)2C(i)2βC(i)2 (4.8)

    and each column of P,

    β1P(j)2P(j)2βP(j)2, (4.9)

    where

    β=κ(PPT)(λcmaxλcmin)2, (4.10)

    κ(PPT) is the condition number of PPT and

    [λcmax,λcmin]=[max,min]{<ci,1>S1,...<ci,γ>Sγ,...<ci,k>Sk}.

    Proof. First, denote the matrix R=(PΩ1aPT)(PΩ2aPT)1Ω1c, hence C=CR and P=R1P. Then, we estimate R2 to see how it changes each row of the true GEP matrix C. Assuming matrix P has SVD P=UΣVT, where URk×k and VRn×n being unitary, while ΣRk×n and has minimum and maximum values as σmin and σmax, respectively, then

    R2=UΣVTΩ1aVΣTUT(UΣVTΩ2aVΣTUT)1Ω1c2=UΣVTΩ1aVΣTUTU(ΣT)1VTΩ2aVΣ1UTΩ1c2=UΣVTΩ1aV˜IVTΩ2aVΣ1UTΩ1c2

    Note that ˜IRn×n and ˜I=(Ik×k000) due to the definition of singular matrix Σ. Hence, using the property that unitary matrices preserve spectrum norm, we have the estimate

    R21λcminUΣVTΩ1aV˜IVTΩ2aVΣ1UT2 (4.11)
    =1λcminΣVTΩ1aV˜IVTΩ2aVΣ12 (4.12)
    σmaxσminλcminVTΩ1aV˜IVTΩ2aV2 (4.13)
    =σmaxσminλcminΩ1aV˜IVTΩ2a2 (4.14)

    Note that each entry of Ωa, i.e., ATP(j), is a convex linear combination of entries in A since P has sum-to-one columns. Then, exam the eigenvalues λa of Ωa, we have

    λcminλaminλamaxλcmax. (4.15)

    Actually, if P satisfies weak or strong separability conditions, we have λcmin=λamin and λcmax=λamax. Additionally, notice that cond(PPT)=σmax/σmin, we have the estimate

    R2=RT2cond(PPT)(λcmaxλcmin)2 (4.16)

    Finally, by definition of spectral norm of matrix,

    C(i)2=RTC(i)2C(i)2cond(PPT)(λcmaxλcmin)2, (4.17)

    and by the same argument, estimate in (4.16) is also true for R1. Hence, the other side inequality in (4.8) holds since C(i)2=(R1)TC(i)2.

    Estimates in Theorem 4.2 are based on noiseless situations, and the ideal case the marker genes can be easily recognized. Nevertherless, they shed light on some additional qualitative conditions for the constituent matrices C and P, about in what situation the numerical solutions are as close to the ground truth as possible.

    ● Quantity β in Eq (4.10) serves as an estimator of the scaling factor in the "unique" solution to how far the computational results are from the ground truth. It is reasonable that this quantity depends on the regularity of the constituent matrix P (condition number of PPT).

    ● Note that λcmin (λcmax) is the minimum (maximum) average of all marker gene expressions in the corresponding cell types, so a smaller ratio of λcmax/λcmin results in a smaller bound of R2.

    ● Based on the above information, we conclude that homogeneous average GEP in different cell types, and regular cellular composition in bulk samples are preferred. For the former, one can select subsets of marker genes in each tell type such that large differences in GEP average are avoided in practice. For the latter, it is the motivation for the idea of pseudo-bulk tissue data. Numerical simulation in the next section show that the solution accuracy is significantly enhanced, when the two strategies are taken.

    In this section, we present the numerical results of the proposed algorithms. First we use synthetic data to display the geometric structure of the Dirichlet distribution generated pseudo cellular proportion with different parameter setup. The properly designed data with weak solvability condition significantly enhance the solution quality of singular data. Then, the proposed algorithms are applied to three different datasets to validate their effectiveness and accuracy. At last, the resilience of the proposed algorithm is computationally analyzed. In contrast to the original GSNMF, the newly proposed pseudo-bulk tissue data augmented method is termed as GSNMF+.

    Figure 4 displays the geometric structures of some Dirichlet distributions generated cellular proportions with various parameters. For visualization convenience, the cell type dimension is taken as k=3, so that all data points can be shown in the 2-simplex (blue triangles). Six different sets of parameters α are used, including heterogeneous (first row) and homogeneous (second row) components. It can be concluded that the convex hull of data generated with heterogeneous parameters α can not include the second-order cone, which can be presented as the in-scripted circle (not shown) of the triangles, or, does not satisfy the sufficiently scattered condition. On the other hand, data generated from homogeneous parameters tend to distribute more evenly in the simplex. However, for α with larger values, data points will concentrate in the center of the simplex and thus their convex hull cannot cover the second-order cone (recall that larger value of α leads to larger condition number of PPT). Therefore, we will use homogeneous parameters with small to median value, such as |α|=0.2 or |α|=1 in simulations, where |α| is the l norm of α.

    Figure 4.  Geometric structures of simulated cellular proportions from Dirichlet distribution with various parameters.

    Figure 5 shows how the overall solvability condition is improved when the singular original cell proportions (red dots) are mixed with pseudo data (green dots). Figure 5(a) shows the data singularity that there is one cell type with extremely small proportion in samples, or a small row vector in P. As seen, all red dots are mostly distributed along one edge of the simplex while absent in one corner. The other singularity type is that two cell types may have correlated cellular proportions and the corresponding geometric structure is displayed in Figure 5(b). In this scenario, the data points align approximately in a line in the simplex, as shown by the red dots. In both cases, the geometric structures can be improved by mixing with the Dirichlet distribution generated pseudo data (green dots and with α=(1,1,1)).

    Figure 5.  Geometric structures of mixed data: Original singular cell proportions (red) and simulated regular cell proportions (green).

    Figure 6 displays the deconvolution results of cellular proportions in a set of synthetic data, in which the cellular composition have the aforementioned singularity. The first and second rows are for the original GSNMF and the proposed GSNMF+ methods, respectively. The synthetic bulk tissue data G is generated by multiplying a predefined matrix C and the above-mentioned pseudo cellular composition matrix P and adding some noises ϵ. There are in total one hundred synthetic samples. The ground truth of cellular composition is shown in orange triangle plots, while the deconvolution results are in blue stars. In the left panel, cell 2 rarely presents in all the samples, so its proportions are very small (below 0.1). The original GSNMF algorithm cannot obtain the accurate results for this cell type, the solution correlation is 0.74 and RMSD is 0.96. Moreover, the solution correlation for the corresponding column of matrix C is 0.52 and RMSD is 0.99. In contrast, the deconvolution accuracy is significantly enhanced by the GSNMF+ algorithm: for this cell type, the correlation is 0.96 and RMSD is 0.06. The accuracy for matrix C is also enhanced, with the correlation being 0.98 and RMSD being 0.13. Similar performance increase can also be found when data contains correlated cellular compositions, as shown in the right panel of Figure 6. In this case, cellular proportions for cell types 1 and 2 are designed to have a strong correlation. As consequences, deconvolution results for the two cell types by GSNMF are highly inaccurate. Solution correlation for cell 1 is 0.99, but RMSD is 2.21. For cell 2, errors in correlation and RMSD are both high, which are 0.56 and 2.22, respectively. Estimation for matrix C is totally wrong and the RMSD is 5.89. When GSNMF+ is used, solution accuracy for all cell types are increased, with correlations being all beyond 0.99, and the RMSD being 0.07, 0.05, and 0.12, respectively. Correspondingly, RMSD of computing matrix C in overall decreases to 0.19.

    Figure 6.  Deconvolution results of cellular proportion from singular data by GSNMF and GSNMF+.

    Table 2 displays the performance of GSNMF+ under various levels of noise-to-data ratios (NDRs), where RMSD of C, P, and relative residues GCPF/GF are displayed. As indicated by the table, errors in matrix C increase more obviously when more noises present (larger NDR). On the contrary, computation of matrix P seems less vulnerable to noise levels.

    Table 2.  Quantitative results of the GSNMF+ with different noise to data ratio (NDR).
    NDR RMSD of C RMSD of P Relative residue
    0.081 0.1802 0.0888 0.0793
    0.165 0.2007 0.094 0.1743
    0.346 0.2372 0.1045 0.4024

     | Show Table
    DownLoad: CSV

    We would like to point out that handling noises is a common challenge in all areas of data sciences, but it is not the major scope of our current work. Obtaining meaningful biological solutions from the ill-posed NMF model is still an open problem. Therefore, the noise levels we used in all synthetic data are low to medium, and only the common Gaussian noise is used.

    The first set of biological data used to validate the proposed algorithms is GSE19830 [67], which was obtained from tissue samples of the brain, liver, and lung of a single rat using expression arrays (Affymetrix). Homogenates of these three types of tissues were mixed together at the scRNA homogenate level with a known proportion, and then the gene expression pattern of every mixed sample was measured. The GSE19830 data set mimics the common scenario of heterogeneous biological samples. Both the relative frequency of the component and marker genes for each cell type are clearly recognizable, so it has been used as a preliminary test in many literature [25,31] to validate computational algorithms. For this dataset, we know cell type k=3 and tissue sample number n=33. After necessary data preprocessing to exclude obvious outliners (row norm, column norm, etc.), we take N=10,000 out of 12,000 total genes.

    Since the dimension k is low, it is easy to visualize the data features as shown in Figure 7. First, the bulk-tissue data can be clearly categorized into three clusters corresponding to the three cell types from distinct organs, as shown in Figure 7(a), and hence marker genes from each type can be easily identified. Second, the constituent (ground truth) cellular proportions are well designed to avoid correlation. As displayed in Figure 7(b), the proportion data is sufficiently scattered in the simplex.

    Figure 7.  Visualization of data features for GSE19830.

    Figure 8 shows such a case with orange triangle lines as ground truth of cellular proportion and blue stars as the corresponding deconvolution results. Our previous GSNMF package obtained a highly accurate solution as shown in Figure 8(a), as the solution correlations being 0.96,0.98 and 0.99, respectively. However, we also observed that from sample to sample, the simulations always overestimate the proportion for one cell type (brain) while underestimating another (liver). It is hypothesized that such a systematic error is due to the unknown rescaling factor, so the proposed GSNMF+ algorithm is applied to the same dataset. With the estimated rescaling factors and adjusting strategy, it can be concluded that the accuracy is further improved, as shown in Figure 8(b). It is important to note that this improvement, or rescaling, is in a global sense (across all samples), as the errors in samples 68 are not significantly reduced.

    Figure 8.  Comparisons of simulation results (cellular proportion) to ground truth for dataset GSE19830.

    Qualitative errors, or sample-wise l2 norm of errors in cellular proportions are displayed in Figure 9. The orange bars are for the original GSNMF algorithm while the blue bars are for the GSNMF+ algorithm developed in this paper. Computational errors are expressed in log form and the improvement in accuracy is obvious.

    Figure 9.  Qualitative errors of the deconvolution results in cellular proportions for dataset GSE19830.

    The second dataset is the brain tissue data GSE67835, in which samples consists of six cell types (k=6) including astrocytes, endothelial, microglia, neurons, and oligodendrocytes. There are two sets of samples, one of which consists of extremely small amount of the microglia cells (Case Ⅰ), while the other has small proportion of endothelial cells (Case Ⅱ). These singular data pose great challenges and significantly decrease computational accuracy. Figure 10 displays the comparison of deconvolution results for these two cases (top and bottom rows) from GSNMF and GSNMF+, respectively. The orange curves are ground truth while blue curves are deconvolution results. It can be seen that GSNMF significantly overestimates cellular proportion for microglia in Case Ⅰ while for endothelial cells in Case Ⅱ. In contrast, the computational results from GSNMF+ are more accurate.

    Figure 10.  Performance comparisons of GSNMF and GSNMF+ for GSE67835.

    It is reasonable to investigate the algorithm performance when different amounts of pseudo-bulk tissue data are augmented, or quantity of scRNA-seq data are perturbed. Figure 11 displays solution accuracy for two datasets, GSE67835 and GSE81608, when different amount of pseudo bulk samples are used in the GSNMF+ algorithm. In these settings, the original bulk tissue data consists of 50 singular samples (denoted as 50S, constituent cellular composition with extremely small variation in Dirichlet distribution), while different amount of regular samples constituent cellular composition with large variation) are augmented in the algorithms, ranging from 10L to 300L. For example, label "50S+20L" means the mixture of 50 original bulk tissue samples with constituent cellular proportions of small variance and 20 pseudo-bulk tissue samples with constituent cellular proportions of large variance. Solution accuracy is presented in terms of Pearson correlation, RMSD, and mAD. It can be concluded that "regular" samples can significantly enhance deconvolution accuracy. While with increased amount of "regular" samples, the solution accuracy fluctuates in a certain degree, but no obvious improvement or decay is observed. This phenomenon echoes the theoretical results about the weak identifiability condition that only a few columns of P are needed to be close to each fundamental basis, in order for cone(P) to cover the second-order cone in Rn+. Therefore, too many pseudo-bulk tissue samples do not provide additional benefits.

    Figure 11.  Algorithm accuracy for two biological datasets with different amount of pseudo-bulk data.

    Another scenario in real applications is that the reference scRNA-seq data used to construct pseudo-bulk tissues may not be consistent with constituent cell specific GEP in the bulk tissues. To investigate the impact of inconsistent scRNA-seq data in the proposed algorithm, we conduct various model resilience analyses by generating pseudo-bulk tissue samples using artificially modified reference scRNA-seq data. These scRNA-seq data modifications include mean shifting (MS), truncation (TR), and factoring (FA), which are specifically defined as the following:

    Mean Shifting (MS): Each single cell profile of a given type was augmented by adding 10%, 30%, 50%, or 70% of the average expression levels of that cell type.

    Truncation (TR): Single cells were selectively excluded from the bulk tissue data generation. We removed the top or bottom 10% of cells in each cell type.

    Factoring (FA): A scaling factor was applied to all gene expression levels, allowing for both upward and downward adjustments. The scaling factors used were 0.4, 0.8, 1.2, and 1.8.

    All manipulative techniques were applied to the reference set, which consisted of 200 larger samples. As shown in Figure 12, these manipulations did not significantly influence the final results. All the results are close to the original estimation, indicating that changes to the reference set are also effective.

    Figure 12.  Algorithm performance when reference scRNA-seq data is perturbed in pseudo-bulk tissue data.

    In Table 3, we list the comparison of GSNMF+ with some popular NMF related deconvolution algorithms: MuSic [23], CIBERSORTx [24], and LinSeed [25] on data GSE67835. Among these algorithms, MuSic and CIBERSORTx are reference (scRNA-seq) based methods, GSNMF+ only uses reference to reduce data singularity, while LinSeed does not use reference data at all.

    Table 3.  Comparison of methods with different reference data.
    Correlation
    Method MuSiC CIBERSORTx LinSeed GSNMF+
    Original 0.9943 0.9643 0.8777 0.8712
    snRNAseq 0.8538 0.7435 0.8888 0.8636
    Simulator 0.2727 0.035 0.8749 0.9859
    RMSD
    Original 0.0416 0.1328 0.2025 0.1967
    snRNAseq 0.2399 0.2486 0.1942 0.1946
    Simulator 0.3806 0.4237 0.2049 0.0606

     | Show Table
    DownLoad: CSV

    As mentioned earlier, reference based methods (partial deconvolution) provide higher accuracy than reference free methods (complete deconvolution), with the assumption that the reference is consistent with the constituent GEP in the bulk tissue. However, this assumption is usually violated in real applications. To test these scenarios, we take the references as the one from GSE67835 (the "Original"), a single-nucleus RNA-seq (the "snRNA-seq") dataset of the human brain from [68], and one generated from a commonly used, real scRNA-seq data based probability model, scDesign simulator [69] (the "Simulator"). The top and bottom halves of the table display the correlation and RMSD of cellular compositions obtained from those methods, respectively. It is obvious that MuSic and CIBERSORTx outperform the other reference-free methods in both error metrics, if the reference is consistent. This is expected because with good reference, the NMF method is just a least square/linear regression model. However, if the reference used is different from the constituent GEP in the bulk tissue data, such as the snRNAseq or Simulators, the solution accuracy significantly decreases. On the other hand, solution accuracy from Linseed does not have obvious changes since this method does not use reference at all. For our proposed GSNMF+, the reference is used for handling the data singularity, it is essentially still a NMF problem, so the solution accuracy does not decrease, or sometimes increases when the reference is changed.

    Complete deconvolution is a type of computational approaches to decompose experimentally more available, reliable, and relatively inexpensive bulk-tissue RNA-seq data into cell type specific GEP and cellular composition, which are crucial datasets to DE analysis in order to reveal novel insights into genes and pathways for human diseases. Mathematical and statistical methods have been developed to perform complete deconvolution, and nonnegative matrix factorization (NMF) is the fundamental model on the mathematical side. However, it is well-known that NMF is an ill-posed problem due to its non-separable solutions. Many efforts have been taken to improve solution accuracy and promising computational tools are proposed, but comprehensive mathematical investigation of the ill-posedness is under-developed.

    In this paper, we focused on the solvability conditions in NMF theory on the bulk-tissue data. Among the two constituent matrices, the cell specific GEP matrix C satisfy the weak identifiability conditions due to the existence of marker genes, so we can identify the geometric structures of marker genes from bulk-tissue data and preserve the structure when solving for C. On the other hand, the cellular abundance matrix P usually violates the identifiability conditions because of possible rare or correlated presented cell proportions. To address this issue, we used Dirichlet distribution to generate pseudo cellular proportion and then constructed the corresponding pseudo bulk tissue data with available single-cell RNA-seq data. As results, our developed GSNMF algorithms are applied to the hybrid bulk-tissue data (original plus pseudo), in which the two constituent matrices satisfy the weak identifiability condition. Even though the solvability conditions are met, the "unique" solution of NMF is still subject to an arbitrary rescaling matrix. We tackled this challenge by choosing a specific type of initial conditions and estimating the possible difference to the theoretical solutions. Although this investigation can only be established based on the ideal noise-free scenarios, it is the first time this solution ambiguity is studied, and it shed a light on qualitatively improving the deconvolution results. Our proposed algorithm pipeline has been tested on several different datasets, and significantly improves solution accuracy from bulk-tissue data with singular cellular composition.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work was supported by US National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM148971. The author NET is supported by RF AG051504, U01 AG046139, R01 AG061796, U19 AG074879 to NET and Alzheimer's Association Zenith Fellows Award (ZEN-22-969810).

    The authors declare there is no conflict of interest.



    [1] M. Ali, S. Okabe, Anammox-based technologies for nitrogen removal: advances in process start-up and remaining issues, Chemosphere, 141 (2015), 144–153. https://doi.org/10.1016/j.chemosphere.2015.06.094 doi: 10.1016/j.chemosphere.2015.06.094
    [2] H. E. Al-Hazmi, D. Grubba, J. Majtacz, A. Ziembińska-Buczyńska, J. Zhai, J. Mąkinia, Combined partial denitrification/anammox process for nitrogen removal in wastewater treatment, J. Environ. Chem. Eng., 11 (2023), 108978. https://doi.org/10.1016/j.jece.2022.108978 doi: 10.1016/j.jece.2022.108978
    [3] S. Jenni, S. E. Vlaeminck, E. Morgenroth, K. M. Udert, Successful application of nitritation/anammox to wastewater with elevated organic carbon to ammonia ratios, Water Res., 49 (2014), 316–326. https://doi.org/10.1016/j.watres.2013.10.073 doi: 10.1016/j.watres.2013.10.073
    [4] J. G. Kuenen, Anammox bacteria: from discovery to application, Nat. Rev. Microbiol., 6 (2008), 320–326. https://doi.org/10.1038/nrmicro1857 doi: 10.1038/nrmicro1857
    [5] H. Siegrist, D. Salzgeber, J. Eugster, A. Joss, Anammox brings WWTP closer to energy autarky due to increased biogas production and reduced aeration energy for N-removal, Water Sci. Technol., 57 (2008), 383–388. https://doi.org/10.2166/wst.2008.048 doi: 10.2166/wst.2008.048
    [6] Y. J. Feng, S. K. Tseng, T. H. Hsia, C. M. Ho, W. P. Chou, Partial nitrification of ammonium-rich wastewater as pretreatment for anaerobic ammonium oxidation (anammox) using membrane aeration Bioreactor, J. Biosci. Bioeng., 104 (2007), 182–187. https://doi.org/10.1263/jbb.104.182 doi: 10.1263/jbb.104.182
    [7] S. Lackner, H. Horn, Comparing the performance and operation stability of an SBR and MBBR for single-stage nitritation-anammox treating wastewater with high organic load, Environ. Technol., 34 (2013), 1319–1328. https://doi.org/10.1080/09593330.2012.746735 doi: 10.1080/09593330.2012.746735
    [8] M. Kornaros, S. N. Dokianakis, G. Lyberatos, Partial nitrification/denitrification can be attributed to the slow response of nitrite oxidizing bacteria to periodic anoxic disturbances, Environ. Sci. Technol., 44 (2010), 7245–7253. https://doi.org/10.1021/es100564j doi: 10.1021/es100564j
    [9] J. Li, M. Feng, S. Zheng, W. Zhao, X. Xu, X. Yu, The membrane aerated biofilm reactor for nitrogen removal of wastewater treatment: Principles, performances, and nitrous oxide emissions, Chem. Eng. J., 460 (2023), 141693. https://doi.org/10.1016/j.cej.2023.141693 doi: 10.1016/j.cej.2023.141693
    [10] M. Lan, P. Yang, L. Xie, Y. Li, J. Liu, P. Zhang, et al., Start-up and synergistic nitrogen removal of partial nitrification and anoxic/aerobic denitrification in membrane aerated biofilm reactor, Environ. Res., 214 (2022), 113901. https://doi.org/10.1016/j.envres.2022.113901 doi: 10.1016/j.envres.2022.113901
    [11] M. Li, C. Du, J. Liu, X. Quan, M. Lan, B. Li, Mathematical modeling on the nitrogen removal inside the membrane-aerated biofilm dominated by ammonia-oxidizing archaea (AOA): effects of temperature, aeration pressure and COD/N ratio, Chem. Eng. J., 338 (2018), 680–687. https://doi.org/10.1016/j.cej.2018.01.040 doi: 10.1016/j.cej.2018.01.040
    [12] S. Matsumoto, A. Terada, S. Tsuneda, Modeling of membrane-aerated biofilm: effects of C/N ratio, biofilm thickness and surface loading of oxygen on feasibility of simultaneous nitrification and denitrification, Biochem. Eng. J., 37 (2007), 98–107. https://doi.org/10.1016/j.bej.2007.03.013 doi: 10.1016/j.bej.2007.03.013
    [13] B. J. Ni, A. Joss, Z. Yuan, Modeling nitrogen removal with partial nitritation and anammox in one floc-based sequencing batch reactor, Water Res., 67 (2014), 321–329. https://doi.org/10.1016/j.watres.2014.09.028 doi: 10.1016/j.watres.2014.09.028
    [14] O. Wanner, H. H. Eberl, M. C. M. Van Loosdrecht, E. Morgenroth, D. R. Noguera, C. Picioreanu, et al., Mathematical modelling of biofilms, IWA Publishing, 2006. https://doi.org/10.2166/9781780402482 doi: 10.2166/9781780402482
    [15] A. G. Dorofeev, Yu. A. Nikolaev, M. N. Kozlov, M. V. Kevbrina, A. M. Agarev, A. Yu. Kallistova, et al., Modeling of anammox process with the biowin software suite, Appl. Biochem. Microbiol., 53 (2017), 78–84. https://doi.org/10.1134/S0003683817010100 doi: 10.1134/S0003683817010100
    [16] X. D. Hao, J. J. Heijnen, M. C. M. van Loosdrecht, Model-based evaluation of temperature and inflow variations on a partial nitrification–ANAMMOX biofilm process, Water Res., 36 (2002), 4839–4849. https://doi.org/10.1016/S0043-1354(02)00219-1 doi: 10.1016/S0043-1354(02)00219-1
    [17] Y. Liu, J. Sun, L. Peng, D. Wang, X. Dai, B. J. Ni, Assessment of heterotrophic growth supported by soluble microbial products in anammox biofilm using multidimensional modeling, Sci. Rep., 6 (2016), 27576. https://doi.org/10.1038/srep27576 doi: 10.1038/srep27576
    [18] T. Liu, J. Guo, S. Hu, Z. Yuan, Model-based investigation of membrane biofilm reactors coupling anammox with nitrite/nitrate-dependent anaerobic methane oxidation, Environ. Int., 137 (2020), 105501. https://doi.org/10.1016/j.envint.2020.105501 doi: 10.1016/j.envint.2020.105501
    [19] F. Russo, A. Tenore, M. R. Mattei, L. Frunzo, Multiscale modelling of the start-up process of anammox-based granular reactors, Math. Biosci. Eng., 19 (2022), 10374–10406. https://doi.org/10.3934/mbe.2022486 doi: 10.3934/mbe.2022486
    [20] H. J. Eberl, D. F. Parker, C. M. Van Loosdrecht, A new deterministic spatio-temporal continuum model for biofilm development, J. Theor. Med., 3 (2001), 161–175. https://doi.org/10.1080/10273660108833072 doi: 10.1080/10273660108833072
    [21] B. Emerenini, B. A. Hense, C. Kuttler, H. J. Eberl, A mathematical model of quorum sensing induced biofilm detachment, PlosOne, 10 (2015), e0132385. https://doi.org/10.1371/journal.pone.0132385 doi: 10.1371/journal.pone.0132385
    [22] M. Ghasemi, H. J. Eberl, Time adaptive numerical solution of a highly degenerate diffusion-reaction biofilm model based on regularisation, J. Sci. Comput., 74 (2018), 1060–1090. https://doi.org/10.1007/s10915-017-0483-y doi: 10.1007/s10915-017-0483-y
    [23] C. Pellicer-Nácher, B. F. Smets, Structure, composition, and strength of nitrifying membrane-aerated biofilms, Water Res., 57 (2014), 151–161. https://doi.org/10.1016/j.watres.2014.03.026 doi: 10.1016/j.watres.2014.03.026
    [24] I. Klapper, B. Szomolay, An exclusion principle and the importance of mobility for a class of biofilm models, Bull. Math. Biol., 73 (2011), 2213–2230. https://doi.org/10.1007/s11538-010-9621-5 doi: 10.1007/s11538-010-9621-5
    [25] M. Ghasemi, S. Chang, S. Sivaloganathan, Exploring aeration strategies for enhanced simultaneous nitrification and denitrification in membrane aerated bioreactors: a computational approach, Bull. Math. Biol., 86 (2024), 117. https://doi.org/10.1007/s11538-024-01343-8 doi: 10.1007/s11538-024-01343-8
    [26] M. Ghasemi, S. Chang, H. J. Eberl, S. Sivaloganathan, Simulation of composition and mass transfer behaviour of a membrane biofilm reactor using a two dimensional multi-species counter-diffusion model, J. Mem. Sci., 618 (2021), 118636. https://doi.org/10.1016/j.memsci.2020.118636 doi: 10.1016/j.memsci.2020.118636
    [27] H. J. Eberl, R. Sudarsan, Exposure of biofilms to slow flow fields: the convective contribution to growth and disinfection, J. Theor. Biol., 253 (2008), 788–807. https://doi.org/10.1016/j.jtbi.2008.04.013 doi: 10.1016/j.jtbi.2008.04.013
    [28] M. R. Frederick, C. Kuttler, B. A. Hense, H. J. Eberl, A mathematical model of quorum sensing regulated EPS production in biofilm communities, Theor. Biol. Med. Mod., 8 (2011), 8. https://doi.org/10.1186/1742-4682-8-8 doi: 10.1186/1742-4682-8-8
    [29] M. Ghasemi, S. Sonner, H. J. Eberl, Time adaptive numerical solution of a highly non-linear degenerate cross-diffusion system arising in multi-species biofilm modelling, Eur. J. Appl. Math., 29 (2018), 1035–1061. https://doi.org/10.1017/S0956792518000554 doi: 10.1017/S0956792518000554
    [30] M. A. Efendiev, S. Zelik, H. J. Eberl, Existence and longtime behavior of a biofilm model, Commun. Pure Appl. Anal., 8 (2009), 509–531. https://doi.org/10.3934/cpaa.2009.8.509 doi: 10.3934/cpaa.2009.8.509
    [31] H. J. Eberl, L. Demaret, A finite difference scheme for a degenerated diffusion equation arising in microbial ecology, Electron. J. Differ. Equ., 15 (2007), 77–95.
    [32] H. J. Eberl, H. Khassehkhan, L. Demaret, A mixed-culture model of a probiotic biofilm control system, Comput. Math. Meth. Med., 11 (2010), 99–118. https://doi.org/10.1080/17486700902789355 doi: 10.1080/17486700902789355
    [33] H. J. Eberl, M. S. Collinson, A modeling and simulation study of siderophore mediated antagonism in dual-species biofilms, Theor. Biol. Med. Model., 6 (2009), 30. https://doi.org/10.1186/1742-4682-6-30 doi: 10.1186/1742-4682-6-30
    [34] J. Rang, Improved traditional Rosenbrock-Wanner methods for stiff ODEs and DAEs, J. Comput. Appl. Math., 286 (2015), 128–144. https://doi.org/10.1016/j.cam.2015.03.010 doi: 10.1016/j.cam.2015.03.010
    [35] M. Ghasemi, H. J. Eberl, Extension of a regularization based time-adaptive numerical method for a degenerate diffusion-reaction biofilm growth model to systems involving quorum sensing, Proc. Comput. Sci., 108 (2017), 1893–1902. https://doi.org/10.1016/j.procs.2017.05.089 doi: 10.1016/j.procs.2017.05.089
    [36] M. Ghasemi, S. Chang, S. Sivaloganathan, Modeling and simulation study of simultaneous nitrification-denitrification in membrane aerated bioreactor, J. Member. Sci., 668 (2023), 121210. https://doi.org/10.1016/j.memsci.2022.121210 doi: 10.1016/j.memsci.2022.121210
    [37] M. Ghasemi, S. Chang, S. Sivaloganathan, Unraveling the role of inert biomass in membrane aerated biofilm reactors for simultaneous nitrification and denitrification, Math. Appl. Sci. Eng., 5 (2024), 120–148. https://doi.org/10.5206/mase/17134 doi: 10.5206/mase/17134
    [38] C. S. Laspidou, B. E. Rittmann, Modeling the development of biofilm density including active bacteria, inert biomass, and extracellular polymeric substances, Water Res., 38 (2004), 3349–3361. https://doi.org/10.1016/j.watres.2004.04.037 doi: 10.1016/j.watres.2004.04.037
    [39] M. Seifi, M. H. Fazaelipoor, Modeling simultaneous nitrification and denitrification (SND) in a fluidized bed biofilm reactor, Appl. Math. Model., 36 (2012), 5603–5613. https://doi.org/10.1016/j.apm.2012.01.004 doi: 10.1016/j.apm.2012.01.004
    [40] M. González-Brambilaa, O. Monroya, F. López-Isunzab, Experimental and theoretical study of membrane-aerated biofilm reactor behavior under different modes of oxygen supply for the treatment of synthetic wastewater, Chem. Eng. Sci., 61 (2005), 5268–5281. https://doi.org/10.1016/j.ces.2006.03.049 doi: 10.1016/j.ces.2006.03.049
    [41] Y. Liu, T. Zhu, S. Ren, T. Zhao, H. Chai, Y. Xu, et al., Contribution of nitrification and denitrification to nitrous oxide turnovers in membrane-aerated biofilm reactors (MABR): a model-based evaluation, Sci. Total Environ., 806 (2022), 151321. https://doi.org/10.1016/j.scitotenv.2021.151321 doi: 10.1016/j.scitotenv.2021.151321
    [42] J. Majtacz, D. Grubba, K. Czerwionka, Application of the anammox process for treatment of liquid phase digestate, Water, 12 (2020), 2965. https://doi.org/10.3390/w12112965 doi: 10.3390/w12112965
    [43] W. Lauterborn, H. Bolle, Experimental investigations of cavitation-bubble collapse in the neighbourhood of a solid boundary, J. Fluid Mech., 72 (1975), 391–399. https://doi.org/10.1017/S0022112075003448 doi: 10.1017/S0022112075003448
    [44] D. Lu, H. Bai, F. Kong, S. N. Liss, B. Liao, Recent advances in membrane aerated biofilm reactors, Crit. Rev. Environ. Sci. Technol., 51 (2020), 649–703. https://doi.org/10.1080/10643389.2020.1734432 doi: 10.1080/10643389.2020.1734432
    [45] E. Syron, H. Kelly, E. Casey, Studies on the effect of concentration of a self-inhibitory substrate on biofilm reaction rate under co-diffusion and counter-diffusion configurations, J. Member. Sci., 335 (2009), 76–82. https://doi.org/10.1016/j.memsci.2009.02.038 doi: 10.1016/j.memsci.2009.02.038
    [46] J. Drewnowski, A. Remiszewska-Skwarek, S. Duda, G. Łagód, Aeration process in bioreactors as the main energy consumer in a wastewater treatment plant. Review of solutions and methods of process optimization, Processes, 7 (2019), 311. https://doi.org/10.3390/pr7050311 doi: 10.3390/pr7050311
    [47] L. Zhang, Y. Narita, L. Gao, M. Ali, M. Oshiki, S. Okabe, Maximum specific growth rate of anammox bacteria revisited, Water Res., 116 (2017), 296–303. https://doi.org/10.1016/j.watres.2017.03.027 doi: 10.1016/j.watres.2017.03.027
    [48] J. Peeters, J. Moran, ZeeLung MABR simple, low-energy process intensification, Intensification of Resource Recovery (IR2), Water Environment Federation (WEF), Manhattan College, New York, 2017.
  • This article has been cited by:

    1. Ricardo Andrés Gómez-Moncada, Andrés Mora, Marcela Jaramillo, Mauricio Parra, Andrés Martínez, Henry Mayorga, Jorge Sandoval-Muñoz, Arcadio Cuy-Cipamocha, Davis Suárez, Jose Sandoval-Ruiz, Rolando Ramírez, Robert Márquez, Ricardo Bueno, Implications of groundwater flow on preservation of heavy and extra-heavy oil accumulations in Southern Llanos Basin, Colombia, 2022, 119, 08959811, 104013, 10.1016/j.jsames.2022.104013
    2. Saeed Khezerloo‐ye Aghdam, Alireza Kazemi, Mohammad Ahmadi, Performance evaluation of different types of surfactants to inhibit clay swelling during chemical enhanced oil recovery, 2023, 0008-4034, 10.1002/cjce.25028
    3. T. Koksalan, K. A. Khan, R. Alhadhrami, A. Salahuddin, E. Warsito, T. Ruble, 2024, Cracking the Code: High Oil Saturation Without Oil Production? A Game-Changing Novel Geochemical Methodology for Assessing Total, Mobile/Immobile and EOR Oil; A Case Study from Abu Dhabi, UAE, 10.2118/222868-MS
    4. T. Koksalan, K. A. Khan, R. Alhadhrami, A. Salahuddin, E. Warsito, 2025, Total, Mobile, and Non-producible Oil: A Game-Changing Geochemical Method for Optimum Perforation Interval Selection and Eor Strategies, 10.2118/224685-MS
  • Reader Comments
  • © 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(208) PDF downloads(49) Cited by(0)

Figures and Tables

Figures(10)  /  Tables(5)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog