
Citation: Timothy P. Bata, Uriah Alexander Lar, Nuhu K. Samaila, Hyeladi U. Dibal, Raymond I. Daspan, Lekmang C. Isah, Ajol A. Fube, Simon Y. Ikyoive, Ezekiel H. Elijah, John Jitong Shirputda. Effect of biodegradation and water washing on oil properties[J]. AIMS Geosciences, 2018, 4(1): 21-35. doi: 10.3934/geosci.2018.1.21
[1] | Seyed Hossein Hashemi, Abas Niknam, Amir Karimian Torghabeh, Nuno Pimentel . Thermodynamic and geochemical studies of formation water in Rag-e Sefid oil and gas field, Iran. AIMS Geosciences, 2023, 9(3): 578-594. doi: 10.3934/geosci.2023031 |
[2] | Pezhman Soltani Tehrani, Hamzeh Ghorbani, Sahar Lajmorak, Omid Molaei, Ahmed E Radwan, Saeed Parvizi Ghaleh . Laboratory study of polymer injection into heavy oil unconventional reservoirs to enhance oil recovery and determination of optimal injection concentration. AIMS Geosciences, 2022, 8(4): 579-592. doi: 10.3934/geosci.2022031 |
[3] | Paolo Dell'Aversana . Reservoir geophysical monitoring supported by artificial general intelligence and Q-Learning for oil production optimization. AIMS Geosciences, 2024, 10(3): 641-661. doi: 10.3934/geosci.2024033 |
[4] | Anthony I Nzekwu, Richardson M Abraham-A . Reservoir sands characterisation involving capacity prediction in NZ oil and gas field, offshore Niger Delta, Nigeria. AIMS Geosciences, 2022, 8(2): 159-174. doi: 10.3934/geosci.2022010 |
[5] | Paolo Dell'Aversana . Reservoir prescriptive management combining electric resistivity tomography and machine learning. AIMS Geosciences, 2021, 7(2): 138-161. doi: 10.3934/geosci.2021009 |
[6] | Maleyka Agha Ali Aghayeva . Joint analysis of seismic and well log data applied for prediction of oil presence in Maykop deposits in Naftalan area. AIMS Geosciences, 2021, 7(3): 331-337. doi: 10.3934/geosci.2021020 |
[7] | Aleksander S. Gundersen, Pasquale Carotenuto, Tom Lunne, Axel Walta, Per M. Sparrevik . Field verification tests of the newly developed flow cone tool—In-situ measurements of hydraulic soil properties. AIMS Geosciences, 2019, 5(4): 784-803. doi: 10.3934/geosci.2019.4.784 |
[8] | Elvis Ishimwe, Richard A. Coffman, Kyle M. Rollins . Predicting blast-induced liquefaction within the New Madrid Seismic Zone. AIMS Geosciences, 2020, 6(1): 71-91. doi: 10.3934/geosci.2020006 |
[9] | Muhammad Safdar, Tim Newson, Hamza Ahmad Qureshi . Shear strength of fibre reinforced cemented Toyoura sand. AIMS Geosciences, 2022, 8(1): 68-83. doi: 10.3934/geosci.2022005 |
[10] | Shuo Yang, Dong Wang, Zeguang Dong, Yingge Li, Dongxing Du . ANN prediction of the CO2 solubility in water and brine under reservoir conditions. AIMS Geosciences, 2025, 11(1): 201-227. doi: 10.3934/geosci.2025009 |
RNA-seq data plays a crucial role in advancing our understanding of biological processes [1,2,3], disease mechanisms [4,5], and therapeutic interventions [6,7] by providing comprehensive insights into tissue-level gene expression patterns [8,9,10]. With the rise of single-cell RNA(scRNA-seq) technologies, direct genome-wide measurement of transcriptome in individual cells and characterization of their heterogeneity is enabled. However, scRNA-seq analysis is greatly susceptible to the high cost, limited high-quality samples, or technical biases [11,12,13]. Compared to scRNA-seq, bulk tissue RNA-seq is more cost-effective for large-scale studies involving numerous samples. It offers a balance between depth of coverage and cost, making it suitable for studies requiring analysis of gene expression across multiple conditions or time points. However, tissues are composed of diverse cell types, each with its own gene expression signature, so this complex nature of bulk tissue samples under investigation remains as a major obstacle. Bulk tissue RNA-seq, although it captures the cumulative gene expression profiles (GEP) from all cells within the sample, cannot distinguish individual cell types and contains no information cellular composition [14,15,16]. For example, features of Alzheimer's disease include amyloid-beta plaques and neurofibrillary tangles, along with neuronal loss and gliosis in the affected brain regions; thus, transcriptome-wide GEPs are different from the brain tissue of patients and neuropathologically normal controls. It is critical to recognize such differences in order to discover genes and biological pathways that are perturbed in and/or lead to Alzheimer's disease [8,10,17,18,19]. One of the important tools to unveil these differences is Differential expression (DE) analysis, which reveals novel insights into the genes and pathways, and is potentially helpful for drug targets therapeutics. However, the question whether disease-associated GEP changes in brain tissues are due to changes in cellular composition of tissue samples, or due to GEP changes in specific cells, e.g., central nervous system cells, remains a fundamental knowledge gap for DE.
To leverage bulk-tissue data while obtaining information from specific cell types, deconvolution [6,17,19,20,21,22] of bulk-tissue RNA-seq data has emerged as a powerful computational approach in the field of transcriptomics, offering insight into the cellular composition of complex tissue samples. Deconvolution methods aim to computationally dissect bulk-tissue RNA-seq data and estimate the relative proportions of different cell types contributing to the observed expression patterns.
In recent years, numerous deconvolution techniques have been developed and they can be classified from different perspectives. In terms of mathematical methodologies, they can be grouped as linear regression-based methods [23,24], non-negative matrix factorization (NMF) methods [25], Bayesian Approaches [26], or various machine learning approaches [27,28]. From whether other data sources are used, they can be classified as reference-based (partial deconvolution) [23,24,26,29] and reference-free methods (complete deconvolution) [25,30,31]. Further, different types of references can be used, such as marker gene [32], scRNA-seq [23], or gene expression variability [29]. Additionally, deconvolution methods may provide different outputs, some result in qualitative cellular fractions [23,24,26] while others only provide scores [27,28]. Some offer in silicon purification of sample-wise GEP [24,26,29], while the rest do not. These methods vary in their assumptions, computational complexity, and accuracy. A comprehensive review can be found in [33,34]. Although these models are successfully applied in many biological problems, the underlying mathematical challenges, especially in complete deconvolution, are still open problems. In our work, we focus on the non-negative matrix factorization (NMF) based complete deconvolution models.
Many studies of NMF have been established for various types of data in other fields, such as spectral unmixing in analytical chemistry [35], remote sensing [36], image processing [37], or topic mining in machine learning [38]. This complete deconvolution problem [17,19,20,21,22] of bulk-tissue RNA-seq is illustrated in Figure 1 as the NMF model [35,39,40]. In this work, the expression of a gene in a sample tissue is assumed to be the linear combination of its expressions in the constituting cell types, with respect to the cell proportion, i.e.,
gij=k∑γ=1ci,γpγ,j,1≤i≤N,1≤j≤n, | (1.1) |
where gij and ci,γ are the GEPs of gene i in the j-th sample and γ-th cell type, respectively, while pγ,j is the proportion of the l-th cell type in the j-th sample. For the total number of genes N, total number of samples n, and the number of cell types k, we usually have N≫n>k. In matrix form, Equation (1.1) is represented as G=CP with all matrix entries being non-negative. Given data G, both variables C and P are to be solved simultaneously. This is also termed as NMF-based reference-free complete deconvolution. Note that in many applications, matrix C is assumed to be known based on some available sources such as scRNA-seq data and only P is computed. This is called reference-based partial deconvolution, which is just a least square/linear regression problem and will not be termed as NMF model. Partial deconvolution is always well-posed, and it is generally more accurate, if the reference C is accurate. However, when reference is not consistent with the true value of C, the convolution result is not reliable.
The NMF is strongly ill-posed, and solutions are generally not unique. Such non-uniqueness poses great challenges on solution interpretability: For RNA-seq data, different solutions represent various combinations of GEPs in each cell type and cell proportions in tissues. It is indispensable to obtain meaningful explanation of these biological quantities to the next step DE analysis. Although many NMF based bulk-tissue RNA data deconvolution algorithms have been developed recently [6,41,42,43,44,45,46,50,51] for different types of samples [48,49], the underlying mathematical challenges have never been addressed.
There are a few guidelines to reduce such ill-posedness in a certain degree. As stated in [52], if the constituent matrices C and P satisfy sufficiently-scattered conditions (see Section 2 for details), it is possible for the NMF problem to have unique solutions, subjective to row/column scaling and permutation ambiguities of the solution. Fortunately, this identifiability condition on matrix C is usually satisfied due to the concept of marker genes in the RNA-seq data. Based on this fact, a geometric structure guided NMF model has been proposed to enhance the identifiability of solutions, by recognizing marker genes (finding structure) first and then enforcing the identifiability condition of matrix C (preserve the structure) accordingly. However, on the other side, the sufficiently scattered requirement on matrix P will almost certainly be violated for realistic bulk tissue data: In many cases, there are usually
● Rarely presented cell types across tissue samples, i.e., matrix P may contain a nearly-zero row vector;
● There also could be correlated cellular proportions in tissue samples, i.e., matrix P may have linearly dependent rows.
These common biological features could cause severe instability and inaccuracy in deconvolution algorithms. In this work, we call the bulk-tissue data singular data, if its constituent cellular composition (matrix P) has above characteristics.
Our objective of this current work is to develop reliable and robust deconvolution algorithms for the aforementioned singular data, based on the geometric structure guided NMF (GSNMF) model. The essential idea is to construct pseudo-bulk tissue data with regular cellular composition and then the GSNMF model is applied on the combined original singular data and regular data. The overall hybrid data is thus regular, since both constituent matrices C and P satisfy the identifiability conditions. Specifically, our approach is achieved from the following steps: (1) Connection between scRNA-seq and bulk tissue RNA-seq data is investigated, and a stochastic algorithm is developed to generate pseudo-bulk tissue data; (2) since the NMF is a non-convex optimization problem, and its solution heavily depends on the initial condition, the relation between initial guess of solution and its converged results is discussed, and an informative initial condition is suggested accordingly; (3) based on the suggested initial condition, it is possible to estimate the qualitative difference between converged solution and true solution and hence a posterior rescaling process may increase solution accuracy.
Throughout the paper, a bold lower-case letter, such as x, represents a column vector with the appropriate dimension, and |x| represents its l2 norm. Vector 1 is a column vector of some dimension with all entries being one. For a matrix A, ‖A‖F represents its Frobenius norm, A(i) and A(j) mean its i-th row and j-th column, respectively.
The paper is organized as the follows: In section 2 we briefly review the NMF, its graphic interpretation, and its separability conditions. Section 3 presents the proposed pseudo-bulk tissue augmented geometric structure guided complete deconvolution model. Solution analysis and its insights in biological background are provided in Section 4, including dependence on informative initial conditions and solution estimates. As validations, in Section 5, we provide numerical results of the proposed model and algorithms for both synthetic and biological data. The paper ends with a conclusion in Section 6, where potential challenges of the work and possible future research directions are discussed. For convenience, some frequently used acronyms throughout the paper are listed in Table 1.
GEP | Gene expression profile |
DE | Differential expression |
NMF | Nonnegative Matrix Factorization |
GS-NMF | Geometric structure constrained NMF |
ADMM | Alternating direction method of multipliers |
Let matrix G∈RN×n with entry gij be the expression of the i-th gene in the j-th sample; matrix C∈RN×k with entry cij be the reference expression of the i-th gene in the j-th cell type; and matrix P∈Rk×n with entry pij be the proportion of the i-th cell type in the j-th sample. Assume dimensions N≫max(n,k) and the following linear relation:
G=CP+ϵ, | (2.1) |
where ϵ is noise. Complete deconvolution is the problem that given bulk tissue data G∈RN×n, solve for the constituent matrices C and P, i.e.,
(C∗,P∗)=arg minC∈RN×k+,P∈Rk×n+δ(CP,G) | (2.2) |
where RN×k+ or Rk×n+ represent matrices with nonnegative entries and δ(⋅,⋅) is a cost function. Usually, the cost function is chosen based on prior knowledge about the probability distribution of the data noise and susceptibility to outliers. In actual applications, the minimization process is considered as a maximum likelihood estimator for additive Gaussian noise when the Frobenius norm is used. On the other hand, when noises are Poisson processes, the Expectation Maximization (EM) algorithm and maximum likelihood [53] leads to the I-divergence [54] cost function. Or the cost function can be taken as row-wise or column-wise l1 norms of the difference matrix [43] if noises follow Laplace distribution. As summarized in [55], other choices include Earth Mover's distance metric, α-divergence, β-divergence, γ-divergence, φ-divergence, Bregman divergence, and α-β-divergence. In this work, we focus on how to handle data singularity and structure, so we only use δ(CP,G)=12‖G−CP‖2F for simplicity. The NMF is well-known to be ill-posed, and the solution is not unique or identifiable: if (C∗,P∗) is a local minimum to (2.2), then for any Ω∈Rk×k, ˆC=C∗Ω and ˆP=Ω−1P∗ are also solutions as long as their non-negativity is satisfied.
Figure 2 features some situations of solving an NMF problem by the classic Multiplicative Update (MU) rule. Figure 2(a), (b) show two local minimizers (blue) of matrix C, but none of them is even close to the ground truth (red). Figure 2(c) displays a case of NMF model applied in realistic RNA-seq data to compute cellular composition. The simulated results (blue) are highly correlated with ground truth (orange), or will win high score in correlation as the measure metric. However, these computations estimate significantly wrong cellular proportions for two types of cells. This is due to the fact that, even all assumptions are satisfied, the "unique" solutions of NMF are still subject to a permutation and rescaling matrix, see details in the following section. The permutation ambiguity might be resolved by biological knowledge, while the rescaling factors must be addressed computationally.
Different from image processing, the non-uniqueness of solution will significantly impact statistical analysis for decisions in biological implementation. Therefore, it is important to restrict searching space of variables to increase the identifiability of solutions for better interpretability.
First, the unique solution of NMF is defined in the following sense [56]:
Definition 2.1 (Uniqueness of NMF solution). The solution (C∗,P∗) of NMF (2.2) is unique, or identifiable, if and only if for any other solution (ˉC,ˉP), there exists a permutation matrix Π∈{0,1}k×k and a diagonal scaling matrix S with positive diagonal matrix such that
ˉC=C∗ΠS and ˉP=S−1Π⊤P∗. | (2.3) |
This uniqueness can be achieved under certain circumstances from nested cone theory [52,57,58,59]: By non-negativity of C and P, we have
G(j)∈cone(C)⊆RN+,1≤j≤n, | (2.4) |
where the notation cone(C) denotes the convex cone generated by the columns of C, i.e.,
cone{C}={x∈RN|x=Cθ, for some θ∈Rk,θ≥0}, | (2.5) |
From Eq (2.4), one can further obtain cone(G)⊆cone(C)⊆RN+. This argument also applies to the transpose of the NMF problem, i.e., cone(GT)⊆cone(PT)⊆Rn+. Then, problem (2.2) can be interpreted as a finding nested cone problem: given data G, find the nested cones cone(C) and cone(PT), between cone(G) and RN+, and cone(GT) and Rn+, respectively. Therefore, in order to obtain unique solutions of C and P, it requires data G almost "full-fill" in the positive orthant RN+ and Rn+. However, this requirement about data G is on its original, high dimensional space (N and n are usually large). Instead, the solvability conditions of NMF problem can be interpreted on matrices C and P, in the intrinsic, low-dimensional space Rk+ with k≪min{N,n}. First, one needs the following definitions:
Definition 2.2. The matrix A∈RN×n+ is separable if cone(A)=RN+.
Definition 2.3. The matrix A∈RN×n+ is sufficiently scattered if: (ⅰ) The second-order cone in RN+ is contained in cone(A), i.e., C={x∈RN+|e⊤x≥√N−1‖x‖2}⊆cone(A); and (ⅱ) There does not exist any orthogonal matrix Q such that cone(A)⊆cone(Q), except for permutation matrices.
Then, the unique solution of the NMF problem can be achieved if a strong identifiability condition or its weakened version is satisfied [55,56]:
Theorem 2.4 (Strong identifiability condition). Assuming k=rank(G), ϵ=0, if problem (2.2) admits a solution, for which both C⊤ and P are separable matrices, then the solution is unique.
Theorem 2.5 (Weak identifiability condition). Assuming k=rank(G), ϵ=0, if both C⊤ and P are sufficiently scattered, then problem (2.2) admits a unique solution.
The left panel in Figure 3 offers a graphic explanation of the two theorems. Both rows of C and columns of P are in the low dimensional space Rk, so it is convenient to represent them in a (k−1)-simplex with k=3, especially columns of P have sum-to-one property. The red triangle is the positive orthant of R3 and ei,i=1,2,3 are the unit base. Blue dots, either circles or pentagons, represent rows of C or columns of P. Theorem 2.4 is quite rigorous, because being separable matrices it requires P and C⊤ must contain (scaled) extreme rays of the nonnegative orthant in the corresponding space, i.e., for every r=1,2,...k there exists a column index lr, such that P(lr)=αrer∈Rk+, where αr is a scalar. In the simplex view, some of the blue dots have to coincide with the three red dots. In contrast, Theorem 2.5 is much more relaxed and it requires only some of the data points (shown as pentagons) being outside of the second-order cone, which is represented by the dashed, inscribed circle in the triangle.
The right panel of Figure 3 displays the fundamental idea of the geometric structure guided NMF (GSNMF) developed in [60]. According to Theorem 2.5, a solvability constraint is imposed on the searching space of C. In this constraint, some rows of C are required to be close enough to the k fundamental basis, such that the convex hull of all rows can cover the second-order cone in Rk. The different groups of rows of C to each basis, represented by the red, blue, and green dots, are recognized by a clustering algorithm, and are corresponding to the biological marker genes. Within the same group, the manifold constraint is applied based on the local invariance assumption in manifold regularization [61,62,63]. Note that the Eisen cosine correlation distance is used in both constraints, instead of Euclidean distance. With these constraints, the GSNMF significantly improves the interpretability of solutions, but it suffers to obtain accurate solutions when cellular compositions of the bulk tissue data are singular. Recall in either strong or weak identifiability conditions, that these requirements are set on both matrices CT and P. The weak condition on CT is usually satisfied for a given dataset and it is reasonable to impose constraints on C(i) due to the biological features of marker genes, which can be recognized computationally or by prior knowledge. However, there is usually no information in advance about the cellular composition. The black star dots in the left panel of Figure 3 show such a case that cellular proportion of two cell types across all samples are highly correlated. In this scenario, the data points will distribute more or less along a line but not fully occupy the whole simplex.
One major contribution of this work is to address the challenge of singular data, as mentioned in the introduction. To do so, the core idea is to augment the original NMF problem of true potentially singular sample data G by pseudo bulk tissue data ˆG with regular cellular proportions, i.e., the original NMF problem is augmented as
(C∗,P∗)=arg minC∈RN×k+,P∈Rk×(n+ˆn)+δ(CP,[G,ˆG]). | (3.1) |
The key issue is how to construct regular pseudo bulk tissue data ˆG in Eq (3.1).
Singe-cell RNA-seq data is just a spreadsheet recording the number counts of a set of N genes in multiple cells of different types and in various samples. In this section, we illustrate how to generate a vector of pseudo-bulk tissue date from the spread sheet. Let cγ,l∈RN be the GEP of all genes in the l-th cell of the γ-th cell type, where 1≤γ≤k and total number of cells depend on a specific dataset. Then, the relation between single-cell data and the j-th sample in the pseudo-bulk tissue data is
ˆG(j)=k∑γ=1m(γ,j)∑l=1cγ,l=k∑γ=1m(γ,j)ˉcγ, | (3.2) |
where m(γ,j) means that there are m(γ,j) cells for the γ-th cell type, or cell composition in the j-th sample. Since the deconvolution problem does not distinguish individual cells, the above equation is further written in the form of averaged GEP ˉcγ in each cell type. If the cell type proportion is used, then let Mj=∑kγ=1m(γ,j) being the total number of cells in the j-th sample and Eq (3.2) becomes
ˆG(j)=k∑γ=1m(γ,j)∑kγ=1m(γ,j)Mjˉcγ=k∑γ=1pγ,jMjˉcγ | (3.3) |
with ∑kγ=1pγ,j=1 being proportion.
It is important to notice the difference between Eq (3.3) and the j-th column of the NMF model, i.e., G(j)=∑kγ=1pγ,jC(γ). Note that the matrix C∈RN×k only contains k columns C(γ) (cell type index 1≤γ≤k) and no sample index j, so the NMF model for bulk tissue data deconvolution implies the assumption that cell type specific GEP should be homogeneous across tissue samples. This assumption also gives us a guidance when constructing pseudo-bulk tissue, that the total number of cells and average GEP, or Mjˉcγ, taken from all samples need to be as consistent as possible.
● Choose the total cell number M, as a parameter, for all pseudo-bulk tissue samples;
● Compute ˉcγ from the single-RNA-seq data sheet for 1≤γ≤k;
● Choose a mechanism to generate random numbers pγ,j and use Eq (3.3) to generate the j-th pseudo-bulk sample.
On the other hand, from the point view of linear regression, the matrix C estimated from the bulk-tissue data (pseudo or real) by the NMF model is actually the averaged value ˉcγ=<Mjˉcγ> across samples. This understanding raises two concerns about applying the NMF model: (1) The original data need to be "good" in the sense that the GEP among samples are homogeneous. By prior knowledge, the GEP varies significantly for sample to sample, then the NMF is not a suitable model. (2) Appropriate single-cell data usage: The scRNA-seq data could be directly used in NMF as a reference and it reduces the complete deconvolution to a regular linear regression problem. However, in realistic applications, the scRNA-seq data may not be consistent with the inherent GEP in the bulk tissue data, or they are even taken from other literature. This inconsistency will significantly reduce the computational accuracy of the GEP (matrix C) and hence cellular proportion (matrix P) in the original bulk tissue data.
In order to generate pseudo-bulk tissue data, it is critical to utilize proper mechanism to construct pseudo proportion for each sample as in Eq (3.3). Denote the pseudo proportion matrix as ˆP, then we require the weak identification condition ˆP, i.e., it contains sufficient amount of columns ˆP(j) that are close enough to eγ∈Rk. To do so, we propose to use multivariable Dirichlet distribution:
Consider the n column vectors in matrix ˆP as the n realizations of the random vector ˆp∈Rk. With parameter α1,α2,...αk>0, the probability density function with respect to Lebesgue measure on the Euclidean space Rk−1 is given as
f(ˆp1,ˆp2,...ˆpk;α1,α2,...αk)=1S(α)Πkγ=1ˆpαγ−1γ, where α=(α1,α2,...αk) | (3.4) |
where {ˆpγ} belong to the standard k−1 simplex (because of the sum-to-one condition) and the S(α) is the normalizing constant defined in terms of the gamma function, i.e.,
S(α)=Πkγ=1Γ(αγ)Γ(∑kγ=1). | (3.5) |
It is well known that the mean of the variable is E[ˆpγ]=αγ/α0, where α0=∑kγ=1αγ, and variance and covariances of the random vector are
Var[ˆpγ]=αγ(α0−αγ)α20(α0+1),and Cov[ˆpγ,ˆpγ′]=−αγαγ′α20(α0+1), | (3.6) |
so by choosing small and equally valued αγ, one can obtain smaller value of α0, hence widely and evenly scattered vectors in the k−1 simplex.
The regularity of matrix ˆP can be investigated by the condition number of ˆPˆPT: if ˆP contains correlated or extremely small rows, matrix ˆPˆPT will admit large condition number. Since Var[ˆpγ]=E[ˆp2γ]−E[ˆpγ]2 and Cov[ˆpγ,ˆpγ′]=E[ˆpγˆpγ′]−E[ˆpγ]E[ˆpγ′], it is easy to derive
E[ˆp2γ]=αγ(αγ+1)α0(α0+1)and E[ˆpγˆpγ′]=αγαγ′α0(α0+1). | (3.7) |
Then the matrix of ˆPˆPT has the following entries when sample size is large enough:
ˆPˆPT=c0[(α1α2⋱αk)+(α1α2⋮αk)(α1α2⋯αk)] | (3.8) |
where c0=n/α0(α0+1) and n is the sample size.
Equation (3.8) serves as an indicator of how to choose parameters in the Dirichlet Distribution, in order to maintain a small condition number for ˆPˆPT. Actually, if α1=α2=...αk=α, then
ˆPˆPT=c0α(Ik×k+α11T). | (3.9) |
Notice that eigenvalues of 11T are (k,0,0,..0), hence κ(ˆPˆPT)=kα+1. Then, it is reasonable to choose a uniform vector α=(α,α,...α) with small value of α to maintain a small condition number. Further, we claim that for the mixed bulk tissue data (original plus pseudo bulk) in the proposed algorithm, the proportion matrix ˉP=[P,ˆP] is always regular. Indeed,
ˉPˉPT=PPT+ˆPˆPT | (3.10) |
has full rank, because there is a diagonal matrix in ˆPˆPT and the remaining matrix is positive semi-definite.
In this section, we summarize and assemble all the components of the proposed method.
According to Eq (3.6), pick small and equal values for parameters α, e.g., α1=α2=...αk=1.5 and use Dirichlet distribution (3.4) to generate n random vectors ˆpj∈Rk,j=1,2,...n and belong to the standard k−1 simplex. Assume that there are Mγ cells in the single cell data set for the γ-th cell type and let M=∑kγ=1|Mγ|. Denote ˆpγ,j as the γ-th entry of vector ˆpj, then we randomly choose ⌈Mˆpγ,j⌉ from the total Mγ (regardless which one is bigger) cells of the γ-th type with replacement. Then, the j-th pseudo bulk tissue sample, or column of ˆG, can be generated in Eq (3.2) as
ˆG(j)=k∑γ=1⌈Mˆpγ,j⌉∑l=1gγ,l. | (3.11) |
The subset of marker genes from the augmented data ˜G=[G,ˆG] can be identified by clustering rows of ˜G into k groups. The total N rows of ˜G are considered as the vertex set V={˜G(i)}Ni=1⊂Rn in the similarity graph G=(V,E) in the spectral clustering method [64]. The non-negative weights ωij of edges E={eij} are calculated by a function Rn×Rn→R+ as,
ωij=exp{−deisen(˜G(i),˜G(j))2σ},1≤i≤N,1≤j≤N, | (3.12) |
where
deisen(˜G(i),˜G(j))=1−<˜G(i),˜G(j)>|˜G(i)||˜G(j)| | (3.13) |
is the Eisen cosine correlation distance quantifying the correlation between two vertices, and σ>0 is a parameter. With the adjacency matrix W=(ωij)∈RN×N and its degree matrix D=diag(d1,d2,...dN), where di=∑Nj=1ωij, different types of graph Laplacians (gL) of G can be defined, such as the unnormalized gL L=D−W, symmetric normalized gL Lsym=I−D−12WD−12, or random walk gL Lrw=I−D−1W. By examining the first a few eigenvectors of gL, rows of data ˜G will be clustered into k groups, and the set {Gr}kr=1 records row indices of ˜G in the corresponding clusters. The choice of different gLs depends on specific data applications [65]. For our problem, we use the normalized gL Lsym and the matrix W will be used in the next step. The other output for next step are disjoint sets Sγ,γ=1,2,...k, which represent the indices of marker genes for the γ-th cell type.
minC≥0,P≥012‖[G,ˆG]−C[P,ˆP]‖2F+F(C)+1T([P,ˆP]). | (3.14) |
For model simplicity, we just use the Frobenius norm to measure the error between deconvoluted solutions and the given mixed data. Note that P∈Rk×n corresponds to the original n samples while ˆP∈Rk×n′ is for the pseudo bulk tissue. Both of them are subject to the sum-to-one conditions on columns, which is represented by the set T:={Z∈Rk×n+,1⊤Z=1⊤} and the indicator function 1T as 1T(Z)=0 if Z∈T while 1T(Z)=∞ otherwise.
The regularization function F(C)=F1(C)+F2(C), where
F1(C)=λ12k∑r=1∑i∈Srdeisen(C(i),e⊤r)2, | (3.15) |
and
F2(C)=λ22N∑j=1N∑i=1ωijdeisen(C(i),C(j))2, | (3.16) |
with λ1 and λ2 being parameters. Recall that entry ωij>0 in the adjacency matrix W in (3.12) measure the correlations (larger value represents stronger correlation) between genes i and j in data G.
The Alternating direction method of the multipliers (ADMM) [66] method is used to numerically solve this optimization problem, and there are two types of parameters involved in the the overall computational process. One set of parameters are in the ADMM algorithm itself: A detailed discussion can be found in our earlier work [60] and the same strategies of parameter tuning are used in this paper. On the other hand, there are two major parameters when generating and integrating the pseudo bulk tissue data. One is the vector α=(α1,α2,...αk) in the Dirichlet distribution for pseudo cellular proportions. As discussed in 5.1, small and equal values αl,l=1,2,...k contribute to both geometric structure requirement and computational stability. The other is the number n′ of pseudo bulk samples. It is obvious that if n′ is too small, there will not be enough information. But too large n′ is not necessary according to Definition 2.3, and it will also increase computational cost. The proper amount of pseudo bulk samples is not a trivial question: From Eq (3.1), we can see that the solution C∗ is the "average" of the constituent GEP from the original bulk tissue samples and the reference scRNA. Thus, if the two sets of data are not consistent, the generated pseudo bulk samples will lead to inaccuracy of the actual GEP and hence the cellular proportion. In Section 5.3, we computationally address this issue with a specific dataset, while a systematic investigation will be our future work.
In many studies and packages, the mostly used evaluation metrics is the row-wise (across samples) correlation of cell proportion matrices between computational results and ground ground truth. i.e., P(i) and P∗(i) for i=1,2,...k. This metric is problematic: One issue is that it only demonstrates the global similarity of the computational results with the ground truth for i-th cell type for all samples, but has little to tell about all cell types in a local specific sample. More importantly, according to the uniqueness of the NMF solution, each row of matrix P could be subject to a rescaling factor. It is easy to see that λiP(i) will have the same correlation coefficient with P∗(i) as P(i) does. Thus, even all corresponding rows in P and P∗ are highly correlated, each column of P and P∗ could be totally different, and this difference will lead to misinformation of cellular composition of each sample. In addition to the Pearson correlation, we also measure simulation accuracy (relative error) in terms of discrete L1 norm, or equivalently, the normalized mean absolute difference (mAD) and discrete L2 norm, or equivalently, the normalized root mean square deviation (RMSD). Similar metrics are also defined for columns of matrix C, i.e., GEP for each cell type.
According to the theorem, the solution is subject to a rescaling factor, i.e., if C and P are the "true" solutions of the NMF problem, the best computational solution one can obtain is
ˉC=CΩ−1λ and ˉP=ΩλP, | (4.1) |
where Ωλ=diag{λ1,λ2,...λk>0} is unknown. This unknown rescaling matrix may cause problematic estimations of cellular proportion. Another major contribution of this work is to perform a posterior error estimates of the NMF solution, which gives important insights on data preparation for realistic biological problems. We first try to explore the dependence of the solution of the non-convex problem on initial conditions, and then try to estimate the rescaling matrix. Note that the results in this section are base on the matrix decomposition form G=CP, where P is regular and without considering noises and nonnegativity constraint.
First, we consider the converged solution.
Theorem 4.1. Given initial condition P[0] with P[0]PT[0] being invertible and assume the true GEP and cell proportion as C and P, respectively, then the solutions of G=CP converge to C∗=CM1M2−1 and P∗=M2M−11P, where M1=PPT[0] and M2=P[0]PT[0].
Proof. With the simple iteration,
C[0]=GPT[0](P[0]PT[0])−1=CPPT[0](P[0]PT[0])−1=CM1M−12 | (4.2) |
Then
P[1]=(CT[0]C[0])−1CT[0]G=(CT[0]C[0])−1CT[0]CP=[(M−12)TMT1CTCM1M−12]−1(M−12)TMT1CTCP=M2M−11P |
with such updated P[1], we compute
C[1]=GPT[1](P[1]PT[1])−1=CPPT(M−11)T(M2M−11PPT(M−11)TMT2)−1=CM1M−12=C[0] | (4.3) |
Based on Theorem 4.1, it is possible to combine the information of marker genes and properly choose initial conditions, so that the computational solutions can be estimated. At this moment, we assume the marker genes and their associated cell types are identified. Mathematically, set M is used to represent the indices of all marker genes and disjoint sets Sγ,γ=1,2,...k represent the indices of marker genes for the γ-th cell type. Note ⋃γSγ=M and denote Scγ=M∖Sγ. Based on the property of marker genes, ideally we have c[Scγ],γ=0, where c[Scγ],γ means all entries of the γ-th column of C with row index in Scγ. Then for certain γ and γ′, let i1,i2∈Sγ, and i3∈Sγ′, it is easy to see
G(i1)=ci1,γP(γ),G(i2)=ci2,γP(γ), and G(i3)=ci3,γ′P(γ′) | (4.4) |
From (4.4) we can conclude that rows of G (i1 and i2) from the marker genes of the same cell type will be linearly dependent since they are multiples of the same γ-th row of P. Further, for every 1≤γ≤k, we take the mean of all rows G(i) over all i∈Sγ, i.e., <G(i)>Sγ, and denote the result as the the γ-th row of a matrix ˜P, i.e.,
˜P(γ)=<G(i)>Sγ=<ci,γ>SγP(γ), or just˜P=ΩcP, | (4.5) |
where Ωc=diag(A) and AT=[<ci,1>S1,...<ci,γ>Sγ,...<ci,k>Sk].
If a good identification of marker gene sets can be made, the matrix ˜P can be computed directly from the original data G and this matrix is related to the true cellular proportion P with a diagonal rescaling factor Ωc. Although this factor unknown yet, the matrix ˜P can serve as an informative initial condition.
First, if initial condition P[0] as the column normalization of ˜P with sum-to-on condition, i.e., P[0]=ΩcPΩ−1a, where
Ωa=diag(ATP(1),ATP(2),...,ATP(n)). |
Consequently, by Theorem 4.1, solutions converge to
C∗=CM1M−12=C(PΩ−1aPT)(PΩ−2aPT)−1Ω−1c. | (4.6) |
and
P∗=M2M−11P=Ωc(PΩ−2aPT)(PΩ−1aPT)−1P. | (4.7) |
Theorem 4.2. Assume ideal data G has true solutions C and P, i.e., G=CP. Additionally, row index set Ω of C (also G) can be subdivided into k disjoint sets Sγ, i.e., Ω=⋃γSγ and entry (of C) ci,γ=0 if i∉Sγ. Given initial condition P[0]=ΩcPΩ−1a and the solutions of G=CP converge to C∗ and P∗, then the following estimations hold for each row of C∗
β−1‖C(i)‖2≤‖C∗(i)‖2≤β‖C(i)‖2 | (4.8) |
and each column of P∗,
β−1‖P(j)‖2≤‖P∗(j)‖2≤β‖P(j)‖2, | (4.9) |
where
β=√κ(PPT)(λcmaxλcmin)2, | (4.10) |
κ(PPT) is the condition number of PPT and
[λcmax,λcmin]=[max,min]{<ci,1>S1,...<ci,γ>Sγ,...<ci,k>Sk}. |
Proof. First, denote the matrix R=(PΩ−1aPT)(PΩ−2aPT)−1Ω−1c, hence C∗=CR and P∗=R−1P. Then, we estimate ‖R‖2 to see how it changes each row of the true GEP matrix C∗. Assuming matrix P has SVD P=UΣVT, where U∈Rk×k and V∈Rn×n being unitary, while Σ∈Rk×n and has minimum and maximum values as σmin and σmax, respectively, then
‖R‖2=‖UΣVTΩ−1aVΣTUT(UΣVTΩ−2aVΣTUT)−1Ω−1c‖2=‖UΣVTΩ−1aVΣTUTU(ΣT)−1VTΩ2aVΣ−1UTΩ−1c‖2=‖UΣVTΩ−1aV˜IVTΩ2aVΣ−1UTΩ−1c‖2 |
Note that ˜I∈Rn×n and ˜I=(Ik×k000) due to the definition of singular matrix Σ. Hence, using the property that unitary matrices preserve spectrum norm, we have the estimate
‖R‖2≤1λcmin‖UΣVTΩ−1aV˜IVTΩ2aVΣ−1UT‖2 | (4.11) |
=1λcmin‖ΣVTΩ−1aV˜IVTΩ2aVΣ−1‖2 | (4.12) |
≤σmaxσminλcmin‖VTΩ−1aV˜IVTΩ2aV‖2 | (4.13) |
=σmaxσminλcmin‖Ω−1aV˜IVTΩ2a‖2 | (4.14) |
Note that each entry of Ωa, i.e., ATP(j), is a convex linear combination of entries in A since P has sum-to-one columns. Then, exam the eigenvalues λa of Ωa, we have
λcmin≤λamin≤λamax≤λcmax. | (4.15) |
Actually, if P satisfies weak or strong separability conditions, we have λcmin=λamin and λcmax=λamax. Additionally, notice that √cond(PPT)=σmax/σmin, we have the estimate
‖R‖2=‖RT‖2≤√cond(PPT)(λcmaxλcmin)2 | (4.16) |
Finally, by definition of spectral norm of matrix,
‖C∗(i)‖2=‖RTC(i)‖2≤‖C(i)‖2√cond(PPT)(λcmaxλcmin)2, | (4.17) |
and by the same argument, estimate in (4.16) is also true for R−1. Hence, the other side inequality in (4.8) holds since ‖C(i)‖2=‖(R−1)TC∗(i)‖2.
Estimates in Theorem 4.2 are based on noiseless situations, and the ideal case the marker genes can be easily recognized. Nevertherless, they shed light on some additional qualitative conditions for the constituent matrices C and P, about in what situation the numerical solutions are as close to the ground truth as possible.
● Quantity β in Eq (4.10) serves as an estimator of the scaling factor in the "unique" solution to how far the computational results are from the ground truth. It is reasonable that this quantity depends on the regularity of the constituent matrix P (condition number of PPT).
● Note that λcmin (λcmax) is the minimum (maximum) average of all marker gene expressions in the corresponding cell types, so a smaller ratio of λcmax/λcmin results in a smaller bound of ‖R‖2.
● Based on the above information, we conclude that homogeneous average GEP in different cell types, and regular cellular composition in bulk samples are preferred. For the former, one can select subsets of marker genes in each tell type such that large differences in GEP average are avoided in practice. For the latter, it is the motivation for the idea of pseudo-bulk tissue data. Numerical simulation in the next section show that the solution accuracy is significantly enhanced, when the two strategies are taken.
In this section, we present the numerical results of the proposed algorithms. First we use synthetic data to display the geometric structure of the Dirichlet distribution generated pseudo cellular proportion with different parameter setup. The properly designed data with weak solvability condition significantly enhance the solution quality of singular data. Then, the proposed algorithms are applied to three different datasets to validate their effectiveness and accuracy. At last, the resilience of the proposed algorithm is computationally analyzed. In contrast to the original GSNMF, the newly proposed pseudo-bulk tissue data augmented method is termed as GSNMF+.
Figure 4 displays the geometric structures of some Dirichlet distributions generated cellular proportions with various parameters. For visualization convenience, the cell type dimension is taken as k=3, so that all data points can be shown in the 2-simplex (blue triangles). Six different sets of parameters α are used, including heterogeneous (first row) and homogeneous (second row) components. It can be concluded that the convex hull of data generated with heterogeneous parameters α can not include the second-order cone, which can be presented as the in-scripted circle (not shown) of the triangles, or, does not satisfy the sufficiently scattered condition. On the other hand, data generated from homogeneous parameters tend to distribute more evenly in the simplex. However, for α with larger values, data points will concentrate in the center of the simplex and thus their convex hull cannot cover the second-order cone (recall that larger value of α leads to larger condition number of PPT). Therefore, we will use homogeneous parameters with small to median value, such as |α|=0.2 or |α|=1 in simulations, where |α| is the l∞ norm of α.
Figure 5 shows how the overall solvability condition is improved when the singular original cell proportions (red dots) are mixed with pseudo data (green dots). Figure 5(a) shows the data singularity that there is one cell type with extremely small proportion in samples, or a small row vector in P. As seen, all red dots are mostly distributed along one edge of the simplex while absent in one corner. The other singularity type is that two cell types may have correlated cellular proportions and the corresponding geometric structure is displayed in Figure 5(b). In this scenario, the data points align approximately in a line in the simplex, as shown by the red dots. In both cases, the geometric structures can be improved by mixing with the Dirichlet distribution generated pseudo data (green dots and with α=(1,1,1)).
Figure 6 displays the deconvolution results of cellular proportions in a set of synthetic data, in which the cellular composition have the aforementioned singularity. The first and second rows are for the original GSNMF and the proposed GSNMF+ methods, respectively. The synthetic bulk tissue data G is generated by multiplying a predefined matrix C and the above-mentioned pseudo cellular composition matrix P and adding some noises ϵ. There are in total one hundred synthetic samples. The ground truth of cellular composition is shown in orange triangle plots, while the deconvolution results are in blue stars. In the left panel, cell 2 rarely presents in all the samples, so its proportions are very small (below 0.1). The original GSNMF algorithm cannot obtain the accurate results for this cell type, the solution correlation is 0.74 and RMSD is 0.96. Moreover, the solution correlation for the corresponding column of matrix C is 0.52 and RMSD is 0.99. In contrast, the deconvolution accuracy is significantly enhanced by the GSNMF+ algorithm: for this cell type, the correlation is 0.96 and RMSD is 0.06. The accuracy for matrix C is also enhanced, with the correlation being 0.98 and RMSD being 0.13. Similar performance increase can also be found when data contains correlated cellular compositions, as shown in the right panel of Figure 6. In this case, cellular proportions for cell types 1 and 2 are designed to have a strong correlation. As consequences, deconvolution results for the two cell types by GSNMF are highly inaccurate. Solution correlation for cell 1 is 0.99, but RMSD is 2.21. For cell 2, errors in correlation and RMSD are both high, which are 0.56 and 2.22, respectively. Estimation for matrix C is totally wrong and the RMSD is 5.89. When GSNMF+ is used, solution accuracy for all cell types are increased, with correlations being all beyond 0.99, and the RMSD being 0.07, 0.05, and 0.12, respectively. Correspondingly, RMSD of computing matrix C in overall decreases to 0.19.
Table 2 displays the performance of GSNMF+ under various levels of noise-to-data ratios (NDRs), where RMSD of C, P, and relative residues ‖G−CP‖F/‖G‖F are displayed. As indicated by the table, errors in matrix C increase more obviously when more noises present (larger NDR). On the contrary, computation of matrix P seems less vulnerable to noise levels.
NDR | RMSD of C | RMSD of P | Relative residue |
0.081 | 0.1802 | 0.0888 | 0.0793 |
0.165 | 0.2007 | 0.094 | 0.1743 |
0.346 | 0.2372 | 0.1045 | 0.4024 |
We would like to point out that handling noises is a common challenge in all areas of data sciences, but it is not the major scope of our current work. Obtaining meaningful biological solutions from the ill-posed NMF model is still an open problem. Therefore, the noise levels we used in all synthetic data are low to medium, and only the common Gaussian noise is used.
The first set of biological data used to validate the proposed algorithms is GSE19830 [67], which was obtained from tissue samples of the brain, liver, and lung of a single rat using expression arrays (Affymetrix). Homogenates of these three types of tissues were mixed together at the scRNA homogenate level with a known proportion, and then the gene expression pattern of every mixed sample was measured. The GSE19830 data set mimics the common scenario of heterogeneous biological samples. Both the relative frequency of the component and marker genes for each cell type are clearly recognizable, so it has been used as a preliminary test in many literature [25,31] to validate computational algorithms. For this dataset, we know cell type k=3 and tissue sample number n=33. After necessary data preprocessing to exclude obvious outliners (row norm, column norm, etc.), we take N=10,000 out of ∼12,000 total genes.
Since the dimension k is low, it is easy to visualize the data features as shown in Figure 7. First, the bulk-tissue data can be clearly categorized into three clusters corresponding to the three cell types from distinct organs, as shown in Figure 7(a), and hence marker genes from each type can be easily identified. Second, the constituent (ground truth) cellular proportions are well designed to avoid correlation. As displayed in Figure 7(b), the proportion data is sufficiently scattered in the simplex.
Figure 8 shows such a case with orange triangle lines as ground truth of cellular proportion and blue stars as the corresponding deconvolution results. Our previous GSNMF package obtained a highly accurate solution as shown in Figure 8(a), as the solution correlations being 0.96,0.98 and 0.99, respectively. However, we also observed that from sample to sample, the simulations always overestimate the proportion for one cell type (brain) while underestimating another (liver). It is hypothesized that such a systematic error is due to the unknown rescaling factor, so the proposed GSNMF+ algorithm is applied to the same dataset. With the estimated rescaling factors and adjusting strategy, it can be concluded that the accuracy is further improved, as shown in Figure 8(b). It is important to note that this improvement, or rescaling, is in a global sense (across all samples), as the errors in samples 6–8 are not significantly reduced.
Qualitative errors, or sample-wise l2 norm of errors in cellular proportions are displayed in Figure 9. The orange bars are for the original GSNMF algorithm while the blue bars are for the GSNMF+ algorithm developed in this paper. Computational errors are expressed in log form and the improvement in accuracy is obvious.
The second dataset is the brain tissue data GSE67835, in which samples consists of six cell types (k=6) including astrocytes, endothelial, microglia, neurons, and oligodendrocytes. There are two sets of samples, one of which consists of extremely small amount of the microglia cells (Case Ⅰ), while the other has small proportion of endothelial cells (Case Ⅱ). These singular data pose great challenges and significantly decrease computational accuracy. Figure 10 displays the comparison of deconvolution results for these two cases (top and bottom rows) from GSNMF and GSNMF+, respectively. The orange curves are ground truth while blue curves are deconvolution results. It can be seen that GSNMF significantly overestimates cellular proportion for microglia in Case Ⅰ while for endothelial cells in Case Ⅱ. In contrast, the computational results from GSNMF+ are more accurate.
It is reasonable to investigate the algorithm performance when different amounts of pseudo-bulk tissue data are augmented, or quantity of scRNA-seq data are perturbed. Figure 11 displays solution accuracy for two datasets, GSE67835 and GSE81608, when different amount of pseudo bulk samples are used in the GSNMF+ algorithm. In these settings, the original bulk tissue data consists of 50 singular samples (denoted as 50S, constituent cellular composition with extremely small variation in Dirichlet distribution), while different amount of regular samples constituent cellular composition with large variation) are augmented in the algorithms, ranging from 10L to 300L. For example, label "50S+20L" means the mixture of 50 original bulk tissue samples with constituent cellular proportions of small variance and 20 pseudo-bulk tissue samples with constituent cellular proportions of large variance. Solution accuracy is presented in terms of Pearson correlation, RMSD, and mAD. It can be concluded that "regular" samples can significantly enhance deconvolution accuracy. While with increased amount of "regular" samples, the solution accuracy fluctuates in a certain degree, but no obvious improvement or decay is observed. This phenomenon echoes the theoretical results about the weak identifiability condition that only a few columns of P are needed to be close to each fundamental basis, in order for cone(P) to cover the second-order cone in Rn+. Therefore, too many pseudo-bulk tissue samples do not provide additional benefits.
Another scenario in real applications is that the reference scRNA-seq data used to construct pseudo-bulk tissues may not be consistent with constituent cell specific GEP in the bulk tissues. To investigate the impact of inconsistent scRNA-seq data in the proposed algorithm, we conduct various model resilience analyses by generating pseudo-bulk tissue samples using artificially modified reference scRNA-seq data. These scRNA-seq data modifications include mean shifting (MS), truncation (TR), and factoring (FA), which are specifically defined as the following:
● Mean Shifting (MS): Each single cell profile of a given type was augmented by adding 10%, 30%, 50%, or 70% of the average expression levels of that cell type.
● Truncation (TR): Single cells were selectively excluded from the bulk tissue data generation. We removed the top or bottom 10% of cells in each cell type.
● Factoring (FA): A scaling factor was applied to all gene expression levels, allowing for both upward and downward adjustments. The scaling factors used were 0.4, 0.8, 1.2, and 1.8.
All manipulative techniques were applied to the reference set, which consisted of 200 larger samples. As shown in Figure 12, these manipulations did not significantly influence the final results. All the results are close to the original estimation, indicating that changes to the reference set are also effective.
In Table 3, we list the comparison of GSNMF+ with some popular NMF related deconvolution algorithms: MuSic [23], CIBERSORTx [24], and LinSeed [25] on data GSE67835. Among these algorithms, MuSic and CIBERSORTx are reference (scRNA-seq) based methods, GSNMF+ only uses reference to reduce data singularity, while LinSeed does not use reference data at all.
Correlation | ||||
Method | MuSiC | CIBERSORTx | LinSeed | GSNMF+ |
Original | 0.9943 | 0.9643 | 0.8777 | 0.8712 |
snRNAseq | 0.8538 | 0.7435 | 0.8888 | 0.8636 |
Simulator | 0.2727 | 0.035 | 0.8749 | 0.9859 |
RMSD | ||||
Original | 0.0416 | 0.1328 | 0.2025 | 0.1967 |
snRNAseq | 0.2399 | 0.2486 | 0.1942 | 0.1946 |
Simulator | 0.3806 | 0.4237 | 0.2049 | 0.0606 |
As mentioned earlier, reference based methods (partial deconvolution) provide higher accuracy than reference free methods (complete deconvolution), with the assumption that the reference is consistent with the constituent GEP in the bulk tissue. However, this assumption is usually violated in real applications. To test these scenarios, we take the references as the one from GSE67835 (the "Original"), a single-nucleus RNA-seq (the "snRNA-seq") dataset of the human brain from [68], and one generated from a commonly used, real scRNA-seq data based probability model, scDesign simulator [69] (the "Simulator"). The top and bottom halves of the table display the correlation and RMSD of cellular compositions obtained from those methods, respectively. It is obvious that MuSic and CIBERSORTx outperform the other reference-free methods in both error metrics, if the reference is consistent. This is expected because with good reference, the NMF method is just a least square/linear regression model. However, if the reference used is different from the constituent GEP in the bulk tissue data, such as the snRNAseq or Simulators, the solution accuracy significantly decreases. On the other hand, solution accuracy from Linseed does not have obvious changes since this method does not use reference at all. For our proposed GSNMF+, the reference is used for handling the data singularity, it is essentially still a NMF problem, so the solution accuracy does not decrease, or sometimes increases when the reference is changed.
Complete deconvolution is a type of computational approaches to decompose experimentally more available, reliable, and relatively inexpensive bulk-tissue RNA-seq data into cell type specific GEP and cellular composition, which are crucial datasets to DE analysis in order to reveal novel insights into genes and pathways for human diseases. Mathematical and statistical methods have been developed to perform complete deconvolution, and nonnegative matrix factorization (NMF) is the fundamental model on the mathematical side. However, it is well-known that NMF is an ill-posed problem due to its non-separable solutions. Many efforts have been taken to improve solution accuracy and promising computational tools are proposed, but comprehensive mathematical investigation of the ill-posedness is under-developed.
In this paper, we focused on the solvability conditions in NMF theory on the bulk-tissue data. Among the two constituent matrices, the cell specific GEP matrix C satisfy the weak identifiability conditions due to the existence of marker genes, so we can identify the geometric structures of marker genes from bulk-tissue data and preserve the structure when solving for C. On the other hand, the cellular abundance matrix P usually violates the identifiability conditions because of possible rare or correlated presented cell proportions. To address this issue, we used Dirichlet distribution to generate pseudo cellular proportion and then constructed the corresponding pseudo bulk tissue data with available single-cell RNA-seq data. As results, our developed GSNMF algorithms are applied to the hybrid bulk-tissue data (original plus pseudo), in which the two constituent matrices satisfy the weak identifiability condition. Even though the solvability conditions are met, the "unique" solution of NMF is still subject to an arbitrary rescaling matrix. We tackled this challenge by choosing a specific type of initial conditions and estimating the possible difference to the theoretical solutions. Although this investigation can only be established based on the ideal noise-free scenarios, it is the first time this solution ambiguity is studied, and it shed a light on qualitatively improving the deconvolution results. Our proposed algorithm pipeline has been tested on several different datasets, and significantly improves solution accuracy from bulk-tissue data with singular cellular composition.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work was supported by US National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM148971. The author NET is supported by RF AG051504, U01 AG046139, R01 AG061796, U19 AG074879 to NET and Alzheimer's Association Zenith Fellows Award (ZEN-22-969810).
The authors declare there is no conflict of interest.
[1] |
Head IM, Jones DM, Larter SR (2003) Biological activity in the deep subsurface and the origin of heavy oil. Nature 426: 344–352. doi: 10.1038/nature02134
![]() |
[2] |
Larter S, Huang H, Adams J, et al. (2006) The controls on the composition of biodegraded oils in the deep subsurface: Part II-Geological controls on subsurface biodegradation fluxes and constraints on reservoir-fluid property prediction. AAPG Bull 90: 921–938. doi: 10.1306/01270605130
![]() |
[3] |
Krumholz LR, Mckinley JP, Ulrich GA, et al. (1997) Confined subsurface microbial communities in Cretaceous rock. Nature 386: 64–66. doi: 10.1038/386064a0
![]() |
[4] |
Bata T, Parnell J, Samaila NK, et al. (2015) Geochemical evidence for a Cretaceous oil sand (Bima oil sand) in the Chad Basin, Nigeria. J Afr Earth Sci 111: 148–155. doi: 10.1016/j.jafrearsci.2015.07.026
![]() |
[5] |
Bata T, Parnell J, Bowden S, et al. (2016) Origin of Heavy Oil in Cretaceous Petroleum Reservoirs. Bull Can Pet Geol 64: 106–118. doi: 10.2113/gscpgbull.64.2.106
![]() |
[6] | Bata T (2016) Geochemical Consequences of Cretaceous Sea Level Rise. Unpublished PhD thesis submitted to the Department of Geology and Petroleum Geology, University of Aberdeen. Available from: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.690589. |
[7] |
López L (2014) Study of the biodegradation levels of oils from the Orinoco Oil Belt (Junin area) using different biodegradation scales. Org Geochem 66: 60–69. doi: 10.1016/j.orggeochem.2013.10.014
![]() |
[8] |
Wenger LM, Davis CL, Isaksen GH (2002) Multiple controls on petroleum biodegradation and impact in oil quality. SPE Reservoir Eval Eng 5: 375–383. doi: 10.2118/80168-PA
![]() |
[9] | Peters KE, Walters CC, Moldowan JM (2005) The Biomarker Guide: Biomarkers and isotopes in petroleum exploration and Earth history, v. 2. Biomarkers Isot Pet Syst Earth Hist Ed 2: 490. |
[10] | Palmer SE (1984) Effect of water washing on C15+ hydrocarbon fraction of crude oils from northwest Palawan, Philippines. AAPG Bull 68: 137–149. |
[11] |
Kuo LC (1994) An experimental study of crude oil alteration in reservoir rocks by water washing. Org Geochem 21: 465–479. doi: 10.1016/0146-6380(94)90098-1
![]() |
[12] | ASTM D4124, (2000) Standard test method for separation of asphalt into four fractions, In: American Society for Testing and Materials, West Conshohocken, PA. |
[13] | ASTM D974-97, (2013) Standard test method for acid and base numbers by colour indicator titration, In: American Society for Testing and Materials, West Conshohocken, PA. |
[14] |
Gouch MA, Rhead MM, Rowland SJ (1992) Biodegradation studies of unresolved complex mixtures of hydrocarbons: Model UCM hydrocarbons and the aliphatic UCM. Org Geochem 18: 17–22. doi: 10.1016/0146-6380(92)90139-O
![]() |
[15] |
Frysinger GS, Gaines RB, Xu L, et al. (2003) Resolving the unresolved complex mixture in petroleum-contained sediments. Environ Sci Technol 37: 1653–1662. doi: 10.1021/es020742n
![]() |
[16] |
Ventura GT, Kenig F, Reddy CM, et al. (2008) Analysis of unresolved complex mixtures of hydrocarbons extracted from Late Archean sediments by comprehensive two-dimensional gas chromatography (GC × GC). Org Geochem 39: 846–867. doi: 10.1016/j.orggeochem.2008.03.006
![]() |
[17] |
Parnell J, Bellis D, Feldmann J, et al. (2015) Selenium and tellurium enrichment in palaeo-oil reservoirs. J Geochem Explor 148: 169–173. doi: 10.1016/j.gexplo.2014.09.006
![]() |
[18] |
Bennett B, Adams JJ, Gray ND, et al. (2013) The controls on the composition of biodegraded oils in the deep subsurface-Part 3: The impact of microorganism distribution on petroleum geochemical gradients in biodegraded petroleum reservoirs. Org Geochem 56: 94–105. doi: 10.1016/j.orggeochem.2012.12.011
![]() |
[19] | Head IM, Gray ND, Larter SR (2014) Life in the slow lane; biogeochemistry of biodegraded petroleum containing reservoirs and implications for energy recovery and carbon management. Front Microbiol 5: 1–23. |
[20] |
Wenger LM, Davis CL, Isaksen GH (2002) Multiple controls on petroleum biodegradation and impact in oil quality. SPE Reservoir Eval Eng 5: 375–383. doi: 10.2118/80168-PA
![]() |
[21] | Xinheng C, Songbai T (2011) Review and comprehensive analysis of composition and origin of high acidity crude oils. China Pet Process Petrochem Technol 13: 6–15. |
[22] | Behar FH, Albrecht P (1984) Correlations between carboxylic acids and hydrocarbons in several crude oils. Alteration by biodegradation. Org Geochem 6: 597–604. |
[23] |
Jaffé R, Gallardo M (1993) Application of carboxylic acid biomarkers as indicators of biodegradation and migration of crude oils from the Maracaibo Basin, western Venezuela. Org Geochem 20: 973–984. doi: 10.1016/0146-6380(93)90107-M
![]() |
[24] | Olsen SD, (1998) The relationship between biodegradation, total acid number (TAN) and metals in oils, In: Preprints of the American Chemical Society, Division of Petroleum Chemistry, 3: 142–145. |
[25] |
Meredith W, Kelland SJ, Jones DM (2000) Influence of biodegradation on crude oil acidity and carboxylic acid composition. Org Geochem 31: 1059–1073. doi: 10.1016/S0146-6380(00)00136-4
![]() |
[26] | Mackenzie AS, Wolff GA, Maxwell JR, (1983) Fatty acids in biodegraded petroleums. Possible origins and significance, In: Bjorøy M, et al., Eds., Advances in Organic Geochemistry, John Wiley, Chichester, UK, 637–649. |
[27] | Montgomery W, Sephton MA, Watson JS, et al. (2015) Minimising hydrogen sulphide generation during steam assisted production of heavy oil. Sci Rep 5: 1–6. |
[28] |
Milner CWD, Rogers MA, Evans CR (1977) Petroleum transformations in reservoirs. J Geochem Explor 7: 101–153. doi: 10.1016/0375-6742(77)90079-6
![]() |
[29] | Connan J (1984) Biodegradation of crude oils in reservoirs. In: Brooks J, Welte DH, Eds., Advances in Petroleum Geochemistry 1, London: Academic Press, 299–335. |
[30] | Palmer SE (1993) Effect of biodegradation and water washing on crude oil composition. In: M.H. Engel, S.A. Macko, Eds., Organic Geochemistry, Plenum, New York, 861. |
[31] | Jones DM, Head IM, Gray ND, et al. (2008) Crude oil biodegradation via methanogenesis in subsurface petroleum reservoirs. Nature 451: 170–180. |
[32] | Larter SR, Head IM, Bennett B, et al. (2013) The roles of water in subsurface petroleum biodegradation-Part 1. The role of water radiolysis. Geoconvention Integr 1–4. |
1. | Ricardo Andrés Gómez-Moncada, Andrés Mora, Marcela Jaramillo, Mauricio Parra, Andrés Martínez, Henry Mayorga, Jorge Sandoval-Muñoz, Arcadio Cuy-Cipamocha, Davis Suárez, Jose Sandoval-Ruiz, Rolando Ramírez, Robert Márquez, Ricardo Bueno, Implications of groundwater flow on preservation of heavy and extra-heavy oil accumulations in Southern Llanos Basin, Colombia, 2022, 119, 08959811, 104013, 10.1016/j.jsames.2022.104013 | |
2. | Saeed Khezerloo‐ye Aghdam, Alireza Kazemi, Mohammad Ahmadi, Performance evaluation of different types of surfactants to inhibit clay swelling during chemical enhanced oil recovery, 2023, 0008-4034, 10.1002/cjce.25028 | |
3. | T. Koksalan, K. A. Khan, R. Alhadhrami, A. Salahuddin, E. Warsito, T. Ruble, 2024, Cracking the Code: High Oil Saturation Without Oil Production? A Game-Changing Novel Geochemical Methodology for Assessing Total, Mobile/Immobile and EOR Oil; A Case Study from Abu Dhabi, UAE, 10.2118/222868-MS | |
4. | T. Koksalan, K. A. Khan, R. Alhadhrami, A. Salahuddin, E. Warsito, 2025, Total, Mobile, and Non-producible Oil: A Game-Changing Geochemical Method for Optimum Perforation Interval Selection and Eor Strategies, 10.2118/224685-MS |
GEP | Gene expression profile |
DE | Differential expression |
NMF | Nonnegative Matrix Factorization |
GS-NMF | Geometric structure constrained NMF |
ADMM | Alternating direction method of multipliers |
NDR | RMSD of C | RMSD of P | Relative residue |
0.081 | 0.1802 | 0.0888 | 0.0793 |
0.165 | 0.2007 | 0.094 | 0.1743 |
0.346 | 0.2372 | 0.1045 | 0.4024 |
Correlation | ||||
Method | MuSiC | CIBERSORTx | LinSeed | GSNMF+ |
Original | 0.9943 | 0.9643 | 0.8777 | 0.8712 |
snRNAseq | 0.8538 | 0.7435 | 0.8888 | 0.8636 |
Simulator | 0.2727 | 0.035 | 0.8749 | 0.9859 |
RMSD | ||||
Original | 0.0416 | 0.1328 | 0.2025 | 0.1967 |
snRNAseq | 0.2399 | 0.2486 | 0.1942 | 0.1946 |
Simulator | 0.3806 | 0.4237 | 0.2049 | 0.0606 |
GEP | Gene expression profile |
DE | Differential expression |
NMF | Nonnegative Matrix Factorization |
GS-NMF | Geometric structure constrained NMF |
ADMM | Alternating direction method of multipliers |
NDR | RMSD of C | RMSD of P | Relative residue |
0.081 | 0.1802 | 0.0888 | 0.0793 |
0.165 | 0.2007 | 0.094 | 0.1743 |
0.346 | 0.2372 | 0.1045 | 0.4024 |
Correlation | ||||
Method | MuSiC | CIBERSORTx | LinSeed | GSNMF+ |
Original | 0.9943 | 0.9643 | 0.8777 | 0.8712 |
snRNAseq | 0.8538 | 0.7435 | 0.8888 | 0.8636 |
Simulator | 0.2727 | 0.035 | 0.8749 | 0.9859 |
RMSD | ||||
Original | 0.0416 | 0.1328 | 0.2025 | 0.1967 |
snRNAseq | 0.2399 | 0.2486 | 0.1942 | 0.1946 |
Simulator | 0.3806 | 0.4237 | 0.2049 | 0.0606 |