A simple and bounded model of population dynamics for mutualistic networks

  • Dynamic population models are based on the Verhulst's equation (logisitic equation), where the classic Malthusian growth rate is damped by intraspecific competition terms. Mainstream population models for mutualism are modifications of the logistic equation with additional terms to account for the benefits produced by the interspecies interactions. These models have shortcomings as the population divergence under some conditions (May's equations) or a mathematical complexity that difficults their analytical treatment (Wright's type II models). In this work, we introduce a model for the population dynamics in mutualism inspired by the logistic equation but cured of divergences. The model is also mathematically more simple than the type II. We use numerical simulations to study the model stability in more general interaction scenarios. Despite its simplicity, our results suggest that the model dynamics are rich and may be used to gain further insights in the dynamics of mutualistic interactions.

    Citation: Juan Manuel Pastor, Javier García-Algarra, Javier Galeano, José María Iriondo, José J. Ramasco. A simple and bounded model of population dynamics for mutualistic networks[J]. Networks and Heterogeneous Media, 2015, 10(1): 53-70. doi: 10.3934/nhm.2015.10.53

    Related Papers:

    [1] Antonio Gagliano, Salvatore Giuffrida, Francesco Nocera, Maurizio Detommaso . Energy efficient measure to upgrade a multistory residential in a nZEB. AIMS Energy, 2017, 5(4): 601-624. doi: 10.3934/energy.2017.4.601
    [2] Afamia Elnakat, Juan D. Gomez, Martha Wright . A measure to manage approach to characterizing the energy impact of residential building stocks. AIMS Energy, 2016, 4(4): 574-588. doi: 10.3934/energy.2016.4.574
    [3] Hamza El Hafdaoui, Ahmed Khallaayoun, Kamar Ouazzani . Activity and efficiency of the building sector in Morocco: A review of status and measures in Ifrane. AIMS Energy, 2023, 11(3): 454-485. doi: 10.3934/energy.2023024
    [4] Hossam A. Gabbar, Ahmed Eldessouky, Jason Runge . Evaluation of renewable energy deployment scenarios for building energy management. AIMS Energy, 2016, 4(5): 742-761. doi: 10.3934/energy.2016.5.742
    [5] Sergio Copiello . Building energy efficiency: New challenges for incentive policies and sustainable business models. AIMS Energy, 2024, 12(2): 481-483. doi: 10.3934/energy.2024022
    [6] Theocharis Tsoutsos, Stavroula Tournaki, Maria Frangou, Marianna Tsitoura . Creating paradigms for nearly zero energy hotels in South Europe. AIMS Energy, 2018, 6(1): 1-18. doi: 10.3934/energy.2018.1.1
    [7] Fiona Bénard-Sora, Jean-Philippe Praene, Yatina Calixte . Assess the local electricity consumption: the case of Reunion island through a GIS based method. AIMS Energy, 2018, 6(3): 436-452. doi: 10.3934/energy.2018.3.436
    [8] Abanda F.Henry, Nkeng G.Elambo, Tah J.H.M., Ohandja E.N.Fabrice, Manjia M.Blanche . Embodied Energy and CO2 Analyses of Mud-brick and Cement-block Houses. AIMS Energy, 2014, 2(1): 18-40. doi: 10.3934/energy.2014.1.18
    [9] Zhongjiao Ma, Zichun Yan, Mingfei He, Haikuan Zhao, Jialin Song . A review of the influencing factors of building energy consumption and the prediction and optimization of energy consumption. AIMS Energy, 2025, 13(1): 35-85. doi: 10.3934/energy.2025003
    [10] Lamya Lairgi, Rachid Lagtayi, Yassir Lairgi, Abdelmajid Daya, Rabie Elotmani, Ahmed Khouya, Mohammed Touzani . Optimization of tertiary building passive parameters by forecasting energy consumption based on artificial intelligence models and using ANOVA variance analysis method. AIMS Energy, 2023, 11(5): 795-809. doi: 10.3934/energy.2023039
  • Dynamic population models are based on the Verhulst's equation (logisitic equation), where the classic Malthusian growth rate is damped by intraspecific competition terms. Mainstream population models for mutualism are modifications of the logistic equation with additional terms to account for the benefits produced by the interspecies interactions. These models have shortcomings as the population divergence under some conditions (May's equations) or a mathematical complexity that difficults their analytical treatment (Wright's type II models). In this work, we introduce a model for the population dynamics in mutualism inspired by the logistic equation but cured of divergences. The model is also mathematically more simple than the type II. We use numerical simulations to study the model stability in more general interaction scenarios. Despite its simplicity, our results suggest that the model dynamics are rich and may be used to gain further insights in the dynamics of mutualistic interactions.


    Over the past decade, RNA-sequencing (RNA-seq) has been replacing microarray technology as the primary high-throughput method used to measure gene expression [1]. In a biological sample, the amount of messenger RNA (transcript abundance) derived from a gene is known as the gene's expression level in that sample. For each gene, RNA-seq is a count positively correlated with the gene's transcript abundance. A diploid genome has two sets of chromosomes, one from each parent, so every gene has two copies. RNA-seq can be used to measure the expression of each gene copy separately when the two gene copies exhibit sequence differences. These two separate measures of expression for a single gene are known as allele-specific expression (ASE) measures, which can be obtained using single nucleotide polymorphism (SNPs) that makes it possible to distinguish the expression of the two alleles [2]. The study of ASE may provide some explanation for so-called heterosis effects. In plant breeding, phenotypic heterosis occurs when hybrid lines show improvements in several phenotype traits compared with their inbred parent lines [3]. Heterozygous hybrid varieties might take advantage of having two alleles with different genotypes in order to adapt to environmental conditions by promoting the selection of the superior allele. The uneven expression of alleles might be related to the superior adaptation of hybrids, so it might be related to the occurrence of gene heterosis [4,5]. Other biological questions where ASE is relevant may include identifying imprinting or parent-of-origin effects, which occurs in genes where only one parental allele is expressed, the distinction between cis-acting and trans-acting regulation DNA relies on ASE since cis-acting is associated with differentially expressed alleles while trans-acting has effects both alleles [2].

    Several modelling strategies has been proposed to analyze ASE data. Given the total ASE, i.e., the sum of counts in both alleles, the so-called reference allele count can be modeled as binomially distributed [5], or use Beta-binomial distribution which includes gene-specific overdispersion [6,7,8,9]. Instead of modeling ASE counts based on a binomial distribution, it is possible to adapt models originally designed for dealing with total RNA-seq transcript abundance counts, Poisson [4], generalized Poisson [10,11] and negative binomial distributions [12] has been proposed. [13] provide an extensive review of the methods to detect differential expression for total RNA-seq data. Differentially expressed genes can be obtained applying a binomial test for each gene and adjusting p-values to control false discovery rate (FDR). Total RNA-seq expression and ASE can be combined to distinguish factors that affect the gene expression in an allele-specific manner (cis-QTL) from factors that affect the gene expression of the two alleles at the same time (trans-QTL). A likelihood ratio test distinguishes cis and trans regulation by combining ASE beta-binomial model with a model for the total RNA-seq counts [2]. The model is extended in [14] to incorporate isoform-specific information and haplotype modeling.

    In this paper, a hierarchical overdispersed count regression model is proposed to study allele-specific expression. This modeling framework allows easy generalization to include additional genotypes, tissue types, and additional alleles. These more complex models will allow researches to address more complex biological questions. The method is applicable whenever heterozygous genotype expression is available. In cases with uncertainty about the genotype, an initial stage is needed to determine sites from heterozygous genotype before inference about ASE is possible.

    The hierarchical aspect of the approach is very important, we learn the gene-specific parameters hierarchical distribution from data, i.e. perform full Bayesian inference. The proposed model is able to capture the key features of ASE data such as reference allele bias, and is more flexible in the modelling of biological sample random effects. In addition, fully Bayesian inference allow to detect relevant genes based on summaries of the posterior probability of being differentially expressed between alleles with no need of multiplicity corrections. The main problem to obtain full Bayesian inference in this problem is computational, we use GPU-accelerated algorithm to obtain posterior samples [15].

    The next Section describe the main characteristic ASE data different from the total RNA-seq expression. Section 3 describes the statistical model we propose to analyze ASE data. Sections 4 and 5 presents results from a simulation study and a real ASE data set analysis, respectively. Finally, 6 presents a summary of the main findings and comments on the next steps in this line of research.

    ASE counts from RNA-seq data are typically obtained by first mapping the short RNA-seq reads to a reference genome and then assigning those reads to a particular allele using known single nucleotide polymorphisms (SNPs). If no SNPs are known in a particular short read, then there is no ASE information for that read and that read cannot be assigned to a particular allele. The proportion of genes having some ASE information available depends on genomic similarity and RNA-seq read length [2].

    Assume ASE counts are available for a single variety $ BM $ whose parents are the varieties $ B $ and $ M $. The ASE counts are formed by two transcript abundance counts per gene each sample. In plant breeding experiments, it is common that the parental varieties, are inbred lines and thus haplotypes are known. While we focus on a plant breeding experiment where the parents are inbred lines with haplotypes known, the methodology could be utilized whenever sample are taken from a single variety and there are two possible alleles for each gene corresponding to two different varieties.

    To set some ideas, we perform an initial data exploration letting $ m_{ga} $ be the mean expression level for allele $ a \in \left\{B, M \right\} $ of gene $ g \in \left\{1, \ldots, G \right\} $ over all available samples. Then let $ A_g = m_{gB}+m_{gM} $ be the average gene abundance and $ R_g = m_{gB}/m_{gM} $ be the allele ratio. Figure 1, based on plant breeding experiment more fully described in Section 5, illustrates some characteristics usually present in ASE gene transcript abundance data using the summary measures $ A_g $ and $ R_g $.

    Figure 1.  (Left panel) Histogram of the allele ratio with best fitting normal density (blue line) for comparison. (Right Panel) Two-dimensional histogram for abundance (x-axis) and allele ratio (y-axis) with zero (black line) and mean allele ratio (red line) indicated.

    The left panel in Figure 1 presents a histogram of $ R_g $ with a reference Gaussian density constructed using the sample mean and sample variance. The empirical distribution of $ R_g $ is more concentrated around 0 and has heavier tails than a normal distribution. This characteristic is not exclusive to ASE counts, differential expression measures in total RNA-seq counts and microarrays are typically similar.

    The right panel in Figure 1 shows a two-dimensional histogram of the $ (A_g, R_g) $ pairs. The most frequent cells are close to zero allele difference (black line) for any level of average ASE suggesting that most of the genes have small differences in the ASE counts. In addition, more genes fall above zero allele difference than below and the gene-wide average ratio (red line) is also positive suggesting that allele $ B $ has higher ASE counts than allele $ M $ on average.

    While there could be some biological reason to observe one of the alleles more expressed than the other one on average, it is known the ASE process can result in increased counts, on average, for the reference genome [16].

    The reference genome is (almost) fully known, and many times it is not possible to distinguish mismatches due to errors from genuine mismatches due to the read corresponding to a non-reference genome. A read that truly matches the reference genome is more likely to be counted than a read matching the non-reference genome, creating a bias towards reference allele counts. Alternative ASE processes can be implemented to eliminate this reference genome bias [17,18,19]. Alternatively, a conservative analysis would be to only consider those genes with significant allele imbalance against the reference allele [4]. In the following section, we develop methodology in the modeling stage that allows the analysis to adjust for this bias.

    We introduce a hierarchical overdispersed count regression model for the allele-specific counts for each sample and borrows information across genes to learn about gene-specific parameters. To estimate parameters we will utilize a Markov chain Monte Carlo (MCMC) approach utilizing an overall Gibbs sampling structure with slice sampling for intractable conditional distributions. To ameliorate the computational difficulties of sampling from the high dimensional posterior, we utilize an algorithm constructed to run on a graphics processing unit (GPU) which provides within-iteration acceleration of the algorithm [15].

    Each hybrid sample in the experiment has two sub-samples: one with RNA-seq counts for allele $ B $ and one with RNA-seq counts for allele $ M $.

    Let $ Y_{gn} $ be the allele-specific RNA-seq count of gene $ g $ in sub-sample $ n $ for $ g = \{1, 2, \ldots, G\} $ and $ n = \{1, 2, \ldots, N\} $. Let $ {\bf X}{} $ by an $ N\times P $ model matrix that contains information regarding which allele each sub-sample is associated with as well as any additional relevant experimental conditions and let $ x_n $ be the $ n $th row of $ {\bf X}{} $. We follow the approach of [15] and assume

    $ YgnindPo(hnexnβg+ϵgn),ϵgnindN(0,γ2g)
    $
    (3.1)

    where $ h_{n} $ is a measure of the library size for sub-sample $ n $ and $ \beta_g $ and $ \gamma_g $ are gene-specific model parameters.

    There might be many columns in the model matrix $ {\bf X} $ specific to particular applications, for instance to represent blocking factors or relevant covariate effects. However, there are three columns that should be present in models dealing with ASE counts. We assume the first column corresponds to an intercept term and denote its associated coefficient as $ \beta_{g, 1} $. Moreover, we assume the second model matrix column take value 1 for observed counts from the reference allele and the value -1 for observed counts of the non-reference allele. Then, the regression coefficient associated with the second column, $ \beta_{g, 2} $, represents the half difference of gene ASE, genes with $ \beta_{g, 2} = 0 $ providing evidence of equally expressed alleles. Lastly, if there is more than one biological replicate (which is usually the case), a third column should be included to represent the grouping effect of the allele-specific sub-sampling. We assume this effect it corresponds to the last column in $ X $, its associated coefficient is $ \beta_{g, P} $.

    The $ \epsilon_{gn} $ terms provide gene-specific overdispersion through a normal hierarchical distribution with mean 0 and variance $ \gamma_g^2 $. This effect implies a quadratic mean-variance relationship that could differ across genes, and admits the partition of the total gene variability into technical and biological components similar to the Poisson-gamma mixture [20].

    As we have many genes, but generally few biological samples, we wish to borrow information across the genes about the gene-specific parameters $ \beta_g $ and $ \gamma_g $. One feature common to RNA-seq and ASE counts is that, in many cases, the large effect of interest are only present for a small group of genes while the remaining genes have small to negligible effects as demonstrated in the left panel of Figure 1. This pattern can be modeled using shrinkage distributions, i.e. distributions that have more mass around the location parameter but with heavier tails. Several of these distributions can be written as a scale mixture of normals, i.e. $ \beta_{gp} \stackrel{\text{ind}}{\sim} N(\theta_p, \sigma_p^2\xi_{gp}) $ and $ \xi_{gp} \stackrel{\text{ind}}{\sim} p(\xi) $ where the marginal distribution for $ \beta_{gp} $ can be normal, Student-t, Laplace [21], or horseshoe distribution [22] by assuming a point-mass, inverse-gamma, exponential, and half-Cauchy distribution for the $ \xi_{gp} $, respectively.

    A second set of gene-specific parameters are the normal variances that control overdispersion, $ \gamma_g $. We model these variances as independent inverse-gamma distributions conditional on $ \nu $ and $ \tau $, and independent from the regression coefficients $ \beta_g $, i.e. $ \gamma_g \stackrel{\text{ind}}{\sim} IG(\nu/2, \nu\tau/2) $. With this parametrization, we have $ E(1/\gamma_g) = 1/\tau $ and the coefficient of variation is $ CV(\gamma_g) = \sqrt{2/(\nu-4)} $, and therefore $ \tau $ is related to the location of the distribution while $ \nu $ controls the amount of shrinkage around that location.

    Prior distribution for the hyperparameters of regression coefficients are set as normals for the means, $ \theta_k \stackrel{\text{ind}}{\sim} \text{N}(0, c_{k}) $, and uniform for the standard deviations $ \sigma_k \stackrel{\text{ind}}{\sim} \text{Unif}(0, s_{k}) $ [23]. Parameters controlling overdispersion effect have uniform prior, $ \nu \sim \text{Unif}(0, d) $, and gamma prior, $ \tau \sim Ga(a, b) $.

    Normal prior for location parameters $ \theta_k $ is widely used choice [24], it can be weakly informative maintaining conditional conjugacy. Similarly the gamma prior for a location-related parameter $ \tau $ represents a good balance between computation convenience and being weakly informative. The prior parameters, indicated with Roman letters, are set to obtain diffuse distributions. As the number of genes is very large, there is a lot of information about the hyperparameters in the data and thus, these diffuse priors will not have a large impact on the hyperparameter estimation [25].

    Models where gene-specific parameters have fully specified distributions, i.e. non-hierarchical models, can be estimated using MCMC methods [26]. However, fully Bayesian inference of high-dimensional hierarchical models is computationally demanding since the number of groups (or genes) is large. Usually, approximations like empirical Bayes [27] or integrated nested Laplace approximation [28] are used to obtain inference results. We instead utilize the ${\tt fbseq }$ package [29] which uses graphics processing units (GPUs) to take advantage of the embarrassingly parallel MCMC steps and parallel reductions in each iteration of the MCMC algorithm, convergence is assessed using potential scale reduction factor statistic. For computational reasons, ${\tt fbseq }$ provides posterior means and standard deviations for gene-specific parameters and full MCMC samples for all hyperparameters and few gene-specific parameters.

    An important characteristic present in some ASE data is to observe a higher transcription for one of the alleles on average across all genes, due to the positive bias towards the reference allele mentioned in Section 2 and observed in Figure 1. These systematic difference among alleles are not of interest as the goal is to identify genes showing differences among allele larger than what is explained by systematic factors.

    We consider the overall mean across all genes, $ \theta_2 $, as a measure of the systematic difference among alleles commonly due to bias towards the reference allele. Then, we define the allele effect to be the difference between alleles that is not due to bias, i.e. $ \Delta_g = \beta_{g2}-\theta_2 $. Since this is a function of a gene-specific parameter and a hyperparameter, we are not able to obtain the posterior distribution directly from the $\texttt{fbseq}$ output.

    In order to obtain inference about the gene-specific regression parameters, the posterior mean and variance from the MCMC samples can be used to create a normal approximation of its posterior distribution [15]. A similar strategy could be used to obtain credible intervals for the allele effect, $ \Delta_g $. In this case the posterior mean and variance of $ \Delta_g $ are

    $ E(Δg|y)=E(βg2|y)E(θ2|y)Var(Δg|y)=Var(βg2|y)+Var(θ2|y)2Cov(βg2,θ2|y)
    $

    As we mentioned before, ${\tt fbseq}$ does not compute this covariance. However, the variability of the hyperparameters is negligible compared to the variability of the gene-specific parameters, i.e. $ Var(\Delta_g\vert y) \approx Var(\beta_{g2} \vert y) $ since $ Var(\theta_2\vert y)-2Cov(\beta_{g2}, \theta_2 \vert y) < < Var(\beta_{g2} \vert y) $. Therefore, a normal approximation for $ \Delta_g $ has mean $ E(\Delta_g \vert y) $ and variance $ Var(\beta_{g2} \vert y) $. We show this approximation is reasonable in Figure 6 in supplemental material.

    The main goal of the proposed model is to identify genes with differentially expressed alleles (DEA). We say a gene has DEA when $ |\Delta_g| \geq c $ where $ c > 0 $ represents a threshold that must be adapted to specific applications or experiments. Here we follow [30] in setting as the DEA threshold a 25% increase in the expression level, i.e., $ c = log(1.25)/2 $.

    Unfortunately, $ P(|\Delta_g| \geq c\vert y) $ will be large for genes with large posterior uncertainty for $ \Delta_g $ even when $ E(\Delta_g \vert y) \approx 0 $. To avoid this issue, we use the statistic $ p_g = \mbox{min}\left\{ P(\Delta_g < c), P(\Delta_g > -c)\right\} $ where small $ p_g $ provide evidence in favor of DEA [28].

    A Bayesian false discovery rate correction has been proposed [28,31]. However, since we use a fully hierarchical Bayesian model and the null hypothesis has a positive probability multiplicity corrections are not needed [32,33]. Alternatively, we could minimize the following expected loss

    $ E[L(d, y)] = q \sum\limits_g d_g(1-p_g) + \sum\limits_g (1-d_g)p_g = qFD + FN $

    where $ d_g $ is an indicator that gene $ g $ has DEA, $ FD $ and $ FN $ are the posterior expected false discoveries and false negatives respectively, and $ q $ is the relative cost associated to $ FD $. [34] shows the optimal rule that minimizes $ E[L(d, y)] $ is

    $ d_g = I\left(p_g \leq \frac{1}{q+1}\right). $

    Setting $ q = 19 $ we would declare as DEA every gene with $ p_g $ lower than 5%.

    A simulation study is performed to explore how the model captures several characteristics of interest in the data, and evaluate model performance in finding genes where the allele effect is present. In this Section, we describe the data sets simulation scenarios, the analysis of each simulated data, and present simulation study results.

    In order to obtain simulated data sets close to the real data we have, we fit an initial model and use it to simulate new data. We obtain point estimates of the gene-specific regression coefficients and gene-specific overdispersion parameters using ${\tt edgeR}$ [12]. In addition, we obtain normalization factors $ h_{n}^* $ based on the method proposed by [35]. These point estimates and normalization values are used to obtain the simulated data sets. This corresponds to a negative binomial model for the ASE counts, $ Y_{gn} \stackrel{\text{ind}}{\sim} NB(h_{n}^*e^{x_n^{\top}\beta_g}, \phi_{g}) $, where $ h_{gn}^* $ are normalization factors and $ \phi_g $ control the overdispersion.

    The specific {data set} we use later in Section 5 as an application example, has 8 allele-specific observations per gene, corresponding to 4 biological replicates of a single hybrid genotype distributed in two blocks. Let $ y_g = (y_{g1}, \ldots, y_{g8}) $ such that $ y_{g1} $ and $ y_{g2} $ represent the ASE counts for the two alleles from the first sample, $ y_{g3} $ and $ y_{g4} $ represent the ASE counts for the two alleles from the second sample, etc. In order to obtain approximate independence among regression coefficients (which is an assumption in hierarchical distributions), we use a zero-sum parametrization for $ X $ as shown in Eq (4.1).

    $ X=[1111011110111101111011101111011110111101]
    $
    (4.1)

    This particular choice of $ X $ matrix implies $ \beta_{g1} $ corresponds to the intercept and $ \beta_{g2} $ to the half allele difference, as in the general model presented previously. Here we include a column to capture the difference between the two blocks, associated with coefficient $ \beta_{g3} $. Also, two columns for block and replicate interaction, $ (\beta_{4g}, \beta_{5g}) $ are included, which represent the half difference between replicates within each block. Note that usually the set of effects related with grouping factors as biological samples, share a common variance, while the model proposed here allows $ \sigma_4 \neq \sigma_5 $ which encompasses the common variance case.

    A simulation scenario is defined by four simulation design parameters: sparsity ($ w $) and strength of the allele effect ($ s $), bias toward reference allele ($ p $), and overdispersion effects ($ T $). Table 1 shows the design parameters values, in total there are 24 scenarios as a full factorial combination of the design parameters values. Each scenario is formed by the simulated ASE count for $ G = 5000 $ genes with 8 observations per gene, and is replicated 2 times.

    Table 1.  Simulation study design parameter values.
    Description Sparsity Strength Bias Overdispersion
    Parameter $ w $ $ s $ $ p $ $ T $
    Values $.5, .95 $ $ 1, 1.8 $ $ 1, .5 $ $ 0.25, 1, 4 $

     | Show Table
    DownLoad: CSV

    The estimates $ (h_{n}^*, \hat\beta_{g1}, \hat\beta_{g2}, \hat\phi_g) $ from NB model described above are used to obtain simulated data set. We construct two groups of genes depending on whether the estimated allele difference is larger than a threshold, $ |\hat\beta_{g2} | > c $ or not, and we obtain a stratified random sample with $ (1-w) $ proportion of genes with large allele difference. With the selected genes, we obtain 8 counts per gene as follows:

    $ YgnindNB(ˉheEgn,Tˆϕg)Egn=ˆβg1+ψ(s)xaˆβg2+log(p)I(xa=1)
    $

    where $ \bar h $ is the average of normalization factors, $ \psi(s) = s $ when $ |\hat\beta_{g2}| > c $ and $ \psi(s) = 1 $ in other case, and $ x_a $ takes value 1 for the reference allele and $ -1 $ in the non-reference allele. The design parameter $ s $ controls the signal strength, we set $ s = (1, 1.8) $ as weak and strong signal cases respectively. Lastly, overdispersion effects are computed as $ T\hat\phi_g $, three overdispersion scenarios are determined by the value of $ T = (.25, 1, 4) $.

    Reference allele bias is created by the $ \log(p) $ factor, to better understand why this might match with the biologic characteristic of this effect consider an intermediate step where $ Y_{gn}^* $ is a simulated count without bias (i.e. $ \log(p) = 0 $) and then $ Y_{gn} \sim Bin(Y_{gn}^*, p) $ when $ x_a = -1 $, integrating out $ Y_{gn}^* $ we can recover the negative binomial distribution. Design parameter $ p $ is the probability of actually assigning one short read to the non-reference allele, so on average $ (1-p) $ non-reference reads are lost. This implies that the mean of $ \beta_{g2} $ coefficients, $ \theta_2 $, should be close to $ -\log(p)/2 $ since $ \beta_{g2} $ captures the gene-specific half difference among the two alleles and $ \log(p) $ represent the allele between alleles averaging all genes. But not necessarily for each individual count, since the initial negative binomial simulation is independent for the two alleles within a gene.

    Every simulated data set is analyzed using data model (3.1) with five hierarchical distributions with normal scale mixture described in 3.2. We use normal, Laplace, Student-$ t_6 $, Cauchy (Student-$ t_1 $), and horseshoe. The main reason for this is to assess the impact of the hierarchical distribution of the regression coefficients on the posterior inference. We ran three MCMC chains with $ 40,000 $ iterations with thinning value of 5 in Cauchy and horseshoe cases. Still, horseshoe distribution shows lack of convergence in many scenarios for $ \theta_k $ parameters, therefore we do not present horseshoe distribution simulation results nor consider it for the real data analysis. It have beem recently pointed out that horseshoe distribution may have poor mixing in high-dimensional problems, and propose to use an elliptical slice sampler to improve it [36].

    Additionally, we fit a non-hierarchical counterpart for each version of the proposed model, i.e., fixing hyperparameters values so distribution for gene-specific parameter is no longer learned from data. In non-hierarchical Bayesian models are set with values $ \theta_k = 0 $, $ \sigma_k^2 = 3^2 $, $ \tau = .1 $, $ \nu = 1 $, and inference is obtained with 3 MCMC chains with $ 20000 $ iterations and no thinning.

    Figure 2 presents receiver operating characteristic (ROC) curves for only one replicate in simulation scenarios where only 5% of genes are truly DEA and reference allele bias is present. ROC curves are computed with ${\tt plotROC }$ package [37]. Statistic $ p_g $ is used is used as a continuous score to compute the ROC curves, we set DEA threshold in $ c = \log(1.25)/2 $, as in [30]. Results indicates that increasing the signal strength and decreasing the overdispersion level produce better detection rates for all methods. The non-hierarchical models fail into account the bias towards reference allele. Among hierarchical models, Figure 2 suggests a Cauchy distribution for gene-specific regression parameters has slightly better detection rates (other simulation scenarios present similar patterns).

    Figure 2.  ROC curves for scenarios where only 5% of genes are truly DEA and reference allele bias is present (only one replicate). Column facets correspond to overdispersion level and row facets correspond signal strength. Hierarchical models are plotted with continuous lines and dashed lines correspond to non-hierarchical models. Line color indicates the hierarchical distribution.

    Several alternative methods are used to analyze each simulated data set for performance comparison with model (3.1) results. First, we use a Poisson generalized linear model with overdispersion and biological sample effects, for each gene, the model is estimated via maximum likelihood. Second, a beta-binomial distribution is used to perform a likelihood ratio test for each gene. Finally, negative binomial model is estimated via empirical Bayes methods using ${\tt edgeR}$ package [12]. In all these three methods a p-value is obtained from testing if each gene shows evidence of DEA and then a false discovery rate correction is applied, using the method proposed in [38].

    ROC curves can be summarize computing the area under the ROC curve (AUC), a perfect detection rate would have AUC value of 1. Figure 3 shows AUC measure results only for hierarchical Bayesian models and the three alternative methods just described. The facets combine the signal strength level and sparsity level (columns) with the presence of reference allele bias (rows), overdispersion level is represented by the the color and type of the lines. Each line corresponds to one simulation scenario.

    Figure 3.  Partial area under ROC curve (AUC), over region with false positive rate lower than 10%. Facets represent signal strength and sparsity, color and line type indicates overdispersion level.

    Similarly to Figure 2 overdispersion level and signal strength have the largest impact on the signal detection performance measured with AUC across all models. There might be some interaction among the simulation design factors, for instance, signal strength shows almost no effect in AUC when overdispersion level is high.

    Figure 3 shows that Cauchy as hierarchical distribution for the regression parameters have the largest AUC measure in most simulation scenarios, particularly when 95% of gene having true effect lower than the threshold. Cauchy accommodates a lot of probability mass close to zero and its heavy tails can capture the genes with real effects. Performance of Laplace and t distributions appear to be slightly worst than Cauchy models. This might suggest degrees of freedom parameter in student-t distribution impact how the model borrows information across genes. Next, using a normal distribution for regression coefficients has the poorest AUC results among Bayesian hierarchical models, and its lower than edgeR method for most cases. The empirical Bayes method (edgeR) shows equal (or slightly better) AUCs than hierarchical models in many scenarios while is somewhat worse in highly sparse and weak signal cases. Finally, the two frequentist methods shows lowest AUC measures in every simulation scenario, and are specially affected by the presence of reference allele bias.

    We finish this Section showing how the proposed model captures the bias towards reference allele. Above we mentioned that parameter $ \theta_2 $ should capture half of the bias in log scale, i.e., we expect i.e. $ E(\theta_2\vert y) \approx -\log(p)/2 $. Figure 4 shows a scatter plot of $ E(\theta_2 \vert Y) $ against $ \log(p) $. The plot suggests posterior expectation of $ \theta_2 $ captures the bias towards the reference allele, is possible to use it as an estimate of the bias and remove it when making inference about the allele effect.

    Figure 4.  Scatter plot of $ \theta_2 $ posterior mean against $ \log(p) $ parameter. Row facets represents sparsity and column facets the overdispersion level. The line corresponds to $ y = -\frac{x}{2} $ line.

    In this Section we apply the methods described in Section 3 to a RNA-seq data set with allele-specific counts from maize that constitute a portion of the experimental data obtained by [4]. Data set includes four replicate samples of a hybrid genotype (B73xMo17) distributed in 2 flow cell blocks and two allele count measures per sample. RNA-seq transcript abundance count information for 39656 genes is obtained using Iliumna® technology and B73 genome as the reference allele [39]. However, many of them have little or no ASE information. To avoid genes with extremely low observed expression, only use genes were the average of allele-specific counts is bigger than 1. The resulting data set corresponds to the ASE counts of 16380 genes, which is 41% of the total.

    An initial exploration of this data, based on expression averages per gene, was presented in Section 2 to illustrate the main features of an ASE data set, here we use the expression for each replicate without averaging. The specific model matrix we use is the same presented in (4.1) used to create a model to simulate data. Gene-specific intercept is normally distributed while the rest of regression parameters are Cauchy distributed. The choice of $ \beta_{gk} \sim Ca(\theta_{k}, \sigma_{k}) $ is based on the results from simulation study, the models using Cauchy hierarchical distribution results in better partial AUC measures in particular in sparse cases or cases with large overdispersion levels.

    We present the main results from the analysis, we start with some remarks about the posterior inference relative to the hyperparameters of the model, and after that we focus on the results relative to gene-specific allele effects. We present credible intervals for $ \Delta_g $ and identify genes differentially expressed between alleles.

    Table 2 presents posterior summaries for all hyperparameter in the model. Posterior means and credible interval for $ (\nu, \tau) $ suggest most genes show very little or none overdispersion present, but there are a few genes with large overdispersion effects. Posterior mean of $ \theta_2 $ is positive representing the bias towards reads from B73 allele. The results suggest that expression from Mo17 allele is only 78% of the expression count from allele B73 on average across all genes. In other words, 1 out of 5 reads from Mo17 is lost presumably because is compared with a different genome. Finally, results suggest the variances of common biological sample effects are different with $ \sigma^2_4 > \sigma^2_5 $ by a factor of $ 100 $.

    Table 2.  Hyperparameter posterior summaries (B73xMo17 data).
    Parameter Posterior Mean Credible Interval (95%)
    ν 3.6 (3,4.3)
    τ 0.0023 (0.0019, 0.0028)
    θ1 2.4 (2.4, 2.4)
    θ2 0.12 (0.12, 0.13)
    θ3 -0.025 (-0.029, -0.021)
    θ4 -0.026 (-0.029, -0.024)
    θ5 0.002 (0.00015, 0.0038)
    $\sigma _1^2$ 1.7 (1.7, 1.8)
    $\sigma _2^2$ 0.012 (0.011, 0.013)
    $\sigma _3^2$ 0.013 (0.013, 0.014)
    $\sigma _4^2$ 0.0011 (0.00094, 0.0012)
    $\sigma _5^2$ 0.000015 (0.0000095, 0.000023)

     | Show Table
    DownLoad: CSV

    Figure 5 shows allele effect posterior inference results for each gene. Left panel presents 95% credible intervals of allele effects against expected gene expression with color highlighting genes with differentially expressed alleles. The expected gene expression (in logs) is computed as $ \beta_{g1} + h_n $, i.e. the posterior mean of the gene-specific intercept plus the offset value. Genes with large expected expression show smaller allele effects and shorter credible intervals than genes with low expression. There are some genes flagged as differentially expressed among alleles with very low expression level.

    Figure 5.  Allele effects for ASE counts of B73$ \times $Mo17 hybrid data. Left: Right: 95% credible intervals of allele effect against overall gene expression. Color indicates if the gene is declared as differentially expressed or not. Right: Scatter plot of observed allele ratio ($ R_g $) against the evidence of DEA ($ 1-p_g $), color of the points represents the total ASE observed count.

    The right panel of Figure 5 shows the observed allele ratio ($ R_g $) against the evidence of DEA measured as $ 1-p_g $, color of the points represents the total ASE observed count. This figure relates model results with observed data, it suggest the model results are reasonable given the observed data. Genes with large absolute value of the allele ratio presents larger probabilities of having DEA, there also some genes with relatively small allele effect but with large probability, this occurs when the total expression level is high (darker points). Additionally Figure 10 (in supplemental material) shows the allele effects estimated by the model are highly correlated with the observed allele ratio with some shrinkage towards zero value, this behavior is expected for hierarchical models.

    Figure 10.  Observed effects against allele effect from the model.

    Results indicate that 17% of the genes shows allele differential expression. When the observed allele ratio in logs is negative (favor the non-reference allele), the list of genes with DEA is contained within the list of genes previously flagged by [4]. There are many genes flagged in [4] that are discarded by our proposed model. This makes sense because we define a region of non-differential expression instead of a point value, so this smaller proportion of DEA is reasonable since the null region is larger. On the other hand, genes where the observed allele ratio favor the reference allele were previously discarded, our analysis flagged is capable to find genes high probability of DEA since correct the reference bias from the allele effect.

    Allele-specific expression refers to a transcript abundance count associated with each gene copy (allele). We propose a hierarchical overdispersed count regression model for ASE data from heterozygous genotypes to detect genes with differential expression between alleles. This model address the main characteristics of ASE data. A measure of allele effect corrected for reference allele bias and a method to obtain credible intervals for this measure are described. The proposed statistical method can be applied to multi-allelic scenarios or situations that require to model total and allele expression simultaneously. Model inference is performed in a fully Bayesian fashion. The specific MCMC algorithm is embarrassingly parallel when updating the gene-specific parameters. A parallel strategy computing is then used for computational efficiency.

    Simulation experiments suggested there are performance gains in learning the hierarchical distributions of gene-specific parameter from data. Hierarchical Bayesian models show slightly better performance than empirical Bayes approach and much better results than frequentist and non-hierarchical methods. Non-hierarchical Bayesian models performance is more heavily affected by to overdispersion and sparsity level and cannot accommodate the reference allele bias. Caution is needed for the comparison with frequentist methods since they are calibrated for p-values distribution under a point mass null hypothesis, but the simulated data had small but different from zero effects in non-DEA genes. Among hierarchical models, the better performance results in terms of signals detection were showed by Cauchy distribution. Cauchy is informative enough to produce information sharing across genes and at the same time is flexible enough (due to the heavy tails) to accommodate large true signals.

    A real data set is analyzed with the Poisson hierarchical model, using Cauchy as hierarchical distribution for gene specific regression parameters. The application consist in ASE data from four maize hybrid plants. We found evidence of DEA for 17% of the genes, results are consistent with previous analysis of the same data, our method allow to reference bias thus some of the genes were not consider before, otherwise our is somewhat more conservative. The model suggest variances of biological samples are different for each sample, the usual corrections with random effects restrict these variances to be equal. Could be relevant to explore the consequences of this restriction in the model results.

    As we mentioned earlier, the method proposed in the paper could serve as the base for a more general model. Some relevant generalizations are straightforward to incorporate. A careful construction of the design matrix (4.1) the only piece needed to include more varieties (other genomes) and total RNA-seq expression or to deal with multiple alleles data. These generalizations allow to study more relevant contrast other than differential expression among alleles. For instance, in plant breeding applications, we could study allelic imbalance, i.e. compare the allele expression ratio in hybrid with the total expression ratio among parental lines, and the relationship among hybrid vigor and allelic imbalance. Other generalizations maybe harder to incorporate, in order to work under uncertainty about the genotype, a new model stage is needed. It could be possible to use a finite mixture of Poisson distribution in (3.1), where the mixing probabilities corresponds to the probability of each genotype.

    Horseshoe distribution results showed lack-of-convergence problems so it was excluded from the proposed models. Recently, it was pointed out the poor mixing of a horseshoe implementation based on a scale mixture of normals (which is the one used in this work) and propose to use an elliptical slice sampler instead in [36]. We would like to continue working analyzing the effect of the elliptical sample for the horseshoe distribution in the proposed models.

    This research was supported by National Institute of General Medical Sciences (NIGMS) of the National Institutes of Health and the joint National Science Foundation / NIGMS Mathematical Biology Program under award number R01GM109458. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

    The authors declare there is no conflict of interest.

    Allele effect variance approximation

    In Section 3 we stated that $ Var(\Delta_g\vert y) \approx Var(\beta_{g2} \vert y) $, we can test the approximation from a few genes having the complete MCMC samples. Figure 6 presents scatter plots of the variance of the allele effect against the variance of the regression coefficient $ \beta_{g2} $, the facets represents the hierarchical distribution used in the model and color of points represent the overdispersion level. There is a close relationship among the two plotted variances, suggesting the approximation $ Var(\Delta_g\vert y) \approx Var(\beta_{g2} \vert y) $ is reasonable.

    Figure 6.  Scatter plot of variance of allele effect against variance of regression coefficient. Facets correspond to the hierarchical distribution and color indicates the overdispersion level.

    ROC curves for complementary scenarios

    Figures 7, 8 and 9 shows ROC curves for simulation scenarios that complete the scenarios presented in the main text (Figure 2). In all three figures, row facets corresponds to overdispersion level, while column facets combine signal strength and bias. Hierarchical models are plotted with continuous lines and dashed lines correspond to non-hierarchical models. Line color indicates the hierarchical. distribution.

    Figure 7.  ROC curves for scenarios with low sparse allele effects and reference allele bias present.
    Figure 8.  ROC curves for scenarios with high sparse allele effects and no reference bias present.
    Figure 9.  ROC curves for scenarios with low sparse allele effects and no reference bias present.

    Allele effects in real data analysis

    Figure 10 illustrates the relationship among estimated allele effects ($ \Delta_g $) and the observed data.

    [1] D. Balcan, V. Colizza, B. Gonçalves, H. Hu, J. J. Ramasco and A. Vespignani, Multiscale mobility networks and the spatial spreading of infectious diseases, Proceedings of the National Academy of Sciences USA, 106 (2009), 21484-21489. doi: 10.1073/pnas.0906910106
    [2] J. Bascompte and P. Jordano, Plant-animal mutualistic networks: The architecture of biodiversity, The Annual Review of Ecology, Evolution, and Systematics, 38 (2007), 567-593. doi: 10.1146/annurev.ecolsys.38.091206.095818
    [3] U. Bastolla, M. A. Fortuna, A. Pascual-García, A. Ferrera, B. Luque and J. Bascompte, The architecture of mutualistic networks minimizes competition and increases biodiversity, Nature, 458 (2009), 1018-1020. doi: 10.1038/nature07950
    [4] U. Bastolla, M. Lässig, S. C. Manrubia and A. Valleriani, Biodiversity in model ecosystems, II: Species assembly and food web structure, Journal of Theoretical Biology, 235 (2005), 531-539. doi: 10.1016/j.jtbi.2005.02.006
    [5] W. Feller, On the logistic law of growth and its empirical verifications in biology, Acta Biotheoretica, 5 (1940), 51-66. doi: 10.1007/BF01602862
    [6] J. P. Gabriel, F. Saucy and L. F. Bersier, Paradoxes in the logistic equation?, Ecological Modelling, 185 (2005), 147-151. doi: 10.1016/j.ecolmodel.2004.10.009
    [7] J. R. Groff, Exploring dynamical systems and chaos using the logistic map model of population change, American Journal of Physics, 81 (2013), 725-732. doi: 10.1119/1.4813114
    [8] L. Gustafsson and M. Sternad, Bringing consistency to simulation of population models-Poisson simulation as a bridge between micro and macro simulation, Mathematical Biosciences, 209 (2007), 361-385. doi: 10.1016/j.mbs.2007.02.004
    [9] C. A. Johnson and P. Amarasekare, Competitionfor benefits can promote the persistence of mutualistics interactions, Journal of Theoretical Biology, 328 (2013), 54-64. doi: 10.1016/j.jtbi.2013.03.016
    [10] E. Kuno, Some strange properties of the logistic equation defined with r and k: Inherent defects or artifacts?, Researches on population ecology, 14 (1991), 33-39.
    [11] T. R. Malthus, An Essay on the Principle of Population or a View of Its Past and Present Effects on Human Happiness; with an Inquiry into Our Prospects Respecting the Future Removal on Mitigation of the Evils which It Occasions, 1st edition, Roger Chew Weightman, Washington, 1798. Available from: http://opac.newsbank.com/select/shaw/17975.
    [12] R. May, Models for two interacting populations, in Theoretical Ecology. Principles and Applications, $2^{nd}$ edition (ed. R. May), 1981, 78-104.
    [13] J. D. Murray, Mathematical Biology I: An Introduction, $3^{rd}$ edition, Springer-Verlag, New York, 2002.
    [14] R. Pearl, The biology of population growth, Zeitschrift für Induktive Abstammungs- und Vererbungslehre, 49 (1929), 336-338. doi: 10.1007/BF01847581
    [15] E. Stokstad, Will malthus continue to be wrong?, Science, 309 (2005), p102. doi: 10.1126/science.309.5731.102
    [16] P. F. Verhulst, Recherches mathematiques sur la loi d'accroissement de la population [Mathematical researches into the law of population growth increase], Nouveaux Memoires de l'Academie Royale des Sciences et Belles-Lettres de Bruxelles, 18 (1845), 1-42.
    [17] V. Volterra, Fluctuations in the abundance of a species considered mathematically, Nature, 118 (1926), 558-560. doi: 10.1038/118558a0
    [18] D. H. Wright, A simple, stable model of mutualism incorporating handling time, The American Naturalist, 134 (1989), 664-667. doi: 10.1086/285003
  • Reader Comments
  • © 2015 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(8273) PDF downloads(77) Cited by(1)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog