Survey on the application of deep learning in algorithmic trading

Yongfeng Wang; Guofeng Yan; Yongfeng Wang; Guofeng Yan

doi:10.3934/DSFE.2021019

Data Science in Finance and Economics

2021, Volume 1, Issue 4: 345-361. doi: 10.3934/DSFE.2021019

Previous Article Next Article

Review

Survey on the application of deep learning in algorithmic trading

Yongfeng Wang ,
Guofeng Yan ^,

School of Computer Science and Cyber Engineering, Guangzhou University, China

Received: 08 December 2021 Accepted: 18 December 2021 Published: 27 December 2021
JEL Codes: G15, C63

Algorithmic trading is one of the most concerned directions in financial applications. Compared with traditional trading strategies, algorithmic trading applications perform forecasting and arbitrage with higher efficiency and more stable performance. Numerous studies on algorithmic trading models using deep learning have been conducted to perform trading forecasting and analysis. In this article, we firstly summarize several deep learning methods that have shown good performance in algorithmic trading applications, and briefly introduce some applications of deep learning in algorithmic trading. We then try to provide the latest snapshot application for algorithmic trading based on deep learning technology, and show the different implementations of the developed algorithmic trading model. Finally, some possible research issues are suggested in the future. The prime objectives of this paper are to provide a comprehensive research progress of deep learning applications in algorithmic trading, and benefit for subsequent research of computer program trading systems.

Keywords:

Citation: Yongfeng Wang, Guofeng Yan. Survey on the application of deep learning in algorithmic trading[J]. Data Science in Finance and Economics, 2021, 1(4): 345-361. doi: 10.3934/DSFE.2021019

Related Papers:

[1]	Thanh Nam Nguyen, Jean Clairambault, Thierry Jaffredo, Benoît Perthame, Delphine Salort . Adaptive dynamics of hematopoietic stem cells and their supporting stroma: a model and mathematical analysis. Mathematical Biosciences and Engineering, 2019, 16(5): 4818-4845. doi: 10.3934/mbe.2019243
[2]	H. Thomas Banks, W. Clayton Thompson, Cristina Peligero, Sandra Giest, Jordi Argilaguet, Andreas Meyerhans . A division-dependent compartmental model for computing cell numbers in CFSE-based lymphocyte proliferation assays. Mathematical Biosciences and Engineering, 2012, 9(4): 699-736. doi: 10.3934/mbe.2012.9.699
[3]	J. Ignacio Tello . On a mathematical model of tumor growth based on cancer stem cells. Mathematical Biosciences and Engineering, 2013, 10(1): 263-278. doi: 10.3934/mbe.2013.10.263
[4]	Rui-zhe Zheng, Jin Xing, Qiong Huang, Xi-tao Yang, Chang-yi Zhao, Xin-yuan Li . Integration of single-cell and bulk RNA sequencing data reveals key cell types and regulators in traumatic brain injury. Mathematical Biosciences and Engineering, 2021, 18(2): 1201-1214. doi: 10.3934/mbe.2021065
[5]	Xiao Tu, Qinran Zhang, Wei Zhang, Xiufen Zou . Single-cell data-driven mathematical model reveals possible molecular mechanisms of embryonic stem-cell differentiation. Mathematical Biosciences and Engineering, 2019, 16(5): 5877-5896. doi: 10.3934/mbe.2019294
[6]	Eminugroho Ratna Sari, Fajar Adi-Kusumo, Lina Aryati . Mathematical analysis of a SIPC age-structured model of cervical cancer. Mathematical Biosciences and Engineering, 2022, 19(6): 6013-6039. doi: 10.3934/mbe.2022281
[7]	Mingshuai Chen, Xin Zhang, Ying Ju, Qing Liu, Yijie Ding . iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. Mathematical Biosciences and Engineering, 2022, 19(12): 13829-13850. doi: 10.3934/mbe.2022644
[8]	Azmy S. Ackleh, Jeremy J. Thibodeaux . Parameter estimation in a structured erythropoiesis model. Mathematical Biosciences and Engineering, 2008, 5(4): 601-616. doi: 10.3934/mbe.2008.5.601
[9]	Linqian Guo, Qingrong Meng, Wenqi Lin, Kaiyuan Weng . Identification of immune subtypes of melanoma based on single-cell and bulk RNA sequencing data. Mathematical Biosciences and Engineering, 2023, 20(2): 2920-2936. doi: 10.3934/mbe.2023138
[10]	Tin Phan, Changhan He, Alejandro Martinez, Yang Kuang . Dynamics and implications of models for intermittent androgen suppression therapy. Mathematical Biosciences and Engineering, 2019, 16(1): 187-204. doi: 10.3934/mbe.2019010

Abstract

1. Introduction

The ability to apply genome sequencing methods to single-cells has revolutionized biology^[1]. Technologies enabling single-cell sequencing are advancing rapidly, with datasets as large as hundreds of thousands of cells are common^[2]. RNA-sequencing is currently the most prevalent form of single-cell genomic analysis ^[1]. The sequencing of RNA at the cellular level enables the interrogation of gene transcription, which is used as a high-dimensional fingerprint which characterizes the identity of the cell. For this reason, single-cell RNA-sequencing data (scRNA-seq) has been used as a tool to study cell identity and state-transitions at the cellular level.

The most frequently studied cell state-transition is cellular differentiation; the process of a cell and its progeny to perform specialized tasks through transformation from a less differentiated stem-like state to a more differentiated state. ScRNA-seq is used to identify cells in various states of differentiation primarily through one or both of two primary methods: 1) clustering of cells with similar features^[3], or 2) though trajectory inference (TI)^[4]. Clustering analysis relies on the definition of a similarity metric, and may rely on a pre-defined number of clusters (e.g., k-means), or may use optimization methods to identify clusters (e.g., Leiden). There are a wide variety of clustering methods and similarity metrics to choose from, which may give drastically different results^[5]. Similarly, trajectory inference methods may use pre-defined relationships between cells or may use optimization methods to identify these relationships to construct graphs which are then used to infer paths, or trajectories, between cell states. In addition, various approaches aim to characterize the cell fate landscape, for instance, by a parameterized landscape based on bifurcation analysis ^[6,7], by using a measure of entropy of cell states: SCENT ^[8] and scEpath ^[9], or by mapping cells to a landscape on optimized parameters: HopLand ^[10] and Topslam ^[11].

A significant limitation of these approaches is if the graph structure and underlying relationships between the cells is unknown. As shown in a comprehensive review of trajectory inference methods by Saelens et al. (2019) ^[4], most TI algorithms have difficulty inferring even simple graphs which may include cycles or disconnected subgraphs. Because of the limitations of clustering and trajectory inference in analysing these data, we suggest that a hypothesis-driven and mathematical approach to the analysis of scRNA-seq data to study cell state transitions is warranted.

Moreover, single-cell genomic sequencing suffers from a number of challenges in analysis. Beyond the several choices to be made for even simple analyses such as clustering or visualization, the data may be sparse and incomplete. Gene "drop outs" and background signal (noise) can complicate differential expression and clustering analyses. For this reason, analysis of these data have remained fairly superficial despite the wealth of information contained in these high-dimensional datasets. Moreover, results obtained from analysis of single-cell sequencing datasets may be very sensitive to choices in the method of analysis and algorithm parameters. To date, single-cell sequencing data have not been effectively leveraged as inputs into mathematical models.

Here we compare two cell state geometries of cell state-transitions modeling with scRNA-seq data. Building on our prior work ^[12], we model cell differentiation as a continuous process. To elaborate this concept, when cell type-A becomes cell type-B, the cell states during the transition process are often classified into more steps as type-A ${\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 2}}\right. } \!\lower0.7ex\hbox{$2$}}$ or types-A ${\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 2}}\right. } \!\lower0.7ex\hbox{$4$}}$ , A ${\raise0.7ex\hbox{$2$} \!\mathord{\left/ {\vphantom {1 2}}\right. } \!\lower0.7ex\hbox{$4$}}$ , A ${\raise0.7ex\hbox{$3$} \!\mathord{\left/ {\vphantom {1 2}}\right. } \!\lower0.7ex\hbox{$4$}}$ . The continuous cell states can be considered as a limit of these states. We develop phenotype structured cell state models assuming continuous cell states using reaction-diffusion-advection partial differential equations (PDE) solved on: (i) an abstracted graph and (ii) a multi-dimensional continuum space. We compare and contrast these two cell state geometries with hematopoeisis as a model system. This manuscript is structured as follows: first we present the PDE model on a graph and in continuous space, then we apply the model to two datasets, see ^[13,14]. We examine the impact of various graph construction and trajectory inference methods on the geometry of the cell state space, and solve the model on these geometries. We then use the model the study the effects of perturbing 1) the graph structure 2) expression of select subsets of genes 3) and cell state transition dynamics by perturbing flow on the graph or by modifying the dynamics in the continuous space. We predict novel dynamics of leukemia pathogenesis by perturbing normal hematopoeisis and conclude with a comparative analysis of our approach and description of future directions for mathematical modeling with single-cell genomic sequencing data. A summary of our workflow is shown in Figure 1.

Figure 1. Step-by-step illustration of our modeling process. 1. Processed single-cell sequencing expression matrices are represented in a reduced dimension space through one of many dimension reduction techniques such as diffusion mapping, PCA, t-SNE, or UMAP. 2. Cell clusters are inferred to construct the cell state geometry either the graph or multi-dimensional continuum of cell states. 3. From these representations, models are calibrated to the transport of cell distribution along the graph or in the cell state space. 4. The models can then be utilized to perturb genes and cell states. The calibrated models predict cell state-transitions and the emergence of novel cell states.

DownLoad: Full-Size Img PowerPoint

2. Materials and method

2.1. Modeling cell state-transitions in a continuous cell state-space

In this section we develop PDE models of cell dynamics in the continuous phenotype space identified by dimension reduction techniques.

For a given single-cell genomic sequencing dataset

${\{ \mathit{\boldsymbol{g}}^i \}_{i = 1}^N, \qquad {\mathit{\boldsymbol{g}}}^i = (g_1^i, g_2^i, ..., g_G^i) \in {{\mathcal{R}}}^G, }$

where $N$ is the number of cells and ${\mathit{\boldsymbol{g}}}^i$ is a $G$ -dimensional vector of gene expression of the $i$ -th cell, the dimension reduction method can be written as an operator $\mathcal{P}: {{\mathcal{R}}}^G \rightarrow \Gamma \subset {{\mathcal{R}}}^n$ where the reduced dimensional space is truncated at the $n$ -th dimension and $n \ll G$ . We denote the reduced space variable as

$\begin{equation} \theta = \mathcal{P}(g) \in \Gamma \subset {{\mathcal{R}}}^n, \qquad \theta = (\theta_1, \, \theta_2, \, ..., \theta_n), \end{equation}$

(2.1)

and the $i$ -th single-cell data can be transformed into the reduced space as $\mathcal{P}(g^i) = \theta^i = (\theta_1^i, \, \theta_2^i, \, ..., \theta_n^i)$ . Various dimension reduction techniques exist to construct a mapping $\mathcal{P}$ , including principal component analysis (PCA), diffusion mapping, and t-distributed stochastic neighbor embedding (t-SNE). While different techniques provide different shapes and differentiation spaces, we choose diffusion mapping due to its ability to capture non-linear structure of high-dimensional data, and to well reproduce global trajectory of data ^[15]. We comment that if the reduced dimensional space is not clear to truncate at a low-dimension, one can consider low-dimensional marker genes according to the cell state of interest, and semi-supervised learning approach can be applied to obtain the low-dimensional reduced space. We also comment that it is common to remove the effect of cell cycles from the gene expression data to eliminate the cell state regarding their location along the cell cycle ^[16].

2.2. PDE model of cell state-transitions on a multi-dimensional reduced component space

We first develop a cell state model that describes the dynamics of cell distribution $u(t, \theta)$ on the reduced component space $\Gamma$ , where $\theta \in \Gamma$ is the variable that represents continuous cell state. Three highly distinctive dynamic regimes of the cell states are considered, namely, directed cell transition, birth-death process, and random phenotypic instability. Such model can be written as an advection-reaction-diffusion PDE that governs the cell distribution $u(t, \theta)$ as

$\begin{equation} \partial_t u(t, \theta) = -\nabla \cdot \left( V(t, \theta) u(t, \theta) \right) + R( \theta, u(t, \theta)) + \nabla \cdot \left( D(\theta) \nabla u(t, \theta) \right), \end{equation}$

(2.2)

with zero Dirichlet boundary condition. The three terms in our equation that involve parameterized functions $V$ , $R$ , and $D$ , represent cell differentiation, population growth, and phenotypic instability, respectively.

Let us first describe the advection term $V \in {{\mathcal{R}}}^n$ that represents directed cell differentiation, where we propose two candidates for modeling $V$ , denoted as ${\bf{v}}_1$ and ${\bf{v}}_2$ . The first candidate ${\bf{v}}_1$ assumes an attractor cell states of homeostasis. Assuming that the magnitude of phenotypic instability is with a magnitude $\nu$ , that is, $D(\theta) = \nu$ , one can compute the advection term ${\bf{v}}_1$ as

$\begin{eqnarray} {\bf{v}}_1(\theta) = \nu \nabla_\theta U(\theta), \end{eqnarray}$

(2.3)

where $U(\theta)$ can be computed from the homeostasis distribution $u_s(\theta)$ that can be regarded as the cell landscape that the hematopoiesis system desires to maintain. As in the Boltzmann-like distribution from equilibrium statistical mechanics ^[17], we compute $U(\theta)$ as the exponent of $u_s(\theta)$ in the exponential form, in other words, $U(\theta) = - \ln(u_s(\theta))$ . There are multiple methods to compute the cell landscape, so called quasi-potential, that focuses on relative stability of multiple attractors and models cell differentiation as transition between the attractor states ^{[6,7,8,9,10,11]}. Here, we compute the cell landscape empirically by assuming that the entire single-cell data is a representative subset of the entire hematopoiesis system, and by using density approximation methods. In particular, we use kernel density estimation ^[18] from the projected single-cell data $\theta^i \in \Gamma$ , i.e., $u_s(\theta) = \frac{1}{N} \sum_{i = 1}^N K_h(\theta - \theta^i)$ where we chose the standard normal density function as the kernel function $K_h$ with bandwidth parameter $h > 0$ .

The second candidate ${\bf{v}}_2$ models the dynamics of cell state transition. We model this term using a mechanistic approach that describes the symmetric and asymmetric cell division of stem cells to more differentiated cells. In particular, we consider the following form

$\begin{eqnarray} {\bf{v}}_2(t, \theta) = {\bf{c}}(\theta) \left[ 2 ( 1 - a(\theta) ) r(\theta) \right] s(t), \end{eqnarray}$

(2.4)

that is parameterized by the proliferation rate $r(\theta)$ and the self renewal rate $a(\theta)$ ^[19]. ${\bf{c}}(\theta) \in {{\mathcal{R}}}^n$ represents the direction and magnitude of differentiation on the phenotype space that we can estimate with either temporal data or pseudotime inference methods ^[4]. We note that the self renewal rate $a(\theta)$ is the proportion of cells that remains in cell state $\theta$ , while $1-a(\theta)$ cells further differentiate into matured states. This can be counted from symmetric and asymmetric stem cell division. In addition, we assume a signal parameter $s(t)$ that controls the active differentiation term, where $s(t) = 1/(1+k m(t))$ and $m(t)$ is the number of matured cells. Finally, we comment that the directed cell transition is simulated as $V = {\bf{v}}_1 + {\bf{v}}_2$ , that is a sum of cell transition to attain homeostasis and active cell differentiation.

The reaction term represents the growth rate that consists of proliferation and apoptosis. We comment that the calibration of this term requires additional data to scRNA-seq, particularly, the population level growth data, to uniquely calibrate the model. It has been studied that the cell dynamics cannot be uniquely determined without imposing the reaction term ^[20]. More recently, there has been efforts to estimate the proliferation rate directly from scRNA-seq data by cellular barcoding techniques ^[21]. In our simulations, we cluster the single-cell data into biologically well known cell types, for instance, in case of hematopoiesis, myeloid progenitors, lymphoid progenitors, macrophages, and obtain the proliferation rate and self renewal rate from the literature. We consider the logistic growth term as following

$\begin{eqnarray} R(\theta, u) = r(\theta) (1 - d(\theta, u)) u, \end{eqnarray}$

(2.5)

where $r(\theta)$ is the proliferation rate and $d(\theta)$ is the apoptosis term assuming a logistic growth as $d(\theta, u) = \min \{ \frac{u}{u_s(\theta)}, \, \bar{d} \},$ where $\bar{d}$ models the maximum magnitude of apoptosis rate.

The second-order diffusion term represents the instability on the phenotypic landscape of the cells that should be taken into account when modeling the macroscopic cell density. We simply consider a constant term $D(\theta) = \nu$ . Assuming that the cell state trajectory is subject to Gaussian white noise, the diffusion coefficient can be estimated as the variance of the cell trajectory $\theta(t)$ on the reduced space, $\nu = \text{Var}(\theta(t))/4$ . However, since we do not have the data of cell trajectories, one can estimate the value as a limit of random walk as $\nu = (\Delta x)^2 / (4\Delta t)$ , assuming that $\Delta x$ is the step size of the phenotypic fluctuation in $\Delta t$ time ^[22]. See Appendix A for the detail of the model.

2.3. PDE model of cell state-transitions solved on a graph

Although the continuum-based multi-dimensional model provides a framework to study cell states, it is not always straightforward to map back locations in the space to novel or otherwise unknown cell states. Moreover, a central feature of contemporary analysis of scRNA-seq data is clustering and inferring relationships between clusters of known cell types ^[4]. Therefore, we develop a model that can describe cell state-transition dynamics on a graph that represents relationships between known cell types identified with clusters, extended from our previous work in ^[12]. An immediate advantage of this cell state geometry is that it is convenient to employ biological insights from well-known classical discrete cell states.

The continuum of differentiation cell states is assumed to be on the graph obtained from the scRNA-seq data, for instance, using partition-based graph abstraction (PAGA) algorithm ^[23]. We project the graph on the reduced component space, and denote the nodes as $\{ v_k \}_{k = 1}^{n_v}$ and the edges as $e_{ij}$ connecting in the direction from the $i$ -th to the $j$ -th node. For convenience of notation, the edges are also denoted as $\{ e_{k} \}_{k = 1}^{n_e}$ with the index mapping $I: \mathcal{J} \rightarrow \{1, ..., n_e\}$ on the set of nontrivial edges $(i, j)\in\mathcal{J}$ . The end points in the direction of cell transition are $\{ a_k \}_{k = 1}^{n_e}$ and $\{ b_k \}_{k = 1}^{n_e}$ , where $\bigcup_{k = 1}^{n_e} \{ a_k, \, b_k \} = \{ v_k \}_{k = 1}^{n_v}$ .

The model follows the dynamics of the cell distribution on the graph, $u(x, t)$ , where $x \in e_k$ is a one-dimensional variable that parameterizes the differentiation continuum space location along the edges. We annotate the cell distribution on each edge $e_k$ as $u_k(x, t)$ such that $u(x, t) = \{ u_k(x, t) \}_{k = 1}^{n_e}$ , and model the cell density by an advection-reaction-diffusion equation ^[24] as

$\begin{eqnarray} \frac{\partial u_k}{\partial t} & = & - \frac{\partial }{\partial x} \left( V_k(x) u_k \right) + R_k(x) u_k + \frac{\partial }{\partial x} ( D_k(x) \frac{\partial u_k}{\partial x} ), \quad x \in e_k = \overline{a_k\, b_k}. \end{eqnarray}$

(2.6)

The three terms are similarly modeled as the multi-dimensional model in Eq (2.2), representing cell differentiation, population growth, and phenotypic instability. To summarize once more, the advection coefficient $V_k(x)$ models the cell differentiation and the transition between the nodes, that is, different cell types. We model the advection term in two parts as in the reduced component space model, $V_k(x) = \text{v}_{k, 1}(x)+ \text{v}_{k, 2}(x)$ , and compute them as

$\begin{eqnarray} \text{v}_{k, 1}(x) = \nu \partial_x U_k(x), \quad \text{v}_{k, 2}(t, x) = \left[ 2 ( 1 - a_k(x) ) r_k(x) \right] s(t). \end{eqnarray}$

(2.7)

Here, $u_{s, k}(x) = e^{-U_k(x)}$ is the homeostasis cell distribution on the $k$ -th edge, $\nu$ is the magnitude of phenotypic instability, $r_k(x)$ is the proliferation rate, $a_k(x)$ is the self-renewal rate, and $s(t)$ is the signal parameter. Cell proliferation and apoptosis can be modeled by the reaction coefficient $R_k(x)$ as

$\begin{eqnarray} R_k(x, u) = r_k(x) (1 - d_k(x, u)) u. \end{eqnarray}$

(2.8)

Finally, the diffusion term that represents phenotypic fluctuation is taken as $D_k(x) = \nu$ .

In addition to the governing equation on the edges, the boundary condition at the nodes are critical when describing the dynamics on the graph. The boundary condition on the cell fate PDE model can be classified into three types, the initial nodes that do not have inflow $\text{N}_I \doteq \{ v_k \notin \bigcup_{j = 1}^{n_e} \{ b_j \}, \, _{k = 1, ..., n_v} \}$ , e.g., stem cells, the final nodes without outflow $\text{N}_F \doteq \{ v_k \notin \bigcup_{j = 1}^{n_e} \{ a_j \}, \, _{k = 1, ..., n_v} \}$ , e.g., the most differentiated cells, and the intermediate nodes, $\text{N}_T \doteq \{\bigcup_{j = 1}^{n_e} \{ a_j \} \} \bigcap \, \{ \bigcup_{j = 1}^{n_e} \{ b_j \}\}.$ On the intermediate nodes $v_n \in \text{N}_T$ , mixed boundary condition is imposed for continuity of the density and flow as following,

$\begin{eqnarray} \sum\limits_{(i, n)\in\mathcal{J}} \mathcal{B}_{I[i, n]}( u, b_{I[i, n]} ) = \sum\limits_{(n, j)\in\mathcal{J}} \mathcal{B}_{I[n, j]}( u, a_{I[n, j]} ), \\ u(b_{I[i, n]}) = u(a_{I[n, j]}), \quad \text{ for all } (i, n)\in\mathcal{J}, \, (n, j)\in\mathcal{J}, \end{eqnarray}$

(2.9)

where $\mathcal{B}_{I[i, j]}(u, x) \doteq \left. V_{I[i, j]}(x) u(x) - D_{I[i, j]}(x) \frac{\partial }{\partial x} u(x) \right|_{x_{I[i, j]}}$ , $b_{I[i, n]}$ is the right end point of the edge between nodes $i$ and $n$ , and $a_{I[n, j]}$ is the left end point of the edge between nodes $n$ and $j$ . The cell outflow boundary conditions on the final nodes, $v_n \in \text{N}_F$ , are imposed as reflecting boundary conditions

$\begin{equation*} \sum\limits_{(i, n)\in\mathcal{J}} \mathcal{B}_{I[i, n]}( u, b_{I[i, n]} ) = 0, \end{equation*}$

and $u(b_{I[i, n]}) = u(b_{I[j, n]})$ for all $(i, n)$ and $(j, n)$ in $\mathcal{J}$ , and similarly on the initial nodes $v_n \in \text{N}_I$ .

2.4. Quantification of cell state-transition dynamics

Let us define some useful quantities to interpret model predictions in the multi-dimensional cell state-space and on a graph. The total number of cells from the cell distribution on either a graph or a continuous manifold can be computed as

$\begin{equation} \rho(t) \doteq \sum\limits_k \int_{e_k} u_k(t, x) dx, \qquad \rho(t) \doteq \int_\Gamma u(t, \theta) d \theta, \end{equation}$

(2.10)

respectively. In addition, we compute the number of cells of a specific cell type by assigning a weight, $w_k$ , that corresponds to cells in the $k$ -th cluster as

$\begin{equation} \rho_k(t) \doteq \int_\Gamma u(t, \theta) w_k(\theta) d \theta, \end{equation}$

(2.11)

with $\sum_k w_k(\theta) = 1$ . We assign weights based on the relative cell density of each clusters estimated with kernel density estimation. In the graph model, we assign the cell states along the edge to be the cell type of the closest node, so that we can compute the number of the $k$ -th node cell type as $\rho_k(t) \doteq \sum_{m = I(k, j)} \int_{a_m}^{\frac{a_m+b_m}{2}} u_m(t, x) dx + \sum_{m = I(i, k)} \int_{\frac{a_m+b_m}{2}}^{b_m} u_m(t, x) dx.$

Although we can understand the continuum of cell states by mapping cells in intermediate states back to known discrete cell types, we also desire to interpret the continuous cell states in their location without reference to the canonical cell identities. For such purpose, we characterize cell states by identifying genes that are strongly correlated to a location in the space or moving in a direction toward a cell state. This extends finding the genes that are correlated to specific reduced space components to analyze the reduced cell state space ^[25]. First, to characterize the cell state $\theta^*$ in the reduced space, we use a function $f_{\theta^*}(\theta)$ centered at $\theta^*$ as $f_{\theta^*}(\theta) = \frac{1}{\sqrt{2\pi}\sigma } \exp\left[- {\|\theta -\theta^*\|^2}/{2\sigma^2} \right]$ , and compute the correlation between the function values and the $j$ -th gene expression levels as

$\begin{equation} r_{f, j} \doteq \text{corr}({\mathit{\boldsymbol{f}}}, {\mathit{\boldsymbol{g}}}_j), \end{equation}$

(2.12)

where ${\mathit{\boldsymbol{f}}}$ represents the vector of function evaluation at each single-cell data point $\theta^i$ , that is, ${\mathit{\boldsymbol{f}}} = \{f_{\theta^*}(\theta^i)\}_{i = 1}^N$ , and ${\mathit{\boldsymbol{g}}}_j = \{g_j^i\}_{i = 1}^N$ . An alternate quantity to examine is the genes that are related to a certain direction in the reduced component space. For instance, the correlation between the $j$ -th gene and the $k$ -th reduced component $\theta_k = \{\theta_k^i\}_{i = 1}^N$ and to a certain vector $\mathit{\boldsymbol{v}} = \{v_k\}_{k = 1}^n$ can be computed as

$\begin{equation} r_{k, j} \doteq \text{corr}(\theta_k, {\mathit{\boldsymbol{g}}}_j), \qquad r_{{\mathit{\boldsymbol{v}}}, j} \doteq \sum\limits_k v_k \, r_{k, j}, \end{equation}$

(2.13)

respectively. Regarding Eq (2.13) as global quantities, we can also compute the local correlation on the subdomain of the reduced space $\Omega_d$ by collecting the cell indices that lie within the subdomain $\Gamma_d = \{ i \, |\, \theta^i \in \Omega_d \}$ , that is, $r_{k, j}|_{\Gamma_d} \doteq \text{corr}((\theta_k, {\mathit{\boldsymbol{g}}}_j)|_{i\in \Gamma_d})$ and $r_{{\mathit{\boldsymbol{v}}}, j}|_{\Gamma_d} \doteq \sum_k v_k \, \rho_{k, j}|_{\Gamma_d}$ . Although these metrics may provide candidates of potential genes that are related to the cell state to be analyzed, we emphasize that these need to be verified experimentally by observing the cell state change after perturbing the genes. See Section 4.2 for the limitations of this approach.

3. Simulation of continuous cell state models on multi-dimensional space versus graph

In this section, we employ the framework developed in the previous section to the mouse hematopoiesis cell data from Nestorowa et al. (2016) ^[13] and Paul et al. (2015) ^[14]. We obtain the graph and multi-dimensional space models of hematopoiesis cell state and focus on comparing the strengths and weaknesses of the two models.

The hematopoiesis single-cell data from ^[13,14] projected on the first two diffusion component space are shown in , where distinct cell types, including lymphoid-primed multipotent progenitors (Lymph); common myeloid progenitors (CMP); megakaryocyte-erythroid progenitors (MEP); granulocyte-macrophage progenitors (GMP); erythrocytes (Ery); neutrophils (Neu); monocytes (Mo); megakaryocytes (Mk); basophils (Baso), classified in the original papers are illustrated with different colors. We truncate the diffusion component at two since the reduced two-dimensional space describes the dynamics of our interest, that is, from strong to weak stemness. The first two diffusion components $\theta_1$ and $\theta_2$ represent cell maturation in both data sets. In Nestorowa data, the first diffusion component separates the stem cells to myeloid lineages, particularly MEP cells and the second diffusion component describes GMP cells and the lymphoid lineages. In Paul data, the first and second reduced component represents MEP and GMP lineage, respectively. We remark that the cells that are the most stem-like in Paul data are CMPs, that is more matured than the ones in Nestorowa data, that includes the long-term and short-term HSCs. In addition to the single-cell data, the Figure 2B presents the abstracted graphs obtained using PAGA ^[23]. Further refinements of the graph will eventually become the full single-cell data, where each single-cell being counted as distinct cell states, and it depicts the hierarchy of cell states (see Figure A5).

Figure 2. From discrete to continuum cell states. A) Single-cell data from Nestorowa et al. (2016) and Paul et al. (2015) projected on the first two diffusion component space. B) Graph obtained by PAGA algorithm projected on the diffusion component space. Distinct cell types classified in the original paper are either illustrated with different colors (A) or annotated on the graph nodes (B). C, D) Multi-dimensional continuum cell state distribution on the diffusion component space computed by kernel density estimation. They are used as homeostasis distribution.

DownLoad: Full-Size Img PowerPoint

The homeostatic cell distribution $u_s$ on the reduced dimensional space is computed by kernel density approximation ^[18,26]. The computed cell landscapes viewed in different angles are shown in , . The cell distribution on the graph can be similarly obtained after reallocating the cells to the node, that is, the center of each cluster. The cell distribution on the continuum space provides an intuitive method to compare the relative concentration of different cell lineages, including the intermediate cell states. We observe high concentration of MEP and Ery cells that are localized at the far right (large $\theta_1$ ) in both data. The Nestorowa data has more diverse cells including the common lymphoid progenitors that are visible on the left (small $\theta_1$ , and intermediate $\theta_2$ ), while the Paul data has evenly distributed cell states among the most stem-like cells (CMP) and the two different lineages.

Let us summarize the properties of the graph and multi-dimensional space models before we present simulation results. The graph model has its strength that the cell lineages between the known cell states can be more easily identified as compared to the multi-dimensional space model. The cell concentration moving toward different edges can be clearly distinguished as the cell lineages to different cell states. Accordingly, counting the number of cells in each discrete cell state is more straightforward, for instance, by integrating the cell distribution along the edges half way. Although the multi-dimensional space model has ambiguity on classifying the cells into discrete cell types, the number of cells in each discrete cell type can be computed by assigning weights to integrate as in Eq (2.11). Moreover, we emphasize that the advantage of clear cell states in the graph model is also its limitation at the same time, since it restricts the model to only study the known cell types and lineages. The advantage of the multi-dimensional space model is its potential of exploring novel cell states that deviates from known cell types. While the graph model cannot explore the cell states that are not already included in the graph structure, the multi-dimensional space model can immediately study the abnormal trajectories and emergence of cells at any space location. We will show later in our simulation that the hypothesis of genetic alterations can be studied directly in the multi-dimensional space model, without projecting it on the graph structure. Moreover, the multi-dimensional space model is more sensitive to genetic variations than the graph model, although when the variation is large and a considerably distinct cell state arises, the graph model can append another cluster node. See Table 1 for a summary.

Table 1. Comparison of the cell state model on graph versus multi-dimensional space in

$n$ dimensions. The computational cost is estimated by denoting

$M$ as the number of discretized grid points in one-dimension. We comment that the computational cost of a PDE solver can be up to a third power of the degree of freedom.

	Graph model	Multi-dimensional model
Cell state Interpretation	Comprehensible as	Difficult to interpret
	intermediate cell states
Cell state Exploration	Limited to graph structure	Freedom to explore novel and
		unconventional cell states
Computational cost	O( $M$ )	O( $M^n$ )
(Degree of freedom)	O( $M$ )	O( $M^n$ )

| Show Table

DownLoad: CSV

In the following sections, we consider mainly two application problems, namely, normal hematopoiesis and abnormal hematopoiesis differentiation, resulting in myeloid leukemia as application examples of our modeling approach.

3.1. Calibrating the mathematical models to normal hematopoiesis

We demonstrate that normal hematopoiesis can be visualized by both models on the graph and on the space of two-dimensional diffusion components, (see Figures A1 and A2 for the advection and reaction terms used in the multi-dimensional space model). shows a cluster of stem cells differentiating into the entire cell states on the graph and reduced space using Nestorowa data ^[13] and Paul data ^[14]. The initial condition is imposed as approximately 10% of cell capacity in normal condition mostly composed with stem cells. On both graph and multi-dimensional space models, nontrivial amount of most matured cell states, particularly, Ery and Neu/Mo cells arise around pesudotime $t = 30$ , and further recovers the full landscape after approximately $t = 50$ . In particular, the observation that the matured cells quickly proliferate to fill in the space agrees in both simulations from Nestorowa and Paul, while the effect is more significant in Paul's data due to shorter distances of the matured cell states from the initially administered cells.

Figure 3. Dynamics of cell distribution during normal hematopoeisis. A) Evolution of cell state densities

$u(t, x)$ on the graph with 8 to 9 nodes, and

$u(t, \theta)$ on the diffusion component space during normal hematopoeisis using single-cell data from Nestorowa et al. (2016) and Paul et al. (2015). The shown dynamics is in pseudotime

$t$ . B) The pseudotime dynamics of the number of cells in each cell cluster, where the number is normalized so that the total cell number in equilibrium state is one. The initial stem cells differentiate to progenitors and more matured cell states and recover the entire cell landscape. C) Numbers of cells in each type/cluster using the multi-dimensional space model and the graph model are successfully calibrated to the observed data so that at

$t = 100$ each model predicts the correct cell ratios to within

$\pm 5\%$ . D) The continuous cell states of hematopoeisis is also depicted in the FACS data set collected from the normal mouse bone marrow. Bone marrow cells were gated for myeloid progenitor cell markers (lineage-negative, Sca1-negative, cKit-positive). Conventionally, the expression levels of CD16/32 and CD34 are used to distinguish CMP, GMP, and MEP cell types within the myeloid progenitor compartment, however, the continuity of marker expression agrees with our graph abstraction and multi-dimensional cell state geometries.

DownLoad: Full-Size Img PowerPoint

The advantage of the graph model is apparent that we can observe distinct cell states as a mass of cells shifting from a node to distinct edges toward different cell states. For instance, the cells differentiating from the MPP cluster to either Neu/Mo lineage and Ery lineage can be clearly separated in the graph models, while those can be ambiguous in the two-dimensional distribution. Still, we can compute the number of cells in each cell types in both models as shown in . We observe that the number of cells reaches the maximal capacity at later times around $t = 100$ , with the intended ratio of cell numbers in each discrete cell type approximating the given data ^[13] in . The recovery is more rapid for larger values of $\nu$ and larger number of initial stem cells $\rho(0)$ (see Figure A6). We remark that the continuous cell states of hematopoeisis is also depicted in conventional flow cytometry which is typically used to identify distinct cell populations based on expression of cell surface markers. We performed Fluorescence Activated Cell Sorting (FACS) analysis of bone marrow cells isolated from normal C57Bl/6 mice (age 6-8 weeks). Distinct myeloid progenitor types (CMP, GMP and MEP) are typically differentiated by the expression of CD16/32 and CD34 markers within the myeloid lineage progenitor cell compartment. Figure 3D shows representative FACS data with respect to CD16/32 and CD34 expression that is used to identify the CMP, GMP, and MEP cell types within the normal myeloid progenitor compartment. Although the FACS data is conventionally clustered and gated into three cell types, continuity of CD16/32 and CD34 expression can be observed that agrees with our graph abstraction and multi-dimensional cell state geometries.

3.2. Using the model framework to simulate acute myeloid leukemia (AML) pathogenesis and progression

In this section, we once more compare the graph and multi-dimensional space models with an application to abnormal differentiation under leukemia pathogenesis and progression. We first consider AML model in the context similar to the previous section that involves known progenitor cells that exemplifies the advantage of graph models. However, we will show how aberrant differentiation and phenotypic plasticity of leukemia pathogenesis motivates the spatial model.

AML results from aberrant differentiation and proliferation of transformed leukemia-initiating cells and abnormal progenitor cells. We model AML pathogenesis based on known behavior of a genetic Cbfb-MYH11 (CM) knock-in mouse model that recapitulates somatic acquisition of a chromosomal rearrangement, inv(16)(p13q22)^[28,29], commonly found in approximately 12 percent of AML cases. Inv(16) rearrangement results in expression of a leukemogenic fusion protein CBF $\beta$ -SMMHC, which impairs differentiation of multiple hematopoietic lineages at various stages ^[30,31,32]. Most notable in such leukemia pathogenesis and progression is the increased in abnormal myeloid progenitors, which has an MEP-like immunophenotype and a CMP-like differentiation potential ^[31]. Experimental studies ^[27,33] show that such MEP attains a predominant increase in pre-megakaryocyte/erythroid (Pre-Meg/Ery) population (ranging from 5 to 12 fold) accompanied by impaired erythroid lineage differentiation as shown in Figure 4E. This refined phenotypic Pre-Meg/Ery population consists partly of the CMP and MEP populations which are identified using conventional markers ^[13,34].

Figure 4. Predicting abnormal cell differentiation during leukemia progression. A, B, C) Cell distributions during leukemia pathogenesis and progression are shown on the graph model (A, B) and the multi-dimensional space model (C). Plot (B) shows an alternative way to plot the graph based model solution by stacking the cell distribution on each edge horizontally. The number on the left and right shows the node numbers shown in Figure 2B. They show the effect of over-proliferation and differentiation block in the myeloid lineages. In particular, we observe a rapid expansion of cell states near MEP and Ery in both Nestorowa and Paul data, after the initiation of AML at

$t = 10$ . D) The number of leukemic Ery cells show an increase within ten days in both graph and multi-dimensional space models. E) Experimental result reproduced from ^[27] that shows a rapid expansion of pre-Meg/E (MEP) population in leukemic mouse compared to normal mouse (Control).

DownLoad: Full-Size Img PowerPoint

In our model, abnormal leukemic progenitors are regarded as intermediate cell states along the edges connecting CMP (or MPP) and MEP (and Ery) in the graph model, and the corresponding locations in the multi-dimensional space model. We assume a 10-fold increase on average in those population by lowering $d(\theta)$ and $d_k(x)$ in Eqs (2.5) and (2.8) that controls the local cell capacity. In addition to over-proliferation, another aspect of the leukemia pathogenesis of our interest is the impaired differentiation of erythroid lineage differentiation, where it can be modeled by blocking the cell differentiation $V(\theta)$ in Eq (2.6) and $V_k(x)$ in Eq (2.2) toward Ery.

The corresponding results are shown in , where we modify the model to leukemia progression at $t = 10$ . The cell distribution changes from the normal cell landscape at $t = 10$ to an increased population of Ery (MEP) and nearby cells at $t = 20$ in both graph and continuum models. We observe a 10-fold increase in the Ery (MEP) population, which includes the abnormal myeloid progenitors, in both graph and multi-dimensional space model across the data sets as shown in Figure 4D, E. The rapid emergence of AML occurs within two week period, corresponding to the expansion of the leukemic cell phenotype. We observe a rise in the MPP or CMP cluster as shown in the results from Nestorowa data, that is similar in Paul data as well. The total proportion of leukemic cells comprise 50–60% of the total population.

In the leukemia pathogenesis simulation in this section, we focus on studying the leukemic cells as a variation of cell states classified using conventional markers. In this case, the graph model can interpret and include the dynamics of such cells, as well as the multi-dimensional space model. While the simulation outcome between the graph and multi-dimensional space model is similar, the graph model is computationally more efficient due to the fewer number of unknowns as compared to the two-dimensional space model. However, to study the abnormal cell states that may appear far away from the known or existing landscape, we will show in the following section that the multi-dimensional space model has more freedom to include those new cell states and disrupted trajectories. We will study the impact of perturbation of genes in the graph and multi-dimensional space model, particularly focusing on alterations of genes known to be involved in leukemia pathogenesis.

3.3. In silico experiments of gene expression perturbation

In this section we investigate the sensitivity of altering specific genes in a prescribed manner and the impact of this perturbation in the graph and multi-dimensional space models. We keep our focus on leukemia pathogenesis and progression and consider alterations of 38 genes that are reported to be related to leukemia stem cells ^[35,36], although we emphasize that these genes serve simply as examples, and are not intended to model the precise biological process of AML pathogenesis. The $j$ -th gene expression level of $i$ cell, $g_j^i$ , is modified as

$\begin{equation} {\widetilde{g}}_j^i = 2^{\gamma_j} \, g_j^i, \qquad 0\leq \log_2({\widetilde{g}}_j^i+1) \leq 16, \end{equation}$

(3.1)

where $\gamma_j$ is the log $_2$ -fold change compared to normal cells. The full list of altered genes and magnitudes $\gamma_j$ from ^[35,36] are shown in A1. Details of the model equation and parameters, and the log $_2$ -fold change is in the range of $\gamma_j \in [-3.5, \, 2.7]$ . In addition, we consider the extreme case of gene alteration as the maximum level $\log_2({\widetilde{g}}_j^i+1) = 16$ for up-regulated genes and $\log_2({\widetilde{g}}_j^i+1) = 0$ for down-regulated genes. Figure 5 shows examples of the gene expression levels in log scale that we modify including the up-regulated genes, GPR56, GATA2, and MZB1, and the down-regulated genes, LGALS3, LY86, and ANXA5. The given single-cell data in normal condition is plotted together with the hypothetically altered levels of gene expressions in regular leukemia pathogenesis and extreme levels of alteration. Although the case of extreme alteration is unrealistic, we consider such case to illustrate an example where the graph abstraction and dimension reduction algorithm clearly distinguishes the leukemic cells.

Figure 5. Perturbing genes associated with leukemia stem cells. Examples of expression levels of genes in log

$_2$ scale, that are associated with leukemia stem cells and pathogenesis, including up-regulated GPR56, GATA2, and MZB1, and down-regulated LGALS3, LY86, and ANXA5. We show these subset of genes simply to illustrate the process. The normal single-cell data

$\log_2(g_j^i+1)$ (blue circle) and modified gene expression

$\log_2({\widetilde{g}}_j^i+1)$ (red square) computed as Eq (3.1) are shown together, with the case of extreme levels of either

$16$ or

$0$ (purple diamond).

DownLoad: Full-Size Img PowerPoint

3.4. Effects of gene perturbation on graph abstraction and multi-dimensional reduced component space

We first study the sensitivity of the reduced component space using diffusion mapping ^[15]. , compares the altered leukemic single-cell data ${\widetilde{g}}^i$ projected on the normal reduced space $(\theta_1, \theta_2)$ . The left-most column shows the projected leukemic single-cell data $\mathcal{P}({\widetilde{g}}^i)$ in the normal reduced space, where the leukemic cells are located toward the left-bottom compared to the normal cells in Nestorowa data, and upwards in Paul data. The effect of gene modification is shown more clearly in the presented vector field $\mathcal{P}({\widetilde{g}}^i) - \mathcal{P}(g^i)$ .

Figure 6. Effects of perturbing genes on cell state-space. A, B) Projection of perturbed leukemic single-cell data on the normal diffusion component space, and the directional vectors

$\mathcal{P}({\widetilde{g}}^i) - \mathcal{P}(g^i)$ representing the altered cell state by leukemic perturbation. The top figures are computed with Nestorowa data ^[13] (A) and the bottom figures with Paul data ^[14] (B). C, D) Graph computed from perturbed leukemic single-cell data and their cluster information. The annotation shows that the graph abstraction algorithm does not distinguish the perturbed leukemic cells in regular magnitude to the normal cells, so that the perturbed information is lost (C). However, when single-cell data is modified to the extreme values of gene expression level, the algorithm distinguishes the leukemic cells, although the data is unrealistic (D).

DownLoad: Full-Size Img PowerPoint

Similarly, we study the impact of leukemia-associated gene perturbation in graph abstraction using PAGA ^[23]. Figure 6C, D shows the clustered cell types and the corresponding graph using perturbed leukemic scRNA-seq data. The presented results are computed with Nestorowa data. The clustered cell types and leukemic cells are annotated to depict the cluster properties. In Figure 6C, which is the case of leukemia progression with single-cell data altered in regular magnitude, we observe that there is no cluster that separates the leukemic cells. Thus, the information of leukemic gene alteration is lost within the clustering algorithm, and the model on such abstracted graph is not capable of studying the perturbed cells. On the other hand, when the gene levels are modified to their extreme levels, the perturbed leukemic single-cell data are clustered into separate nodes as shown in Figure 6D. In this case, although the graph model is able to study the perturbed cells as separate nodes, we comment that this level of perturbation is an unrealistic scenario due to the extreme levels of gene expression.

A strength of the multi-dimensional cell state model is its capability of interpreting the perturbation of gene expression levels or new incoming cell data regardless of its relation to the primary data (Figure 6). As shown in our results, the leukemic alteration is successfully projected in the reduced space, while the abstracted graph lost the information. Although the projected directions in the reduced space can be once more projected on the graph, it does not make sense to do so when the direction is orthogonal to the edges. The multi-dimensional space model has its advantage especially in this case, where the projected direction of cell states can be directly implemented.

3.5. Simulating AML pathogenesis by perturbing known leukemia-associated genes

In this section, we incorporate the perturbed leukemia-associated gene data in the AML simulation using the multi-dimensional space model. In particular, we are interested in studying the impact of leukemia-associated gene alteration on the cell distribution during AML progression. We compute the abnormal differentiation of leukemic cells by projecting the altered single-cell data of MEP and Ery cells to the normal diffusion component space $\mathcal{P}({\widetilde{g}}^i)$ as it is shown in . The aberrant differentiation vector ${\bf{v}}_{aml}^1 = \mathcal{P}({\widetilde{g}}^i) - \mathcal{P}(g^i)$ shows the shifts of cells toward the location where no cell data occupies. Therefore, in addition to modifying the advection term according to the altered gene data, we assume an emergence of new abnormal cell state. In particular, we take the cell state location at $\theta^* = (0.610, \, 0.215)$ in Nestorowa data, and at $\theta^* = (0.6, \, 1.0)$ in Paul data, and use Gaussian functions centered at $\theta^*$ to obtain ${\bf{v}}_{aml}^2$ . The corresponding vector fields are also shown in Figure 7.

Figure 7. Modeling AML pathogenesis and progression by perturbing cell states directly in the cell state-space. The direction of abnormal cell differentiation

${\bf{v}}_{aml}^1$ (black) is computed from the projection of altered leukemic MEP and Ery cells (

$\times$ ) to the normal diffusion component space as

$\mathcal{P}({\widetilde{g}}) - \mathcal{P}(g)$ . Alternatively, we assume a source of abnormal cell state (

$\odot$ ) at

$\theta^* = (0.610, \, 0.215)$ in Nestorowa data (A) and at

$\theta^* = (0.6, \, 1)$ in Paul data (B) to model

${\bf{v}}_{aml}^2$ (blue).

DownLoad: Full-Size Img PowerPoint

For AML progression, the advection term is modeled with the prescribed vector field as $V = {\bf{v}}_1 + c_{aml} {\bf{v}}_{aml}^1$ or $V = {\bf{v}}_1 + c_{aml} {\bf{v}}_{aml}^2$ , where $c_{aml}$ parameterizes the perturbation magnitude. We further perturb the model by increasing the proliferation of the new leukemic cells at $\theta^*$ by appending $r_{aml}\, f_{\theta^*}(\theta)$ to $R$ , where $r_{aml}$ parameterizes the over-proliferation. The cell distribution $u(t, \theta)$ with ${\bf{v}}_{aml}^1$ and ${\bf{v}}_{aml}^2$ for different values of $c_{aml} = 1, 2, 10$ are presented in and Abnormal cell state transitions during leukemia pathogenesis and progression. The distribution of cell states $u(t, \theta)$ show abnormal cell states emerging during leukemic progression after $t = 10$ modeled in the advection term as $V = c_{aml} {\bf{v}}_{aml}^1$ (top) and $V = {\bf{v}}_1 + c_{aml} {\bf{v}}_{aml}^2$ (bottom), with various levels of $c_{aml} = 1, 2, 10$ . Larger magnitude of $c_{aml}$ results in more disrupted cell landscape.. In the cell landscape with ${\bf{v}}_{aml}^1$ , we observe increased MEP cells and abnormal progenitors arising in the direction of left-bottom, especially for large values of $c_{aml}$ . The model with ${\bf{v}}_{aml}^2$ , a new cell state further down in the cell space emerges and dominates the population. With the model $V = {\bf{v}}_1 + 2 {\bf{v}}_{aml}^2$ , new abnormal cells appear around $t = 10$ and dominate the population at $t = 30$ . The total number of cells is plotted in , where the effects of the parameters, $c_{aml}$ and $r_{aml}$ , are shown more clearly. The total number of cells increases more than 10 times the initial size after $t = 30$ when $c_{aml} = 10$ and $r_{aml} = 0$ . When the over-proliferation term is appended as $r_{aml} = 1$ , the total number of cells increases more rapidly, for example, up to 100 times the initial size and the number of cells in most of the myeloid lineage increases. Our simulation results agree with the experimental data, where unconventional cell states emerge during leukemia progression and eventually overtakes the entire progenitor population as observed by FACS analysis of bone marrow progenitor cells isolated from CM knock-in preleukemic and leukemic mice (Figure 8C). The predominant population observed in leukemic bone marrow does not fall within the typical gates in conventional cell clustering based on data from normal control mice (Figure 4E). Although this novel population would had been classified as MEP, pre-Meg/E, Pre-GM, and GMP cells in the graph model (Figure 4A, B), we emphasize that they are distinct population and the multi-dimensional model is capable of incorporating novel cell states. Although we comment that, in the graph model, a new cell type can be included by adding a new node to the original graph.

Figure 8. Cell state-transition dynamics during leukemia pathogenesis and progression. A) The evolution cell state distribution

$u(t, \theta)$ with

$c_{aml} = 2$ . B) The total number of cells in AML condition is computed using model

$V = {\bf{v}}_1 + c_{aml} {\bf{v}}_{aml}^2$ . More rapid progression of AML in terms of cell number is observed for larger values of

$c_{aml}$ and

$r_{aml}$ . C) FACS analysis for CD34 and CD16/32 expression in myeloid progenitor compartment of control (left), preleukemic (center) and leukemic (right) CM knock-in mouse shows emergence of unconventional cell states during leukemic progression that eventually dominate the entire progenitor population. Our multi-dimensional cell state model is capable of incorporating those novel cell states.

DownLoad: Full-Size Img PowerPoint

3.6. Interpretation of new cell states in the multi-dimensional model

The remaining question is how to interpret the new cell states in the multi-dimensional space model that may arise far away from the cell states identified by conventional markers. Hence, we propose some measures in Eqs (2.12), (2.13) to guide the interpretation. shows an example of the rescaled correlation quantities $r_{f, j}$ and $r_{{\mathit{\boldsymbol{v}}}, j}$ computed with Nestorowa data. The first row show results of the correlation $r_{{\mathit{\boldsymbol{v}}}, j}$ to the average leukemic directional vector ${\mathit{\boldsymbol{v}}} = (-0.068, \, -0.206)$ . The gene expression levels of genes that have large values of $r_{{\mathit{\boldsymbol{v}}}, j}$ are depicted in the figure, namely, PLAC8 and CAR2. We remark that those genes have strong local correlation $r_{{\mathit{\boldsymbol{v}}}, j}|_{\Gamma_d}$ on $\Gamma_d = \{ 0.3 \geq \theta_1 \geq 0.9, \, 0.3 \geq \theta_2 \geq 0.5 \}$ as well. shows the correlation of all 3991 genes, where the red bars highlight the leukemia related genes we modify (A1. Details of the model equation and parameters) and we observe large magnitudes in some of the genes. The second row shows the correlation $r_{f, j}$ to a specific cell state at the reduced space, where we choose $\theta^* = (0.5, 0.35)$ , which is approximately an intermediate location between MEP and CMP cells, and $f_{\theta^*}(\theta) = \frac{1}{{2\pi 0.05} } \exp\left[- {\|\theta -\theta^*\|^2}/{0.1} \right]$ . APOE and CLEC12a genes show the largest magnitude of $r_{f, j}$ , and similarly, we can identify the leukemia related genes that show strong correlation to cell state $\theta^*$ . Although more careful and rigorous approach should be developed to characterize the new arising cell states, $r_{f, j}$ and $r_{{\mathit{\boldsymbol{v}}}, j}$ defined in Eqs (2.12), (2.13) provides an efficient method of initial screening of possible related genes.

Figure 9. Interpretation and mapping of model-predicted novel cell states. In order to identify novel cell states predicted by the mathematical model, gene expression levels

$\log_2(g_j^i+1)$ that are strongly correlated to the direction of leukemic alteration

${\mathit{\boldsymbol{v}}} = (-0.068, \, -0.206)$ (A), and to the reduced space location

$\theta^* = (0.5, 0.35)$ (C). The rescaled correlation

$r_{f, j}$ (B) and

$r_{{\mathit{\boldsymbol{v}}}, j}$ (D) computed for all the genes in Nestorowa data are shown, and the leukemia related genes are marked in red bars.

DownLoad: Full-Size Img PowerPoint

4. Discussion

We have shown how to construct mathematical models of cell state-transitions using scRNA-seq data. We compare two cell state geometries: solving equations on graphs and solving equations on a multi-dimensional cell state-space. Each cell state geometry has its strengths and limitations. Selecting a model for a given application or dataset will depend on the type of biological data and the nature of the scientific question.

When the modeling application and quantity of interest includes well-known cell lineages and relation between the conventional cell states, the graph model is more appropriate due to its ability of distinguishing distinct cell lineages more clearly compared to the multi-dimensional space model. Dynamics of cell numbers in specific cell states, alteration of proliferation and apoptosis in particular cell state, differentiation block, and emergence of intermediate cell states can be quantified and studied in a straightforward manner. However, to explore cell states beyond known cell lineages, the continuum space model is more advantageous since it includes all intermediate and pathological cell states, rather than confining the model into presumed cell lineages. Moreover, the continuum model can incorporate a relatively small genetic and epigenetic alteration that the graph abstraction may not recognize, and study abnormal trajectories that yield unconventional cell states.

We selected and perturbed genes to simulate AML based on genes known to be associated with leukemia pathogenesis. We do not intend for this to be an accurate model of the biological process, rather, as an illustration of how one may select sets of genes and perturb them in a prescribed fashion in order to study the effect on cell state-transition dynamics. This approach assumes that AML pathogenesis originates from changes in gene expression in specific cell subsets, which is limited by our identification of these genes based on published literature. We acknowledge this is a limitation of the modeling approach, although we also note that our model predictions are consistent with known features of AML progression.

4.1. Comparison to other approaches

Although at the time of this work there are relatively few mathematical models published which utilize single-cell sequencing data, there are a few notable exceptions. Of particular note are works which use modeling and simulation to generate synthetic in silico gene expression datasets ^[37]. These important approaches to mechanism-based mathematical modeling may also be used to study and predict the effects of perturbations on cell state distributions. They may also be used as computational controls to benchmark analysis tools and potentially to benchmark and compare mathematical models, although using a model to benchmark other models can lead to consistent but incorrect circular reasoning and caution is warranted. Another example is Ferrall-Fairbanks and Papalexi et al, who use mathematical analysis to generate novel quantifications of cell heterogeneity in cancer or immune cell subsets respectively ^[38,39]. These methods may be used to map and interpret novel cell states predicted by mathematical models or similarly as a method to interpret model-predicted changes in cell heterogeneity following a perturbation.

Schiebinger et al compute and predict differentiation trajectories in cell development using optimal transport (OT) ^[40,41]. This approach considers the optimal transport of cells as a mass flowing along differentiation trajectories, and is conceptually the most similar to our approach. As presented, Schiebinger et al do not use the OT framework to examine perturbations of cell states or genes along the differentiation trajectory, although this is possible with an OT model. Setty et al present a method to compute cell fate probabilities ^[42], which may also be achieved by inferring cell state-transition dynamics with lineage trees ^[43]. Fischer et al have demonstrated a method for inferring population dynamics from single-cell sequencing data ^[44], where the model equation is identical to our graph based model developed in ^[12]. Jiang et al develops a dynamic inference approach to derive a Fokker-Planck type PDE on a graph considering an energy landscape and optimal transport ^[45]. Sharma et al use longitudinal sequencing to study drug-induced infidelity in the stem cell hierarchy ^[46], and Karaayvaz et al show how to use single-cell sequencing to examine drug resistance in breast cancer ^[47]. These approaches and analysis methods may be used to inform and potentially calibrate mathematical models of cell population dynamics or response to treatment-induced perturbations.

Recently, vector fields derived from RNA velocity ^[48] have been used to infer potential energy or fitness landscapes for cell state-transitions. These approaches may be used to inform the computational domain for mathematical models that we present here, however, an important limitation of the RNA-velocity approach is extrapolation of the vector field outside of the data range. This underscores the need for hypothesis-based and model-guided approaches to inform the shape of these fields. This limitation also applies to the rapidly growing field of deep learning approaches ^[49] to analyze single-cell sequencing data, namely, whether the learning algorithm can effectively make predictions to datasets which are not sufficiently similar to those upon which it has been trained. We believe that the future likely involves a merger of mathematical modeling with machine learning, in which mathematical models are used to inform learning approaches and impute sparse data as has been recently shown by Gaw and Rockne et al ^[50,51]. Among the recent works that align with this direction, PRESCIENT algorithm aims to learn the underlying differentiation landscape from time-series scRNA-seq data ^[21]. Moreover, dynamo framework improves RNA velocity using kinetic models to reconstruct continuous vector fields that predict cell fates ^[52].

4.2. Opportunities and limitations of modeling with single-cell sequencing data

There are pros and cons, opportunities and limitations to mathematical modeling with single-cell sequencing data. The advantages and potential opportunities include: a wealth of available data, richness and complexity of each data set, a focus on the cell level, opportunity to study dynamics in hierarchically structured state-based relationships between cells, and an ability to perturb individual cells and/or genes within cells to predict dynamics of state-change at cellular level. The most significant strength of mathematical modeling is the ability to use and generate hypotheses that may not be directly evident from the data; for example, extrapolation of RNA velocity fields beyond the dataset boundaries or to interpret and predict novel cell states which may not otherwise be clearly identified with known canonical cell state markers. Another advantage of our approach is the use of pseudo-time analysis of data collected at a single timepoint to calibrate the models, however, the models can also be calibrated directly to time-sequential single-cell datasets, which we expect to become more commonly available as single-cell sequencing continues to be used as a tool to study cell dynamics.

The disadvantages and limitations include: the potential for misleading or incorrect inference due to poor data quality including drop-outs, small non-representative samples of large heterogeneous populations, batch effects, no physical or micro-environmental context, no direct or physical interactions between cells, and the possibility of model predictions to be sensitive to methods of dimension reduction, graph abstraction, state-space construction, and potentially sequencing platform. Sensitivity of the modeling to experimental and computational methods may be directly studied and mitigated as we have shown in this work, however this remains potentially a significant source of uncertainty and variability in the modeling calibration and predictions. Studying the sensitivity of our modeling framework regarding different noise scenarios and applying noise reduction methods is our future work ^[53].

In terms of computational cost, the graph model is more efficient since it is a multiple of one-dimensional cost, while the cost of implementing the space model increases exponentially as the dimension of reduced space increases. In our simulation, the computational cost to simulate up to time $t = 50$ with step size $\Delta t = 10^{-3}$ and O $(100^2)$ degrees of freedom in one-dimension is around 25 seconds in the graph model with 8 nodes, while it takes around 230 seconds in the continuum model with two dimensions. In short, the continuum model runs approximately 10 times longer than the graph model with 8 nodes in our example. Therefore, the multi-dimensional cell state geometry will be reasonable only when the reduced component can be truncated at two- to three-dimension, unless the numerical method is carefully implemented, and we emphasize that the graph model will be more advantageous in terms of computational cost than the continuum model especially when higher dimensional reduced space is necessary.

4.3. Future work and applications

Future applications of this approach is to explore hypothesis in the resolution of single-cell genomics and study altered and novel cell states with genetic and epigenetic alterations in various biological systems and pathogenesis. We look forward to compare the model prediction to sampling/sequencing of perturbed biological system, for instance, to examine scRNA-seq data from leukemic progenitor cells. Moreover, we anticipate to incorporate effects of external perturbations such as therapy in future studies.

There are opportunities for further enhancements in our model in improving the model of cell landscape dynamics to accurately estimate cell transition pathways in the reduced component space, for instance, minimum action paths ^[6] and bifurcation ^[7,54]. The model can be improved by obtaining parameter functions or mappings of biological quantities directly from single-cell sequencing data, for example, more precisely infer the proliferation rate function. Also, developing methodologies to obtain reduced component space that captures desired characteristic of cell states ^[55] will help us explore our approach for other biological settings where cell states are less clearly characterized. Moreover, we propose to develop quantities, such as index of critical state transitions ^[54,56], in the phenotype space that could be used to predict forthcoming major alterations in development and diseases. We also expect to be able to infer the potential landscape directly from the RNA velocity vector field ^[48,52].

5. Conclusions

In summary, despite the explosion of computational tools to analyze single-cell sequencing data, there have been relatively few mathematical models developed which utilize this data. Here we begin to explore the possibilities—and limitations—of dynamical modeling with single-cell RNA-seq data. We hope this work paves the way for development of mathematical models to guide the interpretation of these complicated datasets as they begin to be collected after biological perturbations (eg., cancer, treatment, altered developmental processes), sequentially over time, or sampled spatially within biological tissues.

Acknowledgments

Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under award numbers U01CA250046 (Y.H.K., R.C.R.), R01CA178387 (Y.H.K.), R01CA205247 (Y.H.K.), P30CA033572 and included work performed in the Analytical Cytometry and Biostatistics and Mathematical Oncology Cores. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflict of interest

The authors declare that there is no conflict of interest.

Code availability

The simulation codes are available from https://github.com/heyrim/Mathematical-modeling-with-single-cell-sequencing-data

Appendix

A1. Details of the model equation and parameters

The model terms require interpolation of single-cell data to the continuum cell state space. We use clustering to identify cell types and their cell type properties to assign those to each single-cell. The following are the values we take for cluster properties.

By denoting $\bar{r}_i$ as the assigned proliferation rate of the $i$ -th cluster, we compute the intermediate level of proliferation in the graph model by linear interpolation as

$\begin{equation} r_k(x) = r_{I(i, j)}(x) = \bar{r}_i + (\bar{r}_j - \bar{r}_i) x, \quad x \in [0, \, 1], \end{equation}$

(A.1)

Table A1. Summary of the required parameters. The following table summarizes the parameters in our model terms,

$V$ ,

$R$ ,

$D$ ,

$V_k$ ,

$R_k$ , and

$D_k$ , and their biological meaning with the range. The ranges are found from the literature ^{[19,57,58,59,60,61]} experimentally measuring the cell cycle and self-renewal rate of the well known hematopoeisis cell types.

Parameters	Biological meaning	Range
$r(\theta), \, r_k(x)$	proliferation rate	$[0.00215, \, 1]$	^{[57,58,59,60]}
$a(\theta), \, a_k(x)$	self-renewal rate	$[0.1, 0.8]$	^[19,61]
${\boldsymbol c}(\theta)$	differentiation vector	$[0, 1]$	estimated
$\nu$	phenotypic fluctuation	$[0, 0.0027]$	estimated
$\bar{d}$	apoptosis rate	0.6925	^[61]

| Show Table

DownLoad: CSV

Table A2. Parameter values of proliferation and self-renewal rate. The following values are taken for each single-cell,

$\bar{r}^i$ and

$\bar{a}^i$ , based on their clustered cell types ^{[19,57,58,59,60,61]}, and then used for computing

$r(\theta)$ ,

$r_k(x)$ ,

$a(\theta)$ , and

$a_k(x)$ .

	hematopoietic stem cells (HSC) $\leftrightarrow$ progenitor cells (HPC)
cell type	HSC	MPP, LMPP, CMP	MEP, GMP	Neu/Mo, Ery
proliferation	0.01125	0.05658	0.1612	0.6931
	(8.8 weeks)	(12.25 days)	(4.3 days)	(1 day)
self-renewal	0.77	0.7689	0.7359	0.66

| Show Table

DownLoad: CSV

assuming that the overall proliferation of intermediate cell states change gradually. In the multi-dimensional model, we compute the interpolation based on local means as

$\begin{equation} r(\theta) = \frac{1}{|I_\theta|} \sum\limits_{i\in I_\theta} \bar{r}^i, \quad I_\theta = \{ i \, |\, \|\theta^i -\theta \| < \bar{\theta} \} , \end{equation}$

(A.2)

where we take $\bar{\theta} = 0.04$ . The self-renewal rate functions $a_k(x)$ and $a(\theta)$ are computed similarly. See Cell proliferation rate $r(\theta)$ and self-renewal rate $a(\theta)$ computed from the single-cell data. The black dots are the rates of data. for $r(\theta)$ and $a(\theta)$ computed for Nestorowa data.

To compute the multi-dimensional function on the continuum space from the single-cell data, we employ the kernel density method ^[18,26], that is a non-parametric way to estimate the density function based on a finite data sample. Using the single-cell samples in the reduced component space, $\{ \theta^i \}_{i = 1}^N$ , the method approximates the density function as

$u_s(\theta) = \frac{1}{N h} \sum\limits_{i = 1}^N K\left( \frac{(\theta - \theta^i)}{h} \right),$

where $K$ is the kernel smoothing function that we take it as a Gaussian function and $h$ is the bandwidth. The optimal bandwidth to estimate normal densities can be computed by $({4 \hat{\sigma}^5}/{3 N})^{1/5}$ , where $\hat{\sigma}$ is the standard deviation and $N$ is the sample size, and the optimal bandwidth for our data is computed as $0.0383$ to $0.0456$ , however in our simulation, we choose a slightly smaller value, $h = 0.03$ , to reveal more features of multiple modes. (, ) shows the results computing $u_s(\theta)$ using Nestorowa and Paul data, and A2(a) shows the corresponding ${\bf{v}}_1(\theta)$ .

In the diffusion term, we explore the parameter $\nu$ so that the phenotypic instability does not dominate the cell maturation. We compute the parameter in the range of $\nu \leq (L/T_d)^2 / 4$ , where the distance in the diffusion space is $L = 1$ and the time that HSC differentiates to the progenitors is $T_d = 5\sim30$ (day), that is, $\nu \leq 0.0027 \sim 0.01$ , and we consider $\nu = 0.001$ . Quantifying the local phenotypic instability in the reduced component space, and justifying this term is our future work.

To compute the reduced component space using dimension reduction approaches, we employ diffusion mapping. See ^[15,62] for the detail of the algorithm. We take the cosine distance, $k(x^i, \, x^j) = 1 - corr(x^i, \, x^j)$ for the Nestorowa data and the gaussian distance $k(x^i, \, x^j) = \exp \left(-\frac{\|x^i-x^j\|^2 }{2 \sigma^2} \right)$ for Paul data with $\sigma = 50$ . From $L(i, \, j) = k(x^i, \, x^j)$ , the diffusion mapping use parameter $\alpha$ to tune the influence of density of the data points as

$L^{(\alpha)} = D^{-\alpha} L D^{-\alpha}, \quad M = (D^{(\alpha)})^{-1} L^{(\alpha)},$

where $D^{(\alpha)}(i, \, i) = \sum_j L^{(\alpha)}(i, \, j)$ , and we choose $\alpha = 0.5$ . From the eigen-decomposition of $M \phi = \lambda \phi$ and ordered eigenvalues $1 = \lambda_0 \leq \lambda_1 \leq \lambda_2 \leq \cdots$ , the corresponding right eigenvectors, $\phi_1$ , $\phi_2$ , $\cdots$ are the diffusion components. We truncate the reduced space at the second diffusion component, where the eigenvalues are $\lambda_1 = 0.1039$ , $\lambda_2 = 0.0326$ , $\lambda_3 = 0.0167$ , $\lambda_4 = 0.0135$ for Nestorowa data, and $\lambda_1 =$ 2.4653e-03, $\lambda_2 =$ 5.8338e-04, $\lambda_3 =$ 9.7792e-05, $\lambda_4 =$ 7.0364e-05 for Paul data. For a comparison of diffusion mapping to two-dimensional reduced component space using other dimension reduction algorithms, see Comparison of dimension reduction methods. Dimension reduction algorithms that focus on preserving local structure are not appropriate to infer global trajectory. Compare the following figures to diffusion component space. They are computed with Nestorowa (top) and Paul (bottom) data and projected on the reduced component space of ForceAtlas2 (FA) ^[64] and t-stochastic neighbor embedding (tSNE)..

For the pseudotime inference, we use the algorithm developed in ^[63]. The diffusion distance between two cells are computed as

$D_t^2(x^i, \, x^j) = \sum\limits_{k = 1}^n \lambda_k^{2t} (\phi_k^i - \phi_k^j)^2,$

and the pseudotime distances are computed based on this distance. We choose three extreme points in each of the three clusters, stem cell, Ery, and Neu cell types, that are the furtherest in the diffusion component space, and infer the lineage between the extreme cells. After computing the pesudotime of each single-cell we compute the local average direction to the neighborhood cells that are in later pesudotime similar as in Eq (A.2). The computed results are shown in Pseudotime dynamics. The homeostasis cell differentiation vector v₁ (a), and the direction of active cell differentiation obtained from diffusion pseudotime analysis (b), and that interpolated at the grid points v₂ (c) are presented. We remark that, v₂ corresponds to the cell differentiation along the edges in the graph model. (b) with the interpolated vector at the grid points, Pseudotime dynamics. The homeostasis cell differentiation vector v₁ (a), and the direction of active cell differentiation obtained from diffusion pseudotime analysis (b), and that interpolated at the grid points v₂ (c) are presented. We remark that, v₂ corresponds to the cell differentiation along the edges in the graph model. (c).

Table A3. Gene alterations in Leukemic stem cells. From the genes that are reported in ^[35,36], we find all the gene that are in Nestorowa and Paul data. The following table is the genes and their altered magnitude. See ^[35] Extended Data Table 1 for the 17 genes and ^[36] Supplemental Table S4 for approximately 80 genes.

Up-regulated Gene	log2-fold change	Down-regulated Gene	log2-fold change
CD34	2.1500	LGALS3	$-$ 3.4901
LAPTM4B	1.8000	CYBB	$-$ 2.9546
MMRN1	1.3600	CD36	$-$ 2.7661
SOCS2	1.2400	ANXA5	$-$ 2.6349
CDK6	1.2300	LY86	$-$ 2.5564
CPXM1	1.2000	IRF8	$-$ 2.4982
EMP1	1.0100	SAMHD1	$-$ 2.4580
GPR56	2.7004	GRN	$-$ 2.3659
GATA2	1.8875	RNASE6	$-$ 2.3585
LPIN1	1.6323	FCER1G	$-$ 2.2934
MZB1	1.4854	S100A9	$-$ 2.2447
ZSCAN18	1.3219	TLR4	$-$ 2.1078
GUCY1A3	1.2630	FCGRT	$-$ 2.1016
SPNS2	1.2016	S100A8	$-$ 2.0116
PTK7	1.2016	CLEC12A	$-$ 1.8730
ABCC1	1.1375	MNDA	$-$ 1.8417
SYTL1	1.0704	IL13RA1	$-$ 1.7515
MAGED1	1.0704	SGK1	$-$ 1.7418
ARHGAP25	1.0704
SLA2	1.0000

| Show Table

DownLoad: CSV

Figure A1. Cell proliferation rate

$r(\theta)$ and self-renewal rate

$a(\theta)$ computed from the single-cell data. The black dots are the rates of data.

DownLoad: Full-Size Img PowerPoint

Figure A2. Pseudotime dynamics. The homeostasis cell differentiation vector

${\bf{v}}_1$ (a), and the direction of active cell differentiation obtained from diffusion pseudotime analysis (b), and that interpolated at the grid points

${\bf{v}}_2$ (c) are presented. We remark that,

${\bf{v}}_2$ corresponds to the cell differentiation along the edges in the graph model.

DownLoad: Full-Size Img PowerPoint

Figure A3. Comparison of dimension reduction methods. Dimension reduction algorithms that focus on preserving local structure are not appropriate to infer global trajectory. Compare the following figures to diffusion component space. They are computed with Nestorowa (top) and Paul (bottom) data and projected on the reduced component space of ForceAtlas2 (FA) ^[64] and t-stochastic neighbor embedding (tSNE).

DownLoad: Full-Size Img PowerPoint

Figure A4. Abnormal cell state transitions during leukemia pathogenesis and progression. The distribution of cell states

$u(t, \theta)$ show abnormal cell states emerging during leukemic progression after

$t = 10$ modeled in the advection term as

$V = c_{aml} {\bf{v}}_{aml}^1$ (top) and

$V = {\bf{v}}_1 + c_{aml} {\bf{v}}_{aml}^2$ (bottom), with various levels of

$c_{aml} = 1, 2, 10$ . Larger magnitude of

$c_{aml}$ results in more disrupted cell landscape.

DownLoad: Full-Size Img PowerPoint

Figure A5. From discrete to continuum cell states. The hierarchy of graphs using partition-based graph abstraction ^[23] and single cell data from Nestorowa et al. (2016) (a) and Paul et al. (2015) (b). The single-cell data can be regarded as the most refined graph. The simulation of normal hematopoiesis on graph with 19 nodes (c), that is comparable to Figurek 3A, illustrates the hierarchy of cell distribution toward the entire reduce space.

DownLoad: Full-Size Img PowerPoint

Figure A6. Model sensitivity to parameters. Using single cell data from Nestorowa et al. (2016), number of cells and its dynamics in each cluster up to

$t = 50$ for different values of

$\nu$ and initial stem cell numbers

$\rho(0)$ are shown in (a--c). The dynamics of cells in each cluster for

$\nu = 10^{-3}$ with

$\rho(0) = 0.1$ (a),

$\nu = 10^{-2}$ with

$\rho(0) = 0.1$ (b), and

$\nu = 10^{-2}$ with larger initial number of cells,

$\rho(0) = 0.5$ (c) shows that the recovery is more rapid for larger values of

$\nu$ and larger number of initial stem cells

$\rho(0)$ . Cell distribution

$u(\theta, t)$ at intermediate time

$t = 14$ for advection terms

${\bf{v}}_1$ and

${\bf{v}}_1+{\bf{v}}_2$ , and

$\nu = 10^{-4}$ or

$\nu = 10^{-3}$ are shown in (d). The distributions are distinct, where larger values of

$\nu$ increases overall rate of differentiation, while adding

${\bf{v}}_2$ prioritizes recovery of the most matured cells.

DownLoad: Full-Size Img PowerPoint

References

[1]	Baek Y, Kim H (2018) ModAugNet: A new forecasting framework for stock market index value with an overfitting prevention LSTM module and a prediction LSTM module. Expert Syst Appl 113: 457–480. doi: 10.1016/j.eswa.2018.07.019
[2]	Boehmer E, Fong K, Wu J (2012) International evidence on algorithmic trading. In AFA 2013 San Diego Meetings Paper.
[3]	Chen C, Zhang P, Liu Y, et al. (2020) Financial quantitative investment using convolutional neural network and deep learning technology. Neurocomputing 390: 384–390. doi: 10.1016/j.neucom.2019.09.092
[4]	Chen J, Chen W, Huang C, et al. (2016) Financial Time-Series Data Analysis Using Deep Convolutional Neural Networks. In 2016 7th International Conference on Cloud Computing and Big Data (CCBD), 87–92.
[5]	Chen K, Zhou Y, Dai F (2015) A LSTM-based method for stock returns prediction: A case study of China stock market. In 2015 IEEE International Conference on Big Data (Big Data), 2823–2824.
[6]	Chen S, He H (2018) Stock Prediction Using Convolutional Neural Network. In IOP Conference Series: Materials Science and Engineering, 435: 012026.
[7]	Chen Y, Chen W, Huang S (2018) Developing Arbitrage Strategy in High-frequency Pairs Trading with Filterbank CNN Algorithm. In 2018 IEEE International Conference on Agents (ICA), 113–116.
[8]	Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signals Syst 2: 303–314. doi: 10.1007/BF02551274
[9]	Day M, Lee C (2016) Deep learning for financial sentiment analysis on finance news providers. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 1127–1134.
[10]	Deng Y, Bao F, Kong Y, et al. (2017) Deep Direct Reinforcement Learning for Financial Signal Representation and Trading. IEEE Trans Neural Networks Learn Syst 28: 653–664. doi: 10.1109/TNNLS.2016.2522401
[11]	Dixon M, Klabjan D, Bang J (2017) Classification-based financial markets prediction using deep neural networks. Algorithmic Financ 6: 67–77. doi: 10.3233/AF-170176
[12]	Doering J, Fairbank M, Markose S (2017) Convolutional neural networks applied to high-frequency market microstructure forecasting. In 2017 9th Computer Science and Electronic Engineering (CEEC), 31–36.
[13]	Fang Y, Chen J, Xue Z (2019) Research on quantitative investment strategies based on deep learning. Algorithms 12: 35. doi: 10.3390/a12020035
[14]	Gudelek M, Boluk S, Ozbayoglu A (2017) A deep learning based stock trading model with 2-D CNN trend detection. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), 1–8.
[15]	Gunduz H, Yaslan Y, Cataltepe Z (2017) Intraday prediction of borsa Istanbul using convolutional neural networks and feature correlations. Knowl Based Syst 137: 138–148. doi: 10.1016/j.knosys.2017.09.023
[16]	Hendershott T, Jones C, Menkveld A (2011) Does algorithmic trading improve liquidity? J Financ 66: 1–33. doi: 10.1111/j.1540-6261.2010.01624.x
[17]	Hendershott T, Riordan R (2009) Algorithmic trading and information. University of California, Berkeley.
[18]	Hinton G, Salakhutdinov R (2006) Reducing the Dimensionality of Data with Neural Networks. Science 313: 504–507. doi: 10.1126/science.1127647
[19]	Hochreiter S, Schmidhuber J (1997) Long Short-Term Memory. Neural Comput 9: 1735–1780. doi: 10.1162/neco.1997.9.8.1735
[20]	Lee H, Grosse R, Ranganath R, et al. (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, 609–616.
[21]	Hoseinzade E, Haratizadeh S (2019) CNNpred: CNN-based stock market prediction using a diverse set of variables. Expert Syst Appl 129: 273–285. doi: 10.1016/j.eswa.2019.03.029
[22]	Hossain M, Karim R, Thulasiram R, et al. (2018) Hybrid Deep Learning Model for Stock Price Prediction. In 2018 IEEE Symposium Series on Computational Intelligence (SSCI), 1837–1844.
[23]	Jeong G, Kim H (2019) Improving financial trading decisions using deep Q-learning: Predicting the number of shares, action strategies, and transfer learning. Expert Syst Appl 117: 125–138. doi: 10.1016/j.eswa.2018.09.036
[24]	Ji S, Kim J, Im H (2019) A Comparative Study of Bitcoin Price Prediction Using Deep Learning. Mathematics 7: 898. doi: 10.3390/math7100898
[25]	Kalman B, Kwasny S (1992) Why tanh: choosing a sigmoidal function. In [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, 4: 578–581.
[26]	Kim S, Kang M (2019) Financial series prediction using Attention LSTM. arXiv preprint arXiv: 1902.10877.
[27]	Krauss C, Do X, Huck N (2017) Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S & P 500. Eur J Oper Res 259: 689–702. doi: 10.1016/j.ejor.2016.10.031
[28]	LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521: 436-44. doi: 10.1038/nature14539
[29]	Li Y, Zheng W, Zheng Z (2019) Deep Robust Reinforcement Learning for Practical Algorithmic Trading. IEEE Access 7: 108014–108022. doi: 10.1109/ACCESS.2019.2932789
[30]	Lin B, Chu W, Wang C (2018) Application of Stock Analysis Using Deep Learning. In 2018 7th International Congress on Advanced Applied Informatics (ⅡAI-AAI), 612–617.
[31]	Liu S, Zhang C, Ma J (2017) CNN-LSTM neural network model for quantitative strategy analysis in stock markets. In international conference on neural information processing, Springer, 198–206.
[32]	Lu W, Li J, Li Y, et al. (2020) A CNN-LSTM-Based Model to Forecast Stock Prices. Complexity 2020: 6622927.
[33]	Luo S, Lin X, Zheng Z (2019) A novel CNN-DDPG based AI-trader: Performance and roles in business operations. Transp Res Part E Logist Transp Rev 131: 68–79. doi: 10.1016/j.tre.2019.09.013
[34]	Lv D, Yuan S, Li M, et al. (2019) An Empirical Study of Machine Learning Algorithms for Stock Daily Trading Strategy. Math Probl Eng 2019: 7816154.
[35]	Mikolov T, Karafiát M, Burget L, et al. (2010) Recurrent neural network based language model. In Interspeech, Makuhari, 1045–1048.
[36]	Mudassir M, Bennbaia S, Unal D, et al. (2020) Time-series forecasting of Bitcoin prices using high-dimensional features: a machine learning approach. Neural Comput Appl 2020: 1–15.
[37]	Nair V, Hinton G (2010) Rectified linear units improve restricted boltzmann machines. In Icml.
[38]	Nelson D, Pereira A, de Oliveira R (2017) Stock market's price movement prediction with LSTM neural networks. In 2017 International Joint Conference on Neural Networks (IJCNN), 1419–1426.
[39]	Nuti G, Mirghaemi M, Treleaven P, et al. (2011) Algorithmic Trading. Computer 44: 61–69. doi: 10.1109/MC.2011.31
[40]	Ozbayoglu A, Gudelek M, Sezer O (2020) Deep learning for financial applications: A survey. Appl Soft Comput 93: 106384. doi: 10.1016/j.asoc.2020.106384
[41]	Selvin S, Vinayakumar R, Gopalakrishnan E, et al. (2017) Stock price prediction using LSTM, RNN and CNN-sliding window model. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 1643–1647.
[42]	Serrano W (2018) Fintech Model: The Random Neural Network with Genetic Algorithm. Proced Comput Sci 126: 537–546. doi: 10.1016/j.procs.2018.07.288
[43]	Sezer O, Ozbayoglu M, Dogdu E (2017) A Deep Neural-Network Based Stock Trading System Based on Evolutionary Optimized Technical Analysis Parameters. Proced Comput Sci 114: 473–480. doi: 10.1016/j.procs.2017.09.031
[44]	Sezer O, Ozbayoglu A (2018) Algorithmic financial trading with deep convolutional neural networks: Time series to image conversion approach. Appl Soft Comput 70: 525–538. doi: 10.1016/j.asoc.2018.04.024
[45]	Sezer O, Ozbayoglu A (2019) Financial trading model with stock bar chart image time series with deep convolutional neural networks. arXiv preprint arXiv: 1903.04610.
[46]	Shah D, Campbell W, Zulkernine F (2018) A Comparative Study of LSTM and DNN for Stock Market Forecasting. In 2018 IEEE International Conference on Big Data (Big Data), 4148–4155.
[47]	Singh R, Srivastava S (2017) Stock prediction using deep learning. Multimed Tools Appl 76: 18569–18584. doi: 10.1007/s11042-016-4159-7
[48]	Sirignano J, Cont R (2019) Universal features of price formation in financial markets: perspectives from deep learning. Quant Financ 19: 1449–1459. doi: 10.1080/14697688.2019.1622295
[49]	Sohangir S, Wang D, Pomeranets A, et al. (2018) Big Data: Deep Learning for financial sentiment analysis. J Big Data 5: 1–25. doi: 10.1186/s40537-017-0111-6
[50]	Sutskever I, Hinton G, Taylor G (2009) The recurrent temporal restricted boltzmann machine. In Advances in neural information processing systems, 1601–1608.
[51]	Théate T, Ernst D (2021) An application of deep reinforcement learning to algorithmic trading. Expert Syst Appl 173: 114632. doi: 10.1016/j.eswa.2021.114632
[52]	Treleaven P, Galas M, Lalchand V (2013) Algorithmic trading review. Commun ACM 56: 76–85. doi: 10.1145/2500117
[53]	Troiano L, Villa E, Loia V (2018) Replicating a Trading Strategy by Means of LSTM for Financial Industry Applications. IEEE Trans Ind Inf 14: 3226–3234. doi: 10.1109/TII.2018.2811377
[54]	Wang Z, Lu W, Zhang K, et al. (2021) MCTG: Multi-frequency continuous-share trading algorithm with GARCH based on deep reinforcement learning. arXiv preprint arXiv: 2105.03625.
[55]	Xie M, Li H, Zhao Y (2020) Blockchain financial investment based on deep learning network algorithm. J Comput Appl Math 372: 112723. doi: 10.1016/j.cam.2020.112723
[56]	Zhao Z, Rao R, Tu S, et al. (2017) Time-Weighted LSTM Model with Redefined Labeling for Stock Trend Prediction. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), 1210–1217.
[57]	Zou Z, Qu Z (2020) Using LSTM in Stock prediction and Quantitative Trading. CS230: Deep Learning, Winter.

This article has been cited by:

1.	Christian Düll, Piotr Gwiazda, Anna Marciniak-Czochra, Jakub Skrzeczkowski, Structured population models on Polish spaces: A unified approach including graphs, Riemannian manifolds and measure spaces to describe dynamics of heterogeneous populations, 2024, 34, 0218-2025, 109, 10.1142/S0218202524400037
2.	MeiLu McDermott, Riddhee Mehta, Evanthia T. Roussos Torres, Adam L. MacLean, Modeling the dynamics of EMT reveals genes associated with pan-cancer intermediate states and plasticity, 2025, 11, 2056-7189, 10.1038/s41540-025-00512-2

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Data Science in Finance and Economics

2.2

Metrics

Article views(10245) PDF downloads(863) Cited by(20)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(3) / Tables(4)

Data Science in Finance and Economics

Survey on the application of deep learning in algorithmic trading

Related Papers:

Abstract

1. Introduction

2. Materials and method

2.1. Modeling cell state-transitions in a continuous cell state-space

2.2. PDE model of cell state-transitions on a multi-dimensional reduced component space

2.3. PDE model of cell state-transitions solved on a graph

2.4. Quantification of cell state-transition dynamics

3. Simulation of continuous cell state models on multi-dimensional space versus graph

3.1. Calibrating the mathematical models to normal hematopoiesis

3.2. Using the model framework to simulate acute myeloid leukemia (AML) pathogenesis and progression

3.3. In silico experiments of gene expression perturbation

3.4. Effects of gene perturbation on graph abstraction and multi-dimensional reduced component space

3.5. Simulating AML pathogenesis by perturbing known leukemia-associated genes

3.6. Interpretation of new cell states in the multi-dimensional model

4. Discussion

4.1. Comparison to other approaches

4.2. Opportunities and limitations of modeling with single-cell sequencing data

4.3. Future work and applications

5. Conclusions

Acknowledgments

Conflict of interest

Code availability

Appendix

A1. Details of the model equation and parameters

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Data Science in Finance and Economics

Survey on the application of deep learning in algorithmic trading

Related Papers:

Abstract

1. Introduction

2. Materials and method

2.1. Modeling cell state-transitions in a continuous cell state-space

2.2. PDE model of cell state-transitions on a multi-dimensional reduced component space

2.3. PDE model of cell state-transitions solved on a graph

2.4. Quantification of cell state-transition dynamics

3. Simulation of continuous cell state models on multi-dimensional space versus graph

3.1. Calibrating the mathematical models to normal hematopoiesis

3.2. Using the model framework to simulate acute myeloid leukemia (AML) pathogenesis and progression

3.3. In silico experiments of gene expression perturbation

3.4. Effects of gene perturbation on graph abstraction and multi-dimensional reduced component space

3.5. Simulating AML pathogenesis by perturbing known leukemia-associated genes

3.6. Interpretation of new cell states in the multi-dimensional model

4. Discussion

4.1. Comparison to other approaches

4.2. Opportunities and limitations of modeling with single-cell sequencing data

4.3. Future work and applications

5. Conclusions

Acknowledgments

Conflict of interest

Code availability

Appendix

A1. Details of the model equation and parameters

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog