DL-CNV: A deep learning method for identifying copy number variations based on next generation target sequencing

Yunxiang Zhang; Lvcheng Jin; Bo Wang; Dehong Hu; Leqiang Wang; Pan Li; Junling Zhang; Kai Han; Geng Tian; Dawei Yuan; Jialiang Yang; Wei Tan; Xiaoming Xing; Jidong Lang; Yunxiang Zhang; Lvcheng Jin; Bo Wang; Dehong Hu; Leqiang Wang; Pan Li; Junling Zhang; Kai Han; Geng Tian; Dawei Yuan; Jialiang Yang; Wei Tan; Xiaoming Xing; Jidong Lang

doi:10.3934/mbe.2020011

Mathematical Biosciences and Engineering

2020, Volume 17, Issue 1: 202-215. doi: 10.3934/mbe.2020011

Previous Article Next Article

Research article Special Issues

DL-CNV: A deep learning method for identifying copy number variations based on next generation target sequencing

1.
Weifang People's Hospital, Guang Wen Road, Weifang 261000, China
2.
Weifang Medical University, Bao Tong West Street, Weifang 261053, China
3.
Geneis Beijing Limited Company, Beijing 100102, China
4.
The Affiliated Hospital of Qingdao University, Jiang Su Road, Qingdao 266071, China

Received: 28 April 2019 Accepted: 17 September 2019 Published: 30 September 2019

Copy number variations (CNVs) play an important role in many types of cancer. With the rapid development of next generation sequencing (NGS) techniques, many methods for detecting CNVs of a single sample have emerged: (ⅰ) require genome-wide data of both case and control samples, (ⅱ) depend on sequencing depth and GC content correction algorithm, (ⅲ) rely on statistical models built on CNV positive and negative sample datasets. These make them costly in the data analysis and ineffective in the targeted sequencing data. In this study, we developed a novel alignment-free method called DL-CNV to call CNV from the target sequencing data of a single sample. Specifically, we collected two sets of samples. The first set consists of 1301 samples, in which 272 have CNVs in ERBB2 and the second set is composed of 1148 samples with 63 samples containing CNVs in MET. Finally, we found that a testing AUC of 0.9454 for ERBB2 and 0.9220 for MET. Furthermore, we hope to make the CNV detection could be more accurate with clinical pgold standardq (e.g. FISH) information and provide a new research direction, which can be used as the supplement to the existing NGS methods.

Keywords:

Citation: Yunxiang Zhang, Lvcheng Jin, Bo Wang, Dehong Hu, Leqiang Wang, Pan Li, Junling Zhang, Kai Han, Geng Tian, Dawei Yuan, Jialiang Yang, Wei Tan, Xiaoming Xing, Jidong Lang. DL-CNV: A deep learning method for identifying copy number variations based on next generation target sequencing[J]. Mathematical Biosciences and Engineering, 2020, 17(1): 202-215. doi: 10.3934/mbe.2020011

Related Papers:

[1]	Liang Wu . Cu-based mutlinary sulfide nanomaterials for photocatalytic applications. AIMS Materials Science, 2023, 10(5): 909-933. doi: 10.3934/matersci.2023049
[2]	Leydi J. Cardenas F., Josep Ma. Chimenos, Luis C. Moreno A., Elaine C. Paris, Miryam R. Joya . Enhancing Co₃O₄ nanoparticles: Investigating the impact of nickel doping and high-temperature annealing on NiCo₂O₄/CoO heterostructures. AIMS Materials Science, 2023, 10(6): 1090-1104. doi: 10.3934/matersci.2023058
[3]	Luca Spiridigliozzi, Grazia Accardo, Emilio Audasso, Barbara Bosio, Sung Pil Yoon, Gianfranco Dell’Agli . Synthesis of easily sinterable ceramic electrolytes based on Bi-doped 8YSZ for IT-SOFC applications. AIMS Materials Science, 2019, 6(4): 610-620. doi: 10.3934/matersci.2019.4.610
[4]	Laís Weber Aguiar, Cleiser Thiago Pereira da Silva, Hugo Henrique Carline de Lima, Murilo Pereira Moises, Andrelson Wellington Rinaldi . Evaluation of the synthetic methods for preparing metal organic frameworks with transition metals. AIMS Materials Science, 2018, 5(3): 467-478. doi: 10.3934/matersci.2018.3.467
[5]	Ririn Cahyanti, Sumari Sumari, Fauziatul Fajaroh, Muhammad Roy Asrori, Yana Fajar Prakasa . Fe-TiO₂/zeolite H-A photocatalyst for degradation of waste dye (methylene blue) under UV irradiation. AIMS Materials Science, 2023, 10(1): 40-54. doi: 10.3934/matersci.2023003
[6]	Ahmed Z. Abdullah, Adawiya J. Haider, Allaa A. Jabbar . Pure TiO₂/PSi and TiO₂@Ag/PSi structures as controllable sensor for toxic gases. AIMS Materials Science, 2022, 9(4): 522-533. doi: 10.3934/matersci.2022031
[7]	Akira Nishimura, Ryuki Toyoda, Daichi Tatematsu, Masafumi Hirota, Akira Koshio, Fumio Kokai, Eric Hu . Optimum reductants ratio for CO₂ reduction by overlapped Cu/TiO₂. AIMS Materials Science, 2019, 6(2): 214-233. doi: 10.3934/matersci.2019.2.214
[8]	Fahmi Astuti, Rima Feisy Azmi, Mohammad Arrafi Azhar, Fani Rahayu Hidayah Rayanisaputri, Muhammad Redo Ramadhan, Malik Anjelh Baqiya, Darminto . Employing Na₂CO₃ and NaCl as sources of sodium in NaFePO₄ cathode: A comparative study on structure and electrochemical properties. AIMS Materials Science, 2024, 11(1): 102-113. doi: 10.3934/matersci.2024006
[9]	Alain Mauger, Christian M. Julien . V₂O₅ thin films for energy storage and conversion. AIMS Materials Science, 2018, 5(3): 349-401. doi: 10.3934/matersci.2018.3.349
[10]	K. Ravindranadh, K. Durga Venkata Prasad, M.C. Rao . Spectroscopic and luminescent properties of Co²⁺ doped tin oxide thin films by spray pyrolysis. AIMS Materials Science, 2016, 3(3): 796-807. doi: 10.3934/matersci.2016.3.796

Abstract

1. Introduction

Diabetes is a metabolic disorder disease caused by insufficient insulin secretion and insulin secretion disorders ^[1].The main manifestation of diabetes is hyperglycemia. Long-term exposure of organs to hyperglycemia will cause the damage of physiological system, then leading to chronic progressive lesions and failure of tissues and organs, such as eyes, kidneys, nerves, heart and blood vessels ^[2]. At present, diabetes Mellitus can be divided into type 1 diabetes mellitus (T1DM) and type 2 diabetes mellitus (T2DM), among which T2DM is the most common type of diabetes, accounting for about 95% of diabetic patients ^[3]. The main factors leading to T2DM are environmental factors and bad living habits. In addition, age, overnutrition and insufficient exercise are all the triggers of diabetes ^[4]. From health to T2DM, the development usually goes through three stages: health, pre-diabetes, type 2 diabetes ^[5]. When T2DM is diagnosed, the blood glucose level of patients will continue to rise, and drug treatment is difficult to reverse ^[6,7]. However, patients in pre-diabetes can maintain blood glucose stability and even restore health through artificial intervention. Many studies have shown that early diagnosis and treatment of T2DM is the most effective way to prevent and control T2DM. Therefore, early detection and timely adjustment of lifestyle is the key to the treatment of T2DM ^[8].

With the development of economy and culture, people pay more and more attention to physical examination^[9,10]. Finding valuable information related to diabetes from physical examination data and finding out the changing pattern of diabetes at all stages is of great importance to the prevention and treatment of diabetes.

In recent years, many algorithms have been used to predict diabetes. For example, Zou et al. used principal component analysis (PCA) and minimum redundant maximum (mRMR) correlation to screen risk factors, and utilized decision tree (DT), RF and neural network (NN) to predict diabetes ^[11]. By using mutual information (MI) and Gini impurity (GI) to screen diabetes-related risk factors in physical examination data, Yang et al. established a cascade diabetes risk prediction system ^[12]. The invasive risk assessment model HCL predicted diabetes by using invasive characteristics and referring to Harvard Cancer Risk Index ^[13].

Machine learning algorithms have been widely used in the field of medicine because of their powerful performance ^{[14,15,16,17]}. Therefore, based on physical examination data in real world, this study used XGBoost, RF, LR, and FCN to predict diabetes, and analyze the impact of these indicators at each stage of T2DM.

2. Materials and methods

2.1. Benchmark Dataset

The physical examination data were collected from Beijing Physical Examination Center from January 2006 to December 2017. In this study, fasting plasma glucose (FPG) index in the physical examination data was used as the standard to classify the sample types of the dataset. FPG can reflect the function of islet B cells, and generally indicate the secretion function of basal insulin, which is the most commonly used indicator for diabetes ^[18]. Clinical application of FPG is more conducive to the early diagnosis and prevention of T2DM. According to WHO (1999) diagnostic criteria for diabetes, the population was divided into three groups: normal FPG (NFG, FPG < 6.1 mmol/L), slightly impaired FPG (IFG, 6.1 mmol/L ≤ FPG < 7.0 mmol/L), and T2DM (T2DM, FPG > 7.0 mmol/L) ^[19]. Finally, the benchmark data included 1,221,598 NFG samples, 285,965 IFG samples, and 387,076 T2DM samples.

There are 14 initial features in the physical examination data, including waistline, age, systolic pressure (SP), gender, blood uric acid (BUA), serum creatinine (SC), triglyceride, diastolic pressure (DP), glutamic oxalacetic transaminase (GOT), hipline, high-density lipoprotein (HDL), glutamic-pyruvic transaminase (GPT), height, blood urea nitrogen(BUN), weight, total cholesterol (TC), and low density lipoprotein (LDL). Height and Waist circumference cannot directly evaluate a person's obesity, so we added waist height ratio (WHtR) to reflect whether a person has visceral fat accumulation. As a result, total of 15 features were used to perform further analysis and model construction.

To facilitate the performance evaluation of the model, we divided the data set into training set and test set according to the ratio of 7:3. Thus, the benchmark dataset can be formulated as

$\left\{\begin{array}{c}{S}^{train} = {S}_{1}^{train}\cup {S}_{2}^{train}\cup {S}_{3}^{train}\\ {S}^{test} = {S}_{1}^{test}\cup {S}_{2}^{test}\cup {S}_{3}^{test}\end{array}\right.$

(1)

where the symbol 1, 2 and 3 represent the NFG, IFG and T2DM, respectively. The "train" and "test" denotes the training data and test data, respectively.

2.2. Machine learning methods

In this study, eXtreme Gradient Boosting (XGBoost), random forest (RF), logistic regression (LR), and fully connected neural network (FCN) algorithm were used as the classifier. The details are as follows.

2.1.1. eXtreme Gradient Boosting (XGBoost)

XGBoost is based on the gradient boosting algorithm ^[20,21,22]. In the modeling process, features are spitted through continuous adding trees. In each time, a tree is added to learn a new function to fit the residual of the last prediction. After the training, a gradient boosting model of K trees is obtained. The ultimate goal of XGBoost is to make the predicted value of the tree group as close to the true value as possible, and to have as large a generalization range as possible.

The objective function of XGBoost is:

$L\left(\mathrm{\varnothing }\right) = \sum _{i}l({y}_{i}^{\mathrm{\text{'}}}-{y}_{i})+\sum _{k}\Omega \left({f}_{t}\right)$

(2)

where ${y}_{i}^{\mathrm{\text{'}}}$ is the output of the entire cumulative model, and the regularization term $\sum _{k}\Omega \left({f}_{t}\right)$ is a function representing the complexity of the tree. The smaller the value, the lower the complexity and the stronger the generalization ability of the model.

In this study, Gini impurity (GI) is used to evaluate the contribution of features to the model. In the tree model, better decision-making conditions can be selected by comparing the value of GI. Each division of tree nodes should try to make the GI as low as possible. GI is mainly used to solve the problem of high computational complexity. It is defined as:

$Gini\left(t\right) = 1-\sum\limits_{i-0}^{c-1}{p\left(i\right|t)}^{2}$

(3)

where $t$ represents a given node, $i$ represents any category of label, and $p\left(i|t\right)$ represents the proportion of label category $i$ on node $t$ .

2.1.2. Random Forest (RF)

RF is also a tree-based ensemble classifier which is a representative model of the bagging method. The core idea of the bagging method is to construct multiple independent evaluators, and then the prediction results are determined by the principle of average or majority voting ^[23,24].

2.1.3. Logistic Regression (LR)

LR is a generalized linear regression analysis algorithm, and is often used in the field of disease diagnosis ^[25,26]. It is a variation of linear regression, and an algorithm widely used in the field of regression and classification. LR is to construct a mapping from X to $\widehat{y}$ and calculates the parameters of the model formulated as.

$\widehat{y} = {\theta }_{0}+{\theta }_{1}{x}_{1}+{\theta }_{2}{x}_{2}+\dots +{\theta }_{n}{x}_{n}$

(4)

The process is calculated as follows. Firstly, a loss function is defined, and then the parameter vector is solved by minimizing the loss function. Finally, the LR uses the Sigmoid function to control the output between 0 and 1:

$g\left(z\right) = \frac{1}{1+{e}^{-z}}$

(5)

The Sigmoid function distributes the value of $g\left(z\right)$ between 0 and 1. When $g\left(z\right)$ approaches 0, the label of the sample is category 0, and when $g\left(z\right)$ is close to 1, the label of the sample is category 1. In this way, a classification model can be obtained.

2.1.4. Fully connected neural network (FCN)

FCN generally consists of three parts, an input layer, a hidden layer and an output layer ^[27,28]. Each layer uses the output of the previous layer as input, and then outputs to the next level. The most basic unit in a neural network is a neuron. Each neuron receives multiple inputs and produces an output. Multiple neurons are connected to each other to form a neural network. Fully connected neural network (FCN) generates nonlinear output through activation functions. The commonly used activation functions are ReLU, Sigmoid, and Tanh. FCN training is divided into two processes: forward propagation and backward propagation. The forward propagation fits the features, and then uses the loss function to calculate the gap between the model output value and the target value. Backpropagation uses the gradient descent method to update the parameters of each layer according to the loss function value generated by the forward propagation, thereby optimizing and updating parameter.

We established a three-layer fully connected neural network, the input layer has 18 neurons. The first layer has 7 neurons and the second layer has 4 neurons respectively, the activation function is 'ReLU', the optimization function is 'RMSprop'. The output layer has three neurons, the activation function is 'Softmax'.

2.2. Performance measurement

In this study, accuracy, precision, recall, F1 and AUC were used to evaluate the performance of proposed models ^[29], which were calculated as follows:

$\left\{\begin{array}{c}{\rm{Accuracy}} = \frac{TP+TN}{TP+TN+FN+FP}0\le Sn\le 1\\ {\rm{Precision}} = \frac{TP}{TP+FP}0\le Sp\le 1\\ {\rm{Recall}} = TPR = \frac{TP}{TP+FN}0\le Ac \le 1\\ {\rm{FPR}} = \frac{FP}{FP+TN}0\le Acc\le 1\\ \frac{2}{F1} = \frac{1}{\mathrm{Precision}}+\frac{1}{\mathrm{Recall}}0\le Acc\le 1\end{array}\right.$

(6)

where TP represents true positives, describing the number of correctly predicted positive samples; FP denotes false positives, representing the number of negative samples predicted as positive; FN indicates false negatives, representing the number of positive samples classified as negative; TN denotes true negatives, representing the number of samples correctly predicted as negative. Accuracy is the ratio of the number of all predicted correct samples divided by the total number of samples.

The receiver operating characteristic (ROC) curve is often used to measure the predictive power of the current method across the entire range of algorithm decision value ^[30]. The ROC can reveal the relationship between true positive rate (TPR) and false positive rate ( $\mathrm{F}\mathrm{P}\mathrm{R})$ . We used the area under the ROC curve, referred to as area under curve (AUC), to evaluate the performance of the model.

2.3. Model validation

Generally, there are three methods for model verification: Holdout test, K-Fold cross-validation test and Leave-One-Out (LOO) test ^[31,32].

Holdout test divides the sample into two mutually exclusive parts, one part is used as the training set and the other part is used as the test set. The model is trained on the training set and examined on the test set. All evaluation indexes were calculated on the test set. K-Fold cross-validation divides the data set into K mutually exclusive data subsets. Each time, one data subset is used as the test set, and all other subsets are used as the training set. Traverse these K subsets in turn. Finally, the average values of the evaluation indexed are used as the final evaluation indexes. The stability of K-Fold cross-validation is closely related to the value of K. If the K value is too small, the experimental stability is not enough. If the K value is too large, the modeling cost may increase. Generally, the K value is 5 or 10. LOO is a special K-Fold cross-validation, where k is equal to the number of sample in the data set. The results obtained by this method are the same as the training entire test. The expected value of the set is the closest, but the cost is too large.

In this article, we use Holdout test for model verification.

3. Results and discussion

In this study, four kinds of machine learning methods that are XGBoost, RF, LR and FCN were used as the classifier. The following two experiments were performed as follows.

3.1. Prediction of NFG, IFG and T2DM

In the first experiments, based on the above four methods, four-classification models were established to distinguish NFG, IFG and T2DM.We used ${S}_{1}^{train}$ , ${S}_{2}^{train}$ and ${S}_{3}^{train}$ to train the four machine learning methods for constructing models. The ${S}_{1}^{test}$ , ${S}_{2}^{test}$ and ${S}_{3}^{test}$ were utilized to investigate the performance of models for the prediction of NFG, IFG and T2DM. The results were recorded in Table 1 and shown in Figure 1. Table 2 displays the six evaluation indexes of four models on test data. From the table, we noticed that XGBoost could produce the best results with the AUC (macro) of 0.7874 and the AUC (micro) of 0.8633. It is worth noting that the prediction result of FCN is the worst, suggesting that FCN is not suitable for health data analysis. This is consistent with the fact that neural network is not suitable for the analysis of less characteristic samples. Figure 1 shows the ROC curves of four different classifiers on test set. For each algorithm, we draw the micro-average ROC curve, macro-Average ROC Curve and any two kinds of ROC curves. According to Figures 1 (a), we can also see that the AUCs of XGBoost identifying NFG, IFG, and T2DM from the entire population are 0.79, 0.70, and 0.84, respectively.

Table 1. The results for the prediction of NFG, IFG and T2DM.

Algorithm	Accuracy	Precision (weighted)	F1-score (weighted)	Recall (weighted)	AUC (micro)	AUC (macro)
XGBoost	0.6871	0.8192	0.7367	0.6871	0.8633	0.7874
RF	0.6590	0.8260	0.7185	0.6590	0.8233	0.7842
LR	0.6540	0.8334	0.7159	0.6540	0.8068	0.7841
FCN	0.5593	0.5601	0.5560	0.5593	0.7607	0.7472

| Show Table

DownLoad: CSV

Figure 1. The results for the prediction of NFG, IFG and T2DM. (a) The ROC curves of the algorithm XGBoost, (b) The ROC curves of the algorithm RF, (c) The ROC curves of the algorithm LRs, (d) The ROC curves of the algorithm FCNs, (e) The feature importance using GI, (f) The IFS curve for feature importance using XGBoost.

DownLoad: Full-Size Img PowerPoint

Table 2. The results for the discrimination between any two classes by using XGBoost.

Dataset	Recall	Accuracy	Precision	F1-score	AUC
NFG vs IFG	0.6732	0.7220	0.2047	0.3140	0.7808
NFG vs T2DM	0.7611	0.8039	0.2194	0.3409	0.8687
IFG vs T2DM	0.7960	0.5891	0.4983	0.6129	0.7067

| Show Table

DownLoad: CSV

Subsequently, we performed feature analysis and showed the results in Figure 2. shows the feature importance of XGBoost based on dataset 1. Waist circumference ranked first respectively, indicating that obesity is the most important risk factor for diabetes, and age ranked second. The older the age, the greater the risk of diabetes. Figure 3 shows the incremental feature selection strategy (IFS) curve, it can be seen that when the first 7 features (Waistline, Age, SP, Gender, BUA, SC, Triglyceride) are used for modeling, the model achieves the highest AUC, and the addition of features does not improve the overall results of the model. We believe that these 7 features are important risk factors for distinguishing NFG, IFG and T2DM.

Figure 2. The results for discriminating NFG from IFG. (a) ROC curve, (b) The feature importance, (c) The IFS curve for feature selection.

DownLoad: Full-Size Img PowerPoint

Figure 3. The results for discriminating NFG from T2DM. (a) ROC curve, (b) The feature importance, (c) The IFS curve for feature selection.

DownLoad: Full-Size Img PowerPoint

3.2. Discrimination between any two classes

On the basis of benchmark dataset, three binary models were established to distinguish NFG and IFG, NFG and T2DM, as well as IFG and T2DM. The importance of features in each model was assessed using GI, and incremental feature selection (IFS) was used to find the optimal feature subset. Due to good performance and wide usage in healthy data, we only used XGBoost to construct the three models. Results have been recorded in Table 2.

At first, we built a model for discriminating between NFG from IFG. ROC curve and feature rank of the model were drawn in Figure 2. Results show that the AUC is 0.7808. There is little difference between NFG and IFG. Although blood sugar is elevated in the pre-diabetes stage, the pancreatic islets have not been completely impaired. It will not cause irreversible damage to the body. From Figure 2b and c, it can be observed that the features with the most importance characteristics at this case are waistline, Age, WHtR, Gender and SP, indicating that the risk factors for the early population are obesity, age and hypertension.

Subsequently, we focused on the discrimination between NFG vs T2DM. From Table 2 and Figure 3a, the XGBoost-based model could produce the AUC of 0.8687. The model established by physical examination indicators can more accurately distinguish normal people from diabetic people. The order of feature importance is Age, Waistline, Triglyceride, WHtR, SP, Gender and SC (Figure 3b). In the identification of diabetic patients, some molecular markers, such as triglycerides, play an important role, which reflects the physiological level of diabetic patients. At present, the diagnosis rate of diabetes in China is less than 50%. It is of great significance to diagnose diabetic patients through physical examination indicators, especially in rural China's free physical examination.

The third binary model was built for distinguishing IFG from T2DM based on XGBoost. Based on the results in Table 2 and Figure 4a, we may notice that the model could achieve the AUC of 0.7067 on test dataset. This prediction accuracy is the lowest among the three two classification models. This is mainly due to the fact that many physical indicators of pre diabetes and diabetes are very similar. Patients with pre diabetes are not easily controlled and treated, and are easily converted to diabetic patients. In this classification problem, both IFG population and T2DM population are exposed to hyperglycemia and have an impact on various physical indicators. Figure 4b and c conclude that the most important features are Gender, SC, Triglyceride, Age, BUA, Waistline, GOT, WHtR, GPT. Some special features, such as SC and GOT, may indicate that renal and liver function of T2DM population may be impaired compared with IFG population.

Figure 4. The results for discriminating IFG from T2DM. (a) ROC curve of XGBoost, (b) The feature importance, (c) The IFS curve for feature selection.

DownLoad: Full-Size Img PowerPoint

4. Conclusions

Diabetes is a metabolic disease. From health to diabetes, there are generally three stages: health, pre-diabetes and type 2 diabetes. It is worth studying how to use machine learning methods to early predict and diagnose the disease. In the three-classification experiment of distinguishing NFG, IFG and T2DM, by comparing the results of the four classifiers: XGBoost, RF, LR, and FCN, we can find that there is little difference between them. XGBoost is slightly better than other classifiers, with AUC (macro) of 0.7874 and AUC (micro) of 0.8633. Then, we chose XGBoost as the basic classifier, and constructed three binary classification models to distinguish between NFG and IFG, NFG and T2DM, IFG and T2DM. The AUCs of these models on test dataset are 0.7808, 0.8687 and 0.7067, respectively. We used GI index to evaluate the importance of features, sort the features according to their importance, and mine relevant risk factors by combining with IFS strategy. Overall, Age, Triglyceride, WHtR, and SP are important risk factors. In particular, it was found that T2DM patients may have liver and kidney damage.

Through this work, we hope to explore the possibility of early prediction of diabetes with physical examination data. And we hope to dig out valuable information related to diabetes from the physical examination data and other omics data ^[33], and discover the changes in the each stage of diabetes, so as to provide clues for early prevention and treatment of diabetes. In the future, we hope to clarify the causal relationship between various risk factors and diabetes through cohort studies and Mendelian randomization studies, and explore some effective intervention schemes on this basis.

Acknowledgments

The study was supported by grants from the National Key R & D Program of China (2020YFC2003403), Capital's Funds for Health Improvement and Research (2018-2-2242) and the National Natural Science Foundation of China (82130112).

Conflict of interest

The authors declare that there is no conflict of interest.

References

[1]	S. A. McCarroll and D. M. Altshuler, Copy-number variation and association studies of human disease, Nat. Genet., 39(2007), S37-42.
[2]	P. Liu, C. M. Carvalho, P. J. Hastings, et al., Mechanisms for recurrent and complex human genomic rearrangements, Curr. Opin. Genet. Dev., 22(2012), 211-220.
[3]	A. P. de Koning, W. Gu, T. A. Castoe, et al., Repetitive elements may comprise over two-thirds of the human genome, Plos Genet.,7 (2011), e1002384. doi: 10.1371/journal.pgen.1002384
[4]	M. Zarrei, J. R. MacDonald, D. Merico, et al., A copy number variation map of the human genome, Nat. Rev. Genet., 16(2015), 172-183. doi: 10.1038/nrg3871
[5]	J. L. Freeman, G. H. Perry, L. Feuk, et al., Copy number variation: new insights in genome diversity, Genome Res., 16(2006), 949-961.
[6]	S. F. Chin, A. E. Teschendorff, J. C. Marioni, et al., High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer, Genome Biol., 8(2007), R215.
[7]	D. He, N. Furlotte and E. Eskin, Detection and reconstruction of tandemly organized de novo copy number variations, BMC Bioinf., 11(2010), S12.
[8]	G. Klambauer, K. Schwarzbauer, A. Mayr, et al., cn.MOPS: Mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate, Nucleic Acids Res., 40 (2012), e69. doi: 10.1093/nar/gks003
[9]	E. Talevich, A. H. Shain, T. Botton, et al., CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA Sequencing, PLoS Comput. Biol., 12(2016), e1004873.
[10]	A. Abyzov, A. E. Urban, M. Snyder, et al., CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing, Genome Res., 21(2011), 974-984.
[11]	V. Boeva, T. Popova, K. Bleakley, et al., Control-FREEC: Atool for assessing copy number and allelic content using next-generation sequencing data, Bioinf., 28(2012), 423-425.
[12]	G. Onsongo, L. B. Baughn, M. Bower, et al., CNV-RF Is a Random Forest-Based Copy Number Variation Detection Method Using Next-Generation Sequencing, J. Mol. Diagn., 18(2016), 872-881.
[13]	C. Wang, J. M. Evans, A. V. Bhagwate, et al., PatternCNV: Aversatile tool for detecting copy number changes from exome sequencing data, Bioinf., 30(2014), 2678-2680.
[14]	J. Budczies, N. Pfarr, A. Stenzinger, et al., Ioncopy: Anovel method for calling copy number alterations in amplicon sequencing data including significance assessment, Oncotarget, 7 (2016), 13236-13247.
[15]	Y. L. Cun, L. Bottou, Y. Bengio, et al., Gradient-Based Learning Applied to Document Recognition, Proc. IEEE., 86 (1998), 2278-2324.
[16]	J. Zhou and O. G. Troyanskaya, Predicting effects of noncoding variants with deep learning-based sequence model, Nat. Methods,12 (2015), 931-934.
[17]	R. Poplin, D. Newburger, J. Dijamco, et al., Creating a universal SNP and small indel variant caller with deep neural networks, BioRxiv, (2016), 092890.
[18]	M. X. Sliwkowski, J. A. Lofgren, G. D. Lewis, et al., Nonclinical studies addressing the mechanism of action of trastuzumab (Herceptin), Semin Oncol., 26(1999), 60-70.
[19]	S. Ahn, M. Hong, M. Van Vrancken, et al., A nCounter CNV Assay to Detect HER2 Amplification: A Correlation Study with Immunohistochemistry and In Situ Hybridization in Advanced Gastric Cancer, Mol. Diagn. Ther., 20(2016), 375-383.
[20]	F. Sircoulomb, I. Bekhouche, P. Finetti, et al., Genome profiling of ERBB2-amplified breast cancers, BMC Cancer,10 (2010), 539.
[21]	S. Kim, T. M. Kim, D. W. Kim, et al., Acquired Resistance of MET-Amplified Non-small Cell Lung Cancer Cells to the MET Inhibitor Capmatinib, Cancer Res. Treat., 5 (2019), 951-962.
[22]	N. Pfarr, R. Penzel, F. Klauschen, et al., Copy number changes of clinically actionable genes in melanoma, non-small cell lung cancer and colorectal cancer-A survey across 822 routine diagnostic cases, Genes Chromosomes Cancer, 55 (2016), 821-833.
[23]	F. Zare, M. Dow, N. Monteleone, et al., An evaluation of copy number variation detection tools for cancer using whole exome sequencing data, BMC Bioinf., 18(2017), 286.
[24]	H. Li and R. Durbin, Fast and accurate short read alignment with Burrows-Wheeler transform, Bioinf., 25(2009), 1754-1760.
[25]	H. Li, B. Handsaker, A. Wysoker, et al., The Sequence Alignment/Map format and SAMtools, Bioinf., 25(2009), 2078-2079.
[26]	M. Abadi, P. Barham, J. Chen, et al., TensorFlow: A system for large-scale machine learning, 12th Symposium on Operating Systems Design and Implementation (OSDI), 2016, 265-283. Available from: https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.

Reader Comments

Your name:*

Email:*
© 2020 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

4.4

Metrics

Article views(8046) PDF downloads(1241) Cited by(6)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Mathematical Biosciences and Engineering

DL-CNV: A deep learning method for identifying copy number variations based on next generation target sequencing