iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree

Yunyun Liang; Shengli Zhang; Huijuan Qiao; Yinan Cheng; Yunyun Liang; Shengli Zhang; Huijuan Qiao; Yinan Cheng

doi:10.3934/mbe.2021434

Mathematical Biosciences and Engineering

2021, Volume 18, Issue 6: 8797-8814. doi: 10.3934/mbe.2021434

Previous Article Next Article

Research article Special Issues

iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree

1.
School of Science, Xi'an Polytechnic University, Xi'an 710048, China
2.
School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
3.
Department of Statistics, University of California at Davis, Davis, CA 95616, USA

Received: 20 August 2021 Accepted: 30 September 2021 Published: 14 October 2021

Enhancer is a non-coding DNA fragment that can be bound with proteins to activate transcription of a gene, hence play an important role in regulating gene expression. Enhancer identification is very challenging and more complicated than other genetic factors due to their position variation and free scattering. In addition, it has been proved that genetic variation in enhancers is related to human diseases. Therefore, identification of enhancers and their strength has important biological meaning. In this paper, a novel model named iEnhancer-MFGBDT is developed to identify enhancer and their strength by fusing multiple features and gradient boosting decision tree (GBDT). Multiple features include k-mer and reverse complement k-mer nucleotide composition based on DNA sequence, and second-order moving average, normalized Moreau-Broto auto-cross correlation and Moran auto-cross correlation based on dinucleotide physical structural property matrix. Then we use GBDT to select features and perform classification successively. The accuracies reach 78.67% and 66.04% for identifying enhancers and their strength on the benchmark dataset, respectively. Compared with other models, the results show that our model is useful and effective intelligent tool to identify enhancers and their strength, of which the datasets and source codes are available at https://github.com/shengli0201/iEnhancer-MFGBDT1.
- identification,
- enhancers,
- multiple features,
- gradient boosting decision tree
Citation: Yunyun Liang, Shengli Zhang, Huijuan Qiao, Yinan Cheng. iEnhancer-MFGBDT: Identifying enhancers and their strength by fusing multiple features and gradient boosting decision tree[J]. Mathematical Biosciences and Engineering, 2021, 18(6): 8797-8814. doi: 10.3934/mbe.2021434

Related Papers:

Abstract

Enhancer is a non-coding DNA fragment that can be bound with proteins to activate transcription of a gene, hence play an important role in regulating gene expression. Enhancer identification is very challenging and more complicated than other genetic factors due to their position variation and free scattering. In addition, it has been proved that genetic variation in enhancers is related to human diseases. Therefore, identification of enhancers and their strength has important biological meaning. In this paper, a novel model named iEnhancer-MFGBDT is developed to identify enhancer and their strength by fusing multiple features and gradient boosting decision tree (GBDT). Multiple features include k-mer and reverse complement k-mer nucleotide composition based on DNA sequence, and second-order moving average, normalized Moreau-Broto auto-cross correlation and Moran auto-cross correlation based on dinucleotide physical structural property matrix. Then we use GBDT to select features and perform classification successively. The accuracies reach 78.67% and 66.04% for identifying enhancers and their strength on the benchmark dataset, respectively. Compared with other models, the results show that our model is useful and effective intelligent tool to identify enhancers and their strength, of which the datasets and source codes are available at https://github.com/shengli0201/iEnhancer-MFGBDT1.

References

[1]	N. Omar, W. Y. Shiong, L. Xi, C. C Yee Ling, M. T. D. Abdullah, N. K. Lee, Enhancer prediction in proboscis monkey genome: A comparative study, J. Telecom. Electron. Computer Eng., 9 (2017), 175-179.
[2]	B. Liu, L. Y. Fang, R. Long, X. Lan, K. C. Chou, iEnhancer-2L: A two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics, 32 (2016), 362-369. doi: 10.1093/bioinformatics/btv604
[3]	H. M. Herz, Enhancer deregulation in cancer and other diseases, Bioessays, 38 (2016), 1003-1015. doi: 10.1002/bies.201600106
[4]	G. Zhang, J. Shi, S. Zhu, Y. Lan, L. Xu, H. Yuan, et al., DiseaseEnhancer: A resource of human disease-associated enhancer catalog, Nucleic Acids Res., 46 (2018), D78-D84.
[5]	O. Corradin, P. C. Scacheri, Enhancer variants: Evaluating functions in common disease, Genome Med., 6 (2014), 85.
[6]	M. Boyd, M. Thodberg, M. Vitezic, J. Bornholdt, K. Vitting-Seerup, Y. Chen, et al., Characterization of the enhancer and promoter landscape of inflammatory bowel disease from human colon biopsies, Nat. Commun., 9 (2018), 1661.
[7]	D. Shlyueva, G. Stampfel, A. Stark, Transcriptional enhancers: from properties to genome-wide predictions, Nat. Rev. Genet., 15 (2014), 272-286. doi: 10.1038/nrg3682
[8]	N. D. Heintzman, B. Ren, Finding distal regulatory elements in the human genome, Curr. Opin. Genet. Dev., 19 (2009), 541-549. doi: 10.1016/j.gde.2009.09.006
[9]	N. D. Heintzman, R. K. Stuart, G. Hon, Y. T. Fu, C. W. Ching, R. D. Hawkins, et al., Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome, Nat. Genet., 39 (2007), 311-318.
[10]	A. Visel, M. J. Blow, Z. R. Li, T. Zhang, J. A. Akiyama, A. Holt, et al., ChIP-seq accurately predicts tissue-specific activity of enhancers, Nature, 457 (2009), 854-858.
[11]	A. P. Boyle, L. Y. Song, B. K. Lee, D. London, D. Keefe, E. Birney, et al., High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells, Genome Res., 21 (2011), 456-464.
[12]	J. Ernst, P. Kheradpour, T. S. Mikkelsen, N. Shoresh, L. D. Ward, C. B. Epstein, et al., Mapping and analysis of chromatin state dynamics in nine human cell types, Nature, 473 (2011), 43-49.
[13]	G. D. Erwin, N. Oksenberg, R. M. Truty, D. Kostka, K. K. Murphy, N. Ahituv, et al., Integrating diverse datasets improves developmental enhancer prediction, PLoS Comput. Boil., 10 (2014), e1003677.
[14]	M. Feinandez, D. Miranda-Saavedra, Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machine, Nucleic Acids Res., 40 (2012), e77.
[15]	H. A. Firpi, D. Ucar, K. Tan, Discover regulatory DNA elements using chromatin signatures and artificial neural network, Bioinformatics, 26 (2010), 1579-1586. doi: 10.1093/bioinformatics/btq248
[16]	N. Rajagopal, W. Xie, Y. Li, U. Wagner, W. Wang, J. Stamatoyannopoulos, et al., RFECS: A random-forest based algorithm for enhancer identification from chromatin state, PLoS Comput. Boil., 9 (2013), e1002968.
[17]	C. Z. Jia, W. Y. He, EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features, Sci. Rep., 6 (2016) 38741.
[18]	B. Liu, K. Li, D. S. Huang, K. C. Chou, iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach, Bioinformatics, 34 (2018), 3835-3842. doi: 10.1093/bioinformatics/bty458
[19]	Q. H. Nguyen, T. Nguyen-Vo, N. Q. K. Le, T. T. T. DO, S. Raharja, B. P. Nguyen, iEnhancer-ECNN: Identifying enhancers and their strength using ensemble of convolutional neural networks, BMC Genom., 20 (2019), 951.
[20]	K. K. Tan, N. Q. K. Le, H. Y. Yeh, M. C. H. Chua, Ensemble of deep recurrent neural networks for identifying enhancers via dinucleotide physicochemical properties, Cells, 8 (2019), 767.
[21]	N. Q. K. Le, E. K. Y. Yapp, Q. T. Ho, N. Nagasundaram, Y. Y. Ou, H. Y. Yeha, iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding, Anal. Biochem., 571 (2019), 53-61. doi: 10.1016/j.ab.2019.02.017
[22]	S. Basith, M. M. Hasan, G. Lee, L. Y. Wei, B. Manavalan, Integrative machine learning framework for the identification of cell-specific enhancers from the human genome, Brief. Bioinform., (2021), 1-13. doi: 10.1093/bib/bbab252.
[23]	L. J. Cai, X. B. Ren, X. Z. Fu, L. Peng, M. Y. Gao, X. X. Zeng, iEnhancer-XG: Interpretable sequence-based enhancers and their strength predictor, Bioinformatics, 37 (2021), 1060-1067. doi: 10.1093/bioinformatics/btaa914
[24]	N. Q. K. Le, Q. T. Ho, T. T. D. Nguyen, Y. Y. Ou, A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information, Brief. Bioinform., 22 (2021), 1-7. doi: 10.1093/bib/bbaa398
[25]	D. Y. Lim, J. Khanal, H. Tayara, K. T. Chong, iEnhancer-RF: Identifying enhancers and their strength by enhanced feature representation using random forest, Chemometr. Intell. Lab., 212 (2021), 104284.
[26]	W. He, Y. Ju, X. Zeng, X. Liu, Q. Zou, Sc-ncdnapred: A sequence-based predictor for identifying non-coding dna in saccharomyces cerevisiae, Front. Microbiol., 9 (2018), 2174.
[27]	C. S. Kim, M. D. Winn, V. Sachdeva, K. E. Jordan, K-mer clustering algorithm using a mapreduce framework: application to the parallelization of the inchworm module of trinity, BMC Bioinform., 18 (2017), 467.
[28]	J. Matias Rodrigues, T. S. Schmidt, J. Tackmann, C. von Mering, Mapseq: Highly efficient k-mer search with confidence estimates, for rRNA sequence analysis, Bioinformatics, 33 (2017), 3808-3810.
[29]	J. S. Wang, S. L. Zhang, PA-PseU: An incremental passive-aggressive based method for identifying RNA pseudouridine sites via Chou's 5-steps rule, Chemometr. Intell. Lab., 210 (2021), 104250.
[30]	B. Liu, H. Wu, K. C. Chou, An improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences, Natural Sci., 4 (2017), 67-91.
[31]	B. Liu, S. Y. Wang, R. Long, K. C. Chou, iRSpot-EL: Identify recombination spots with an ensemble learning approach, Bioinformatics, 33 (2017), 35-41. doi: 10.1093/bioinformatics/btw539
[32]	Y. Y. Yao, S. L. Zhang, Y. Y. Liang, iORI-ENST: Identifying origin of replication sites based on elastic net and stacking learning, SAR QSAR Environ. Res., 32 (2021), 317-331. doi: 10.1080/1062936X.2021.1895884
[33]	Z. Liu, X. Xiao, D. J. Yu, J. H. Jia, W. R. Qiu, K. C. Chou, pRNAm-PC: Predicting N6-methyladenosine sites in RNA sequences via physical-chemical properties, Anal. Biochem., 497 (2016), 60-67. doi: 10.1016/j.ab.2015.12.017
[34]	R. E. Dickerson, Definitions and nomenclature of nucleic acid structure components, Nucleic Acids Res., 17 (1989), 1797-1803. doi: 10.1093/nar/17.5.1797
[35]	E. Alessio, A. Carbon, G. Castelli, V. Frappietro, Second-order moving average and scaling of stochastic time series, The European Physical Journal. B: Condensed Matter and Complex Systems, 27 (2002), 197-200.
[36]	Y. Y. Liang, S. L. Zhang, Identify Gram-negative bacterial secreted protein types by incorporating different modes of PSSM into Chou's general PseAAC via Kullback-Leibler divergence, J. Theor. Biol., 454 (2018), 22-29. doi: 10.1016/j.jtbi.2018.05.035
[37]	S. L. Zhang, T. Xue, Use Chou's 5 steps rule to identify DNase I hypersensitive sites via dinucleotide property matrix and extreme gradient boosting, Mol. Genet. Genom., 295 (2020), 1431-1442. doi: 10.1007/s00438-020-01711-8
[38]	J. H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine, Ann. Stat., 29 (2001), 1189-1232. doi: 10.1214/aos/1013203450
[39]	N. Alexey, K. Alois, Gradient boosting machines, a tutorial, Front. Neurorobot., 7 (2013), 21.
[40]	B. Manavalan, S. Basith, T. H. Shin, L. Wei, G. Lee, mAHTPred: A sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation, Bioinformatics, 35 (2019), 2757-2765. doi: 10.1093/bioinformatics/bty1047
[41]	J. H. Jia, Z. Liu, X. Xiao, B. X. Liu, K. C. Chou, iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC, J. Theor. Biol., 377 (2015), 47-56. doi: 10.1016/j.jtbi.2015.04.011
[42]	B. Liu, K. Li, D. S. Huang, K. C. Chou, iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach, Bioinformatics, 34 (2018), 3835-3842. doi: 10.1093/bioinformatics/bty458
[43]	S. Basith, B. Manavalan, T. H. Shin, G. Lee, iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree, Comput. Struct. Biotec., 16 (2018), 412-420. doi: 10.1016/j.csbj.2018.10.007
[44]	T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett., 27 (2006), 861-874.
[45]	A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recogn., 30 (1997), 1145-1159. doi: 10.1016/S0031-3203(96)00142-2
[46]	K. C. Chou, H. B. Shen, Review: Recent advances in developing web-servers for predicting protein attributes, Natural Sci., 1 (2009), 63-92. doi: 10.4236/ns.2009.12011
[47]	K. C. Chou, Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11 (2015), 218-234. doi: 10.2174/1573406411666141229162834

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)