Dimension reduction methods for microarray data: a review

Rabia Aziz; C.K. Verma; Namita Srivastava; Rabia Aziz; C.K. Verma; Namita Srivastava

doi:10.3934/bioeng.2017.1.179

AIMS Bioengineering

2017, Volume 4, Issue 1: 179-197. doi: 10.3934/bioeng.2017.1.179

Previous Article Next Article

Review

Dimension reduction methods for microarray data: a review

Department of Mathematics & Computer Application, Maulana Azad National Institute of Technology Bhopal-462003 (M.P.) India

Received: 27 November 2016 Accepted: 01 March 2017 Published: 07 March 2017

Dimension reduction has become inevitable for pre-processing of high dimensional data. “Gene expression microarray data” is an instance of such high dimensional data. Gene expression microarray data displays the maximum number of genes (features) simultaneously at a molecular level with a very small number of samples. The copious numbers of genes are usually provided to a learning algorithm for producing a complete characterization of the classification task. However, most of the times the majority of the genes are irrelevant or redundant to the learning task. It will deteriorate the learning accuracy and training speed as well as lead to the problem of overfitting. Thus, dimension reduction of microarray data is a crucial preprocessing step for prediction and classification of disease. Various feature selection and feature extraction techniques have been proposed in the literature to identify the genes, that have direct impact on the various machine learning algorithms for classification and eliminate the remaining ones. This paper describes the taxonomy of dimension reduction methods with their characteristics, evaluation criteria, advantages and disadvantages. It also presents a review of numerous dimension reduction approaches for microarray data, mainly those methods that have been proposed over the past few years.
- DNA microarrays,
- dimension reduction,
- classification,
- prediction
Citation: Rabia Aziz, C.K. Verma, Namita Srivastava. Dimension reduction methods for microarray data: a review[J]. AIMS Bioengineering, 2017, 4(1): 179-197. doi: 10.3934/bioeng.2017.1.179

Related Papers:

Abstract

Dimension reduction has become inevitable for pre-processing of high dimensional data. “Gene expression microarray data” is an instance of such high dimensional data. Gene expression microarray data displays the maximum number of genes (features) simultaneously at a molecular level with a very small number of samples. The copious numbers of genes are usually provided to a learning algorithm for producing a complete characterization of the classification task. However, most of the times the majority of the genes are irrelevant or redundant to the learning task. It will deteriorate the learning accuracy and training speed as well as lead to the problem of overfitting. Thus, dimension reduction of microarray data is a crucial preprocessing step for prediction and classification of disease. Various feature selection and feature extraction techniques have been proposed in the literature to identify the genes, that have direct impact on the various machine learning algorithms for classification and eliminate the remaining ones. This paper describes the taxonomy of dimension reduction methods with their characteristics, evaluation criteria, advantages and disadvantages. It also presents a review of numerous dimension reduction approaches for microarray data, mainly those methods that have been proposed over the past few years.

References

[1]	Chang TW (1983) Binding of cells to matrixes of distinct antibodies coated on solid surface. J Immunol Methods 65: 217–223. doi: 10.1016/0022-1759(83)90318-6
[2]	Lenoir T, Giannella E (2006) The emergence and diffusion of DNA microarray technology. J Biomed Discov Collab 1: 11–49. doi: 10.1186/1747-5333-1-11
[3]	Pirrung MC, Read LJ, Fodor SPA, et al. (1992) Large scale photolithographic solid phase synthesis of polypeptides and receptor binding screening thereof: US, US5143854[P].
[4]	Peng S, Xu Q, Ling XB, et al. (2003) Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines. Febs Lett 555: 358–362.
[5]	Statnikov A, Aliferis CF, Tsamardinos I, et al. (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21: 631–643.
[6]	Tan Y, Shi L, Tong W, et al. (2005) Multi-class cancer classification by total principal component regression (TPCR) using microarray gene expression data. Nucleic Acids Res 33: 56–65. doi: 10.1093/nar/gki144
[7]	Eisen MB, Brown PO (1999) DNA arrays for analysis of gene expression. Method Enzymol 303: 179–205. doi: 10.1016/S0076-6879(99)03014-1
[8]	Leng C (2008) Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data. Comput Biol Chem 32: 417–425. doi: 10.1016/j.compbiolchem.2008.07.015
[9]	Quackenbush J (2001) Computational analysis of microarray data. Nat Rev Genet 2: 418–427. doi: 10.1038/35076576
[10]	Piatetsky-Shapiro G, Tamayo P (2003) Microarray data mining: facing the challenges. ACM Sigkdd Explor Newslett 5: 1–5.
[11]	Eisen MB, Spellman PT, Brown PO, et al. (1998) Cluster analysis and display of genome-wide expression patterns. P Natl Acad Sci USA 95: 14863–14868.
[12]	Golub TR, Slonim DK, Tamayo P, et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531–537. doi: 10.1126/science.286.5439.531
[13]	O'Neill MC, Song L (2003) Neural network analysis of lymphoma microarray data: prognosis and diagnosis near-perfect. Bioinformatics 4: 1–12.
[14]	Beer DG, Kardia SL, Huang CC, et al. (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med 8: 816–824.
[15]	Lee JW, Lee JB, Park M, et al. (2005) An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data An 48: 869–885.
[16]	You W, Yang Z, Yuan M, et al. (2014) Totalpls: local dimension reduction for multicategory microarray data. IEEE T Hum Mach Syst 44: 125–138. doi: 10.1109/THMS.2013.2288777
[17]	Xi M, Sun J, Liu L, et al. (2016) Cancer feature selection and classification using a binary quantum-behaved particle swarm optimization and support vector machine. Comput Math Method Med 2016: 1–9.
[18]	Wang L, Feng Z, Wang X, et al. (2010) DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26: 136–138. doi: 10.1093/bioinformatics/btp612
[19]	Shen Q, Mei Z, Ye BX (2009) Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification. Comput Biol Med 39: 646–649. doi: 10.1016/j.compbiomed.2009.04.008
[20]	Xie J, Xie W, Wang C, et al. (2010) A novel hybrid feature selection method based on ifsffs and svm for the diagnosis of erythemato-squamous diseases. J Mach Learn Res 11: 142–151.
[21]	Chuang LY, Yang CH, Wu KC, et al. (2011) A hybrid feature selection method for DNA microarray data. Comput Biol Med 41: 228–237. doi: 10.1016/j.compbiomed.2011.02.004
[22]	Li B, Zheng CH, Huang DS, et al. (2010) Gene expression data classification using locally linear discriminant embedding. Comput Biol Med 40: 802–810. doi: 10.1016/j.compbiomed.2010.08.003
[23]	Mahajan S, Singh S (2016) Review on feature selection approaches using gene expression data. IJIR 2: 356–364.
[24]	Pinkel D, Segraves R, Sudar D, et al. (1998) High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 20: 207–211. doi: 10.1038/2524
[25]	Cheadle C, Vawter MP, Freed WJ, et al. (2003) Analysis of microarray data using Z score transformation. J Mol Diagn 5: 73–81. doi: 10.1016/S1525-1578(10)60455-2
[26]	Witten IH, Frank E (2016) Data mining: practical machine learning tools and techniques, 4th Edition, Morgan Kaufmannis, 4–7.
[27]	Dubitzky W, Granzow M, Berrar D (2002) Data mining and machine learning methods for microarray analysis, In: Methods of microarray data analysis,Springer US, 5–22.
[28]	Brown MP, Grundy WN, Lin D, et al. (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. P Natl Acad Sci USA 97: 262–267. doi: 10.1073/pnas.97.1.262
[29]	Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97: 77–87. doi: 10.1198/016214502753479248
[30]	Khan J, Wei JS, Ringner M, et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7: 673–679. doi: 10.1038/89044
[31]	Zheng CH, Huang DS, Shang L (2006) Feature selection in independent component subspace for microarray data classification. Neurocomputing 69: 2407–2410.
[32]	Peng Y (2006) A novel ensemble machine learning for robust microarray data classification. Comput Biol Med 36: 553–573. doi: 10.1016/j.compbiomed.2005.04.001
[33]	Mohan A, Rao MD, Sunderrajan S, et al. (2014) Automatic classification of protein structures using physicochemical parameters. Interdiscipl Sci 6: 176–186.
[34]	Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23: 2507–2517. doi: 10.1093/bioinformatics/btm344
[35]	Jirapech-Umpai T, Aitken S (2005) Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. Bioinformatics 6: 148–148.
[36]	Law MH, Figueiredo MA, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE T Pattern Anal 26: 1154–1166.
[37]	Lazar C, Taminau J, Meganck S, et al. (2012) A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM T Comput BiolBioinform 9: 1106–1119.
[38]	Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3: 1157–1182.
[39]	Ang JC, Mirzal A, Haron H, et al. (2015) Supervised, unsupervised and semi-supervised feature selection: a review on gene selection. IEEE/ACM T Comput BiolBioinform 13: 971–989.
[40]	Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40: 16–28. doi: 10.1016/j.compeleceng.2013.11.024
[41]	Lin KS, Chien CF (2009) Cluster analysis of genome-wide expression data for feature extraction. Expert Syst Appl 36: 3327–3335. doi: 10.1016/j.eswa.2008.01.068
[42]	Sun Y, Todorovic S, Goodison S (2010) Local-learning-based feature selection for high-dimensional data analysis. IEEE T Pattern Anal 32: 1610–1626.
[43]	Zhu S, Wang D, Yu K, et al. (2010) Feature selection for gene expression using model-based entropy. IEEE/ACM T Comput BiolBioinform 7: 25–36. doi: 10.1109/TCBB.2008.35
[44]	Mishra D, Sahu B (2011) Feature selection for cancer classification: a signal-to-noise ratio approach. IJSER 2: 1–7.
[45]	Wei D, Li S, Tan M (2012) Graph embedding based feature selection. Neurocomputing 93: 115–125.
[46]	Liu JX, Wang YT, Zheng CH, et al. (2013) Robust PCA based method for discovering differentially expressed genes. BMC bioinform 14: S3.
[47]	Maulik U, Chakraborty D (2014) Fuzzy preference based feature selection and semisupervised SVM for cancer classification. IEEE T Nano Biosci 13: 152–160.
[48]	Chinnaswamy A, Srinivasan R (2016) Hybrid feature selection using correlation coefficient and particle swarm optimization on microarray gene expression data. IBICA, 229–239.
[49]	Mortazavi A, Moattar MH (2016) Robust feature selection from microarray data based on cooperative game theory and qualitative mutual information. Adv Bioinform 2016: 1–16.
[50]	John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem, In: machine learning, Proceedings of the Eleventh International Conference, 121–129.
[51]	Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intel 97: 273–324. doi: 10.1016/S0004-3702(97)00043-X
[52]	Somol P, Pudil P, Novovičová J, et al. (1999) Adaptive floating search methods in feature selection. Pattern Recogn Lett 20: 1157–1163. doi: 10.1016/S0167-8655(99)00083-5
[53]	Youssef H, Sait SM, Adiche H (2001) Evolutionary algorithms, simulated annealing and tabu search: a comparative study. Eng Appl Artif Intel 14: 167–181.
[54]	Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection for clustering with Gaussian mixture models. Biometrics 65: 701–709.
[55]	Ai-Jun Y, Xin YS (2010) Bayesian variable selection for disease classification using gene expression data. Bioinformatics 26: 215–222.
[56]	Ji G, Yang Z, You W (2011) PLS-based gene selection and identification of tumor-specific genes. IEEE T Syst Man Cy C 41: 830–841. doi: 10.1109/TSMCC.2010.2078503
[57]	Sharma A, Imoto S, Miyano S (2012) A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM T Comput BiolBioinform 9: 754–764.
[58]	Cadenas JM, Garrido MC, MartíNez R (2013) Feature subset selection filter–wrapper based on low quality data. Expert Syst Appl 40: 6241–6252. doi: 10.1016/j.eswa.2013.05.051
[59]	Srivastava B, Srivastava R, Jangid M (2014) Filter vs. wrapper approach for optimum gene selection of high dimensional gene expression dataset: an analysis with cancer datasets. IEEE High Perform Comput Appl 454: 1–6.
[60]	Kar S, Sharma KD, Maitra M (2016) A particle swarm optimization based gene identification technique for classification of cancer subgroups. IEEE Control Instrum Energ Commun, 130–134.
[61]	Kumar V, Minz S (2014) Feature selection:a literature review. Smart Cr 4: 211–229.
[62]	Niijima S, Okuno Y (2009) Laplacian linear discriminant analysis approach to unsupervised feature selection. IEEE/ACM T Comput BiolBioinform 6: 605–614.
[63]	Cai X, Nie F, Huang H, et al. (2011) Multi-class l2, 1-norm support vector machine. IEEE Comput Soc, 91–100.
[64]	Maldonado S, Weber R, Basak J (2011) Simultaneous feature selection and classification using kernel-penalized support vector machines. Inform Sci Int J 181: 115–128.
[65]	Xiang S, Nie F, Meng G, et al. (2012) Discriminative least squares regression for multiclass classification and feature selection. IEEE T Neur Netw Learn Syst 23: 1738–1754. doi: 10.1109/TNNLS.2012.2212721
[66]	Lan L, Djuric N, Guo Y, et al. (2013) MS-kNN: protein function prediction by integrating multiple data sources. Bioinformatics 14: S8.
[67]	Cao J, Zhang L, Wang B, et al. (2015) A fast gene selection method for multi-cancer classification using multiple support vector data description. J Biomed Inform 53: 381–389.
[68]	Lan L, Vucetic S (2011) Improving accuracy of microarray classification by a simple multi-task feature selection filter. Int J Data Min Bioinform 5: 189–208.
[69]	Kursa MB (2016) Embedded all relevant feature selection with random ferns. arXiv preprint arXiv: 1604.06133.
[70]	Bartenhagen C, Klein HU, Ruckert C, et al. (2010) Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data. Bioinformatics 11: 567–577.
[71]	Kotsiantis S (2011) Feature selection for machine learning classification problems: a recent overview. Artif Intell Rev, 1–20.
[72]	Hira ZM, Gillies DF (2015) A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform 2015: 1–13.
[73]	Tzeng J, Lu HH, Li WH (2008) Multidimensional scaling for large genomic data sets. Bioinformatics 9: 179–195.
[74]	Ehler M, Rajapakse VN, Zeeberg BR, et al. (2011) Nonlinear gene cluster analysis with labeling for microarray gene expression data in organ development. BMC proc 5: S3.
[75]	Kong W, Vanderburg CR, Gunshin H, et al. (2008) A review of independent component analysis application to microarray gene expression data. Biotechniques 45: 501–520. doi: 10.2144/000112950
[76]	Aziz R, Verma C, Srivastava N (2016) A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data. Genom Data 8: 4–15. doi: 10.1016/j.gdata.2016.02.012
[77]	Hsu CC, Chen MC, Chen LS (2010) Integrating independent component analysis and support vector machine for multivariate process monitoring. Comput Ind Eng 59: 145–156. doi: 10.1016/j.cie.2010.03.011
[78]	Naik GR, Kumar DK (2011) An overview of independent component analysis and its applications. Informatica 35: 63–81.
[79]	Huang Y, Lowe HJ (2007) A novel hybrid approach to automated negation detection in clinical radiology reports. J Am Med Inform Assoc 14: 304–311. doi: 10.1197/jamia.M2284
[80]	Aziz R, Verma C, Srivastava N (2015) A weighted-SNR feature selection from independent component subspace for nb classification of microarray data. Int J Adv Biotec Res 6: 245–255.
[81]	Aziz R, Srivastava N, Verma C (2015) T-independent component analysis for svm classification of dna-microarray data. Int J Bioinform Res 6: 305–312.
[82]	Zibakhsh A, Abadeh MS (2013) Gene selection for cancer tumor detection using a novel memetic algorithm with a multi-view fitness function. Eng Appl Artif Intel 26: 1274–1281. doi: 10.1016/j.engappai.2012.12.009
[83]	Zhao W, Wang G, Wang Hb, et al. (2011) A novel framework for gene selection. Int J Adv Comput Technol 3: 184–191.
[84]	Alshamlan H, Badr G, Alohali Y (2015) mRMR-ABC: a hybrid gene selection algorithm for cancer classification using microarray gene expression profiling. Bio Med Res Int 2015: 1–15.
[85]	Hu Q, Pan W, An S, et al. (2010) An efficient gene selection technique for cancer recognition based on neighborhood mutual information. Int J Mach Learn Cybern 1: 63–74. doi: 10.1007/s13042-010-0008-6
[86]	El Akadi A, Amine A, El Ouardighi A, et al. (2011) A two-stage gene selection scheme utilizing MRMR filter and GA wrapper. Knowl Inform Syst 26: 487–500.
[87]	Shreem SS, Abdullah S, Nazri MZA, et al. (2012) Hybridizing ReliefF, MRMR filters and GA wrapper approaches for gene selection. J Theor Appl Inform Technol 46: 1034–1039.
[88]	Alshamlan HM, Badr GH, Alohali YA (2015) Genetic bee colony (GBC) algorithm: a new gene selection method for microarray cancer classification. Comput Bio Chem 56: 49–60.
[89]	Chuang LY, Yang CH, Yang CH (2009) Tabu search and binary particle swarm optimization for feature selection using microarray data. J Comput Biol 16: 1689–1703.
[90]	Tong DL, Mintram R (2010) Genetic algorithm-neural network (GANN): a study of neural network activation functions and depth of genetic algorithm search applied to feature selection. Int J Mach Learn Cybern 1: 75–87. doi: 10.1007/s13042-010-0004-x
[91]	Shi P, Ray S, Zhu Q, et al. (2011) Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. Bioinformatics 12: 375–399.
[92]	Liu Q, Zhao Z, Li YX, et al. (2012) Feature selection based on sensitivity analysis of fuzzy ISO data. Neurocomputing 85: 29–37. doi: 10.1016/j.neucom.2012.01.005
[93]	Hajiloo M, Rabiee HR, Anooshahpour M (2013) Fuzzy support vector machine: an efficient rule-based classification technique for microarrays. Bioinformatics 14: S4.
[94]	Chang SW, Abdul-Kareem S, Merican AF, et al. (2013) Oral cancer prognosis based on clinicopathologic and genomic markers using a hybrid of feature selection and machine learning methods. Bioinformatics 14:1–15.

Reader Comments

Your name:*

Email:*
© 2017 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)