Export file:


  • RIS(for EndNote,Reference Manager,ProCite)
  • BibTex
  • Text


  • Citation Only
  • Citation and Abstract


1.School of Mathematics and Information SciencesGuangzhou UniversityGuangzhou 510006, China
2.Clearpier Inc., 1300-121 Richmond St.W.Toronto, Ontario M5H 2K1, Canada

A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.
  Article Metrics

Keywords Association, categorical data, feature selection, statistical reliability.

Citation: Jianguo Dai, Wenxue Huang,Yuanyi Pan. A CATEGORY-BASED PROBABILISTIC APPROACH TOFEATURE SELECTION. Big Data and Information Analytics, 2018, 3(1): 14-21. doi: doi:10.3934/bdia.2017020


  • [1] A. Daly, T. Dekker and S. Hess, Dummy coding vs effects coding for categorical variables: Clari cations and extensions, J. Choice Modelling, 21 (2014), 36-41.
  • [2] S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998.
  • [3] S. S. Gokhale, Quantifying the variance in application reliability, IEEE Pacific Rim Interna-tional Symposium on Dependable Computing, 2004, 113{121.
  • [4] L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979.
  • [5] L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95.
  • [6] W. Huang, X. Li and Y. Pan, Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347.
  • [7] W. Huang and Y. Pan, On balancing between optimal and proportional categorical predic-tions, Big Data and Information Analytics, 1 (2016), 129-137.
  • [8] W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798{7819.
  • [9] S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015.
  • [10] J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv:1601.07996v4 [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94.
  • [11] C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999.
  • [12] STATCAN, 1998. Survey of Family Expenditures-1996.
  • [13] http://archive.ics.uci.edu/ml/datasets/Mushroom


Reader Comments

your name: *   your email: *  

© 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution Licese (http://creativecommons.org/licenses/by/4.0)

Download full text in PDF

Export Citation

Copyright © AIMS Press All Rights Reserved