A CATEGORY-BASED PROBABILISTIC APPROACH TOFEATURE SELECTION

  • A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.

    Citation: Jianguo Dai, Wenxue Huang, Yuanyi Pan. A CATEGORY-BASED PROBABILISTIC APPROACH TOFEATURE SELECTION[J]. Big Data and Information Analytics, 2018, 3(1): 14-21. doi: doi:10.3934/bdia.2017020

    Related Papers:

  • A high dimensional and large sample categorical data set with a response variable may have many noninformative or redundant categories in its explanatory variables. Identifying and removing these categories usually improve the association but also give rise to significantly higher statistical reliability of selected features. A category-based probabilistic approach is proposed to achieve this goal. Supportive experiments are presented.


    加载中
    [1] [ A. Daly, T. Dekker and S. Hess, Dummy coding vs effects coding for categorical variables: Clari cations and extensions, J. Choice Modelling, 21 (2014), 36-41.
    [2] [ S. Garavaglia and A. Sharma, A Smart Guide to Dummy Variables: Four Applications and a Macro, 1998.
    [3] [ S. S. Gokhale, Quantifying the variance in application reliability, IEEE Pacific Rim Interna-tional Symposium on Dependable Computing, 2004, 113{121.
    [4] [ L. A. Goodman and W. H. Kruskal, Measures of Association for Cross Classifications, Springer-Verlag, New York-Berlin, 1979.
    [5] [ L. Guttman, The test-retest reliability of qualitative data, Psychometrika, 11 (1946), 81-95.
    [6] [ W. Huang, X. Li and Y. Pan, Increase statistical reliability without lossing predictive power by merging classes and adding variables, Big Data and Information Analytics, 1 (2016), 341-347.
    [7] [ W. Huang and Y. Pan, On balancing between optimal and proportional categorical predic-tions, Big Data and Information Analytics, 1 (2016), 129-137.
    [8] [ W. Huang, Y. Shi and X. Wang, A nominal association matrix with feature selection for categorical data, Communications in Statistics -Theory and Methods, 46 (2017), 7798{7819.
    [9] [ S. Kamenshchikov, Variance Ratio as a Measure of Backtest Reliability, Futures, 2015.
    [10] [ J. Li, K. Cheng, et. al., Feature selection: A data perspective, arXiv:1601.07996v4 [cs.LG], ACM Computing Surveys 50 (2018), Article No. 94.
    [11] [ C. J. Lloyd, Statistical Analysis of Categorical Data, John Wiley & Sons, Inc., New York, 1999.
    [12] [ STATCAN, 1998. Survey of Family Expenditures-1996.
    [13] [ http://archive.ics.uci.edu/ml/datasets/Mushroom
  • Reader Comments
  • © 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1924) PDF downloads(1708) Cited by(0)

Article outline

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog