Citation: Wenxue Huang, Qitian Qiu. Forward Supervised Discretization for Multivariate with Categorical Responses[J]. Big Data and Information Analytics, 2016, 1(2): 217-225. doi: 10.3934/bdia.2016005
[1] | Ming Yang, Dunren Che, Wen Liu, Zhao Kang, Chong Peng, Mingqing Xiao, Qiang Cheng . On identifiability of 3-tensors of multilinear rank (1; Lr; Lr). Big Data and Information Analytics, 2016, 1(4): 391-401. doi: 10.3934/bdia.2016017 |
[2] | Subrata Dasgupta . Disentangling data, information and knowledge. Big Data and Information Analytics, 2016, 1(4): 377-390. doi: 10.3934/bdia.2016016 |
[3] | Guojun Gan, Qiujun Lan, Shiyang Sima . Scalable Clustering by Truncated Fuzzy c-means. Big Data and Information Analytics, 2016, 1(2): 247-259. doi: 10.3934/bdia.2016007 |
[4] | Ugo Avila-Ponce de León, Ángel G. C. Pérez, Eric Avila-Vales . A data driven analysis and forecast of an SEIARD epidemic model for COVID-19 in Mexico. Big Data and Information Analytics, 2020, 5(1): 14-28. doi: 10.3934/bdia.2020002 |
[5] | Wenxue Huang, Yuanyi Pan . On Balancing between Optimal and Proportional categorical predictions. Big Data and Information Analytics, 2016, 1(1): 129-137. doi: 10.3934/bdia.2016.1.129 |
[6] | Zhouchen Lin . A Review on Low-Rank Models in Data Analysis. Big Data and Information Analytics, 2016, 1(2): 139-161. doi: 10.3934/bdia.2016001 |
[7] | Jianguo Dai, Wenxue Huang, Yuanyi Pan . A category-based probabilistic approach to feature selection. Big Data and Information Analytics, 2018, 3(1): 14-21. doi: 10.3934/bdia.2017020 |
[8] | Elnaz Delpisheh, Aijun An, Heidar Davoudi, Emad Gohari Boroujerdi . Time aware topic based recommender System. Big Data and Information Analytics, 2016, 1(2): 261-274. doi: 10.3934/bdia.2016008 |
[9] | Jian-Bing Zhang, Yi-Xin Sun, De-Chuan Zhan . Multiple-instance learning for text categorization based on semantic representation. Big Data and Information Analytics, 2017, 2(1): 69-75. doi: 10.3934/bdia.2017009 |
[10] | Wenxue Huang, Xiaofeng Li, Yuanyi Pan . Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data and Information Analytics, 2016, 1(4): 341-348. doi: 10.3934/bdia.2016014 |
In data analysis and mining, discretization algorithms are to turn continuous variables into certain levels. For instance, income, as a continuous variable, needs to be leveled as low, median or high for executable user profiling or for descriptive analysis. Many data mining technologies or methods also require categorical explanatory variables. For example, Bayesian classification[21] assumes that the explanatory variables are all categorical or discrete; decision trees[22] consists of tree node describing the condition that leads to the next node. The variables involved in each node have to be either categorical or described as a combination of intervals. One may see many continuous variables in many real data sets such as asset, income, debt, age, numerical measure of risk, etc.
A natural method of grouping distinct values in a continuous variable is to seek data-based cutting points that cut the whole range of data into intervals. There are two ways to identify the intervals: with or without a given response variable. An algorithm of discretizing continuous variables with a response variable (with a given criterion or objective function) is called supervised discretization; while the other (with no link to the response variable) is called unsupervised discretization [5]. The latter one usually is used in industrial data mining project to deal with multiple response variables at the same time for convenience, unification or for the purpose of saving production cost. One may use a normal distribution based unsupervised discretization to process the macro numerical consuming data because of the central limit theorem. Other unsupervised discretization methods include, but not limit to equal frequency intervals, equal width intervals[5] or more sophisticated ones associated with certain criterion measures (or objective functions), say, the information theoretic entropy[4,6]. The idea is to minimize or maximize the measures in each interval by adjusting the boundaries. Various clustering algorithms can also be used to accomplish an unsupervised discretization[7] to multi-dimensional cases, which compartmentalize a continuous multidimensional space into a given finite number of parts.
In contrast to the unsupervised one, supervised discretization algorithms aim to seek the boundaries by optimizing the intervals' coherence[16] associated with a target variable with an optimization criterion. In other words, an evaluation function is usually applied to measure the discretization algorithm's quality. Typical measures include conditional entropy, conditional Gini concentration or Chi-square. The Chi-square based methods include ChiMerge [15], Chi2 [17], Khiops [1], etc. One can refer to [2,3,10,20] for the detailed discussion regarding the (conditional) entropy based methods. The conditional Gini concentration based methods can be found in [12,13]. The choice of discretization method depends on the time computing complexity, the greediness for the accuracy, the interpretability [16] and/or how easy the result can be executed. Usually the unsupervised discretization methods are faster than the supervised ones but result in less accuracy in predicting the target variable. Dougherty et al.[5] prove that (conditional) entropy-based discretization methods perform quite well overall regarding the proportional association. Holte showed in [10] that even a one-dimensional supervised discretization system could yield similar classification results to a multiple dimensional ones if the system is carefully tuned.
Rather than using (conditional) entropy-based discretization method, Huang et al. propose two alternates, proportional and modal, association based independently (or individually) supervised discretization in [12,13]. Beside the expected higher association with the responses and with limited higher computational complexity than an unsupervised one, they also argue that their measures are more interpretable than the conditional entropy ones.
Suppose we work with a categorical response variable and a number of explanatory continuous variables. We propose a forward supervised discretization algorithm to capture a better association with the response variable compared to the independently supervised discretization algorithm proposed by Huang et al. Experiments in the latter part of this article show remarkable improvements, with acceptable increase of computationally complexity, regarding the same association proposed in [12] and [13].
This article is organized as follows. Section 2 reassembles the concept of categorical variable association measures, especially, the GK-tau and GK-lambda. We also briefly recall Huang et al.'s independently supervised discretization algorithm in this section. The forward supervised discretization algorithm is presented in Section 3. Although the algorithms are introduced by GK-tau, it can be replaced by any other association measure including GK-lambda.
Section 4 show the supportive evidences by experiments with the GK-tau and GK-lambda. The major issues are summarized and commented in the last section.
Let
According to [18,p.70], an association measure (or degree) of
δY|X=V(Y)−EX{EYV(Y|X)]}V(Y), |
where
The proportional association degree of Y on X, denoted by
τY|X=nY∑j=1nX∑i=1p(X=i,Y=j)2p(X=i)−nY∑j=1p(Y=j)21−nY∑j=1p(Y=j)2, |
where
(1)
(2)
(3)
where
The optimal (or modal) association degree of Y on X, denoted by
λY|X=nX∑j=1ρjm−ρ.m1−ρ.m. |
where
ρjm=maxj∈{1,2,…,nY}ρ(X=i;Y=j), |
where
Recall [12] again that for a continuous explanatory variable
τY|X(Bk)=nY∑j=1k+1∑i=1p(bi−1<X≤bi,Y=j)2p(bi−1<X≤bi)−nY∑j=1p(Y=j)21−nY∑j=1p(Y=j)2, |
where
Huang et al proposed an optimal splitting searching scheme in [12] to find these cutting points. This scheme can be briefly described as follows:
Step1. The first cutting point, denoted
r∗1(X)=argmaxmin(X)<r<max(X)τY|((−∞,r],(r,∞)); |
Step2. The second cutting point, denoted
r∗2(X)=argmaxmin(X)<r<max(X)∖r∗1(X)τY|((−∞,min(r∗1,r)],(min(r∗1,r),max(r∗1,r)],(max(r,r∗1),∞)) |
Step3. Continue the steps until a predefined number, say
Please note that the number of values for a given variable in a given data set is always finite even this variable is defined as continuous. Thus not only the aforementioned
Please also note that the previous searching schema can also applies to the case of
λ(Y|X(BK))=k+1∑i=1maxj∈{1,2,…,nY}p(bi−1<X≤bi,Y=j)−maxj∈{1,2,…,nY}{p(Y=j)}1−maxj∈{1,2,…,nY}{p(Y=j)}, |
where
For a data set with
We assume without loss of generality that
Step1. The first cutting point, denoted
fd1r∗1(Xi)=argmaxmin(Xi)<r<max(Xi)τY|(idX1,((−∞,r],(r,∞))); |
Step2. the second cutting point, denoted
fd1r∗2(Xi)=argmaxmin(Xi)<r<max(Xi)∖fd1r∗1(Xi)τY|(idX1,fd1r2(Xi)), |
where
fd1r2(Xi)={(−∞,min(fd1r∗1,r)],(min(fd1r∗1,r),max(fd1r∗1,r)],(max(fd1r∗1,r),∞); |
Step3. continue the process in this fashion until a predefined number, say
When all variables are discretized by this procedure, denoted as
We further assuming without loss of generality that
Step1. The first cutting point, denoted
fd2r∗1(Xi)=argmaxmin(Xi)<r<max(Xi)τY|((idX1,fd1(X2)),((−∞,r],(r,∞)); |
Step2. The second cutting point, denoted
fd2r∗2(Xi)=argmaxmin(Xi)<r<max(Xi)∖fd2r∗1(Xi)τY|((idX1,fd1(X2)),fd2r2(Xi)), |
where
fd2r2(Xi)={(−∞,min(fd2r∗1,r)],(min(fd2r∗1,r),max(fd2r∗1,r)],(max(fd2r∗1,r),∞). |
Step3. Continue the process until a predefined number, say
We admit that the forward discretizing scheme in this article is a variation of stepwise feature selection procedure[9]. When to stop the discretizing loop depends on the condition to stop searching the next variable. The conditions include, but not limited to, reaching the maximum joint association, or reaching the predefined maximum number of variables.
Naturally, the computational expense for the forward supervised discretization is significantly higher than the individual ones. But the difference is still acceptable since our forward one is based on individual one; and the individual one finally chooses very few predictors. Given that this article is to recommend an alternative discretization procedure that can increase the association from the independent variables to the target, we believe it is worth the cost.
Experiments. The purpose of this article's experiments is to show how the forward supervised discretization method improves the variable association in the multivariate case for both the GK-tau and the GK-lambda. Not only the associations but also the independent variables' domain size are evaluated under different circumstances to demonstrate the approximate reliability in statistical sense for each chosen variable set. In general, a variable set with a smaller domain size has higher confidence power. When two variable sets have the same association, the one with smaller domain size is preferred in most feature selection methods.
The data set in our experiment is The Survey of Family Expenditure conducted by Statistic Canada in 1996 (Famex96)[23]. It has
Since the approach in this article is inspired and based by the independently supervised discretization algorithm, we are going to compare them in different scenarios in both experiments. Please note that
After the independent supervised discretization by
variable | Boundary1 | Boundary2 | | | |
Inc-btax | | | | ||
Tot-expn | | | | ||
Hh-incbt | | | |||
Hh-suply | | | | ||
Age | | | |
Thus the best variable after the independent supervised discretization is Age with
The difference between the independently supervised discretization and the forward supervised discretization begins at the
X | Bndry1 | Bndry2 | | | |
fd | | | | ||
fd | | | | ||
fd | | | |||
fd | | |
Table 2 shows the best 2-variable group as (idAge, fd
Keep going with the same process, we have the discretization result for the third variable, denoted by
X | Bndry1 | Bndry2 | | | |
| | | | ||
| | | | ||
| | |
The best 3-variable group then is (Age, Hh-incbt, Inc-btax) with
After individually discretized by
Bndry1 | Bndry2 | | | | |
| | | | ||
Tot-Expn | | | | ||
| | | |||
Hh-Supply | | | |
The best variable regarding
Using idInc-Btax as the leading variable in the subsequent forward supervised discretization process, we have the second discretization to other independent variables, denoted as (
Variable | Bndry1 | Bndry2 | | | |
fd | | | | ||
fd | | | |||
fd | | | |
From Table 5, we see that (Inc-Btax, fd
ϕY|(idX1,fd1X2)≥ϕY|(idX1,idX2) |
with the additional advantage that
|Dmn(Y,idX1,fd1X2)|<|Dmn(Y,idX1,idX2)|. |
Table 6 shows the result of discretizing the third variable based on (idInc-Btax, fd
X | Bndry1 | Bndry2 | |||
| | | | ||
| | | |
The 3-variable group with the highest accurate rate then goes to (idInc-Btax, fd
ϕY|(idX1,fd1X2,fd2X3=0.438≥ϕY|(idX1,idX2,idX3=0.4369 |
while
|Dmn(Y,idX1,fd1X2,fd2X3)|=112<|Dmn(Y,idX1,idX2,idX3)|=132. |
We may continue if the number of continuous independent variables is greater than 3 and that the sample size is large enough to ensure the reliability of information (or the confidence power of data). Till then both experiments show show improved performance by the forward supervised discretization than the individual one.
In this article, we propose a categorical variable association, e.g., the GK-tau or the GK-lambda, based forward supervised discretization method for multi-dimensional data set. This method is inspired and based on an individually supervised discretization proposed in [12,13]. We demonstrate the new approach's advantage by two experiments. One is based on GK-tau or and another is based on GK-lambda. The experiments also have different target variables to show our approach's robustness. Admittedly, the new approach take more computational time. But we believe the cost is acceptable given the the improved performance including association.
Although the individual and the forward are applied to one single variable while the latter uses the information from the variable that are previously discretized by the same approach, it is natural to extend the case to compartmentalizing a multi-dimensional space using the same idea. A popular compartmentalization technology is clustering. Another interesting research is to find out the exact computational cost for the new approach.
[1] | [ M. Boulle, Khiops:A statistical discretization method of continuous attributes, Machine Learning, 55(2004), 53-69. |
[2] | [ J. Catlett, On changing continuous attributes into ordered discrete attributes, In:Machine LearningEWSL-91, 482(1991), 164-178. |
[3] | [ D. Chiu, B. Cheung and A. Wong, Information synthesis based on hierarchical maximum entropy discretization, Journal of Experimental and Theoretical Artificial Intelligence, 2(1989), 117-129. |
[4] | [ M. Chmielewski and J. Grzymala-Busse, Global discretization of continuous attributes as preprocessing for machine learning, International Journal of Approximate Reasoning, 15(1996), 319-331. |
[5] | [ J. Dougherty, R. Kohavi and M. Sahami, Supervised and unsupervised discretization of continuous features, In Machine learning-International Workshop. Morgan Kaufmann Publishers, 2(1995), 194-202. |
[6] | [ U. Fayyad and K. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, Proceedings of the International Joint Conference on Uncertainty in AI, 2(1993), 1022-1027. |
[7] | [ G. Gan, C. Ma and J. Wu, Data clustering:Theory, algorithms, and applications(ASA-SIAM series on statistics and applied probability), Society for Industrial and Applied Mathematics, 20(2007), xxii+466 pp. |
[8] | [ L. Goodman and W. Kruskal, Measures of association for cross classifications, Journal of the American Statistical Association, 49(1954), 732-764. |
[9] | [ I. Guyon and A. Elisseeff, An Introduction to Variable and Feature Selection, Applied Physics Letters, 3(2002), 1157-1182. |
[10] | [ R. Holte, Very sim1ple classification rules perform well on most commonly used datasets, Machine Learning, 11(1993), 63-90. |
[11] | [ W. Huang and Y. Pan, On balalncing between optimal and proportional predictions, Big Data and Information Analytics, 1(2016), 129-137. |
[12] | [ W. Huang, Y. Pan and J. Wu, Supervised discretization with GK-τ, In Procedia Computer Science, 17(2013), 114-120. |
[13] | [ W. Huang, Y. Pan and J. Wu, Supervised discretization with GK-λ, Procedia Computer Science, 30(2014), 75-80. |
[14] | [ W. Huang, Y. Shi and X. Wang, A nomminal association matrix with feature selection for categorical data, Communications in Statistics-Theory and Methods, to appear. |
[15] | [ R. Kerber, Chimerge:Discretization of numeric attributes, In Proceedings of the tenth national conference on Artificial intelligence.AAAI Press, 1994, 123-128. |
[16] | [ S. Kotsiantis and D. Kanellopoulos, Discretization techniques:A recent survey, GESTS International Transactions on Computer Science and Engineering, 32(2006), 47-58. |
[17] | [ H. Liu and R. Setiono, Chi2:Feature selection and discretization of numeric attributes, In:Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, 55(1995), 388-391. |
[18] | [ C. Lloyd, Statistical Analysis with Missing Data, John Wiley & Sons, Inc. 1987, New York, NY, USA. |
[19] | [ J. MacQueen, Some methods for classification and analysis of multivariate observations, Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(1967), 281-297. |
[20] | [ D. Olson and Y. Shi, Introduction to business data mining, Knowledge and information systems, 2007, McGraw-Hill/Irwin. |
[21] | [ I. Rish, An empirical study of the naive bayes classifier, IJCAI 2001 workshop on empirical methods in artificial intelligence, 2001, 41-46. |
[22] | [ S. Safavian and D. Landgrebe, A survey of decision tree classifier methodology, IEEE Transactions on Systems, Man and Cybernetics, 21(1991), 660-674. |
[23] | [ STATCAN, Survey of Family Expenditures-1996. |
[24] | [ K. Ting, Discretization of Continuous-Valued Attributes and Instance-Based Learning, Basser Department of Computer Science,University of Sydney, 1994. |
1. | Haddouchi Maissae, Berrado Abdelaziz, A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimization, 2022, 14, 2364-415X, 45, 10.1007/s41060-022-00316-1 |
variable | Boundary1 | Boundary2 | | | |
Inc-btax | | | | ||
Tot-expn | | | | ||
Hh-incbt | | | |||
Hh-suply | | | | ||
Age | | | |
X | Bndry1 | Bndry2 | | | |
fd | | | | ||
fd | | | | ||
fd | | | |||
fd | | |
X | Bndry1 | Bndry2 | | | |
| | | | ||
| | | | ||
| | |
Bndry1 | Bndry2 | | | | |
| | | | ||
Tot-Expn | | | | ||
| | | |||
Hh-Supply | | | |
Variable | Bndry1 | Bndry2 | | | |
fd | | | | ||
fd | | | |||
fd | | | |
X | Bndry1 | Bndry2 | |||
| | | | ||
| | | |
variable | Boundary1 | Boundary2 | | | |
Inc-btax | | | | ||
Tot-expn | | | | ||
Hh-incbt | | | |||
Hh-suply | | | | ||
Age | | | |
X | Bndry1 | Bndry2 | | | |
fd | | | | ||
fd | | | | ||
fd | | | |||
fd | | |
X | Bndry1 | Bndry2 | | | |
| | | | ||
| | | | ||
| | |
Bndry1 | Bndry2 | | | | |
| | | | ||
Tot-Expn | | | | ||
| | | |||
Hh-Supply | | | |
Variable | Bndry1 | Bndry2 | | | |
fd | | | | ||
fd | | | |||
fd | | | |
X | Bndry1 | Bndry2 | |||
| | | | ||
| | | |