Citation: Wenxue Huang, Yuanyi Pan. On Balancing between Optimal and Proportional categorical predictions[J]. Big Data and Information Analytics, 2016, 1(1): 129-137. doi: 10.3934/bdia.2016.1.129
[1] | Wenxue Huang, Xiaofeng Li, Yuanyi Pan . Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data and Information Analytics, 2016, 1(4): 341-348. doi: 10.3934/bdia.2016014 |
[2] | Jian-Bing Zhang, Yi-Xin Sun, De-Chuan Zhan . Multiple-instance learning for text categorization based on semantic representation. Big Data and Information Analytics, 2017, 2(1): 69-75. doi: 10.3934/bdia.2017009 |
[3] | Wenxue Huang, Qitian Qiu . Forward Supervised Discretization for Multivariate with Categorical Responses. Big Data and Information Analytics, 2016, 1(2): 217-225. doi: 10.3934/bdia.2016005 |
[4] | Sunmoo Yoon, Maria Patrao, Debbie Schauer, Jose Gutierrez . Prediction Models for Burden of Caregivers Applying Data Mining Techniques. Big Data and Information Analytics, 2017, 2(3): 209-217. doi: 10.3934/bdia.2017014 |
[5] | Nickson Golooba, Woldegebriel Assefa Woldegerima, Huaiping Zhu . Deep neural networks with application in predicting the spread of avian influenza through disease-informed neural networks. Big Data and Information Analytics, 2025, 9(0): 1-28. doi: 10.3934/bdia.2025001 |
[6] | Ricky Fok, Agnieszka Lasek, Jiye Li, Aijun An . Modeling daily guest count prediction. Big Data and Information Analytics, 2016, 1(4): 299-308. doi: 10.3934/bdia.2016012 |
[7] | Wenxue Huang, Yuanyi Pan, Lihong Zheng . Proportional association based roi model. Big Data and Information Analytics, 2017, 2(2): 119-125. doi: 10.3934/bdia.2017004 |
[8] | Jianguo Dai, Wenxue Huang, Yuanyi Pan . A category-based probabilistic approach to feature selection. Big Data and Information Analytics, 2018, 3(1): 14-21. doi: 10.3934/bdia.2017020 |
[9] | Yiwen Tao, Zhenqiang Zhang, Bengbeng Wang, Jingli Ren . Motality prediction of ICU rheumatic heart disease with imbalanced data based on machine learning. Big Data and Information Analytics, 2024, 8(0): 43-64. doi: 10.3934/bdia.2024003 |
[10] | Dongyang Yang, Wei Xu . Statistical modeling on human microbiome sequencing data. Big Data and Information Analytics, 2019, 4(1): 1-12. doi: 10.3934/bdia.2019001 |
A bias-variance dilemma in categorical data mining and analysis is the fact that a prediction method can aim at either maximizing the overall point-hit accuracy without constraint or with the constraint of minimizing the distribution bias, but can hardly achieve both at the same time. The dilemma was notified, analyzed and illustrated by S. Geman et al. [9] in 1992. The origin of this dilemma is that a machine learning algorithm claiming to be distribution unbiased has to pay the price of high variance. It means that the prediction distribution to be as close as possible to the real target's distribution has to expect a high point-to-point prediction error, and vice versa.
This issue has also been widely discussed from practical technic points of view since then, sometimes under different terminologies. Yaniv and Foster[23] examined an "accuracy-informativeness" trade-off in three judgement estimation studies and proposed a trade-off model and a trade-off parameter to describe the penalty for lack of informativeness. Friedman[8] describes the similar problem in classification about distribution bias versus variance, suggesting that a lower bias tend to increases variance and thus there is always a "bias-variance trade-off". It has been noticed that the Monte Carlo method, which is usually considered distribution unbiased, has a problem in point-hit accuracy compared to optimal estimation. Mark and Baram [21] then suggested an improvement to increase the accuracy with a loss to unbiasedness. Yu et al. [24] extended the tradeoff to a bias-variance-complexity trade-off framework and proposed a complex system modeling approach by optimizing a model selection criterion of bias, variance and complexity. Zhou et al. [25] detailed a solution in a recommender system to solve the dilemma. They linearly combines two methods, one of which favors diversity and one on accuracy, and suggest that the solution is to define a utility function regarding the combinations and to optimize the function by tuning the combination coefficient.
Most of the discussions above study the bias and variance on a numerical response variable. R. Tibshirani discussed their categorical equivalence in [22]. A generalized bias-variance formulation is discussed in [5] and [16].
To illusrate this dilemma, we only consider the purely categorical data situation: both explanatory and response variables are of (nominal) categorical type in this article. We also consider a data set consisting of only two categorical variables
However it should be noted that a data set in the practice of big data mining is usually high dimensional with mixed data type. It can be viewed as two categorical variables though after a few proper processes. A numerical target variable
To estimate the unknown values of
n∑i=1maxj∈{1,2,...,k}p(X=xi,Y=yk)−n∑i=1k∑j=1p(X=xi,Y=yj)p(Y=yj|X=xi)≥0 |
where the equality holds if and only if
A very simple example of this dilemma can be described as follows. A table with 90 rows and 2 columns, A and B, has 10 unknown values in B to be estimated (or predicted), shown in Table 1. Please note that
A | B | Tot. |
30 | ||
10 | ||
15 | ||
25 | ||
16 | ||
16 |
For simplicity, we assume that the unknown part has exactly the same conditional distribution as the known part, i.e., the proportion of
In general, for a given conditional distribution
n∑i=1p(xi)p(yM|xi)=n∑i=1p(xi,yM) | (1) |
It is equivalent to the Goodman-Kruskal
λY|X=n∑i=1ρim−ρ⋅m1−ρ⋅m |
where
ρim=maxj∈{1,2,...,k}{p(X=xi;Y=yj)} |
and
ρ⋅m=max1≤j≤k{p(Y=yj)}. |
On the other hand, the least distribution biased prediction, or the prediction with the maximum expectation (or the proportional prediction [10, Section 9] is to predict
ωY|X:=n∑i=1k∑j=1p(Y=yj|X=xi)p(X=xi,Y=yj) | (2) |
The accuracy rate is linked to the Goodman-Kruskal-tau[10, Section 9] (or the GK-tau, denoted by
ωY|X=(1−∑jp(Y=yj)2)τY|X+∑jp(Y=yj)2, |
where,
τY|X=∑ni=1∑kj=1p(Y=j|X=xi)p(X=xi;Y=yj)−∑ki=1p(Y=yj)21−∑ki=1p(Y=yj)2. |
It can be proven ([14]) that
More details and discussions about
Thus if all variables are categorical and samples are representative and (the sample size is large) enough, either the highest observable (realistic) point-hit accuracy rate with the lowest distribution bias or the highest point-hit accuracy with no care of response distribution bias can be achieved via an appropriate feature selection based on the corresponding association measures as discussed above. But generally the two optimizations cannot be realized at the same time hence one may want to achieve a certain level of balance between these two. This is exactly what we propose in this article: a scheme to balance the optimizations goal to maximizing the prediction accuracy and that to minimizing the prediction bias. Some experiments with real data Famex96 are conducted to demonstrate the characteristics of this scheme. Basic mathematical properties of this scheme are also discussed. Please note that these experiments are designed to estimate the unknown values in a table with a response variable. To focus on this subject, we ignore all other important issues in high dimensional, mix-typed data prediction such as discretization, feature selection, model selection, etc.
The definition of this scheme is described in Section 2 along with the prediction strategy. The experiments are discussed in Section 3. The relationship between the parameter introduced in this scheme and the prediction performance is also studied in that section. The last section is the conclusion remarks and the future work.
Our discussion is about a framework balancing the expected point-hit accuracy with distribution faithfulness and the likely maximum point-hit-accuracy. The variable with unknown values is considered as the response (or dependent) variable, while others are the explanatory (or independent) variables. The data set is divided into two parts by the values in the response variable. All rows with known values in the response variable goes to the learning part and others go to the prediction part. The response variable in the prediction part will be predicted using the values of its independent variables and the information learned from the learning part. Please note that all the variables in both parts are considered as categorical.
Assume that the response variable
θ=αρm+(1−α)ρM | (3) |
where
ρm=0.5×min1≤i≤nmax1≤j≤kp(Y=yj|X=xi), |
ρM=max1≤i≤nmax1≤j≤kp(Y=yj|X=xi) |
and
Apparently, it is a point between the half of the minimum maximum conditional probability and the maximum maximum conditional probability. The prediction method for a given
Our prediction to increase the point-hit accuracy is to predict the unknowns by the conditional mode. Monte-Carlo simulation is used to lower bias, which is to randomly pick
The data set that we use in this experiment is The Survey of Family Expenditure conducted by Statistic Canada in 1996 (Famex96)[6]. This data set has
It is needed to mention that there are various types of unknown (or missing) values. Three types were introduced in [3, 18]: missing completely at random (MCAR), missing at random (MAR)and not missing at random (NMAR). [1] classifies missing values as four: missing by definition of the subpopulation, missing completely at random (MCAR), missing at random (MAR), and nonignorable (NI) missing values. Each type usually requires a different processing method. The missing values are generated completely at random for the sake of simplicity.
The first experiment uses type of dwelling (
1 | 1 | 755 | 0.13 | 3 | 1 | 98 | 0.20 | 5 | 1 | 1209 | 0.51 |
1 | 2 | 1543 | 0.26 | 3 | 2 | 84 | 0.175 | 5 | 2 | 430 | 0.18 |
1 | 3 | 2552 | 0.432 | 3 | 3 | 152 | 0.32 | 5 | 3 | 229 | 0.1 |
1 | 4 | 401 | 0.07 | 3 | 4 | 20 | 0.04 | 5 | 4 | 36 | 0.02 |
1 | 5 | 328 | 0.06 | 3 | 5 | 83 | 0.17 | 5 | 5 | 251 | 0.11 |
1 | 6 | 203 | 0.03 | 3 | 6 | 22 | 0.05 | 5 | 6 | 101 | 0.04 |
1 | 7 | 130 | 0.02 | 3 | 7 | 22 | 0.05 | 5 | 7 | 125 | 0.05 |
2 | 1 | 56 | 0.17 | 4 | 1 | 112 | 0.23 | 6 | 1 | 89 | 0.29 |
2 | 2 | 69 | 0.21 | 4 | 2 | 104 | 0.21 | 6 | 2 | 73 | 0.23 |
2 | 3 | 118 | 0.37 | 4 | 3 | 143 | 0.29 | 6 | 3 | 77 | 0.25 |
2 | 4 | 17 | 0.05 | 4 | 4 | 21 | 0.04 | 6 | 4 | 7 | 0.02 |
2 | 5 | 32 | 0.1 | 4 | 5 | 69 | 0.14 | 6 | 5 | 36 | 0.12 |
2 | 6 | 16 | 0.05 | 4 | 6 | 18 | 0.04 | 6 | 6 | 16 | 0.05 |
2 | 7 | 14 | 0.04 | 4 | 7 | 24 | 0.05 | 6 | 7 | 14 | 0.045 |
Observe that the bold part in this table represent the maximal conditional probabilities, which will be the result of a prediction by (conditional) mode(s). Table 3 and Table 4 are the prediction results for the balancing rate
1 | 2 | 3 | 4 | 5 | 6 | 7 | SUM | |
1 | 37 | 29 | 29 | 4 | 9 | 7 | 5 | 120 |
2 | 33 | 28 | 44 | 6 | 11 | 3 | 6 | 131 |
3 | 24 | 40 | 81 | 6 | 12 | 6 | 4 | 173 |
4 | 3 | 8 | 11 | 4 | 1 | 0 | 1 | 28 |
5 | 10 | 6 | 8 | 3 | 4 | 2 | 3 | 36 |
6 | 6 | 3 | 6 | 3 | 0 | 0 | 0 | 18 |
7 | 5 | 3 | 3 | 0 | 1 | 0 | 0 | 12 |
SUM | 118 | 117 | 182 | 26 | 38 | 18 | 19 | 518 |
1 | 3 | SUM | |
1 | 62 | 58 | 120 |
2 | 34 | 97 | 131 |
3 | 22 | 151 | 173 |
4 | 28 | 28 | |
5 | 16 | 20 | 36 |
6 | 5 | 13 | 18 |
7 | 5 | 7 | 12 |
SUM | 144 | 374 | 518 |
The simple match rate is used to measure the point-hit accuracy, which gives us an accuracy rate of
d(ˆY|Y)=m∑j=1p(yj)|p(ˆyj)−p(ˆyj)| | (4) |
As in the
When
Finally Figure 3 shows that the effect of the missing rate to the prediction performance is negligible.
In conclusion, sacrifices in maximizing point-hit accuracy has to be made to achieve least bias in prediction and vise versa. To address this tradeoff issue, we define a balancing scheme so the prediction accuracy can be reduced to certain level to tune down the prediction distribution bias. We introduce a balancing rate, a parameter,
[1] | [ A. C. Acock, Working with missing values, Journal of Marriage and Family, 67(2005), 1012-1028. |
[2] | [ E. Acuna and C. Rodriguez, The treatment of missing values and its effect in the classifier accuracy, In Classification, Clustering and Data Mining Applications, (2004), 639-647. |
[3] | [ G. E. Batista and M. C. Monard, An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence, 17(2003), 519-533. |
[4] | [ J. Doak, An Evaluation of Feature Selection Methods and Their Application to Computer Security, UC Davis Department of Computer Science, 1992. |
[5] | [ P. Domingos, A unified bias-variance decomposition, In Proceedings of 17th International Conference on Machine Learning. Stanford CA Morgan Kaufmann, 2000, 231-238. |
[6] | [ Survey of Family Expenditures-1996, STATCAN, 1998. |
[7] | [ A. Farhangfar, L. Kurgan and J. Dy, Impact of imputation of missing values on classification error for discrete data, Pattern Recognition, 41(2008), 3692-3705. |
[8] | [ H. H. Friedman, On bias, variance, 0/1-loss, and the curse-of-dimensionality, Data mining and knowledge discovery, 1(1997), 55-77. |
[9] | [ S. Geman, E. Bienenstock and R. Doursaté, Neural networks and the bias/variance dilemma, Neural computation, 4(1992), 1-58. |
[10] | [ L. A. Goodman and W. H. Kruskal, Measures of association for cross classification, J. American Statistical Association, 49(1954), 732-764. |
[11] | [ I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., 3(2003), 1157-1182. |
[12] | [ L. Himmelspach and S. Conrad, Clustering approaches for data with missing values:Comparison and evaluation, In Digital Information Management (ICDIM), 2010 Fifth International Conference on,IEEE 2010, 19-28. |
[13] | [ P. T. V. Hippel, Regression with missing Ys:An improved strategy for analyzing multiply imputed data, Sociological Methodology, 37(2007), 83-117. |
[14] | [ W. Huang, Y. Shi and X. Wang, A nomminal association matrix with feature selection for categorical data, Communications in Statistics-Theory and Methods, to appear, 2015. |
[15] | [ W. Huang, Y. Pan and J. Wu, Supervised Discretization for Optimal Prediction, Procedia Computer Science, 30(2014), 75-80. |
[16] | [ G. James and T. Hastie, Generalizations of the Bias/Variance Decomposition for Prediction Error, Dept. Statistics, Stanford Univ., Stanford, CA, Tech. Rep, 1997. |
[17] | [ S. Kullback and R. A. Leibler, On information and sufficiency, Annals of Mathematical Statistics, 22(1951), 79-86. |
[18] | [ R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, Inc. 1987, New York, NY, USA. |
[19] | [ H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers 1998, Norwell, MA, USA. |
[20] | [ J. Luengo, S. García and F. Herrera, On the choice of the best imputation methods for missing values considering three groups of classification methods, Knowledge and information systems, 32(2012), 77-108. |
[21] | [ Z. Mark and Y. Baram, The bias-variance dilemma of the Monte Carlo method, Artificial Neural Networks,ICANN, 2130(2001), 141-147. |
[22] | [ R. Tibshirani, Bias, Variance and Prediction Error for Classification Rules, Citeseer 1996. |
[23] | [ I. Yaniv and D. P. Foster, Graininess of judgment under uncertainty:An accuracyinformativeness trade-off, Journal of Experimental Psychology:General, 124(1995), 424-432. |
[24] | [ L. Yu, K. K. Lai, S. Wang and W. Huang, A bias-variance-complexity trade-off framework for complex system modeling, In Computational Science and Its Applications-ICCSA 2006, Springer, 3980(2006), 518-527. |
[25] | [ T. Zhou, Z. Kuscsik, J. Liu, M. Medo, J. R. Wakeling and Y. Zhang, Solving the apparent diversity-accuracy dilemma of recommender systems, Proceedings of the National Academy of Sciences, 107(2010), 4511-4515. |
1. | Wenxue Huang, Yuanyi Pan, Lihong Zheng, Proportional association based roi model, 2017, 2, 2380-6974, 119, 10.3934/bdia.2017004 | |
2. | Qitian Qiu, Wenxue Huang, Forward supervised discretization for multivariate with categorical responses, 2016, 1, 2380-6966, 217, 10.3934/bdia.2016005 | |
3. | Yuanyi Pan, Xiaofeng Li, Wenxue Huang, Increase statistical reliability without losing predictive power by merging classes and adding variables, 2017, 1, 2380-6966, 341, 10.3934/bdia.2016014 | |
4. | Jianguo Dai, Wenxue Huang, Yuanyi Pan, A category-based probabilistic approach to feature selection, 2017, 3, 2380-6974, 14, 10.3934/bdia.2017020 |
A | B | Tot. |
30 | ||
10 | ||
15 | ||
25 | ||
16 | ||
16 |
1 | 1 | 755 | 0.13 | 3 | 1 | 98 | 0.20 | 5 | 1 | 1209 | 0.51 |
1 | 2 | 1543 | 0.26 | 3 | 2 | 84 | 0.175 | 5 | 2 | 430 | 0.18 |
1 | 3 | 2552 | 0.432 | 3 | 3 | 152 | 0.32 | 5 | 3 | 229 | 0.1 |
1 | 4 | 401 | 0.07 | 3 | 4 | 20 | 0.04 | 5 | 4 | 36 | 0.02 |
1 | 5 | 328 | 0.06 | 3 | 5 | 83 | 0.17 | 5 | 5 | 251 | 0.11 |
1 | 6 | 203 | 0.03 | 3 | 6 | 22 | 0.05 | 5 | 6 | 101 | 0.04 |
1 | 7 | 130 | 0.02 | 3 | 7 | 22 | 0.05 | 5 | 7 | 125 | 0.05 |
2 | 1 | 56 | 0.17 | 4 | 1 | 112 | 0.23 | 6 | 1 | 89 | 0.29 |
2 | 2 | 69 | 0.21 | 4 | 2 | 104 | 0.21 | 6 | 2 | 73 | 0.23 |
2 | 3 | 118 | 0.37 | 4 | 3 | 143 | 0.29 | 6 | 3 | 77 | 0.25 |
2 | 4 | 17 | 0.05 | 4 | 4 | 21 | 0.04 | 6 | 4 | 7 | 0.02 |
2 | 5 | 32 | 0.1 | 4 | 5 | 69 | 0.14 | 6 | 5 | 36 | 0.12 |
2 | 6 | 16 | 0.05 | 4 | 6 | 18 | 0.04 | 6 | 6 | 16 | 0.05 |
2 | 7 | 14 | 0.04 | 4 | 7 | 24 | 0.05 | 6 | 7 | 14 | 0.045 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | SUM | |
1 | 37 | 29 | 29 | 4 | 9 | 7 | 5 | 120 |
2 | 33 | 28 | 44 | 6 | 11 | 3 | 6 | 131 |
3 | 24 | 40 | 81 | 6 | 12 | 6 | 4 | 173 |
4 | 3 | 8 | 11 | 4 | 1 | 0 | 1 | 28 |
5 | 10 | 6 | 8 | 3 | 4 | 2 | 3 | 36 |
6 | 6 | 3 | 6 | 3 | 0 | 0 | 0 | 18 |
7 | 5 | 3 | 3 | 0 | 1 | 0 | 0 | 12 |
SUM | 118 | 117 | 182 | 26 | 38 | 18 | 19 | 518 |
1 | 3 | SUM | |
1 | 62 | 58 | 120 |
2 | 34 | 97 | 131 |
3 | 22 | 151 | 173 |
4 | 28 | 28 | |
5 | 16 | 20 | 36 |
6 | 5 | 13 | 18 |
7 | 5 | 7 | 12 |
SUM | 144 | 374 | 518 |
A | B | Tot. |
30 | ||
10 | ||
15 | ||
25 | ||
16 | ||
16 |
1 | 1 | 755 | 0.13 | 3 | 1 | 98 | 0.20 | 5 | 1 | 1209 | 0.51 |
1 | 2 | 1543 | 0.26 | 3 | 2 | 84 | 0.175 | 5 | 2 | 430 | 0.18 |
1 | 3 | 2552 | 0.432 | 3 | 3 | 152 | 0.32 | 5 | 3 | 229 | 0.1 |
1 | 4 | 401 | 0.07 | 3 | 4 | 20 | 0.04 | 5 | 4 | 36 | 0.02 |
1 | 5 | 328 | 0.06 | 3 | 5 | 83 | 0.17 | 5 | 5 | 251 | 0.11 |
1 | 6 | 203 | 0.03 | 3 | 6 | 22 | 0.05 | 5 | 6 | 101 | 0.04 |
1 | 7 | 130 | 0.02 | 3 | 7 | 22 | 0.05 | 5 | 7 | 125 | 0.05 |
2 | 1 | 56 | 0.17 | 4 | 1 | 112 | 0.23 | 6 | 1 | 89 | 0.29 |
2 | 2 | 69 | 0.21 | 4 | 2 | 104 | 0.21 | 6 | 2 | 73 | 0.23 |
2 | 3 | 118 | 0.37 | 4 | 3 | 143 | 0.29 | 6 | 3 | 77 | 0.25 |
2 | 4 | 17 | 0.05 | 4 | 4 | 21 | 0.04 | 6 | 4 | 7 | 0.02 |
2 | 5 | 32 | 0.1 | 4 | 5 | 69 | 0.14 | 6 | 5 | 36 | 0.12 |
2 | 6 | 16 | 0.05 | 4 | 6 | 18 | 0.04 | 6 | 6 | 16 | 0.05 |
2 | 7 | 14 | 0.04 | 4 | 7 | 24 | 0.05 | 6 | 7 | 14 | 0.045 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | SUM | |
1 | 37 | 29 | 29 | 4 | 9 | 7 | 5 | 120 |
2 | 33 | 28 | 44 | 6 | 11 | 3 | 6 | 131 |
3 | 24 | 40 | 81 | 6 | 12 | 6 | 4 | 173 |
4 | 3 | 8 | 11 | 4 | 1 | 0 | 1 | 28 |
5 | 10 | 6 | 8 | 3 | 4 | 2 | 3 | 36 |
6 | 6 | 3 | 6 | 3 | 0 | 0 | 0 | 18 |
7 | 5 | 3 | 3 | 0 | 1 | 0 | 0 | 12 |
SUM | 118 | 117 | 182 | 26 | 38 | 18 | 19 | 518 |
1 | 3 | SUM | |
1 | 62 | 58 | 120 |
2 | 34 | 97 | 131 |
3 | 22 | 151 | 173 |
4 | 28 | 28 | |
5 | 16 | 20 | 36 |
6 | 5 | 13 | 18 |
7 | 5 | 7 | 12 |
SUM | 144 | 374 | 518 |