Original Features | ||||
1 | 18 | 0.9429 | 0.9693 | 0.4797 |
2 | 46 | 0.9782 | 0.9877 | 0.7718 |
3 | 108 | 0.9907 | 0.9939 | 0.9076 |
4 | 192 | 1 | 1 | 0.9490 |
Citation: Kufen Su, Yen-Lun Su, Yao-Ming Kuo. The moderating effect of pro-environmental self-identity in the relationship between abnormally-shaped foods and purchase intention[J]. AIMS Environmental Science, 2020, 7(3): 247-257. doi: 10.3934/environsci.2020015
[1] | Wenxue Huang, Xiaofeng Li, Yuanyi Pan . Increase statistical reliability without losing predictive power by merging classes and adding variables. Big Data and Information Analytics, 2016, 1(4): 341-348. doi: 10.3934/bdia.2016014 |
[2] | Jian-Bing Zhang, Yi-Xin Sun, De-Chuan Zhan . Multiple-instance learning for text categorization based on semantic representation. Big Data and Information Analytics, 2017, 2(1): 69-75. doi: 10.3934/bdia.2017009 |
[3] | Dongyang Yang, Wei Xu . Statistical modeling on human microbiome sequencing data. Big Data and Information Analytics, 2019, 4(1): 1-12. doi: 10.3934/bdia.2019001 |
[4] | Wenxue Huang, Qitian Qiu . Forward Supervised Discretization for Multivariate with Categorical Responses. Big Data and Information Analytics, 2016, 1(2): 217-225. doi: 10.3934/bdia.2016005 |
[5] | Jinyuan Zhang, Aimin Zhou, Guixu Zhang, Hu Zhang . A clustering based mate selection for evolutionary optimization. Big Data and Information Analytics, 2017, 2(1): 77-85. doi: 10.3934/bdia.2017010 |
[6] | Ricky Fok, Agnieszka Lasek, Jiye Li, Aijun An . Modeling daily guest count prediction. Big Data and Information Analytics, 2016, 1(4): 299-308. doi: 10.3934/bdia.2016012 |
[7] | Minlong Lin, Ke Tang . Selective further learning of hybrid ensemble for class imbalanced increment learning. Big Data and Information Analytics, 2017, 2(1): 1-21. doi: 10.3934/bdia.2017005 |
[8] | David E. Bernholdt, Mark R. Cianciosa, David L. Green, Kody J.H. Law, Alexander Litvinenko, Jin M. Park . Comparing theory based and higher-order reduced models for fusion simulation data. Big Data and Information Analytics, 2018, 3(2): 41-53. doi: 10.3934/bdia.2018006 |
[9] | Cai-Tong Yue, Jing Liang, Bo-Fei Lang, Bo-Yang Qu . Two-hidden-layer extreme learning machine based wrist vein recognition system. Big Data and Information Analytics, 2017, 2(1): 59-68. doi: 10.3934/bdia.2017008 |
[10] | Yaru Cheng, Yuanjie Zheng . Frequency filtering prompt tuning for medical image semantic segmentation with missing modalities. Big Data and Information Analytics, 2024, 8(0): 109-128. doi: 10.3934/bdia.2024006 |
In high dimensional, large sample categorical data analysis, feature selection or dimension reduction is usually involved. Existing feature selection procedures either call upon the original variables or rely on linear models (or generalized linear models). Linear models are restrained by certain assumptions of multivariate distribution of data. Some categories in the original categorical explanatory variables may not be informative enough, redundant or even irrelevant with the response variable. Besides, a regular feature selection's statistical reliability may be jeopardized if it picks up variables with a large domains. So we propose a category-based probabilistic approach for feature selection.
One can refer to [10,8] for more introductions to various data types and algorithms in feature selection. The reliability was introduced by variance in [9,3] and by proportion of categories in [5]. The reliability measure used here was proposed in [6], denoted as
As in [6], we propose a category based feature selection method in this article to improve the statistical reliability and to increase the overall point-hit accuracy by merging or removing the less-informative or redundant categories in the categorical explanatory variables. Instead of doing as in [6], we first transform each original categorical explanatory variable into multiple dummy variables, then select the more informative ones by a stepwise forward feature selection approach, and then merge the unselected categories. The merging process in [6], on the other hand, is to find less informative categories within those pre-selected original explanatory variables and merge. Our proposed approach here can compare not only the categories within one explanatory variable one another, but also among the categories in other explanatory variables. Certain introductions and applications to dummy variable can be found in [1,2].
The rest of this article is organized as follows. Section 2 introduces the association measures and the reliability measures; Section 3 introduces the dummy variable approach, proves two theorems, and proposes the detailed feature selection steps; two experiments are conducted in Section 4; we briefly summarize results of this article in the last section.
Assume we are given a data set with one categorical explanatory variable
The GK-lambda (denoted as
λ=∑xρxm−ρ⋅m1−ρ⋅m, |
where
ρ⋅m=maxyρ⋅y=maxyp(Y=y), ρxm=maxyρxy=maxyp(X=x;Y=y). |
Please note that
τ=∑x∑yρxy2/ρx⋅−∑yρ⋅y21−∑yρ⋅y2, |
where
ρx⋅=p(X=x). |
Both
Given a categorical data set with two variables
E(Gini(X|Y))=1−nX∑i=1nY∑j=1p(X=i|Y=j)2p(Y=j). |
Notice that
0≤E(Gini(X|Y))≤1−1|Domain(X)|≤1−1|Domain(X,Y)|; |
and the smaller
We transform
Proposition 3.1.
τ(Y|X1,X2,⋯,XnX)=τ(Y|X). |
Proof. Since
ωY|X=∑i,sp(Y=s|X=i)2p(X=i). |
Thus we only need to prove that
ωY|X1,X2,⋯,XnX=ωY|X |
Since
ωY|X1,X2,⋯,XnX=nY∑j=11∑i1=01∑i2=0⋯1∑inX=0p(X1=i1,X2=i1,⋯,XnX=inX,Y=j)2p(X1=i1,X2=i2,⋯,Xn=in) |
and
Xi=1 when and only when Xs=0, for all s≠i,s=1,2,⋯,nX, |
we have
p(X1=0,X2=0,⋯,Xi=1,⋯,XnX=0,Y=j)=p(X=i,Y=j), |
j=1,2,⋯,nY. |
So
ωY|X1,X2,⋯,XnX=nY∑j=1nnX∑i=1p(X1=0,X2=0,⋯,Xi=1,⋯,XnX=0,Y=j)2p(X1=0,X2=0,⋯,Xi=1,⋯,Xn=0)=nY∑j=1nX∑i=1p(X=i,Y=j)2p(X=i)=ωY|X, |
that is,
ωY|X1,⋯,XnX=ωY|X⇔τ(Y|X1,⋯,XnX)=τ(Y|X). |
It is of interest to introduce the following notion.
Definition 3.1. For a given categorical response variable
The next proposition tells that merging two categories in
Proposition 3.2. If two categories,
τ(Y|X′)≤τ(Y|X), |
where
Proof. Notice that this proposition is equivalent to the following inequality.
ωY|X′≤ωY|X. |
Let
p(X=s;Y=j)p(X=s)=bs,p(X=t;Y=j)p(X=t)=bt, for Y=1,2,⋯,nY. |
We have
p(X=s;Y=j)=p(X=s)bs,p(X=t;Y=j)=p(X=t)bt, and ωY|X′=nY∑j=1∑i≠s,tp(X=i;Y=j)2p(X=i)+nY∑j=1p(X=m;Y=j)2p(X=m), |
where
nY∑j=1p(X=m;Y=j)2p(X=m)=nY∑j=1(p(X=s;Y=j)+p(X=t;Y=j))2p(X=s)+p(X=t)=nY∑j=1(p(X=s)bs+p(X=t)bt)2p(X=s)+p(X=t), | (1) |
and
nY∑j=1(p(X=s;Y=j)2p(X=s)+p(X=t;Y=j)2p(X=t))=nY∑j=1(b2sp(X=s)+b2tp(X=t)). | (2) |
Multiplying
(1)=nY∑j=1(b2sp(X=s)2+b2tp(X=t)2+2bsbtp(X=s)p(X=t)),(2)=nY∑j=1(b2sp(X=s)2+b2tp(X=t)2+(b2s+b2t)p(X=s)p(X=t)). |
Since
ωY|X′≤ωY|X⇔τ(Y|X′)≤τ(Y|X); |
and the equality holds if and only if
In actual high dimensional data analysis projects, there are usually some categories in some explanatory variables that can be merged such that the association degree decrease is ignorable while the merge raises significantly selected features' statistical reliability. This is especially the case when the data set is high dimensional and many explanatory variables have many categories. Two experiments are conducted in the next section to support this supposition by showing that with similar reliability, merging categories can significantly increase the statistical reliability while not reducing association degree significantly.
A feature selection procedure usually follows a stepwise forward variable selection scheme, in which explanatory variables are selected one by one until certain pre-assigned threshold is hit. A reasonable threshold to stop the process is set by an acceptable association degree and statistical reliability. Specifically, for a given set of explanatory variables
1. identify a subset of explanatory variables, denoted as
D1={Xh∈X|τ(Y|{Xh}∪D0)=maxXi∈X∖D0τ(Y|{Xi}∪D0)} |
where
2. select the one in
Xi1={Xk|E(Gini({Xk}∪D0|Y))=minXh∈D1E(Gini({Xh}∪D0|Y))}; |
3. define the new set of selected variables as follows.
D2={Xi1}∪D0 |
4. repeat the previous steps until the stopping criterion is met.
Thus the idea of this general feature selection process is to select a set of variables with the highest association degree and the reliability from the previous step, or with the association degree from the previous step and the highest statistical reliability. More detailed explanations and similar procedures can be found in [8].
The category based version of the previous procedure is to transform all the original (non-binary categorical) explanatory variables into dummy variables before the general steps. The unselected categories are then merged into a new category in each original variable as described below.
1. Transform each original variable
2. Follow the steps in Section (3.2) to select
3. Merge the remaining
Notice that despite the genuine advantage of the category-based forward selection process, it has a higher time cost than the corresponding original variable based approach. It has to go through more loops to reach the same target given more features to be scanned. Generally, a complexity analysis needs to be related to a specifically designed and implemented algorithm. However, it is not this article's objective to discuss this subject in detail thus a brief discussion is carried out as follows.
Assume that the time cost for one variable set's association is a constant
The experiment's purpose is to evaluate the association and the reliability differences between the category based and the original variable based feature selection processes. The first experiment uses the mushroom data set from UCI Machine Learning Repository[13]. It has
The mushroom's type is chosen as the response variable while the other 21 variables are the explanatory ones. We are going to compare the feature selection result by the original variables and that by the transformed dummy variables. Please note that
Original Features | ||||
1 | 18 | 0.9429 | 0.9693 | 0.4797 |
2 | 46 | 0.9782 | 0.9877 | 0.7718 |
3 | 108 | 0.9907 | 0.9939 | 0.9076 |
4 | 192 | 1 | 1 | 0.9490 |
Merged Features | ||||
4 | 16 | 0.9445 | 0.9693 | 0.2098 |
4 | 24 | 0.9908 | 0.9939 | 0.2143 |
5 | 30 | 0.9962 | 0.9979 | 0.4669 |
6 | 38 | 1 | 1 | 0.6638 |
As described in Table 1, there can only be four variables by the feature selection through the original variables with the final association which reaches the maximum (data-based) association degree as
The category-based feature selection always gives rise to remarkably better reliability (
It is also see from these two tables that we have in both experiments a higher association by the categories than that by the original variables given the same reliability threshold: for an almost equal reliability, say,
In this experiment, variable HouseholdType is chosen as the response variable and all other
OrigVarFeatures | ||||
1 | 66 | 0.3005 | 0.3444 | 0.8201 |
2 | 252 | 0.3948 | 0.4391 | 0.9046 |
3 | 1830 | 0.4383 | 0.4648 | 0.9833 |
Merged Features | ||||
2 | 24 | 0.3242 | 0.3934 | 0.5491 |
2 | 36 | 0.3573 | 0.4165 | 0.6242 |
2 | 48 | 0.3751 | 0.4234 | 0.6388 |
3 | 96 | 0.3901 | 0.4234 | 0.7035 |
4 | 186 | 0.4017 | 0.4269 | 0.7774 |
4 | 282 | 0.4121 | 0.4317 | 0.8066 |
5 | 558 | 0.4221 | 0.4548 | 0.8782 |
6 | 966 | 0.4314 | 0.4768 | 0.8968 |
7 | 1716 | 0.4436 | 0.4856 | 0.9135 |
One can see from these two tables that the category based approaches produces an association degree
By transforming the categorical explanatory variables into their dummy forms and applying feature selection procedure to the transformed variables, we can select the informative categories and merge the less informative or redundant categories in the explanatory variables in order to increase the association and raise the reliability.
[1] | Gustavsson J, Cederberg C, Sonesson U, et al. (2011) Global food losses and food waste: Extent causes and prevention. Available from: https://reurl.cc/AqXNk3. |
[2] | Smithers R (2013) Almost half of the world's food thrown away, report finds. The Guardian. Available from: http://www.theguardian.com. |
[3] |
Loebnitz N, Grunert KG (2015) The effect of food shape abnormality on purchase intentions in China. Food Qual Prefer 40: 24-30. doi: 10.1016/j.foodqual.2014.08.005
![]() |
[4] |
Grunert KG, Bredahl L, Brunso K (2004) Consumer perception of meat quality and implications for product development in the meat sector-A review. Meat Sci 66: 259-272. doi: 10.1016/S0309-1740(03)00130-X
![]() |
[5] |
Hurling R, Shepherd R (2003) Eating with your eyes: Effect of appearance on expectations of liking. Appet 41: 167-174. doi: 10.1016/S0195-6663(03)00058-8
![]() |
[6] |
Sobal J, Wansink B (2007) Kitchenscapes, tablescapes, platescapes, and foodscapes: Influences of microscale built environments on food intake. Environ Behav 39: 124-142. doi: 10.1177/0013916506295574
![]() |
[7] | Loebnitz N, Schuitema G, Grunert KG (2015) Who buys oddly shaped food and why? Impacts of food shape abnormality and organic labeling on purchase intentions. Psychol Market 32: 408-421. |
[8] |
Allen MW, Wilson M, Ng SH, et al. (2000) Values and beliefs of vegetarians and omnivores. J Soci Psychol 140: 405-422. doi: 10.1080/00224540009600481
![]() |
[9] |
Bisogni CA, Connors M, Devine CM, et al. (2002) Who we are and how we eat: A qualitative study of identities in food choice. J Nutri Edu Behav 34: 128-139. doi: 10.1016/S1499-4046(06)60082-1
![]() |
[10] |
Cook AJ, Kerr GN, Moore K (2002) Attitudes and intentions towards purchasing GM food. J Eco Psychol 23: 557-572. doi: 10.1016/S0167-4870(02)00117-4
![]() |
[11] |
Sparks P, Shepherd R (1992) Self-identity and the theory of planned behavior: Assessing the role of identification with "green consumerism". Soc Psychol Quarter 55: 388-399. doi: 10.2307/2786955
![]() |
[12] |
Fielding KS, McDonald R, Louis WR (2008) Theory of planned behaviour, identity and intentions to engage in environmental activism. J Environ Psychol 28: 318-326. doi: 10.1016/j.jenvp.2008.03.003
![]() |
[13] |
Terry DJ, Hogg MA, White KM (1999) The theory of planned behaviour: Self-identity, social identity and group norms. Brit J of Soci Psychol 38: 225-244. doi: 10.1348/014466699164149
![]() |
[14] | Whitmarsh L, O'Neill S (2010) Green identity, green living? The role of pro-environmental self-identity in determining consistency across diverse pro-environmental behaviours. J Environ Psychol 30: 305-314. |
[15] | Wang MW (2016) Research on poster creation for Taichung tangerine popularization and advocating-A case study on inglorious fruits and vegetables (Master's thesis). Available from: http://www.airitilibrary.com. |
[16] | Chen WH (2018) Business model innovation of unselected organic fruits and vegetables transaction (Master's thesis). Available from: http://www.airitilibrary.com. |
[17] | Su KF, Su YL, Kuo YM (2019) The mediating effect of genetic modification belief in the relationship between abnormally shaped foods and consumers' risk perception. Sustain Commun Tour Studies 3: 27-36. |
Original Features | ||||
1 | 18 | 0.9429 | 0.9693 | 0.4797 |
2 | 46 | 0.9782 | 0.9877 | 0.7718 |
3 | 108 | 0.9907 | 0.9939 | 0.9076 |
4 | 192 | 1 | 1 | 0.9490 |
Merged Features | ||||
4 | 16 | 0.9445 | 0.9693 | 0.2098 |
4 | 24 | 0.9908 | 0.9939 | 0.2143 |
5 | 30 | 0.9962 | 0.9979 | 0.4669 |
6 | 38 | 1 | 1 | 0.6638 |
OrigVarFeatures | ||||
1 | 66 | 0.3005 | 0.3444 | 0.8201 |
2 | 252 | 0.3948 | 0.4391 | 0.9046 |
3 | 1830 | 0.4383 | 0.4648 | 0.9833 |
Merged Features | ||||
2 | 24 | 0.3242 | 0.3934 | 0.5491 |
2 | 36 | 0.3573 | 0.4165 | 0.6242 |
2 | 48 | 0.3751 | 0.4234 | 0.6388 |
3 | 96 | 0.3901 | 0.4234 | 0.7035 |
4 | 186 | 0.4017 | 0.4269 | 0.7774 |
4 | 282 | 0.4121 | 0.4317 | 0.8066 |
5 | 558 | 0.4221 | 0.4548 | 0.8782 |
6 | 966 | 0.4314 | 0.4768 | 0.8968 |
7 | 1716 | 0.4436 | 0.4856 | 0.9135 |
Original Features | ||||
1 | 18 | 0.9429 | 0.9693 | 0.4797 |
2 | 46 | 0.9782 | 0.9877 | 0.7718 |
3 | 108 | 0.9907 | 0.9939 | 0.9076 |
4 | 192 | 1 | 1 | 0.9490 |
Merged Features | ||||
4 | 16 | 0.9445 | 0.9693 | 0.2098 |
4 | 24 | 0.9908 | 0.9939 | 0.2143 |
5 | 30 | 0.9962 | 0.9979 | 0.4669 |
6 | 38 | 1 | 1 | 0.6638 |
OrigVarFeatures | ||||
1 | 66 | 0.3005 | 0.3444 | 0.8201 |
2 | 252 | 0.3948 | 0.4391 | 0.9046 |
3 | 1830 | 0.4383 | 0.4648 | 0.9833 |
Merged Features | ||||
2 | 24 | 0.3242 | 0.3934 | 0.5491 |
2 | 36 | 0.3573 | 0.4165 | 0.6242 |
2 | 48 | 0.3751 | 0.4234 | 0.6388 |
3 | 96 | 0.3901 | 0.4234 | 0.7035 |
4 | 186 | 0.4017 | 0.4269 | 0.7774 |
4 | 282 | 0.4121 | 0.4317 | 0.8066 |
5 | 558 | 0.4221 | 0.4548 | 0.8782 |
6 | 966 | 0.4314 | 0.4768 | 0.8968 |
7 | 1716 | 0.4436 | 0.4856 | 0.9135 |