Research article

Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification

  • Received: 04 September 2022 Revised: 17 November 2022 Accepted: 22 November 2022 Published: 05 December 2022
  • MSC : 62H30, 62R07

  • Because the majority of model-free feature screening methods concentrate on individual predictors, they are unable to consider structured predictors, such as grouped variables. In this study, we suggest a model-free and direct extension of the original sure independence screening approach for group screening using Gini impurity for a classification model. Compared to current feature screening approaches, the proposed method performs better in terms of screening efficiency and classification accuracy. It was established that the suggested group screening process exhibits sure screening properties and ranking consistency properties under specific regularity conditions. We used simulation studies to illustrate the limited sample performance of the proposed technique and real data analysis.

    Citation: Zhongzheng Wang, Guangming Deng, Haiyun Xu. Group feature screening based on Gini impurity for ultrahigh-dimensional multi-classification[J]. AIMS Mathematics, 2023, 8(2): 4342-4362. doi: 10.3934/math.2023216

    Related Papers:

    [1] Peng Lai, Mingyue Wang, Fengli Song, Yanqiu Zhou . Feature screening for ultrahigh-dimensional binary classification via linear projection. AIMS Mathematics, 2023, 8(6): 14270-14287. doi: 10.3934/math.2023730
    [2] Hanji He, Meini Li, Guangming Deng . Group feature screening for ultrahigh-dimensional data missing at random. AIMS Mathematics, 2024, 9(2): 4032-4056. doi: 10.3934/math.2024197
    [3] Qingqing Jiang, Guangming Deng . Ultra-high-dimensional feature screening of binary categorical response data based on Jensen-Shannon divergence. AIMS Mathematics, 2024, 9(2): 2874-2907. doi: 10.3934/math.2024142
    [4] Renqing Liu, Guangming Deng, Hanji He . Generalized Jaccard feature screening for ultra-high dimensional survival data. AIMS Mathematics, 2024, 9(10): 27607-27626. doi: 10.3934/math.20241341
    [5] Xianwen Ding, Tong Su, Yunqi Zhang . An efficient iterative model averaging framework for ultrahigh-dimensional linear regression models with missing data. AIMS Mathematics, 2025, 10(6): 13795-13824. doi: 10.3934/math.2025621
    [6] Changfu Yang, Wenxin Zhou, Wenjun Xiong, Junjian Zhang, Juan Ding . Single-index logistic model for high-dimensional group testing data. AIMS Mathematics, 2025, 10(2): 3523-3560. doi: 10.3934/math.2025163
    [7] M. R. Irshad, S. Aswathy, R. Maya, Amer I. Al-Omari, Ghadah Alomani . A flexible model for bounded data with bathtub shaped hazard rate function and applications. AIMS Mathematics, 2024, 9(9): 24810-24831. doi: 10.3934/math.20241208
    [8] Turki Althaqafi . Mathematical modeling of a Hybrid Mutated Tunicate Swarm Algorithm for Feature Selection and Global Optimization. AIMS Mathematics, 2024, 9(9): 24336-24358. doi: 10.3934/math.20241184
    [9] Zu-meng Qiu, Huan-huan Zhao, Jun Yang . A group decision making approach based on the multi-dimensional Steiner point. AIMS Mathematics, 2024, 9(1): 942-958. doi: 10.3934/math.2024047
    [10] Jun Ma, Junjie Li, Jiachen Sun . A novel adaptive safe semi-supervised learning framework for pattern extraction and classification. AIMS Mathematics, 2024, 9(11): 31444-31469. doi: 10.3934/math.20241514
  • Because the majority of model-free feature screening methods concentrate on individual predictors, they are unable to consider structured predictors, such as grouped variables. In this study, we suggest a model-free and direct extension of the original sure independence screening approach for group screening using Gini impurity for a classification model. Compared to current feature screening approaches, the proposed method performs better in terms of screening efficiency and classification accuracy. It was established that the suggested group screening process exhibits sure screening properties and ranking consistency properties under specific regularity conditions. We used simulation studies to illustrate the limited sample performance of the proposed technique and real data analysis.



    Ultrahigh-dimensional data are commonly available for a wide range of scientific research and applications. Feature screening plays an essential role in ultrahigh-dimensional data, where Fan and Lv [5] first proposed sure independence screening (SIS) in their seminal paper. For linear regressions, they showed that the approach based on Pearson correlation learning possesses a screening property. That is, even if the number of predictors P can grow much faster than the number of observations n with logP=O(nα) for some α(0,12), all relevant predictors can be selected with a probability tending to one [6].

    To address ultrahigh-dimensional feature screening in the classification problem, Mai and Zou [11] applied a Kolmogorov filter to ultrahigh-dimensional binary classification. Cui et al. [4] proposed a screening procedure using empirical conditional distribution functions. The proposed screening methods assume that the types of data are continuous. Assume that the types of data are continuous. For categorical covariates, Huang et al. [8] constructed a model-free discrete feature screening method based on Pearson Chi-square statistics and showed its screening property fulfilling Fan et al. [6] when all covariates were binary. Ni and Fang [12] proposed a model-free feature screening procedure based on information entropy theory for multi-class classification. Ni et al. [13] further proposed a feature screening procedure based on weighting the Adjusted Pearson Chi-square for multi-class classification. Sheng and Wang [17] proposed a new model-free feature screening method based on the classification accuracy of marginal classifiers for ultrahigh-dimensional classification. However, some covariates existed in the groups, especially discrete and categorical covariates that showed microarrays, genomics, brain images and quantitative measurements. A fair number of grouped variable selection methods arise from individual variable selection and yield a sparse solution at the group level, or even at the within-group level. Refer to group LASSO [22], group SCAD [21], group MCP [2], group hierarchical LASSO [23], group bridge [9] and group exponential LASSO [1]. When the regularization parameter is set for non-sparse estimation, some grouped variable selection algorithms may fail to converge, causing non-identifiability problems and near-singularity problems. Even if the algorithm converges in the setting of a large group and small sample n, the estimated coefficients are not likely to be globally optimal solutions. Therefore, to reduce the number of groups before selecting important groups and variables within these groups, there is a need for new screening methods. For ultrahigh-dimensional data with grouping structures, Niu et al. [15] applied working independence in linear models to propose a group-screening approach. Song and Xie [18] further used F-test statistics to construct a group screening approach that improved marginal methods by reducing the burden of multiple testing and aggregating individual effects. With regard to ultrahigh-dimensional group data in the linear model, Qiu and Ahn [16] proposed group sure independence screening (gSIS), group high dimensional ordinary least-squares projector (gHOLP) and group wise adjusted R-squares screening (gAR2). He and Deng [7] applied joint information entropy to screen for important grouped covariates.

    In this study, we propose a model-free group feature screening method for ultrahigh-dimensional multi-classification of categorical. Our proposed group screening method is based on the Gini impurity to evaluate the predictive power of grouped covariates. The Gini impurity is a non-purity attribute splitting index, which was proposed by Breiman et al. [3] and has been widely used in decision tree algorithms, such as CART and SPRINT. Regarding categorical covariate screening, we can apply the index of purity gain, which is the same as the information gain [12]. As in Ni and Fang [12], continuous covariates can be sliced using standard normal quantiles. The proposed grouped feature screening procedure is based on the purity gain, which is referred to as GP-SIS. Theoretically, the GP-SIS is rigorously proven to enjoy Fan and Lv [5] proposed a sure screening property that ensures that all important features can be obtained. Practically, as shown by the simulation results, compared with the existing group feature screening method and single covariate feature screening, GP-SIS has a better performance.

    The remainder of this paper is organized as follows. Section 2 describes the proposed GP-SIS method in detail. In Section 3, the screening property is established. In Section 4, numerical simulations and an example of real data analysis are presented to assess the performance of the proposed method. Some concluding remarks are given in Section 5, and all proofs are provided in the Appendix.

    We first introduce the Gini impurity and purity gain and then propose a screening procedure based on purity gain.

    Each grouped covariate can be regarded as a whole. Suppose that Y is a categorical response, and covariate matrix X is a multivariate covariate matrix of n×P dimension with G grouped covariates, which can be represented in Table 1.

    Table 1.  Definition of X and Y.
    Y X11 X1p1 Xg1 Xgpg XG1 XGpG
    Y1 x1,11 x1,1p1 x1,g1 x1,gpg x1,G1 x1,GpG
    Yn xn,11 xn,1p1 xn,g1 xn,gpg xn,G1 xn,GpG

     | Show Table
    DownLoad: CSV

    Here, Xg=(Xg1,Xg2,,Xgpg) represents the g-th group covariate, pg represents the dimension of the covariates in the g-th group covariates, and P=Gg=1pg. To introduce the Gini impurity and purity gain, assume that all the covariate components of the covariate matrix X are classified with J categorizes {1,,J}. The values of any element in Xg{1,,J}, Jpg combinations were formed. Jg represents the last combinations between covariate categories in the g-th group covariate matrix, Jg=(jpg,jpg,,jpg). Here, jg=(j1,,jpg) represents the indicator variable in the combination between covariate categories in the g-th group covariate matrix, and j1 represents the first covariate category combination.

    Let pr=P(Y=r) represent the probability function of a response variable, wjg=w(j1,,jpg)=P(Xg1=j1,,Xgpg=jpg) represent the probability function of group covariate, and pjgr=p(j1,,jpg)r=P(Y=r|Xg1=j1,,Xgpg=jpg) represent the probability function of response variables under the condition of group covariates, where g{1,,G},(j1,,jpg){(1,1,,1),(2,1,,1),(J,J,,J)},r{1,,R}. The marginal Gini impurity of Y is defined as

    Gini(Y)=1Rr=1p2r. (2.1)

    Conditional Gini impurity is defined as

    Gini(Y|Xg)=Jgjg=c(1,1,,1)wjg(1Rr=1p2jgr). (2.2)

    Similar to the information gain, the purity gain is defined as

    GP(Y|Xg)=Gini(Y)Gini(Y|Xg)=1Rr=1p2rJgjg=c(1,1,,1)wjg(1Rr=1p2jgr). (2.3)

    In Eq (2.1), Gini(Y) is non-negative and acquires its maximum 11R if and only if p1==pR=1R by Jensen's inequality. The Gini(Y|Xg) in Eq (2.2) is the conditional Gini impurity of Y given Xg1=j1,,Xgpg=jpg. Further support can be provided by the following proposition.

    Proposition 2.1. When Xg is a categorical covariable, we obtain GP(Y|Xg)0, and Xg and Y are independent if and only if GP(Y|Xg)=0.

    For continuous Xg, the conditional Gini impurity cannot be directly calculated, and the purity gain by slicing X into several categories. For a fixed integer J2, let q(j) be the j/J-th percentile of X, j=1,,J1, q(0)= and q(J)=+. Replacing wjg and pjgr in Eq (2.3), respectively, by

    wjg=w(j1,,jpg)=P(Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)]), (2.4)
    pjgr=p(j1,,jpg)r=P(Y=r|Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)]). (2.5)

    We define conditional Gini impurity based on continuous covariates:

    GiniJ(Y|Xg)=Jgjg=c(1,1,,1)wjg(1Rr=1p2jgr), (2.6)
    GPJ(Y|Xg)=(1Rr=1p2r)GiniJ(Y|Xg). (2.7)

    Proposition 2.2. When Xg is a continuous covariable, we obtain GPJ(Y|Xg)0, and Xg and Y are independent if and only if GPJ(Y|Xg)=0.

    First, we select a medium-scale simplified model that can almost fully contain D, where D=\{g:F(Y|x) functionally depends on Xg for some Y=r\}, using an adjusted purity gain index for each pair (Y,Xg) as follows:

    eg=[(1Rr=1p2r)Jgjg=c(1,1,,1)wjg(1Rr=1p2jgr)]logNg. (2.8)

    Here, pr=P(Y=r), wjg=w(j1,,jpg)=P(Xg1=j1,,Xgpg=jpg) when Xg is a categorical group, Ng represents the number of group categories of Xg, and pjgr=p(j1,,jpg)r=P(Y=r|Xg1=j1,,Xgpg=jpg).

    When Xg is defined as continuous group covariates, pjgr=P(Y=r|Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)]), where qg1,j represents j/J percentile of Xg1Jg=JPg, and J represents the number of slices applied to Xg. In this case, Ng=JPg.

    There may be more categories of group covariates associated with larger purity gain in the original definition of Eq (2.3), regardless of whether the group covariates are important, especially when the number of categories involved in each group covariate is different. Therefore, Ni and Fang [12] used logJk to construct an information gain ratio to solve this problem, where each category of Xk is the same. Similarly, when each category of Xg is the same, for Eq (2.8), we apply the logNg to build an adjusted purity gain index to address the problem, which is also applied to continuous Xg. However, when each category of Xg is different, 1Jgjg=c(1,1,,1)w2jg is defined as an adjustment factor, motivated by the split of Xg into several categories via the Decision Tree algorithm.

    For group sample data {xi,g1,,xi,gpg,y}, i=1,,n, eg can be easily estimated by

    ˆeg=[(1Rr=1ˆp2r)Jgjg=c(1,1,,1)  ˆwjg(1Rr=1ˆp2jgr)]logNg. (2.9)

    When Xg is categorical,

    ˆwjg=1nni=1I{xi,g1=j1,,xi,gpg=jpg}, (2.10)
    ˆpjgr=ni=1I{yi=r,xi,g1=j1,,xi,gpg=jpg}ni=1I{xi,g1=j1,,xi,gpg=jpg}. (2.11)

    When Xg is continuous,

    ˆwjg=1nni=1I{xi,g1(qg1,(j1),qg1,(j)],,xi,gpg(qgpg,(j1),qgpg(j)]}, (2.12)
    ˆpjgr=ni=1I{yi=r,xig1(qg1,(j1)  ,qg1,(j)    ],,xigpg(qgpg,(j1)  ,qgpg(j)    ]}ni=1I{xig1(qg1,(j1)  ,qg1,(j)    ],,xigpg(qgpg,(j1)  ,qgpg(j)    ]}. (2.13)

    Here, qg1,j is the j/Jth sample normal percentile of {x1,g1,,xn,g1}. In either case, ˆpr=1nni=1I{yi=r}.

    We suggest selecting a sub-model ˆD={g:ˆegcnτ,1gG}, where both c and τ are predetermined thresholds established via condition (C2) in Section 3. In practice, we can choose a model ˆD = { g:ˆeg is among the top of d largest of all }, where d=[n/logn].

    In this section, we establish the screening properties of the GP-SIS. Based on the sure independent screening theories proposed by Ni and Fang [12] and He and Deng [7], the following conditions are assumed:

    Condition 1 (C1). There exist two positive constants c1 and c2 such that c1/Rprc2/R, c1+c2R, c1/Rpjgrc2/R and c1/Ngwjgc2/Ng for every 1gG, 1rR and jg{(1,1,,1),(2,1,,1),,(J,J,,J)}.

    Condition 2 (C2). There exists a positive constant c>0, and 0τ<1/2 such that mingϵDeg2cnτ.

    Condition 3 (C3). R=O(nε), J=max1gGNg=O(nκ), where ε0, κ0 and 2τ+2ε+2κ<1.

    Condition 4 (C4). There exists a positive constant c3, such that 0<fg(x|Y=r)<c3 for any 1rR, and x is in the domain of Xg, where fg(x|Y=r) is the Lebesgue density function of Xg conditional on Y=r.

    Condition 5 (C5). There exists a positive constant c4, and 0ρ<1/2 such that fg(x)c4nρ for any 1gG and x in the domain of Xg, where fg(x) is the Lebesgue density function of Xg. Furthermore, fg(x) was continuous in the domain of Xg.

    Condition 6 (C6). R=O(nε), J=max1gGNg=O(nk), where 2τ+2ε+2κ+2ρ<1 and ε0,κ0.

    Condition 7 (C7). liminfp{mingϵDegmaxgϵIeg}δ, where δ>0 is a constant.

    Condition (C1) guarantees that the proportion of each class of variables cannot be either extremely small or extremely large. A similar assumption was made for conditions (C1) in Huang et al. [8] and Cui et al. [4]. According to Fan and Lv [5] and Cui et al. [4], Condition (C2) allows the minimum true signal to disappear to zero in the order of nτ as the sample size goes to infinity. According to Ni and Fang [12] and He and Deng [7], Condition (C3) provides for the covariates to diverge with a certain order and number of classes for the response, and Condition (C6) slightly modifies Condition (C3). To ensure that the sample percentiles are close to the true percentiles, Condition (C4) rules out the extreme case in which some Xg places a heavy mass in a small range. Condition (C5) requires nρ as the lower bound of the density. Cui et al. [4] and Zhu et al. [24] proposed the ranking consistency property; assuming the inactive covariate subset I={1,,P}D, then Condition (C7) is established; a similar assumption was also made by Ni and Fang [12] and He and Deng [7].

    Theorem 3.1 (Sure screening property). Under conditions (C1) to (C3), if all the covariates are categorical, we obtain:

    P(DˆD)1O(pexpbn1(2τ+2ε+2κ)+(ε+k)logn),

    where b denotes a positive constant. If logp=O(nα) and α<1(2τ+2ε+2κ), GP-SIS exhibits a sure screening property.

    Theorem 3.2 (Sure screening property). Under conditions (C4)–(C6), when the covariates comprise continuous and categorical variables, we obtain

    P(DˆD)1O(pexpbn1(2τ+2ε+2κ+2ρ)+(ε+κ)logn),

    where b denotes a positive constant. If logp=O(nα) and α<1(2τ+2ε+2κ+2ρ), GP-SIS exhibits a sure screening property.

    Theorem 3.3 (Ranking consistency property). Under conditions (C1), (C4), (C5) and (C7), if logRNglogn=O(1) and max{logp,logn}R4Ng4n12ρ=O(1), then liminfn{mingϵGˆegmaxgϵIˆeg}>0,a.s.

    Theorem 3.3 testifies that the proposed screening index can effectively separate active and inactive covariates at the sample level.

    In this subsection, we conduct four simulation studies to demonstrate the finite sample performance of the group screen methods described in Section 2. We compared GP-SIS with IG-SIS [12] and GIG-SIS [7] in terms of performance using the following evaluation criteria: we ranked the features inside each replication in accordance with each screening criterion and noted the minimum model size (MMS) required to accommodate all of the active features. The 5,25,50,75, and 95% quantiles of the MMS over 100 replications were used to determine the screening performance. Following Shao and Zhang [20], we denote this as CPa. CPa close to 1 is evidence of sure screening for a procedure. We also consider predictor-specific inclusion proportions, which are denoted as CP1, CP2, and CP3, respectively. These represent the coverage probability, which is a given model size of [n/logn], 2[n/logn], and 3[n/logn], including the indicators of all active covariates. This allows us to further investigate the active predictors that are easier to predict.

    Model 1: categorical covariates and binary response

    First, we consider the response variables of different categories. According to Ni and Fang [12] and He and Deng [7], we assume a model in which the response yi is binary, where R=2, and all covariates are categorical. We consider two distributions for yi:

    (1) Balanced, pr=P(yi=r)=1/2;

    (2) Unbalanced, pr=2[1+RrR1]/3R with max1rRpr=2min1rRpr.

    The true model was defined as D={1,,9} with d0=9, and the group size was d0G=3. Under the condition of yi, the latent variable is generated as zi=(zi,1,,zi,P), where zi,kN(urk,1), 1kP. Subsequently, we construct the active covariates:

    (1) If k>d0, then urk=0;

    (2) If kd0 and r=1, then urk=0.5;

    (3) If kd0 and r=2, then urk=0.5.

    Next, we apply the quantile of the standard normal distribution to generate the covariates. The specific approach is as follows.

    (1) When k is an odd number, that is, xi,k=I(zi,k>z(j2))+1;

    (2) When k is an even number, that is, xi,k=I(zi,k>z(j5))+1;

    where αth percentile of the standard normal distribution is z(α).

    Thus, among all P covariates, the covariates of the two categories and five categories accounted for half. In this model, we considered P=1500 and n=80,100,120.

    Table 2 shows the evaluation criteria over 100 simulations for Model 1. The results argue that the proposed GP-SIS works well. When the sample size n increases, GP-SIS is close to d0G=3 in MMS, and both increase to 1 in coverage probability. MMS in an unbalanced response is better than in a balanced response in performance via comparing the responses of different structures. Moreover, GP-SIS is more robust than the other two methods in performance because the fluctuation range in MMS is small.

    Table 2.  Simulation results for example 1.
    Condition MMS CP
    5% 25% 50% 75% 95% CP1 CP2 CP3 CPa
    Balanced Y, n=80, p=1500
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS 89.8 120.0 162.5 193.0 231.1 0.72 0.79 0.80 0.00
    Balanced Y, n=100, p=1500
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 17.0 22.0 26.0 31.0 40.1 0.89 0.97 1.00 1.00
    Balanced Y, n=120, p=1500
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 9.0 10.0 11.0 13.0 15.0 1.00 1.00 1.00 1.00
    UnBalanced Y, n=80, p=1500
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 4.0 5.0 1.00 1.00 1.00 1.00
    IG-SIS 130.9 197.0 226.0 282.5 373.4 0.66 0.79 0.84 0.00
    UnBalanced Y, n=100, p=1500
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 4.0 0.89 1.00 1.00 1.00
    IG-SIS 11.0 15.0 17.0 20.0 27.0 0.90 0.99 1.00 1.00
    UnBalanced Y, n=120, p=1500
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 9.0 9.0 10.0 10.0 12.0 1.00 1.00 1.00 1.00

     | Show Table
    DownLoad: CSV

    Model 2: categorical covariates and multi-class response

    We consider more covariate classifications, and the response yi is multi-class, where R=10. We consider yi of the two distributions:

    (1) Balanced, pr=P(yi=r)=1/R;

    (2) Unbalanced, pr=2[1+RrR1]/3R with max1rRpr=2min1rRpr.

    The true model was defined as D={1,,9} with d0=9, and the group size was d0G=3. Condition on yi, the latent variable is generated as zi=(zi,1,,zi,P), for covariates Xk,xi,k=fk(εi,k+μi,k), where εi,kt(4) and fk() represents a quantile function of standard normal distribution. We then construct the active covariates by defining ui,k:

    (1) If k>d0, then urk=0;

    (2) If d0, then urk=1.5×(0.9)r.

    Next, we apply the fk() to generate covariates and consider P=2000,n=100,150,200 in this model. The specific approach is as follows:

    (1) For k400, fk(εi,k+μi.k)=I(zi,k>z(j2))+1;

    (2) For 401k800, fk(εi,k+μi.k)=I(zi,k>z(j4))+1;

    (3) For 801k1200, fk(εi,k+μi.k)=I(zi,k>z(j6))+1;

    (4) For 1201k1600, fk(εi,k+μi.k)=I(zi,k>z(j8))+1;

    (5) For 1601k, fk(εi,k+μi.k)=I(zi,k>z(j10))+1.

    Thus, among all the P covariates, the covariates of two, four, six, eight, and ten categories accounted for one-fifth each.

    Table 3 shows the evaluation criteria over 100 simulations for Model 2. Two methods in performance under Model 1 are worse than Model 2. When the model is more intricate, GP-SIS in performance is better than IG-SIS. Particularly, GP-SIS and GIG-SIS have a slightly small MMS under a small sample size n. When the sample size n increases, GP-SIS is close to d0G=3 in MMS, and both increase to 1 in coverage probability. MMS in an unbalanced response is better than in a balanced response in performance via comparing the responses of different structures. Furthermore, GP-SIS is more robust in performance because the fluctuation range in MMS is small.

    Table 3.  Simulation results for example 2.
    Condition MMS CP
    5% 25% 50% 75% 95% CP1 CP2 CP3 CPa
    Balanced Y, n=100, p=2000
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 9.0 9.0 9.0 10.0 45.1 0.99 0.99 0.99 0.95
    Balanced Y, n=150, p=2000
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 9.0 9.0 9.0 9.0 9.0 1.00 1.00 1.00 1.00
    Balanced Y, n=200, p=2000
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 9.0 9.0 9.0 9.0 9.0 1.00 1.00 1.00 1.00
    UnBalanced Y, n=100, p=2000
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 9.0 9.0 10.0 10.0 12.0 0.66 0.79 0.84 0.00
    UnBalanced Y, n=150, p=2000
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 9.0 9.0 9.0 9.0 9.0 1.00 1.00 1.00 1.00
    UnBalanced Y, n=200, p=2000
    GP-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    GIG-SIS 3.0 3.0 3.0 3.0 3.0 1.00 1.00 1.00 1.00
    IG-SIS 9.0 9.0 9.0 9.0 9.0 1.00 1.00 1.00 1.00

     | Show Table
    DownLoad: CSV

    Model 3: continuous and categorical covariates

    Finally, among the covariates that are both continuous and categorical, we assume a more complex example, where response yi is multi-class with R=4. We consider yi of the two distributions:

    (1) Balanced, pr=P(yi=r)=1/R;

    (2) Unbalanced, pr=2[1+RrR1]/3R with max1rRpr=2min1rRpr.

    In this model, we consider P=3000,n=180,220,260. The true model is defined at D={1,2,3,751,752,753,1501,1502, 1503,1504,1505,1506} with d0=12, and the group size is 4. Under condition yi, the latent variable is generated as zi=(zi,1,,zi,p). For covariates Xk,zi,kN(μi,1),1kP, where ui=(ui1,,uip)T with uik=(1)rθrk when yi=r and kD. According to He and Deng [7] and Ni and Fang [12], θrk is listed in Table 4. uik=0 when kD. To generate Xk:

    Table 4.  Parameter specification of Model 3.
    θrk K
    1 2 3 4 5 6 7 8 9 10 11 12
    r=1 0.2 0.8 0.7 0.2 0.2 0.9 0.1 0.1 0.7 0.7 0.3 0.5
    r=2 0.9 0.3 0.3 0.7 0.8 0.4 0.7 0.6 0.4 0.4 0.8 0.2
    r=3 0.1 0.9 0.9 0.1 0.3 0.1 0.4 0.3 0.6 0.6 0.4 0.7
    r=4 0.7 0.2 0.2 0.6 0.7 0.6 0.8 0.9 0.1 0.1 0.8 0.6

     | Show Table
    DownLoad: CSV

    For k750, xik=j,ifzik(z(j1)/4,zj/4];

    For 750<k1500, xik=j,ifzik(z(j1)/10,zj/10];

    For 1501k, xik=zik.

    Thus, among all the P covariates, the covariates of four categories and ten categories accounted for one- fifth, and the other covariates were continuous. Similarly, there are three in four categories and ten in ten categories, and the active covariates are continuous, accounting for half. For continuous covariates, we applied different slices, J=4,8,10. The corresponding approaches were defined as GP-SIS-4, IG-SIS-4, GP-SIS-8, IG-SIS-8, GP-SIS-10 and IG-SIS-10. When the numbers of covariates are grouped, He and Deng [7] proposed a grouped feature screening algorithm by using the joint information entropy to screen some important grouped covariates. We denote these as GIG-SIS-4, GIG-SIS-8 and GIG-SIS-10.

    Tables 5 and 6 present the simulation results with over 100 simulations for the balanced and unbalanced cases, respectively. When the sample size n increases, GP-SIS is close to d0G=3 in MMS, and both increase to 1 in coverage probability. The coverage probability of GP-SIS is close to that of GIG-SIS in the five indexes. Therefore, it was proved that GP-SIS has the characteristics of group feature screening. MMS in an unbalanced response is better than in a balanced response in terms of performance by comparing the responses of different structures.

    Table 5.  Simulation results for example 3: balanced Y.
    Condition MMS CP
    5% 25% 50% 75% 95% CP1 CP2 CP3 CPa
    Balanced Y, n=180, p=3000
    GP-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-4 15.0 22.0 32.5 52.8 92.1 0.94 0.98 0.99 0.88
    GP-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-8 13.0 16.0 20.0 31.3 79.1 0.97 0.99 0.99 0.93
    GP-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-10 14.0 17.0 22.0 39.3 123.1 0.96 0.99 0.99 0.87
    Balanced Y, n=220, p=3000
    GP-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-4 13.0 14.0 16.0 19.0 26.0 1.00 1.00 1.00 1.00
    GP-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-8 13.0 15.0 18.0 21.0 29.0 1.00 1.00 1.00 1.00
    GP-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-10 14.0 17.0 22.0 26.0 37.4 0.99 1.00 1.00 1.00
    Balanced Y, n=260, p=3000
    GP-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-4 12.0 12.0 14.0 15.0 18.1 1.00 1.00 1.00 1.00
    GP-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-8 12.0 13.0 14.0 17.0 21.1 1.00 1.00 1.00 1.00
    GP-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-10 12.0 14.0 16.0 19.0 28.1 1.00 1.00 1.00 1.00

     | Show Table
    DownLoad: CSV
    Table 6.  Simulation results for example 3: unbalanced Y.
    Condition MMS CP
    5% 25% 50% 75% 95% CP1 CP2 CP3 CPa
    UnBalanced Y, n=180, p=3000
    GP-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-4 39.0 43.0 46.0 50.3 55.1 0.84 0.97 1.00 1.00
    GP-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-8 22.0 26.0 29.0 32.0 35.1 0.93 1.00 1.00 1.00
    GP-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-10 21.0 23.0 26.0 29.0 33.0 0.95 1.00 1.00 1.00
    UnBalanced Y, n=220, p=3000
    GP-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-4 16.0 18.0 20.0 22.0 25.0 1.00 1.00 1.00 1.00
    GP-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-8 13.0 14.0 16.0 17.0 18.0 1.00 1.00 1.00 1.00
    GP-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-10 15.0 17.0 19.0 20.0 22.1 1.00 1.00 1.00 1.00
    UnBalanced Y, n=260, p=3000
    GP-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-4 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-4 12.0 12.0 13.0 13.0 14.0 1.00 1.00 1.00 1.00
    GP-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-8 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-8 12.0 12.0 13.0 13.0 14.1 1.00 1.00 1.00 1.00
    GP-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    GIG-SIS-10 4.0 4.0 4.0 4.0 4.0 1.00 1.00 1.00 1.00
    IG-SIS-10 12.0 13.0 13.0 14.0 15.0 1.00 1.00 1.00 1.00

     | Show Table
    DownLoad: CSV

    Furthermore, GP-SIS and GIG-SIS are robust in performance, because the fluctuation range in the MMS is small for the two types of responses. When different slices are applied in continuous covariates, GP-SIS and GIG-SIS are better in terms of the five indices of coverage probability and MMS in performance by comparing the responses of different structures. Therefore, three methods are independent of the number of slices in performance.

    Model 4. Computational time complexity analysis

    Similar to Model 1. However, for the distribution of yi, we consider balanced data, that is, P(yi=r)=1/2. The true model was defined as D={1,,9}, with d0=9 and d0G=3, and the group size was 3. The active and irrelevant covariates were generated in the same way as in Model 1. Similarly, half of the p-dimensional covariates are two-category, while the other half are five-category covariates. Model 4 controls for a constant sample size of 150 and considers a dimensional vector of covariates ranging from 1500 to 10500, with an equal series of 1000 equal differences. The running times of the five methods were recorded for each experiment, and the median running time in 100 replicate experiments was recorded as the running time index of the five methods. The trends of the five methods will be compared as the dimension of covariates increases to compare the computational complexities of the methods.

    The median run times in Table 7 show a linear trend for the three methods as the sample size varies linearly. It can be observed that the running time of GP-SIS is not much different from that of GIG-SIS. In ultrahigh-dimensional feature screening, both GP-SIS and GIG-SIS are robust, but the computational time of GP-SIS is shorter than that of GIG-SIS. For grouped variable feature screening, our method was superior to GIG-SIS.

    Table 7.  Simulation results for Model 4 (Note: Running time in seconds).
    Screening Methods P
    1500 2500 3500 4500 5500 6500 7500 8500 9500 10500
    GP-SIS 2.636 4.202 6.173 6.617 8.124 9.275 10.705 12.135 17.139 15.757
    IG-SIS 3.145 5.127 7.512 7.923 9.696 10.967 12.629 14.296 19.841 18.478
    GIG-SIS 3.677 6.283 8.933 9.393 11.532 13.085 15.079 17.103 24.702 22.291

     | Show Table
    DownLoad: CSV

    In this subsection, we analyze a real data-set from the feature selection database at Arizona State University (http://featureselection.asu.edu/). The lung biological data included 203 samples and 3312 features, which were unbalanced owing the response variable. Every class is 139,17,21,20, and 6, and the covariates are not only continuous but also have group correlations. We randomly divided the data into two parts, where 90% of the data represented training data and 10% of the data represented the test data. The sample sizes of training data and test data respectively are n=182 and n=21. The dimensions of both the training data and test data were P=3312.

    We utilized a ten-fold cross-validation method to assess the performances of various classification algorithms to eliminate the model accuracy issues caused by various training data. Active covariates were chosen by GP-SIS-10, IG-SIS-10 and GIG-SIS-10 based on the training data. We classified them using a variety of techniques, including Support Vector Machine [19], Random Forest (RF), and Decision Tree (DT) [10], using the active covariates chosen based on GP-SIS, IG-SIS and GIG-SIS. G-mean and F-measure are the evaluation indices employed. The performance of feature screening for unbalanced high-dimensional data improves with higher G-means and F-measures [7]. Table 8 shows the G-mean and F-measure for the training data and test data using the three classification techniques GP-SIS-10, IG-SIS-10 and GIG-SIS-10. Among all classification methods, GP-SIS exhibited the best performance, where F-measure of GP-SIS is closer to 1 than those of the other two methods. However, the F-measure (test) of some response's class is a little small, which is close to 0 in all classification methods. In other words, the proposed GP-SIS method performed better.

    Table 8.  Analysis results for real data example.
    screening method response
    1 2 3 4 5
    classification method SVM
    G-mean(train data) GP-SIS 0.9986 0.7995 0.8108 0.8131 0.7801
    GIG-SIS 0.9979 0.7903 0.8095 0.8111 0.7768
    IG-SIS 0.9992 0.7914 0.7865 0.8084 0.7738
    G-mean(test data) GP-SIS 0.9941 0.9790 0.9825 0.9973 0.9888
    GIG-SIS 0.9954 0.9773 0.9937 1.0000 0.9947
    IG-SIS 0.9959 0.9841 0.9758 1.0000 0.9943
    F-measure(train data) GP-SIS 0.9762 0.8080 0.8518 0.8576 0.6432
    GIG-SIS 0.9635 0.6843 0.7873 0.7942 0.5259
    IG-SIS 0.9480 0.6295 0.5900 0.7255 0.4425
    F-measure(test data) GP-SIS 0.8946 0.3352 0.4203 0.4828 0.0900
    GIG-SIS 0.9208 0.3433 0.5683 0.5446 0.2000
    IG-SIS 0.9116 0.3555 0.3422 0.5502 0.2400
    classification method DT
    G-mean(train data) GP-SIS 0.9944 0.7861 0.7976 0.7988 0.7511
    GIG-SIS 0.9931 0.7715 0.7829 0.7914 0.7449
    IG-SIS 0.9952 0.7821 0.7777 0.8013 0.7471
    G-mean(test data) GP-SIS 0.9907 0.9896 0.9849 0.9949 0.9829
    GIG-SIS 0.9927 0.9646 0.9831 0.9891 0.9816
    IG-SIS 0.9917 0.9840 0.9751 1.0000 0.9829
    F-measure(train data) GP-SIS 0.9190 0.5290 0.5988 0.6085 0.0000
    GIG-SIS 0.89029 0.3551 0.4698 0.5202 0.0343
    IG-SIS 0.9057 0.4776 0.4432 0.5929 0.0000
    F-measure(test data) GP-SIS 0.8836 0.4005 0.4233 0.4779 0.0000
    GIG-SIS 0.8459 0.1708 0.3441 0.3847 0.0000
    IG-SIS 0.8745 0.3195 0.2819 0.4683 0.0000
    classification method RF
    G-mean(train data) GP-SIS 1.0000 0.8099 0.8188 0.8166 0.7848
    GIG-SIS 1.0000 0.8099 0.8187 0.8166 0.7847
    IG-SIS 1.0000 0.8045 0.8098 0.8143 0.7818
    G-mean(test data) GP-SIS 0.9963 0.9819 0.9884 0.9975 0.9834
    GIG-SIS 0.9978 0.9598 0.9819 0.9913 0.9859
    IG-SIS 0.9958 0.9816 0.9761 1.0000 0.9975
    F-measure(train data) GP-SIS 1.0000 1.0000 1.0000 1.0000 1.0000
    GIG-SIS 1.0000 1.0000 1.0000 1.0000 1.0000
    IG-SIS 0.9846 0.8772 0.8925 0.9025 0.7340
    F-measure(test data) GP-SIS 0.9036 0.3533 0.5067 0.5138 0.0000
    GIG-SIS 0.8856 0.1300 0.3968 0.4885 0.0286
    IG-SIS 0.9137 0.3605 0.3722 0.5303 0.2567

     | Show Table
    DownLoad: CSV

    In the data, there were continuous and categorical grouping covariates, and the response was categorical, which is common in practice, but the applicable screening methods are limited. We propose a GP-SIS procedure based on the Gini impurity to effectively screen grouping covariates. GP-SIS has a sure screening property and ranking consistency property, theoretically, and is model-free. When the numbers of categories of all grouping covariates are the same and different, GP-SIS is quite similar to GIG-SIS in performance, which can be shown in the simulation. Practically, as shown by the simulation results, compared with the existing group feature screening method and single covariate feature screening, GP-SIS has a better performance.

    Group feature screening reports difficulties based on missing data. In the future, based on the classification model, we intend to propose a new group feature screening method for either the missing variable or response variable.

    The work was supported by National Natural Science Foundation of China [grant number 71963008]. The authors are grateful to the editor and anonymous referee for their constructive comments that led to significant improvements in the paper.

    All authors declare no conflicts of interest in this paper.

    The lung biological data that support the findings of this study are available from the feature selection database of Arizona State University (http://featureselection.asu.edu/).

    Proof of Proposition 2.1. To prove Proposition 2.1, we need to define f(x)=x2, proved to be close Ni and Fang [12]. By Jensen's inequality,

    pgi=1Jji=1w(j1,,jpg)Rr=1p2(j1,,jpg)r=Rr=1[pgi=1Jji=1w(j1,,jpg)f(p(j1,,jpg)r)]Rr=1f(pgi=1Jji=1w(j1,,jpg)p(j1,,jpg)r)=Rr=1f(pgi=1Jji=1P(Xgi=ji)P(Y=r|Xgi=ji))=Rr=1p2r,

    and then

    GP=(1Rr=1p2r)Jgjgwjg(1Rr=1p2jgr)=1Rr=1p2rJgjgwjg+JgjgwjgRr=1p2jgr=JgjgwjgRr=1p2jgrRr=1p2r0.

    The above equation holds if and only if p(j1,,jpg)r=p(j1,,jpg)r, for any 1 r R, 1ipg and 1jpgjpgJ. That is, Xg and Y are independent.

    Proof of Proposition 2.2. From the same proof as Proposition 2.1, we can get that GPJ(Y|Xg)0 holds if and only if p(j1,,jpg)r=p(j1,,jpg)r. So, when Xg and Y are independent, GPJ(Y|Xg)=0.

    P(Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)]|Y=r)=P(Y=r|Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)])/JP(Y=r)=P(Y=r|Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)])/JP(Y=r)=P(Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)]|Y=r).

    By Jj=1P(Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)])=1, we have P(Xg1(qg1,(j1),qg1,(j)],,Xgpg(qgpg,(j1),qgpg,(j)])=(1/J)pg and P(Xg1qg1,j,,Xgpgqgpg,j|Y=r)=(j/J)pg if covariates have a similar distribution.

    Lemma 1 (Bernstein inequality). If Z1,,Zn is an independent random variable with a mean value of 0, and bounded supporter is [M,M], then we have the inequality: P(|ni=1Zi|>t)2exp{t22(v+Mt3)} where vVar(ni=1Zi).

    Lemma 2. For discrete group covariates Xg and discrete response Y, we have the following three inequalities:

    (a)P(|ˆprpr|>t)2exp{6nt23+4t};

    (b)P(|ˆwjgwjg|>t)2exp{6nt23+4t};

    (c)P(|ˆpjgrpjgr|>t)2exp{6nt23+4t}.

    Proof of Lemma 2. Three inequalities are similar in the proofs, where inequality (a) and inequality (b), respectively, have been given at Ni [14] and He and Deng [7]. The following is the proof of inequality (c).

    ˆpjgr=ni=1I{yi=r,xi,g1=j1,,xi,gpg=jpg}ni=1I{xi,g1=j1,,xi,gpg=jpg}.

    The expectation of ˆpjgr is

    E(ˆpjgr)=E(ni=1I{yi=r,xi,g1=j1,,xi,gpg=jpg}ni=1I{xi,g1=j1,,xi,gpg=jpg})=E(I{yi=r,xi,g1=j1,,xi,gpg=jpg}I{xi,g1=j1,,xi,gpg=jpg})=pjgr.

    Let Zi=I{yi=r|Xig1=j1,,Xigpg=jpg}pjgr, Var(ni=1Zi)=nVar(Zi)=npjgr(1pjgr)n4 be known, and then

    P(|ˆpjgrpjgr|>t)=P(|n1ni=1Zi|>t)=P(|ni=1Zi|>nt)2exp{n2t22(n4+nt3)}2exp{6nt23+4t}.

    According to the Bernstein inequality, the formula is held.

    Lemma 3. With regard to discrete group covariates Xg and discrete response Y, for any 0<ε<1, under condition (C1), we have P(|ˆegeg|>2ε)O(RJ3)exp{c5nε2R2J6}, where c5 represents a positive constant.

    Proof of Lemma 3. By eg and ˆeg in Section 2.2, we have

    logNg(ˆegeg)=[(1Rr=1ˆp2r)Jgjgˆwjg(1Rr=1ˆp2jgr)][(1Rr=1p2r)Jgjgwjg(1Rr=1p2jgr)]=(Rr=1p2rRr=1ˆp2r)+(JgjgwjgJgjgˆwjg)+(JgjgˆwjgRr=1ˆp2jgrJgjgwjgRr=1p2jgr)=Rr=1(p2rˆp2r)+Jgjg(wjgˆwjg)+JgjgRr=1(ˆwjgˆp2jgrwjgp2jgr)=Rr=1(prˆpr)(pr+ˆpr)+Jgjg(wjgˆwjg)+JgjgRr=1[(ˆwjgˆpjgr+wjgpjgr)(ˆpjgrpjgr)+ˆpjgrpjgr(ˆwjgwjg)]=I1+I2+I3.

    Since logJlog20.5, we have

    P(|ˆegeg|>ε)P(|I1|>ε3)+P(|I2|>ε3)+P(|I3|>ε3).

    For I1, we have

    P(|I1|>ε3)Rr=1P(|(prˆpr)(pr+ˆpr)|>ε3)Rr=1P(|(prˆpr)|>c1ε3RJ3)RJ32exp{6n(c1ε3RJ3)23+4(c1ε3RJ3)}.

    For I2, we have

    P(|I2|>ε3)JgjgP(|ˆwjgwjg|>c1ε3J3)J32exp{6n(c1ε3J3)23+4(c1ε3J3)}.

    For I3, we have

    I3=JgjgRr=1[(ˆwjgˆpjgr+wjgpjgr)(ˆpjgrpjgr)+ˆpjgrpjgr(ˆwjgwjg)]=JgjgRr=1[(ˆwjgˆpjgr+wjgpjgr)(ˆpjgrpjgr)]+JgjgRr=1ˆpjgrpjgr(ˆwjgwjg):=I31+I32.

    For I31 and I32, we have

    P(|I3|>ε3)P(|I31|>ε6)+P(|I32|>ε6)
    P(|I31|>ε6)JgjgRr=1P(|(ˆwjgˆpjgr+wjgpjgr)(ˆpjgrpjgr)|>ε6)JgjgRr=1p(|ˆpjgrpjgr|>c1ε6RJ3)RJ32exp{6n(c1ε6RJ3)23+4(c1ε6RJ3)}
    P(|I32|>ε6)JgjgRr=1P(|ˆpjgrpjgr(ˆwjgwjg)|>ε6)JgjgRr=1P(|ˆwjgwjg|>c1ε6RJ3)RJ32exp{6n(c1ε6RJ3)23+4(c1ε6RJ3)}.

    In a word, we have the inequality

    P(|ˆegeg|>2ε)O(RJ3)exp{c5nε2R2J6},

    where c5 represents a positive constant.

    Proof of Theorem 3.1. By Conditions (C1) to (C3) and Lemma 3, we can get

    P(DˆD)P(|ˆegeg|cnτ,gϵD)P(max1gG|ˆegeg|cnτ)1Gg=1P(max1gG|ˆegeg|>cnτ)1O(RJ3)pexpc5c2n12τR2J61O(pexpbn12τ2ε2κ+(ε+κ)logn),

    where b is a positive constant.

    Lemma 4 (Lemma A.5 [7]). Under (C1), (C4) and (C5), mfor any 0<ε<1, so for continuous Xg, we have P(|ˆegeg|>2ε)O(RNg)exp{c6n12pε2R4N4g}, there exists a positive constant c6.

    Proof of Theorem 3.2. According to Lemma 4, the proof of Theorem 3.2 is the same as Theorem 3.1 and hence is omitted.

    Proof of Theorem 3.3. According to Lemma 3 and 4 and under Conditions (C1), (C4), (C5) and (C7), we get

    P(mingDˆegmaxgIˆeg<δ2)P((mingDˆegmaxgIˆeg)(mingDegmaxgIeg)<δ2)P(|(mingDˆegmaxgIˆeg)(mingDegmaxgIeg)|>δ2)P(max1gG|ˆegeg|>δ4)O(RJg)pexp{c7n12ρR4J4g}=O(exp{logRNg+logpc7n12ρR4N4g}),

    where c7=min{c5,c6}(δ24). Since log(RNg)logn=O(1), there exists a positive constant c8 such that log(RNg)c8logn. Also, max{logp,logn}R4Ng4n12ρ=O(1) implies that logp12c7n12ρR4N4g and 12c7n12ρR4N4g(c8+2)logn for large n. Next, there exists a constant n0, and we can get n=n0exp{logRNg+logpc7n12ρR4N4g}n=n0exp{c8logn12c7n12ρR4N4g}n=n0exp{c8logn(c8+2)logn}=n=n0n2<. According to Ni and Fang [12] and by the Borel Cantelli Lemma, we can get liminfn{mingϵGˆegmaxgϵIˆeg}δ2>0,a.s.



    [1] P. Breheny, The group exponential lasso for bi-level variable selection, Biometrika, 71 (2015), 731–740. https://doi.org/10.1111/biom.12300 doi: 10.1111/biom.12300
    [2] P. Breheny, J. Huang, Penalized methods for bi-level variable selection, Stat. Interface., 2 (2009), 369–380. https://doi.org/10.4310/SII.2009.v2.n3.a10 doi: 10.4310/SII.2009.v2.n3.a10
    [3] L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification and regression trees, Belmont CA: Wadsworth International Group, 1984. https://doi.org/10.1201/9781315139470
    [4] H. Cui, R. Li, W. Zhong, Model-free feature screening for ultrahigh dimensional discriminant analysis, J. Am. Stat. Assoc., 110 (2015), 630–641. https://doi.org/10.1080/01621459.2014.920256 doi: 10.1080/01621459.2014.920256
    [5] J. Fan, J. Lv, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B, 70 (2008), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x doi: 10.1111/j.1467-9868.2008.00674.x
    [6] J. Fan, R. Samworth, Y. Wu, Ultrahigh dimensional feature selection: beyond the linear model, J. Mach. Learn. Res., 10 (2009), 2013–2038. http://arXiv.org/abs/0812.3201
    [7] H. He, G. Deng, Grouped feature screening for ultra-high dimensional data for the classification model, J. Stat. Comput. Simul., 92 (2022), 972–997. https://doi.org/10.1080/00949655.2021.1981901 doi: 10.1080/00949655.2021.1981901
    [8] D. Huang, R. Li, H. Wang, Feature screening for ultrahigh dimensional categorical data with applications, J. Bus. Econ. Stat., 32 (2014), 237–244. https://doi.org/10.1080/07350015.2013.863158 doi: 10.1080/07350015.2013.863158
    [9] J. Huang, S. Ma, H. Xie, C. Zhang, A group bridge approach for variable selection, Biometrika, 96 (2009), 339–355. https://doi.org/10.1093/biomet/asp020 doi: 10.1093/biomet/asp020
    [10] B. Lantz, Machine learning with R: expert techniques for predictive modeling, 2ed, Birmingha: Packt Publishing, 2019.
    [11] Q. Mai, H. Zou, The Kolmogorov filter for variable screening in high-dimensional binary classification, Biometrika, 100 (2013), 229–234. https://doi.org/10.1093/biomet/ass062 doi: 10.1093/biomet/ass062
    [12] L. Ni, F. Fang, Entropy-based model-free feature screening for ultrahigh-dimensional multiclass classification, J. Nonparametr. Stat., 28 (2016), 515–530. https://doi.org/10.1080/10485252.2016.1167206 doi: 10.1080/10485252.2016.1167206
    [13] L. Ni, F. Fang, F. Wan, Adjusted pearson Chi-Square feature screening for multi-classification with ultrahigh dimensional data, Metrika, 80 (2017), 805–828. https://doi.org/10.1007/s00184-017-0629-9 doi: 10.1007/s00184-017-0629-9
    [14] L. Ni, Variable screening methods for ultra-high dimensional categorical covariates, Shanghai: East China Normal University, 2019.
    [15] Y. Niu, R. Zhang, J. Liu, H. Li, Group screening for ultra-high-dimensional feature under linear model, Stat. Theor. Relat. Field., 4 (2020), 43–54. https://doi.org/10.1080/24754269.2019.1633763 doi: 10.1080/24754269.2019.1633763
    [16] D. Qiu, J. Ahn, Grouped variable screening for ultra-high dimensional data for linear model, Comput. Stat. Data Anal., 144 (2020), 1–11. https://doi.org/10.1016/j.csda.2019.106894 doi: 10.1016/j.csda.2019.106894
    [17] Y. Sheng, Q. Wang, Model-free feature screening for ultrahigh dimensional classification, J. Multivar. Anal., 178 (2020), 1–15. https://doi.org/10.1016/j.jmva.2020.104618 doi: 10.1016/j.jmva.2020.104618
    [18] W. Song, J. Xie, Group feature screening via the F statistic, Commun. Stat. Simul. Comput., 48 (2019), 1921–1931. https://doi.org/10.1080/03610918.2019.1691223 doi: 10.1080/03610918.2019.1691223
    [19] J. A. K. Suykens, J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett., 9 (1999), 293–300. https://doi.org/10.1023/A:1018628609742 doi: 10.1023/A:1018628609742
    [20] X. Shao, J. Zhang, Martingale difference correlation and its use in high-dimensional variable screening, J. Am. Stat. Assoc., 109 (2014), 1302–1318. https://doi.org/10.1080/01621459.2014.887012 doi: 10.1080/01621459.2014.887012
    [21] L. Wang, G. Chen, H. Li, Group SCAD regression analysis for microarray time course gene expression data, Bioinformatics, 23 (2007), 1486–1494. https://doi.org/10.1093/bioinformatics/btm125 doi: 10.1093/bioinformatics/btm125
    [22] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B, 68 (2006), 49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.x doi: 10.1111/j.1467-9868.2005.00532.x
    [23] N. Zhou, J. Zhu, Group variable selection via a hierarchical lasso and its oracle property, Stat. Interface., 3 (2010), 557–574. https://doi.org/10.48550/arXiv.1006.2871 doi: 10.48550/arXiv.1006.2871
    [24] L. Zhu, L. Li, R. Li, L. Zhu, Model-free feature screening for ultrahigh-dimensional data, J. Am. Stat. Assoc., 106 (2011), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563 doi: 10.1198/jasa.2011.tm10563
  • This article has been cited by:

    1. Yuchao Zhang, Bin Nie, Jianqiang Du, Jiandong Chen, Yuwen Du, Haike Jin, Xuepeng Zheng, Xingxin Chen, Zhen Miao, Feature selection based on neighborhood rough sets and Gini index, 2023, 9, 2376-5992, e1711, 10.7717/peerj-cs.1711
    2. Sarbojit Roy, Soham Sarkar, Subhajit Dutta, Anil K. Ghosh, On Exact Feature Screening in Ultrahigh-Dimensional Binary Classification, 2024, 33, 1061-8600, 448, 10.1080/10618600.2023.2270656
    3. Hanji He, Jianfeng He, Guangming Deng, Nian-Sheng Tang, A Group Feature Screening Procedure Based on Pearson Chi‐Square Statistic for Biology Data with Categorical Response, 2024, 2024, 2314-4629, 10.1155/2024/9014764
    4. Qiyu Guo, Shouhang Du, Jinbao Jiang, Wei Guo, Hengqian Zhao, Xuzhe Yan, Yinpeng Zhao, Wanshan Xiao, Combining GEDI and sentinel data to estimate forest canopy mean height and aboveground biomass, 2023, 78, 15749541, 102348, 10.1016/j.ecoinf.2023.102348
    5. Yongli Sang, Xin Dang, Grouped feature screening for ultrahigh-dimensional classification via Gini distance correlation, 2024, 204, 0047259X, 105360, 10.1016/j.jmva.2024.105360
    6. Hanji He, Meini Li, Guangming Deng, Group feature screening for ultrahigh-dimensional data missing at random, 2024, 9, 2473-6988, 4032, 10.3934/math.2024197
    7. Dejun Wang, Yanqiu Xing, Anmin Fu, Jie Tang, Xiaoqing Chang, Hong Yang, Shuhang Yang, Yuanxin Li, Mapping Forest Aboveground Biomass Using Multi-Source Remote Sensing Data Based on the XGBoost Algorithm, 2025, 16, 1999-4907, 347, 10.3390/f16020347
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1986) PDF downloads(74) Cited by(7)

Figures and Tables

Tables(8)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog