1.
Introduction
With the development of information technology and networks, big data with "high dimensionality" as its main feature are widely appearing in medicine, economics, engineering, and other fields. In general, when the dimension p of the covariable increases exponentially with sample size n, we call such data ultrahigh-dimensional data [1]. Intuitively, the covariate dimension of ultrahigh-dimensional data far exceeds the sample size, and the traditional penalty variable screening method suffers from high computational complexity, poor statistical accuracy, and weak algorithm stability; therefore, it is urgent to develop a variable screening method that is suitable for ultrahigh-dimensional data. However, missing data is also another major problem in data analysis and it exists in ultrahigh-dimensional data. Even missing a smaller number of samples for a covariate can lead to an inability to calculate the significance of that variable, which will lead to the omission of important information and misjudgment in subsequent analysis. Therefore, determining how to directly find important variables in ultrahigh-dimensional data with randomly missing data points plays an important role in efficient data analysis.
To solve the problem of statistically modeling ultrahigh-dimensional data, J. Fan and J. Lv [2] first proposed sure independence screening (SIS), which measures the importance of each covariate according to the Pearson correlation coefficient between the response variable and a single covariate. Thus, the dimension of the covariates is reduced to a suitable range. To improve the use of SIS under more general assumptions, P. Hall and H. Miller [3] extended the generalized correlation coefficients, and G. Li et al. [4] proposed the robust rank correlation coefficient screening method to address transformed regression models. X. Y. Wang and C. L. Leng [5] proposed a feature screening method based on high-dimensional least-squares projection to further improve screening performance, considering that SIS is highly dependent on significant covariates and response variables with large marginal correlations. For the model-free assumption, which is more suitable for the era of big data, L. P. Zhu et al. [6] proposed a feature screening method based on covariance, namely, sure independent ranking screening. As there is no model assumption, this method can be used for models such as linear models, generalized linear models, partial linear models, and single index models. R. Li et al. [7] proposed a feature screening method based on the distance correlation coefficient by employing distance covariance through the use of an index to describe whether two arbitrary variables are independent. On this basis, X. Shao and J. Zhang [8] proposed the use of the martingale difference correlation coefficient screening method to measure the deviation of the correlation between two random variables. On the topic of feature screening for ultrahigh-dimensional discrete data, Q. Mai and H. Zou [9] focused on binary response variables, introduced Kolmogorov-Smirnov statistics into the feature screening framework, and proposed a variable selection method based on the Kolmogorov filter. D. Huang et al. [10] proposed a feature screening method based on Pearson chi-square statistics that can solve the problems of superhigh-dimensional discrete covariate data and continuous data under multiple response variables. L. Ni et al. [11] further considered adjusted Pearson chi-square SIS (APC-SIS) to analyze multiclass response data. P. Lai et al. [12] introduced Fisher linear projection and marginal score tests to construct a linear projection feature screening method for linear discriminant modeling. With the emergence of group variables, the above feature screening methods that apply to single variables are no longer applicable. W. C. Song and J. Xie [13] proposed a feature screening method for group data based on F statistics. Based on the assumptions of a linear model, D. Qiu and J. Ahn [14] proposed three methods: group SIS, group high-dimensional least squares projection, and group adjusted R-square screening. H. J. He and G. M. Deng [15] focused on the feature screening of discrete group data and constructed the group information entropy feature screening method based on joint information entropy. Z. Z. Wang et al. [16,17] further extended the role of information theory in group feature screening by using the information gain ratio and Gini coefficient. Y. L. Sang and X. Dang [18] used the Gini distance coefficient to measure the independence between discrete response variables and continuous covariates and constructed a group feature screening method for the Gini distance.
Missing data constitutes a widespread problem in data analysis, and ultrahigh-dimensional data are no exception. P. Lai et al. [19] applied the Kolmogorov filtering method to screen important covariates for the construction of propensity score functions, proposed a feature screening method for ultrahigh-dimensional data with response variables missing at random, and promoted inverse probability weighting technology to build marginal feature screening processes. Q. H. Wang and Y. J. Li [20] proposed missing indicator interpolation screening and feature screening methods based on a Venn diagram by using missing indicator information for response variables. X. X. Li et al. [21] identified key covariates by using the marginal Spearman rank correlation coefficient for conditional estimation. L. Y. Zou et al. [22] used the interpolation technique to process the distribution function of missing responses and adopted the distance correlation between the response distribution function and the covariate distribution function as the index for feature screening. L. Ni et al. [23] proposed two-stage feature screening with covariates missing at random (IMCSIS) and proved the theoretical properties of this method based on adjusted Pearson chi-square statistical feature screening.
The response variable and the covariates missing at random have been fully discussed in the feature screening of ultrahigh-dimensional data. Considering that group data are common, they also need to be taken into account in the feature screening framework. We attempted to extend the group feature screening method that is typically applied to complete data to the case of covariates missing at random to expand the application scope of the existing group feature screening method. In addition, the ultrahigh-dimensional data considered in this paper are discrete, and the application scenarios are mostly classification learning objectives. Therefore, we used the adjusted Pearson chi-square statistic and the two-stage screening procedure to improve the effectiveness of multi-classification problems in practical problems.
In this paper, we construct an ultrahigh-dimensional group feature screening method for randomly missing data and extend the feature screening method that is generally suitable for classification models. First, we define the indicator variables for missing covariates, for which it is assumed that any missing variables exist as a group structure. Second, a two-stage group feature screening method with covariates missing at random (GIMCSIS) was proposed by introducing adjusted Pearson chi-square statistics as the basic screening method. In this paper, the GIMCSIS satisfies the sure screening performance requirement. Furthermore, the performance of GIMCSIS is demonstrated via numerical simulation and empirical analysis. Specifically, compared with IMCSIS, GIMCSIS can be applied to group data and improve the feature screening performance of ultrahigh-dimensional data with covariates missing at random. In the empirical analysis, we focus on the classification model, and GIMCSIS is better than IMCSIS in terms of various classification indices.
This paper is organized as follows. Section 2 introduces two-stage group feature screening based on adjusted Pearson chi-square statistics. Then, we establish a group sure screening property. The simulation studies are given in Section 3. Section 4 provides a classification analysis, and the paper concludes with a discussion in Section 5.
2.
Theory and method
2.1. Symbol and definition
First, we define the group structure data. When covariates are randomly missing, there are group covariates among both the full and partial observation covariates; therefore, we need to define new symbols and concepts. Suppose that Y is a multiclass response variable with R elements and that the covariate matrix X is a multivariate covariate matrix with G group covariates, each of which consists of one or more covariates. Considering random missing covariates in covariate data, the set of fully observed variables is defined as U=(u1,…,uG1)T, and the partial set of observed variables is defined as V=(v1,…,vG2)T. The covariate matrix X is represented by
where 1≤k≤G1, 1≤l≤G2, and G1 and G2 are the groups of fully and partially observed covariates, respectively. P represents the total number of dimensions in the covariate, pk represents the number of dimensions of the kth covariate in the fully observed covariate, and ql represents the number of dimensions of the lth covariate in the group of partially observed observation covariates. For some of the observed covariates, δl is used to represent the missing indicator variables. The missing state of a single variable has only 1 or 0 states, while the missing state of a group of variables is more complicated. In this paper, the missing indicator variable δ∗l is defined as a fully observed covariate group only when all covariables in the group are as follows:
Second, it is assumed that all covariate components of the covariate matrix X are classified as J, Jg represents the last of the combinations between covariate classes in the gth group covariate matrix, and jg represents the indicator variables in the combinations between covariate classes in the gth group covariate matrix. If jg=1, it is the first covariate-class combination, and so on; Jg is the pg covariate class combination; and a certain covariate class combination has a classification vector representation, namely, (j1,…,jpg).
Let the probability function of the response variable be pr=P(Y=r). A probability function with a group structure covariate can be expressed as wjk=w(j1,…,jpk)=P(uk1=j1,…,ukz=jz,…,ukpk=jpk) and wjl=w(j1,…,jpl)=P(vl1=j1,…,vlz=jz,…,vlpl=jpl) represents a joint probability function of variables within a group; wjk is a joint probability function for complete data, and wjl is a form in the partial case of covariates. Similarly, the probability functions of the response variable with group structure covariates are as follows: pJkr=p(j1,…,jpk)r=P(Y=r,uk1=j1,…,ukz=jz,…,ukpk=jpk) and pJlr=p(j1,…,jpl)r=P(Y=r,vl1=j1,…,vlz=jz,…,vlpl=jpl), 1≤k≤G1, 1≤l≤G2, r=1,…,R, and z=1,…,pkor1,…,pl,{j1,…,jpg}=1,…,J.
2.2. Adjusted Pearson chi-square statistic
Based on the two-stage feature screening process, we propose a two-stage feature screening method for group structure data [23]. Specifically, we use APC-SIS to construct the group feature screening process.
The APC-SIS method first uses the adjusted Pearson chi-square statistic as the feature screening index [11]. The adjusted Pearson chi-square statistic for univariate analysis is given as follows:
where pr=P(Y=r), w(k)j=P(Xk=j) and π(k)r,j=P(Y=r,Xk=j), r=1,2,⋯,R,j=1,2,⋯,J and k=1,2,⋯,K. When the response variable Y is independent of the covariate Xk, the product of the marginal probabilities is equal to the joint probability; then, prw(k)j=π(k)r,j. When the response variable Y and the covariate Xkare not independent, the product of the marginal probabilities is not equal to the joint probability; the larger the difference, the stronger the correlation between Y and Xk. Therefore, it is easy to obtain two properties of Δk; then, Δk≥0, and Δk=0 if and only if Y is independent of Xk.
By applying the above definition, we can obtain the adjusted Pearson chi-square statistics of a group of covariables under the random missing mechanism. For complete covariate data, we can directly construct the adjusted Pearson chi-square statistic of the response variable Y and the complete covariate Uk:
For partial covariate data, the adjusted Pearson chi-square statistic of the response variable Y and the complete covariate Vl is calculated:
Because wjl=P(vl1=j1,…,vlz=jz,…,vlpl=jpl)=∑Rr=1P(Y=r,vl1=j1,…,vlz=jz,…,vlpl=jpl)=pJlr, when estimating APCg(Y,Vl), compared with Eq (2.2), we need to estimate only pJlr. Then,
where UMlg is used to link the fully observed covariate with the partially observed covariate, and the important fully observed covariate Mlg can be obtained by calculating APCg(δ∗l,Uk); that is, the fully observed covariate associated with the missing information is used to replace the partially observed covariate. Therefore, ˆpJlr can be obtained from ˆMlg and Eq (2.4).
2.3. Two-stage group feature screening method
The two-stage group feature screening with covariates missing at random (GIMCSIS) uses the missing indicator variable as a bridge between the partially observed covariate and the response variable. Feature screening of the fully observed covariates and partially observed covariates is carried out, and the fully observed covariate information is used to replace the partially observed covariate information to realize group feature screening of the partially observed covariates. The screening process is divided into two steps:
Step 1: To map the partially observed variable information to the fully observed covariates, the fully observed covariates associated with the missing indicator variable are considered, and the partially observed covariates are replaced by the information of the fully observed covariates. Specifically, for each of the observed covariates that are missing indicator variables, the adjusted Pearson chi-square statistic is calculated as follows:
where wjk=w(j1,…,jpk)=P(uk1=j1,…,ukz=jz,…,ukpk=jpk) and r=0,1. The active covariates are estimated by applying the following thresholds:
where cδ∗l and τδ∗l are predetermined constants that are defined in Condition (C4) in Section 2.4.
Step 2: On the basis of obtaining ˆMlg, using UMlg to replace partially observed covariates, the adjusted Pearson chi-square statistics of the response variable and partially observed covariates are obtained via the following process:
where
ˆwjl=∑Rr=1ˆpjlr and ˆpr=n−1∑ni=1I(yi=r). The sum of all ˆpjlr values is equivalent to all possible values of u in the set UˆMl. In practice, the summation term ∑ni=1I(yi=r,uˆMlgi=u,δ∗i,l=1) in the case of a given u value is 0 when the number of covariates in ˆMl is large enough. Thus, logJl is used to adjust the Pearson chi-square statistics, yielding ∑Rr=1∑Jlj=1ˆpjlr=1.
For the fully observed covariates, we obtain the active covariate directly by using the adjusted Pearson chi-square statistic. For the partially observed covariates, we obtain the active covariate according to Steps 1 and 2. Therefore, the active covariates in the dataset can be estimated as follows:
where c and τ are predetermined constants.
In practice, we replace ˆMlg with
and replace (U,V)ˆD with the following method:
2.4. Sure screening property
Next, we establish the theoretical property of the proposed GIMCSIS. For feature screening, sure screening properties are essential and they were proposed by J. Fan and J. Lv [2]. After applying a feature screening procedure with a probability tending to 1, all of the important variables still survive. It is important to identify the conditions under which the sure screening property holds, i.e.,
where Mγ is the final model after feature screening and M∗ is the true model.
Therefore, to explore the sure screening property of GIMCSIS, the following regularity conditions are assumed.
(1) There are two positive constants c1 and c2, such that c1R≤pr≤c2R, 0≤wUkjg≤c2J and 0≤wVljg≤c2J; for r=1,…,R, j=1,…,JUk, l=1,…,q,k=1,…,p, and J=max1≤k≤p,1≤l≤q{JUk,JVl}.
(2) There are two constants c>0 and 0<τ<12, such that
(3) There are two positive constants c3 and c4, such that 0<c3≤P(δ∗l=r)≤c4<1; for r=0,1, l=1,…,q.
(4) For each 1≤l≤q, there are two constants cδ∗l>0 and 0<τδ∗l<12, such that
where APCg(δ∗l,Uk)=(logJk)−1∑Jgj=1∑1r=0{P(δ∗l=r,Uk=j)−P(δ∗l=r)P(Uk=j)}2/P(δ∗l=r)/P(Uk=j).
(5) There are two positive constants c5 and c6, such that c52RJˉm≤P(Y=r,UˉMl=u,δ∗l=1)≤c62RJˉm, where ˉMl={k:APCg(δ∗l,Uk)>12cδ∗ln−τδ∗l,1≤k≤G1}, ˉm=max1≤l≤q‖UˉMl‖0, r=1,…,R, and l=1,…,q.
(6) R=O(nξ)andJ=O(nκ), where ξ≥0, κ≥0, 1−2τ−6ξ−18κ>0, 1−2τδ−18κ>0, 1−2τ−10ξ−(18ˉm+18)κ>0, and τδ=max1≤l≤G2τδ∗l.
Conditions (1)–(5) are commonly used in the study of feature screening [2,7,10,11,23]. Condition (1) ensures that the proportion of each class of response variables and covariates is not too small or too large, i.e., that the class of variables is balanced. Condition (2) requires that the smallest real signal converges to zero at the n−τ rate at which the sample size reaches infinity. Condition (3) requires that the missing proportion be bounded. Condition (4) ensures the sure screening performance of the APC-SIS in the first step of screening. Condition (5) ensures that the denominator of Eq (2.4) is not 0, and the such a condition can be satisfied when the magnitude of ˉMl is small. Condition (6) requires that the divergence rate be much less than the growth rate of n.
Theorem 1. (Sure screening property) Under Conditions (1)–(6), if logp=O(nα), logq=O(nβ), α<1−2τ−6ξ−18κ, β<1−2τ−10ξ−(18ˉm+18)κ and α+β<1−2τδ−18κ, such that
where b1, b2 and b3 are constants. Therefore, GIMCSIS has the sure screening property.
Remark 1. To explore the feature screening of missing covariates in ultra-high dimensional data with group structuring, it is easier to make better use of the information of covariates missing at random. Inspired by L. Ni et al. [23], we propose group feature screening for ultrahigh-dimensional data with categorical covariates missing at random (GIMCSIS). GIMCSIS expands the scope of IMCSIS, and it further improves the performance of classification learning. Regarding the screening performance theory, compared to IMCSIS, GIMCSIS has a higher probability of screening important variables (see Theorem 1). Theorem 1 is also confirmed in Section 3.
3.
Simulation studies
To verify the feature screening performance of GIMCSIS, we generated a series of simulation data for relevant experiments. Simulations 1 and 2 are compared with IMCSIS [23] from two perspectives: The binary response variable and the multiclass response variable. The computer configuration is as follows: CPU, Intel i5-3230M (2.6 GHz); memory, 16 GB; and operating system, Windows 10. The feature screening was implemented by using R version 4.2.2 programming, and the RStudio interactive programming interface was used.
The metrics used to evaluate the performance of feature screening include the following steps:
3.1. Simulation 1: Binary responses
On the basis of complete covariates, 40% of the covariates were defined as partially observed covariates. A simple model in which all covariables are multiclass and the response variables are binary is defined. The settings for the response variables yi, latent variables zi and covariables xi are as follows:
where d0 is the size of the active covariables and the first to tenth covariates are active covariates. In the IMCSIS method, the indicator of the active covariate is d0=10. In the GIMCSIS method, the number of variables in each group is 3, and the indicator of the active covariate is d0G=4. When μrk=−0.5 or μrk=0.5, the kth covariable is active. z(α) is the αth standard normal distribution quantile, and it is used to discretize the covariates xi.
Therefore, for a p-dimensional covariate, half of the covariates are categorical variables and the other half are categorical variables. The ratio of complete covariates to partial covariates is defined as 6:4; the first 60% of the covariates make up complete data, and the others constitute missing data. The random missing proportions mp were set to 10%, 25% and 40%, and the missing indicator variable δi,l was generated from the Bernoulli distribution δi,l∼B(1,1−mp). The dimensions of the covariates were set to 1000, the full covariates to 600, the partial covariates to 400, and the sample sizes to 100,120, and 150, (see Table 1).
Table 1 reports the values of each index in 50 experiments simulated with a normal distribution. The active covariate size based on the GIMCSIS model is 4, and the active covariate size based on the IMCSIS model is 10. In the first stage, regardless of the missing proportions, GIMCSIS outperforms IMCSIS. In the second stage, the covariate is affected by random missing data, with the following results:
(1) Comparison of different sample sizes: As the sample size increases, the minimum model size (MMS) of GIMCSIS approaches the active covariate size d0G=4, and that of IMCSIS approaches the active covariate size d0=10. However, GIMCSIS tends to converge faster than the set of active covariates, and the quantile of the MMS is almost equal to 4 when n = 150, while the quantile of the MMS for IMCSIS is significantly larger than the size of the active covariates. In both models, the four indicators covering the probability tend to be equal to 1. However, the coverage probability of the IMCSIS model is much weaker than that of the GIMCSIS model. GIMCSIS can achieve a better coverage probability when n = 100, while IMCSIS can achieve a better coverage probability when n = 150.
(2) Comparison of different response variables: The structure of the response variables considers both balanced and unbalanced data and is used to compare the anti-interference capacities of the methods. In general, the performance of finite samples under balanced data is better than that under unbalanced data. Among them, GIMCSIS with unbalanced data can achieve excellent MMS and coverage probability under the condition of n = 50, while IMCSIS has better screening performance under the condition of n = 150. Furthermore, GIMCSIS has stronger anti-interference capacities than IMCSIS.
(3) Comparison of different missing proportions: As the missing data proportion increases, the MMS quantiles of both methods decrease. The MMS variation in IMCSIS is larger than that in GIMCSIS. Moreover, the coverage probability of IMCSIS decreases significantly, while the coverage probability of GIMCSIS remains as 1. The above results indicate that the performance of IMCSIS decreases rapidly, while that of GIMCSIS is relatively stable when the missing data proportion continues to increase.
In summary, GIMCSIS has better ability to screen active covariates than IMCSIS in ultrahigh-dimensional data with binary response variables and covariates missing at random. The screening performance of IMCSIS decreases significantly in the second stage of screening, and the smaller the sample size, the worse the screening performance. The performance of GIMCSIS in the second stage of screening is similar to that in the first stage of screening, and it maintains a high coverage probability. For unbalanced data, the performance of both GIMCSIS and IMCSIS methods decreases, but GIMCSIS is significantly robust.
3.2. Simulation 2: Multiclass responses
Consider a complex model with more classes of covariates and four classes of response variables. Two yi distributions are considered the same as those in Simulation 1.
The first to the 11th covariates are active covariates. In the IMCSIS method, the indicator of the active covariate is d0=11; in the GIMCSIS method, when the number of variables in each group is 3, the indicator of the active covariate is d0G=4, which is the same as the Simulation 1 screening target, but the composition of group variables is different. In the case of yi, Simulation 2 for the generation of latent variable data and active covariates is the same as Simulation 1.
Therefore, the p-dimensional covariates are evenly divided into two and five classes, respectively. The ratio of complete covariates to partial covariates is defined as 6:4; the first 60% of the covariates make up complete data, and the others constitute partial data. The random missing proportions mp are set to 10%, 25% and 40%, and the missing indicator variable δi,l is generated from the Bernoulli distribution δi,l∼B(1,1−mp). The number of dimensions of the covariate are considered to be 1000 dimensions, where the full covariate dimension is 600 dimensions, the partial covariate dimension is 400 dimensions, and the sample numbers are 50, 80 and 100, (see Table 2).
Table 2 reports the values of each index in 50 experiments simulated with a normal distribution. The active covariate size based on the GIMCSIS model is 4, and the active covariate size based on the IMCSIS model is 11. In the first stage, GIMCSIS outperforms IMCSIS, regardless of the missing proportions. In the second stage, covariates are affected by random missing data, with the following results:
(1) Comparison of different sample sizes: As the sample size increases, the MMS of GIMCSIS approaches the active covariate size d0G=4, and the MMS of IMCSIS approaches the active covariate size d0=11. However, GIMCSIS tends to converge faster than the set of active covariates, and the quantile of the MMS is almost equal to 4 when n = 100, while the quantile of the MMS in IMCSIS is significantly larger than the size of the active covariates. In contrast to the simulation results for binary response variables, the coverage probability of IMCSIS is much lower than that of binary response variables when the four indices all converge to 1, and the active variables that were screened out have an obvious trailing phenomenon. Although GIMCSIS also exhibited a similar phenomenon, the four coverage probability indicators are almost 1 when n = 80.
(2) Comparison of different response variables: The structure of the response variables considers both balanced and unbalanced data and is used to compare the anti-interference capacities of the methods. In general, the performance of finite samples under balanced data is better than that under unbalanced data. Among them, GIMCSIS of unbalanced data can achieve a good MMS and coverage probability when n = 80, while the coverage probability of IMCSIS is far from the qualified requirement even when n = 100. Furthermore, GIMCSIS has stronger anti-interference capacities than IMCSIS.
(3) Comparison of different missing proportions: As the missing data proportion increases, the quantile of the MMS for both methods decreases. The amplitude of the MMS variation in the IMCSIS dataset is larger than that in the GIMCSIS dataset, and the 75% and 95% quantiles of the IMCSIS dataset are too far from the range of active covariates. Moreover, the coverage probability of IMCSIS decreases significantly, while the coverage probability of GIMCSIS remains as 1. The above results indicate that the performance of IMCSIS decreases rapidly, while that of GIMCSIS is relatively stable when the missing data proportion continues to increase. Therefore, GIMCSIS can obtain stable screening results more effectively.
(4) Comparison with binary response variables: In Simulation 1, although the performance of IMCSIS is not as good as that of GIMCSIS, it can achieve better performance when there is a large sample size. However, in Simulation 2, the response variable is only increased from binary to four categories, and IMCSIS cannot achieve the screening goal. This shows that GIMCSIS has unique advantages in the case of more general multiple response variables.
In summary, GIMCSIS also exhibits better screening performance than IMCSIS under the conditions of ultrahigh-dimensional data with a multivariate response and covariates missing at random. The screening performance of GIMCSIS remains robust. Compared with IMCSIS, GIMCSIS has advantages in terms of a small sample size, an unbalanced response and a high deletion rate.
4.
Empirical analysis
Section 3 illustrates the performance of GIMCSIS on simulated data. In practical application, whether important variables obtained via GIMCSIS can play a role in data analysis is an important issue that needs further verification. In this section, we describe the application of GIMCSIS to the predata analysis process of imbalanced data classification to test whether the important variables obtained can improve the effectiveness of classification problems.
The empirical data were obtained from the Arizona State University feature selection database (http://featureselection.asu.edu/), which contains colon cancer data consisting of 62 instances and 2000 covariates. Forty samples were negative for colon cancer, and the other 22 samples were positive for colon cancer, for an imbalance of 1.81:1. The 2000 covariates were based on the expression levels, and 2000 out of the 6500 genes were differentially expressed. Thus, the response is binary, and the covariates are continuous. There are group correlations between gene expression.
To evaluate the effectiveness of the various feature screening methods for classification problems, the average accuracy rate of the evaluation indicators was calculated via fivefold cross-validation. First, 62 samples were randomly divided into two groups according to the ratio of 1:4. Eighty percent of the samples were used as training data, and the rest were used as test data. The sample size of the training data was 50, the sample size of the test data was 12, and the covariate dimension of both datasets was 2000. The feature screening methods used were GIMCSIS and IMCSIS, where the number of active covariables in the univariate feature screening was d0=22 and the number of active covariables in the group variable feature screening was d0G=8. Considering the effect of covariates on classification, we chose three classification models: Support vector machine (SVM) [24], decision tree (DT) [25] and k-nearest neighbor (KNN) [26].
The evaluation indices for the classification effect are derived from the confusion matrix, and the details are as follows. The confusion matrix is as below.
where TP is the true positive, FN is the false negative, FP is the false positive and TN is the true negative. Table 3 shows all of the evaluation indices based on the confusion matrix.
Table 4 reports the classification performance of the various feature screening methods on the training and test data. Overall, the classification effect of GIMCSIS was better than that of IMCSIS, with outstanding performance in terms of the recall, G-mean and F-measure. The classification method with the best classification effect was KNN. The average accuracy on the training set based on GIMCSIS was greater than 98%, and the average accuracy on the test set was greater than 88%. According to the test data, the G-mean and F-measure based on GIMCSIS were superior to those based on IMCSIS. Regarding the SVM, the average evaluation index for the GIMCSIS dataset was 1.64% greater than that for the IMCSIS dataset on the training data, and the average evaluation index for the GIMCSIS dataset was 5.84% greater than that for the IMCSIS dataset on the test data, indicating that the classification performance of the GIMCSIS dataset was better than that of the IMCSIS dataset. For DT, the average evaluation index for the GIMCSIS dataset was only 0.22% greater than that for the IMCSIS dataset on the training data, but the average evaluation index for the GIMCSIS dataset was 9.55% greater than that for the IMCSIS dataset on the test data, indicating that IMCSIS exacerbated the overfitting phenomenon. For KNN, the average evaluation index for GIMCSIS was 3.14% greater than that for IMCSIS on the training data, and the average evaluation index for GIMCSIS was 1.65% greater than that for IMCSIS on the test data, indicating that GIMCSIS affected the underfitting phenomenon of the KNN model on these data to a certain extent.
5.
Conclusions
The existing group feature screening methods mainly focus on continuous data, discrete response variables, discrete covariates, and other different cases, but feature screening with covariates missing at random has not been discussed. Considering the missing conditions of ultrahigh-dimensional data, this paper extends two-stage feature screening under a random missing mechanism to ultrahigh-dimensional data with a group structure and it presents a two-stage feature screening method with covariates missing at random. In the first stage, we use group feature screening based on adjusted Pearson chi-square statistics to find fully observed covariates that are dependent on missing indicator variables. In the second stage, the information of partially observed covariates is replaced by the fully observed covariates with dependence in the first stage so that the partially observed covariates with dependence on response variables can be found. Finally, the important features are selected by comparing the dependence between the fully observed covariates and the response variables. Compared with existing methods, GIMCSIS can efficiently extract important variables from ultrahigh-dimensional group data with covariates missing at random. In practice, the variables selected by GIMCSIS can improve the classification performance for imbalanced data, which plays an important role in expanding the path of imbalanced data analysis.
Specifically, GIMCSIS does not require model assumptions and satisfies certain screening performance requirements. According to our numerical simulation, the finite sample performance of GIMCSIS is better than that of IMCSIS, which is consistent with both the binary and multivariate response variables. The computational complexity of GIMCSIS is similar to that of IMCSIS, and the computer simulation times are similar for the same sample sizes. In the empirical analysis, we apply the GIMCSIS method to the classification model to improve the classification of ultrahigh-dimensional data with randomly missing data. The results show that GIMCSIS can identify more important covariates, and that GIMCSIS has better classification performance than IMCSIS.
Under different missing data mechanisms, the group feature screening of ultrahigh-dimensional data needs to be discussed. In addition, screening group features in the absence of both response variables and covariates is one of the more challenging problems. In terms of empirical research, discretizing continuous data to meet the needs of discretization feature screening is the key to popularizing the group feature screening of ultrahigh-dimensional discrete data.
Use of AI tools declaration
We have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
The authors are grateful to the editor and anonymous referee for their constructive comments that led to significant improvements in the paper.
The National Natural Science Foundation of China [grant number 71963008] funded this research.
Conflict of interest
The authors declare that there is no conflict of interest in the publication of this paper.
Supplementary
Lemma 1. Similar to the derivation of Lemma 1 in [23], for the categorical response Y and the categorical and fully observed covariate Uk, under Condition (1), we have
where e1 is a constant.
Corollary 1. For the missing indicator variable δ∗l and the fully observed discrete covariate Uk, under Conditions (1) and (3), we have
where e2 is a constant.
Lemma 2. Under Conditions (1), (3), (4) and (5), we have the following two inequalities
where e3 is a positive constant.
Proof. Mlg and ˆMlg have been defined in Section 2.1 and Condition (5) respectively:
Under Conditions (1), (3) and (4), it is easy to obtain the following:
Lemma 3. For the discrete response variable Y and the random missing covariate Vl, under Conditions (1), (3), (4) and (5), we have
where e3 and e5 are positive constants.
Proof. Section 2.2 gives the joint probability of group data with randomly missing data:
Then it is easy to get
In Lemma 2, neither of the last two terms of the above formula is greater than O(G1⋅J3)exp{−e3n1−2τδ∗lJ18}, so we just have to worry about the first inequality. Before we do that, for ease of representation, we give the following notation:
Let ϕr,u=P(Y=r,UˆMlg=u), φr,u=P(Y=r,UˆMlg=u,δ∗l=1), and γjl,r,u=P(Vl1=j1,…,Vlpl=jpl,Y=r,UˆMlg=u,δ∗l=1).
The corresponding estimators are as follows:
Because Mlg⊆ˆMlg⊆ˉMlg, we have
For ease of writing, the following formula ignores the conditional probability, such that PM(⋅) is used instead of P(⋅|Mlg⊆ˆMlg⊆ˉMlg); hence
Considering I41,
Similarly,
Hence,
where e3 and e5 are constants.
Corollary 2. For the discrete response variable Y and the random missing covariate Vl, under Conditions (1), (3), (4) and (5), we have
where e3 and e5 are constants.
Proof. Section 2.2 gives ˆwjl=∑Rr=1ˆpjlr. Using the result in Lemma 3, we can see that
Lemma 4. For the discrete response variable Y and the random missing covariate Vl, under Conditions (1), (3), (4) and (5), we have
where e3 and e5 are positive constants.
Proof. The proof process is similar to Lemma 1, so it is omitted here.
Theorem 1. Under Conditions (1)–(6), we have
where b1, b2 and b3 are constants. If logp=O(nα), logq=O(nβ), α<1−2τ−6ξ−18κ, β<1−2τ−10ξ−(18ˉm+18)κ and α+β<1−2τδ−18κ, then GIMCSIS has the sure screening property.
Proof. Define four covariate sets as follows:
It is obvious that (U,V)D=UD∩VD and (U,V)ˆD=UˆD∩VˆD.
According to Lemmas 1 and 4, we have
where τδ=max1<l<G2τδ∗l, b1, b2 and b3 are constants.