1.
Introduction
Missing data are a common challenge across various research fields, including health studies, clinical trials, and longitudinal investigations. Such missingness can arise from multiple factors, such as the high cost of measuring critical response variables or logistical constraints that hinder data collection due to demographic or economic limitations. Missing values not only reduce the effective sample size, which in turn diminishes estimation efficiency, but also compromise the validity of standard complete-data analysis methods. To mitigate this issue, several statistical techniques have been developed, including multiple imputation, inverse probability weighting, and augmented inverse probability weighting (AIPW), all aimed at improving the robustness of estimation. Particularly when response data are missing at random (MAR), significant research has focused on developing methods to enhance both bias correction and estimation efficiency. References [1] and [2] offer comprehensive reviews of the theoretical advancements and practical applications in this area.
Regression-based prediction methods have become fundamental tools for improving accuracy in various application domains. Related efforts in predictive modeling have demonstrated success using deep regression frameworks for thermal control in photovoltaic systems [3], as well as hybrid associative classifiers incorporating support vector machines to enhance classification accuracy [4]. Competitively in theory with these machine learning approaches, model averaging is a well-established statistical technique designed to further improve prediction performance by combining multiple candidate models. By assigning weights based on model quality, model averaging aims to produce more reliable and robust estimates. In recent years, frequentist model averaging has gained considerable traction, with an expanding body of research exploring a variety of methodologies. Notable contributions include Mallows model averaging (MMA, [5]), jackknife model averaging (JMA, [6]), the heteroskedasticity-robust Cp criterion [7], and Kullback-Leibler loss model averaging [8]. In the context of missing data, model averaging has garnered significant attention. Schomaker et al. [9] introduced two model averaging approaches specifically tailored to handle missing data, while Dardanoni et al. [10] applied model averaging to navigate the bias-precision trade-off in linear regression models with missing covariates. Zhang [11] extended the MMA framework to cases where covariates are entirely missing at random, and Fang et al. [12] proposed a novel model averaging framework for fragmentary data. Liang and Wang [13] further contributed by developing a robust model averaging method for partially linear models with responses missing at random. Liang and Zhou [14] then proposed a new model averaging method based on the weighted generalized method of moments for missing response problems. More recently, Liang et al. [15] addressed optimal model averaging for partially linear models with responses missing at random and measurement error in some covariates.
In the realm of high-dimensional data analysis, significant progress has been made in extending model averaging techniques to accommodate the challenges posed by a growing number of predictors. Lu and Su [16] expanded the JMA criterion of [6] to quantile regression models, allowing the number of predictors to scale with the sample size. Zhang et al. [17] introduced a novel criterion for selecting weights, enabling the development of parsimonious model averaging estimators. In scenarios where the number of predictors increases exponentially with the sample size, Ando and Li [18] introduced a two-step model averaging method that combines marginal screening with JMA for ultrahigh-dimensional linear regression models. This approach was later expanded by [19] to accommodate generalized linear models. Cheng and Hansen [20] further explored model averaging procedures for ultrahigh-dimensional factor-augmented linear regression models using principal component analysis, while Chen et al. [21] investigated semiparametric model averaging methods for nonlinear dynamic time series regression models in similar settings. Lan et al. [22] introduced the sequential screening approach into model averaging for high-dimensional linear regression models. Despite these advancements, much of the research on model averaging for ultrahigh-dimensional data has been limited to complete data settings, leaving the complexities of missing data in high-dimensional contexts largely unaddressed.
This paper introduces a novel iterative model averaging (IMA) method that integrates an iterative screening procedure with model averaging to handle ultrahigh-dimensional regression models when some of the response data are subject to MAR. To address the missing responses, we develop a nonparametric multiple imputation procedure, which offers greater flexibility compared to the parametric assumptions of the MAR mechanism studied in [23]. The imputed response values are iteratively updated based on the residuals from prior iterations. This iterative screening process reduces the dominance of highly weighted predictors in earlier stages, thereby allowing additional relevant predictors to incrementally contribute to parameter estimation. The candidate model weights are determined using the Bayesian information criterion (BIC), enabling larger weights to be assigned to more influential predictors. This process also ensures that the method remains computationally feasible, even in ultrahigh-dimensional settings, by limiting each step to candidate models of size one. Under regularity conditions, we demonstrate that the proposed IMA method produces consistent estimators of regression coefficients and exhibits strong model-fitting performance. Our theoretical results are supported by several numerical studies, confirming the efficacy of the method in practice.
The paper is organized as follows: In Section 2, we outline the model setup and introduce the IMA procedure designed for regression models with missing responses. Section 3 presents the theoretical properties of the IMA procedure. Sections 4 and 5 provide a comprehensive evaluation of the method through extensive simulation studies and its application to a real-world dataset. Finally, Section 6 provides the conclusion of the paper, and Section 7 addresses the limitations and potential directions for future research. All technical proofs can be found in the Appendix.
2.
Methodologies
2.1. Missing response imputation using multiple imputation
Let {(yi,Xi)⊤}ni=1 represent a set of n independent and identically distributed (i.i.d.) random samples, where yi denotes the response variable, and Xi=(xi1,…,xipn)⊤∈Rpn denotes the ultrahigh-dimensional covariates. Without loss of generality, it is assumed that E(xij)=0 and consider the following linear regression relationship:
where β0=(β01,…,β0pn)⊤ is a pn-dimensional vector of regression coefficients, and εi are i.i.d. random errors with mean zero and finite variance σ2. Throughout this paper, the number of covariates pn is permitted to diverge with the sample size n, satisfying pn≫n, i.e., log(pn)=o(nα) for some constant α∈(0,1). In this framework, the response variables yi may be missing, while the covariates Xi are assumed to be fully observed. Thus, the dataset comprises the observations {(yi,Xi,δi)⊤}ni=1, where δi=1 indicates that yi is observed, and δi=0 otherwise. Let X=(X1,…,Xn)⊤=(x1,…,xpn)∈Rn×pn denote the design matrix, where xj represents the j-th column of covariates.
We assume that the missing data mechanism adheres to the MAR assumption, implying that the missingness indicator δi is conditionally independent of the response yi, given the covariates Xi. Formally, this can be expressed as Pr(δi=1|Xi,yi)=Pr(δi=1|Xi)≜π(Xi), where π(⋅) is the selection probability function that models the selection bias underlying the missingness pattern.
To overcome the "curse of dimensionality" frequently encountered in ultrahigh-dimensional data with missing responses, we assume that y⊥δ|xj for each j=1,…,pn (see [24]), where ⊥ stands for statistical independence. This assumption enables us to impute the missing values of yi by leveraging information from individual covariates xj, rather than relying on the full covariate vector X. The multiple imputation procedure for estimating missing responses is then defined as follows:
where K is the number of multiple imputations, and {˜y(j)ik}Kk=1 are K independent imputations for the missing value of yi, drawn from the estimated conditional distribution ˆF(˜y|xij). This conditional distribution ˆF(˜y|xij)=∑nl=1κ(j)ilI(yl≤˜y) is a kernel estimator of the true distribution F(˜y|xij), where κ(j)il=δlKh(xlj−xij)/∑nk=1δkKh(xkj−xij) is the kernel weight. Here, Kh(u)=K(u/h) is the kernel function with bandwidth h=hn, a positive smoothing parameter that tends to zero when n is growing, and I(⋅) is the indicator function. It is important to note that the imputed values ˜y(j)ik have a discrete distribution, where the selecting probability of yl (when δl=1) is given by κ(j)il. This structure ensures that the imputation procedure integrates information across covariates while maintaining computational efficiency in ultrahigh-dimensional settings.
Remark 2.1. An alternative imputation method commonly used for handling missing response data is the AIPW method. In this approach, the estimated response value for the i-th individual is given by
As shown in Eq (2.1), the AIPW method involves imputing both the missing and the observed responses. Within the empirical likelihood framework, Tang and Qin [25] established the semiparametric efficiency of the AIPW estimator. In the setting of robust quantile regression, Chen et al. [26] further demonstrated that quantile estimators based on both ˆy∗i and ˆyi can achieve the semiparametric efficiency bound, provided that the error distributions are identical. Nevertheless, since the AIPW approach requires repeated imputation even for observed responses, it may lead to increased risk of overfitting, particularly in ultrahigh-dimensional scenarios where p≫n. In contrast, the proposed imputation method is computationally more efficient, as it avoids redundant imputation for observed values. By averaging over the imputed values, it offers a direct and practical solution for handling missing response data. In addition, the proposed multiple imputation method, facilitated by kernel-assisted imputation, is less sensitive to model misspecification and tends to be more robust in high-dimensional contexts.
2.2. Univariate model averaging with missing response
Let AF={1,…,pn} represent the full predictor index set, and let A={j1,…,jq}⊆AF be some nonempty subset of cardinality |A|=q≥1. The complement of A is denoted by Ac=AF∖A. Define XA=(xj:j∈A)∈Rn×|A| and β0A as the sub-matrix of the design matrix X and the sub-vector of β0 indexed by A, respectively. Similarly, we define XAc and β0Ac. The candidate model A with fully observed response data is given by
where XiA=(xij:j∈A)∈R|A| represents the covariates in Xi corresponding to the predictors in the candidate model A. A model averaging estimator ˆβ∗A of β0A in model (2.2) can be expressed as
where ˆβA is an estimator of βA, and ˆω∗A represents the weight assigned to model A, which can be estimated using the commonly applied BIC criterion.
The classical BIC-based model averaging procedures become computationally prohibitive in high-dimensional settings due to the cardinality of candidate models scaling exponentially with predictor dimensionality. To alleviate this issue, we propose a simplified BIC model averaging procedure that focuses on univariate candidate models (|A|=1), reducing the number of candidate models to pn. Without loss of generality, let the design matrix X be standardized such that for each j=1,…,pn, the j-th column of X satisfies n−1‖xj‖2=n−1x⊤jxj=1. For the j-th model, the BIC is defined as
where ˆY=(ˆy1,…,ˆyn)⊤ and ˆβj=x⊤jˆY/n are the estimated coefficient for the j-th model. The corresponding BIC-based model averaging estimator is then given by ˆβU=(ˆωU1ˆβ1,…,ˆωUpnˆβpn)⊤, where the BIC weight for the j-th model is defined as
with BIC0=nlog(‖ˆY‖2).
2.3. Iterative model averaging with missing response
Univariate model averaging may lead to inaccurate predictions due to its reliance on a limited subset of the information pool, thereby ignoring the majority of potentially explanatory data features. A natural extension to improve prediction accuracy is to increase the candidate model size from 1 to 2, but this approach can become computationally expensive, particularly in high-dimensional settings, where the number of candidate models scales to p2n. Motivated by the forward regression approach, we propose an iterative method that reduces the impact of heavily weighted predictors by updating the response at each iteration based on the residuals. This enables enhanced integration of auxiliary predictors into the inferential process, thereby improving the efficiency of parameter estimation. We designate this methodology as IMA. The detailed procedure is outlined below.
Let ˆY(1)=ˆY represent the initial response vector, and ˆY(m)=(ˆy(m)1,…,ˆy(m)n)⊤ represent the response vector at the m-th iteration. In the m-th iteration, we fit pn univariate models using the current response ˆy(m) and the covariates xj, j=1,…,pn. The corresponding estimator is given by ˆβmj=x⊤jˆY(m)/n. The BIC for the j-th model is calculated as
Although the weighting coefficient assigned to the null model is less than 1, the covariates can still explain some information about the response. Therefore, the null model is also fitted, resulting in BICm0=nlog(‖ˆY(m)‖2). The corresponding BIC model averaging weights for each candidate model are denoted as
Subsequently, the response vector is updated for the (m+1)-th iteration as
where ˆβ(m)=(ˆωm1ˆβm1,…,ˆωmpnˆβmpn)⊤∈Rpn. Without loss of generality, this iterative process is assumed to cease after M iterations. The final iterative model averaging estimator is then given by
The convergence criteria for this algorithm will be discussed later in this work. Let (y∗,X∗) be an independent copy of (yi,Xi) for some 1≤i≤n, and we can predict the value of y∗ by X⊤∗ˆβM.
Remark 2.2. Commonly used kernel functions include: (1) uniform kernel K(u)=(1/2)I(|u|≤1), (2) logistic kernel K(u)=e−u/(1+e−u)2, (3) Epanechnikov kernel K(u)=(3/4)(1−u2)I(|u|≤1), (4) triangular kernel K(u)=(1−|u|)I(|u|≤1), (5) biweight kernel K(u)=(15/16)(1−u2)2I(|u|≤1), and (6) Gaussian kernel K(u)=(2π)−1/2e−u2/2. Through a sensitivity analysis of kernel function choices in the numerical experiments presented in Section 4, we find that the proposed kernel-assisted IMA procedure exhibits robustness to the choice of kernel function. Therefore, for practical applications, we recommend employing the Gaussian kernel for multiple imputation, primarily due to its computational simplicity and widespread applicability.
2.4. Algorithmic steps of the IMA procedure
The IMA algorithm with missing response is designed to estimate regression coefficients in high-dimensional settings where the response variable may be partially missing. The procedure begins with a training dataset {(yi,Xi,δi)⊤}ni=1, where δi indicates whether yi is observed. For observations with missing responses (δi=0), the method employs a kernel-based nonparametric imputation strategy. Specifically, for each covariate xij, imputed response values {˜y(j)ik}Kk=1 are sampled from an estimated conditional distribution ˆF(˜y|xij), constructed using a kernel smoothing technique over observed data.
Next, the IMA procedure then proceeds, starting with the initialization of iteration index m=0 and weight ˆω00=0. In each iteration, the null model's raw weight is computed based on the norm of the current response vector ˆY(m). For each covariate xj, a univariate regression is fitted, yielding coefficient estimates ˆβmj, and associated raw weights ˆωmj are calculated based on the BIC that accounts for model complexity.
All raw weights, including that of the null model, are normalized to ensure they sum to one. The iteration produces a weighted average estimator ˆβ(m), which is then used to update the response vector for the subsequent iteration via residual adjustment: ˆY(m+1)=ˆY(m)−Xˆβ(m). This iterative process continues until the change in the null model's weight across successive iterations falls below a pre-specified threshold ϵ, ensuring convergence.
Finally, the cumulative model averaging estimator is obtained as ˆβM=∑Mm=1ˆβ(m), where M denotes the total number of iterations. The IMA algorithm effectively integrates imputation with adaptive model averaging to address challenges arising from missing data in high-dimensional regression problems. The detailed algorithmic steps are outlined in Algorithm 1.
As demonstrated in Algorithm 1, the proposed IMA procedure is fundamentally data-driven and can be effectively implemented using the observed dataset {(yi,Xi,δi)⊤}ni=1. Nevertheless, several hyperparameters, which are recommended based on our empirical findings in Section 4, play a crucial role in enhancing computational efficiency. These are summarized in Table 1.
3.
Theoretical properties
To establish the theoretical properties of the proposed IMA procedure, we introduce the following notations and conditions. Let β2(1)≥β2(2)≥…≥β2(pn) and ˆωU(1)≥ˆωU(2)≥…≥ˆωU(pn) be the ordered statistics of {β20j:1≤j≤pn} and {ˆωUj:1≤j≤pn}, respectively.
Condition 1: The missing propensity function π(X)=Pr(δ=1|X) and the probability density function fj(x) of xj (j=1,…,pn) both have continuous and bounded second-order derivatives over the support Xj of xj. Additionally, there exists a constant C0>0 such that infxπ(x)≥C0.
Condition 2: The kernel function K(⋅) is a probability density function satisfying: (ⅰ) It is bounded and has compact support; (ⅱ) It is symmetric with ∫t2K(t)dt<∞; (ⅲ) There exists a constant C1>0 such that K(t)≥C1 in a closed interval centered at zero. Furthermore, the smoothing bandwidth h satisfies nh→∞ and √nh2→0 as n→∞.
Condition 3: The random variables xj, y, and xjy satisfy a sub-exponential tail probability uniformly in pn. That is, there exists a constant u0>0 such that for all 0<u≤2u0,
Condition 4: There exist positive constants C2 and C3, independent of n and pn, such that maxj|β0j|≤C2, and max{maxj1≠j2|σj1j2|,maxj|corr(xj,y)|}≤C3<1.
Condition 5: Let β2m(1)≥β2m(2)≥…≥β2m(pn) be the ordered statistics of {β20mj:1≤j≤pn} for any m=1,…,M. There exist positive constants C5 and C6, independent of n and pn, such that max1≤m≤M|βm(1)|≤C5 and max1≤m≤Mmaxj|corr(xj,y(m))|≤C6<1.
Condition 6: There exists a sequence An such that (i) |An|→∞ and |An|/n→0 as n→∞; (ii) |Acn|‖β0Ac‖2→0 as M,n→∞; (iii) |Acn|maxm>1∑j∈Acnˆω2mj→0 as n→∞.
Condition 7: There exist positive constants C8 and C9 such that the eigenvalues of the covariance matrix Cov(X) satisfy C8≤λmin[Cov(X)]≤λmax[Cov(X)]≤C9, where λmin(A) and λmax(A) represent the smallest and largest eigenvalues of the matrix A, respectively.
Conditions 1 and 2 impose standard assumptions on the missing data propensity function π(X) and the probability density functions fj(x), which are generally satisfied by commonly used distributions. Condition 3 holds under widely employed distributions, such as the multivariate Gaussian. Similar conditions have been discussed in the work of [27]. Condition 4 places restrictions on the parameters to prevent perfect correlation between any predictor and the response, which aligns with the assumptions made in [28] and [22]. Condition 5 extends these assumptions to each iterative step. Condition 6, adapted from [22], is essential for establishing the asymptotic properties of ˆβM as M,n→∞. Lastly, Condition 7 ensures that the design matrix behaves well, similar to the assumptions made by [29].
Theorem 3.1. Under Conditions 1–4, suppose that β2(1)−β2(2)>C4 for some positive constant C4. Then we have ˆωU(1)→p1 as n→∞, where ˆωU(1) is the weight assigned to the univariate model averaging of the covariate with the largest absolute marginal regression coefficient.
This theorem establishes that the information used for estimation and prediction is dominantly governed by the largest correlated covariates. Moreover, Theorem 3.1 can be easily extended to assign nearly equal weights to the first two largest correlated predictors with the response, specifically when β2(1)=β2(2)+o(1)>β2(3), while the contributions from other predictors are negligible.
Theorem 3.2. For m≥1, we have
The first part of Theorem 3.2 ensures that the proposed IMA procedure can progressively approximate the true regression relationship within a limited number of iterations. In particular, the residual sum of squares ‖ˆY(m)‖2 decreases monotonically with each iteration, which reflects the method's strong fitting capability. The second part of Theorem 3.2 indicates that the IMA framework effectively mitigates the risk of overfitting. When no informative predictors are selected during the m-th iteration, the weight ˆωm0 approaches 1, implying that the procedure primarily relies on the baseline model and avoids spurious fitting. Conversely, if significant predictors are present, the largest coefficient ˆβ2m(1) and its corresponding weight ˆωm(1) become dominant, while ˆωm0 shrinks toward 0. Consequently, the upper bound in part (ⅱ) of Theorem 3.2 demonstrates that the IMA procedure not only maintains fitting accuracy but also suppresses excessive model complexity, thus effectively alleviating overfitting.
Theorem 3.3. Under Conditions 1–3 and 5, for those values of m satisfying β2m(1)−β2m(2)>C7 with some positive constant C7, we have ˆωm(1)→p1 as n→∞, where ˆωm(1)≥…≥ˆωm(pn) denote the ordered statistics of {ˆωj:1≤j≤pn}. In addition, for those values of m satisfying β2m(1)=O(n−γ), γ>0, we have ˆωm0→p1 as n→∞.
The first part of this theorem indicates that the largest weight in each iterative step will be assigned to the most significant coefficient. The second part of this theorem implies that the weight ˆωm0 increases with m. Therefore, the IMA algorithm can be stopped if {ˆω(m+1)0−ˆωm0}/ˆωm0<ϵ for some small ϵ>0. In our simulations, we set ϵ=0.0001.
Theorem 3.4. Under Conditions 1–3 and 6–7, we have
This theorem demonstrates that the IMA procedure provides a consistent estimator of β0. Accordingly, y∗=X⊤∗ˆβM is a consistent estimator of X⊤∗β0 for the given X∗. This implies that E{‖X⊤∗ˆβM−X⊤∗β0‖2}→0 as n→∞.
Remark 3.1. In fields such as genomics and epidemiology, high-dimensional datasets typically consist of thousands of feature variables, while the response variable may be subject to MAR. The proposed iterative model averaging procedure effectively mitigates prediction risks in such ultrahigh-dimensional settings, as demonstrated in Theorem 3.4. This makes the proposed approach highly applicable in the analysis of genomic data and disease prediction.
4.
Simulation studies
In this section, several simulation studies are conducted to evaluate the finite-sample performance of the proposed estimation procedure, referred to hereafter as the IMA method. The simulation is based on 200 independent replications, each employing a sample size of n=100 and two feature dimensions, pn=1000 and 3000. For each t-th replication, the data are represented as {Y[t],X[t],δ[t]}, where Y[t]∈Rn, X[t]∈Rn×pn, and δ[t]∈Rn. Based on this dataset, the corresponding IMA estimator ˆβM[t] is computed. To evaluate the estimation accuracy of our proposed method and compare it with several alternative approaches, we employ the following mean squared error (MSE) criterion:
where β0∈Rpn denotes the true parameter vector, and {X∗i}100i=1 represents an independent test dataset. The overall performance metric is obtained by calculating the median of replication-specific MSE values:
This comprehensive evaluation framework ensures a rigorous assessment of estimation accuracy, where smaller MSE values indicate superior performance of the estimation method.
In addition to the proposed IMA method, we consider several alternative methods for comparison. These include the sequential model averaging procedure for the complete-case setting (denoted as the "CC" method, [22]), the sequential model averaging procedure without adjustments for missing data (denoted as the "FULL" method, [22]), weighted model averaging with the cross-validation (CV) method, without the constraint ∑Ss=1ωs=1 (denoted as "WMCV", [23]), and the weighted model averaging with CV method, incorporating the constraint ∑Ss=1ωs=1 (denoted as "WMCV1", [23]). Here, S denotes the number of candidate models. Additionally, we consider the BIC model averaging procedure applied to complete-case data (denoted as "MBIC").
To implement the last three methods, the predictors are grouped according to their marginal multiple-imputation sure independence screening utility [23], retaining the top 100 predictors. Subsequently, we set S=10, resulting in a set of 10 candidate models, each with 10 predictors. The performance of these methods is then evaluated by replacing X⊤∗iˆβM[t] in Eq (4.1) with the weighted sum ∑Ss=1ˆωs[t]X⊤∗iˆβs[t], where ˆωs[t] and ˆβs[t] denote the estimators for the s-th candidate model in the t-th replication.
Experiment 4.1. This experiment is adapted from [18], where the true number of regressors, d, is set to 50. The considered linear regression model is as follows:
where β0=(β01,…,β0pn)⊤ is the vector of true regression coefficients, and Xi=(xi1,…,xipn)⊤ represents the pn vector of predictors. The noise terms, εi, are independent of the predictors. The covariate vector Xi is generated from a multivariate normal distribution with mean 0, and the covariance between xj1 and xj2 is given by Cov(xj1,xj2)=ρ|j1−j2|, 1≤j1,j2≤pn, where ρ represents the correlation parameter and takes values of 0 and 0.5, corresponding to low and moderate correlation scenarios, respectively. The true regressors xj are spaced evenly across the predictor vector, with j=pn(k−1)/d+1, for k=1,…,d. The nonzero entries of β0 are generated from a normal distribution with a mean of 0 and a standard deviation of 0.5. Two error distributions are considered: (i) the standard normal distribution and (ii) the Student's t distribution with 3 degrees of freedom. It is assumed that the predictor values Xi are fully observed, but the response values yi are subject to missingness. To simulate missing data for yi, the missingness indicator δi is generated from a Bernoulli distribution with a probability π(Xi)=π(xi1)=Pr(δi=1|xi1). The following three missingness mechanisms are considered:
● M1: π(xi1)=(0.3+0.175|xi1|)I(|xi1|<4)+I(|xi1|≥4). This mechanism induces a missingness pattern that depends on the absolute value of xi1, with truncation at |xi1|≥4.
● M2: π(xi1)=Φ(0.5+3xi1), where Φ(⋅) denotes the cumulative distribution function of the standard normal distribution. This mechanism introduces a monotonic relationship between the probability of missingness and the value of xi1.
● M3: logit(π(xi1))=0.5+3xi1, where the logit function is applied to model the probability of missingness. This mechanism results in a linear relationship between xi1 and the log-odds of missingness.
The average proportions of missing data for the three mechanisms are approximately 56%, 44%, and 44%, respectively. For each of these settings, the number of multiple imputations is set to K=30 for each missing yi.
Experiment 4.2. This experiment aims to investigate the impact of the proposed method on different sparsity structures within a regression model, following a framework modified from [30]. Specifically, the covariate vector Xi=(xi1,…,xipn)⊤ is generated from a multivariate normal distribution with mean vector 0 and covariance matrix Σ, where the entries of Σ are defined as Cov(xj1,xj2)=0.5|j1−j2|, 1≤j1,j2≤pn. The true regression coefficients, β0=(β01,…,β0pn)⊤, are specified such that β0j=(−1)j×0.5, 1≤j≤d, and β0j=0 for j>d, where d represents the number of true nonzero coefficients. The error terms, εi, are generated from the standard normal distribution with a mean of 0 and a standard deviation of 1, and are independent of the covariates xij for j=1,…,pn. In alignment with Experiment 4.1, the covariate vectors Xi are assumed to be fully observed, while the response values yi are subject to missingness. The missingness indicator δi is generated from a Bernoulli distribution with probability π(Xi)=π(xi1)=Pr(δi=1|xi1), where π(xi1) follows the same specifications as in Experiment 4.1. The three considered missingness mechanisms yield average missing proportions of approximately 56%, 44%, and 44%, respectively. The simulation results for d=20 are presented in this experiment.
The Gaussian kernel function is employed in the imputation procedure, and the optimal bandwidth for kernel density estimation is selected via the cross-validation method implemented in the kedd package in R. The results from the two experiments with pn=1000 and 3000 are presented in Figures 1–4. The black dash-dotted line indicates the median MSE value (MMSE) of the proposed method and is included for direct comparison. Additionally, simulation results for Experiment 4.2 with pn=1000 using the Epanechnikov and biweight kernels are presented in Figure 5.
A closer examination of Figures 1–5 has the following observations: (ⅰ) The proposed IMA procedure consistently outperforms other model averaging methods, including CC, WMCV, WMCV1, and MBIC, across most settings. This outcome underscores the effectiveness of the proposed multiple imputation techniques. (ⅱ) As anticipated, the sequential model averaging method of "FULL" in [22] performs best across all scenarios, as it relies on completely observed data. (ⅲ) The IMA procedure exhibits robust performance across various missing data mechanisms, indicating its relative insensitivity to changes in the missingness pattern. (ⅳ) The proposed method shows reduced sensitivity to the sparsity of regression model coefficients. (ⅴ) The proposed kernel-assisted IMA approach exhibits robustness to the choice of kernel function. In conclusion, the IMA procedure consistently outperforms the competing methods, demonstrating the smallest median distribution of MSE values.
To further assess the impact of the covariate vector distribution on prediction accuracy, we consider the multivariate skew-normal distribution as studied in [31]. Specifically, the covariate vector Xi=(xi1,…,xi,pn)⊤ is generated from the multivariate skew-normal distribution SNpn(Σ,α), where the entries of Σ are defined by Cov(xj1,xj2)=0.5|j1−j2|, for 1≤j1,j2≤pn, and α is the shape parameter controlling the skewness of the distribution. When α=0, the multivariate skew-normal distribution SNpn(Σ,α) reduces to the multivariate normal distribution Npn(0,Σ). In contrast, when α≠0, the distribution remains skew-normal. The simulated MSE values from Experiment 4.2, where α=(1,−1,1,−1,…,1,−1)⊤ and the Gaussian kernel function is used, are presented in Figure 6. A careful examination of Figure 6 reveals that the proposed method consistently outperforms WMCV, WMCV1, and MBIC across all scenarios. However, for the cases where pn=1000 and the missingness mechanisms are M2 and M3, the CC method appears to be comparable to the proposed IMA method. Nevertheless, for all other scenarios, the proposed estimation procedure outperforms the CC method. These findings suggest that the performance of the CC method is particularly sensitive to the missingness rate when the skew-normal distribution is considered, as mechanisms M2 and M3 exhibit relatively lower missing rates compared to M1. This further supports the efficiency and robustness of the proposed IMA method based on multiple imputation.
All numerical experiments are conducted on a personal computer equipped with an Intel(R) Core(TM) i7-10875H CPU running at 2.30 GHz, featuring 8 physical cores and 16 threads. The machine is configured with 16.0 GB of RAM. The computations are performed using R software, version 4.4.1. This hardware and software configuration provides adequate computational power for implementing and evaluating the proposed algorithms, especially in the context of high-dimensional simulation and iterative model fitting. The average computation times for the six methods in Experiment 4.2 are presented in Table 2.
As shown in Table 2, the average computation time for the IMA method is 47.04 seconds for pn=1000, increasing to 236.40 seconds for pn=3000. While the computational time increases with pn, these results highlight the robustness and scalability of the IMA method in high-dimensional modeling scenarios. Notably, when compared with other methods such as CC, WMCV, WMCV1, and MBIC, the IMA method achieves an optimal balance between computational efficiency and model performance, making it particularly suitable for ultrahigh-dimensional settings. Given its robustness, scalability, and competitive performance, the IMA method is recommended for applications that require precise model averaging and prediction in high-dimensional contexts.
Experiment 4.3. The primary aim of this experiment is to evaluate the effectiveness of the proposed stopping rule in ensuring robust predictive performance. For illustration, we focus on the setup from Experiment 4.2, where n=100, pn=1000, and d=20. Let M=100, and define the stopping criterion at the m-th iteration as epsm={ˆω(m+1)0−ˆωm0}/ˆω0m, for m=1,…,M. Additionally, we examine the evolution of the coefficients by plotting maxj|ˆβ(m)j| as a function of the iteration number, m=1,…,M, to determine whether the estimated coefficients ˆβ(m)j become negligible as m increases. The corresponding plots of epsm and maxj|ˆβ(m)j| are presented in Figure 7 for the missingness mechanisms M1–M3.
Examination of Figure 7 reveals the following key observations: (i) The upper three panels demonstrate that the values of epsm tend to stabilize for m>20 across all three missingness mechanisms. From this, we conservatively conclude that updating the IMA procedure for up to 20 steps is sufficient to achieve reliable predictive performance for all mechanisms (M1–M3) in this experiment. (ii) The lower three panels show that maxj|ˆβ(m)j| decreases rapidly toward zero for all missingness mechanisms, indicating that the influence of predictors diminishes significantly after approximately 20 steps.
In summary, these findings provide strong evidence that the proposed stopping rule ensures stable and accurate predictions after a relatively small number of iterations, thereby confirming the efficiency of the procedure.
5.
Real data analysis
In this section, we illustrate the application of the proposed IMA procedure using a gene expression dataset related to Bardet-Biedl Syndrome [32]. This dataset comprises 120 twelve-week-old male rats, each characterized by 31,042 distinct probe sets, selected for tissue harvesting from the eyes and subsequent microarray analysis. Following the approaches of [32], 18,976 probes were deemed "sufficiently variable" based on expression quantitative trait locus (eQTL) mapping, exhibiting at least two-fold variation. For our analysis, all 18,976 probes were standardized to have zero mean and unit variance. Chiang et al. [33] identified the gene TRIM32 at probe 1389163_at as critical to Bardet-Biedl Syndrome. In this study, the response variable y corresponds to probe 1389163_at, while the predictors xj represent the remaining probes. The dataset presents a challenging scenario, as the sample size (n=120) is small compared to the dimensionality (pn= 18,975).
The primary goal of this analysis is to evaluate the prediction performance of the proposed IMA procedure in the presence of missing response data. To achieve this, we artificially introduce missingness under a MAR mechanism defined as
where γ=(γ0,γ1)⊤, and γ0 is the intercept term. The covariate x6329 is selected by ranking all 18,975 probes according to the absolute value of their marginal correlations with the expression of probe 1389163_at, retaining the top-ranked probe as the covariate for the missingness mechanism. The true parameter values are set as γ=(−0.5,0.8)⊤, resulting in an approximate missing proportion of 42%.
To assess the predictive performance of various methods, the complete dataset is randomly partitioned into a training set of size 80 and a validation set of size 40, with this process repeated 100 times. For implementing the proposed method, we adopt the Gaussian kernel function K(u)=exp(−u2/2)/(2π)1/2 and perform K=30 multiple imputations for each missing response. The bandwidth parameter is determined using the kedd package in R, following the approach taken in Experiment 4.1. The proposed IMA method is then applied to the training set, and the predictive performance of the resulting model is evaluated by calculating the prediction error (PE) on the validation set for each of the 100 replications:
where ˆμi denotes the estimated mean of the response variable yi, n1=∑i∈Vδi, and V represents the set of indices corresponding to the observations in the validation set. A detailed summary of the dataset characteristics and experimental design is provided in Table 3.
In this artificially generated missing-response scenario, we include an evaluation of four alternative methods, namely WMCV, WMCV1, MBIC, and CC, all of which were introduced and analyzed in the previous simulation studies. For the last three methods, genes are first ranked according to the marginal screening utility proposed by [23]. The top 200 genes are retained, and M=20 candidate models are subsequently constructed, each containing 10 genes. The prediction errors on the validation data are illustrated in Figure 8. Examination of Figure 8 clearly demonstrates that the proposed IMA procedure achieves superior predictive efficiency compared to the other four methods, as evidenced by the smallest PE values.
6.
Conclusions
This paper investigated the prediction problem in ultrahigh-dimensional linear regression models with missing responses under the assumption of missing at random. To address the missing response issue, we proposed an effective multiple-imputation procedure for handling the missing data. Additionally, we introduced an IMA procedure that combines iterative screening and model averaging techniques to enhance prediction accuracy. The proposed approach alleviates overfitting and provides consistent estimators for the regression coefficients, ensuring reliable and accurate predictions. Through simulation studies and a real data example, we validated the performance of the proposed IMA procedure. Our results demonstrated that the proposed method outperforms existing approaches, including conventional model averaging techniques, in terms of predictive accuracy and robustness.
7.
Discussion and future work
While the current model averaging method assumes that missing responses follow a MAR mechanism, many real-world applications involve missing data mechanisms where the missingness is dependent on the unobserved values themselves, leading to a missing not at random (MNAR) scenario. In practical applications, testing the assumption of MAR versus MNAR is crucial to ensure the applicability of our method. In ultrahigh-dimensional settings, the Pearson Chi-square test statistic developed by [34] can be employed to identify key features in the missingness data models. Subsequently, the score test techniques proposed by [35] can be utilized to assess the type of missingness mechanism. If the missingness mechanism is identified as MAR, the proposed multiple imputation and iterative model averaging approach can be applied to achieve robust statistical predictions. However, when the missingness mechanism is MNAR, it presents significant challenges for valid inference, including issues related to the identification of population parameters and the development of appropriate multiple imputation methods. Additionally, Condition 3 in Section 3 imposes a sub-exponential tail assumption on covariates to guarantee model stability and prevent overfitting. Relaxing this condition remains an important avenue for future research to enhance the generalizability of the IMA method to data with heavier tails. These considerations suggest potential directions for enhancing the current framework to better accommodate more complex missing data mechanisms and distributional settings.
Author contributions
Xianwen Ding: Writing–original draft, Formal analysis, Investigation, Methodology, Visualization, Software; Tong Su: Conceptualization, Writing–original draft, Formal analysis, Methodology, Project administration; Yunqi Zhang: Conceptualization, Funding acquisition, Validation, Resources, Writing–review & editing. All authors have read and agreed to the published version of the manuscript.
Use of Generative-AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 12001244, 12426666, 12426668, 12426409), the Major Basic Research Project of the Natural Science Foundation of the Jiangsu Higher Education Institutions (grant No. 19KJB110007), the Yunnan Fundamental Research Projects (grant No. 202401AU070212), the Youth Project of Yunnan Xingdian Talent Program, and the Zhongwu Young Teachers Program for the Innovative Talents of Jiangsu University of Technology.
Conflict of interest
All authors declare no conflicts of interest in this paper.
Appendix
Lemma A.1. Under Conditions 1–3, for any ν∈(0,1/2), constant c1>0, and 1≤j≤pn, there exist some positive constants c2 and c3 such that
Proof of Lemma A.1. Let hj(x)=E(xjy|xj=x), and ˆhj(x)=∑nl=1δlKh(x−xlj)xyl/∑nl=1δlKh(x−xlj) for each j=1,…,pn. Then
where
Under the assumption y⊥δ|xj for each j=1,…,pn, it follows from the proof of Lemma A.1 in [36] that
Also,
where ηj(x)=π(x)fj(x) and ˆηj(x)=∑nl=1δlKh(xlj−x)/n. Then
According to Lemma S3 of [27], under Condition 2, we have
where C′>0 is any positive constant, and c4 and c5 are some positive constants. Then, under Condition 1, by taking C′=c6n(1−2ν)/3 for some positive constant c6, we obtain
where c7, c8, and c9 are some positive constants, and the last inequality holds due to Hoeffding's inequality. By some similar arguments, it can be shown that J2 follows a similar bound. Hence, we complete the proof of this lemma. □
Lemma A.2. Under Conditions 1–4, for any ν∈(0,1/2) and constant c′1>0, there exist positive constants c′2 and c′3 such that
Proof of Lemma A.2. For any ν∈(0,1/2) and constant c′4, since maxj|β0j|≤C2, then there exist positive constants c′5 and c′6 such that
where the last inequality holds due to Lemma A.1. This shows that maxj|ˆβj| is bounded in probability. For each j=1,…,pn, notice that
For the first term, we have
where c′7 and c′8 are positive constants. We next deal with the second term:
where c′9 and c′10 are some positive constants. Let c′3=c′7+c′9 and c′2=min{2c′8,2c′10}. Hence,
This completes the proof. □
Proof of Theorem 3.1. By the definition of ˆωj, we have
It can be shown that ˆωj is a monotone increasing function of ˆβ2j. For the sake of convenience, let BIC(1) be the BIC score associated with ˆβ2(1). Then we have
To show ˆωU(1)→p1, it suffices to show
First, for Eq (A.1), notice that
Furthermore, by condition β2(1)−β2(2)>C4, it is noteworthy that
By Lemma A.2, we know that Pr(maxj|ˆβ2j−β20j|≥c′1n−ν)→p0. This, together with Eq (A.3), implies that with probability tending to one, we have ˆβ2(1)−ˆβ2(2)>C4/2. From the proof of Lemma A.2, we know maxj|ˆβj| is bounded in probability. Under Condition 2 and the MAR assumption, for any C″>0, there exist positive constants b1 and b2 such that
where ˆm(X)=∑pnj=1∑Kk=1˜y(j)ik/(pnK), and the last inequality holds due to the Markov inequality. Then ‖ˆY‖2/n is bounded in probability. Thus, under Condition 4, it follows that nˆβ2(1)/‖ˆY‖2<C3<1. Combining Eq (A.3) and Condition 4, we have
Analogously, we can prove that Eq (A.2) holds. As a result, ˆωU(1)→p1. □
Proof of Theorem 3.2. Let Hj=xjx⊤j/n. Notice that
where ˆβ(m)=(ˆωm1ˆβm1,…,ˆωmpnˆβmpn)⊤. Then we have
Note that |x⊤j1xj2|≤n for any 1≤j1,j2≤pn due to the fact that ‖xj‖2=n. This suggests that
Therefore, we have
Now, by the definition of ˆY(m+1), we have
Notice that ˆβ2mj≤ˆβ2m(1) for each j=1,…,pn. Then
□
Lemma A.3. Under Conditions 1–4, for every j∈{1,2,…,pn}, we have maxj|ˆβ2mj−β2mj|→p0 for any m=1,…,M.
Proof of Lemma A.3. We can demonstrate this lemma for general m by induction. For the sake of simplicity, we can only show that the result is valid for m=2 by assuming that it holds when m=1. Notice that the result of m=1 can be directly derived from Lemma A.2. Recall that β(1)0=(ω11β11,…,ω1pnβ1pn)⊤∈Rpn and ˆβ(1)=(ˆω11ˆβ11,…,ˆω1pnˆβ1pn)⊤∈Rpn. From the definition of ˆβ2j, we have
From the proof of Lemma A.1, to prove ˆβ2j→pβ2j for every j∈{1,2,…,pn}, it suffices to show the following result:
By Hölder's inequality, we have
where ˆσj1j2=x⊤j1xj2/(‖xj1‖‖xj2‖). By the results of Proposition 1 in [37], under Condition 2, we obtain that maxj1,j2|ˆσj1j2−σj1j2|→p0. We have maxj1,j2|ˆσj1j2|=Op(1). Let ω1(1)≥ω1(2)≥…≥ω1(pn) be the ordered statistics of {ω1j,1≤j≤pn} and ˆω1(j) be the corresponding estimators for j=1,2,…,pn. By the result of Theorem 3.1, there exists a positive constant ξ<1 such that
Thus, we have
By the result of Lemma A.1, we have |β(1)0−ˆβ(1)|1→p0. Hence, ˆβ2j→pβ2j. The remaining steps are similar to those of Lemma A.2. We omit them here. This completes the proof of Lemma A.3.□
Proof of Theorem 3.3. From the proof of Theorem 3.1, it also can be shown that ˆωmj is a monotone increasing function of ˆβ2mj. Then
Then, to demonstrate ˆωm(1)→p1, it suffices to show
Following the proof of Theorem 3.1, and under Conditions 1–5, we have completed the proof of the first part of Theorem 3.3.
We here just show the second part of this theorem. Notice that
Similar to the proof of Theorem 3.1, it is straightforward to show ‖ˆY(m)‖2/n is bounded in probability. Thus, by assuming β2m(1)=O(n−γ), γ>0, and using the results of Lemma A.3, we have (1−nˆβ2mj/‖ˆY(m)‖2)−n/2=Op(1). As a result, (√npn)−1∑pnj=1(1−nˆβ2mj/‖ˆY(m)‖2)−n/2→p0, which implies that ˆωm0→p1. Hence, we have completed the proof. □
Proof of Theorem 3.4. Let Λ=β0−ˆβM, and we have ‖Λ‖2=‖ΛAn‖2+‖ΛAcn‖2. To prove ‖ˆβM−β0‖→p0, it suffices to show ‖ΛAn‖→p0 and ‖ΛAcn‖→p0. Notice that
Furthermore, by Condition 6(ⅲ) and the proof of Theorem 3.1, it is noteworthy that
By Condition 6(ⅱ), together with Eq (A.4), it follows that ‖ΛAcn‖→p0. Next we will show ‖ΛAn‖→p0. By Condition 7 and Lemma 3 in [22], we have
where the second inequality holds due to Hölder's inequality. This leads to
Notice that ˆY(M+1)=ˆY(M)−Xˆβ(M)=ˆY−Xβ0+XΛ. Then, by the triangle inequality, we have
For the first term of the right-hand side of the above inequality, we have
By the Bonferroni inequality and Lemma A.3 in [38], we obtain I1→p0. From the proof of Lemma A.1, for each j=1,…,pn, we have
where 0<r<1/2. Then, by Condition 6(ⅰ),
As a result, the term |An|1/2maxj{n−1|x⊤j(ˆY−Xβ0)|}→p0. Similar to the proof of Theorem 5 in [22], by Conditions 1–3, we can show that maxj(n−1|x⊤jˆY(M+1)|)→p0. This together with Lemma A.1 in [36] proves that x⊤jˆY(M+1)/√n=Op(1) for each j=1,…,pn. Then we have |An|1/2maxj(n−1|x⊤jˆY(M+1)|)→p0. Accordingly,
On the other hand, by the Cauchy-Schwarz inequality and ‖ΛAcn‖→p0, we obtain
This in conjunction with Eq (A.6) implies that |An|1/2maxj(n−1|x⊤jXAnΛAn|)→p0. Then we obtain that Eq (A.5) →p0. Therefore, we have that ‖Λ‖=‖β0−ˆβM‖→p0, which completes the proof. □