Loading [MathJax]/jax/output/SVG/jax.js
Research article

An efficient iterative model averaging framework for ultrahigh-dimensional linear regression models with missing data

  • Received: 16 April 2025 Revised: 02 June 2025 Accepted: 06 June 2025 Published: 16 June 2025
  • MSC : 62D10

  • This paper addresses the prediction problem in linear regression models with ultrahigh-dimensional covariates and missing response data. Assuming a missing at random mechanism, we introduced a novel nonparametric multiple imputation method to handle missing response values. Based on these imputed responses, we proposed an efficient iterative model averaging method that integrates an iterative screening process within a model averaging framework. The weights for the candidate models were determined using the Bayesian information criterion, ensuring an optimal balance between model fit and complexity. The computational feasibility of the proposed approach stems from its iterative structure, which significantly reduces the computational burden compared to conventional methods. Under certain regularity conditions, we demonstrated that the proposed method effectively mitigates the risk of overfitting and yields consistent estimators for the regression coefficients. Simulation studies and a real-world data application illustrate the practical efficacy of the proposed approach, showing its superior performance in terms of predictive accuracy and flexibility when compared to several competing approaches.

    Citation: Xianwen Ding, Tong Su, Yunqi Zhang. An efficient iterative model averaging framework for ultrahigh-dimensional linear regression models with missing data[J]. AIMS Mathematics, 2025, 10(6): 13795-13824. doi: 10.3934/math.2025621

    Related Papers:

    [1] Hanji He, Meini Li, Guangming Deng . Group feature screening for ultrahigh-dimensional data missing at random. AIMS Mathematics, 2024, 9(2): 4032-4056. doi: 10.3934/math.2024197
    [2] Peng Lai, Mingyue Wang, Fengli Song, Yanqiu Zhou . Feature screening for ultrahigh-dimensional binary classification via linear projection. AIMS Mathematics, 2023, 8(6): 14270-14287. doi: 10.3934/math.2023730
    [3] Qiang Zhao, Zhaodi Wang, Jingjing Wu, Xiuli Wang . Weighted expectile average estimation based on CBPS with responses missing at random. AIMS Mathematics, 2024, 9(8): 23088-23099. doi: 10.3934/math.20241122
    [4] A. Joumad, A. El Moutaouakkil, A. Nasroallah, O. Boutkhoum, Mejdl Safran, Sultan Alfarhood, Imran Ashraf . Unsupervised segmentation of images using bi-dimensional pairwise Markov chains model. AIMS Mathematics, 2024, 9(11): 31057-31086. doi: 10.3934/math.20241498
    [5] Kannat Na Bangchang . Application of Bayesian variable selection in logistic regression model. AIMS Mathematics, 2024, 9(5): 13336-13345. doi: 10.3934/math.2024650
    [6] Anoop Kumar, Shashi Bhushan, Abdullah Mohammed Alomair . Assessment of correlated measurement errors in presence of missing data using ranked set sampling. AIMS Mathematics, 2025, 10(4): 9805-9831. doi: 10.3934/math.2025449
    [7] Zhongqi Liang, Yanqiu Zhou . Model averaging based on weighted generalized method of moments with missing responses. AIMS Mathematics, 2023, 8(9): 21683-21699. doi: 10.3934/math.20231106
    [8] Liqi Xia, Xiuli Wang, Peixin Zhao, Yunquan Song . Empirical likelihood for varying coefficient partially nonlinear model with missing responses. AIMS Mathematics, 2021, 6(7): 7125-7152. doi: 10.3934/math.2021418
    [9] Kittiwat Sirikasemsuk, Sirilak Wongsriya, Kanogkan Leerojanaprapa . Solving the incomplete data problem in Greco-Latin square experimental design by exact-scheme analysis of variance without data imputation. AIMS Mathematics, 2024, 9(12): 33551-33571. doi: 10.3934/math.20241601
    [10] Peng Lai, Wenxin Tian, Yanqiu Zhou . Semi-supervised estimation for the varying coefficient regression model. AIMS Mathematics, 2024, 9(1): 55-72. doi: 10.3934/math.2024004
  • This paper addresses the prediction problem in linear regression models with ultrahigh-dimensional covariates and missing response data. Assuming a missing at random mechanism, we introduced a novel nonparametric multiple imputation method to handle missing response values. Based on these imputed responses, we proposed an efficient iterative model averaging method that integrates an iterative screening process within a model averaging framework. The weights for the candidate models were determined using the Bayesian information criterion, ensuring an optimal balance between model fit and complexity. The computational feasibility of the proposed approach stems from its iterative structure, which significantly reduces the computational burden compared to conventional methods. Under certain regularity conditions, we demonstrated that the proposed method effectively mitigates the risk of overfitting and yields consistent estimators for the regression coefficients. Simulation studies and a real-world data application illustrate the practical efficacy of the proposed approach, showing its superior performance in terms of predictive accuracy and flexibility when compared to several competing approaches.



    Missing data are a common challenge across various research fields, including health studies, clinical trials, and longitudinal investigations. Such missingness can arise from multiple factors, such as the high cost of measuring critical response variables or logistical constraints that hinder data collection due to demographic or economic limitations. Missing values not only reduce the effective sample size, which in turn diminishes estimation efficiency, but also compromise the validity of standard complete-data analysis methods. To mitigate this issue, several statistical techniques have been developed, including multiple imputation, inverse probability weighting, and augmented inverse probability weighting (AIPW), all aimed at improving the robustness of estimation. Particularly when response data are missing at random (MAR), significant research has focused on developing methods to enhance both bias correction and estimation efficiency. References [1] and [2] offer comprehensive reviews of the theoretical advancements and practical applications in this area.

    Regression-based prediction methods have become fundamental tools for improving accuracy in various application domains. Related efforts in predictive modeling have demonstrated success using deep regression frameworks for thermal control in photovoltaic systems [3], as well as hybrid associative classifiers incorporating support vector machines to enhance classification accuracy [4]. Competitively in theory with these machine learning approaches, model averaging is a well-established statistical technique designed to further improve prediction performance by combining multiple candidate models. By assigning weights based on model quality, model averaging aims to produce more reliable and robust estimates. In recent years, frequentist model averaging has gained considerable traction, with an expanding body of research exploring a variety of methodologies. Notable contributions include Mallows model averaging (MMA, [5]), jackknife model averaging (JMA, [6]), the heteroskedasticity-robust Cp criterion [7], and Kullback-Leibler loss model averaging [8]. In the context of missing data, model averaging has garnered significant attention. Schomaker et al. [9] introduced two model averaging approaches specifically tailored to handle missing data, while Dardanoni et al. [10] applied model averaging to navigate the bias-precision trade-off in linear regression models with missing covariates. Zhang [11] extended the MMA framework to cases where covariates are entirely missing at random, and Fang et al. [12] proposed a novel model averaging framework for fragmentary data. Liang and Wang [13] further contributed by developing a robust model averaging method for partially linear models with responses missing at random. Liang and Zhou [14] then proposed a new model averaging method based on the weighted generalized method of moments for missing response problems. More recently, Liang et al. [15] addressed optimal model averaging for partially linear models with responses missing at random and measurement error in some covariates.

    In the realm of high-dimensional data analysis, significant progress has been made in extending model averaging techniques to accommodate the challenges posed by a growing number of predictors. Lu and Su [16] expanded the JMA criterion of [6] to quantile regression models, allowing the number of predictors to scale with the sample size. Zhang et al. [17] introduced a novel criterion for selecting weights, enabling the development of parsimonious model averaging estimators. In scenarios where the number of predictors increases exponentially with the sample size, Ando and Li [18] introduced a two-step model averaging method that combines marginal screening with JMA for ultrahigh-dimensional linear regression models. This approach was later expanded by [19] to accommodate generalized linear models. Cheng and Hansen [20] further explored model averaging procedures for ultrahigh-dimensional factor-augmented linear regression models using principal component analysis, while Chen et al. [21] investigated semiparametric model averaging methods for nonlinear dynamic time series regression models in similar settings. Lan et al. [22] introduced the sequential screening approach into model averaging for high-dimensional linear regression models. Despite these advancements, much of the research on model averaging for ultrahigh-dimensional data has been limited to complete data settings, leaving the complexities of missing data in high-dimensional contexts largely unaddressed.

    This paper introduces a novel iterative model averaging (IMA) method that integrates an iterative screening procedure with model averaging to handle ultrahigh-dimensional regression models when some of the response data are subject to MAR. To address the missing responses, we develop a nonparametric multiple imputation procedure, which offers greater flexibility compared to the parametric assumptions of the MAR mechanism studied in [23]. The imputed response values are iteratively updated based on the residuals from prior iterations. This iterative screening process reduces the dominance of highly weighted predictors in earlier stages, thereby allowing additional relevant predictors to incrementally contribute to parameter estimation. The candidate model weights are determined using the Bayesian information criterion (BIC), enabling larger weights to be assigned to more influential predictors. This process also ensures that the method remains computationally feasible, even in ultrahigh-dimensional settings, by limiting each step to candidate models of size one. Under regularity conditions, we demonstrate that the proposed IMA method produces consistent estimators of regression coefficients and exhibits strong model-fitting performance. Our theoretical results are supported by several numerical studies, confirming the efficacy of the method in practice.

    The paper is organized as follows: In Section 2, we outline the model setup and introduce the IMA procedure designed for regression models with missing responses. Section 3 presents the theoretical properties of the IMA procedure. Sections 4 and 5 provide a comprehensive evaluation of the method through extensive simulation studies and its application to a real-world dataset. Finally, Section 6 provides the conclusion of the paper, and Section 7 addresses the limitations and potential directions for future research. All technical proofs can be found in the Appendix.

    Let {(yi,Xi)}ni=1 represent a set of n independent and identically distributed (i.i.d.) random samples, where yi denotes the response variable, and Xi=(xi1,,xipn)Rpn denotes the ultrahigh-dimensional covariates. Without loss of generality, it is assumed that E(xij)=0 and consider the following linear regression relationship:

    yi=Xiβ0+εi,i=1,,n,

    where β0=(β01,,β0pn) is a pn-dimensional vector of regression coefficients, and εi are i.i.d. random errors with mean zero and finite variance σ2. Throughout this paper, the number of covariates pn is permitted to diverge with the sample size n, satisfying pnn, i.e., log(pn)=o(nα) for some constant α(0,1). In this framework, the response variables yi may be missing, while the covariates Xi are assumed to be fully observed. Thus, the dataset comprises the observations {(yi,Xi,δi)}ni=1, where δi=1 indicates that yi is observed, and δi=0 otherwise. Let X=(X1,,Xn)=(x1,,xpn)Rn×pn denote the design matrix, where xj represents the j-th column of covariates.

    We assume that the missing data mechanism adheres to the MAR assumption, implying that the missingness indicator δi is conditionally independent of the response yi, given the covariates Xi. Formally, this can be expressed as Pr(δi=1|Xi,yi)=Pr(δi=1|Xi)π(Xi), where π() is the selection probability function that models the selection bias underlying the missingness pattern.

    To overcome the "curse of dimensionality" frequently encountered in ultrahigh-dimensional data with missing responses, we assume that yδ|xj for each j=1,,pn (see [24]), where stands for statistical independence. This assumption enables us to impute the missing values of yi by leveraging information from individual covariates xj, rather than relying on the full covariate vector X. The multiple imputation procedure for estimating missing responses is then defined as follows:

    ˆyi=δiyi+(1δi)1pnKpnj=1Kk=1˜y(j)ik,i=1,,n,

    where K is the number of multiple imputations, and {˜y(j)ik}Kk=1 are K independent imputations for the missing value of yi, drawn from the estimated conditional distribution ˆF(˜y|xij). This conditional distribution ˆF(˜y|xij)=nl=1κ(j)ilI(yl˜y) is a kernel estimator of the true distribution F(˜y|xij), where κ(j)il=δlKh(xljxij)/nk=1δkKh(xkjxij) is the kernel weight. Here, Kh(u)=K(u/h) is the kernel function with bandwidth h=hn, a positive smoothing parameter that tends to zero when n is growing, and I() is the indicator function. It is important to note that the imputed values ˜y(j)ik have a discrete distribution, where the selecting probability of yl (when δl=1) is given by κ(j)il. This structure ensures that the imputation procedure integrates information across covariates while maintaining computational efficiency in ultrahigh-dimensional settings.

    Remark 2.1. An alternative imputation method commonly used for handling missing response data is the AIPW method. In this approach, the estimated response value for the i-th individual is given by

    ˆyi=δiyiπ(Xi)+(1δiπ(Xi))1pnKpnj=1Kk=1˜y(j)ik,i=1,,n. (2.1)

    As shown in Eq (2.1), the AIPW method involves imputing both the missing and the observed responses. Within the empirical likelihood framework, Tang and Qin [25] established the semiparametric efficiency of the AIPW estimator. In the setting of robust quantile regression, Chen et al. [26] further demonstrated that quantile estimators based on both ˆyi and ˆyi can achieve the semiparametric efficiency bound, provided that the error distributions are identical. Nevertheless, since the AIPW approach requires repeated imputation even for observed responses, it may lead to increased risk of overfitting, particularly in ultrahigh-dimensional scenarios where pn. In contrast, the proposed imputation method is computationally more efficient, as it avoids redundant imputation for observed values. By averaging over the imputed values, it offers a direct and practical solution for handling missing response data. In addition, the proposed multiple imputation method, facilitated by kernel-assisted imputation, is less sensitive to model misspecification and tends to be more robust in high-dimensional contexts.

    Let AF={1,,pn} represent the full predictor index set, and let A={j1,,jq}AF be some nonempty subset of cardinality |A|=q1. The complement of A is denoted by Ac=AFA. Define XA=(xj:jA)Rn×|A| and β0A as the sub-matrix of the design matrix X and the sub-vector of β0 indexed by A, respectively. Similarly, we define XAc and β0Ac. The candidate model A with fully observed response data is given by

    yi=XiAβ0A+εi,i=1,,n, (2.2)

    where XiA=(xij:jA)R|A| represents the covariates in Xi corresponding to the predictors in the candidate model A. A model averaging estimator ˆβA of β0A in model (2.2) can be expressed as

    ˆβA=AAFˆωAˆβA,

    where ˆβA is an estimator of βA, and ˆωA represents the weight assigned to model A, which can be estimated using the commonly applied BIC criterion.

    The classical BIC-based model averaging procedures become computationally prohibitive in high-dimensional settings due to the cardinality of candidate models scaling exponentially with predictor dimensionality. To alleviate this issue, we propose a simplified BIC model averaging procedure that focuses on univariate candidate models (|A|=1), reducing the number of candidate models to pn. Without loss of generality, let the design matrix X be standardized such that for each j=1,,pn, the j-th column of X satisfies n1xj2=n1xjxj=1. For the j-th model, the BIC is defined as

    BICj=nlog(ˆYxjˆβj2)+log(n)+2log(pn),j=1,,pn,

    where ˆY=(ˆy1,,ˆyn) and ˆβj=xjˆY/n are the estimated coefficient for the j-th model. The corresponding BIC-based model averaging estimator is then given by ˆβU=(ˆωU1ˆβ1,,ˆωUpnˆβpn), where the BIC weight for the j-th model is defined as

    ˆωUj=exp(BICj/2)pnj=0exp(BICj/2),j=0,1,,pn,

    with BIC0=nlog(ˆY2).

    Univariate model averaging may lead to inaccurate predictions due to its reliance on a limited subset of the information pool, thereby ignoring the majority of potentially explanatory data features. A natural extension to improve prediction accuracy is to increase the candidate model size from 1 to 2, but this approach can become computationally expensive, particularly in high-dimensional settings, where the number of candidate models scales to p2n. Motivated by the forward regression approach, we propose an iterative method that reduces the impact of heavily weighted predictors by updating the response at each iteration based on the residuals. This enables enhanced integration of auxiliary predictors into the inferential process, thereby improving the efficiency of parameter estimation. We designate this methodology as IMA. The detailed procedure is outlined below.

    Let ˆY(1)=ˆY represent the initial response vector, and ˆY(m)=(ˆy(m)1,,ˆy(m)n) represent the response vector at the m-th iteration. In the m-th iteration, we fit pn univariate models using the current response ˆy(m) and the covariates xj, j=1,,pn. The corresponding estimator is given by ˆβmj=xjˆY(m)/n. The BIC for the j-th model is calculated as

    BICmj=nlog(ˆY(m)xjˆβmj2)+log(n)+2log(pn).

    Although the weighting coefficient assigned to the null model is less than 1, the covariates can still explain some information about the response. Therefore, the null model is also fitted, resulting in BICm0=nlog(ˆY(m)2). The corresponding BIC model averaging weights for each candidate model are denoted as

    ˆωmj=exp(BICmj/2)pnj=0exp(BICmj/2),j=0,1,,pn.

    Subsequently, the response vector is updated for the (m+1)-th iteration as

    ˆY(m+1)=ˆY(m)Xˆβ(m),

    where ˆβ(m)=(ˆωm1ˆβm1,,ˆωmpnˆβmpn)Rpn. Without loss of generality, this iterative process is assumed to cease after M iterations. The final iterative model averaging estimator is then given by

    ˆβM=Mm=1ˆβ(m).

    The convergence criteria for this algorithm will be discussed later in this work. Let (y,X) be an independent copy of (yi,Xi) for some 1in, and we can predict the value of y by XˆβM.

    Remark 2.2. Commonly used kernel functions include: (1) uniform kernel K(u)=(1/2)I(|u|1), (2) logistic kernel K(u)=eu/(1+eu)2, (3) Epanechnikov kernel K(u)=(3/4)(1u2)I(|u|1), (4) triangular kernel K(u)=(1|u|)I(|u|1), (5) biweight kernel K(u)=(15/16)(1u2)2I(|u|1), and (6) Gaussian kernel K(u)=(2π)1/2eu2/2. Through a sensitivity analysis of kernel function choices in the numerical experiments presented in Section 4, we find that the proposed kernel-assisted IMA procedure exhibits robustness to the choice of kernel function. Therefore, for practical applications, we recommend employing the Gaussian kernel for multiple imputation, primarily due to its computational simplicity and widespread applicability.

    The IMA algorithm with missing response is designed to estimate regression coefficients in high-dimensional settings where the response variable may be partially missing. The procedure begins with a training dataset {(yi,Xi,δi)}ni=1, where δi indicates whether yi is observed. For observations with missing responses (δi=0), the method employs a kernel-based nonparametric imputation strategy. Specifically, for each covariate xij, imputed response values {˜y(j)ik}Kk=1 are sampled from an estimated conditional distribution ˆF(˜y|xij), constructed using a kernel smoothing technique over observed data.

    Next, the IMA procedure then proceeds, starting with the initialization of iteration index m=0 and weight ˆω00=0. In each iteration, the null model's raw weight is computed based on the norm of the current response vector ˆY(m). For each covariate xj, a univariate regression is fitted, yielding coefficient estimates ˆβmj, and associated raw weights ˆωmj are calculated based on the BIC that accounts for model complexity.

    All raw weights, including that of the null model, are normalized to ensure they sum to one. The iteration produces a weighted average estimator ˆβ(m), which is then used to update the response vector for the subsequent iteration via residual adjustment: ˆY(m+1)=ˆY(m)Xˆβ(m). This iterative process continues until the change in the null model's weight across successive iterations falls below a pre-specified threshold ϵ, ensuring convergence.

    Finally, the cumulative model averaging estimator is obtained as ˆβM=Mm=1ˆβ(m), where M denotes the total number of iterations. The IMA algorithm effectively integrates imputation with adaptive model averaging to address challenges arising from missing data in high-dimensional regression problems. The detailed algorithmic steps are outlined in Algorithm 1.

    Algorithm 1: Iterative model averaging with missing response
    Input: Training dataset {(yi,Xi,δi)}ni=1.
    Output: Iterative model averaging estimator ˆβM.
    18 return ˆβM=Mm=1ˆβ(m).

    As demonstrated in Algorithm 1, the proposed IMA procedure is fundamentally data-driven and can be effectively implemented using the observed dataset {(yi,Xi,δi)}ni=1. Nevertheless, several hyperparameters, which are recommended based on our empirical findings in Section 4, play a crucial role in enhancing computational efficiency. These are summarized in Table 1.

    Table 1.  Recommended hyperparameters in the IMA procedure.
    Hyperparameter Description Value
    K Number of multiple imputations 30
    Kh(u)=K(u/h) Kernel function Gaussian kernel function
    h=hn Bandwidth of given kernel function K() Selected via the data-driven procedure
    ϵ Stopping threshold for the IMA procedure 0.0001
    M Number of iterative steps 20

     | Show Table
    DownLoad: CSV

    To establish the theoretical properties of the proposed IMA procedure, we introduce the following notations and conditions. Let β2(1)β2(2)β2(pn) and ˆωU(1)ˆωU(2)ˆωU(pn) be the ordered statistics of {β20j:1jpn} and {ˆωUj:1jpn}, respectively.

    Condition 1: The missing propensity function π(X)=Pr(δ=1|X) and the probability density function fj(x) of xj (j=1,,pn) both have continuous and bounded second-order derivatives over the support Xj of xj. Additionally, there exists a constant C0>0 such that infxπ(x)C0.

    Condition 2: The kernel function K() is a probability density function satisfying: (ⅰ) It is bounded and has compact support; (ⅱ) It is symmetric with t2K(t)dt<; (ⅲ) There exists a constant C1>0 such that K(t)C1 in a closed interval centered at zero. Furthermore, the smoothing bandwidth h satisfies nh and nh20 as n.

    Condition 3: The random variables xj, y, and xjy satisfy a sub-exponential tail probability uniformly in pn. That is, there exists a constant u0>0 such that for all 0<u2u0,

    max1jpnE{exp(2ux2j)}<,max1jpnE{exp(2uxjy)}<,E{exp(2uy2)}<.

    Condition 4: There exist positive constants C2 and C3, independent of n and pn, such that maxj|β0j|C2, and max{maxj1j2|σj1j2|,maxj|corr(xj,y)|}C3<1.

    Condition 5: Let β2m(1)β2m(2)β2m(pn) be the ordered statistics of {β20mj:1jpn} for any m=1,,M. There exist positive constants C5 and C6, independent of n and pn, such that max1mM|βm(1)|C5 and max1mMmaxj|corr(xj,y(m))|C6<1.

    Condition 6: There exists a sequence An such that (i) |An| and |An|/n0 as n; (ii) |Acn|β0Ac20 as M,n; (iii) |Acn|maxm>1jAcnˆω2mj0 as n.

    Condition 7: There exist positive constants C8 and C9 such that the eigenvalues of the covariance matrix Cov(X) satisfy C8λmin[Cov(X)]λmax[Cov(X)]C9, where λmin(A) and λmax(A) represent the smallest and largest eigenvalues of the matrix A, respectively.

    Conditions 1 and 2 impose standard assumptions on the missing data propensity function π(X) and the probability density functions fj(x), which are generally satisfied by commonly used distributions. Condition 3 holds under widely employed distributions, such as the multivariate Gaussian. Similar conditions have been discussed in the work of [27]. Condition 4 places restrictions on the parameters to prevent perfect correlation between any predictor and the response, which aligns with the assumptions made in [28] and [22]. Condition 5 extends these assumptions to each iterative step. Condition 6, adapted from [22], is essential for establishing the asymptotic properties of ˆβM as M,n. Lastly, Condition 7 ensures that the design matrix behaves well, similar to the assumptions made by [29].

    Theorem 3.1. Under Conditions 1–4, suppose that β2(1)β2(2)>C4 for some positive constant C4. Then we have ˆωU(1)p1 as n, where ˆωU(1) is the weight assigned to the univariate model averaging of the covariate with the largest absolute marginal regression coefficient.

    This theorem establishes that the information used for estimation and prediction is dominantly governed by the largest correlated covariates. Moreover, Theorem 3.1 can be easily extended to assign nearly equal weights to the first two largest correlated predictors with the response, specifically when β2(1)=β2(2)+o(1)>β2(3), while the contributions from other predictors are negligible.

    Theorem 3.2. For m1, we have

    (i)ˆY(m)2ˆY(m+1)2pnj=1nˆωmjˆβ2mj,(ii)ˆY(m)2ˆY(m+1)22n(1ˆωm0)ˆβ2m(1).

    The first part of Theorem 3.2 ensures that the proposed IMA procedure can progressively approximate the true regression relationship within a limited number of iterations. In particular, the residual sum of squares ˆY(m)2 decreases monotonically with each iteration, which reflects the method's strong fitting capability. The second part of Theorem 3.2 indicates that the IMA framework effectively mitigates the risk of overfitting. When no informative predictors are selected during the m-th iteration, the weight ˆωm0 approaches 1, implying that the procedure primarily relies on the baseline model and avoids spurious fitting. Conversely, if significant predictors are present, the largest coefficient ˆβ2m(1) and its corresponding weight ˆωm(1) become dominant, while ˆωm0 shrinks toward 0. Consequently, the upper bound in part (ⅱ) of Theorem 3.2 demonstrates that the IMA procedure not only maintains fitting accuracy but also suppresses excessive model complexity, thus effectively alleviating overfitting.

    Theorem 3.3. Under Conditions 1–3 and 5, for those values of m satisfying β2m(1)β2m(2)>C7 with some positive constant C7, we have ˆωm(1)p1 as n, where ˆωm(1)ˆωm(pn) denote the ordered statistics of {ˆωj:1jpn}. In addition, for those values of m satisfying β2m(1)=O(nγ), γ>0, we have ˆωm0p1 as n.

    The first part of this theorem indicates that the largest weight in each iterative step will be assigned to the most significant coefficient. The second part of this theorem implies that the weight ˆωm0 increases with m. Therefore, the IMA algorithm can be stopped if {ˆω(m+1)0ˆωm0}/ˆωm0<ϵ for some small ϵ>0. In our simulations, we set ϵ=0.0001.

    Theorem 3.4. Under Conditions 1–3 and 6–7, we have

    ˆβMβ0p0,asmin{M,n}.

    This theorem demonstrates that the IMA procedure provides a consistent estimator of β0. Accordingly, y=XˆβM is a consistent estimator of Xβ0 for the given X. This implies that E{XˆβMXβ02}0 as n.

    Remark 3.1. In fields such as genomics and epidemiology, high-dimensional datasets typically consist of thousands of feature variables, while the response variable may be subject to MAR. The proposed iterative model averaging procedure effectively mitigates prediction risks in such ultrahigh-dimensional settings, as demonstrated in Theorem 3.4. This makes the proposed approach highly applicable in the analysis of genomic data and disease prediction.

    In this section, several simulation studies are conducted to evaluate the finite-sample performance of the proposed estimation procedure, referred to hereafter as the IMA method. The simulation is based on 200 independent replications, each employing a sample size of n=100 and two feature dimensions, pn=1000 and 3000. For each t-th replication, the data are represented as {Y[t],X[t],δ[t]}, where Y[t]Rn, X[t]Rn×pn, and δ[t]Rn. Based on this dataset, the corresponding IMA estimator ˆβM[t] is computed. To evaluate the estimation accuracy of our proposed method and compare it with several alternative approaches, we employ the following mean squared error (MSE) criterion:

    MSE(t)=1100100i=1(Xiβ0XiˆβM[t])2, (4.1)

    where β0Rpn denotes the true parameter vector, and {Xi}100i=1 represents an independent test dataset. The overall performance metric is obtained by calculating the median of replication-specific MSE values:

    MMSE=median{MSE(1),,MSE(200)}.

    This comprehensive evaluation framework ensures a rigorous assessment of estimation accuracy, where smaller MSE values indicate superior performance of the estimation method.

    In addition to the proposed IMA method, we consider several alternative methods for comparison. These include the sequential model averaging procedure for the complete-case setting (denoted as the "CC" method, [22]), the sequential model averaging procedure without adjustments for missing data (denoted as the "FULL" method, [22]), weighted model averaging with the cross-validation (CV) method, without the constraint Ss=1ωs=1 (denoted as "WMCV", [23]), and the weighted model averaging with CV method, incorporating the constraint Ss=1ωs=1 (denoted as "WMCV1", [23]). Here, S denotes the number of candidate models. Additionally, we consider the BIC model averaging procedure applied to complete-case data (denoted as "MBIC").

    To implement the last three methods, the predictors are grouped according to their marginal multiple-imputation sure independence screening utility [23], retaining the top 100 predictors. Subsequently, we set S=10, resulting in a set of 10 candidate models, each with 10 predictors. The performance of these methods is then evaluated by replacing XiˆβM[t] in Eq (4.1) with the weighted sum Ss=1ˆωs[t]Xiˆβs[t], where ˆωs[t] and ˆβs[t] denote the estimators for the s-th candidate model in the t-th replication.

    Experiment 4.1. This experiment is adapted from [18], where the true number of regressors, d, is set to 50. The considered linear regression model is as follows:

    yi=Xiβ0+εi,i=1,,n,

    where β0=(β01,,β0pn) is the vector of true regression coefficients, and Xi=(xi1,,xipn) represents the pn vector of predictors. The noise terms, εi, are independent of the predictors. The covariate vector Xi is generated from a multivariate normal distribution with mean 0, and the covariance between xj1 and xj2 is given by Cov(xj1,xj2)=ρ|j1j2|, 1j1,j2pn, where ρ represents the correlation parameter and takes values of 0 and 0.5, corresponding to low and moderate correlation scenarios, respectively. The true regressors xj are spaced evenly across the predictor vector, with j=pn(k1)/d+1, for k=1,,d. The nonzero entries of β0 are generated from a normal distribution with a mean of 0 and a standard deviation of 0.5. Two error distributions are considered: (i) the standard normal distribution and (ii) the Student's t distribution with 3 degrees of freedom. It is assumed that the predictor values Xi are fully observed, but the response values yi are subject to missingness. To simulate missing data for yi, the missingness indicator δi is generated from a Bernoulli distribution with a probability π(Xi)=π(xi1)=Pr(δi=1|xi1). The following three missingness mechanisms are considered:

    M1: π(xi1)=(0.3+0.175|xi1|)I(|xi1|<4)+I(|xi1|4). This mechanism induces a missingness pattern that depends on the absolute value of xi1, with truncation at |xi1|4.

    M2: π(xi1)=Φ(0.5+3xi1), where Φ() denotes the cumulative distribution function of the standard normal distribution. This mechanism introduces a monotonic relationship between the probability of missingness and the value of xi1.

    M3: logit(π(xi1))=0.5+3xi1, where the logit function is applied to model the probability of missingness. This mechanism results in a linear relationship between xi1 and the log-odds of missingness.

    The average proportions of missing data for the three mechanisms are approximately 56%, 44%, and 44%, respectively. For each of these settings, the number of multiple imputations is set to K=30 for each missing yi.

    Experiment 4.2. This experiment aims to investigate the impact of the proposed method on different sparsity structures within a regression model, following a framework modified from [30]. Specifically, the covariate vector Xi=(xi1,,xipn) is generated from a multivariate normal distribution with mean vector 0 and covariance matrix Σ, where the entries of Σ are defined as Cov(xj1,xj2)=0.5|j1j2|, 1j1,j2pn. The true regression coefficients, β0=(β01,,β0pn), are specified such that β0j=(1)j×0.5, 1jd, and β0j=0 for j>d, where d represents the number of true nonzero coefficients. The error terms, εi, are generated from the standard normal distribution with a mean of 0 and a standard deviation of 1, and are independent of the covariates xij for j=1,,pn. In alignment with Experiment 4.1, the covariate vectors Xi are assumed to be fully observed, while the response values yi are subject to missingness. The missingness indicator δi is generated from a Bernoulli distribution with probability π(Xi)=π(xi1)=Pr(δi=1|xi1), where π(xi1) follows the same specifications as in Experiment 4.1. The three considered missingness mechanisms yield average missing proportions of approximately 56%, 44%, and 44%, respectively. The simulation results for d=20 are presented in this experiment.

    The Gaussian kernel function is employed in the imputation procedure, and the optimal bandwidth for kernel density estimation is selected via the cross-validation method implemented in the kedd package in R. The results from the two experiments with pn=1000 and 3000 are presented in Figures 14. The black dash-dotted line indicates the median MSE value (MMSE) of the proposed method and is included for direct comparison. Additionally, simulation results for Experiment 4.2 with pn=1000 using the Epanechnikov and biweight kernels are presented in Figure 5.

    Figure 1.  MSE values of six different methods in Experiment 4.1 under M1.
    Figure 2.  MSE values of six different methods in Experiment 4.1 under M2.
    Figure 3.  MSE values of six different methods in Experiment 4.1 under M3.
    Figure 4.  MSE values of six different methods in Experiment 4.2 (Gaussian kernel).
    Figure 5.  MSE values of six different methods in Experiment 4.2 (Epanechnikov kernel and biweight kernel).

    A closer examination of Figures 15 has the following observations: (ⅰ) The proposed IMA procedure consistently outperforms other model averaging methods, including CC, WMCV, WMCV1, and MBIC, across most settings. This outcome underscores the effectiveness of the proposed multiple imputation techniques. (ⅱ) As anticipated, the sequential model averaging method of "FULL" in [22] performs best across all scenarios, as it relies on completely observed data. (ⅲ) The IMA procedure exhibits robust performance across various missing data mechanisms, indicating its relative insensitivity to changes in the missingness pattern. (ⅳ) The proposed method shows reduced sensitivity to the sparsity of regression model coefficients. (ⅴ) The proposed kernel-assisted IMA approach exhibits robustness to the choice of kernel function. In conclusion, the IMA procedure consistently outperforms the competing methods, demonstrating the smallest median distribution of MSE values.

    To further assess the impact of the covariate vector distribution on prediction accuracy, we consider the multivariate skew-normal distribution as studied in [31]. Specifically, the covariate vector Xi=(xi1,,xi,pn) is generated from the multivariate skew-normal distribution SNpn(Σ,α), where the entries of Σ are defined by Cov(xj1,xj2)=0.5|j1j2|, for 1j1,j2pn, and α is the shape parameter controlling the skewness of the distribution. When α=0, the multivariate skew-normal distribution SNpn(Σ,α) reduces to the multivariate normal distribution Npn(0,Σ). In contrast, when α0, the distribution remains skew-normal. The simulated MSE values from Experiment 4.2, where α=(1,1,1,1,,1,1) and the Gaussian kernel function is used, are presented in Figure 6. A careful examination of Figure 6 reveals that the proposed method consistently outperforms WMCV, WMCV1, and MBIC across all scenarios. However, for the cases where pn=1000 and the missingness mechanisms are M2 and M3, the CC method appears to be comparable to the proposed IMA method. Nevertheless, for all other scenarios, the proposed estimation procedure outperforms the CC method. These findings suggest that the performance of the CC method is particularly sensitive to the missingness rate when the skew-normal distribution is considered, as mechanisms M2 and M3 exhibit relatively lower missing rates compared to M1. This further supports the efficiency and robustness of the proposed IMA method based on multiple imputation.

    Figure 6.  MSE values of six different methods in Experiment 4.2 (XiSNpn(Σ,α)).

    All numerical experiments are conducted on a personal computer equipped with an Intel(R) Core(TM) i7-10875H CPU running at 2.30 GHz, featuring 8 physical cores and 16 threads. The machine is configured with 16.0 GB of RAM. The computations are performed using R software, version 4.4.1. This hardware and software configuration provides adequate computational power for implementing and evaluating the proposed algorithms, especially in the context of high-dimensional simulation and iterative model fitting. The average computation times for the six methods in Experiment 4.2 are presented in Table 2.

    Table 2.  Running times of six methods in Experiment 4.2 (in seconds).
    pn=1000 pn=3000
    IMA CC FULL WMCV WMCV1 MBIC IMA CC FULL WMCV WMCV1 MBIC
    47.04 4.61 8.10 22.75 22.44 27.92 236.40 107.04 126.21 154.47 152.75 154.82

     | Show Table
    DownLoad: CSV

    As shown in Table 2, the average computation time for the IMA method is 47.04 seconds for pn=1000, increasing to 236.40 seconds for pn=3000. While the computational time increases with pn, these results highlight the robustness and scalability of the IMA method in high-dimensional modeling scenarios. Notably, when compared with other methods such as CC, WMCV, WMCV1, and MBIC, the IMA method achieves an optimal balance between computational efficiency and model performance, making it particularly suitable for ultrahigh-dimensional settings. Given its robustness, scalability, and competitive performance, the IMA method is recommended for applications that require precise model averaging and prediction in high-dimensional contexts.

    Experiment 4.3. The primary aim of this experiment is to evaluate the effectiveness of the proposed stopping rule in ensuring robust predictive performance. For illustration, we focus on the setup from Experiment 4.2, where n=100, pn=1000, and d=20. Let M=100, and define the stopping criterion at the m-th iteration as epsm={ˆω(m+1)0ˆωm0}/ˆω0m, for m=1,,M. Additionally, we examine the evolution of the coefficients by plotting maxj|ˆβ(m)j| as a function of the iteration number, m=1,,M, to determine whether the estimated coefficients ˆβ(m)j become negligible as m increases. The corresponding plots of epsm and maxj|ˆβ(m)j| are presented in Figure 7 for the missingness mechanisms M1–M3.

    Figure 7.  Values of epsm and maxj|ˆβ(m)j| versus the number of iterative numbers m for three settings of missingness mechanisms in Experiment 4.3.

    Examination of Figure 7 reveals the following key observations: (i) The upper three panels demonstrate that the values of epsm tend to stabilize for m>20 across all three missingness mechanisms. From this, we conservatively conclude that updating the IMA procedure for up to 20 steps is sufficient to achieve reliable predictive performance for all mechanisms (M1–M3) in this experiment. (ii) The lower three panels show that maxj|ˆβ(m)j| decreases rapidly toward zero for all missingness mechanisms, indicating that the influence of predictors diminishes significantly after approximately 20 steps.

    In summary, these findings provide strong evidence that the proposed stopping rule ensures stable and accurate predictions after a relatively small number of iterations, thereby confirming the efficiency of the procedure.

    In this section, we illustrate the application of the proposed IMA procedure using a gene expression dataset related to Bardet-Biedl Syndrome [32]. This dataset comprises 120 twelve-week-old male rats, each characterized by 31,042 distinct probe sets, selected for tissue harvesting from the eyes and subsequent microarray analysis. Following the approaches of [32], 18,976 probes were deemed "sufficiently variable" based on expression quantitative trait locus (eQTL) mapping, exhibiting at least two-fold variation. For our analysis, all 18,976 probes were standardized to have zero mean and unit variance. Chiang et al. [33] identified the gene TRIM32 at probe 1389163_at as critical to Bardet-Biedl Syndrome. In this study, the response variable y corresponds to probe 1389163_at, while the predictors xj represent the remaining probes. The dataset presents a challenging scenario, as the sample size (n=120) is small compared to the dimensionality (pn= 18,975).

    The primary goal of this analysis is to evaluate the prediction performance of the proposed IMA procedure in the presence of missing response data. To achieve this, we artificially introduce missingness under a MAR mechanism defined as

    logit{π(x6329,γ)}=γ0+γ1x6329,

    where γ=(γ0,γ1), and γ0 is the intercept term. The covariate x6329 is selected by ranking all 18,975 probes according to the absolute value of their marginal correlations with the expression of probe 1389163_at, retaining the top-ranked probe as the covariate for the missingness mechanism. The true parameter values are set as γ=(0.5,0.8), resulting in an approximate missing proportion of 42%.

    To assess the predictive performance of various methods, the complete dataset is randomly partitioned into a training set of size 80 and a validation set of size 40, with this process repeated 100 times. For implementing the proposed method, we adopt the Gaussian kernel function K(u)=exp(u2/2)/(2π)1/2 and perform K=30 multiple imputations for each missing response. The bandwidth parameter is determined using the kedd package in R, following the approach taken in Experiment 4.1. The proposed IMA method is then applied to the training set, and the predictive performance of the resulting model is evaluated by calculating the prediction error (PE) on the validation set for each of the 100 replications:

    PE=1n1iVδi(yiˆμi)2,

    where ˆμi denotes the estimated mean of the response variable yi, n1=iVδi, and V represents the set of indices corresponding to the observations in the validation set. A detailed summary of the dataset characteristics and experimental design is provided in Table 3.

    Table 3.  Summary of dataset features and experimental design.
    Feature Description
    Study objective Predict expression of TRIM32 (probe 1389163_at)
    Organism 12-week-old male rats
    Total number of samples 120
    Original number of probe sets 31,042
    Filtered probes after eQTL selection 18,976
    Response variable (y) Expression of probe 1389163_at
    Predictors (xj) Remaining 18,975 probe expressions
    Data standardization Mean 0 and variance 1 for all predictors
    Missing data mechanism MAR: logit(π)=γ0+γ1x6329
    Missing rate in y Approximately 42%
    Training set size 80
    Validation/Test set size 40
    Number of repetitions 100
    Number of imputations (K) 30
    Kernel function Gaussian: K(u)=exp(u2/2)/2π

     | Show Table
    DownLoad: CSV

    In this artificially generated missing-response scenario, we include an evaluation of four alternative methods, namely WMCV, WMCV1, MBIC, and CC, all of which were introduced and analyzed in the previous simulation studies. For the last three methods, genes are first ranked according to the marginal screening utility proposed by [23]. The top 200 genes are retained, and M=20 candidate models are subsequently constructed, each containing 10 genes. The prediction errors on the validation data are illustrated in Figure 8. Examination of Figure 8 clearly demonstrates that the proposed IMA procedure achieves superior predictive efficiency compared to the other four methods, as evidenced by the smallest PE values.

    Figure 8.  PE values of five different methods in the rat eye dataset.

    This paper investigated the prediction problem in ultrahigh-dimensional linear regression models with missing responses under the assumption of missing at random. To address the missing response issue, we proposed an effective multiple-imputation procedure for handling the missing data. Additionally, we introduced an IMA procedure that combines iterative screening and model averaging techniques to enhance prediction accuracy. The proposed approach alleviates overfitting and provides consistent estimators for the regression coefficients, ensuring reliable and accurate predictions. Through simulation studies and a real data example, we validated the performance of the proposed IMA procedure. Our results demonstrated that the proposed method outperforms existing approaches, including conventional model averaging techniques, in terms of predictive accuracy and robustness.

    While the current model averaging method assumes that missing responses follow a MAR mechanism, many real-world applications involve missing data mechanisms where the missingness is dependent on the unobserved values themselves, leading to a missing not at random (MNAR) scenario. In practical applications, testing the assumption of MAR versus MNAR is crucial to ensure the applicability of our method. In ultrahigh-dimensional settings, the Pearson Chi-square test statistic developed by [34] can be employed to identify key features in the missingness data models. Subsequently, the score test techniques proposed by [35] can be utilized to assess the type of missingness mechanism. If the missingness mechanism is identified as MAR, the proposed multiple imputation and iterative model averaging approach can be applied to achieve robust statistical predictions. However, when the missingness mechanism is MNAR, it presents significant challenges for valid inference, including issues related to the identification of population parameters and the development of appropriate multiple imputation methods. Additionally, Condition 3 in Section 3 imposes a sub-exponential tail assumption on covariates to guarantee model stability and prevent overfitting. Relaxing this condition remains an important avenue for future research to enhance the generalizability of the IMA method to data with heavier tails. These considerations suggest potential directions for enhancing the current framework to better accommodate more complex missing data mechanisms and distributional settings.

    Xianwen Ding: Writing–original draft, Formal analysis, Investigation, Methodology, Visualization, Software; Tong Su: Conceptualization, Writing–original draft, Formal analysis, Methodology, Project administration; Yunqi Zhang: Conceptualization, Funding acquisition, Validation, Resources, Writing–review & editing. All authors have read and agreed to the published version of the manuscript.

    The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

    This work was supported by the National Natural Science Foundation of China (No. 12001244, 12426666, 12426668, 12426409), the Major Basic Research Project of the Natural Science Foundation of the Jiangsu Higher Education Institutions (grant No. 19KJB110007), the Yunnan Fundamental Research Projects (grant No. 202401AU070212), the Youth Project of Yunnan Xingdian Talent Program, and the Zhongwu Young Teachers Program for the Innovative Talents of Jiangsu University of Technology.

    All authors declare no conflicts of interest in this paper.

    Lemma A.1. Under Conditions 1–3, for any ν(0,1/2), constant c1>0, and 1jpn, there exist some positive constants c2 and c3 such that

    Pr(|ˆβjβ0j|>c1nν)c3nexp{c2n(12ν)/3}.

    Proof of Lemma A.1. Let hj(x)=E(xjy|xj=x), and ˆhj(x)=nl=1δlKh(xxlj)xyl/nl=1δlKh(xxlj) for each j=1,,pn. Then

    Pr(|ˆβjβ0j|>c1nν)Pr{|1nni=1xijˆyiE(xjy)|c1nν}=Pr(|Tn1+Tn2+Tn3+Tn4|c1nν),

    where

    Tn1=1npnni=1pnl=1(1δi){1KKk=1xij˜y(j)ikˆhj(xij)},Tn2=1nni=1(1δi){ˆhj(xij)hj(xij)},Tn3=1nni=1δi{xijyihj(xij)},Tn4=1nni=1{hj(xij)E(xjy)}.

    Under the assumption yδ|xj for each j=1,,pn, it follows from the proof of Lemma A.1 in [36] that

    Tn1=1nni=1(1δi){1KKk=1xij˜y(j)ikˆhj(xij)}=Op(n1/2).

    Also,

    Tn2=1nni=1(1δi){ˆhj(xij)hj(xij)}=1nni=1δi{xijyihj(xij)}{1π(xij)}π(xij)+1nni=1(1δi)nl=1δlKh(xljxij){xijylhj(xlj)}/nηj(xij)1nni=1δi{xijyihj(xij)}{1π(xij)}π(xij)+1nni=1(1δi)nl=1δlKh(xljxij){hj(xlj)hj(xij)}/nηj(xij)+1nni=1(1δi){ˆhj(xij)hj(xij)}{ηj(xij)ˆηj(xij)}ηj(xij)=1nni=1δi{xijyihj(xij)}{1π(xij)}π(xij)+Op(n1/2)=˜Tn2+Op(n1/2),

    where ηj(x)=π(x)fj(x) and ˆηj(x)=nl=1δlKh(xljx)/n. Then

    Pr(|Tn1+Tn2+Tn3+Tn4|c1nν)Pr(|Tn2+Tn3+Tn4|c1nν/2)Pr(|˜Tn2+Tn3+Tn4|c1nν/4)Pr[1nni=1δiπ(xij){xijyiE(xjy)}c1nν/8]+Pr(1nni=1{1δiπ(xij)}[hj(xij)E{hj(xij)}]c1nν/8)=J1+J2.

    According to Lemma S3 of [27], under Condition 2, we have

    Pr(maxi|δixijyi|C)nPr(|δixijyi|C)nc4exp(c5C),

    where C>0 is any positive constant, and c4 and c5 are some positive constants. Then, under Condition 1, by taking C=c6n(12ν)/3 for some positive constant c6, we obtain

    J1=Pr[1nni=1δiπ(xij){xijyiE(xjy)}c1nν/8]Pr[1nni=1δi{xijyiE(xjy)}C0c1nν/8]Pr[1nni=1δi{xijyiE(xjy)}C0c1nν/8,maxi|δixijyi|<C]+Pr(maxi|δixijyi|C)2exp(c7n12ν/C2)+nc4exp(c5C)c9nexp{c8n(12ν)/3},

    where c7, c8, and c9 are some positive constants, and the last inequality holds due to Hoeffding's inequality. By some similar arguments, it can be shown that J2 follows a similar bound. Hence, we complete the proof of this lemma.

    Lemma A.2. Under Conditions 1–4, for any ν(0,1/2) and constant c1>0, there exist positive constants c2 and c3 such that

    Pr(maxj|ˆβ2jβ20j|>c1nν)c3npnexp{c2n(12ν)/3}.

    Proof of Lemma A.2. For any ν(0,1/2) and constant c4, since maxj|β0j|C2, then there exist positive constants c5 and c6 such that

    Pr(maxj|ˆβj|C2+c4nν)pnPr{|ˆβjβ0j|+|β0j|C2+c4nν}pnPr{|ˆβjβ0j|c4nν}c5pnnexp{c6n(12ν)/3},

    where the last inequality holds due to Lemma A.1. This shows that maxj|ˆβj| is bounded in probability. For each j=1,,pn, notice that

    Pr(|ˆβ2jβ20j|c1nν)Pr(|ˆβj||ˆβjβ0j|+|β0j||ˆβjβ0j|c1nν)Pr(|ˆβj||ˆβjβ0j|c1nν/2)+Pr(|β0j||ˆβjβ0j|c1nν/2).

    For the first term, we have

    Pr(|ˆβj||ˆβjβ0j|c1nν/2)=Pr(|ˆβj||ˆβjβ0j|c1nν/2,|ˆβj|C2+c4nν)+Pr(|ˆβj||ˆβjβ0j|c1nν/2,|ˆβj|<C2+c4nν)Pr(|ˆβj|C2+c4nν)+Pr{(C2+c4nν)|ˆβjβ0j|>c1nν/2}c7nexp{c8n(12ν)/3},

    where c7 and c8 are positive constants. We next deal with the second term:

    Pr(|β0j||ˆβjβ0j|c1nν/2)Pr{|ˆβjβ0j|c1nν/(2C2)}c9nexp{c10n(12ν)/3},

    where c9 and c10 are some positive constants. Let c3=c7+c9 and c2=min{2c8,2c10}. Hence,

    Pr(maxj|ˆβ2jβ20j|>c1nν)pnPr{|ˆβ2jβ20j|>c1nν}c3nexp{c2n(12ν)/3}.

    This completes the proof.

    Proof of Theorem 3.1. By the definition of ˆωj, we have

    ˆωj=exp(BICj/2)pnj=0exp(BICj/2)=(ˆY2nˆβ2j)n/2/npn(ˆY2)n/2+(pnj=1ˆY2nˆβ2j)n/2/npn.

    It can be shown that ˆωj is a monotone increasing function of ˆβ2j. For the sake of convenience, let BIC(1) be the BIC score associated with ˆβ2(1). Then we have

    ˆωU(1)=exp(BIC(1)/2)pnj=0exp(BICj)=[1+j2(ˆY2nˆβ2(j))n/2(ˆY2nˆβ2(1))n/2+npn(1nˆβ2(1)/ˆY2)n/2]1.

    To show ˆωU(1)p1, it suffices to show

    j2(ˆY2nˆβ2(j))n/2(ˆY2nˆβ2(1))n/2p0, (A.1)
    npn(1nˆβ2(1)/ˆY2)n/2p0. (A.2)

    First, for Eq (A.1), notice that

    j2(ˆY2nˆβ2(j))n/2(ˆY2nˆβ2(1))n/2pn(ˆY2nˆβ2(1))n/2(ˆY2nˆβ2(2))n/2=exp[log(pn)+n2log{ˆY2nˆβ2(1)ˆY2nˆβ2(2)}].

    Furthermore, by condition β2(1)β2(2)>C4, it is noteworthy that

    ˆβ2(1)ˆβ2(2)β2(1)β2(2)|ˆβ2(1)β2(1)||ˆβ2(2)β2(2)|C42maxj|ˆβ2jβ20j|. (A.3)

    By Lemma A.2, we know that Pr(maxj|ˆβ2jβ20j|c1nν)p0. This, together with Eq (A.3), implies that with probability tending to one, we have ˆβ2(1)ˆβ2(2)>C4/2. From the proof of Lemma A.2, we know maxj|ˆβj| is bounded in probability. Under Condition 2 and the MAR assumption, for any C>0, there exist positive constants b1 and b2 such that

    Pr(ˆY2/n<C)=Pr(Y2/n<C,δ=1)+Pr(ˆm(X)2/n<C,δ=0)1Pr(Y2/nC,δ=1)=1E[E{I(Y2Cn,δ=1)|X,Y}]=1E{I(Y2Cn)Pr(δ=1|X)}1Pr(Y2Cn)1b1exp(b2Cn), (A.4)

    where ˆm(X)=pnj=1Kk=1˜y(j)ik/(pnK), and the last inequality holds due to the Markov inequality. Then ˆY2/n is bounded in probability. Thus, under Condition 4, it follows that nˆβ2(1)/ˆY2<C3<1. Combining Eq (A.3) and Condition 4, we have

    exp[log(pn)+n2log{ˆY2nˆβ2(1)ˆY2nˆβ2(2)}]exp[log(pn)+n2log{ˆY2nˆβ2(1)C4/2+ˆY2nˆβ2(1)}]=exp[log(pn)+n2log{1C4/2C4/2+ˆY2nˆβ2(1)}]p0.

    Analogously, we can prove that Eq (A.2) holds. As a result, ˆωU(1)p1.

    Proof of Theorem 3.2. Let Hj=xjxj/n. Notice that

    ˆY(m+1)=ˆY(m)Xˆβ(m)=ˆY(m)pnj=1ˆωmjHjˆY(m),

    where ˆβ(m)=(ˆωm1ˆβm1,,ˆωmpnˆβmpn). Then we have

    ˆY(m+1)2=ˆY(m)2+pnj=1ˆωmjHjˆY(m)22pnj=1ˆωmj(ˆY(m))HjˆY(m)=ˆY(m)2+pnj=1ˆωmjHjˆY(m)22ˆY(m)2pnj=1nˆωmjˆβ2mj/ˆY(m)2=ˆY(m)2(12pnj=1nˆωmjˆβ2mj/ˆY(m)2+1j1,j2pnˆωmj1ˆωmj2ˆβmj1ˆβmj2xj1xj2/ˆY(m)2).

    Note that |xj1xj2|n for any 1j1,j2pn due to the fact that xj2=n. This suggests that

    ˆY(m)2(12pnj=1nˆωmjˆβ2mj/ˆY(m)2+1j1,j2pnˆωmj1ˆωmj2ˆβmj1ˆβmj2xj1xj2/ˆY(m)2)ˆY(m)2{12pnj=1nˆωmjˆβ2mj/ˆY(m)2+n(pnj=1ˆωmj|ˆβmj|)2/ˆY(m)2}.

    Therefore, we have

    ˆY(m)2ˆY(m+1)22pnj=1nˆωmjˆβ2mjn(pnj=1ˆωmj|ˆβmj|)22pnj=1nˆωmjˆβ2mjn(pnj=1ˆωmj)(pnj=1ˆωmjˆβ2mj)pnj=1nˆωmjˆβ2mj.

    Now, by the definition of ˆY(m+1), we have

    ˆY(m)2ˆY(m+1)2=2pnj=1nˆωmjˆβ2mjpnj=1ˆωmjXjˆβmj22pnj=1nˆωmjˆβ2mj.

    Notice that ˆβ2mjˆβ2m(1) for each j=1,,pn. Then

    ˆY(m)2ˆY(m+1)22pnj=1nˆωmjˆβ2mj2pnj=1nˆωmjˆβ2m(1)=2n(1ˆωm0)ˆβ2m(1).

    Lemma A.3. Under Conditions 1–4, for every j{1,2,,pn}, we have maxj|ˆβ2mjβ2mj|p0 for any m=1,,M.

    Proof of Lemma A.3. We can demonstrate this lemma for general m by induction. For the sake of simplicity, we can only show that the result is valid for m=2 by assuming that it holds when m=1. Notice that the result of m=1 can be directly derived from Lemma A.2. Recall that β(1)0=(ω11β11,,ω1pnβ1pn)Rpn and ˆβ(1)=(ˆω11ˆβ11,,ˆω1pnˆβ1pn)Rpn. From the definition of ˆβ2j, we have

    ˆβ2j=1nxjˆY(2)=1nxj(ˆY(1)Xβ(1)0)+1nxjX(β(1)0ˆβ(1)).

    From the proof of Lemma A.1, to prove ˆβ2jpβ2j for every j{1,2,,pn}, it suffices to show the following result:

    maxj|1nxjX(β(1)0ˆβ(1))|=op(1).

    By Hölder's inequality, we have

    maxj|1nxjX(β(1)0ˆβ(1))|maxj1,j2|ˆσj1j2||β(1)0ˆβ(1)|1,

    where ˆσj1j2=xj1xj2/(xj1xj2). By the results of Proposition 1 in [37], under Condition 2, we obtain that maxj1,j2|ˆσj1j2σj1j2|p0. We have maxj1,j2|ˆσj1j2|=Op(1). Let ω1(1)ω1(2)ω1(pn) be the ordered statistics of {ω1j,1jpn} and ˆω1(j) be the corresponding estimators for j=1,2,,pn. By the result of Theorem 3.1, there exists a positive constant ξ<1 such that

    ˆω1(1)p1,ˆω1(j)ξn,ω1(1)p1,andω1(j)ξn.

    Thus, we have

    |β(1)0ˆβ(1)|1|β0(1)ˆβ(1)|+(pn1)ξn.

    By the result of Lemma A.1, we have |β(1)0ˆβ(1)|1p0. Hence, ˆβ2jpβ2j. The remaining steps are similar to those of Lemma A.2. We omit them here. This completes the proof of Lemma A.3.

    Proof of Theorem 3.3. From the proof of Theorem 3.1, it also can be shown that ˆωmj is a monotone increasing function of ˆβ2mj. Then

    ˆωm(1)=[1+j2(ˆY(m)2nˆβ2m(j))n/2(ˆY(m)2nˆβ2m(1))n/2+npn{1nˆβ2m(1)/ˆY(m)2}]1.

    Then, to demonstrate ˆωm(1)p1, it suffices to show

    j2(ˆY(m)2nˆβ2mj)n/2(ˆY(m)2nˆβ2m(1))n/2=op(1),npn(1nˆβ2m(1)/ˆY(m)2)=op(1).

    Following the proof of Theorem 3.1, and under Conditions 1–5, we have completed the proof of the first part of Theorem 3.3.

    We here just show the second part of this theorem. Notice that

    ˆωm0=exp(BICm0/2)pnj=0exp(BICmj/2)={1+(npn)1pnj=1(1nˆβ2mj/ˆY(m)2)n/2}1.

    Similar to the proof of Theorem 3.1, it is straightforward to show ˆY(m)2/n is bounded in probability. Thus, by assuming β2m(1)=O(nγ), γ>0, and using the results of Lemma A.3, we have (1nˆβ2mj/ˆY(m)2)n/2=Op(1). As a result, (npn)1pnj=1(1nˆβ2mj/ˆY(m)2)n/2p0, which implies that ˆωm0p1. Hence, we have completed the proof.

    Proof of Theorem 3.4. Let Λ=β0ˆβM, and we have Λ2=ΛAn2+ΛAcn2. To prove ˆβMβ0p0, it suffices to show ΛAnp0 and ΛAcnp0. Notice that

    λmax(XAcnXAcn)ΛAcn22λmax(XAcnXAcn)ˆβMAcn2+2λmax(XAcnXAcn)β0Acn22tr(XAcnXAcn)ˆβMAcn2+2tr(XAcnXAcn)β0Acn2=O(|Acn|ˆβMAcn2+|Acn|β0Acn2).

    Furthermore, by Condition 6(ⅲ) and the proof of Theorem 3.1, it is noteworthy that

    |Acn|ˆβMAcn2M|Acn|jAcnMm=1ˆω2mjˆβ2mjM|Acn|ˆY2/njAcnMm=1ˆω2mj{ˆβmj/(nˆY(m))}2M|Acn|jAcnMm=1ˆω2mjˆY2/nCM|Acn|jAcnMm=1ˆω2mjCM2|Acn|supm>1jAcnˆω2mjp0.

    By Condition 6(ⅱ), together with Eq (A.4), it follows that ΛAcnp0. Next we will show ΛAnp0. By Condition 7 and Lemma 3 in [22], we have

    ΛAn2λ1minΛAn(n1XAnXAn)ΛAnλ1minΛAn1maxj(n1|xjXAnΛAn|)λ1min|An|1/2ΛAnmaxj(n1|xjXAnΛAn|),

    where the second inequality holds due to Hölder's inequality. This leads to

    ΛAnλ1min|An|1/2maxj(n1|xjXAnΛAn|). (A.5)

    Notice that ˆY(M+1)=ˆY(M)Xˆβ(M)=ˆYXβ0+XΛ. Then, by the triangle inequality, we have

    |An|1/2maxj(n1|xjXΛ|)|An|1/2maxj{n1|xj(ˆYXβ0)|}+|An|1/2maxj(n1|xjˆY(M+1)|).

    For the first term of the right-hand side of the above inequality, we have

    |An|1/2maxj{n1|xj(ˆYXβ0)|}|An|1/2maxj(n1|ni=1xijεi|)+|An|1/2maxj[1n|ni=1xij(1δi){1pnKpnl=1Kk=1˜y(l)ikyi}|]=I1+I2.

    By the Bonferroni inequality and Lemma A.3 in [38], we obtain I1p0. From the proof of Lemma A.1, for each j=1,,pn, we have

    1nni=1xij(1δi){1pnKpnl=1Kk=1˜y(l)ikyi}=1npnni=1pnl=1(1δi){1KKk=1xij˜y(l)ikˆhj(xij)}+1nni=1(1δi){ˆhj(xij)hj(xij)}+1nni=1(1δi){hj(xij)xijyi}=Op(nr),

    where 0<r<1/2. Then, by Condition 6(ⅰ),

    I2=|An|1/2maxj[1n|ni=1xij(1δi){1pnKpnl=1Kk=1˜y(l)ikyi}|]=|An|1/2n1rmaxj[1nr|ni=1xij(1δi){1pnKpnl=1Kk=1˜y(l)ikyi}|]p0.

    As a result, the term |An|1/2maxj{n1|xj(ˆYXβ0)|}p0. Similar to the proof of Theorem 5 in [22], by Conditions 1–3, we can show that maxj(n1|xjˆY(M+1)|)p0. This together with Lemma A.1 in [36] proves that xjˆY(M+1)/n=Op(1) for each j=1,,pn. Then we have |An|1/2maxj(n1|xjˆY(M+1)|)p0. Accordingly,

    |An|1/2maxj(n1|xjXΛ|)p0. (A.6)

    On the other hand, by the Cauchy-Schwarz inequality and ΛAcnp0, we obtain

    |An|{n1maxj|XjXAcnΛAcn|}22|An|λmax{n1XAcnXAcn}ΛAcn2p0.

    This in conjunction with Eq (A.6) implies that |An|1/2maxj(n1|xjXAnΛAn|)p0. Then we obtain that Eq (A.5) p0. Therefore, we have that Λ=β0ˆβMp0, which completes the proof.



    [1] R. Little, D. Rubin, Statistical analysis with missing data, 3 Eds., New York: John Wiley & Sons, 2019. https://doi.org/10.1002/9781119482260
    [2] J. K. Kim, J. Shao, Statistical methods for handling incomplete data, 2 Eds., New York: Chapman and Hall/CRC, 2021. https://doi.org/10.1201/9780429321740
    [3] W. M. Elmessery, A. Habib, M. Y. Shams, T. Abd El-Hafeez, T. M. El-Messery, S. Elsayed, et al., Deep regression analysis for enhanced thermal control in photovoltaic energy systems, Sci. Rep., 14 (2024), 30600. https://doi.org/10.1038/s41598-024-81101-x doi: 10.1038/s41598-024-81101-x
    [4] H. M. Farghaly, A. A. Ali, T. Abd El-Hafeez, Building an effective and accurate associative classifier based on support vector machine, Sylwan, 164 (2020), 39–56.
    [5] B. E. Hansen, Least squares model averaging, Econometrica, 75 (2007), 1175–1189. https://doi.org/10.1111/j.1468-0262.2007.00785.x doi: 10.1111/j.1468-0262.2007.00785.x
    [6] B. E. Hansen, J. S. Racine, Jackknife model averaging, J. Econometrics, 167 (2012), 38–46. https://doi.org/10.1016/j.jeconom.2011.06.019
    [7] Q. F. Liu, R. Okui, Heteroscedasticity-robust Cp model averaging, Econom. J., 16 (2013), 463–472. https://doi.org/10.1111/ectj.12009 doi: 10.1111/ectj.12009
    [8] X. Y. Zhang, D. L. Yu, G. H. Zou, H. Liang, Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models, J. Amer. Statist. Assoc., 111 (2016), 1775–1790. https://doi.org/10.1080/01621459.2015.1115762 doi: 10.1080/01621459.2015.1115762
    [9] M. Schomaker, A. T. K. Wan, C. Heumann, Frequentist model averaging with missing observations, Comput. Statist. Data Anal., 54 (2010), 3336–3347. https://doi.org/10.1016/j.csda.2009.07.023 doi: 10.1016/j.csda.2009.07.023
    [10] V. Dardanoni, S. Modica, F. Peracchi, Regression with imputed covariates: a generalized missing-indicator approach, J. Econometrics, 162 (2011), 362–368. https://doi.org/10.1016/j.jeconom.2011.02.005 doi: 10.1016/j.jeconom.2011.02.005
    [11] X. Y. Zhang, Model averaging with covariates that are missing completely at random, Econom. Lett., 121 (2013), 360–363. https://doi.org/10.1016/j.econlet.2013.09.008 doi: 10.1016/j.econlet.2013.09.008
    [12] F. Fang, W. Lan, J. J. Tong, J. Shao, Model averaging for prediction with fragmentary data, J. Bus. Econom. Statist., 37 (2019), 517–527. https://doi.org/10.1080/07350015.2017.1383263 doi: 10.1080/07350015.2017.1383263
    [13] Z. Q. Liang, Q. H. Wang, A robust model averaging approach for partially linear models with responses missing at random, Scand. J. Statist., 50 (2023), 1933–1952. https://doi.org/10.1111/sjos.12659 doi: 10.1111/sjos.12659
    [14] Z. Q. Liang, Y. Q. Zhou, Model averaging based on weighted generalized method of moments with missing responses, AIMS Math., 8 (2023), 21683–21699. https://doi.org/10.3934/math.20231106 doi: 10.3934/math.20231106
    [15] Z. Q. Liang, S. J. Wang, L. Cai, Optimal model averaging for partially linear models with missing response variables and error-prone covariates, Stat. Med., 43 (2024), 4328–4348. https://doi.org/10.1002/sim.10176 doi: 10.1002/sim.10176
    [16] X. Lu, L. J. Su, Jackknife model averaging for quantile regressions, J. Econometrics, 188 (2015), 40–58. https://doi.org/10.1016/j.jeconom.2014.11.005 doi: 10.1016/j.jeconom.2014.11.005
    [17] X. Y. Zhang, G. H. Zou, H. Liang, R. J. Carroll, Parsimonious model averaging with a diverging number of parameters, J. Amer. Statist. Assoc., 115 (2020), 972–984. https://doi.org/10.1080/01621459.2019.1604363 doi: 10.1080/01621459.2019.1604363
    [18] T. Ando, K. C. Li, A model-averaging approach for high-dimensional regression, J. Amer. Statist. Assoc., 109 (2014), 254–265. https://doi.org/10.1080/01621459.2013.838168 doi: 10.1080/01621459.2013.838168
    [19] T. Ando, K. C. Li, A weight-relaxed model averaging approach for high-dimensional generalized linear models, Ann. Statist., 45 (2017), 2654–2679. https://doi.org/10.1214/17-AOS1538 doi: 10.1214/17-AOS1538
    [20] X. Cheng, B. E. Hansen, Forecasting with factor-augmented regression: a frequentist model averaging approach, J. Econometrics, 186 (2015), 280–293. https://doi.org/10.1016/j.jeconom.2015.02.010 doi: 10.1016/j.jeconom.2015.02.010
    [21] J. Chen, D. G. Li, O. Linton, Z. D. Lu, Semiparametric ultra-high dimensional model averaging of nonlinear dynamic time series, J. Amer. Statist. Assoc., 113 (2018), 919–932. https://doi.org/10.1080/01621459.2017.1302339 doi: 10.1080/01621459.2017.1302339
    [22] W. Lan, Y. Y. Ma, J. L. Zhao, H. S. Wang, C. L. Tsai, Sequential model averaging for high dimensional linear regression models, Statist. Sinica, 28 (2018), 449–469. https://doi.org/10.5705/ss.202016.0122 doi: 10.5705/ss.202016.0122
    [23] J. H. Xie, X. D. Yan, N. S. Tang, A model-averaging method for high-dimensional regression with missing responses at random, Statist. Sinica, 31 (2021), 1005–1026. https://doi.org/10.5705/ss.202018.0297 doi: 10.5705/ss.202018.0297
    [24] X. M. He, L. Wang, H. G. Hong, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist., 41 (2013), 342–369. https://doi.org/10.1214/13-AOS1087 doi: 10.1214/13-AOS1087
    [25] C. Y. Tang, Y. S. Qin, An efficient empirical likelihood approach for estimating equations with missing data, Biometrika, 99 (2012), 1001–1007. https://doi.org/10.1093/biomet/ass045 doi: 10.1093/biomet/ass045
    [26] X. R. Chen, A. T. K. Wan, Y. Zhou, Efficient quantile regression analysis with missing observations, J. Amer. Statist. Assoc., 110 (2015), 723–741. https://doi.org/10.1080/01621459.2014.928219 doi: 10.1080/01621459.2014.928219
    [27] J. Y. Liu, R. Z. Li, R. L. Wu, Feature selection for varying coefficient models with ultrahigh-dimensional covariates, J. Amer. Statist. Assoc., 109 (2014), 266–274. https://doi.org/10.1080/01621459.2013.850086 doi: 10.1080/01621459.2013.850086
    [28] M. Kalisch, P. Bühlmann, Estimating high-dimensional directed acyclic graphs with the PC-algorithm, J. Mach. Learn. Res., 8 (2007), 613–636. https://doi.org/10.5555/1314498.1314520 doi: 10.5555/1314498.1314520
    [29] H. S. Wang, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc., 104 (2009), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516 doi: 10.1198/jasa.2008.tm08516
    [30] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Methodol., 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x
    [31] A. Azzalini, A. Capitanio, Statistical applications of the multivariate skew normal distribution, J. R. Stat. Soc. Ser. B Stat. Methodol., 61 (1999), 579–602. https://doi.org/10.1111/1467-9868.00194 doi: 10.1111/1467-9868.00194
    [32] T. E. Scheetz, K. Y. A. Kim, R. E. Swiderski, A. R. Philp, T. A. Braun, K. L. Knudtson, et al., Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc. Natl. Acad. Sci., 103 (2006), 14429–14434. https://doi.org/10.1073/pnas.0602562103 doi: 10.1073/pnas.0602562103
    [33] A. P. Chiang, J. S. Beck, H. J. Yen, M. K. Tayeh, T. E. Scheetz, R. E. Swiderski, et al., Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11), Proc. Natl. Acad. Sci., 103 (2006), 6287–6292. https://doi.org/10.1073/pnas.0600158103 doi: 10.1073/pnas.0600158103
    [34] X. W. Ding, J. D. Chen, X. P. Chen, Regularized quantile regression for ultrahigh-dimensional data with nonignorable missing responses, Metrika, 83 (2020), 545–568. https://doi.org/10.1007/s00184-019-00744-3 doi: 10.1007/s00184-019-00744-3
    [35] H. R. Wang, Z. P. Lu, Y. K. Liu, Score test for missing at random or not under logistic missingness models, Biometrics, 79 (2023), 1268–1279. https://doi.org/10.1111/biom.13666 doi: 10.1111/biom.13666
    [36] D. Wang, S. X. Chen, Empirical likelihood for estimating equations with missing values, Ann. Statist., 37 (2009), 490–517. https://doi.org/10.1214/07-AOS585 doi: 10.1214/07-AOS585
    [37] B. Y. Jiang, Covariance selection by thresholding the sample correlation matrix, Statist. Probab. Lett., 83 (2013), 2492–2498. https://doi.org/10.1016/j.spl.2013.07.008 doi: 10.1016/j.spl.2013.07.008
    [38] P. J. Bickel, E. Levina, Covariance regularization by thresholding, Ann. Statist., 36 (2008), 2577–2604. https://doi.org/10.1214/08-AOS600 doi: 10.1214/08-AOS600
  • Reader Comments
  • © 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(131) PDF downloads(12) Cited by(0)

Figures and Tables

Figures(8)  /  Tables(3)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog