This paper addresses the prediction problem in linear regression models with ultrahigh-dimensional covariates and missing response data. Assuming a missing at random mechanism, we introduced a novel nonparametric multiple imputation method to handle missing response values. Based on these imputed responses, we proposed an efficient iterative model averaging method that integrates an iterative screening process within a model averaging framework. The weights for the candidate models were determined using the Bayesian information criterion, ensuring an optimal balance between model fit and complexity. The computational feasibility of the proposed approach stems from its iterative structure, which significantly reduces the computational burden compared to conventional methods. Under certain regularity conditions, we demonstrated that the proposed method effectively mitigates the risk of overfitting and yields consistent estimators for the regression coefficients. Simulation studies and a real-world data application illustrate the practical efficacy of the proposed approach, showing its superior performance in terms of predictive accuracy and flexibility when compared to several competing approaches.
Citation: Xianwen Ding, Tong Su, Yunqi Zhang. An efficient iterative model averaging framework for ultrahigh-dimensional linear regression models with missing data[J]. AIMS Mathematics, 2025, 10(6): 13795-13824. doi: 10.3934/math.2025621
This paper addresses the prediction problem in linear regression models with ultrahigh-dimensional covariates and missing response data. Assuming a missing at random mechanism, we introduced a novel nonparametric multiple imputation method to handle missing response values. Based on these imputed responses, we proposed an efficient iterative model averaging method that integrates an iterative screening process within a model averaging framework. The weights for the candidate models were determined using the Bayesian information criterion, ensuring an optimal balance between model fit and complexity. The computational feasibility of the proposed approach stems from its iterative structure, which significantly reduces the computational burden compared to conventional methods. Under certain regularity conditions, we demonstrated that the proposed method effectively mitigates the risk of overfitting and yields consistent estimators for the regression coefficients. Simulation studies and a real-world data application illustrate the practical efficacy of the proposed approach, showing its superior performance in terms of predictive accuracy and flexibility when compared to several competing approaches.
| [1] | R. Little, D. Rubin, Statistical analysis with missing data, 3 Eds., New York: John Wiley & Sons, 2019. https://doi.org/10.1002/9781119482260 |
| [2] | J. K. Kim, J. Shao, Statistical methods for handling incomplete data, 2 Eds., New York: Chapman and Hall/CRC, 2021. https://doi.org/10.1201/9780429321740 |
| [3] |
W. M. Elmessery, A. Habib, M. Y. Shams, T. Abd El-Hafeez, T. M. El-Messery, S. Elsayed, et al., Deep regression analysis for enhanced thermal control in photovoltaic energy systems, Sci. Rep., 14 (2024), 30600. https://doi.org/10.1038/s41598-024-81101-x doi: 10.1038/s41598-024-81101-x
|
| [4] | H. M. Farghaly, A. A. Ali, T. Abd El-Hafeez, Building an effective and accurate associative classifier based on support vector machine, Sylwan, 164 (2020), 39–56. |
| [5] |
B. E. Hansen, Least squares model averaging, Econometrica, 75 (2007), 1175–1189. https://doi.org/10.1111/j.1468-0262.2007.00785.x doi: 10.1111/j.1468-0262.2007.00785.x
|
| [6] | B. E. Hansen, J. S. Racine, Jackknife model averaging, J. Econometrics, 167 (2012), 38–46. https://doi.org/10.1016/j.jeconom.2011.06.019 |
| [7] |
Q. F. Liu, R. Okui, Heteroscedasticity-robust $C_p$ model averaging, Econom. J., 16 (2013), 463–472. https://doi.org/10.1111/ectj.12009 doi: 10.1111/ectj.12009
|
| [8] |
X. Y. Zhang, D. L. Yu, G. H. Zou, H. Liang, Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models, J. Amer. Statist. Assoc., 111 (2016), 1775–1790. https://doi.org/10.1080/01621459.2015.1115762 doi: 10.1080/01621459.2015.1115762
|
| [9] |
M. Schomaker, A. T. K. Wan, C. Heumann, Frequentist model averaging with missing observations, Comput. Statist. Data Anal., 54 (2010), 3336–3347. https://doi.org/10.1016/j.csda.2009.07.023 doi: 10.1016/j.csda.2009.07.023
|
| [10] |
V. Dardanoni, S. Modica, F. Peracchi, Regression with imputed covariates: a generalized missing-indicator approach, J. Econometrics, 162 (2011), 362–368. https://doi.org/10.1016/j.jeconom.2011.02.005 doi: 10.1016/j.jeconom.2011.02.005
|
| [11] |
X. Y. Zhang, Model averaging with covariates that are missing completely at random, Econom. Lett., 121 (2013), 360–363. https://doi.org/10.1016/j.econlet.2013.09.008 doi: 10.1016/j.econlet.2013.09.008
|
| [12] |
F. Fang, W. Lan, J. J. Tong, J. Shao, Model averaging for prediction with fragmentary data, J. Bus. Econom. Statist., 37 (2019), 517–527. https://doi.org/10.1080/07350015.2017.1383263 doi: 10.1080/07350015.2017.1383263
|
| [13] |
Z. Q. Liang, Q. H. Wang, A robust model averaging approach for partially linear models with responses missing at random, Scand. J. Statist., 50 (2023), 1933–1952. https://doi.org/10.1111/sjos.12659 doi: 10.1111/sjos.12659
|
| [14] |
Z. Q. Liang, Y. Q. Zhou, Model averaging based on weighted generalized method of moments with missing responses, AIMS Math., 8 (2023), 21683–21699. https://doi.org/10.3934/math.20231106 doi: 10.3934/math.20231106
|
| [15] |
Z. Q. Liang, S. J. Wang, L. Cai, Optimal model averaging for partially linear models with missing response variables and error-prone covariates, Stat. Med., 43 (2024), 4328–4348. https://doi.org/10.1002/sim.10176 doi: 10.1002/sim.10176
|
| [16] |
X. Lu, L. J. Su, Jackknife model averaging for quantile regressions, J. Econometrics, 188 (2015), 40–58. https://doi.org/10.1016/j.jeconom.2014.11.005 doi: 10.1016/j.jeconom.2014.11.005
|
| [17] |
X. Y. Zhang, G. H. Zou, H. Liang, R. J. Carroll, Parsimonious model averaging with a diverging number of parameters, J. Amer. Statist. Assoc., 115 (2020), 972–984. https://doi.org/10.1080/01621459.2019.1604363 doi: 10.1080/01621459.2019.1604363
|
| [18] |
T. Ando, K. C. Li, A model-averaging approach for high-dimensional regression, J. Amer. Statist. Assoc., 109 (2014), 254–265. https://doi.org/10.1080/01621459.2013.838168 doi: 10.1080/01621459.2013.838168
|
| [19] |
T. Ando, K. C. Li, A weight-relaxed model averaging approach for high-dimensional generalized linear models, Ann. Statist., 45 (2017), 2654–2679. https://doi.org/10.1214/17-AOS1538 doi: 10.1214/17-AOS1538
|
| [20] |
X. Cheng, B. E. Hansen, Forecasting with factor-augmented regression: a frequentist model averaging approach, J. Econometrics, 186 (2015), 280–293. https://doi.org/10.1016/j.jeconom.2015.02.010 doi: 10.1016/j.jeconom.2015.02.010
|
| [21] |
J. Chen, D. G. Li, O. Linton, Z. D. Lu, Semiparametric ultra-high dimensional model averaging of nonlinear dynamic time series, J. Amer. Statist. Assoc., 113 (2018), 919–932. https://doi.org/10.1080/01621459.2017.1302339 doi: 10.1080/01621459.2017.1302339
|
| [22] |
W. Lan, Y. Y. Ma, J. L. Zhao, H. S. Wang, C. L. Tsai, Sequential model averaging for high dimensional linear regression models, Statist. Sinica, 28 (2018), 449–469. https://doi.org/10.5705/ss.202016.0122 doi: 10.5705/ss.202016.0122
|
| [23] |
J. H. Xie, X. D. Yan, N. S. Tang, A model-averaging method for high-dimensional regression with missing responses at random, Statist. Sinica, 31 (2021), 1005–1026. https://doi.org/10.5705/ss.202018.0297 doi: 10.5705/ss.202018.0297
|
| [24] |
X. M. He, L. Wang, H. G. Hong, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist., 41 (2013), 342–369. https://doi.org/10.1214/13-AOS1087 doi: 10.1214/13-AOS1087
|
| [25] |
C. Y. Tang, Y. S. Qin, An efficient empirical likelihood approach for estimating equations with missing data, Biometrika, 99 (2012), 1001–1007. https://doi.org/10.1093/biomet/ass045 doi: 10.1093/biomet/ass045
|
| [26] |
X. R. Chen, A. T. K. Wan, Y. Zhou, Efficient quantile regression analysis with missing observations, J. Amer. Statist. Assoc., 110 (2015), 723–741. https://doi.org/10.1080/01621459.2014.928219 doi: 10.1080/01621459.2014.928219
|
| [27] |
J. Y. Liu, R. Z. Li, R. L. Wu, Feature selection for varying coefficient models with ultrahigh-dimensional covariates, J. Amer. Statist. Assoc., 109 (2014), 266–274. https://doi.org/10.1080/01621459.2013.850086 doi: 10.1080/01621459.2013.850086
|
| [28] |
M. Kalisch, P. Bühlmann, Estimating high-dimensional directed acyclic graphs with the PC-algorithm, J. Mach. Learn. Res., 8 (2007), 613–636. https://doi.org/10.5555/1314498.1314520 doi: 10.5555/1314498.1314520
|
| [29] |
H. S. Wang, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc., 104 (2009), 1512–1524. https://doi.org/10.1198/jasa.2008.tm08516 doi: 10.1198/jasa.2008.tm08516
|
| [30] |
R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Methodol., 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x
|
| [31] |
A. Azzalini, A. Capitanio, Statistical applications of the multivariate skew normal distribution, J. R. Stat. Soc. Ser. B Stat. Methodol., 61 (1999), 579–602. https://doi.org/10.1111/1467-9868.00194 doi: 10.1111/1467-9868.00194
|
| [32] |
T. E. Scheetz, K. Y. A. Kim, R. E. Swiderski, A. R. Philp, T. A. Braun, K. L. Knudtson, et al., Regulation of gene expression in the mammalian eye and its relevance to eye disease, Proc. Natl. Acad. Sci., 103 (2006), 14429–14434. https://doi.org/10.1073/pnas.0602562103 doi: 10.1073/pnas.0602562103
|
| [33] |
A. P. Chiang, J. S. Beck, H. J. Yen, M. K. Tayeh, T. E. Scheetz, R. E. Swiderski, et al., Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet-Biedl syndrome gene (BBS11), Proc. Natl. Acad. Sci., 103 (2006), 6287–6292. https://doi.org/10.1073/pnas.0600158103 doi: 10.1073/pnas.0600158103
|
| [34] |
X. W. Ding, J. D. Chen, X. P. Chen, Regularized quantile regression for ultrahigh-dimensional data with nonignorable missing responses, Metrika, 83 (2020), 545–568. https://doi.org/10.1007/s00184-019-00744-3 doi: 10.1007/s00184-019-00744-3
|
| [35] |
H. R. Wang, Z. P. Lu, Y. K. Liu, Score test for missing at random or not under logistic missingness models, Biometrics, 79 (2023), 1268–1279. https://doi.org/10.1111/biom.13666 doi: 10.1111/biom.13666
|
| [36] |
D. Wang, S. X. Chen, Empirical likelihood for estimating equations with missing values, Ann. Statist., 37 (2009), 490–517. https://doi.org/10.1214/07-AOS585 doi: 10.1214/07-AOS585
|
| [37] |
B. Y. Jiang, Covariance selection by thresholding the sample correlation matrix, Statist. Probab. Lett., 83 (2013), 2492–2498. https://doi.org/10.1016/j.spl.2013.07.008 doi: 10.1016/j.spl.2013.07.008
|
| [38] |
P. J. Bickel, E. Levina, Covariance regularization by thresholding, Ann. Statist., 36 (2008), 2577–2604. https://doi.org/10.1214/08-AOS600 doi: 10.1214/08-AOS600
|