Research article

Robust variable selection for ultrahigh-dimensional linear models with nonignorable missing response

  • Published: 19 August 2025
  • We have proposed a robust and efficient variable selection method for ultrahigh-dimensional linear models with nonrandomly missing responses, leveraging modal regression. The propensity score function was specified by a semiparametric model and we introduced a two-step estimation procedure. In the first feature screening stage, the Pearson chi-square (PC) test statistic identifies significant predictors in the sparse propensity score model. The generalized method of moment (GMM) estimates parameters to obtain consistent estimation for the propensity score in the second stage. With the estimated propensity score, we suggested a feature screening and variable selection procedure based on the inverse probability weighting (IPW). A modified sure independence screening (SIS) method first reduces the model dimensionality, followed by a penalized modal regression approach to select significant covariates. The proposed procedure can deal with the ultrahigh-dimensional data with nonignorable nonresponse, and this modal-based procedure is robust against outliers and heavy-tailed errors. Additionally, we established the asymptotic properties of the estimators under mild regularity conditions. Simulation studies and real data applications confirm the method's effectiveness in finite samples and practical settings.

    Citation: Yanting Xiao, Yifan Shi. Robust variable selection for ultrahigh-dimensional linear models with nonignorable missing response[J]. Electronic Research Archive, 2025, 33(8): 4816-4836. doi: 10.3934/era.2025217

    Related Papers:

  • We have proposed a robust and efficient variable selection method for ultrahigh-dimensional linear models with nonrandomly missing responses, leveraging modal regression. The propensity score function was specified by a semiparametric model and we introduced a two-step estimation procedure. In the first feature screening stage, the Pearson chi-square (PC) test statistic identifies significant predictors in the sparse propensity score model. The generalized method of moment (GMM) estimates parameters to obtain consistent estimation for the propensity score in the second stage. With the estimated propensity score, we suggested a feature screening and variable selection procedure based on the inverse probability weighting (IPW). A modified sure independence screening (SIS) method first reduces the model dimensionality, followed by a penalized modal regression approach to select significant covariates. The proposed procedure can deal with the ultrahigh-dimensional data with nonignorable nonresponse, and this modal-based procedure is robust against outliers and heavy-tailed errors. Additionally, we established the asymptotic properties of the estimators under mild regularity conditions. Simulation studies and real data applications confirm the method's effectiveness in finite samples and practical settings.



    加载中


    [1] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. B, 58 (1996), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x doi: 10.1111/j.2517-6161.1996.tb02080.x
    [2] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc., 96 (2001), 1348–1360. https://doi.org/10.1198/016214501753382273 doi: 10.1198/016214501753382273
    [3] H. Zou, The adaptive lasso and its oracle properties, J. Am. Stat. Assoc., 101 (2006), 1418–1429. https://doi.org/10.1198/016214506000000735 doi: 10.1198/016214506000000735
    [4] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. B, 67 (2005), 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x doi: 10.1111/j.1467-9868.2005.00503.x
    [5] C. Zhang, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist., 38 (2010), 894–942. https://doi.org/10.1214/09-AOS729 doi: 10.1214/09-AOS729
    [6] J. Fan, J. Lv, Sure independence screening for ultra-high dimensional feature space, J. R. Stat. Soc. B, 70 (2008), 849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.x doi: 10.1111/j.1467-9868.2008.00674.x
    [7] J. Fan, R. Song, Sure independence screening in generalized linear models with NP-dimensionality, Ann. Statist., 38 (2010), 3567–3604. https://doi.org/10.1214/10-AOS798 doi: 10.1214/10-AOS798
    [8] J. Fan, Y. Feng, R. Song, Nonparametric independence screening in sparse ultra-high-dimensional additive models, J. Am. Stat. Assoc., 106 (2011), 544–557. https://doi.org/10.1198/jasa.2011.tm09779 doi: 10.1198/jasa.2011.tm09779
    [9] J. Fan, Y. Ma, W. Dai, Nonparametric independence screening in sparse ultra-high-dimensional varying coefficient models, J. Am. Stat. Assoc., 109 (2014), 1270–1284. https://doi.org/10.1080/01621459.2013.879828 doi: 10.1080/01621459.2013.879828
    [10] H. Liang, H. Wang, C. Tsai, Profiled forward regression for ultrahigh dimensional variable screening in semiparametric partially linear models, Stat. Sin., 22 (2012), 531–554. https://doi.org/10.2139/ssrn.1746315 doi: 10.2139/ssrn.1746315
    [11] L. Zhu, L. Li, R. Li, L. Zhu, Model-free feature screening for ultrahigh-dimensional data, J. Am. Stat. Assoc., 106 (2011), 1464–1475. https://doi.org/10.1198/jasa.2011.tm10563 doi: 10.1198/jasa.2011.tm10563
    [12] R. Li, W. Zhong, L. Zhu, Feature screening via distance correlation learning, J. R. Stat. Soc. B, 107 (2012), 1129–1139. https://doi.org/10.1080/01621459.2012.695654 doi: 10.1080/01621459.2012.695654
    [13] G. Li, H. Peng, J. Zhang, L. Zhu, Robust rank correlation based screening, Ann. Statist., 40 (2012), 1846–1877. https://doi.org/10.1214/12-AOS1024 doi: 10.1214/12-AOS1024
    [14] R. Little, D Rubin, Statistical Analysis with Missing Data, 2nd edition, New York, 2002. https://doi.org/10.1002/9781119013563
    [15] P. Zhao, L. Xue, Variable selection for semiparametric varying-coefficient partially linear models with missing response at random, Acta Math. Sin. Engl. Ser., 27 (2011), 2205–2216. https://doi.org/10.1007/s10114-011-9200-1 doi: 10.1007/s10114-011-9200-1
    [16] B. Sherwood, Variable selection for additive partial linear quantile regression with missing covariates, J. Multivar. Anal., 152 (2016), 206–223. https://doi.org/10.1016/j.jmva.2016.08.009 doi: 10.1016/j.jmva.2016.08.009
    [17] Y. Wang, Y. Song, Variable selection via penalized quasi-maximum likelihood method for spatial autoregressive model with missing response, Spat. Stat., 59 (2024), 100809. https://doi.org/10.1016/j.spasta.2023.100809 doi: 10.1016/j.spasta.2023.100809
    [18] P. Lai, Y. Liu, Z. Liu, Y. Wan, Model free feature screening for ultrahigh dimensional data with responses missing at random, Comput. Stat. Data Anal., 105 (2017), 201–216. https://doi.org/10.1016/j.csda.2016.08.008 doi: 10.1016/j.csda.2016.08.008
    [19] N. Tang, L. Xia, X. Yan, Feature screening in ultrahigh-dimensional partially linear models with missing responses at random, Comput. Stat. Data Anal., 133 (2019), 208–227. https://doi.org/10.1016/j.csda.2018.10.003 doi: 10.1016/j.csda.2018.10.003
    [20] X. Li, N. Tang, J. Xie, X. Yan, A nonparametric feature screening method for ultrahigh-dimensional missing response, Comput. Stat. Data Anal., 142 (2020), 106828. https://doi.org/10.1016/j.csda.2019.106828 doi: 10.1016/j.csda.2019.106828
    [21] J. Kim, C. Yu, A semiparametric estimation of mean functionals with nonignorable missing data, J. Am. Stat. Assoc., 106 (2011), 157–165. https://doi.org/10.1198/jasa.2011.tm10104 doi: 10.1198/jasa.2011.tm10104
    [22] S. Wang, J. Shao, J. Kim, An instrumental variable approach for identification and estimation with nonignorable nonresponse, Stat. Sin., 24 (2014), 1097–1116. https://doi.org/10.5705/ss.2012.074 doi: 10.5705/ss.2012.074
    [23] J. Shao, L. Wang, Semiparametric inverse propensity weighting for nonignorable missing data, Biometrika, 103 (2016), 175–187. https://doi.org/10.1093/biomet/asv071 doi: 10.1093/biomet/asv071
    [24] N. Tang, Y. Ju, Statistical inference for nonignorable missing-data problems: a selective review, Stat. Theory Relat. Fields, 2 (2018), 105–133. https://doi.org/10.1080/24754269.2018.1522481 doi: 10.1080/24754269.2018.1522481
    [25] D. Huang, R. Li, H. Wang, Feature screening for ultrahigh dimensional categorical data with applications, J. Bus. Econ. Stat., 32 (2014), 237–244. https://doi.org/10.1080/07350015.2013.863158 doi: 10.1080/07350015.2013.863158
    [26] W. Yao, B. Lindsay, R. Li, Local modal regression, J. Nonparametr. Stat., 24 (2012), 647–663. https://doi.org/10.1080/10485252.2012.678848 doi: 10.1080/10485252.2012.678848
    [27] J. Li, S. Ray, B. G. Lindsay, A nonparametric statistical approach to clustering via mode identification, J. Mach. Learn. Res., 8 (2007), 1687–1723. https://doi.org/10.5555/1314498.1314541 doi: 10.5555/1314498.1314541
    [28] T. Zhou, L. Zhu, Model-free feature screening for ultrahigh dimensional censored regression, Stat. Comput., 27 (2017), 947–961. https://doi.org/10.1007/s11222-016-9664-z doi: 10.1007/s11222-016-9664-z
  • Reader Comments
  • © 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(430) PDF downloads(26) Cited by(0)

Article outline

Figures and Tables

Figures(2)  /  Tables(5)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog