Research article Special Issues

Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts

  • Published: 13 May 2026
  • In the context of Global Health, massive administrative datasets have become indispensable tools for health surveillance. However, the sheer scale of Big Data can mask systemic selection biases that standard mathematical adjustments may not fully mitigate. In this study, I propose a methodological audit of a recent large-scale cohort (N = 2,975,035) concerning COVID-19 vaccination and oncological outcomes. By benchmarking the cohort's architecture against national demographic and epidemiological gold standards through single-proportion Z-tests, we identified notable structural divergences. The first inferential test yielded a Z-score of −260.39 (p < 10⁻⁵⁰), suggesting a structural under-sampling of the elderly population (32.2% deficit) relative to the reference population. The second test identified a statistically inconsistent cancer incidence deficit in the non-vaccinated control group (Z = −15.23, p < 10⁻⁵⁰). These findings indicate that the reported statistical signals may emerge as a computational consequence of structural selection bias, where an artificially deflated baseline in the control group potentially inflates Hazard Ratios. Within a One Health approach, ensuring the structural integrity of data is crucial for effective prevention and control measures. We conclude that large-scale surveillance studies could be inferentially validated against demographic benchmarks to ensure that public health conclusions are grounded in baseline equivalence, thereby safeguarding the reliability of global health monitoring.

    Citation: Marco Roccetti. Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts[J]. AIMS Public Health, 2026, 13(2): 589-597. doi: 10.3934/publichealth.2026031

    Related Papers:

  • In the context of Global Health, massive administrative datasets have become indispensable tools for health surveillance. However, the sheer scale of Big Data can mask systemic selection biases that standard mathematical adjustments may not fully mitigate. In this study, I propose a methodological audit of a recent large-scale cohort (N = 2,975,035) concerning COVID-19 vaccination and oncological outcomes. By benchmarking the cohort's architecture against national demographic and epidemiological gold standards through single-proportion Z-tests, we identified notable structural divergences. The first inferential test yielded a Z-score of −260.39 (p < 10⁻⁵⁰), suggesting a structural under-sampling of the elderly population (32.2% deficit) relative to the reference population. The second test identified a statistically inconsistent cancer incidence deficit in the non-vaccinated control group (Z = −15.23, p < 10⁻⁵⁰). These findings indicate that the reported statistical signals may emerge as a computational consequence of structural selection bias, where an artificially deflated baseline in the control group potentially inflates Hazard Ratios. Within a One Health approach, ensuring the structural integrity of data is crucial for effective prevention and control measures. We conclude that large-scale surveillance studies could be inferentially validated against demographic benchmarks to ensure that public health conclusions are grounded in baseline equivalence, thereby safeguarding the reliability of global health monitoring.



    加载中


    Data availability statement



    The author commits to providing the calculation spreadsheets and statistical code used for the inferential analysis upon reasonable request.

    Conflict of interest



    The author declares no conflict of interest.

    [1] Kim HJ, Kim MH, Choi MG, et al. (2025) 1-year risks of cancers associated with COVID-19 vaccination: a large population-based cohort study in South Korea. Biomark Res 13: 114. https://doi.org/10.1186/s40364-025-00831-w
    [2] Roccetti M (2026) Unlocking the stochastic parrot: Epistemic obligation and the decline of biological plausibility in clinical reality. Inform Med Unlocked 62: 101752. https://doi.org/10.1016/j.imu.2026.101752
    [3] Kang MJ, Jung KW, Bang SH, et al. (2023) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2020. Cancer Res Treat 55: 385-399. https://doi.org/10.4143/crt.2023.447
    [4] Park EH, Jung KW, Park NJ, et al. (2024) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2021. Cancer Res Treat 56: 357-371. https://doi.org/10.4143/crt.2024.253
    [5] Park EH, Jung KW, Park NJ, et al. (2025) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2022. Cancer Res Treat 57: 312-330. https://doi.org/10.4143/crt.2025.264
    [6] Statista.South Korea: Cancer crude incidence rate by age, 2022. Statista 2024 (2024) . [cited 2026 January 10]. Available from: https://www.statista.com/statistics/1440818/south-korea-cancer-crude-incidence-rate-by-age/
    [7] World Bank.Population ages 65 and above (% of total population) - Korea, Rep. World Population Prospects, United Nations (UN) (2024) . [cited 2026 January 10]. Available from: https://data.worldbank.org/indicator/SP.POP.65UP.TO.ZS?locations=KR 5
    [8] Roccetti M (2026) Before the algorithm: An exemplar case of the necessity of statistical testing for epidemiological consistency in public health data. AIMS Public Health 13: 121-134. https://doi.org/0.3934/publichealth.2026008
    [9] Roccetti M (2026) Quantifying structural selection bias in observational cohort data: a ponderation analysis of age - specific incidence rates to inform vaccine safety verification. Front Pharmacol 16: 1754809. https://doi.org/10.3389/fphar.2025.1754809
    [10] Pilleron S, Sarfati D, Janseen-Heijnen M, et al. (2019) Global cancer incidence in older adults, 2012 and 2035: A population-based study. Int J Cancer 144: 49-58. https://doi.org/10.1002/ijc.31664
    [11] Rothman KJ, Greenland S, Lash TL (2008) Measures of disease occurrence. Modern Epidemiology . Philadelphia: Lippincott Williams & Wilkins.
    [12] Meng XL (2018) Statistical paradises and paradoxes in Big Data (I): Law of large populations, Big Data paradox, and the 2016 US Presidential election. Ann Appl Stat 12: 685-726. https://doi.org/10.1214/18-AOAS1161SF
    [13] Koplan JP, Bond TC, Herson MH, et al. (2009) Towards a common definition of global health. Lancet 373: 1993-1995. https://doi.org/10.1016/S0140-6736(09)60332-9
    [14] Roccetti M, Cacciapuoti G (2025) Beyond the gold standard: Linear regression and poisson GLM yield identical mortality trends and deaths counts for COVID-19 in Italy: 2021–2025. Computation 13: 233. https://doi.org/10.3390/computation13100233
    [15] Chemaitelly H, Ayoub H, Coyle P, et al. (2025) Assessing healthy vaccinee effect in COVID-19 vaccine effectiveness studies: a national cohort study in Qatar. ELife 14: e103690. https://doi.org/10.7554/eLife.103690
    [16] Cappi R, Casini L, Tosi D, et al. (2022) Questioning the seasonality of SARS-COV-2: A Fourier spectral analysis. BMJ Open 12: e061602. https://doi.org/10.1136/bmjopen-2022-061602
    [17] Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philos Trans A Math Phys Eng Sci 222: 3309-368. https://doi.org/10.1098/rsta.1922.0009
  • Reader Comments
  • © 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(245) PDF downloads(20) Cited by(0)

Article outline

Figures and Tables

Tables(3)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog