Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts

Marco Roccetti; Marco Roccetti

doi:10.3934/publichealth.2026031

AIMS Public Health

2026, Volume 13, Issue 2: 589-597. doi: 10.3934/publichealth.2026031

Previous Article Next Article

Research article Special Issues

Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts

Marco Roccetti ^,

Department of Computer Science and Engineering, University of Bologna, Bologna, Italy

Received: 28 January 2026 Revised: 28 April 2026 Accepted: 06 May 2026 Published: 13 May 2026

In the context of Global Health, massive administrative datasets have become indispensable tools for health surveillance. However, the sheer scale of Big Data can mask systemic selection biases that standard mathematical adjustments may not fully mitigate. In this study, I propose a methodological audit of a recent large-scale cohort (N = 2,975,035) concerning COVID-19 vaccination and oncological outcomes. By benchmarking the cohort's architecture against national demographic and epidemiological gold standards through single-proportion Z-tests, we identified notable structural divergences. The first inferential test yielded a Z-score of −260.39 (p < 10⁻⁵⁰), suggesting a structural under-sampling of the elderly population (32.2% deficit) relative to the reference population. The second test identified a statistically inconsistent cancer incidence deficit in the non-vaccinated control group (Z = −15.23, p < 10⁻⁵⁰). These findings indicate that the reported statistical signals may emerge as a computational consequence of structural selection bias, where an artificially deflated baseline in the control group potentially inflates Hazard Ratios. Within a One Health approach, ensuring the structural integrity of data is crucial for effective prevention and control measures. We conclude that large-scale surveillance studies could be inferentially validated against demographic benchmarks to ensure that public health conclusions are grounded in baseline equivalence, thereby safeguarding the reliability of global health monitoring.
- one health surveillance,
- global health,
- biostatistics,
- computational epidemiology,
- inferential statistics,
- public health data
Citation: Marco Roccetti. Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts[J]. AIMS Public Health, 2026, 13(2): 589-597. doi: 10.3934/publichealth.2026031

Related Papers:

Abstract

In the context of Global Health, massive administrative datasets have become indispensable tools for health surveillance. However, the sheer scale of Big Data can mask systemic selection biases that standard mathematical adjustments may not fully mitigate. In this study, I propose a methodological audit of a recent large-scale cohort (N = 2,975,035) concerning COVID-19 vaccination and oncological outcomes. By benchmarking the cohort's architecture against national demographic and epidemiological gold standards through single-proportion Z-tests, we identified notable structural divergences. The first inferential test yielded a Z-score of −260.39 (p < 10⁻⁵⁰), suggesting a structural under-sampling of the elderly population (32.2% deficit) relative to the reference population. The second test identified a statistically inconsistent cancer incidence deficit in the non-vaccinated control group (Z = −15.23, p < 10⁻⁵⁰). These findings indicate that the reported statistical signals may emerge as a computational consequence of structural selection bias, where an artificially deflated baseline in the control group potentially inflates Hazard Ratios. Within a One Health approach, ensuring the structural integrity of data is crucial for effective prevention and control measures. We conclude that large-scale surveillance studies could be inferentially validated against demographic benchmarks to ensure that public health conclusions are grounded in baseline equivalence, thereby safeguarding the reliability of global health monitoring.

Data availability statement

The author commits to providing the calculation spreadsheets and statistical code used for the inferential analysis upon reasonable request.

Conflict of interest

The author declares no conflict of interest.

References

[1]	Kim HJ, Kim MH, Choi MG, et al. (2025) 1-year risks of cancers associated with COVID-19 vaccination: a large population-based cohort study in South Korea. Biomark Res 13: 114. https://doi.org/10.1186/s40364-025-00831-w
[2]	Roccetti M (2026) Unlocking the stochastic parrot: Epistemic obligation and the decline of biological plausibility in clinical reality. Inform Med Unlocked 62: 101752. https://doi.org/10.1016/j.imu.2026.101752
[3]	Kang MJ, Jung KW, Bang SH, et al. (2023) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2020. Cancer Res Treat 55: 385-399. https://doi.org/10.4143/crt.2023.447
[4]	Park EH, Jung KW, Park NJ, et al. (2024) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2021. Cancer Res Treat 56: 357-371. https://doi.org/10.4143/crt.2024.253
[5]	Park EH, Jung KW, Park NJ, et al. (2025) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2022. Cancer Res Treat 57: 312-330. https://doi.org/10.4143/crt.2025.264
[6]	Statista.South Korea: Cancer crude incidence rate by age, 2022. Statista 2024 (2024) . [cited 2026 January 10]. Available from: https://www.statista.com/statistics/1440818/south-korea-cancer-crude-incidence-rate-by-age/
[7]	World Bank.Population ages 65 and above (% of total population) - Korea, Rep. World Population Prospects, United Nations (UN) (2024) . [cited 2026 January 10]. Available from: https://data.worldbank.org/indicator/SP.POP.65UP.TO.ZS?locations=KR 5
[8]	Roccetti M (2026) Before the algorithm: An exemplar case of the necessity of statistical testing for epidemiological consistency in public health data. AIMS Public Health 13: 121-134. https://doi.org/0.3934/publichealth.2026008
[9]	Roccetti M (2026) Quantifying structural selection bias in observational cohort data: a ponderation analysis of age - specific incidence rates to inform vaccine safety verification. Front Pharmacol 16: 1754809. https://doi.org/10.3389/fphar.2025.1754809
[10]	Pilleron S, Sarfati D, Janseen-Heijnen M, et al. (2019) Global cancer incidence in older adults, 2012 and 2035: A population-based study. Int J Cancer 144: 49-58. https://doi.org/10.1002/ijc.31664
[11]	Rothman KJ, Greenland S, Lash TL (2008) Measures of disease occurrence. Modern Epidemiology . Philadelphia: Lippincott Williams & Wilkins.
[12]	Meng XL (2018) Statistical paradises and paradoxes in Big Data (I): Law of large populations, Big Data paradox, and the 2016 US Presidential election. Ann Appl Stat 12: 685-726. https://doi.org/10.1214/18-AOAS1161SF
[13]	Koplan JP, Bond TC, Herson MH, et al. (2009) Towards a common definition of global health. Lancet 373: 1993-1995. https://doi.org/10.1016/S0140-6736(09)60332-9
[14]	Roccetti M, Cacciapuoti G (2025) Beyond the gold standard: Linear regression and poisson GLM yield identical mortality trends and deaths counts for COVID-19 in Italy: 2021–2025. Computation 13: 233. https://doi.org/10.3390/computation13100233
[15]	Chemaitelly H, Ayoub H, Coyle P, et al. (2025) Assessing healthy vaccinee effect in COVID-19 vaccine effectiveness studies: a national cohort study in Qatar. ELife 14: e103690. https://doi.org/10.7554/eLife.103690
[16]	Cappi R, Casini L, Tosi D, et al. (2022) Questioning the seasonality of SARS-COV-2: A Fourier spectral analysis. BMJ Open 12: e061602. https://doi.org/10.1136/bmjopen-2022-061602
[17]	Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philos Trans A Math Phys Eng Sci 222: 3309-368. https://doi.org/10.1098/rsta.1922.0009

Reader Comments

Your name:*

Email:*
© 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Public Health

4 5.7

Metrics

Article views(899) PDF downloads(67) Cited by(0)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Tables(3)

AIMS Public Health

Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts

Related Papers:

Abstract

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Public Health

Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts

Related Papers:

Abstract

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog