In the context of Global Health, massive administrative datasets have become indispensable tools for health surveillance. However, the sheer scale of Big Data can mask systemic selection biases that standard mathematical adjustments may not fully mitigate. In this study, I propose a methodological audit of a recent large-scale cohort (N = 2,975,035) concerning COVID-19 vaccination and oncological outcomes. By benchmarking the cohort's architecture against national demographic and epidemiological gold standards through single-proportion Z-tests, we identified notable structural divergences. The first inferential test yielded a Z-score of −260.39 (p < 10⁻⁵⁰), suggesting a structural under-sampling of the elderly population (32.2% deficit) relative to the reference population. The second test identified a statistically inconsistent cancer incidence deficit in the non-vaccinated control group (Z = −15.23, p < 10⁻⁵⁰). These findings indicate that the reported statistical signals may emerge as a computational consequence of structural selection bias, where an artificially deflated baseline in the control group potentially inflates Hazard Ratios. Within a One Health approach, ensuring the structural integrity of data is crucial for effective prevention and control measures. We conclude that large-scale surveillance studies could be inferentially validated against demographic benchmarks to ensure that public health conclusions are grounded in baseline equivalence, thereby safeguarding the reliability of global health monitoring.
Citation: Marco Roccetti. Enhancing public health surveillance: A statistical validation of potential sampling bias in large retrospective vaccine cohorts[J]. AIMS Public Health, 2026, 13(2): 589-597. doi: 10.3934/publichealth.2026031
In the context of Global Health, massive administrative datasets have become indispensable tools for health surveillance. However, the sheer scale of Big Data can mask systemic selection biases that standard mathematical adjustments may not fully mitigate. In this study, I propose a methodological audit of a recent large-scale cohort (N = 2,975,035) concerning COVID-19 vaccination and oncological outcomes. By benchmarking the cohort's architecture against national demographic and epidemiological gold standards through single-proportion Z-tests, we identified notable structural divergences. The first inferential test yielded a Z-score of −260.39 (p < 10⁻⁵⁰), suggesting a structural under-sampling of the elderly population (32.2% deficit) relative to the reference population. The second test identified a statistically inconsistent cancer incidence deficit in the non-vaccinated control group (Z = −15.23, p < 10⁻⁵⁰). These findings indicate that the reported statistical signals may emerge as a computational consequence of structural selection bias, where an artificially deflated baseline in the control group potentially inflates Hazard Ratios. Within a One Health approach, ensuring the structural integrity of data is crucial for effective prevention and control measures. We conclude that large-scale surveillance studies could be inferentially validated against demographic benchmarks to ensure that public health conclusions are grounded in baseline equivalence, thereby safeguarding the reliability of global health monitoring.
| [1] |
Kim HJ, Kim MH, Choi MG, et al. (2025) 1-year risks of cancers associated with COVID-19 vaccination: a large population-based cohort study in South Korea. Biomark Res 13: 114. https://doi.org/10.1186/s40364-025-00831-w
|
| [2] |
Roccetti M (2026) Unlocking the stochastic parrot: Epistemic obligation and the decline of biological plausibility in clinical reality. Inform Med Unlocked 62: 101752. https://doi.org/10.1016/j.imu.2026.101752
|
| [3] |
Kang MJ, Jung KW, Bang SH, et al. (2023) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2020. Cancer Res Treat 55: 385-399. https://doi.org/10.4143/crt.2023.447
|
| [4] |
Park EH, Jung KW, Park NJ, et al. (2024) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2021. Cancer Res Treat 56: 357-371. https://doi.org/10.4143/crt.2024.253
|
| [5] |
Park EH, Jung KW, Park NJ, et al. (2025) Cancer statistics in Korea: Incidence, mortality, survival, and prevalence in 2022. Cancer Res Treat 57: 312-330. https://doi.org/10.4143/crt.2025.264
|
| [6] | Statista.South Korea: Cancer crude incidence rate by age, 2022. Statista 2024 (2024) . [cited 2026 January 10]. Available from: https://www.statista.com/statistics/1440818/south-korea-cancer-crude-incidence-rate-by-age/ |
| [7] | World Bank.Population ages 65 and above (% of total population) - Korea, Rep. World Population Prospects, United Nations (UN) (2024) . [cited 2026 January 10]. Available from: https://data.worldbank.org/indicator/SP.POP.65UP.TO.ZS?locations=KR 5 |
| [8] |
Roccetti M (2026) Before the algorithm: An exemplar case of the necessity of statistical testing for epidemiological consistency in public health data. AIMS Public Health 13: 121-134. https://doi.org/0.3934/publichealth.2026008
|
| [9] |
Roccetti M (2026) Quantifying structural selection bias in observational cohort data: a ponderation analysis of age - specific incidence rates to inform vaccine safety verification. Front Pharmacol 16: 1754809. https://doi.org/10.3389/fphar.2025.1754809
|
| [10] |
Pilleron S, Sarfati D, Janseen-Heijnen M, et al. (2019) Global cancer incidence in older adults, 2012 and 2035: A population-based study. Int J Cancer 144: 49-58. https://doi.org/10.1002/ijc.31664
|
| [11] | Rothman KJ, Greenland S, Lash TL (2008) Measures of disease occurrence. Modern Epidemiology . Philadelphia: Lippincott Williams & Wilkins. |
| [12] | Meng XL (2018) Statistical paradises and paradoxes in Big Data (I): Law of large populations, Big Data paradox, and the 2016 US Presidential election. Ann Appl Stat 12: 685-726. https://doi.org/10.1214/18-AOAS1161SF |
| [13] |
Koplan JP, Bond TC, Herson MH, et al. (2009) Towards a common definition of global health. Lancet 373: 1993-1995. https://doi.org/10.1016/S0140-6736(09)60332-9
|
| [14] |
Roccetti M, Cacciapuoti G (2025) Beyond the gold standard: Linear regression and poisson GLM yield identical mortality trends and deaths counts for COVID-19 in Italy: 2021–2025. Computation 13: 233. https://doi.org/10.3390/computation13100233
|
| [15] |
Chemaitelly H, Ayoub H, Coyle P, et al. (2025) Assessing healthy vaccinee effect in COVID-19 vaccine effectiveness studies: a national cohort study in Qatar. ELife 14: e103690. https://doi.org/10.7554/eLife.103690
|
| [16] |
Cappi R, Casini L, Tosi D, et al. (2022) Questioning the seasonality of SARS-COV-2: A Fourier spectral analysis. BMJ Open 12: e061602. https://doi.org/10.1136/bmjopen-2022-061602
|
| [17] | Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philos Trans A Math Phys Eng Sci 222: 3309-368. https://doi.org/10.1098/rsta.1922.0009 |