The adoption of sophisticated analytical tools, including Machine Learning and massive data processing, has accelerated health research. However, a foundational principle asserts that the rigor of these complex methods is dependent on the integrity and validity of the underlying statistical design. I posit that advanced analyses, particularly in epidemiology, must be subsequent to the rigorous verification of methodological coherence. In this study, I used an exploratory case to demonstrate a crucial cautionary principle: Complex models amplify, rather than correct, substantial methodological limitations. To demonstrate this, I applied standard descriptive and inferential statistical methods (Z-tests, Confidence Intervals, and t-tests) alongside established national epidemiological benchmarks to a published cohort study on vaccine outcomes and psychiatric events. Through this approach, I identified multiple, statistically significant inconsistencies within the source data, including implausible incidence rates and relevant baseline group imbalances. These findings, supported by inferential statistical evidence, demonstrated that the observed effects (e.g., contradictory Hazard Ratios) are not biological but are mathematical artifacts stemming from uncorrected selection and classification biases in the cohort construction. These paradoxes arise from the exclusion of prevalent psychiatric cases in the vaccinated group and the misclassification of pre-existing conditions as new incident events in the control group. Our analysis serves as a robust demonstration that the validity of any conclusion drawn from subsequent advanced ML or statistical modeling sourced from public health data rests on first passing the test of basic epidemiological consistency.
Citation: Marco Roccetti. Before the algorithm: An exemplar case of the necessity of statistical testing for epidemiological consistency in public health data[J]. AIMS Public Health, 2026, 13(1): 121-134. doi: 10.3934/publichealth.2026008
The adoption of sophisticated analytical tools, including Machine Learning and massive data processing, has accelerated health research. However, a foundational principle asserts that the rigor of these complex methods is dependent on the integrity and validity of the underlying statistical design. I posit that advanced analyses, particularly in epidemiology, must be subsequent to the rigorous verification of methodological coherence. In this study, I used an exploratory case to demonstrate a crucial cautionary principle: Complex models amplify, rather than correct, substantial methodological limitations. To demonstrate this, I applied standard descriptive and inferential statistical methods (Z-tests, Confidence Intervals, and t-tests) alongside established national epidemiological benchmarks to a published cohort study on vaccine outcomes and psychiatric events. Through this approach, I identified multiple, statistically significant inconsistencies within the source data, including implausible incidence rates and relevant baseline group imbalances. These findings, supported by inferential statistical evidence, demonstrated that the observed effects (e.g., contradictory Hazard Ratios) are not biological but are mathematical artifacts stemming from uncorrected selection and classification biases in the cohort construction. These paradoxes arise from the exclusion of prevalent psychiatric cases in the vaccinated group and the misclassification of pre-existing conditions as new incident events in the control group. Our analysis serves as a robust demonstration that the validity of any conclusion drawn from subsequent advanced ML or statistical modeling sourced from public health data rests on first passing the test of basic epidemiological consistency.
| [1] |
Habehh H, Gohel S (2021) Current genomics, Machine Learning in Healthcare. Curr Genom 22: 291-300. https://doi.org/10.2174/1389202922666210705124359
|
| [2] |
Roccetti M, Delnevo G, Casini L, et al. (2019) Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. J Big Data 6: 70. https://doi.org/10.1186/s40537-019-0235-y
|
| [3] |
Alhumaidi NH, Dermawan D, Kamaruzaman HF, et al. (2025) The Use of Machine Learning for Analyzing Real-World Data in Disease Prediction and Management: Systematic Review. JMIR Med Inform 13: e68898. https://doi.org/10.2196/68898
|
| [4] |
Roccetti M, Cacciapuoti G (2025) Beyond the Gold Standard: Linear Regression and Poisson GLM Yield Identical Mortality Trends and Deaths Counts for COVID-19 in Italy: 2021–2025. Computation 13: 233. https://doi.org/10.3390/computation13100233
|
| [5] |
Roccetti M, De Rosa EM (2025) A Segmented Linear Regression Study of Seasonal Profiles of COVID-19 Deaths in Italy: September 2021–September 2024. Computation 13: 165. https://doi.org/10.3390/computation13070165
|
| [6] |
Kim HJ, Kim MH, Choi MG, et al. (2024) Psychiatric adverse events following COVID-19 vaccination: a population-based cohort study in Seoul, South Korea. Mol Psychiatry 29: 3635-3643. https://doi.org/10.1038/s41380-024-02627-0
|
| [7] |
Cho SJ, Kim J, Kang YJ, et al. (2020) Annual Prevalence and Incidence of Schizophrenia and Similar Psychotic Disorders in the Republic of Korea: A National Health Insurance Data-Based Study. Psychiatry Investig 17: 61-70. https://doi.org/10.30773/pi.2019.0041
|
| [8] |
Shin H, Lee HS, Lee BC, et al. (2023) The Prevalence and Clinical Characteristics of Borderline Personality Disorder in South Korea Using National Health Insurance Service Customized Database. Yonsei Med J 64: 566-572. https://doi.org/10.3349/ymj.2023.0071
|
| [9] |
Rim SJ, Hahm BJ, Seong SJ, et al. (2023) Prevalence of Mental Disorders and Associated Factors in Korean Adults: National Mental Health Survey of Korea 2021. Psychiatry Investig 20: 262-272. https://doi.org/10.30773/pi.2022.0307
|
| [10] |
Casella G, Berger RL (2024) Statistical Inference. Chapman and Hall/CRC. https://doi.org/10.1201/9781003456285
|
| [11] | Al-Nefaie AH (2023) Applications to Bio-Medical data and statistical inference for a Kavya-Manoharan log-logistic model. J Radiat Res Appl Sci 16: 100523. https://doi.org/10.1016/j.jrras.2023.100523 |
| [12] |
D'Agostino RB (1998) Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat Med 17: 2265-2281. https://doi.org/10.1002/(SICI)1097-0258(19981015)17:19<2265::AID-SIM918>3.0.CO;2-B
|