Reproducibility has become a fundamental concern in modern statistical practice, yet its quantitative assessment remains limited for commonly used dependence measures. This study introduces a systematic evaluation of the reproducibility probability (RP), defined as the probability that the same statistical decision would be reached if an experiment were independently replicated under identical conditions. RP was examined for three widely used correlation tests (Pearson, Spearman, and Kendall) across different types of relationships and sample conditions. Through Monte Carlo simulations, RP was shown to provide a meaningful quantitative measure of the stability of statistical decisions across repeated experiments. Results indicated that the underlying relationship between variables, sample size, and noise level influenced reproducibility. In linear relationships, RP increased with both the strength of the true correlation and the sample size. For example, under strong linear dependence ($ \rho = 0.9 $), RP exceeded $ 0.95 $ for $ n = 40 $ and approached $ 1.00 $ for $ n = 80 $. For weak or null correlations ($ \rho = 0 $ or $ \rho = 0.3 $), the tests typically yielded non-significant $ p $-values, and the corresponding RP values were generally above $ 0.5 $, reflecting stable decisions in the nonrejection area. The Pearson test demonstrated slightly higher RP in small samples due to its sensitivity to linear dependence, whereas rank-based methods achieved comparable reproducibility as the sample size increased. In contrast, under nonlinear nonmonotonic and piecewise monotonic relationships, reproducibility depended on both sample size and noise intensity. For small samples, all tests displayed highly variable RP values, while for larger samples or higher noise levels, RP values converged across methods. The results emphasized the role of RP as a reliable indicator of correlation test stability and revealed how underlying dependence patterns influenced the reproducibility of statistical results.
Citation: Norah D. Alshahrani. Statistical reproducibility of correlation tests: Pearson, Spearman, and Kendall[J]. AIMS Mathematics, 2026, 11(1): 957-976. doi: 10.3934/math.2026042
Reproducibility has become a fundamental concern in modern statistical practice, yet its quantitative assessment remains limited for commonly used dependence measures. This study introduces a systematic evaluation of the reproducibility probability (RP), defined as the probability that the same statistical decision would be reached if an experiment were independently replicated under identical conditions. RP was examined for three widely used correlation tests (Pearson, Spearman, and Kendall) across different types of relationships and sample conditions. Through Monte Carlo simulations, RP was shown to provide a meaningful quantitative measure of the stability of statistical decisions across repeated experiments. Results indicated that the underlying relationship between variables, sample size, and noise level influenced reproducibility. In linear relationships, RP increased with both the strength of the true correlation and the sample size. For example, under strong linear dependence ($ \rho = 0.9 $), RP exceeded $ 0.95 $ for $ n = 40 $ and approached $ 1.00 $ for $ n = 80 $. For weak or null correlations ($ \rho = 0 $ or $ \rho = 0.3 $), the tests typically yielded non-significant $ p $-values, and the corresponding RP values were generally above $ 0.5 $, reflecting stable decisions in the nonrejection area. The Pearson test demonstrated slightly higher RP in small samples due to its sensitivity to linear dependence, whereas rank-based methods achieved comparable reproducibility as the sample size increased. In contrast, under nonlinear nonmonotonic and piecewise monotonic relationships, reproducibility depended on both sample size and noise intensity. For small samples, all tests displayed highly variable RP values, while for larger samples or higher noise levels, RP values converged across methods. The results emphasized the role of RP as a reliable indicator of correlation test stability and revealed how underlying dependence patterns influenced the reproducibility of statistical results.
| [1] | National Academies of Sciences, Engineering, and Medicine, Reproducibility and replicability in science, Washington: The National Academies Press, 2019. |
| [2] |
S. N. Goodman, A comment on replication, p-values and evidence, Stat. Med., 11 (1992), 875–879. https://doi.org/10.1002/sim.4780110705 doi: 10.1002/sim.4780110705
|
| [3] |
S. Senn, A comment on replication p-values and evidence, S. N. Goodman, Statistics in Medicine, 1992; 11: 875–879, Statist. Med., 21 (2002), 2437–2444. https://doi.org/10.1002/sim.1072 doi: 10.1002/sim.1072
|
| [4] |
J. P. A. Ioannidis, Why most published research findings are false, PLoS Med., 2 (2005), e124. https://doi.org/10.1371/journal.pmed.1004085 doi: 10.1371/journal.pmed.1004085
|
| [5] |
A. Gelman, J. Carlin, Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors, Perspect. Psychol. Sci., 9 (2014), 641–651. https://doi.org/10.1177/1745691614551642 doi: 10.1177/1745691614551642
|
| [6] |
D. J. Benjamin, J. O. Berger, M. Johannesson, B. A. Nosek, E. J. Wagenmakers, R. Berk, et al., Redefine statistical significance, Nat. Hum. Behav., 2 (2018), 6–10. https://doi.org/10.1038/s41562-017-0189-z doi: 10.1038/s41562-017-0189-z
|
| [7] |
B. B. McShane, D. Gal, Statistical significance and the dichotomization of evidence, J. Am. Stat. Assoc., 112 (2017), 885–895. https://doi.org/10.1080/01621459.2017.1289846 doi: 10.1080/01621459.2017.1289846
|
| [8] |
L. V. Hedges, J. M. Schauer, The design of replication studies, J. R. Statist. Soc. Ser. A, 184 (2021), 868–886. https://doi.org/10.1111/rssa.12688 doi: 10.1111/rssa.12688
|
| [9] |
A. Simkus, T. Coolen-Maturi, F. P. A. Coolen, C. Bendtsen, Statistical perspectives on reproducibility: Definitions and challenges, J. Stat. Theory Pract., 19 (2025), 40. https://doi.org/10.1007/s42519-025-00459-x doi: 10.1007/s42519-025-00459-x
|
| [10] | H. Atmanspacher, S. Maasen, Reproducibility: Principles, problems, practices, and prospects, John Wiley & Sons, 2016. https://doi.org/10.1002/9781118865064 |
| [11] |
L. Zhang, X. Chen, A. Khatab, Y. An, Optimizing imperfect preventive maintenance in multi-component repairable systems under s-dependent competing risks, Reliab. Eng. Syst. Safe., 219 (2022), 108177. https://doi.org/10.1016/j.ress.2021.108177 doi: 10.1016/j.ress.2021.108177
|
| [12] |
L. Zhang, X. Chen, A. Khatab, Y. An, X. Feng, Joint optimization of selective maintenance and repairpersons assignment problem for mission-oriented systems operating under s-dependent competing risks, Reliab. Eng. Syst. Safe., 242 (2024), 109796. https://doi.org/10.1016/j.ress.2023.109796 doi: 10.1016/j.ress.2023.109796
|
| [13] | E. L. Lehmann, J. Romano, Testing statistical hypotheses, New York: Springer, 2005. https://doi.org/10.1007/0-387-27605-X |
| [14] | J. Fox, Applied regression analysis and generalized linear models, Sage publications, 2015. |
| [15] | B. Efron, R. Tibshirani, An introduction to the bootstrap, New York: Chapman and Hall/CRC, 1994. https://doi.org/10.1201/9780429246593 |
| [16] | A. C. Davison, D. V. Hinkley, Bootstrap methods and their application, Cambridge University Press, 1997. |
| [17] |
K. Pearson, Mathematical contributions to the theory of evolution. Ⅲ. regression, heredity, and panmixia, Philos. Trans. A Math. Phys. Eng. Sci., 187 (1986), 253–318. https://doi.org/10.1098/rsta.1896.0007 doi: 10.1098/rsta.1896.0007
|
| [18] |
C. Spearman, The proof and measurement of association between two things, Am. J. Psychol., 15 (1904), 72–101. https://doi.org/10.2307/1412159 doi: 10.2307/1412159
|
| [19] |
M. G. Kendall, A new measure of rank correlation, Biometrika, 30 (1938), 81–93. https://doi.org/10.2307/2332226 doi: 10.2307/2332226
|
| [20] |
D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, et al., Detecting novel associations in large data sets, Science, 334 (2011), 1518–1524. https://doi.org/10.1126/science.1205438 doi: 10.1126/science.1205438
|
| [21] |
G. J. Székely, M. L. Rizzo, N. K. Bakirov, Measuring and testing dependence by correlation of distances, Ann. Statist., 35 (2007), 2769–2794. https://doi.org/10.1214/009053607000000505 doi: 10.1214/009053607000000505
|
| [22] |
R. Li, W. Zhong, L. Zhu, Feature screening via distance correlation learning, J. Am. Stat. Assoc., 107 (2012), 1129–1139. https://doi.org/10.1080/01621459.2012.695654 doi: 10.1080/01621459.2012.695654
|
| [23] |
R. Heller, Y. Heller, M. Gorfine, A consistent multivariate test of association based on ranks of distances, Biometrika, 100 (2013), 503–510. https://doi.org/10.1093/biomet/ass070 doi: 10.1093/biomet/ass070
|
| [24] | J. Cohen, Statistical power analysis for the behavioral sciences, New York: Routledge, 1988. https://doi.org/10.4324/9780203771587 |
| [25] |
S. A. Julious, Sample size of 12 per group rule of thumb for a pilot study, Pharm. Stat., 4 (2005), 287–291. https://doi.org/10.1002/pst.185 doi: 10.1002/pst.185
|
| [26] |
D. G. Bonett, T. A. Wright, Sample size requirements for estimating pearson, kendall and spearman correlations, Psychometrika, 65 (2000), 23–28. https://doi.org/10.1007/BF02294183 doi: 10.1007/BF02294183
|
| [27] | M. G. Kendall, Rank correlation methods, Griffin, 1948. |
| [28] |
N. D. Alshahrani, T. Coolen-Maturi, F. P. A. Coolen, On statistical reproducibility of normality and equality of variances tests, J. Stat. Theory Pract., 19 (2025), 81. https://doi.org/10.1007/s42519-025-00495-7 doi: 10.1007/s42519-025-00495-7
|
| [29] |
F. P. A. Coolen, S. BinHimd, Nonparametric predictive inference for reproducibility of basic nonparametric tests, J. Stat. Theory Pract., 8 (2014), 591–618. https://doi.org/10.1080/15598608.2013.819792 doi: 10.1080/15598608.2013.819792
|
| [30] |
A. Simkus, F. P. A. Coolen, T. Coolen-Maturi, N. A. Karp, C. Bendtsen, Statistical reproducibility for pairwise t-tests in pharmaceutical research, Stat. Methods Med. Res., 31 (2021), 673–688. https://doi.org/10.1177/09622802211041765 doi: 10.1177/09622802211041765
|
| [31] |
A. Aldawsari, T. Coolen-Maturi, F. P. A. Coolen, Parametric predictive bootstrap method for the reproducibility of hypothesis tests, J. Stat. Theory Pract., 19 (2025), 21. https://doi.org/10.1007/s42519-025-00438-2 doi: 10.1007/s42519-025-00438-2
|