We propose a computationally efficient statistical test for detecting correlation between two Gaussian databases. The problem is formulated as a hypothesis test: under the null hypothesis, the databases are independent; under the alternative hypothesis, they are correlated but subject to an unknown row permutation. We derive bounds on both type Ⅰ and type Ⅱ error probabilities and demonstrate that the proposed test outperforms a recently introduced method across a broad range of parameter settings. The test statistic is based on a sum of dependent indicator random variables. To effectively bound the type Ⅰ error probability, we introduce a novel graph-theoretic technique for bounding the $ k $-th moments of such statistics.
Citation: Ran Tamir. Testing for correlation in Gaussian databases via local decision making[J]. AIMS Mathematics, 2025, 10(4): 7721-7766. doi: 10.3934/math.2025355
We propose a computationally efficient statistical test for detecting correlation between two Gaussian databases. The problem is formulated as a hypothesis test: under the null hypothesis, the databases are independent; under the alternative hypothesis, they are correlated but subject to an unknown row permutation. We derive bounds on both type Ⅰ and type Ⅱ error probabilities and demonstrate that the proposed test outperforms a recently introduced method across a broad range of parameter settings. The test statistic is based on a sum of dependent indicator random variables. To effectively bound the type Ⅰ error probability, we introduce a novel graph-theoretic technique for bounding the $ k $-th moments of such statistics.
| [1] | P. Ohm, Broken promises of privacy: Responding to the surprising failure of anonymization, UCLA Law Rev., 57 (2009), 1701. |
| [2] | J. Sedayao, R. Bhardwaj, N. Gorade, Making big data, privacy, and anonymization work together in the enterprise: Experiences and issues, In: 2014 IEEE International Congress on Big Data, USA: IEEE, 2014,601–607. https://doi.org/10.1109/BigData.Congress.2014.92 |
| [3] |
L. Sweeney, Weaving technology and policy together to maintain confidentiality, J. Law Med. Ethics, 25 (1997), 98–110. https://doi.org/10.1111/j.1748-720X.1997.tb01885.x doi: 10.1111/j.1748-720X.1997.tb01885.x
|
| [4] |
F. M. Naini, J. Unnikrishnan, P. Thiran, M. Vetterli, Where you are is who you are: User identification by matching statistics, IEEE T. Inf. Foren. Sec., 11 (2016), 358–372. https://doi.org/10.1109/TIFS.2015.2498131 doi: 10.1109/TIFS.2015.2498131
|
| [5] | A. Datta, D. Sharma, A. Sinha, Provable de-anonymization of large datasets with sparse dimensions, In: International Conference on Principles of Security and Trust, Berlin, Germany: Springer, 2012,229–248. https://doi.org/10.1007/978-3-642-28641-4_13 |
| [6] | A. Narayanan, V. Shmatikov, Robust de-anonymization of large sparse datasets, In: 2008 IEEE Symposium on Security and Privacy, USA: IEEE, 2008,111–125. https://doi.org/10.1109/SP.2008.33 |
| [7] |
N. Takbiri, A. Houmansadr, D. L. Goeckel, H. Pishro-Nik, Matching anonymized and obfuscated time series to users' profiles, IEEE T. Inform. Theory, 65 (2019), 724–741. https://doi.org/10.1109/TIT.2018.2873134 doi: 10.1109/TIT.2018.2873134
|
| [8] | G. Wondracek, T. Holz, E. Kirda, C. Kruegel, A practical attack to de-anonymize social network users, In: 2010 IEEE Symposium on Security and Privacy, USA: IEEE, 2010,223–238. https://doi.org/10.1109/SP.2010.21 |
| [9] | J. Su, A. Shukla, S. Goel, A. Narayanan, De-anonymizing web browsing data with social networks, In: Proc. 26th Int. Conf. World Wide Web, 2017, 1261–1269. https://doi.org/10.1145/3038912.3052714 |
| [10] | L. Bilge, T. Strufe, D. Balzarotti, E. Kirda, All your contacts are belong to us: Automated identity theft attacks on social networks, In: Proc. 18th Int. Conf. World wide web, 2009,551–560. https://doi.org/10.1145/1526709.1526784 |
| [11] | M. Srivatsa M. Hicks, Deanonymizing mobility traces: Using social network as a side-channel, In: Proc. ACM Conf. Comput. Commun. Secur., 2012,628–637. https://doi.org/10.1145/2382196.2382262 |
| [12] | Z. Cheng, J. Caverlee, K. Lee, You are where you tweet: A content based approach to geo-locating Twitter users, In: Proc. 19th ACM Int. Conf. Inf. Knowl. Manage., 2010,759–768. https://doi.org/10.1145/1871437.1871535 |
| [13] | Z. K, B. Nazer, Detecting correlated Gaussian databases, In: IEEE International Symposium on Information Theory, Finland: IEEE, 2022, 2064–2069. https://doi.org/10.1109/ISIT50566.2022.9834731 |
| [14] | D. Elimelech, W. Huleihel, Phase transitions in the detection of correlated databases, In: International Conference on Machine Learning (PMLR), 2023, 9246–9266. |
| [15] |
V. Paslev, W. Huleihel, Testing dependency of unlabeled databases, IEEE T. Inform. Theory, 70 (2024), 7410–7431. https://doi.org/10.1109/TIT.2024.3442977 doi: 10.1109/TIT.2024.3442977
|
| [16] | D. Cullina, P. Mittal, N. Kiyavash, Fundamental limits of database alignment, In: IEEE International Symposium on Information Theory, USA: IEEE, 2018,651–655. https://doi.org/10.1109/ISIT.2018.8437908 |
| [17] | O. E. Dai, D. Cullina, N. Kiyavash, Database alignment with Gaussian features, In: Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2019, 3225–3233. |
| [18] | O. E. Dai, D. Cullina, N. Kiyavash, Achievability of nearly-exact alignment for correlated Gaussian databases, In: IEEE International Symposium on Information Theory, USA: IEEE, 2020, 1230–1235. https://doi.org/10.1109/ISIT44484.2020.9174507 |
| [19] | F. Shirani, S. Garg, E. Erkip, A concentration of measure approach to database de-anonymization, In: IEEE International Symposium on Information Theory, France: IEEE, 2019, 2748–2752. https://doi.org/10.1109/ISIT.2019.8849392 |
| [20] | S. Bakırtaş, E. Erkip, Database matching under column deletions, In: IEEE International Symposium on Information Theory, Australia: IEEE, 2021, 2720–2725. https://doi.org/10.1109/ISIT45174.2021.9518145 |
| [21] | S. Bakırtaş, E. Erkip, Database matching under adversarial column deletions, In: IEEE Information Theory Workshop, France: IEEE, 2023,181–185. https://doi.org/10.1109/ITW55543.2023.10161615 |
| [22] | S. Bakırtaş, E. Erkip, Distribution-agnostic database de-anonymization under synchronization errors, In: IEEE International Workshop on Information Forensics and Security (WIFS), Germany: IEEE, 2023, 1–6. https://doi.org/10.1109/ITW55543.2023.10161615 |
| [23] |
S. Bakırtaş, E. Erkip, Database matching under noisy synchronization errors, IEEE T. Inform. Theory, 70 (2024), 4335–4367. https://doi.org/10.1109/TIT.2024.3388990 doi: 10.1109/TIT.2024.3388990
|
| [24] | M. Chertkov, L. Kroc, F. Krzakala, M. Vergassola, L. Zdeborová, Inference in particle tracking experiments by passing messages between images, In: Proceedings of the National Academy of Sciences, 107 (2010), 7663–7668. https://doi.org/10.1073/pnas.0910994107 |
| [25] | D. Kunisky, J. Niles-Weed, Strong recovery of geometric planted matchings, In: ACM-SIAM Symposium on Discrete Algorithms, 2022,834–876. https://doi.org/10.1137/1.9781611977073.36 |
| [26] | P. Pedarsani, M. Grossglauser, On the privacy of anonymized networks, In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, 1235–1243. https://doi.org/10.1145/2020408.2020596 |
| [27] |
J. Ding, Z. Ma, Y. Wu, J. Xu, Efficient random graph matching via degree profiles, Probab. Theory Rel., 179 (2021), 29–115. https://doi.org/10.1007/s00440-020-00997-4 doi: 10.1007/s00440-020-00997-4
|
| [28] |
Y. Wu, J. Xu, S. H. Yu, Testing correlation of unlabeled random graphs, Ann. Appl. Probab., 33 (2023), 2519–2558. https://doi.org/10.1214/22-AAP1786 doi: 10.1214/22-AAP1786
|
| [29] |
C. Mao, Y. Wu, J. Xu, S. H. Yu, Testing network correlation efficiently via counting trees, Ann. Stat., 52 (2024), 2483–2505. https://doi.org/10.1214/23-AOS2261 doi: 10.1214/23-AOS2261
|
| [30] |
Y. Wu, J. Xu, H. Y. Sophie, Settling the sharp reconstruction thresholds of random graph matching, IEEE T. Inform. Theory, 68 (2022), 5391–5417. https://doi.org/10.1109/TIT.2022.3169005 doi: 10.1109/TIT.2022.3169005
|
| [31] | L. Ganassali, Sharp threshold for alignment of graph databases with Gaussian weights, In: Mathematical and Scientific Machine Learning, 2 (2022), 314–335. |
| [32] | S. M. Kay, Fundamentals of statistical signal processing: Detection theory, University of Rhode Island: Prentice Hall PTR, 1998. |
| [33] | M. Basseville, I. V. Nikiforov, Detection of abrupt changes: theory and application, Prentice-Hall, 1993. |
| [34] |
H. W. Kuhn, The Hungarian Method for the assignment problem, Nav. Res. Logist. Quart., 2 (1955), 83–97. https://doi.org/10.1002/nav.3800020109 doi: 10.1002/nav.3800020109
|
| [35] |
H. W. Kuhn, Variants of the Hungarian method for assignment problems, Nav. Res. Logist. Quart., 3 (1956), 253–258. https://doi.org/10.1002/nav.3800030404 doi: 10.1002/nav.3800030404
|
| [36] |
J. Munkres, Algorithms for the assignment and transportation problems, J. Soc. Ind. Appl. Math., 5 (1957), 32–38. https://doi.org/10.1137/0105003 doi: 10.1137/0105003
|
| [37] |
J. Edmonds, R. M. Karp, Theoretical improvements in algorithmic efficiency for network flow problems, JACM, 19 (1972), 248–264. https://doi.org/10.1145/321694.321699 doi: 10.1145/321694.321699
|
| [38] |
S. Janson, New versions of Suen's correlation inequality, Random Struct. Algor., 13 (1998), 467–483. https://doi.org/10.1002/(SICI)1098-2418(199810/12)13:3/4<467::AID-RSA15>3.0.CO;2-W doi: 10.1002/(SICI)1098-2418(199810/12)13:3/4<467::AID-RSA15>3.0.CO;2-W
|
| [39] |
R. Tamir, N. Merhav, N. Weinberger, A. G. i Fàbregas, Large deviations behavior of the logarithmic error probability of random codes, IEEE T. Inform. Theory, 66 (2020), 6635–6659. https://doi.org/10.1109/TIT.2020.2995136 doi: 10.1109/TIT.2020.2995136
|
| [40] |
L. V. Truong, G. Cocco, J. Font-Segura, A. G. i Fàbregas, Concentration properties of random codes, IEEE T. Inform. Theory, 69 (2023), 7499–7537. https://doi.org/10.1109/TIT.2023.3312326 doi: 10.1109/TIT.2023.3312326
|
| [41] |
N. Merhav, Error exponents of typical random codes, IEEE T. Inform. Theory, 64 (2018), 6223–6235. https://doi.org/10.1109/TIT.2018.2834503 doi: 10.1109/TIT.2018.2834503
|
| [42] |
N. Merhav, Statistical physics and information theory, Found. Trends Commun., 6 (2009), 1–212. https://doi.org/10.1561/0100000052 doi: 10.1561/0100000052
|
| [43] |
C. Bunte, A. Lapidoth, On the listsize capacity with feedback, IEEE T. Inform. Theory, 60 (2014), 6733–6748. https://doi.org/10.1109/TIT.2014.2355815 doi: 10.1109/TIT.2014.2355815
|
| [44] |
S. Li, Concise formulas for the area and volume of a hyperspherical cap, Asian J. Math. Stat., 4 (2011), 66–70. https://doi.org/10.3923/ajms.2011.66.70 doi: 10.3923/ajms.2011.66.70
|
| [45] | J. Riordan, An introduction to combinatorial analysis, New York: Wiley, 1980. |