Testing for correlation in Gaussian databases via local decision making

Ran Tamir; Ran Tamir

doi:10.3934/math.2025355

AIMS Mathematics

2025, Volume 10, Issue 4: 7721-7766. doi: 10.3934/math.2025355

Previous Article Next Article

Research article Special Issues

Testing for correlation in Gaussian databases via local decision making

Ran Tamir ^,

Signal and Information Processing Laboratory, ETH Zurich, 8092 Zurich, Switzerland

Received: 25 June 2024 Revised: 18 February 2025 Accepted: 19 February 2025 Published: 01 April 2025
MSC : 60F10, 62F03, 62F05, 62H20, 62H22, 68P15, 68P27

We propose a computationally efficient statistical test for detecting correlation between two Gaussian databases. The problem is formulated as a hypothesis test: under the null hypothesis, the databases are independent; under the alternative hypothesis, they are correlated but subject to an unknown row permutation. We derive bounds on both type Ⅰ and type Ⅱ error probabilities and demonstrate that the proposed test outperforms a recently introduced method across a broad range of parameter settings. The test statistic is based on a sum of dependent indicator random variables. To effectively bound the type Ⅰ error probability, we introduce a novel graph-theoretic technique for bounding the $ k $-th moments of such statistics.
- correlation detection,
- data privacy,
- de-anonymization,
- Gaussian database,
- hypothesis testing
Citation: Ran Tamir. Testing for correlation in Gaussian databases via local decision making[J]. AIMS Mathematics, 2025, 10(4): 7721-7766. doi: 10.3934/math.2025355

Related Papers:

Abstract

We propose a computationally efficient statistical test for detecting correlation between two Gaussian databases. The problem is formulated as a hypothesis test: under the null hypothesis, the databases are independent; under the alternative hypothesis, they are correlated but subject to an unknown row permutation. We derive bounds on both type Ⅰ and type Ⅱ error probabilities and demonstrate that the proposed test outperforms a recently introduced method across a broad range of parameter settings. The test statistic is based on a sum of dependent indicator random variables. To effectively bound the type Ⅰ error probability, we introduce a novel graph-theoretic technique for bounding the $ k $-th moments of such statistics.

References

[1]	P. Ohm, Broken promises of privacy: Responding to the surprising failure of anonymization, UCLA Law Rev., 57 (2009), 1701.
[2]	J. Sedayao, R. Bhardwaj, N. Gorade, Making big data, privacy, and anonymization work together in the enterprise: Experiences and issues, In: 2014 IEEE International Congress on Big Data, USA: IEEE, 2014,601–607. https://doi.org/10.1109/BigData.Congress.2014.92
[3]	L. Sweeney, Weaving technology and policy together to maintain confidentiality, J. Law Med. Ethics, 25 (1997), 98–110. https://doi.org/10.1111/j.1748-720X.1997.tb01885.x doi: 10.1111/j.1748-720X.1997.tb01885.x
[4]	F. M. Naini, J. Unnikrishnan, P. Thiran, M. Vetterli, Where you are is who you are: User identification by matching statistics, IEEE T. Inf. Foren. Sec., 11 (2016), 358–372. https://doi.org/10.1109/TIFS.2015.2498131 doi: 10.1109/TIFS.2015.2498131
[5]	A. Datta, D. Sharma, A. Sinha, Provable de-anonymization of large datasets with sparse dimensions, In: International Conference on Principles of Security and Trust, Berlin, Germany: Springer, 2012,229–248. https://doi.org/10.1007/978-3-642-28641-4_13
[6]	A. Narayanan, V. Shmatikov, Robust de-anonymization of large sparse datasets, In: 2008 IEEE Symposium on Security and Privacy, USA: IEEE, 2008,111–125. https://doi.org/10.1109/SP.2008.33
[7]	N. Takbiri, A. Houmansadr, D. L. Goeckel, H. Pishro-Nik, Matching anonymized and obfuscated time series to users' profiles, IEEE T. Inform. Theory, 65 (2019), 724–741. https://doi.org/10.1109/TIT.2018.2873134 doi: 10.1109/TIT.2018.2873134
[8]	G. Wondracek, T. Holz, E. Kirda, C. Kruegel, A practical attack to de-anonymize social network users, In: 2010 IEEE Symposium on Security and Privacy, USA: IEEE, 2010,223–238. https://doi.org/10.1109/SP.2010.21
[9]	J. Su, A. Shukla, S. Goel, A. Narayanan, De-anonymizing web browsing data with social networks, In: Proc. 26th Int. Conf. World Wide Web, 2017, 1261–1269. https://doi.org/10.1145/3038912.3052714
[10]	L. Bilge, T. Strufe, D. Balzarotti, E. Kirda, All your contacts are belong to us: Automated identity theft attacks on social networks, In: Proc. 18th Int. Conf. World wide web, 2009,551–560. https://doi.org/10.1145/1526709.1526784
[11]	M. Srivatsa M. Hicks, Deanonymizing mobility traces: Using social network as a side-channel, In: Proc. ACM Conf. Comput. Commun. Secur., 2012,628–637. https://doi.org/10.1145/2382196.2382262
[12]	Z. Cheng, J. Caverlee, K. Lee, You are where you tweet: A content based approach to geo-locating Twitter users, In: Proc. 19th ACM Int. Conf. Inf. Knowl. Manage., 2010,759–768. https://doi.org/10.1145/1871437.1871535
[13]	Z. K, B. Nazer, Detecting correlated Gaussian databases, In: IEEE International Symposium on Information Theory, Finland: IEEE, 2022, 2064–2069. https://doi.org/10.1109/ISIT50566.2022.9834731
[14]	D. Elimelech, W. Huleihel, Phase transitions in the detection of correlated databases, In: International Conference on Machine Learning (PMLR), 2023, 9246–9266.
[15]	V. Paslev, W. Huleihel, Testing dependency of unlabeled databases, IEEE T. Inform. Theory, 70 (2024), 7410–7431. https://doi.org/10.1109/TIT.2024.3442977 doi: 10.1109/TIT.2024.3442977
[16]	D. Cullina, P. Mittal, N. Kiyavash, Fundamental limits of database alignment, In: IEEE International Symposium on Information Theory, USA: IEEE, 2018,651–655. https://doi.org/10.1109/ISIT.2018.8437908
[17]	O. E. Dai, D. Cullina, N. Kiyavash, Database alignment with Gaussian features, In: Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2019, 3225–3233.
[18]	O. E. Dai, D. Cullina, N. Kiyavash, Achievability of nearly-exact alignment for correlated Gaussian databases, In: IEEE International Symposium on Information Theory, USA: IEEE, 2020, 1230–1235. https://doi.org/10.1109/ISIT44484.2020.9174507
[19]	F. Shirani, S. Garg, E. Erkip, A concentration of measure approach to database de-anonymization, In: IEEE International Symposium on Information Theory, France: IEEE, 2019, 2748–2752. https://doi.org/10.1109/ISIT.2019.8849392
[20]	S. Bakırtaş, E. Erkip, Database matching under column deletions, In: IEEE International Symposium on Information Theory, Australia: IEEE, 2021, 2720–2725. https://doi.org/10.1109/ISIT45174.2021.9518145
[21]	S. Bakırtaş, E. Erkip, Database matching under adversarial column deletions, In: IEEE Information Theory Workshop, France: IEEE, 2023,181–185. https://doi.org/10.1109/ITW55543.2023.10161615
[22]	S. Bakırtaş, E. Erkip, Distribution-agnostic database de-anonymization under synchronization errors, In: IEEE International Workshop on Information Forensics and Security (WIFS), Germany: IEEE, 2023, 1–6. https://doi.org/10.1109/ITW55543.2023.10161615
[23]	S. Bakırtaş, E. Erkip, Database matching under noisy synchronization errors, IEEE T. Inform. Theory, 70 (2024), 4335–4367. https://doi.org/10.1109/TIT.2024.3388990 doi: 10.1109/TIT.2024.3388990
[24]	M. Chertkov, L. Kroc, F. Krzakala, M. Vergassola, L. Zdeborová, Inference in particle tracking experiments by passing messages between images, In: Proceedings of the National Academy of Sciences, 107 (2010), 7663–7668. https://doi.org/10.1073/pnas.0910994107
[25]	D. Kunisky, J. Niles-Weed, Strong recovery of geometric planted matchings, In: ACM-SIAM Symposium on Discrete Algorithms, 2022,834–876. https://doi.org/10.1137/1.9781611977073.36
[26]	P. Pedarsani, M. Grossglauser, On the privacy of anonymized networks, In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, 1235–1243. https://doi.org/10.1145/2020408.2020596
[27]	J. Ding, Z. Ma, Y. Wu, J. Xu, Efficient random graph matching via degree profiles, Probab. Theory Rel., 179 (2021), 29–115. https://doi.org/10.1007/s00440-020-00997-4 doi: 10.1007/s00440-020-00997-4
[28]	Y. Wu, J. Xu, S. H. Yu, Testing correlation of unlabeled random graphs, Ann. Appl. Probab., 33 (2023), 2519–2558. https://doi.org/10.1214/22-AAP1786 doi: 10.1214/22-AAP1786
[29]	C. Mao, Y. Wu, J. Xu, S. H. Yu, Testing network correlation efficiently via counting trees, Ann. Stat., 52 (2024), 2483–2505. https://doi.org/10.1214/23-AOS2261 doi: 10.1214/23-AOS2261
[30]	Y. Wu, J. Xu, H. Y. Sophie, Settling the sharp reconstruction thresholds of random graph matching, IEEE T. Inform. Theory, 68 (2022), 5391–5417. https://doi.org/10.1109/TIT.2022.3169005 doi: 10.1109/TIT.2022.3169005
[31]	L. Ganassali, Sharp threshold for alignment of graph databases with Gaussian weights, In: Mathematical and Scientific Machine Learning, 2 (2022), 314–335.
[32]	S. M. Kay, Fundamentals of statistical signal processing: Detection theory, University of Rhode Island: Prentice Hall PTR, 1998.
[33]	M. Basseville, I. V. Nikiforov, Detection of abrupt changes: theory and application, Prentice-Hall, 1993.
[34]	H. W. Kuhn, The Hungarian Method for the assignment problem, Nav. Res. Logist. Quart., 2 (1955), 83–97. https://doi.org/10.1002/nav.3800020109 doi: 10.1002/nav.3800020109
[35]	H. W. Kuhn, Variants of the Hungarian method for assignment problems, Nav. Res. Logist. Quart., 3 (1956), 253–258. https://doi.org/10.1002/nav.3800030404 doi: 10.1002/nav.3800030404
[36]	J. Munkres, Algorithms for the assignment and transportation problems, J. Soc. Ind. Appl. Math., 5 (1957), 32–38. https://doi.org/10.1137/0105003 doi: 10.1137/0105003
[37]	J. Edmonds, R. M. Karp, Theoretical improvements in algorithmic efficiency for network flow problems, JACM, 19 (1972), 248–264. https://doi.org/10.1145/321694.321699 doi: 10.1145/321694.321699
[38]	S. Janson, New versions of Suen's correlation inequality, Random Struct. Algor., 13 (1998), 467–483. https://doi.org/10.1002/(SICI)1098-2418(199810/12)13:3/4<467::AID-RSA15>3.0.CO;2-W doi: 10.1002/(SICI)1098-2418(199810/12)13:3/4<467::AID-RSA15>3.0.CO;2-W
[39]	R. Tamir, N. Merhav, N. Weinberger, A. G. i Fàbregas, Large deviations behavior of the logarithmic error probability of random codes, IEEE T. Inform. Theory, 66 (2020), 6635–6659. https://doi.org/10.1109/TIT.2020.2995136 doi: 10.1109/TIT.2020.2995136
[40]	L. V. Truong, G. Cocco, J. Font-Segura, A. G. i Fàbregas, Concentration properties of random codes, IEEE T. Inform. Theory, 69 (2023), 7499–7537. https://doi.org/10.1109/TIT.2023.3312326 doi: 10.1109/TIT.2023.3312326
[41]	N. Merhav, Error exponents of typical random codes, IEEE T. Inform. Theory, 64 (2018), 6223–6235. https://doi.org/10.1109/TIT.2018.2834503 doi: 10.1109/TIT.2018.2834503
[42]	N. Merhav, Statistical physics and information theory, Found. Trends Commun., 6 (2009), 1–212. https://doi.org/10.1561/0100000052 doi: 10.1561/0100000052
[43]	C. Bunte, A. Lapidoth, On the listsize capacity with feedback, IEEE T. Inform. Theory, 60 (2014), 6733–6748. https://doi.org/10.1109/TIT.2014.2355815 doi: 10.1109/TIT.2014.2355815
[44]	S. Li, Concise formulas for the area and volume of a hyperspherical cap, Asian J. Math. Stat., 4 (2011), 66–70. https://doi.org/10.3923/ajms.2011.66.70 doi: 10.3923/ajms.2011.66.70
[45]	J. Riordan, An introduction to combinatorial analysis, New York: Wiley, 1980.

Reader Comments

Your name:*

Email:*
© 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)