### Applied Computing and Intelligence

2023, Issue 1: 93-115. doi: 10.3934/aci.2023006
Research article

# All-pairwise squared distances lead to more balanced clustering

• Received: 09 December 2022 Revised: 14 March 2023 Accepted: 19 April 2023 Published: 15 May 2023
• In clustering, the cost function that is commonly used involves calculating all-pairwise squared distances. In this paper, we formulate the cost function using mean squared error and show that this leads to more balanced clustering compared to centroid-based distance functions, like the sum of squared distances in $k$-means. The clustering method has been formulated as a cut-based approach, more intuitively called Squared cut (Scut). We introduce an algorithm for the problem which is faster than the existing one based on the Stirling approximation. Our algorithm is a sequential variant of a local search algorithm. We show by experiments that the proposed approach provides better overall optimization of both mean squared error and cluster balance compared to existing methods.

Citation: Mikko I. Malinen, Pasi Fränti. All-pairwise squared distances lead to more balanced clustering[J]. Applied Computing and Intelligence, 2023, 3(1): 93-115. doi: 10.3934/aci.2023006

### Related Papers:

• In clustering, the cost function that is commonly used involves calculating all-pairwise squared distances. In this paper, we formulate the cost function using mean squared error and show that this leads to more balanced clustering compared to centroid-based distance functions, like the sum of squared distances in $k$-means. The clustering method has been formulated as a cut-based approach, more intuitively called Squared cut (Scut). We introduce an algorithm for the problem which is faster than the existing one based on the Stirling approximation. Our algorithm is a sequential variant of a local search algorithm. We show by experiments that the proposed approach provides better overall optimization of both mean squared error and cluster balance compared to existing methods.

 [1] J. H. Ward Jr, Hierarchical grouping to optimize an objective function, J. Am. Stat. Assoc., 58 (1963), 236–244. https://doi.org/10.1080/01621459.1963.10500845 doi: 10.1080/01621459.1963.10500845 [2] T. Kohonen, Median strings, Pattern Recogn. Lett., 3 (1985), 309–313. https://doi.org/10.1016/0167-8655(85)90061-3 doi: 10.1016/0167-8655(85)90061-3 [3] V. Hautamäki, P. Nykänen, P. Fränti, Time-series clustering by approximate prototypes, 19th International conference on pattern recognition, (2008), 1–4. IEEE. https://doi.org/10.1109/ICPR.2008.4761105 [4] P. Fränti, R. Mariescu-Istodor, Averaging gps segments: competition 2019, Pattern Recogn., 112 (2021), 107730. https://doi.org/10.1016/j.patcog.2020.107730 doi: 10.1016/j.patcog.2020.107730 [5] P. Fränti, S. Sieranoja, K. Wikström, T. Laatikainen, Clustering diagnoses from 58m patient visits in Finland 2015-2018, 2022. [6] M. Fatemi, P. Fränti, Clustering nordic twitter users based on their connections, 2023. [7] M. I. Malinen, P. Fränti, Clustering by analytic functions, Inform. Sciences, 217 (2012), 31–38. https://doi.org/10.1016/j.ins.2012.06.018 doi: 10.1016/j.ins.2012.06.018 [8] M. I. Malinen, P. Fränti, Balanced $k$-means for clustering, in: Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition (S+SSPR 2014), LNCS 8621, Joensuu, Finland, 2014. [9] D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn., 75 (2009), 245–248. https://doi.org/10.1007/s10994-009-5103-0 doi: 10.1007/s10994-009-5103-0 [10] M. Inaba, N. Katoh, H. Imai, Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based $k$-Clustering, ACM symposium on computational geometry (SCG 1994), (1994), 332–339. https://doi.org/10.1145/177424.178042 doi: 10.1145/177424.178042 [11] J. MacQueen, Some methods of classification and analysis of multivariate observations, Berkeley Symp. Mathemat. Statist. Probab., 1 (1967), 281–297. [12] W. H. Equitz, A New Vector Quantization Clustering Algorithm, IEEE Trans. Acoust., Speech, Signal Processing, 37 (1989), 1568–1575. https://doi.org/10.1109/29.35395 doi: 10.1109/29.35395 [13] P. Fränti, O. Virmajoki, V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE T. Pattern Anal., 28 (2006), 1875–1881. https://doi.org/10.1109/TPAMI.2006.227 doi: 10.1109/TPAMI.2006.227 [14] P. Fränti, O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recogn., 39 (2006), 761–765. https://doi.org/10.1016/j.patcog.2005.09.012 doi: 10.1016/j.patcog.2005.09.012 [15] P. Fränti, Efficiency of random swap clustering, Journal of Big Data, 5 (2018), 1–29. https://doi.org/10.1186/s40537-018-0122-y doi: 10.1186/s40537-018-0122-y [16] B. Fritzke, Breathing k-means, arXiv: 2006.15666. [17] C. Baldassi, Recombinator-k-means:an evolutionary algorithm that exploits k-means++ for recombination, IEEE T. Evolut. Comput., 26 (2022), 991–1003. [18] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximun likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B, 39 (1977), 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x doi: 10.1111/j.2517-6161.1977.tb01600.x [19] Q. Zhao, V. Hautamäki, I. Kärkkäinen, P. Fränti, Random swap EM algorithm for finite mixture models in image segmentation, IEEE International Conference on Image Processing (ICIP), (2009), 2397–2400. https://doi.org/10.1109/ICIP.2009.5414459 doi: 10.1109/ICIP.2009.5414459 [20] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE T. Pattern Anal., 22 (2000), 888–905. https://doi.org/10.1109/34.868688 doi: 10.1109/34.868688 [21] C. H. Q. Ding, X. He, H. Zha, M. Gu, H. D. Simon, A min-max cut algorithm for graph partitioning and data clustering, IEEE International Conference on Data Mining (ICDM), (2001), 107–114. [22] M. I. Malinen, P. Fränti, K-means*: Clustering by gradual data transformation, Pattern Recogn., 47 (2014), 3376–3386. https://doi.org/10.1016/j.patcog.2014.03.034 doi: 10.1016/j.patcog.2014.03.034 [23] R. Nallusamy, K. Duraiswamy, R. Dhanalaksmi, P. Parthiban, Optimization of non-linear multiple traveling salesman problem using k-means clustering, shrink wrap algorithm and meta-heuristics, International Journal of Nonlinear Science, 9 (2010), 171–177. [24] R. Mariescu-Istodor, P. Fränti, Solving the large-scale tsp problem in 1 h: Santa claus challenge 2020, Front. Robot. AI, (2021), 1–20. https://doi.org/10.3389/frobt.2021.689908 doi: 10.3389/frobt.2021.689908 [25] D. W. Sambo, B. O. Yenke, A. Förster, P. Dayang, Optimized clustering algorithms for large wireless sensor networks: A review, Sensors, 19 (2019), 322. [26] J. Singh, R. Kumar, A. K. Mishra, Clustering algorithms for wireless sensor networks: A review, International Conference on Computing for Sustainable Global Development (INDIACom), (2015), 637–642. [27] Y. Liao, H. Qi, W. Li, Load-Balanced Clustering Algorithm With Distributed Self-Organization for Wireless Sensor Networks, IEEE Sens. J., 13 (2013), 1498–1506. https://doi.org/10.1109/JSEN.2012.2227704 doi: 10.1109/JSEN.2012.2227704 [28] L. Yao, X. Cui, M. Wang, An energy-balanced clustering routing algorithm for wireless sensor networks, IEEE World Congress on Computer Science and Information Engineering, 3 (2009), 316–320. [29] P. S. Bradley, K. P. Bennett, A. Demiriz, Constrained k-means clustering, Tech. rep., MSR-TR-2000-65, Microsoft Research, 2000. [30] S. Zhu, D. Wang, T. Li, Data clustering with size constraints, Knowledge-Based Syst., 23 (2010), 883–889. https://doi.org/10.1016/j.knosys.2010.06.003 doi: 10.1016/j.knosys.2010.06.003 [31] A. Banerjee, J. Ghosh, Frequency sensitive competitive learning for balanced clustering on high-dimensional hyperspheres, IEEE Transactions on Neural Networks, 15 (2004), 702–719. https://doi.org/10.1109/TNN.2004.824416 doi: 10.1109/TNN.2004.824416 [32] C. T. Althoff, A. Ulges, A. Dengel, Balanced clustering for content-based image browsing, in: GI-Informatiktage 2011, Gesellschaft für Informatik e.V., 2011. [33] A. Banerjee, J. Ghosh, On scaling up balanced clustering algorithms, SIAM International Conference on Data Mining, (2002), 333–349. https://doi.org/10.1137/1.9781611972726.20 doi: 10.1137/1.9781611972726.20 [34] Y. Chen, Y. Zhang, X. Ji, Size regularized cut for data clustering, Advances in Neural Information Processing Systems, 2005. [35] Y. Kawahara, K. Nagano, Y. Okamoto, Submodular fractional programming for balanced clustering, Pattern Recogn. Lett., 32 (2011), 235–243. https://doi.org/10.1016/j.patrec.2010.08.008 doi: 10.1016/j.patrec.2010.08.008 [36] G. Tzortzis, A. Likas, The minmax k-means clustering algorithm, Pattern Recogn., 47 (2014), 2505–2516. https://doi.org/10.1016/j.patcog.2014.01.015 doi: 10.1016/j.patcog.2014.01.015 [37] W. Tang, Y. Yang, L. Zeng, Y. Zhan, Optimizing mse for clustering with balanced size constraints, Symmetry, 11 (2019), 338. https://doi.org/10.3390/sym11030338 doi: 10.3390/sym11030338 [38] L. Hagen, A. B. Kahng, New spectrxal methods for ratio cut partitioning and clustering, IEEE T. Computer-Aided D., 11 (1992), 1074–1085. https://doi.org/10.1109/43.159993 doi: 10.1109/43.159993 [39] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to algorithms (2nd ed.), MIT Press and McGraw-Hill, 2001. [40] M. X. Goemans, D. P. Williamson, Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming, J. ACM, 42 (1995), 1115–1145. https://doi.org/10.1145/227683.227684 doi: 10.1145/227683.227684 [41] S. Arora, S. Rao, U. Vazirani, Expander flows, geometric embeddings and graph partitioning, J. ACM, 56 (2009), 1–37. https://doi.org/10.1145/1502793.1502794 doi: 10.1145/1502793.1502794 [42] U. von Luxburg, A tutorial on spectral clustering, Stat. Comput., 17 (2007), 395–416. https://doi.org/10.1007/s11222-007-9033-z doi: 10.1007/s11222-007-9033-z [43] M. R. Garey, D. S. Johnson, Computers and intractability: A guide to the theory of NP-completeness, W. H. Freeman, 1979. [44] T. D. Bie, N. Cristianini, Fast sdp relaxations of graph cut clustering, transduction, and other combinatorial problems, J. Mach. Learn. Res., 7 (2006), 1409–1436. [45] A. Frieze, M. Jerrum, Improved approximation algorithms for max-$k$-cut and max bisection, Algorithmica, 18 (1997), 67–81. https://doi.org/10.1007/BF02523688 doi: 10.1007/BF02523688 [46] W. Zhu, C. Guo, A local search approximation algorithm for max-$k$-cut of graph and hypergraph, International Symposium on Parallel Architectures, Algorithms and Programming, (2011), 236–240. https://doi.org/10.1109/PAAP.2011.35 doi: 10.1109/PAAP.2011.35 [47] A. V. Kel'manov, A. V. Pyatkin, On the complexity of some quadratic euclidean 2-clustering problems, Comput. Math. Math. Phys., 56 (2016), 491–497. https://doi.org/10.1134/S096554251603009X doi: 10.1134/S096554251603009X [48] L. J. Schulman, Clustering for edge-cost minimization, Ann. ACM Symp. on Theory of Computing (STOC), (2000), 547–555. https://doi.org/10.1145/335305.335373 doi: 10.1145/335305.335373 [49] S. Sahni, T. Gonzalez, P-complete approximation problems, J. ACM, 23 (1976), 555–565. https://doi.org/10.1145/321958.321975 doi: 10.1145/321958.321975 [50] W. F. de la Vega, M. Karpinski, C. Kenyon, Y. Rabani, Approximation schemes for clustering problems, ACM symposium on Theory of computing (STOC '03), (2003), 50–58. https://doi.org/10.1145/780542.780550 doi: 10.1145/780542.780550 [51] N. Guttmann-Beck, R. Hassin, Approximation algorithms for min-sum p-clustering, Discrete Appl. Math., 89 (1998), 125–142. https://doi.org/10.1016/S0166-218X(98)00100-0 doi: 10.1016/S0166-218X(98)00100-0 [52] H. Späth, Cluster analysis algorithms for data reduction and classification of objects, Wiley, New York, 1980. [53] P. Fränti, S. Sieranoja, Clustering datasets, University of Eastern Finland, 2020. Available from: http://cs.uef.fi/sipu/datasets/. [54] P. Fränti, M. Rezaei, Q. Zhao, Centroid index: Cluster level similarity measure, Pattern Recogn., 47 (2014), 3034–3045. https://doi.org/10.1016/j.patcog.2014.03.017 doi: 10.1016/j.patcog.2014.03.017 [55] S. Sieranoja, P. Fränti, Fast and general density peaks clustering, Pattern Recogn. Lett., 128 (2019), 551–558. https://doi.org/10.1016/j.patrec.2019.10.019 doi: 10.1016/j.patrec.2019.10.019 [56] P. Fränti, Genetic algorithm with deterministic crossover for vector quantization, Pattern Recogn. Lett., 21 (2000), 61–68. https://doi.org/10.1016/S0167-8655(99)00133-6 doi: 10.1016/S0167-8655(99)00133-6 [57] T. Cour, S. Yu, J. Shi, Normalized Cut Segmentation Code, 2004.
###### 通讯作者: 陈斌, bchen63@163.com
• 1.

沈阳化工大学材料科学与工程学院 沈阳 110142

Article outline

## Figures and Tables

Figures(13)  /  Tables(5)

• On This Site