The $ k $-means algorithm aims at minimizing the variance within clusters without considering the balance of cluster sizes. Balanced $ k $-means defines the partition as a pairing problem that enforces the cluster sizes to be strictly balanced, but the resulting algorithm is impractically slow $ \mathcal{O}(n^3) $. Regularized $ k $-means addresses the problem using a regularization term including a balance parameter. It works reasonably well when the balance of the cluster sizes is a mandatory requirement but does not generalize well for soft balance requirements. In this paper, we revisit the $ k $-means algorithm as a two-objective optimization problem with two goals contradicting each other: to minimize the variance within clusters and to minimize the difference in cluster sizes. The proposed algorithm implements a balance-driven variant of $ k $-means which initially only focuses on minimizing the variance but adds more weight to the balance constraint in each iteration. The resulting balance degree is not determined by a control parameter that has to be tuned, but by the point of termination which can be precisely specified by a balance criterion.
Citation: Rieke de Maeyer, Sami Sieranoja, Pasi Fränti. Balanced k-means revisited[J]. Applied Computing and Intelligence, 2023, 3(2): 145-179. doi: 10.3934/aci.2023008
The $ k $-means algorithm aims at minimizing the variance within clusters without considering the balance of cluster sizes. Balanced $ k $-means defines the partition as a pairing problem that enforces the cluster sizes to be strictly balanced, but the resulting algorithm is impractically slow $ \mathcal{O}(n^3) $. Regularized $ k $-means addresses the problem using a regularization term including a balance parameter. It works reasonably well when the balance of the cluster sizes is a mandatory requirement but does not generalize well for soft balance requirements. In this paper, we revisit the $ k $-means algorithm as a two-objective optimization problem with two goals contradicting each other: to minimize the variance within clusters and to minimize the difference in cluster sizes. The proposed algorithm implements a balance-driven variant of $ k $-means which initially only focuses on minimizing the variance but adds more weight to the balance constraint in each iteration. The resulting balance degree is not determined by a control parameter that has to be tuned, but by the point of termination which can be precisely specified by a balance criterion.
| [1] | F. Kovács, C. Legány, A. Babos, Cluster validity measurement techniques, Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, (2006). |
| [2] |
A. K. Jain, Data clustering: 50 years beyond k-means, Pattern Recogn. Lett., 31 (2010), 651-666. https://doi.org/10.1016/j.patrec.2009.09.011 doi: 10.1016/j.patrec.2009.09.011
|
| [3] |
X. Wu, V. Kumar, R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, et al., Top 10 algorithms in data mining, Knowl. Inf. Syst., 14 (2007), 1-37. https://doi.org/10.1007/s10115-007-0114-2 doi: 10.1007/s10115-007-0114-2
|
| [4] |
S. P. Lloyd, Least squares quantization in pcm, IEEE T. Inform. Theory, 28 (1982), 129-137. https://doi.org/10.1109/TIT.1982.1056489 doi: 10.1109/TIT.1982.1056489
|
| [5] | E. W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics, 21 (1965), 768-769. |
| [6] |
L. R. Costa, D. Aloise, N. Mladenović, Less is more: basic variable neighborhood search heuristic for balanced minimum sum-of-squares clustering, Inform. Sciences, 415 (2017), 247-253. https://doi.org/10.1016/j.ins.2017.06.019 doi: 10.1016/j.ins.2017.06.019
|
| [7] | M. I. Malinen, P. Fränti, Balanced k-means for clustering, Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition (S+SSPR 2014), (2014), 32-41. https://doi.org/10.1007/978-3-662-44415-3_4 |
| [8] | W. Lin, Z. He, M. Xiao, Balanced clustering: A uniform model and fast algorithm, IJCAI, (2019), 2987-2993. |
| [9] |
A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: a review, ACM Comput. Surv., 31 (1999), 264-323. https://doi.org/10.1145/331499.331504 doi: 10.1145/331499.331504
|
| [10] | D. Saini, M. Singh, Achieving balance in clusters - a survey, International Research Journal of Engineering and Technology IRJET, 2 (2015), 2611-2614. |
| [11] | P. Fränti, S. Sieranoja, How much k-means can be improved by using better initialization and repeats? Pattern Recogn., 93 (2019), 95-112. https://doi.org/10.1016/j.patcog.2019.04.014 |
| [12] | C. T. Althoff, Scalable clustering for hierarchical content-based browsing of largescale image collections, Bachelor's thesis, University of Kaiserslautern, 2010. |
| [13] | P. S. Bradley, K. P. Bennett, A. Demiriz, Constrained k-means clustering, Technical Report MSR-TR-2000-65, Microsoft Research, Redmond, 2000. |
| [14] |
J. Han, H. Liu, F. Nie, A local and global discriminative framework and optimization for balanced clustering, IEEE T. Neur. Net. Lear. Syst., 30 (2018), 3059-3071. https://doi.org/10.1109/TNNLS.2018.2870131 doi: 10.1109/TNNLS.2018.2870131
|
| [15] |
Y. Liao, H. Qi, W. Li, Load-balanced clustering algorithm with distributed self-organization for wireless sensor networks, IEEE Sens. J., 13 (2013), 1498-1506. https://doi.org/10.1109/JSEN.2012.2227704 doi: 10.1109/JSEN.2012.2227704
|
| [16] | R. Nallusamy, K. Duraiswamy, R. Dhanalaksmi, P. Parthiban, Optimization of non-linear multiple traveling salesman problem using k-means clustering, shrink wrap algorithm and meta-heuristics, International Journal of Nonlinear Science, 9 (2010), 171-177. |
| [17] |
A. Banerjee, J. Ghosh, Scalable clustering algorithms with balancing constraints, Data Min. Knowl. Disc., 13 (2006), 365-395. https://doi.org/10.1007/s10618-006-0040-z doi: 10.1007/s10618-006-0040-z
|
| [18] |
P. Fränti, S. Sieranoja, K. Wikström, T. Laatikainen, Clustering diagnoses from 58 million patient visits in finland between 2015 and 2018, JMIR Med. Inf., 10 (2022), e35422. https://doi.org/10.2196/35422 doi: 10.2196/35422
|
| [19] |
C. Zhong, M. Malinen, D. Miao, P. Fränti, A fast minimum spanning tree algorithm based on k-means, Inform. Sciences, 295 (2015), 1-17. https://doi.org/10.1016/j.ins.2014.10.012 doi: 10.1016/j.ins.2014.10.012
|
| [20] |
H.-Y. Xu, H.-Y. Lan, An adaptive layered clustering framework with improved genetic algorithm for solving large-scale traveling salesman problems, Electronics, 12 (2023), 1681. https://doi.org/10.3390/electronics12071681 doi: 10.3390/electronics12071681
|
| [21] |
W. Tang, Y. Yang, L. Zeng, Y. Zhan, Optimizing mse for clustering with balanced size constraints, Symmetry, 11 (2019), 338. https://doi.org/10.3390/sym11030338 doi: 10.3390/sym11030338
|
| [22] |
S. Zhu, D. Wang, T. Li, Data clustering with size constraints, Knowledge-Based Systems, 23 (2010), 883-889. https://doi.org/10.1016/j.knosys.2010.06.003 doi: 10.1016/j.knosys.2010.06.003
|
| [23] | I. D. Luptáková, M. Šimon, L. Huraj, J. Pospíchal, Neural gas clustering adapted for given size of clusters, Math. Probl. Eng., 2016 (2016). https://doi.org/10.1155/2016/9324793 |
| [24] |
N. Rujeerapaiboon, K. Schindler, D. Kuhn, W. Wiesemann, Size matters: cardinality-constrained clustering and outlier detection via conic optimization, SIAM J. Optimiz., 29 (2019), 1211-1239. https://doi.org/10.1137/17M1150670 doi: 10.1137/17M1150670
|
| [25] |
D. Chakraborty, S. Das, Modified fuzzy c-mean for custom-sized clusters, Sādhanā, 44 (2019), 182. https://doi.org/10.1007/s12046-019-1166-1 doi: 10.1007/s12046-019-1166-1
|
| [26] |
Q. Zhou, J. Hao, Q. Wu, Responsive threshold search based memetic algorithm for balanced minimum sum-of-squares clustering, Inform. Sciences, 569 (2021), 184-204. https://doi.org/10.1016/j.ins.2021.04.014 doi: 10.1016/j.ins.2021.04.014
|
| [27] |
R. Martín-Santamaría, J. Sánchez-Oro, S. Pérez-Peló, A. Duarte, Strategic oscillation for the balanced minimum sum-of-squares clustering problem, Inform. Sciences, 585 (2022), 529-542. https://doi.org/10.1016/j.ins.2021.11.048 doi: 10.1016/j.ins.2021.11.048
|
| [28] |
A. Banerjee, J. Ghosh, Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres, IEEE T. Neural Networks, 15 (2004), 702-719. https://doi.org/10.1109/TNN.2004.824416 doi: 10.1109/TNN.2004.824416
|
| [29] | T. Althoff, A. Ulges, A. Dengel, Balanced clustering for content-based image browsing, Informatiktage, (2011), 27-30. |
| [30] | H. Liu, Z. Huang, Q. Chen, M. Li, Y. Fu, L. Zhang, Fast clustering with flexible balance constraints, 2018 IEEE International Conference on Big Data (Big Data), (2018), 743-750. |
| [31] | H. Liu, J. Han, F. Nie, X. Li, Balanced clustering with least square regression, Proceedings of the AAAI Conference on Artificial Intelligence, 31 (2017). https://doi.org/10.1609/aaai.v31i1.10877 |
| [32] | Z. Li, F. Nie, X. Chang, Z. Ma, Y. Yang, Balanced clustering via exclusive lasso: A pragmatic approach, Proceedings of the AAAI Conference on Artificial Intelligence, 32 (2018). https://doi.org/10.1609/aaai.v32i1.11702 |
| [33] | Y. Zhou, R.Jin, S. C. H. Hoi, Exclusive lasso for multitask feature selection, J. Mach. Learn. Res., 9 (2010), 988-995. |
| [34] | S. Gupta, A. Jain, P. Jeswani, Generalized method to produce balanced structures through k-means objective function, 2018 2nd International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC) I-SMAC (IoT in Social, Mobile, Analytics and Cloud)(I-SMAC), 2018 2nd International Conference On, (2018), 586-590. IEEE. |
| [35] |
Y. Lin, H. Tang, Y. Li, C. Fang, Z. Xu, Y. Zhou, et al., Generating clusters of similar sizes by constrained balanced clustering, Appl. Intell., 52 (2022), 5273-5289. https://doi.org/10.1007/s10489-021-02682-y doi: 10.1007/s10489-021-02682-y
|
| [36] |
M. I. Malinen, P. Fränti, All-pairwise squared distances lead to more balanced clustering, Applied Computing and Intelligence, 3 (2023), 93-115. https://doi.org/10.3934/aci.2023006 doi: 10.3934/aci.2023006
|
| [37] |
P. Fränti, S. Sieranoja, K-means properties on six clustering benchmark datasets, Appl. Intell., 48 (2018), 4743-4759. https://doi.org/10.1007/s10489-018-1238-7 doi: 10.1007/s10489-018-1238-7
|
| [38] |
M. Rezaei, P. Fränti, Set-matching methods for external cluster validity, IEEE T. Knowl. Data Eng., 28 (2016), 2173-2186. https://doi.org/10.1109/TKDE.2016.2551240 doi: 10.1109/TKDE.2016.2551240
|
| [39] |
T. Zhang, R. Ramakrishnan, M. Livny, Birch: a new data clustering algorithm and its applications, Data Min. Knowl. Disc., 1 (1997), 141-182. https://doi.org/10.1023/A:1009783824328 doi: 10.1023/A:1009783824328
|
| [40] | D. Dua, C. Graff, UCI Machine Learning Repository, University of California, School of Information and Computer Sciences, Irvine, 2017. |
| [41] |
H. T. Kahraman, S. Sagiroglu, I. Colak, Developing intuitive knowledge classifier and modeling of users' domain dependent data in web, Knowledge Based Systems, 37 (2013), 283-295. https://doi.org/10.1016/j.knosys.2012.08.009 doi: 10.1016/j.knosys.2012.08.009
|
| [42] | A. Strehl, J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res., 3 (2002), 583-617. |
| [43] | A. Strehl, J. Ghosh, R. Mooney, Impact of similarity measures on web-page clustering, Workshop on Artificial Intelligence for Web Search (AAAI 2000), 58 (2000), 64. |
| [44] |
R. Mariescu-Istodor, P. Fränti, Solving the large-scale tsp problem in 1 h: Santa claus challenge 2020, Front. Robot. AI, 8 (2021), 689908. https://doi.org/10.3389/frobt.2021.689908 doi: 10.3389/frobt.2021.689908
|
| [45] |
P. Fränti, Efficiency of random swap clustering, Journal of big data, 5 (2018), 1-29. https://doi.org/10.1186/s40537-018-0122-y doi: 10.1186/s40537-018-0122-y
|
| [46] |
S. Sieranoja, P. Fränti, Adapting k-means for graph clustering, Knowl. Inf. Syst., 64 (2022), 115-142. https://doi.org/10.1007/s10115-021-01623-y doi: 10.1007/s10115-021-01623-y
|
| [47] |
M. Rezaei, P. Fränti, K-sets and k-swaps algorithms for clustering sets, Pattern Recogn., 139 (2023), 109454. https://doi.org/10.1016/j.patcog.2023.109454 doi: 10.1016/j.patcog.2023.109454
|