Subspace and metric space learning for coqualitative data clustering

Duanjiao Li; Yun Chen; Wenxing Sun; Yuhui Chen; Junwen Yao; Hua Ye; Jianguo Zhang; Duanjiao Li; Yun Chen; Wenxing Sun; Yuhui Chen; Junwen Yao; Hua Ye; Jianguo Zhang

doi:10.3934/era.2026188

Electronic Research Archive

2026, Volume 34, Issue 6: 4191-4215. doi: 10.3934/era.2026188

Previous Article Next Article

Research article Special Issues

Subspace and metric space learning for coqualitative data clustering

1.
Guangdong Power Grid Co., Ltd, Guangzhou, China
2.
Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China

Received: 20 February 2026 Revised: 02 April 2026 Accepted: 29 April 2026 Published: 19 May 2026

Cluster analysis of unlabeled categorical data is crucial in a wide range of practical applications, such as medical diagnosis, financial risk assessment, and recommendation systems. Unlike numerical data residing in explicit Euclidean spaces, categorical data consists of qualitative values without inherent ordering, making the definition of object similarity a critical yet challenging determinant of clustering success. Conventional approaches typically rely on single, predefined metrics (e.g., Hamming distance or context-based measures). However, these metrics are often constructed based on limited prior knowledge or specific statistical assumptions, failing to capture the complex, intrinsic structures of diverse datasets. Consequently, the mismatch between the defined metric space and the "true" data structure significantly hinders the performance of downstream clustering tasks. To address these limitations, this paper proposes a novel subspace and metric space co-learning framework named SBMS. Instead of relying on a static measure, SBMS introduces an adaptive learning paradigm that iteratively optimizes two coupled spaces: a metric space, where multiple complementary distance metrics are fused to provide a comprehensive similarity measure; and an attribute subspace, where attribute weights are dynamically adjusted based on cluster discrimination and compactness to identify the most relevant features for each cluster. Furthermore, we provide a theoretical analysis of the proposed method, discussing its computational complexity and demonstrating the convergence properties of the optimization algorithm. Extensive experiments on real-world public datasets from various domains illustrate that SBMS effectively bridges the gap between defined and true metric spaces, yielding superior clustering accuracy and stability compared to state-of-the-art baselines.
- categorical data clustering,
- metric learning,
- subspace learning,
- multiobject optimization,
- unsupervised learning
Citation: Duanjiao Li, Yun Chen, Wenxing Sun, Yuhui Chen, Junwen Yao, Hua Ye, Jianguo Zhang. Subspace and metric space learning for coqualitative data clustering[J]. Electronic Research Archive, 2026, 34(6): 4191-4215. doi: 10.3934/era.2026188

Related Papers:

Abstract

Cluster analysis of unlabeled categorical data is crucial in a wide range of practical applications, such as medical diagnosis, financial risk assessment, and recommendation systems. Unlike numerical data residing in explicit Euclidean spaces, categorical data consists of qualitative values without inherent ordering, making the definition of object similarity a critical yet challenging determinant of clustering success. Conventional approaches typically rely on single, predefined metrics (e.g., Hamming distance or context-based measures). However, these metrics are often constructed based on limited prior knowledge or specific statistical assumptions, failing to capture the complex, intrinsic structures of diverse datasets. Consequently, the mismatch between the defined metric space and the "true" data structure significantly hinders the performance of downstream clustering tasks. To address these limitations, this paper proposes a novel subspace and metric space co-learning framework named SBMS. Instead of relying on a static measure, SBMS introduces an adaptive learning paradigm that iteratively optimizes two coupled spaces: a metric space, where multiple complementary distance metrics are fused to provide a comprehensive similarity measure; and an attribute subspace, where attribute weights are dynamically adjusted based on cluster discrimination and compactness to identify the most relevant features for each cluster. Furthermore, we provide a theoretical analysis of the proposed method, discussing its computational complexity and demonstrating the convergence properties of the optimization algorithm. Extensive experiments on real-world public datasets from various domains illustrate that SBMS effectively bridges the gap between defined and true metric spaces, yielding superior clustering accuracy and stability compared to state-of-the-art baselines.

References

[1]	J. Ye, Y. Yu, Q. Wang, G. Liu, W. Li, A. Zeng, et al., Cmdvit: A voluntary facial expression recognition model for complex mental disorders, IEEE Trans. Image Process., 34 (2025), 3013–3024. https://doi.org/10.1109/TIP.2025.3567825 doi: 10.1109/TIP.2025.3567825
[2]	J. Ye, A. Zeng, D. Pan, Y. Zhang, J. Zhao, Q. Chen, et al., MAD-Former: A traceable interpretability model for Alzheimer's disease recognition based on multi-patch attention, IEEE J. Biomed. Health Inf., 28 (2024), 3637–3648. https://doi.org/10.1109/JBHI.2024.3368500 doi: 10.1109/JBHI.2024.3368500
[3]	Y. Zhang, X. Chen, L. Zhao, Y. Ji, P. Liu, Y. M. Cheung, Online heterogeneous feature selection, IEEE Trans. Cybern., 56 (2026), 2224–2237. https://doi.org/10.1109/TCYB.2025.3635888 doi: 10.1109/TCYB.2025.3635888
[4]	L. Bai, J. Liang, A categorical data clustering framework on graph representation, Pattern Recognit., 128 (2022), 108694. https://doi.org/10.1016/j.patcog.2022.108694 doi: 10.1016/j.patcog.2022.108694
[5]	R. Zou, Y. Zhang, M. Zhao, Z. Tan, Y. Zhang, Y. M. Cheung, SDENK: Unbiased subspace density–clustering, Neurocomputing, 653 (2025), 131225. https://doi.org/10.1016/j.neucom.2025.131225 doi: 10.1016/j.neucom.2025.131225
[6]	Y. Zhang, S. Feng, P. Wang, Z. Tan, X. Luo, Y. Ji, et al., Learning self-growth maps for fast and accurate imbalanced streaming data clustering, IEEE Trans. Neural Netw. Learn. Syst., 36 (2025), 16049–16061. https://doi.org/10.1109/TNNLS.2025.3563769 doi: 10.1109/TNNLS.2025.3563769
[7]	A. K. Kar, M. M. Akhter, A. C. Mishra, S. K. Mohanty, EDMD: An Entropy based dissimilarity measure to cluster mixed-categorical data, Pattern Recognit., 155 (2024), 110674. https://doi.org/10.1016/j.patcog.2024.110674 doi: 10.1016/j.patcog.2024.110674
[8]	T. Dinh, H. Wong, P. Fournier-Viger, D. Lisik, M. Q. Ha, H. C. Dam, et al., Categorical data clustering: 25 years beyond K-modes, Expert Syst. Appl., 272 (2025), 126608. https://doi.org/10.1016/j.eswa.2025.126608 doi: 10.1016/j.eswa.2025.126608
[9]	C. Zhu, Q. Zhang, L. Cao, A. Abrahamyan, Mix2Vec: Unsupervised mixed data representation, in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics, (2020), 118–127. https://doi.org/10.1109/DSAA49011.2020.00024
[10]	Y. Zhang, M. Zhao, H. Jia, M. Li, Y. Lu, Y. M. Cheung, Categorical data clustering via value order estimated distance metric learning, Proc. ACM Manag. Data., 3 (2025), 1–24. https://doi.org/10.1145/3769772 doi: 10.1145/3769772
[11]	Y. Zhang, Y. M. Cheung, A new distance metric exploiting heterogeneous interattribute relationship for ordinal-and-nominal-attribute data clustering, IEEE Trans. Cybern., 52 (2022), 758–771. https://doi.org/10.1109/TCYB.2020.2983073 doi: 10.1109/TCYB.2020.2983073
[12]	S. Jian, L. Hu, L. Cao, K. Lu, Metric-based auto-instructor for learning mixed data representation, Proc. AAAI Conf. Artif. Intell., 32 (2018), 3318–3325. https://doi.org/10.1609/aaai.v32i1.11597 doi: 10.1609/aaai.v32i1.11597
[13]	D. Lin, An information-theoretic definition of similarity, in Proceedings of the Fifteenth International Conference on Machine Learning, (1998), 296–304. https://doi.org/10.5555/645527.657297
[14]	P. Arabie, N. D. Baier, C. F. Critchley, M. Keynes, Studies in Classification, Data Analysis, and Knowledge Organization, Springer, 2006.
[15]	C. Zhang, L. Chen, Y. P. Zhao, Y. Wang, C. L. P. Chen, Graph enhanced fuzzy clustering for categorical data using a Bayesian dissimilarity measure, IEEE Trans. Fuzzy Syst., 31 (2023), 810–824. https://doi.org/10.1109/TFUZZ.2022.3189831 doi: 10.1109/TFUZZ.2022.3189831
[16]	Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Discov., 2 (1998), 283–304. https://doi.org/10.1023/A:1009769707641 doi: 10.1023/A:1009769707641
[17]	Y. Qian, F. Li, J. Liang, B. Liu, C. Dang, Space structure and clustering of categorical data, IEEE Trans. Neural Netw. Learn. Syst., 27 (2016), 2047–2059. https://doi.org/10.1109/TNNLS.2015.2451151 doi: 10.1109/TNNLS.2015.2451151
[18]	S. Cai, Y. Zhang, X. Luo, Y. M. Cheung, H. Jia, P. Liu, Robust categorical data clustering guided by multi-granular competitive learning, in 2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS), (2024), 288–299. https://doi.org/10.1109/ICDCS60910.2024.00035
[19]	C. Zhu, L. Cao, J. Yin, Unsupervised heterogeneous coupling learning for categorical representation, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2022), 533–549. https://doi.org/10.1109/TPAMI.2020.3010953 doi: 10.1109/TPAMI.2020.3010953
[20]	M. Zhao, S. Feng, Y. Zhang, M. Li, Y. Lu, Y. M. Cheung, Learning order forest for qualitative-attribute data clustering, in ECAI 2024, (2024), 1943–1950. https://doi.org/10.3233/FAIA240709
[21]	L. Xie, S. Fan, W. Gao, B. Chen, G. Li, W. Gao, Just noticeable difference measurement for point cloud compression: A benchmark dataset and prediction network, IEEE Trans. Instrum. Meas., 75 (2026), 1–17. https://doi.org/10.1109/TIM.2026.3680192 doi: 10.1109/TIM.2026.3680192
[22]	L. Xie, H. Li, B. Chen, G. Li, S. Kwong, W. Gao, Foreground-aware geometry compression with hybrid attention for large-scale point clouds, IEEE Trans. Broadcast., 72 (2026), 207–222. https://doi.org/10.1109/TBC.2026.3651190 doi: 10.1109/TBC.2026.3651190
[23]	S. Q. Le, T. B. Ho, An association-based dissimilarity measure for categorical data, Pattern Recognit. Lett., 26 (2005), 2549–2557. https://doi.org/10.1016/j.patrec.2005.06.002 doi: 10.1016/j.patrec.2005.06.002
[24]	D. Ienco, R. G. Pensa, R. Meo, From context to distance: Learning dissimilarity for categorical data clustering, ACM Trans. Knowl. Discov. Data., 6 (2012), 1–25. https://doi.org/10.1145/2133360.2133361 doi: 10.1145/2133360.2133361
[25]	S. Jian, L. Cao, K. Lu, H. Gao, Unsupervised coupled metric similarity for non-IID categorical data, IEEE Trans. Knowl. Data Eng., 30 (2018), 1810–1823. https://doi.org/10.1109/TKDE.2018.2808532 doi: 10.1109/TKDE.2018.2808532
[26]	S. Jian, L. Cao, G. Pang, K. Lu, H. Gao, Embedding-based representation of categorical data by hierarchical value coupling learning, in IJCAI International Joint Conference on Artificial Intelligence, (2017), 1937–1943. https://doi.org/10.24963/ijcai.2017/269
[27]	Y. Zhang, Y. M. Cheung, Learnable weighting of intra-attribute distances for categorical data clustering with nominal and ordinal attributes, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2022), 3560–3576. https://doi.org/10.1109/TPAMI.2021.3056510 doi: 10.1109/TPAMI.2021.3056510
[28]	G. Badaro, M. Saeed, P. Papotti, Transformers for tabular data representation: A survey of models and applications, Trans. Assoc. Comput. Linguist., 11 (2023), 227–249. https://doi.org/10.1162/tacl_a_00544 doi: 10.1162/tacl_a_00544
[29]	D. Bahri, H. Jiang, Y. Tay, D. Metzler, Scarf: Self-supervised contrastive learning using random feature corruption, preprint, arXiv: 2106.15147.
[30]	Y. Zhang, M. Zhao, Y. Zhang, Y. M. Cheung, Trending applications of large language models: A user perspective survey, IEEE Trans. Artif. Intell., 7 (2026), 1835–1852. https://doi.org/10.1109/TAI.2025.3620272 doi: 10.1109/TAI.2025.3620272
[31]	S. Park, S. Han, S. Kim, D. Kim, S. Park, S. Hong, et al., Improving unsupervised image clustering with robust learning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), 12278–12287. https://doi.org/10.1109/CVPR46437.2021.01210
[32]	Y. Zhang, M. Zhao, Y. Chen, Y. Lu, Y. M. Cheung, Learning unified distance metric for heterogeneous attribute data clustering, Expert Syst. Appl., 273 (2025), 126738. https://doi.org/10.1016/j.eswa.2025.126738 doi: 10.1016/j.eswa.2025.126738
[33]	J. Z. Huang, M. K. Ng, H. Rong, Z. Li, Automated variable weighting in k-means type clustering, IEEE Trans. Pattern Anal. Mach. Intell., 27 (2005), 657–668. https://doi.org/10.1109/TPAMI.2005.95 doi: 10.1109/TPAMI.2005.95
[34]	G. Gan, M. K. P. Ng, Subspace clustering with automatic feature grouping, Pattern Recognit., 48 (2015), 3703–3713. https://doi.org/10.1016/j.patcog.2015.05.016 doi: 10.1016/j.patcog.2015.05.016
[35]	Y. Zhang, Y. M. Cheung, Graph-based dissimilarity measurement for cluster analysis of any-type-attributed data, IEEE Trans. Neural Netw. Learn. Syst., 34 (2023), 6530–6544. https://doi.org/10.1109/TNNLS.2022.3202700 doi: 10.1109/TNNLS.2022.3202700
[36]	E. Y. Chan, W. K. Ching, M. K. Ng, J. Z. Huang, An optimization algorithm for clustering using weighted dissimilarity measures, Pattern Recognit., 37 (2004), 943–952. https://doi.org/10.1016/j.patcog.2003.11.003 doi: 10.1016/j.patcog.2003.11.003
[37]	M. Zhao, Z. Huang, Y. Lu, M. Li, Y. Zhang, W. Su, et al., Break the tie: Learning cluster-customized category relationships for categorical data clustering, in Proceedings of the AAAI Conference on Artificial Intelligence, 40 (2026), 28715–28723. https://doi.org/10.1609/aaai.v40i34.40104
[38]	L. Chen, S. Wang, K. Wang, J. Zhu, Soft subspace clustering of categorical data with probabilistic distance, Pattern Recognit., 51 (2016), 322–332. https://doi.org/10.1016/j.patcog.2015.09.027 doi: 10.1016/j.patcog.2015.09.027
[39]	L. Bai, X. Cheng, J. Liang, H. Shen, Y. Guo, Fast density clustering strategies based on the k-means algorithm, Pattern Recognit., 71 (2017), 375–386. https://doi.org/10.1016/j.patcog.2017.06.023 doi: 10.1016/j.patcog.2017.06.023
[40]	A. Bhattacharyya, On a measure of divergence between two statistical populations defined by their probability distribution, Bull. Calcutta Math. Soc., 35 (1943), 99–110
[41]	D. Dua, C. Graff, UCI Machine Learning Repository, 2017. Available from: https://archive.ics.uci.edu/ml.
[42]	J. M. Santos, M. Embrechts, On the use of the adjusted rand index as a metric for evaluating supervised classification, in International Conference on Artificial Neural Networks, (2009), 175–184. https://doi.org/10.1007/978-3-642-04277-5_18
[43]	P. A. Estévez, M. Tesmer, C. A. Perez, J. M. Zurada, Normalized mutual information feature selection, IEEE Trans. Neural Netw., 20 (2009), 189–201. https://doi.org/10.1109/TNN.2008.2005601 doi: 10.1109/TNN.2008.2005601

Reader Comments

Your name:*

Email:*
© 2026 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)