1.
Introduction
In pattern recognition, cluster analysis is a formal study of methods and algorithms for dividing data into meaningful and useful clusters (groups) based on some intrinsic similarity measures[1]. It is a crucial intermediate step in exploring the potential structure of massive data for subsequent analysis. In the field of internet applications, clustering is used to identify users with different interest characteristics, thus realizing the personalized service recommendation process [2,3]. In the field of biology, clustering can be applied to discover patterns in genes, helping to identify subtypes of certain diseases (e.g., cancers)[4,5,6].
The nonnegative matrix factorization (NMF) models (algorithms) have been widely studied as one of the important tools for achieving clustering. The NMF model finds wide applications in various domains, owing to its part-based representation and physical interpretability. In [7,8], NMF was used to decompose the original nonnegative high-dimensional matrix X into the product of two low-dimensional nonnegative matrices, WH, ensuring that X and WH are sufficiently close under some metric. Usually, W is referred to as the basis matrix and H is referred to as the coefficient matrix. Each element in the matrices W and H obtained above is nonnegative, and their rank is much lower than X itself, so this decomposition is called dimension reduction decomposition. This form of decomposition effectively saves storage and computational space, improving computational efficiency. In clustering analysis, the coefficient matrix H is used as a substitute matrix for the high-dimensional original matrix X for clustering tasks. The matrix H here is much lower in dimension than the original matrix X, and it will save a lot of computation and space.
Lee and Seung [9] extended NMF by introducing a simple algorithm based on multiplication update rules (MUR) and demonstrating its effectiveness in learning parts of objects and discovering semantic features in text. Additionally, NMF has been successfully extended and applied in various fields, including clustering and classification [10,11,12], image analysis [13,14,15], spectral analysis [16,17], and blind source separation [18,19], and so on. Moreover, NMF has brought a series of new theories and algorithms for clustering problems. Numerous studies have shown that by introducing different constraints, such as sparsity [12,20,21], smoothness [16,22,23], and orthogonality [24,25,26], the performance of standard NMF can usually be further improved. This is because the addition of these constraints introduces other useful features beyond non negativity to the standard NMF model. In fact, many datasets not only do not follow linear distributions, but also exist in certain nonlinear spaces, such as manifold spaces. This requires the introduction of nonlinear strategies for effective processing, such as constructing a Laplace matrix [27,28] to maintain this nonlinear geometric structure between data points. This approach is known as manifold learning, which assumes that high-dimensional data has a nonlinear mapping in a low-dimensional space.
Cai et al. effectively combined manifold learning with NMF by introducing a graph regularization term, which is the graph regularized nonnegative matrix factorization (GNMF[29]) method. This algorithm aims to establish smooth connections between data representations, encode geometric information between data in linear and nonlinear spaces, and find the representation of data on manifolds. The advantage of this method is that it not only preserves the inherent geometric structure, but also obtains a compact representation of hidden semantic information. Based on this, Li et al.[30] presents a graph-based local concept coordinate factorization (GLCF) method, which respects the intrinsic structure of the data through manifold kernel learning in the warped eproducing kernal Hilbert space. Hu et al. [31] combined the GNMF algorithm with the Convex Non-Negative Matrix Factorization (CNMF) algorithm and proposed the graph convex nonnegative matrix factorization (GCNMF) algorithm with manifold regularization terms; Meng et al. [32] introduced graph regularization terms in the sample space and feature space, and proposed the dual graph sparse nonnegative matrix factorization (DSNMF) algorithm. In addition, algorithms such as graph regularized discriminative non-negative matrix factorization (GDNMF) [33], dual regularization nonnegative matrix factorization(DNMF) [34], Incomplete multi-view clustering (IMVC) [35], collaborative topological graph learning (CTGL) [36], multi-view Deep nonnegative matrix factorization with multi-kernel learning (MvMKDNMF) [37], nonnegtive mtrix factorization with graph regularization for multi-view data representation (GMultiNMF) [38], and Pairwise constrained Graph Regularized Convex Nonnegative Matrix Factorization (PGCNMF) [39] applied constraints to the data through manifold learning, constructed different Laplacian regularization terms, and provided many basic methods for data analysis through the idea of manifold learning.
Recently, orthogonal nonnegative matrix factorization (ONMF [24]) has received widespread attention by imposing orthogonal constraints on NMF, making it more suitable for clustering tasks as the constraint matrix composed of factors can be viewed as an indicator matrix. This is because local representation requires the fundamental vectors of the factor matrices to be different from each other, which means that local representation is closely related to orthogonality.
The orthogonality of a given matrix A refers to ATA=I or AAT=I (I is an appropriate identity matrix), which means that any two different columns of A are orthogonal or rows of A are orthogonal.
In [20] Li et al. found that for some datasets, part-based representation involves clustering the rows of the input data matrix, and the clustering result is described by the basis matrix W. The orthogonal constraint on W is column orthogonality, WTW=I. In document clustering [40,41], clustering is performed on the columns of the input data matrix, represented by the coefficient matrix H, and the orthogonality on each row of H can improve clustering performance. The orthogonality constraint here is now on coefficient matrix H satisfying HHT=I. The orthogonality models of the above two categories have greatly improved clustering performance[42,43,44]. Due to the excellent performance of ONMF in clustering, many update rules and variants of ONMF have been proposed [45,46,47].
In fact, ONMF was initially proposed by Ding [24], and empirical evidence demonstrated that introducing a certain degree of orthogonality constraint into the algorithm can enhance sparsity, reduce redundancy in feature vectors, and ensure the uniqueness of the solution and interpretability of clustering results. In their method, orthogonality is achieved by solving an optimization problem with orthogonality constraints.
There, ONMF was considered as a continuous relaxation (without discrete constraints) under the K-means method. Research on various data mining tasks has shown that compared to clustering methods based on K-means and NMF, the ONMF model performs better. However, in order to achieve the orthogonality of the basis vectors, this requires intensive computation, especially for word clustering in the document matrix, which results in high computational costs. Next, in 2010, Li et al. proposed nonnegative matrix factorization on orthogonal subspace (NMFOS) [48], introducing the concept of penalty functions to incorporate orthogonality constraints into the objective function. This orthogonality, which is achieved by adding it as a penalty term to the objective function, is established to reduce computational burden and to some extent improve the sparsity of the decomposition matrices (W matrix or H matrix). However, this model has two issues. First, when applied to certain document datasets, the algorithm may exhibit zero locking phenomenon; Second, the convergence proof proposed in it is not rigorous enough (due to insufficient correctness of matrix differentiation), that is, it fails to effectively prove that the iterative algorithm does indeed converge.
Overall, ONMF models have shown good performance, but still face the following two challenges: significant computational costs when used as constraints, and difficulty or absence in proving the convergence of algorithms when orthogonality WTW=I or HHT=I is introduced as a penalty term. Based on the above observations, this paper will study a new ONMF-based model graph-regularized nonnegative matrix factorization orthogonal subspace with auxiliary variable(GNMFOSV). By adding an orthogonal penalty term and a graph Laplacian term, we achieve the orthogonality of the factor matrix through factorization, enhancing the sparsity of the coefficient matrix to some extent. Leveraging the geometric structure properties of the data, we enhance the algorithm's capability to recover the underlying nonlinear relationships in the data. More importantly, we will introduce an intermediate variable V as an alternating update in the auxiliary function, and prove the convergence of our algorithm using the idea of alternating updates of three variables. Meanwhile, experiments on a large number of datasets have also shown that this algorithm greatly improves clustering performance and has a certain degree of robustness to the dataset. In other words, experimental results on many datasets have shown that our algorithm maintains overall optimality.
The main contributions of this paper are as follows:
∙ The ONMF method obtains sparser results through orthogonal constraints. However, in some complex practical application scenarios, it is still difficult to interpret the physical meanings or practical implications of the basis vectors and coefficient matrices obtained after decomposition, which is not conducive to a deep understanding of the internal structure and features of the data. At the same time, it is also very difficult to prove the convergence of the algorithm. The GNMF method preserves the inherent geometric structure of the data and also acquires a compact representation of the hidden semantic information. Nevertheless, this method overemphasizes local information and relatively ignores the global structure and dependency relationships of the data. In this paper, in order to improve the interpretability of ONMF and the theoretical convergence of the algorithm, and also to retain the global structure and dependency relationships of the data in the GNMF model, the two methods are integrated, and an auxiliary function is ingeniously added, resulting in a rigorous convergence proof. The ablation experiments also fully demonstrate that integrating the graph regularization method, orthogonal constraints, and the auxiliary variable mechanism can significantly improve the clustering accuracy and stability, outperforming single methods on multiple datasets, which fully verifies the rationality and innovation of the model design.
∙ In our algorithm, different levels of orthogonality can be achieved by adjusting the parameter α1 and α2, and various regularization strengths can be obtained by tuning the parameter λ, making the algorithm more flexible and applicable to a broader range of scenarios.
∙ Extensive experiments were conducted on 13 datasets, demonstrating the algorithm's effectiveness in terms of clustering performance. The results not only highlight its efficiency in learning and enhanced discriminative power but also indicate its ability to acquire improved partial linear representations.
The remaining content of this paper is as follows: Section 2 reviews related work, including NMF, NMFOS, and GNMF. Section 3 provides a detailed introduction to our method, including the derivation of update rules and the corresponding algorithm, along with a proof of the algorithm's convergence. Section 4 presents experiments and analysis. Section 5 offers conclusions and future directions.
2.
Symbols and classic models
In this section, some symbols and notations used throughout the paper will be provided in Tables 1 below, followed by classic models of NMF, NFMOS, GNMF and their multiplicative update algorithms.
2.1. NMF
NMF [9] is a classic matrix factorization algorithm that requires all elements in the decomposed matrix to be nonnegative. Consider the objective matrix X=[x1,⋯,xN]∈RM×N+ and treat each xj of X as a column vector. Nonnegative matrix factorization of X refers to finding two matrices W and H (W=[wik]∈RM×K+, H=[hkj]∈RK×N+, i=1,2,⋯,M, j=1,2,⋯,N and k=1,2,⋯,K), such that the matrix-product WH is as equal as possible to X under a certain metric:
Generally speaking, W is called the basis matrix and H is called the coefficient matrix. This is because (2.1) tells us that each column vector xj of X can be linearly represented using the column vectors of W in this factorization, where the coefficients come from matrix H. The specific representation is as follows:
where wk represents the column vectors of W, and K in the (2.2) is much smaller than M, N, or satisfies (M+N)K<MN. It is precisely because of (2.2) that the decomposed matrix W is called the basis matrix and matrix H is called the coefficient matrix. Afterward, matrix H will be used in clustering analysis applications, similar to the indicator set matrix in K-means. Meanwhile, the fact that K is much smaller than M and N makes this decomposition a dimensionality reduction decomposition, so clustering with H (not X) will greatly save time and space, and the mathematical model of NMF is
where W=[w1,⋯,wM]∈RM×K+, H=[h1,⋯,hN]∈RK×N+. In addition, we will import an iterative multiplicative update algorithm from Lee and Seung [50].
2.2. NMFOS
Chris Ding et al.[24] extended the traditional NMF by incorporating an orthogonal constraint on the basis or coefficient matrix into the objective function's constraints, and this model is the important ONMF model. More importantly, the equivalence between ONMF and kernel K-means methods was also proved. Li et al. [48] found that the Lagrangian multiplier used in the nonlinear constraint optimization in ONMF is a symmetric matrix that needs to be updated at each iteration, resulting in a large computational workload. So, to overcome this limitation, they proposed a novel approach called NMFOS (using penalty terms). Then, an improved optimization model was obtained in the following form:
where X=[x1,⋯,xN]∈RM×N+, W=[w1,⋯,wM]∈RM×K+, H=[h1,⋯,hN]∈RK×N+, I is the identity matrix with K rows and K columns.The nonnegative parameter α≥0 controls the degree of orthogonality of the vectors in H. When α=0, NMFOS reduces to NMF. Through model (2.4), the following update rules were derived:
The advantage of this algorithm is that by adjusting the penalty parameter α, orthogonality can be achieved during the decomposition process without the need for additional constraints, while reducing computational complexity. By the way, it should be noted that Mirzal [51] has proved the convergence of (2.5) from [24].
2.3. GNMF
The extraction of the internal structure of data is an important application extension based on NMF models. Graph regularization is a manifold-based data processing method widely used to analyze the intrinsic geometric structure of manifold data [49,52,53]. The starting point of this idea is that manifolds can enhance the smoothness of data in both linear and nonlinear spaces by minimizing the formula later in (2.6), and by extracting the geometric structure information from data points and features, manifolds can also enhance the clustering performance of NMF (see details in [34,54,55]). The study of NMF models based on such minimization forms is the classic GNMF model [29].
Next, let's briefly introduce the specific construction process of GNMF. To effectively capture the geometric structure of both data and feature manifolds, the p-nearest neighbor graph is constructed based on the input data matrix X=[x1,⋯,xN]∈RM×N+. Considering a graph with N vertices, which make up the current X, each vertex corresponds to a data point xj (j=1,2,⋯,N). For each data point xj, we identify its p-nearest neighbors and connect them with edges (theoretically, p can take any value from {1,2,⋯,N}, but in practice, it is usually between 3 and 5). At the same time, a weight matrix S is defined, which is used to measure the degree to which adjacent points are not only similar in distance but also similar in geometric structure, thereby strengthening the influence of geometric structure on the model. There are many options for defining the weight matrix S={Sjl}∈RN×N+ on the graph, with the commonly used ones being the following three options:
∙0−1 Weight Matrix: If N(xj) represents the set of p-nearest neighbors for data point xj, then
∙ Heat Kernal Weight Matrix: If point xj is connected to point xl, then,
∙ Dot-Product Weight Matrix: If point xj is connected to point xl, then,
It is worth noting that if the data point x is normalized to 1, the above dot-product of two vectors is equal to their cosine similarity.
With the help of the weight matrix S, the Laplacian matrix L∈RN×N+ is defined to characterize the degree of similarity between points in geometric structure, and the Laplacian matrix L is defined as L=D−S, where D∈RN×N+ is the degree matrix with Dii=∑jSij. Note that, D is a diagonal matrix. Inspired by matrix factorization and manifold learning, Cai et al.[29] proposed the GNMF algorithm. The core graph regularization formula is as follows:
where Tr(⋅) denotes the trace operator and measures the pairwise similarity of the original data points. By minimizing (2.6), an intuitive result can be obtained: If two data points are close in terms of the intrinsic geometric structure of the original data distribution, they will also be close with respect to the new basis vectors hl and hj. Then, the optimization model of GNMF is given by:
where W=[w1,⋯,wM]∈RM×K+, H=[h1,⋯,hN]∈RK×N+. The regularization parameter λ≥0 controls the smoothness of the model. By utilizing the Lagrange method, the model (2.7) can be solved and the iterative update rules are obtained below:
It is evident that when λ=0, the update rules in (2.8) go back to the form of NMF update rules. In this paper, such manifold techniques are combined with NMFOS to enhance the capability of capturing underlying nonlinear relationships in data recovery.
3.
A new class of NMF-type model and algorithm
In this section, we will introduce a new method called the GNMFOSV. More importantly, we also provide a convergence proof for this type of model to compensate for the lack or inadequacy of algorithm convergence proof. The specific operation is as follows: First, using the idea of NMFOS, the orthogonal constraint conditions on the coefficient matrix H are added to the objective function by using the penalty function method. Second, inspired by GNMF, our new model utilizes the geometric structures inherent in the data manifold and feature manifold to evolve the manifold term into additional regularization terms and add them to the objective function for further clustering research. From the experimental results, it can be seen that combining orthogonality and graph regularization (manifold) and adding them separately to the objective function not only solves the problem of considering both the nonnegativity and orthogonality of the coefficient matrix, but also improves the clustering performance via GNMFOSV.
3.1. Methodology
As mentioned above, we incorporate both the orthogonality constraint and the Laplacian constraint into the objective function to construct GNMFOSV:
where X=[x1,⋯,xN]∈RM×N+, W=[w1,⋯,wM]∈RM×K+, H=[h1,⋯,hN]∈RK×N+, L∈RN×N+, I is the identity matrix with K rows and K columns. Due to the property that a nonnegative matrix with pairwise orthogonal rows may have at most one nonzero entry in each column, the generation of strictly uncorrelated feature vectors is possible. Thus, we incorporate an orthogonal penalty term, α‖HHT−I‖2F, into the foundational model. However, directly imposing orthogonality constraints on the coefficient matrix poses challenges for convergence proof. Therefore, we introduce an auxiliary variable, denoted as V∈RK×N+, with the matrix dimension equivalent to that of H. Subsequently, two new parameters, α1 and α2, are introduced to split the orthogonal constraint penalty term, α‖HHT−I‖2F, into two second-order terms. This is equivalently replaced by α12‖I−HVT‖2F+α22‖V−H‖2F, i.e.,
Therefore, we equivalently replace model (3.1) with the following new model:
This formulation incorporates the orthogonality constraint with an auxiliary variable as a penalty term in the objective function(see Figure 1). The advantage of introducing the auxiliary variable lies in its ability to deconstruct the complex orthogonality constraints found in NMFOS, thereby addressing the common challenges of convergence proof encountered in traditional orthogonal subspace NMF methods, and facilitating rapid convergence and favorable comparative performance in our experiments.
Here, α1≥0, α2≥0 are the two orthogonal penalty parameters, and λ≥0 is the regularization parameter, respectively. The first two parameters adjust the level of orthogonality in H and ensure the validity of Eq (3.2), while the latter balances the reconstruction error in the first term and the graph regularization in the third term of the objective function. For convenience, we employ a 0−1 weight scheme to construct the p-nearest neighbor graph. Clearly, when α1=α2=λ=0, model (3.3) degenerates into model (2.3). Since the objective function in model (3.3) is non-convex with respect to both W and H, obtaining closed-form solutions is challenging. Now, we need to explore how to develop an effective algorithm to solve this model. We adopt an iterative optimization algorithm and utilize multiplicative update rules to obtain local optima (a similar idea can be found in [56,57,58], but not the same). We can rewrite the objective function of J(W,V,H) in trace form as follows:
Let ψ, ϕ, and θ be the Lagrange multipliers associated with the constraints W≥0, H≥0, and V≥0 in the constrained model (3.3). Define matrices Ψ=[ψik], Φ=[ϕkj], and Θ=[θkj]. The Lagrangian function for the constrained optimization problem can then be expressed as:
Now, we will have the following equations by taking the derivatives of the Lagrangian function L with respect to W, V and H.
and
By setting Eqs (3.5)–(3.7) to zero, we can derive the expressions for the Lagrange multipliers:
and
Furthermore, at both ends of Eqs (3.8)–(3.10), we multiply them by factors Wik, Vkj, and Hkj, respectively. With the help of Karush-Kuhn-Tucker (KKT) conditions ψikWik=0, θkjVkj=0, and ϕkjHkj=0, we will obtain the following three equations:
Therefore, based on Eqs (3.11)–(3.13), we derive the iterative formulas for updating W, V, and H in the optimization of the model (3.3):
We develop an efficient algorithm to solve model (3.3) in the Algorithm 1. The proof of the convergence of the Algorithm 1 can be found in the Supplementary.
3.2. Complexity analysis of algorithm
In this subsection, we analyze the computational complexity of Algorithm 1, which is the amount of space requirements needed to run the algorithm. The notation O(n) is used to represent the computational complexity of the algorithm and is of the same order as n. Recall that M denotes the number of features, N denotes the number of samples, K denotes the number of clusters, and p denotes the number of nearest neighbors. The computational complexity of XHT, WHHT, WTX, and WTWH are all MNK. Moreover, the computational complexity of HD and HS are both N2K, while the computational complexity of VHTH and HVTV is NK2. Besides, we need to provide an additional complexity of O(N2M) to construct the p-nearest neighbor graph. In conclusion, the computational complexity of each iteration in Algorithm 1 is O(MNK+N2M+N2K+NK2). During the experiment processing, we found that when conducting algorithm comparison experiments, the differences in time consumption among various algorithms on the author's computer were not significant. Therefore, data on time consumption was not included.
4.
Experimental results and analysis
In this section, via lots of experiments, we will verify and study the performance of our algorithm in many aspects, including algorithm parameter selection analysis, clustering effect comparison experiment, and convergence experiment. In order to show these, we have compared the results based on the following datasets.
4.1. Datasets
To verify the performance superiority of the proposed model, we conduct comprehensive experiments on 13 real-world datasets involving face images, digit images, and documents.
∙ORL*: Olivetti research laboratory dataset(ORL). This dataset contains 400 face images collected from 40 individuals and each subject has 10 grayscale images with size 32*32.
*https://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
∙AR[59]: Augmented reality dataset(AR). This dataset contains images of 65 men and 55 women, with each person contributing 14 images, resulting in a total of 1680 images. These images are resized to 32*32 pixels.
∙Yale†: This dataset contains 165 grayscale images collected from 15 individuals. Each individual has 11 images with different facial expressions or configurations: with or without glasses, left-light, center-light, right-light, normal, happy, sad, sleepy and surprised. All image sizes are 32*32.
†http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html
∙Handwritten digit‡: The dataset consists of 2000 samples collected from 10 numeric characters (0–9).
‡https://lig-membres.imag.fr/grimal/data.html
∙UCI digit§: This dataset consists of features of handwritten numerals (0–9) from University of California, Irvine(UCI) machine learning repository. They contain 2000 instances belonging to 10 classes.
§http://archive.ics.uci.edu/dataset/72/multiple+features
∙Diag-Bcw¶: Breast Cancer Wisconsin (denoted as Diag-Bcw) dataset, whereas the features are computed from a digitized image of a fine needle aspirate of a breast mass. It contains 569 documents in 2 categories.
¶https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
∙Caltech101 [60]: Caltech101 contains 9144 pictures of objects belonging to 101 categories, including motorbike, airplane, cannon, etc. There are about 40 to 800 images per category but most categories have about 50 images.
∙Isolet||: This dataset was generated as follows: 150 subjects spoke the name of each letter of the alphabet twice. The speakers are grouped into sets of 30 speakers each, and are referred to as isolet1, isolet2, isolet3, isolet4, and isolet5. We select the subset isolet3 in this experiment. It contains 1560 samples in 26 categories.
||http://archive.ics.uci.edu/dataset/54/isolet
∙Vote**: Congressional Voting Records (denoted as Vote) are derived from vote collections for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the congressional quarterly almanac(CQA). This dataset contains 435 documents in 2 categories.
**https://archive.ics.uci.edu/dataset/105/congressional+voting+records
∙Abalone††: This dataset corresponds to the physical measurements of abalones and was collected from Marine Resources Division. It contains 2282 documents in 4 categories.
††https://archive.ics.uci.edu/dataset/1/abalone
A summary of the above datasets is provided in Table 2 for their statistical characteristics.
4.2. Algorithms used for comparison
The proposed method is compared with the well-known K-means method and six other popular NMF based algorithms: NMF, ONMF, GNMF, sparse constrained nonnegative matrix factorization with matrix H (SNMF-H), sparse constrained nonnegative matrix factorization with matrix W (SNMF-W), NMFOS. As to the parameter setup of all compared methods, we adopt their default settings if feasible. For a fair comparison, we only consider algorithms based on the Frobenius norm. The description of each algorithm is as follows:
∙ K-means[61]: This algorithm operates on the original matrix without extracting or utilizing information contained in the original matrix.
∙ NMF[9]: The goal of this algorithm is to find two nonnegative matrices with significantly lower dimensions than the original matrix, while achieving strategies for features extraction and dimensionality reduction.
∙ ONMF[24]: Building upon NMF, an orthogonal constraint is introduced to the basis matrix (or coefficient matrix).
∙ GNMF[29]: By integrating graph theory and manifold assumptions with the original NMF, this algorithm preserves the intrinsic geometric structure of the data.
∙ SNMF-W and SNMF-H[62]: These two algorithms can generate sparse representations and represent the data as a linear combination of a small number of basis vectors.
∙RMNMF[63]: The Robust manifold nonnegative matrix factorization (RMNMF) algorithm combines the ℓ2,1-norm with spectral clustering to enhance robustness to outliers and preserve data geometry, while an orthonormal constraint addresses the uniqueness of the solution.
∙KOGNMF[64]: Kernel-based orthogonal graph regularized NMF (KOGNMF). By integrating graph constraints into the nonlinear NMF framework, this algorithm develops a kernel-based graph-regularized NMF.
∙min-vol NMF[65]: The min-vol NMF algorithm incorporates volume regularization to address rank-deficient basis matrices, making it applicable to real-world problems such as multispectral image unmixing. It utilizes an alternating fast projected gradient method for optimization.
∙ NMFOS[48]: Building upon ONMF, this algorithm incorporates an orthogonal constraint on the coefficient matrix as a penalty term in the objective function.
4.3. Settings
To assess clustering performance, we will adopt purity, normalized mutual information(NMI), accuracy(ACC), and rand index(RI) for comparing clustering results, which are widely used in nonnegative learning literature.
In this subsection, we will provide descriptions and definitions of these four indicators, as well as the selection of initial matrices and the stopping criteria for algorithms.
Purity is used for measuring the extent to which each cluster contained data instances from primary one class, and it is computed by
The notation nji represents the count of occurrences where the j-th input class is assigned to the i-th cluster, with N denoting the total number of samples. The described process involves assigning a class to each cluster, and the samples for this class that appear most frequently within the cluster. Therefore, its range is [0,1] and a higher value implies better clustering performance.
NMI is commonly employed in clustering to measure the similarity between two clustering results. Its range is [0,1], with higher values indicating greater similarity between the two clustering results [63]. NMI is defined by the following formula:
Here, C and C′ represent a set of true labels and a set of clustering labels, respectively, and H(⋅) denotes the entropy function.
ACC is used for discovering the one-to-one relationship between classes and clusters, along with measuring the extent to which each cluster's contained data instances from the corresponding class. The range of ACC values is from 0 to 1, where a higher value indicates better clustering results [66]. The expression for ACC is defined as follows:
where N represents the total number of samples, and map(⋅) denotes the mapping of the computed clustering label c′j to the true clustering label ci. The definition of δ is as follows:
RI is a clustering evaluation indicator that measures the similarity between two clusters. It determines the agreement between the pairwise relationships of data points in the true cluster and the obtained cluster. RI is calculated by using the following formula:
where a represents the number of pairwise agreements between data points assigned to the same cluster in both the true and obtained clusterings, b represents the number of pairwise agreements between data points assigned to different clusters in both the true and obtained clusterings, and n denotes the total number of data points, with {\binom{n}{2}} being the combinatorial number.
By unifying (4.1)–(4.3) and (4.5) together, the definitions of all considered clustering indicators are given in Table 3.
In the clustering experiment, we use the number of clusters K in the dataset to be equal to the maximum number of categories C in the target dataset. Due to the sensitivity of most methods to initial values, for experimental accuracy, we conducted 10 separate runs for all algorithms and calculate their mean and standard deviation. As for the stopping criterion, we only needed to set a maximum number T = 100 of iterations.
The experiments were conducted on a computing system operating under Windows 10 Home Edition, 64-bit (version 10.0, Build 19045). The hardware configuration comprises an Intel(R) Core(TM) i5-7300HQ CPU, functioning at a clock speed of 2.50 GHz with four cores, and is supplemented by 8 GB of RAM. The programming environment employed for the implementation of the algorithms is MATLAB R2015b.
4.4. Parameter sensitivity
In this subsection, we will discuss the sensitivity of the parameters.
There are four parameters in our model: the regularization parameter \lambda , the orthogonal penalty parameters \alpha_1 , the orthogonal penalty parameters \alpha_2 , and the nearest neighbor point p . To reveal the impact of these terms on the performance of the model, we select a batch of parameters such that \lambda , \alpha_1 , and \alpha_2 are all in the range of \left\{ 10^{-3}, 10^{-2}, 10^{-1}, 10^{0}, 10^{1}, 10^{2}, 10^{3}\right\} . For the parameters \lambda , \alpha_1 and \alpha_2 , we employed NMI as the evaluation metric to conduct a parameter sensitivity experiment on the ORL dataset. The number of iterations is empirically fixed at 100 times. As stated above, the 0–1 weighting is used for computing the weighting graph in all experiments if not stated otherwise. Figure 2 plots the NMI of the ORL dataset with different parameters. The x-axis represents \alpha_2 , the y-axis represents \alpha_1 , and the z-axis represents NMI. As shown in Figure 2, algorithm performance changes with variations in three parameters. For example, we have the following observations that the proposed GNMFOSV algorithm obtains superior performance when \lambda = 100 , \alpha_1 varies in \left\{ 10^{-2}, 10^{-1}, 10^{0}, 10^{1}\right\} , and achieves consistent good performance regardless of the range of \alpha_2 . The specific analysis of Figure 2 is presented in the following paragraphs.
First, a small \lambda and a large \alpha_1 may deteriorate the clustering performance. For example, from Figure 2(c), GNMFOSV performs poorly when \alpha_1 is greater than 100. However, when the value of \alpha_1 is in a smaller range, the performance of GNMFOSV is very stable and has significant advantages. So, the value of \alpha_1 will be set to \left\{ 10^{-2}, 10^{-1}, 10^{0}, 10^{1}\right\} in the following experiments.
Moreover, when \lambda\! = \!100 , the best results appear for \alpha_1\! = \!0.01 , and \alpha_2\! = \!1000 and this reflects that a small \lambda and a large \alpha_1 are needed. Then, by experiments, we will use \alpha_1\! = \!0.01 , \alpha_2\! = \!1000 , and \lambda\! = \!100 .
Second, to choose nearest neighbor point p when \alpha_1 = 0.01 , \alpha_2 = 1000 , and \lambda = 100 . The clustering results of NMI with respect to the neighborhood size p varies from 1 to 10 are shown in Figure 3. Based on the ORL dataset, GNMFOSV algorithm arrives at the good NMI for the ORL dataset when p varies from 2 to 5. Other indicators and datasets are similar.
In summary, our experiments (all datasets are analyzed) informed us that the parameter selection results are p = 3 , \alpha_1 = 0.01 , \alpha_2 = 1000 , and \lambda = 100 . These selected parameter values will be used in numerical experiments for clustering throughout the entire paper. This means that the robustness of the proposed method can be guaranteed when the parameters are selected in an appropriate range. These observations also indicate that the GNMFOSV algorithm is sensitive to the values of parameters, which provides corresponding moderation for practical applications.
At the same time, we also conducted parameter sensitivity tests on 12 other datasets. We found that the model proposed in this paper demonstrates excellent robustness. Regrettably, constrained by the page limit, only the experimental results of the ORL dataset are presented.
4.5. Ablation study
In the ablation study of this subsection, we choose p = 3 , \alpha_1 = 0.01 , \alpha_2 = 1000 , and \lambda = 100 as parameter selection of GNMFOSV. The ablation study means that the lost functions of the proposed GNMFOSV are studied by experiments of removing different parts. The combinations of each part in (3.3) are denoted as follows.
\bullet Baseline(NMF): The loss function composed of only the NMF initial item(the first item) in (3.3) is set as the baseline, which is
\bullet GNMF: The loss function composed of the NMF term and the second term in Eq (3.3), which is
\bullet NMFOSV: The loss function, consisting of the NMF term along with the third and fourth terms in Eq (3.3),
\bullet GNMFOSV: The loss function of Eq (3.3), which is indeed the proposed algorithm in this paper.
Here X = [x_1, \cdots, x_N]\in R_{+}^{M\times N} , W = [w_1, \cdots, w_M]\in R_{+}^{M\times K} , H = [h_1, \cdots, h_N]\in R_{+}^{K\times N} , L \in R_{+}^{N\times N} , I is the identity matrix with K rows and K columns. The clustering performance values presented in Tables 4-6 demonstrate that, all core modules of the proposed image clustering model are indispensable. The experimental results indicate that the integration of graph regularization methods, orthogonal constraints, and the auxiliary variable mechanism can significantly enhance clustering accuracy and stability. Outperforming comparative methods on multiple datasets, this validates the rationality and innovativeness of the proposed model.
4.6. Clustering results and analysis
In this subsection, extensive experimental results are displayed to validate the effectiveness of the proposed model. The experimental results of GNMFOSV with other comparison algorithms are given in Tables 7-19, and each result is the average of 10 runs after turning the parameters. The clustering results of all comparison algorithms are evaluated according to the former 4 indicators. The best result for each indicator is marked in bold. "-" indicates that clustering results could not be obtained or there was insufficient memory.
From the above tables, we will have the following observations.
First, the proposed method GNMFOSV achieves the best Purity, ACC, NMI, and RI values among all compared algorithms from a total of 13 datasets. It should be pointed that GNMFOSV outperforms other compared methods, with considerable advantages on ORL, AR, UCI-fou, UCI-kar datasets. For example, it makes about 7.7% increases against the second best method SNMF-H on the dataset UCI-fou in terms of ACC. Second, from the data in the above tables, the clustering algorithms of GNMF and GNMFOSV show certain competitive power at Purity, NMI, ACC, and RI comparing to the other methods. It indicates that the graph-based clustering algorithms are good at modeling the intrinsic structure of data. Furthermore, the orthogonal-based learning is promising over original NMF learning, which can be observed from the fact that the orthogonal-based algorithms, i.e., ONMF, NMFOS, and GNMFOSV, substantially boost the clustering performance on document datasets.
Third, compared with GNMF and NMFOS algorithms, neither the graph regularization algorithm nor the orthogonal subspace algorithm can achieve excellent clustering performance. However the GNMFOSV algorithm combines the advantages of graph regularity and orthogonal subspace, and it can significantly exceed for the original single-constrained NMF algorithm. In general, our proposed GNMFOSV algorithm consistently outperforms the other compared algorithms in terms of Purity, NMI, ACC, and RI metrics, which demonstrates the clear advance of our method.
In conclusion, the GNMFOSV method performs acceptably on all datasets, i.e., face image, digit image, object image, and document, which further validates its robustness in terms of steady clustering performance. The quantitative results fully demonstrate the effectiveness of our proposed GNMFOSV approach.
4.7. Convergence analysis
In this subsection, we investigate how fast the convergence speed is. The convergence curves of GNMFOSV on all datasets are demonstrated in Figure 4 based on all 13 datasets. The x-axis denotes the number of iterations, and the y-axis denotes the objective function value.
It can be seen from the convergence curve (tested on each dataset depicted in the figure) that our algorithm can achieve rapid convergence on each data set. In the datasets of AR, Handwritten digit, Diag-Bcw, Caltech7, Caltech20, and Vote, when the number of iterations is less than 20, the GNMFOSV algorithm can achieve stable convergence. On the ORL, Yale, UCI-fac, UCI-fou, UCI-kar and Abalone datasets, the speed of GNMFOSV drops rapidly, and then it becomes a gentle decline, and finally achieves stable convergence at about 100 iterations. On the isolet3 dataset, the convergence curve of GNMFOSV is very smooth, and convergence is achieved around 100 iterations.
So, Figure 4 indicates that the GNMFOSV algorithm has done a very good job in actual convergence experiments, and the choosen of 100 (for iteration) is good enough in actual experiments.
5.
Conclusions
In this paper, we have proposed a new NMF cost function for clustering problems. The derived optimization model incorporates the orthogonal constraints, a graph Laplacian term, and auxiliary variable mechanism. The experiments show the superiority of the proposed algorithm (at least on the datasets used in this work). The pro-posed algorithm has the following advantages: (i) It leverages the geometric structure information hidden in both data and feature spaces effectively. Specifically, this model effectively exploits the geometric structure information hidden in the data and feature spaces. It models the data space as a sub-manifold embedded in a high-dimensional space. After conducting matrix factorization on this sub-manifold, it obtains a low-dimensional representation of data points, which is beneficial for data point clustering. Therefore, this model is more discriminative than algorithms based on orthogonal subspaces that only take into account the Euclidean structure of the data. Also, it can achieve better classification performance than graph-based algorithms because the latter do not consider the orthogonalization of the vectors in the factor matrices. (ii) By introducing an auxiliary variable, it provides a complete proof of convergence, filling the gap in demonstrating convergence under orthogonality conditions for such algorithms. (iii) After conducting 13 groups experiments with data of different sizes and types, the results show that, it is less sensitive to changing the size and type of the data. The performance of the proposed algorithm remains more or less stable even when the size and type of the data changes.
Author contributions
Caiping Wang: Conceptualization; Wen Li: Formal analysis; Caiping Wang, Wen Li, Junjian Zhao and Yasong Chen wrote the main manuscript text. All authors reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.
Use of Generative-AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Acknowledgments
The first author is partially supported by the China Tobacco Hebei industrial co., Ltd Technology Project, China (Grant No. HBZY2023A034). The third and fourth authors are partially supported by the Natural Science Foundation of Tianjin City, China (Grant Nos. 24JCYBJC00430 and 18JCYBJC16300), the Natural Science Foundation of China (Grant No. 11401433), and the Ordinary Universities Undergraduate Education Quality Reform Research Project of Tianjin, China (Grant No. A231005802).
Conflict of interest
The authors declare there is no conflict of interest.
Appendix
Convergence analysis of Algorithm 1
In this section, we will investigate the convergence of Algorithm 1. Since the objective function of the GNMFOSV model (3.3) has a lower bound, we only need to prove (3.3) is nonincreasing with respect to the update rules (3.14)–(3.16). Before that, we first introduce a definition given by Lee and Seung [50], as well as a lemma with a concise proof.
Definition 1. [50] Let F:{R}_+^{\upsilon\times\omega}\to {R} be continuously differentiable with \upsilon, \omega\in\mathbb{N} , and G:{R}_+^{\upsilon\times\omega}\times {R}_+^{\upsilon\times\omega}\rightarrow {R} also be continuously differentiable. Then G is called an auxiliary function of F if it satisfies the following conditions:
for any U, U^t\in {R}_+^{\upsilon\times\omega} .
Lemma 1. Let G be an auxiliary function of F . Then F is nonincreasing under the sequence of matrices \{U^{t}\}_{t = 0}^{+\infty} generated by the following update rule:
Proof. By Definition 1 and (5.1), the following formula
holds naturally. □
Next, we will provide the convergence proofs for the updated rules (3.14)–(3.16), which is also an important innovation of this paper. To this end, we first present the objective function F , and it is introduced to replace the symbols in the objective function J(W, H) in (3.4), i.e., F = J(W, V, H) . Moreover, let F_{W_{ik}} , F_{V_{kj}} , and F_{H_{kj}} represent the parts of the objective function (3.4) associated with the elements W_{ik} , V_{kj} , and H_{kj} , respectively. In addition, due to the complexity of representation of F(H) based only on H , the function F(H) will be rewritten as F(H) = F_1(H)+F_2(H) with
The first and second-order derivatives of F with respect to W_{ik} are listed below:
Meanwhile, the first and second-order derivatives of F_1(H) with respect to H_{kj} will be shown below:
Note that we have omitted the partial derivatives of F_2 and F_{V_{kj}} because they are not required when constructing the auxiliary functions.
We can now define auxiliary functions. Since the update rules (3.14)–(3.16) are all updated in the form of each element of the matrix, according to Lemma 1, it is sufficient to prove that F_{W_{ik}} , F_{V_{kj}} and F_{H_{kj}} do not increase under the sequences \{W_ {ik}^t\}_{t = 0}^{+\infty} , \{V_{kj}^t\}_{t = 0}^{+\infty} , and \{H_{kj}^t\}_{t = 0}^{+\infty} generated by (3.14)–(3.16), respectively. Then, the specific expressions of auxiliary functions are
We now have the following auxiliary function for H via (5.6) and (5.7):
Next, we will provide one by one in the form of lemmas that (5.4), (5.5), and (5.8) are indeed auxiliary functions. The first is to show that G_W(W_{ik}, W_{ik}^{t}) is an auxiliary function of F_{W_{ik}} .
Lemma 2. Let G_W(W_{ik}, W_{ik}^{t}) be defined by the formula (5.4). Then, G_W(W_{ik}, W_{ik}^{t}) is an auxiliary function of F_{W_{ik}} .
Proof. By expanding F_{W_{ik}}(W_{ik}) based on the Taylor series, we have the following expansion:
To prove that G_W(W_{ik}, W_{ik}^{t})\geq F_{W_{ik}}(W_{ik}) , it is sufficient to prove the following inequality:
In fact, (5.9) is true via
In addition, (5.10) can also lead to
Therefore, G_W(W_{ik}, W_{ik}^t)\geq F_{W_{ik}}(W_{ik})
follows. Meanwhile, G_W(W_{ik}, W_{ik}) = F_{W_{ik}}(W_{ik}) holds automatically. In summary, one has that G_W(W_{ik}, W_{ik}^{t}) is an auxiliary function of F_{W_{ik}} .
□
The following lemmas are needed to prove G_V(V_{kj}, V_{kj}^{t}) is an auxiliary function of F_{V_{kj}} .
Lemma 3. [67] Let \Omega\subseteq R^N be a convex set, F:\Omega\to\ R a convex function, and \lambda_{i}\in[0, 1] with i\in\{1, \cdots, k\} and \sum_{i = 1}^k\lambda_i = 1 . Then, for all x_{i}\in\Omega , F\left(\sum_{i = 1}^k\lambda_i x_i\right)\leq\sum_{i = 1}^k\lambda_iF(x_i).
Lemma 4. Suppose F(X) = \frac12\|Y-KX\|_F^2 with X\in R^{p\times m}_+ . Let A\in R^{p\times m}_+ and K\in R^{n\times p}_+ ), then,
Proof. First, let's expand the Frobenius norm of F(X) in the following way:
Second, by Lemma 3, letting \lambda_k = \frac{K_{ik}A_{kj}}{(KA)_{ij}}
and using the auxiliary variable A , we have that
Therefore, it leads to
and the result holds.
□
So, we have the following conclusion that G_V(V_{kj}, V_{kj}^{t}) is an auxiliary function of F_{V_{kj}} .
Lemma 5. Let G_V(V_{kj}, V_{kj}^{t}) be defined by (5.5), then it serves as an auxiliary function for F_{V_{kj}} .
Proof. According to Lemmas 3 and 4, we can estimate G_V(V_{kj}, V_{kj}^{t}) as follows:
In addition, it is easy to see that G_V(V_{kj}, V_{kj}) = F_{V_{kj}}(V_{kj}) and the result holds.
□
The third is to prove that G_H(H_{kj}, H_{kj}^{t}) can serve as an auxiliary function of F_{H_{kj}} .
Lemma 6. Let G_H(H_{kj}, H_{kj}^{t}) be defined as in (5.8), then it can serve as an auxiliary function for F_{H_{kj}} .
Proof. According to the definitions of F_1(H) and F_2(H) in (5.3), to prove
it is sufficient to show that
The proof of the first part in (5.11) is similar to Lemma 2. To eliminate unnecessary misunderstandings, we will provide it in detail. First, by a Taylor series expansion of F_{1_{H_{kj}}}(H_{kj}) , we have
Note that (W^TWH^t)_{kj} = \sum_{r = 1}^{K}(W^TW)_{kr}(H^t)_{rj}\geq (W^TW)_{kk}H^t_{kj}, (\lambda H^t D)_{kj} = \lambda \sum_{r = 1}^{K}H^t_{kr}D_{rj}\geq \lambda H^t_{kj}D_{jj} \geq \lambda H^t_{kj}(D-S)_{jj} = \lambda H^t_{kj}L_{jj}, then, \frac{(W^TWH^t+\lambda H^tD)_{kj}}{H_{kj}^{t}} \geq (W^TW)_{kk}+(\lambda L)_{jj}.
Therefore,
holds and G_{H_1}(H_{kj}, H_{kj}^t)\geq F_{1}(H_{kj}) follows.
The second part of (5.11) is also similar to Lemma 5. By Lemmas 3 and 4,
follows and G_{H_2}(H_{kj}, H_{kj}^t)\geq F_2(H_{kj})
holds. Then the result follows.
□
Based on the above preparations, we now provide the main conclusion of this paper.
Theorem 1. Let X\in R_{+}^{m\times n} , W^0\in R_{+}^{m\times k} , V^0\in R_{+}^{k\times n} , and H^0\in R_{+}^{k\times n} . Then, the objective function in (3.4) is nonincreasing under sequences \{W^{t}, V^{t}, H^{t}\}_{t = 0}^{+\infty} generated by (3.14)–(3.16).
Proof. First, since \frac{\partial G_W(W_{ik}, W_{ik}^{t})}{\partial W_{ik}} = F_{W_{ik}}^{'}(W_{ik}^{t})+\frac{(W^{t}HH^T)_{ik}}{W_{ik}^{t}}(W_{ik}-W_{ik}^{t}) and letting \frac{\partial G_W(W_{ik}, W_{ik}^{t})}{\partial W_{ik}} = 0 , it then goes to F_{W_{ik}}^{'}(W_{ik}^{t})W_{ik}^{t}+(W^{t}HH^T)_{ik}W_{ik}-(W^{t}HH^T)_{ik}W_{ik}^{t} = 0. According to Lemma 2,
By Lemma 1, if W_{ik}^{t+1} = \arg\min_{W_{ik}\in R}G_W(W_{ik}, W_{ik}^{t}) = W_{ik}^{t}\frac{XH^T}{(W^{t}HH^T)_{ik}}, then, (3.14) is nonincreasing. This with (3.4) has a nonnegative lower bound, then, (3.14) is a convergent iterative formula.
Second, due to the fact that
Let \frac{\partial G_{V}(V_{kj}, V_{kj}^{t})}{\partial V_{kj}} = 0 , then, \alpha_1\frac{V_{kj}}{V_{kj}^t}(V^tH^TH)_{kj}+\alpha_2V_{kj} = (\alpha_1+\alpha_2)H_{kj}, \alpha_1V_{kj}(V^tH^TH)_{kj}+\alpha_2V_{kj}V_{kj}^t = (\alpha_1+\alpha_2)H_{kj}V_{kj}^t. According to Lemma 5, we can get that V_{kj} = V_{kj}^{t}\frac{\left[(\alpha_1+\alpha_2)H \right]_{kj}}{\left[\alpha_1V^tH^TH+\alpha_2V^t \right]_{kj}}. So, by Lemma 1, if V_{kj}^{t+1} = \arg\min_{V_{kj}\in R}G_V(V_{kj}, V_{kj}^{t}) = V_{kj}^{t}\frac{\left[(\alpha_1+\alpha_2)H \right]_{kj}}{\left[\alpha_1V^tH^TH+\alpha_2V^t \right]_{kj}}, then (3.15) is nonincreasing. Besides, the objective function in (3.4) has a nonnegative lower bound, and it can be concluded that the formula in (3.15) is a convergent iterative formula.
Third, it is true that \frac{\partial G_{H_1}(H_{kj}, H_{kj}^t)}{\partial H_{kj}} = F_{1_{H_{kj}}}^{'}(H_{kj}^t)+\frac{(W^TWH^t+\lambda H^tD)_{kj}}{H_{kj}^t}(H_{kj}-H_{kj}^t) and
Let \frac{\partial G_{H}(H_{kj}, H_{kj}^{t})}{\partial H_{kj}} = \frac{\partial G_{H_1}(H_{kj}, H_{kj}^{t})}{\partial H_{kj}}+\frac{\partial G_{H_2}(H_{kj}, H_{kj}^{t})}{\partial H_{kj}} = 0 , then,
According to Lemma 6, it leads to H_{kj} = H^t_{kj}\frac{\left[W^TX+\lambda H^tS+(\alpha_1+\alpha_2)V \right]_{kj}}{\left[W^TWH^t+\lambda H^tD+\alpha_1H^tV^TV+\alpha_2H^t \right]_{kj}}.
By Lemma 1, if
holds, then (3.16) is nonincreasing. Once again, since the objective function in (3.4) has a nonnegative lower bound, it can be concluded that the formula (3.16) is a convergent iterative formula.
□