All-pairwise squared distances lead to more balanced clustering

Mikko I. Malinen; Pasi Fränti; Mikko I. Malinen; Pasi Fränti

doi:10.3934/aci.2023006

Applied Computing and Intelligence

2023, Volume 3, Issue 1: 93-115. doi: 10.3934/aci.2023006

Previous Article Next Article

Research article

All-pairwise squared distances lead to more balanced clustering

Mikko I. Malinen ^,,
Pasi Fränti

Machine Learning Unit, School of Computing, University of Eastern Finland, Box 111, FIN-80101 Joensuu, FINLAND; mmali@cs.uef.fi, franti@cs.uef.fi

Academic Editor: Chih-Cheng Hung

Received: 09 December 2022 Revised: 14 March 2023 Accepted: 19 April 2023 Published: 15 May 2023

In clustering, the cost function that is commonly used involves calculating all-pairwise squared distances. In this paper, we formulate the cost function using mean squared error and show that this leads to more balanced clustering compared to centroid-based distance functions, like the sum of squared distances in $k$ -means. The clustering method has been formulated as a cut-based approach, more intuitively called Squared cut (Scut). We introduce an algorithm for the problem which is faster than the existing one based on the Stirling approximation. Our algorithm is a sequential variant of a local search algorithm. We show by experiments that the proposed approach provides better overall optimization of both mean squared error and cluster balance compared to existing methods.

Keywords:

Citation: Mikko I. Malinen, Pasi Fränti. All-pairwise squared distances lead to more balanced clustering[J]. Applied Computing and Intelligence, 2023, 3(1): 93-115. doi: 10.3934/aci.2023006

Related Papers:

[1]	Yanshou Dong, Junfang Zhao, Xu Miao, Ming Kang . Piecewise pseudo almost periodic solutions of interval general BAM neural networks with mixed time-varying delays and impulsive perturbations. AIMS Mathematics, 2023, 8(9): 21828-21855. doi: 10.3934/math.20231113
[2]	Jing Ge, Xiaoliang Li, Bo Du, Famei Zheng . Almost periodic solutions of neutral-type differential system on time scales and applications to population models. AIMS Mathematics, 2025, 10(2): 3866-3883. doi: 10.3934/math.2025180
[3]	Ramazan Yazgan . An analysis for a special class of solution of a Duffing system with variable delays. AIMS Mathematics, 2021, 6(10): 11187-11199. doi: 10.3934/math.2021649
[4]	Xiaofang Meng, Yongkun Li . Pseudo almost periodic solutions for quaternion-valued high-order Hopfield neural networks with time-varying delays and leakage delays on time scales. AIMS Mathematics, 2021, 6(9): 10070-10091. doi: 10.3934/math.2021585
[5]	Li Wang, Hui Zhang, Suying Liu . On the existence of almost periodic solutions of impulsive non-autonomous Lotka-Volterra predator-prey system with harvesting terms. AIMS Mathematics, 2022, 7(1): 925-938. doi: 10.3934/math.2022055
[6]	Tian Yue . Barbashin type characterizations for nonuniform h-dichotomy of evolution families. AIMS Mathematics, 2023, 8(11): 26357-26371. doi: 10.3934/math.20231345
[7]	Ping Zhu . Dynamics of the positive almost periodic solution to a class of recruitment delayed model on time scales. AIMS Mathematics, 2023, 8(3): 7292-7309. doi: 10.3934/math.2023367
[8]	Shihe Xu, Zuxing Xuan, Fangwei Zhang . Analysis of a free boundary problem for vascularized tumor growth with time delays and almost periodic nutrient supply. AIMS Mathematics, 2024, 9(5): 13291-13312. doi: 10.3934/math.2024648
[9]	Yan Yan . Multiplicity of positive periodic solutions for a discrete impulsive blood cell production model. AIMS Mathematics, 2023, 8(11): 26515-26531. doi: 10.3934/math.20231354
[10]	Tianwei Zhang, Zhouhong Li, Jianwen Zhou . 2p-th mean dynamic behaviors for semi-discrete stochastic competitive neural networks with time delays. AIMS Mathematics, 2020, 5(6): 6419-6435. doi: 10.3934/math.2020413

Abstract

1. Introduction

Image-based geo-localization refers to finding out the geographic coordinates of a given query image ^[1] and has broad application prospects in many fields such as autonomous driving ^[2], augmented reality ^[3] and mobile robotics ^[4].

The traditional image-based geo-localization method is to match the query image of the ground view with the geo-tagged ground view image from the reference database. This method is also dubbed as ground-to-ground image geo-localization ^[5,6,7,8]. However, since most of the available reference images are captured in densely populated areas, e.g., famous tourist attractions, business zones, etc., few or no reference images are captured in sparsely inhabited and remote areas. Therefore, the ground-to-ground image geo-localization method often fails in sparsely populated and remote areas. In recent years, with the rapid development of the space industry, high-resolution satellite images with GPS (Global Positioning System) tags have been easily obtained. Cross-view image geo-localization means matching the query ground image with the reference satellite images to determine its geographic location. With the advantages of easy accessiblity and wide coverage of the reference satellite images, cross-view image geo-localization can be extended to large areas or even globally. Therefore, cross-view image geo-localization has attracted wide attention from researchers and has become a primary research direction in current image geo-localization.

The early cross-view image geo-localization methods mainly match the hand-crafted features of the ground images and satellite images, then use the position tags of the satellite image matched best as the estimated position of the ground image ^[9,10,11,12]. For example, in 2010, Noda et al. ^[9] extracted the SIFT ^[13] and SURF ^[14] feature descriptors from the images captured by the in-vehicle camera and GPS satellite images, and matched them to localize the vehicle. In 2011, Lin et al. ^[12] extracted four feature descriptors such as HoG ^[15], self-similarity ^[16], gist ^[17], and color histograms from ground and satellite images for localization. However, due to the significant geometric differences between satellite images and ground images of the same geographic location, traditional hand-crafted features lack viewpoint invariance. They cannot bridge the spatial layout differences between the ground image and the satellite image, i.e., the relative position of the same objects in different views may be different. This makes it so that the methods based on traditional hand-crafted features geo-localize the ground image with very low accuracy.

Inspired by the success of deep learning in many computer vision tasks, Workman and Jacobs ^[18] first applied deep learning methods to cross-view image geo-localization in 2015. Since then, cross-view image geo-localization based on deep learning has become the mainstream method in this direction, and researchers have proposed a series of deep models for cross-view image geo-localization with excellent performance. According to whether the image viewpoint is transformed before feature extraction, these models can be roughly classified into two categories: end-to-end-based methods and viewpoint transformation-based methods.

The end-to-end-based cross-view image geo-localization method directly feeds the ground and satellite images to the deep network to extract discriminative image features for cross-view localization. For example, in 2015, Workman and Jacobs ^[18] directly used the pre-trained AlexNet ^[19] to extract deep features of ground and satellite images for cross-view image matching. After that, researchers proposed several end-to-end deep networks such as CVM-Net ^[20], GeoCapsNet ^[21], Siam-FCANet ^[22], and CVFT ^[23], which used VGG ^[24] or ResNet ^[25] as the backbone to extract the deep features of images. Such methods mainly rely on the image appearance content to learn discriminative image features, ignoring the impact of spatial layout differences between ground and satellite images. This might make the ground image be geo-localized to the wrong position, where the satellite image contains many semantic objects similar to them in the ground image.

The viewpoint transformation-based cross-view image geo-localization method first transforms ground or satellite images to another viewpoint, then inputs the transformed image and another untransformed image into a deep network for matching. For example, in 2019, Regmi and Borji ^[26] fed the ground image into cGANs ^[27] to synthesize its corresponding satellite image and used this synthesized satellite image as auxiliary information to minimize the difference between the query ground image and the satellite image. Shi et al. ^[28] applied a polar transform to satellite images to generate pseudo-ground panoramic images, thereby bridging the spatial layout discrepancies. Later, the polar transform algorithm was adopted by many cross-view image geo-localization methods^[29,30,31], while ^[32] fused the GAN network synthesis method ^[26] and the polar transform algorithm ^[28] on satellite images to synthesize the corresponding ground panoramic image closer to the real ones. Such methods achieve spatial layout alignment between ground image and satellite image features through viewpoint transformation, reduce the geometric differences caused by the drastic changes of two viewpoints, and have significantly increased retrieval accuracy compared to other methods.

In summary, cross-view image geo-localization methods based on viewpoint transformation have become the main development direction in current cross-view image geo-localization research. However, such existing methods do not consider the interference of irrelevant contents in ground or satellite images on features. For example, the ground images may contain transient moving objects and backgrounds such as cars, pedestrians and sky, and the satellite images may contain redundant content beyond the coverage of ground images due to their wide range of coverage. These task-irrelevant contents will interfere with the extracted features and reduce their discriminative power, thus seriously affecting the accuracy of cross-view image geo-localization in the real environment.

To address the above problems, this paper proposes a novel cross-view image geo-localization method, named AENet. Firstly, the EfficientNetV2 ^[33] network containing the channel attention mechanism as the backbone is used to extract useful local features. Then the task-irrelevant features are further filtered out from spatial dimensions by a Triplet Attention ^[34] layer. Moreover, this paper proposes a multiple hard samples weighting (MHNW) strategy, which enhances the learning of the network on multiple hard negative samples in each training batch. The contributions of this paper are as follows:

● For the cross-view image geo-localization task, we introduce the EfficientNetV2 network to the cross-view image matching task, and propose a novel cross-view image geo-localization method AENet, which could focus more on useful features by filtering irrelevant features from the channel and spatial dimension.

● A multiple hard samples weighting (MHNW) strategy is proposed to optimize the training of the network, which emphasizes the influence of multiple hard samples when calculating the loss in the current batch, thus enhancing the learning ability of the network for cross-view image pairs.

● Extensive experiments on two benchmark datasets show that the proposed AENet performs significantly better than state-of-the-art algorithms for cross-view image geo-localization.

The rest of this paper is organized as follows. In Section 2, related works are discussed. Section 3 describes the detailed structure of AENet and the basic principle of MHNW strategy. Section 4 supplies the experimental results of AENet and the existing cross-view image geo-localization methods. Section 5 summarizes the paper and discusses the direction of further research.

2. Related works

In the existing cross-view image geo-localization methods based on deep learning, the geo-localization networks can be classified as four common structures mainly composed of transformation module, feature extraction CNN module, feature processing module and loss function module (as shown in Figure 1). By realizing each module in different ways under different composition structures, researchers have proposed a series of high-performing cross-view image geo-localization methods as shown in Table 1. The specific roles of each module and the existing realization are described in detail below.

Figure 1. Common network structures in cross-view image geo-localization.

DownLoad: Full-Size Img PowerPoint

Table 1. Overview and properties of cross-view geo-localization methods.

Method	Publication	View Transform Module	Feature Extraction Module	Feature Processing Module	Loss Function
Vo and Hays ^[35]	ECCV2017	-	AlexNet	Orientation Regression	DBL Loss
Workman and Jacobs ^[18]	ICCV2015	-	AlexNet	-	Euclidean Loss
Liu and Li ^[36]	CVPR2019	-	VGG	-	WSMRL
CVM-Net ^[20]	CVPR2018	-	VGG	NetVLAD	WSMRL
CVFT ^[23]	AAAI2020	-	VGG	Optimal Feature Transport	WSMRL
GeoCapsNet ^[21]	ICME2019	-	ResNet	Capsule Network	Soft-TriHard Loss
Siam-FCANet ^[22]	ICCV2019	-	ResNet	FCBAM	HERTL
Rodrigues and Tan ^[37]	WACV2021	-	ResNet	Multi-scale Attention	Contranstive Loss
SAFA ^[28]	NeurIPS2019	PT	VGG	Spatial-aware Feature Aggregation	WSMRL
DSM ^[29]	CVPR2020	PT	VGG	Dynamic Similarity Matching	WSMRL
LPN ^[31]	TCSVT2021	PT	VGG	Sequential/Column Partition	WSMRL
Polar-L2LTR ^[30]	NeurIPS2021	PT	ResNet	Transformer	WSMRL
Regmi and Shah ^[26]	CV2019	GANs	GANs	Feature Fusion	WSMRL
Toker et al. ^[32]	CVPR2021	PT+GANs	ResNet+GANs	Spatial Attention	WSMRL
PT : polar transform, DBL : distance-based logistic, WSMRL : weighted soft-margin ranking loss, HERTL : hard exemplar reweighting triplet loss

| Show Table

DownLoad: CSV

Viewpoint transformation module: This module mainly transforms the ground view image to the corresponding satellite view image, or vice versa, to reduce the huge difference caused by the drastic difference between the two viewpoints. As can be seen in Table 1, most existing methods directly input the ground and satellite images into a CNN (convolutional neural network), and their network structures are shown in Figure 1(a), (c). Since 2019, researchers began to transform the input ground or satellite images to the image in another viewpoint, and there are two main methods of viewpoint transformation. One is to transform the ground view image to the satellite view image by cGANs network ^[26], as shown in Figure 1(b). The other is to apply a polar transformation on the satellite view image to obtain the pseudo-ground panorama image ^[28], as shown in Figure 1(d).

Feature extraction CNN module: The role of this module is to extract the local features of the query ground image and the reference satellite images separately through the deep network. In the existing works, the following four CNN networks are commonly used to extract the local features: the first one is the AlexNet used by methods ^[18] and ^[35], the second one is the VGG or its fine-tuned version used by methods ^{[20,23,28,29,31,36]}, the third one is the ResNet used by methods ^{[21,22,30,37]}, the fourth one is the GAN used by methods ^[26,32].

Feature processing module: This module is used to process the local features extracted by the feature extraction module to obtain more discriminative image descriptors. As shown in Table 1, most of the methods process the extracted local features, except the methods in ^[18,36]. The adopted feature processing methods can be roughly divided into two categories: feature processing based on attention mechanisms and feature processing based on spatial layout learning. The attention-based feature processing method mainly learns the salient image features through the attention mechanism ^{[22,28,30,32,37]}. The feature processing module based on spatial layout learning aims at reducing the spatial layout difference between ground image's local features and satellite image's local features by learning the orientation or spatial position relationships ^{[21,23,29,31,35]}.

Loss function module: This module aims to measure the similarity between the features extracted from the query ground image and the reference satellite image. The goal is that, the closer the geographic location of the two images, the higher the similarity. In 2017, Vo and Hays ^[35] proposed a function called Distance-based Logistic loss (DBL loss). In 2018, Hu et al. ^[20] proposed a loss function called weighted soft-margin ranking loss (WSMR) based on DBL loss to speed up the training of the network. Moreover, Hu et al. ^[20] used a hard sample mining strategy proposed by Hermans et al. ^[38] to find the hard sample pairs of satellite image and ground image which do not match but look alike, then repeatedly learned the hard sample pairs to improve the generalization ability of the network. As can be seen from Table 1, the loss function proposed by Hu et al. ^[20] has been used by many methods since then ^{[21,23,26,28,29,30,31,32,36]}. Additionally, Cai et al. ^[22] proposed a hard exemplar reweighting triplet loss function to mine valuable hard samples for the network to learn and thus improve the performance of the network.

In summary, in existing deep learning-based cross-view image geo-localization methods, the designed network structures can be summarized as a framework similar to the Siamese network, as shown in Figure 2. The framework consists of four parts: viewpoint transformation module, feature extraction CNN module, feature processing module, and loss function module.

Figure 2. Overall framework of the cross-view image geo-localization method.

DownLoad: Full-Size Img PowerPoint

3. AENet framework

In image geo-localization tasks, objects such as cars, pedestrians and sky in images not only do not provide useful information, but may also interfere with the extracted image descriptor. The shallow networks can only learn features such as contour, color and texture, but can not obtain the high-level semantic features which can be used to discriminate the above objects. Although the deep networks such as ResNet and VGG commonly used in existing methods can learn richer high-level semantic features, it is still difficult to focus on the high-level semantic features important for image geo-localization. Therefore, in order to improve the localization accuracy, how to pay more attention to the objects related to geo-localization in the learning process for networks and eliminate the interference of useless features as much as possible, is an important problem to be solved in this section.

To address the above problems, this section proposes a cross-view image geo-localization method based on attention efficient networks, as shown in Figure 3. Firstly, the polar transformation is used to transform the satellite images into pseudo-ground panoramic images. Then the pseudo-ground panoramic image and the actual ground panoramic image are separately input into an attention efficient network which fuses the channel and spatial attention mechanisms, called AENet. It uses EfficientNetV2 to extract the high-level semantic features of the input images, then leverages the Triplet Attention (TA) module to determine the importance of different semantic features in order to quickly focus on the semantic features that are important for image geo-localization. Finally, in the loss function module, we first determine whether the number of hard samples is > 1. If it is > 1, use loss function based on a multiple hard samples weighting (MHNW) strategy to measure the similarity between the high-level semantic features extracted from the actual ground panoramic images and the pseudo-ground panoramic images. If it is not, use weighted soft-margin ranking loss (WSMR) proposed by Hu et al. ^[20].

Figure 3. Overall architecture of AENet.

DownLoad: Full-Size Img PowerPoint

The core of the proposed method is the attention efficient network fusing channel and spatial attention mechanisms, which is used to extract the high-level semantic features of the actual ground panoramic image and the pseudo-ground panoramic image features. Although the pseudo-ground panoramic images of satellite images obtained by a polar transform are similar to the actual terrestrial panoramic images, there are still obvious visual differences. For example, the pseudo-ground panoramic image is difficult to clearly display, or even fails to show the facade of objects which are visible from the ground view but difficult to see from the satellite view. Therefore, we do not share the parameters in the two branches extracting the high-level semantic features from the pseudo-ground panoramic image and the actual ground panoramic image.

The subsequent subsections describe the following three key modules of the proposed method in detail: high-level semantic feature extraction module, high-level semantic feature optimization module and loss function module.

3.1. High-level semantic feature extraction

In cross-view image matching tasks, one way to improve the accuracy is to optimize the CNN used in the feature extraction module. There are three factors that affect the performance of the CNN, including the depth of the network, the width of the network and the resolution of the input image. Increasing the values of these three factors can obtain richer and more complex features, but recklessly increasing them may magnify training difficulty and computational cost. In recent years, the EfficientNet networks ^[33,39] have achieved great success in image classification. The improved version of EfficientNet, viz. EfficientNetV2, has become the most accurate model compared to other models with the same number of parameters. Compared to ResNet and VGG, EfficientNetV2 is generated by using a neural network structure search technique ^[40,41] to automatically learn and balance the above three factors. It mainly consists of Fused-MBConv blocks ^[42] and MBConv ^[39,43] blocks, as shown in Figure 4.

Figure 4. The structure of EfficientNetV2 network.

DownLoad: Full-Size Img PowerPoint

In this paper, the EfficientNetV2 is introduced into the field of cross-view image geo-localization. It is used as the backbone in the feature extraction module to extract rich high-level semantic features. This is mainly because its MBConv module can use the channel attention module SENet ^[44] to initially filter the features from the channel dimension. The structure of the MBConv is shown in , which mainly consists of a 1 $\times$ 1 convolution, Depthwise convolution, SENet, 1 $\times$ 1 convolution, and residual connection, where the SENet is shown in the bottom part of Figure 5.

Figure 5. MBConv and SENet structure.

DownLoad: Full-Size Img PowerPoint

In the MBConv module, the Depthwise convolution can preserve the information of each channel as much as possible by adopting a separate convolution kernel for each channel. This can extract information such as the contour, shape and color of each object in the image as much as possible. We denote the output feature map of the Depthwise convolution as $f_{H \times W \times C}$ . SENet performs global average pooling on $f_{H \times W \times C}$ to obtain a feature map $f_{1 \times 1 \times C}$ of size $1 \times 1 \times C$ . This could compress the feature space dimension to capture the energy in each feature channel dimension and thus obtain the high-level semantic features. Then, $f_{1 \times 1 \times C}$ is fed to two fully connected layers to model the correlation between channel features and get the weight of each channel, and then obtain the weight mask by sigmoid activation. Specifically, task-irrelevant features have smaller weights, even close to 0; task-relevant features have larger weights, close to 1. Finally, $f_{H \times W \times C}$ is multiplied by the weight mask multiplication to generate the final weighted feature map $f^{'}_{H \times W \times C}$ .

3.2. High-level semantic feature optimization

Although the EfficientNetV2 can better extract the semantic object characteristics in images, the importance of different semantic objects for image geo-localization varies, and cross-view image geo-localization also needs to consider the spatial layout relationship between the semantic objects in images. For example, when judging whether the ground image and the satellite image (as shown in Figure 6) belong to the same geographical location, people usually first judge whether the two images contain geographical landmarks with the same shape and color, such as houses, highways and rivers, and ignore the objects that are not geographically representative, such as vehicles and pedestrians, then determine whether the two images belong to the same geographical location according to whether the spatial layout relationships between the same content objects in two images are consistent (as shown in Figure 6(b), (c), the highways in both images are in the middle of the forest and the river).

Figure 6. Cross-view image pairs.

DownLoad: Full-Size Img PowerPoint

The attention mechanism can make the network focus more on task-relevant content objects by assigning different weights to different regions of the feature maps. Therefore, in this paper, we choose to add the attention mechanism layer after the EfficientNetV2 network. The traditional attention mechanism module BAM ^[45] and CBAM ^[46] fuse channel and spatial attention mechanisms, but the channel attention and spatial attention of both are separated. However, in 2021, Triplet Attention (TA) proposed by Misra et al. ^[34] establishes the connection between channel attention and spatial attention through rotation operations and residual transformations to capture the dependency between the spatial dimension and the channel dimension of the input tensor, and let the network quickly focus on the task-relevant object.

In view of this, this subsection chooses to utilize the Triplet Attention module to optimize the high-level semantic features extracted by the EfficientNetV2. The Triplet Attention module is used to determine the weights of different semantic objects at different positions, so that the finally obtained descriptors can consider both the importance of different semantic objects and the spatial layout relationship. The structure of the TA module is shown in . For the input local feature map $f^{'}_{C \times H \times W}$ with a shape of $1280 \times 4 \times 16$ extracted by the EfficientNetV2, the TA module first permutes it to obtain two other feature maps $f^{'}_{W \times H \times C}$ and $f^{'}_{H \times C \times W}$ . Secondly, three feature maps pass through three branches with the same structure to obtain the attention weighted feature maps $f^{\ast}_{C \times H \times W}$ , $f^{\ast}_{W \times H \times C}$ and $f^{\ast}_{H \times C \times W}$ , respectively. For example, in the first branch, the feature map $f^{'}_{C \times H \times W}$ is fed to the max-pooling layer and the avg-pooling layer, respectively. Then the obtained results are concatenated as $f_{2 \times H \times W}$ with a shape of $2 \times H \times W$ . Then, $f_{2 \times H \times W}$ is fed to a structure sequence composed of a $7 \times 7$ convolution, BN layer and Sigmoid activation to obtain the weight mask. The input $f^{'}_{H \times W \times C}$ is multiplied by the weight mask to generate the attention weighted feature map $f^{\ast}_{H \times W \times C}$ . The other two branches are analogous to the first branch to obtain $f^{\ast}_{W \times H \times C}$ , $f^{\ast}_{H \times C \times W}$ . Finally, $f^{\ast}_{W \times H \times C}$ and $f^{\ast}_{H \times C \times W}$ are inverted to the feature maps of original size $H \times W \times C$ and the element-wise addition and average operations are performed on the three feature maps of the same size to obtain the final image feature descriptors.

Figure 7. TA module structure.

DownLoad: Full-Size Img PowerPoint

3.3. Loss function based on MHNW strategy

As shown in , if a ground image $a$ is regarded as the anchor image, the corresponding reference satellite image $p$ (as shown in ) is called the positive sample of this anchor image whose Euclidean distance from the anchor image $a$ is ${d}_{a, p}$ , and the satellite images $n$ taken at different geographic locations are called the negative samples of this anchor image. According to the distance to the anchor image, the negative samples can be further classified into the following three categories.

Figure 8. Anchor, positive and negative sample examples.

DownLoad: Full-Size Img PowerPoint

The sample in the first category is called an easy negative sample, whose Euclidean distance ${d}_{a, n}$ from the anchor image $a$ is much larger than the Euclidean distance ${d}_{a, p}$ between the anchor $a$ and the positive sample $p$ . Namely, this negative sample satellite image $c$ should be obviously different to the anchor ground image $a$ , such as the satellite image in Figure 8(c). Such samples are easily distinguished by the network.

The sample in the second category is called a semi-hard negative sample, whose Euclidean distance ${d}_{a, n}$ from the anchor image $a$ is very close to ${d}_{a, p}$ , but still larger than ${d}_{a, p}$ . Namely, this satellite image sample is similar to the anchor image. For example, the negative sample in and the anchor image $a$ both consist of multiple house buildings. Such negative samples can not be distinguished by the network easily.

The sample in the third category is called a hard negative sample, whose Euclidean distance ${d}_{a, n}$ is smaller than ${d}_{a, p}$ . Namely, this satellite image sample is more similar to the anchor image than the corresponding reference satellite image. For example, the negative sample in is extremely similar to the ground image $a$ . It is hard to distinguish this positive sample image by the network.

Intuitively, it is expected that the total loss is minimum when the loss between the anchor image and the positive sample is minimum, and the loss between the anchor image and the hard negative sample is maximum. Therefore, Hu et al. ^[20] used the hard sample mining strategy proposed by Hermans et al. ^[38] to find the hard negative sample, and proposed the following weighted soft-margin ranking loss function by synthesising the distances of the positive sample and negative samples to the anchor image.

$\begin{equation} L_{\mathrm{Hu}} = \ln \left(1+e^{\alpha\left(d_{a, p}-\min \limits_{n \in B} d_{a, n}\right)}\right) \end{equation}$

(3.1)

where ${d}_{a, p}$ is the Euclidean distance between the features of each ground image $a$ and its positive sample $p$ in the current batch $B$ and $\alpha$ is a weighting parameter used to improve the convergence speed of the network.

During the training process of the network, we found that, for an anchor image, there may be $N$ hard negative samples $n_{1}$ , $n_{2}$ , $\cdots$ , $n_{N}$ in each batch, as shown in . However, the hard sample mining strategy proposed by Hermans et al. ^[38] only selects the negative sample which is closest to the anchor image and ignores other hard negative samples. Thus, in order to make the network learn more adequately for hard negative samples, we designed a loss function based on multiple hard negative samples weighting (MHNW) strategy, i.e., when there are several hard negative samples in a batch, the losses of all these hard negative samples are emphasized. Specifically, for an anchor image $a$ in a batch $B$ , first calculate the Euclidean distance between $a$ and all negative samples and take the $N$ hard negative samples $n_{i}$ $(i \in N)$ of the anchor, whose distance from the anchor is $d_{a, n_{i}} < d_{a, p}$ . Then, the respective difficulty measure value of each hard negative sample is measured by $D_{i} = d_{a, {p}}-d_{a, n_{i}}$ , where the smaller $d_{a, n_{i}}$ , the higher the difficulty. Finally, according to the difficulty of each hard negative sample $n_{i}$ , compute the weight $w_{i}$ as follows,

$\begin{equation} w_i = \frac{D_{i}}{\max (D_{i})}(i \in N) \end{equation}$

(3.2)

Figure 9. Illustration of the anchor image with its corresponding three negative samples.

DownLoad: Full-Size Img PowerPoint

Finally, the corresponding loss of each hard negative sample is multiplied by the weights $w_{i}$ . The obtained results of all samples are summed as the final loss for the batch. Therefore, our loss is defined as

$\begin{equation} \begin{aligned} &L_{\text {MHNW}} = \frac{1}{N} \sum\limits_{a \in b a t c h} w_i * \ln \left(1+e^{\alpha\left(d_{a, p}-d_{a, n_i}\right)}\right), &(i \in N) \end{aligned} \end{equation}$

(3.3)

4. Experimental results and analysis

4.1. Experimental setup

The performance of the proposed method was tested in the experimental setup as shown in Table 2.

Table 2. Experimental setting.

Operation	Setup
Input Image Size	128 $\times$ 512
Dataset	CVUSA ^[47] and CVACT_val ^[36]
Training Strategies	Batch size = 24, AdamW Optimizer, learning rate = 0.00001, weight decay = 0.00005
Experimental Platform	24GB TITAN RTX GPU, PyTorch 1.7.1.
Evaluation Protocol	Recall accuracy at top K (K $\in$ 1, 5, 10, 1%)

| Show Table

DownLoad: CSV

Datasets. The experiments were conducted on two standard benchmark datasets: CVUSA and CVACT_val. The original CVUSA dataset is a large-scale dataset constructed by Workman and Jacobs ^[18] which consists of ground and satellite images from all over the U.S.. Zhai et al. ^[47] selected 35,532 pairs of cross-view images from the original CVUSA dataset for training and 8,884 pairs of cross-view images for testing. The CVUSA dataset constructed by Zhai et al. has been widely used in research on cross-view image geo-localization, and thus, in this section, CVUSA denotes the CVUSA dataset constructed by Zhai et al. CVACT_val is a new city-scale cross-view image dataset contructed by Liu and Li ^[36], which densely covers the city of Canberra. This dataset provides 35,532 pairs of cross-view images as the training set and 8884 cross-view image pairs as the validation set. The size of all input ground and satellite images were resized to 128 $\times$ 512.

Network training. The proposed method AENet was implemented in a PyTorch environment and used a TITAN RTX GPU with 24 GB of memory. The network was initialized with pre-trained parameters in ImageNet, then updated by the AdamW optimizer. During the training process, the batch size was set to 24, the learning rate was set to 0.00001 and the weight decay was chosen to be 0.00005.

Evaluation metric. The top $K$ recall accuracy proposed by Vo and Hays ^[35] was used as an evaluation metric. When $K$ is a integer, the top $K$ is the set of the $K$ satellite images whose descriptors are closest to that of a query ground image. When $K$ is a percentage, the top $K$ is the set of the $K\times T$ ( $T$ is the total number of satellite images in the reference satellite image set) satellite images whose descriptors are closest to that of a query ground image. The top $K$ recall accuracy denotes the ratio of query images whose corresponding satellite image in top $K$ , and is denoted R@ $K$ . In this section, R@1, R@5, R@10, and R@1% were used to evaluate the performance of cross-view image geo-localization.

4.2. Comparison to the existing methods

The proposed method was compared with several state-of-the-art methods on two standard datasets, CVUSA and CVACT_val, and the experimental results are shown in Tables 3 and 4, respectively. It can be seen from Tables 3 and 4 that our method has higher recall accuracy than other methods, and the recall accuracy of our method is significantly improved on the key evaluation metric R@1. On the CVUSA dataset, our method achieves a recall accuracy of 95.97%, which is 1.92% higher than the second-best method. On the CVACT_val dataset, our method achieves a recall accuracy of 91.78%, which is 6.89% higher than the second-best method. From the experimental results, it is evident that our method suppresses the interference of irrelevant features on the extracted feature descriptors, thus improving the recall accuracy.

Table 3. Comparisons with state-of-the-art methods on the CVUSA ^[47] dataset.

Model	CVUSA
Model	R@1	R@5	R@10	R@1%
Workman and Jacobs ^[18]	-	-	-	34.30
Zhai et al. ^[47]	-	-	-	43.20
Vo and Hays ^[35]	-	-	-	63.70
CVM-Net ^[20]	22.53	50.01	63.19	93.52
Regmi and Shah ^[26]	48.75	-	81.27	95.98
GeoCapsNet ^[21]	-	-	-	98.07
Siam-FCANet34 ^[22]	-	-	-	98.30
Liu and Li ^[36]	40.79	66.82	76.36	96.08
CVFT ^[23]	61.43	84.69	90.49	99.02
SAFA ^[28]	89.84	96.93	98.14	99.64
DSM ^[29]	91.96	97.50	98.54	99.67
Toker et al. ^[32]	92.56	97.55	98.33	99.57
Polar-L2LTR ^[30]	94.05	98.27	98.99	99.67
Ours	95.97	98.80	99.11	99.84

| Show Table

DownLoad: CSV

Table 4. Comparisons with state-of-the-art methods on the CVACT_val ^[36] dataset.

Model	CVACT_val
Model	R@1	R@5	R@10	R@1%
CVM-Net ^[20]	20.15	45.00	56.87	87.57
Liu and Li ^[36]	46.96	68.28	75.48	92.01
CVFT ^[23]	61.05	81.33	86.52	95.93
SAFA ^[28]	81.03	92.80	94.84	98.17
DSM ^[29]	82.49	92.44	93.99	97.32
Toker et al. ^[32]	83.28	93.57	95.42	98.22
Polar-L2LTR ^[30]	84.89	94.59	95.96	98.37
Ours	91.78	96.28	97.29	99.29

| Show Table

DownLoad: CSV

4.3. Ablation experiments

TA module. To evaluate the effectiveness of the TA module, we removed the TA module from the AENet to obtain a Baseline containing only the EfficientNetV2 network, and trained the Baseline. We also added BAM and CBAM after Baseline and trained these two networks respectively, denoted as Baseline+BAM and Baseline+CBAM. The comparison results on CVUSA and CVACT_val datasets are shown in Table 5. From the experimental results, it can be seen that AENet combined with the TA module achieves the highest recall accuracy on both datasets, reaching R@1 of 95.97% and 91.78% on CVUSA and CVACT_val respectively. This is because the TA module can filter features on spatial location, allowing the network to focus more on the region of the image that are relevant to the cross-view image geo-localization task, thus improving the recall accuracy.

Table 5. Ablation experiment of TA module.

Model	CVUSA				CVACT_val
Model	R@1	R@5	R@10	R@1%	R@1	R@5	R@10	R@1%
Baseline	94.06	98.21	98.85	99.80	91.14	96.06	97.02	99.11
Baseline+BAM	89.80	96.58	97.90	99.52	82.21	91.90	94.12	98.01
Baseline+CBAM	90.89	96.93	98.05	99.53	83.13	92.19	94.24	98.21
AENet	95.97	98.80	99.11	99.84	91.78	96.28	97.29	99.29

| Show Table

DownLoad: CSV

Loss function based on MHNW strategy. To test the effectiveness of the MHNW strategy proposed in this paper, we conducted ablation experiments on CVUSA and CVACT_val datasets according to whether the MHNW strategy was used or not. The experimental results are shown in Table 6, where "without MHNW" means that we use the weighted soft-margin ranking loss function based on the hard sample mining strategy proposed by Hermans et al. ^[38]. "with MHNW" means that we use the weighted soft-margin ranking loss function based on the MHNW Strategy. It can be seen from the results that after using the MHNW strategy, all of four evaluation metrics R@1, R@5, R@10, R@1% on CVUSA and CVACT_val dataset were improved. This shows that the MHNW strategy can enhance the learning ability of the network by emphasizing multiple hard samples in the training process, and obtain more discriminative image features.

Table 6. Ablation experiment of MHNW strategy.

Model	CVUSA				CVACT_val
Model	R@1	R@5	R@10	R@1%	R@1	R@5	R@10	R@1%
without MHNW	95.56	98.66	99.06	99.83	91.69	96.09	97.26	99.18
with MHNW	95.97	98.80	99.11	99.84	91.78	96.28	97.29	99.29

| Show Table

DownLoad: CSV

Complexity and computation cost. In order to compare the complexity and computation cost of the networks, we provide the number of parameters, GFLOPs (Giga Floating Point Operations per Second) of SAFA ^[28], DSM ^[29] and Polar-L2LTR ^[30] in Table 7. It can be seen from the results that the proposed method has lower GFLOPs than the other three networks. In terms of the number of parameters, the proposed method has many fewer than the Polar-L2LTR method, which has the second-best performance on recall accuracy, but still needs to be improved compared to SAFA and DSM methods.

Table 7. Comparison with previous works in terms of parameters and GFLOPs.

Model	Param (M)	GFLOPs
SAFA ^[28]	29.50	15.64
DSM ^[29]	17.90	7.25
Polar-L2LTR ^[30]	195.90	44
Ours	40.34	7.14

| Show Table

DownLoad: CSV

Visualization analysis. To more intuitively observe the effects of the proposed AENet, we visualized some heat maps of the extracted features. In order to test the superiority of the TA module, we replaced the TA module in AENet with the classical channel and spatial attention mechanism module CBAM ^[46], and then made a comparison. Figure 10 shows the heat maps of features extracted from ground image by Baseline, Baseline+CBAM and AENet. The darker red the color is, the more attention the network pays to this part. Figure 10 demonstrates that the AENet can successfully ignore the transient cars and sky in the image, and pay more attention to the region relevant to cross view image geo-localization task.

Figure 10. Heat map of ground image features.

DownLoad: Full-Size Img PowerPoint

Figure 11 shows the heat maps of features extracted from a satellite image after a polar transform by Baseline, Baseline+CBAM and AENet. It can be clearly seen that the AENet can filter out the redundant content covered by satellite images and focus on the content of common region between satellite image and ground image.

Figure 11. Heat map of satellite image features.

DownLoad: Full-Size Img PowerPoint

Moreover, by comparing Figure 10(c) and 10(d), we can see that when processing ground images, the combination of the EfficientNetV2 and the CBAM can not effectively filter out the moving vehicles. By comparing Figure 11(c) and 11(d), we can see that, although the combination of the EfficientNetV2 and the CBAM can filter out the redundant content in satellite image, it pays less attention to the region where useful features are located (green area in Figure 11(c) and red area in Figure 11(d)) than AENet.

5. Conclusions

In this paper, we propose a novel AENet for cross-view image geo-localization, aiming to address the interference of irrelevant features in the feature extraction process. The proposed AENet can reduce the interference of irrelevant features by making the network focus more on the useful features through attention enhancement. In addition, this paper also proposes a MHNW strategy, which can effectively improve the retrieval accuracy. We tested our method on two existing benchmark datasets, and the experimental results show that our method significantly improves the cross-view image geo-localization accuracy. Moreover, one major limitation of the AENet is that it is not applicable to the scenario when cross-view image pairs' orientation are not consistent. Therefore, we intend to increase the scenario applicability of AENet in future work.

Acknowledgments

This work was supported by the National Nature Science Foundation of China (Nos. 61872448, U1804263, 62272163), and the Science and Technology Research Project of Henan Province (No. 222102210075), China.

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this paper.

References

[1]	J. H. Ward Jr, Hierarchical grouping to optimize an objective function, J. Am. Stat. Assoc., 58 (1963), 236–244. https://doi.org/10.1080/01621459.1963.10500845 doi: 10.1080/01621459.1963.10500845
[2]	T. Kohonen, Median strings, Pattern Recogn. Lett., 3 (1985), 309–313. https://doi.org/10.1016/0167-8655(85)90061-3 doi: 10.1016/0167-8655(85)90061-3
[3]	V. Hautamäki, P. Nykänen, P. Fränti, Time-series clustering by approximate prototypes, 19th International conference on pattern recognition, (2008), 1–4. IEEE. https://doi.org/10.1109/ICPR.2008.4761105
[4]	P. Fränti, R. Mariescu-Istodor, Averaging gps segments: competition 2019, Pattern Recogn., 112 (2021), 107730. https://doi.org/10.1016/j.patcog.2020.107730 doi: 10.1016/j.patcog.2020.107730
[5]	P. Fränti, S. Sieranoja, K. Wikström, T. Laatikainen, Clustering diagnoses from 58m patient visits in Finland 2015-2018, 2022.
[6]	M. Fatemi, P. Fränti, Clustering nordic twitter users based on their connections, 2023.
[7]	M. I. Malinen, P. Fränti, Clustering by analytic functions, Inform. Sciences, 217 (2012), 31–38. https://doi.org/10.1016/j.ins.2012.06.018 doi: 10.1016/j.ins.2012.06.018
[8]	M. I. Malinen, P. Fränti, Balanced $k$ -means for clustering, in: Joint Int. Workshop on Structural, Syntactic, and Statistical Pattern Recognition (S+SSPR 2014), LNCS 8621, Joensuu, Finland, 2014.
[9]	D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn., 75 (2009), 245–248. https://doi.org/10.1007/s10994-009-5103-0 doi: 10.1007/s10994-009-5103-0
[10]	M. Inaba, N. Katoh, H. Imai, Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based $k$ -Clustering, ACM symposium on computational geometry (SCG 1994), (1994), 332–339. https://doi.org/10.1145/177424.178042 doi: 10.1145/177424.178042
[11]	J. MacQueen, Some methods of classification and analysis of multivariate observations, Berkeley Symp. Mathemat. Statist. Probab., 1 (1967), 281–297.
[12]	W. H. Equitz, A New Vector Quantization Clustering Algorithm, IEEE Trans. Acoust., Speech, Signal Processing, 37 (1989), 1568–1575. https://doi.org/10.1109/29.35395 doi: 10.1109/29.35395
[13]	P. Fränti, O. Virmajoki, V. Hautamäki, Fast agglomerative clustering using a k-nearest neighbor graph, IEEE T. Pattern Anal., 28 (2006), 1875–1881. https://doi.org/10.1109/TPAMI.2006.227 doi: 10.1109/TPAMI.2006.227
[14]	P. Fränti, O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recogn., 39 (2006), 761–765. https://doi.org/10.1016/j.patcog.2005.09.012 doi: 10.1016/j.patcog.2005.09.012
[15]	P. Fränti, Efficiency of random swap clustering, Journal of Big Data, 5 (2018), 1–29. https://doi.org/10.1186/s40537-018-0122-y doi: 10.1186/s40537-018-0122-y
[16]	B. Fritzke, Breathing k-means, arXiv: 2006.15666.
[17]	C. Baldassi, Recombinator-k-means:an evolutionary algorithm that exploits k-means++ for recombination, IEEE T. Evolut. Comput., 26 (2022), 991–1003.
[18]	A. P. Dempster, N. M. Laird, D. B. Rubin, Maximun likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B, 39 (1977), 1–38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x doi: 10.1111/j.2517-6161.1977.tb01600.x
[19]	Q. Zhao, V. Hautamäki, I. Kärkkäinen, P. Fränti, Random swap EM algorithm for finite mixture models in image segmentation, IEEE International Conference on Image Processing (ICIP), (2009), 2397–2400. https://doi.org/10.1109/ICIP.2009.5414459 doi: 10.1109/ICIP.2009.5414459
[20]	J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE T. Pattern Anal., 22 (2000), 888–905. https://doi.org/10.1109/34.868688 doi: 10.1109/34.868688
[21]	C. H. Q. Ding, X. He, H. Zha, M. Gu, H. D. Simon, A min-max cut algorithm for graph partitioning and data clustering, IEEE International Conference on Data Mining (ICDM), (2001), 107–114.
[22]	M. I. Malinen, P. Fränti, K-means: Clustering by gradual data transformation, Pattern Recogn.*, 47 (2014), 3376–3386. https://doi.org/10.1016/j.patcog.2014.03.034 doi: 10.1016/j.patcog.2014.03.034
[23]	R. Nallusamy, K. Duraiswamy, R. Dhanalaksmi, P. Parthiban, Optimization of non-linear multiple traveling salesman problem using k-means clustering, shrink wrap algorithm and meta-heuristics, International Journal of Nonlinear Science, 9 (2010), 171–177.
[24]	R. Mariescu-Istodor, P. Fränti, Solving the large-scale tsp problem in 1 h: Santa claus challenge 2020, Front. Robot. AI, (2021), 1–20. https://doi.org/10.3389/frobt.2021.689908 doi: 10.3389/frobt.2021.689908
[25]	D. W. Sambo, B. O. Yenke, A. Förster, P. Dayang, Optimized clustering algorithms for large wireless sensor networks: A review, Sensors, 19 (2019), 322.
[26]	J. Singh, R. Kumar, A. K. Mishra, Clustering algorithms for wireless sensor networks: A review, International Conference on Computing for Sustainable Global Development (INDIACom), (2015), 637–642.
[27]	Y. Liao, H. Qi, W. Li, Load-Balanced Clustering Algorithm With Distributed Self-Organization for Wireless Sensor Networks, IEEE Sens. J., 13 (2013), 1498–1506. https://doi.org/10.1109/JSEN.2012.2227704 doi: 10.1109/JSEN.2012.2227704
[28]	L. Yao, X. Cui, M. Wang, An energy-balanced clustering routing algorithm for wireless sensor networks, IEEE World Congress on Computer Science and Information Engineering, 3 (2009), 316–320.
[29]	P. S. Bradley, K. P. Bennett, A. Demiriz, Constrained k-means clustering, Tech. rep., MSR-TR-2000-65, Microsoft Research, 2000.
[30]	S. Zhu, D. Wang, T. Li, Data clustering with size constraints, Knowledge-Based Syst., 23 (2010), 883–889. https://doi.org/10.1016/j.knosys.2010.06.003 doi: 10.1016/j.knosys.2010.06.003
[31]	A. Banerjee, J. Ghosh, Frequency sensitive competitive learning for balanced clustering on high-dimensional hyperspheres, IEEE Transactions on Neural Networks, 15 (2004), 702–719. https://doi.org/10.1109/TNN.2004.824416 doi: 10.1109/TNN.2004.824416
[32]	C. T. Althoff, A. Ulges, A. Dengel, Balanced clustering for content-based image browsing, in: GI-Informatiktage 2011, Gesellschaft für Informatik e.V., 2011.
[33]	A. Banerjee, J. Ghosh, On scaling up balanced clustering algorithms, SIAM International Conference on Data Mining, (2002), 333–349. https://doi.org/10.1137/1.9781611972726.20 doi: 10.1137/1.9781611972726.20
[34]	Y. Chen, Y. Zhang, X. Ji, Size regularized cut for data clustering, Advances in Neural Information Processing Systems, 2005.
[35]	Y. Kawahara, K. Nagano, Y. Okamoto, Submodular fractional programming for balanced clustering, Pattern Recogn. Lett., 32 (2011), 235–243. https://doi.org/10.1016/j.patrec.2010.08.008 doi: 10.1016/j.patrec.2010.08.008
[36]	G. Tzortzis, A. Likas, The minmax k-means clustering algorithm, Pattern Recogn., 47 (2014), 2505–2516. https://doi.org/10.1016/j.patcog.2014.01.015 doi: 10.1016/j.patcog.2014.01.015
[37]	W. Tang, Y. Yang, L. Zeng, Y. Zhan, Optimizing mse for clustering with balanced size constraints, Symmetry, 11 (2019), 338. https://doi.org/10.3390/sym11030338 doi: 10.3390/sym11030338
[38]	L. Hagen, A. B. Kahng, New spectrxal methods for ratio cut partitioning and clustering, IEEE T. Computer-Aided D., 11 (1992), 1074–1085. https://doi.org/10.1109/43.159993 doi: 10.1109/43.159993
[39]	T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to algorithms (2nd ed.), MIT Press and McGraw-Hill, 2001.
[40]	M. X. Goemans, D. P. Williamson, Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming, J. ACM, 42 (1995), 1115–1145. https://doi.org/10.1145/227683.227684 doi: 10.1145/227683.227684
[41]	S. Arora, S. Rao, U. Vazirani, Expander flows, geometric embeddings and graph partitioning, J. ACM, 56 (2009), 1–37. https://doi.org/10.1145/1502793.1502794 doi: 10.1145/1502793.1502794
[42]	U. von Luxburg, A tutorial on spectral clustering, Stat. Comput., 17 (2007), 395–416. https://doi.org/10.1007/s11222-007-9033-z doi: 10.1007/s11222-007-9033-z
[43]	M. R. Garey, D. S. Johnson, Computers and intractability: A guide to the theory of NP-completeness, W. H. Freeman, 1979.
[44]	T. D. Bie, N. Cristianini, Fast sdp relaxations of graph cut clustering, transduction, and other combinatorial problems, J. Mach. Learn. Res., 7 (2006), 1409–1436.
[45]	A. Frieze, M. Jerrum, Improved approximation algorithms for max- $k$ -cut and max bisection, Algorithmica, 18 (1997), 67–81. https://doi.org/10.1007/BF02523688 doi: 10.1007/BF02523688
[46]	W. Zhu, C. Guo, A local search approximation algorithm for max- $k$ -cut of graph and hypergraph, International Symposium on Parallel Architectures, Algorithms and Programming, (2011), 236–240. https://doi.org/10.1109/PAAP.2011.35 doi: 10.1109/PAAP.2011.35
[47]	A. V. Kel'manov, A. V. Pyatkin, On the complexity of some quadratic euclidean 2-clustering problems, Comput. Math. Math. Phys., 56 (2016), 491–497. https://doi.org/10.1134/S096554251603009X doi: 10.1134/S096554251603009X
[48]	L. J. Schulman, Clustering for edge-cost minimization, Ann. ACM Symp. on Theory of Computing (STOC), (2000), 547–555. https://doi.org/10.1145/335305.335373 doi: 10.1145/335305.335373
[49]	S. Sahni, T. Gonzalez, P-complete approximation problems, J. ACM, 23 (1976), 555–565. https://doi.org/10.1145/321958.321975 doi: 10.1145/321958.321975
[50]	W. F. de la Vega, M. Karpinski, C. Kenyon, Y. Rabani, Approximation schemes for clustering problems, ACM symposium on Theory of computing (STOC '03), (2003), 50–58. https://doi.org/10.1145/780542.780550 doi: 10.1145/780542.780550
[51]	N. Guttmann-Beck, R. Hassin, Approximation algorithms for min-sum p-clustering, Discrete Appl. Math., 89 (1998), 125–142. https://doi.org/10.1016/S0166-218X(98)00100-0 doi: 10.1016/S0166-218X(98)00100-0
[52]	H. Späth, Cluster analysis algorithms for data reduction and classification of objects, Wiley, New York, 1980.
[53]	P. Fränti, S. Sieranoja, Clustering datasets, University of Eastern Finland, 2020. Available from: http://cs.uef.fi/sipu/datasets/.
[54]	P. Fränti, M. Rezaei, Q. Zhao, Centroid index: Cluster level similarity measure, Pattern Recogn., 47 (2014), 3034–3045. https://doi.org/10.1016/j.patcog.2014.03.017 doi: 10.1016/j.patcog.2014.03.017
[55]	S. Sieranoja, P. Fränti, Fast and general density peaks clustering, Pattern Recogn. Lett., 128 (2019), 551–558. https://doi.org/10.1016/j.patrec.2019.10.019 doi: 10.1016/j.patrec.2019.10.019
[56]	P. Fränti, Genetic algorithm with deterministic crossover for vector quantization, Pattern Recogn. Lett., 21 (2000), 61–68. https://doi.org/10.1016/S0167-8655(99)00133-6 doi: 10.1016/S0167-8655(99)00133-6
[57]	T. Cour, S. Yu, J. Shi, Normalized Cut Segmentation Code, 2004.

This article has been cited by:

Abhilash Durgam, Sidike Paheding, Vikas Dhiman, Vijay Devabhaktuni, Cross-View Geo-Localization: A Survey, 2024, 12, 2169-3536, 192028, 10.1109/ACCESS.2024.3507280

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)