
In recent years, Transformer-based object trackers have demonstrated exceptional performance in object tracking. However, traditional methods often employ single-scale pixel-level attention mechanisms to compute the correlation between templates and search regions, disrupting object's integrity and positional information. To address these issues, we introduce a cyclic-shift mechanism to expand the diversity of sample positions and replace the traditional single-scale pixel-level attention mechanism with a multi-scale window-level attention mechanism. This approach not only preserves the object's integrity but also enriches the diversity of samples. Nevertheless, the introduced cyclic-shift operation heavily burdens storage and computation. To this end, we treat the attention computation of shifted and static windows in the spatial domain as convolution. By leveraging the convolution theorem, we transform the attention computation of cyclic shift samples from the spatial domain to element-wise multiplication in the frequency domain. This approach enhances computational efficiency and reduces data storage requirements. We conducted extensive experiments on the proposed module. The results demonstrate that the proposed module outperforms multiple existing tracking algorithms regarding performance. Moreover, ablation studies show that the method effectively reduces the storage and computational burden without compromising performance.
Citation: Huanyu Wu, Yingpin Chen, Changhui Wu, Ronghuan Zhang, Kaiwei Chen. A multi-scale cyclic-shift window Transformer object tracker based on fast Fourier transform[J]. Electronic Research Archive, 2025, 33(6): 3638-3672. doi: 10.3934/era.2025162
[1] | Yingpin Chen, Kaiwei Chen . Four mathematical modeling forms for correlation filter object tracking algorithms and the fast calculation for the filter. Electronic Research Archive, 2024, 32(7): 4684-4714. doi: 10.3934/era.2024213 |
[2] | Yijun Chen, Yaning Xie . A kernel-free boundary integral method for reaction-diffusion equations. Electronic Research Archive, 2025, 33(2): 556-581. doi: 10.3934/era.2025026 |
[3] | Ming Wei, Congxin Yang, Bo Sun, Binbin Jing . A multi-objective optimization model for green demand responsive airport shuttle scheduling with a stop location problem. Electronic Research Archive, 2023, 31(10): 6363-6383. doi: 10.3934/era.2023322 |
[4] | Jiange Liu, Yu Chen, Xin Dai, Li Cao, Qingwu Li . MFCEN: A lightweight multi-scale feature cooperative enhancement network for single-image super-resolution. Electronic Research Archive, 2024, 32(10): 5783-5803. doi: 10.3934/era.2024267 |
[5] | Hong Yang, Yiliang He . The Ill-posedness and Fourier regularization for the backward heat conduction equation with uncertainty. Electronic Research Archive, 2025, 33(4): 1998-2031. doi: 10.3934/era.2025089 |
[6] | Xiaoping Fang, Youjun Deng, Zaiyun Zhang . Reconstruction of initial heat distribution via Green function method. Electronic Research Archive, 2022, 30(8): 3071-3086. doi: 10.3934/era.2022156 |
[7] | Cui-Ping Cheng, Ruo-Fan An . Global stability of traveling wave fronts in a two-dimensional lattice dynamical system with global interaction. Electronic Research Archive, 2021, 29(5): 3535-3550. doi: 10.3934/era.2021051 |
[8] | Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen . Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192 |
[9] | Xiaowei Pang, Haiming Song, Xiaoshen Wang, Jiachuan Zhang . Efficient numerical methods for elliptic optimal control problems with random coefficient. Electronic Research Archive, 2020, 28(2): 1001-1022. doi: 10.3934/era.2020053 |
[10] | Nan Li . Summability in anisotropic mixed-norm Hardy spaces. Electronic Research Archive, 2022, 30(9): 3362-3376. doi: 10.3934/era.2022171 |
In recent years, Transformer-based object trackers have demonstrated exceptional performance in object tracking. However, traditional methods often employ single-scale pixel-level attention mechanisms to compute the correlation between templates and search regions, disrupting object's integrity and positional information. To address these issues, we introduce a cyclic-shift mechanism to expand the diversity of sample positions and replace the traditional single-scale pixel-level attention mechanism with a multi-scale window-level attention mechanism. This approach not only preserves the object's integrity but also enriches the diversity of samples. Nevertheless, the introduced cyclic-shift operation heavily burdens storage and computation. To this end, we treat the attention computation of shifted and static windows in the spatial domain as convolution. By leveraging the convolution theorem, we transform the attention computation of cyclic shift samples from the spatial domain to element-wise multiplication in the frequency domain. This approach enhances computational efficiency and reduces data storage requirements. We conducted extensive experiments on the proposed module. The results demonstrate that the proposed module outperforms multiple existing tracking algorithms regarding performance. Moreover, ablation studies show that the method effectively reduces the storage and computational burden without compromising performance.
As an important research area in computer vision [1,2], visual object tracking (VOT) [3] has wide applications in drone tracking [4,5,6], intelligent driving [7,8], object recognition [9,10], military applications [11,12], and other fields. Its primary research involves modeling an object's appearance and other features in the initial frame of a video sequence and then accurately locating it in subsequent frames to complete the tracking task [13]. However, in practical applications, trackers often face various challenges [14], such as deformation, partial occlusion, fast movement, and background interference [15], which lead to cumulative errors and affect tracking performance [16].
Currently, mainstream object trackers include correlation filter-based trackers and deep learning-based trackers [17,18]. By transforming complex spatial-domain correlation operations into frequency-domain element-wise multiplications, correlation filter (CF) trackers [19] achieve high computational efficiency in object tracking. This method has made CF methods widely used in visual object tracking [20]. For example, KCF [21] significantly improved tracking speed and accuracy by utilizing the fast Fourier transform (FFT) [22]. However, in practical applications, the circulant matrices used in CF methods introduce boundary effects, leading to the unsatisfactory performance of CF methods. BACF [23] mitigates boundary effects by expanding the search area and cropping negative samples from the real world to address this issue. Despite the widespread attention to CF methods, these approaches have limitations, such as filter degradation and insufficient robustness [24]. In recent years, researchers have made many improvements by introducing deep convolutional neural networks to enhance tracking accuracy, such as TEDS [25], DeepABCF [26], and DeepBACF [27]. Nevertheless, most CF methods only employ the initial frame of the object as the sole template in the tracking process, leading to the tracker's limits in its ability to model the object's appearance.
Recently, deep learning-based object trackers have gradually become the dominant framework in the object tracking field [28]. For example, SiamFC [29] applies the Siamese network [30] structure to VOT. These methods extract features from the object and search region via an offline-trained neural network and measure their similarity by a correlation operator. SiamRPN [31] designs a region proposal network (RPN) to achieve precise object localization, enhancing tracking accuracy. Subsequently, SiamRPN++ [32] employs a spatially aware sampling strategy to alleviate translation invariance issues caused by padding operations. Despite their success, Siamese networks are limited by the local linear property of correlation operations, which lack the capacity for modeling long-range information dependencies [33]. Thus, the Siamese networks may lose contextual semantic information, which is significant for robust tracking [34].
To address these limitations [35], scholars have recently introduced Transformer [36,37] architectures, which possess strong capabilities for modeling long-range dependencies, leading to the development of various Transformer-based trackers [38,39,40]. For example, TransT [41] replaces the correlation operations in Siamese networks with self-attention and cross-attention mechanisms, demonstrating the potential of attention mechanisms in VOT [42]. However, TransT relies solely on the first frame's template, which limits its ability to capture real-time appearance changes of the object. STARK [43] introduces a dynamic template to exploit spatio-temporal information and turns the tracking task into a bounding box prediction task to overcome this issue. However, these methods rely on global self-attention for feature extraction, which is highly complex. To address this, the Swin Transformer [44] employs local window self-attention and a shifted window strategy. This approach reduces computational complexity to a linear level while maintaining efficiency. It has been applied in various VOT tasks, such as SwinTrack [45]. Inspired by the Swin Transformer, CSWinTT [46] employs a novel cross-window attention calculation between the template and search region. It replaces conventional pixel-level attention with multi scale window attention via cyclic-shift (CS) windows. This approach preserves object integrity and enriches positional information, effectively distinguishing the object from the background. Furthermore, CSWinTT adopts a multi-head, multi-scale attention mechanism to measure similarity between partitioned windows at specific scales, enhancing tracking accuracy. However, the CS strategy for each window introduces a substantial computational burden due to the large number of matrix multiplications, leading to unsatisfactory tracking speed. Therefore, optimizing the computational approach to reduce the calculation and storage resources required by the tracker is of great significance.
Motivated by the correlation filter theory, which shows the correlation operation between a matrix and a CS matrix is efficiently computed using FFT [47], a worthwhile question is whether the FFT can avoid the high storage and computational complexity caused by cyclic shifted samples in the spatial domain. Inspired by this, we propose a multi scale cyclic shift window Transformer object tracker based on fast Fourier transform (MCWTT). Unlike the CSWinTT, which directly performs CS matrix operations in the spatial domain, we introduce the FFT and design a frequency domain feature fusion block. It converts the calculation strategy from the spatial domain to the frequency domain regarding the matrix multiplication of CS windows as a correlation operation. Then, according to the convolution theorem, we transform the correlation operation into point-wise multiplication in the frequency domain, alleviating the problems of increased storage and computational costs caused by traditional CS operations. For the object location task, we borrow the idea of the STARK tracker, decoupling classification and regression. It presents a corner estimation network that predicts the object's top-left and bottom-right corner coordinates, improving the model's handling of scale changes from rapid movements. Finally, to boost the model's spatio-temporal information mining, we add a prediction head to assess the best candidate samples' reliability and obtain dynamic templates with temporal information.
For this paper, our contributions are summarized as follows.
(1) We propose a window-level attention mechanism to replace the traditional pixel-level attention mechanism, thereby avoiding the potential disruption of object's integrity and positional information caused by pixel-level attention mechanisms. Moreover, combined with the CS operation, this mechanism promotes information exchange across windows, enriches training sample diversity, and improves the model's adaptability in complex scenarios.
(2) We design a novel attention calculation strategy that first treats the attention calculation in the spatial domain as a matrix correlation operation. We then leverage the advantages of the fast Fourier transform (FFT) to convert these operations into frequency-domain element-wise multiplications. This approach significantly improves computational efficiency and reduces storage costs, providing a more efficient solution for real-time object-tracking tasks.
(3) Extensive experimental analysis of the proposed MCWTT module was conducted on multiple datasets, including the LaSOT [48], OTB100 [49], and UAV123 [50] datasets, with detailed comparisons against various existing mainstream trackers. The experimental results demonstrate that the MCWTT tracker outperforms most existing methods across multiple evaluation metrics, especially showing significant advantages in handling scale variations and fast-moving objects. Furthermore, ablation studies further indicate that the proposed multi-scale shifted window attention mechanism plays an important role in improving trackers' performance and also demonstrates the efficiency of the frequency domain attention mechanism.
The rest of this paper is organized as follows. Section 2 introduces the preliminary knowledge. Section 3 explains the design and implementation of the proposed module. Section 4 presents experimental results comparing the proposed module with other methods on multiple datasets, as well as ablation studies. Finally, Section 5 summarizes the main findings and outlines future work.
Extracting the object's features and providing accurate prediction information in the subsequent tracking process is a key objective in object tracking tasks [51]. The circulation matrix structure expands the object samples, allowing the tracker to utilize spatio-temporal information further to exploit the deep features fully. However, the large amount of sample data generated by circulant matrix operations is redundant. If these data are directly operated on in the spatial domain, it will result in a significant computational burden and high computational complexity. Therefore, leveraging the connection between convolution and circulant matrices and transforming the operations to element-wise frequency-domain multiplications via the convolution theorem, avoiding large matrix multiplications and inversion operations.
Two-dimensional image correlation operations are special convolutions. The operation is carried out through element-wise multiplication in the frequency domain and the inverse Fourier transform, as shown in Eq (2.1) below:
I1⋆I2=Conv2(I1,ˉI2)=real{ifft2(ˆI1⊙ˆI∗2)}, | (2.1) |
where the symbol ⋆ represents the two-dimensional correlation operation, Conv2 denotes the two-dimensional convolution operation. I1, I2∈R√N×√N, √N is an integer, ˉI2 represents two-dimensional reversed signal of I2. The calculation rule for the reversed signal of a two-dimensional signal is as follows: First, perform row-wise reversing on the matrix to obtain an intermediate matrix, and then perform column-wise reversing on this intermediate matrix. ˆI1 and ˆI2 represent the Fourier transform of I1 and I2, and ˆI∗2 represents the conjugate of ˆI2. The symbol ⊙ indicates element-wise multiplication, ifft2 represents the two-dimensional inverse Fourier transform, and real represents the operation of taking the real part.
When Eq (2.1) is computed in the spatial domain, the memory space occupied is N2+N, and the computational complexity is O(N2). If the convolution theorem transforms spatial-domain correlation to frequency-domain element-wise multiplication, the module's memory requirements drop to 4N, and the computational complexity becomes O(8Nlog2N+4N).
The attention mechanism is an essential component of the Transformer model architecture, helping the model extract and fuse features from images and adjust the tracking region to enable the tracker to more effectively distinguish between foreground and background interference and more accurately track the object.
Assume the input sequence X has a size of X∈RN×C, where N denotes the number of tokens, and C denotes the length of tokens. Q=XWQ, K=XWK, V=XWV. Q∈RN×D, K∈RN×D, V∈RN×D. WQ∈RC×D, WK∈RC×D, and WV∈RC×D are linear transformation matrices. Given the query feature vector Qi=Split(Q)∈RN×di, the key feature vector Ki=Split(K)∈RN×di, and the value feature vector Vi=Split(V)∈RN×di for each head, in the attention mechanism, the attention is determined by the self-attention score matrix, which decides the degree of attention each part of the input sequence should receive. As shown in Eq (2.2), the attention matrix is written as follows:
A=QiKTi√di. | (2.2) |
where di denotes the dimension of each token in the head.
The Softmax function converts the normalized attention score matrix into probabilities, which are then applied to the value matrix to produce the final attention output. The result of the self-attention computation can be defined by the following equation:
Attention(Qi, Ki, Vi)=Softmax(A)Vi, | (2.3) |
where i denotes the number of input sequence heads.
Using multi-head attention enhances the performance of the attention mechanism. The multi-head attention calculation is defined as follows:
Multihead(Q, K, V)=Concat(H1, H2, ⋯, HI)WO, s.t. Hi=Attention(Qi, Ki, Vi) | (2.4) |
where I denotes the number of attention heads, WO∈RD×D is a linear transformation matrix, and i∈[1,I].
In this section, we introduce the proposed multi-scale cyclic-shift window Transformer object tracker (MCWTT). The main framework of the module consists of four parts: the backbone network, the feature fusion block (FFB), the feature enhancement block (FEB), and the prediction analysis module. The architecture of the module is illustrated in Figure 1.
The backbone network of the tracker is based on ResNet-50 [52], which extracts features from the search region image S, the initial static template image Ts, and the dynamic template image in the initial frame Td. These inputs are fed into the backbone network to obtain the corresponding feature maps, namely the search region feature map FS∈RHS×WS×D, the initial static template feature map FTS∈RHTS×WTS×D, and the dynamic template feature map FTD∈RHTD×WTD×D. Then flattened and concatenated along the channel dimension to form a joint feature map X, as shown in the following equation:
X=Concat(Flatten(FS), Flatten(FTS), Flatten(FTD)), | (3.1) |
where Concat denotes the concatenation operator and Flatten denotes the flattening operator. Specifically, the feature maps are reshaped using a r×r size window, then through concatenation resulting in L=(HSr×WSr+HTSr×WTSr+HTDr×WTDr) tokens.
Positional encoding PE is added to the joint feature map X to incorporate spatial information, forming the feature fusion block input Z0, as shown in the following equation:
Z0=X+PE. | (3.2) |
The positional encoding PE is defined as follows:
PE(n, i)={sin(n10000idm), ifi=2kcos(n10000idm), ifi=2k+1, | (3.3) |
where n represents the position of the token, i represents the index of the element within the token, and dm is the dimension of the positional encoding vector, with a size of r2D, PE∈RL×(r2D).
In this section, we introduce the feature fusion block (FFB). As mentioned in the preliminaries, transferring spatial-domain computation to the frequency-domain enhances computational efficiency and reduces memory usage. The FFB has two key parts: the spatial-domain feature fusion module (SFM) and the frequency-domain feature fusion module (FFM). These two modules integrate spatial and frequency domain features, improving the model's spatial context awareness. The following sections provide detailed explanations of these two modules.
The FFB input Zj is divided into eight feature heads, as shown in Eq (3.4)
[X0, ⋯, X8]=Split(Reshape(Zj)), | (3.4) |
where Split denotes the splitting operator, and the size of each head after splitting is Xi∈RL×r×r×di, i=1, 2, ⋯, 8.
Each feature head undergoes linear transformations to generate the query, key, and value feature tensors, i.e., Q, K and V; the linear transformation process is given by Eq (3.5)
{Q =LP1(Xi)K =LP2(Xi)V =LP3(Xi), | (3.5) |
where LP denotes linear mapping transformation. The sizes of feature maps are Q∈RL×r×r×di, K∈RL×r×r×di, and V∈RL×r×r×di. The feature tensor Q is reshaped into a query matrix Q, with a size of Q∈RL×(r2di).
The key and value feature tensors K, V are subjected to CS operations to generate the shifted key tensor KC and the shifted value tensor VC, respectively, as shown in Eq (3.6)
{KC=CS(K)VC=CS(V), | (3.6) |
where CS performs a cyclic-shift along the second and third dimensions of the input tensor. The size of the shifted key tensor KC and the value tensor VC are KC∈RL×r2×r×r×di and VC∈RL×r2×r×r×di.
The shifted key tensor and shifted value tensor are then reshaped to obtain the shifted key matrix KC and the shifted value matrix VC, respectively
{KC=Reshape(KC)VC=Reshape(VC), | (3.7) |
where KC∈R(Lr2)×(r2di) and VC∈R(Lr2)×(r2di).
After reshaping, the shifted key matrix KC is transposed to obtain the transposed key matrix KTC∈R(r2di)×(Lr2). The transposed key matrix KTC is then scaled by a factor √r2di to facilitate the normalization operation of the Softmax function and avoid gradient vanishing or explosion. The query matrix Q and the scaled key matrix KCN=KTC√r2di are multiplied to obtain the feature attention score matrix A, as shown in Eq (3.8)
A=QKCN, | (3.8) |
where A∈RL×(Lr2).
During the CS process, the concatenation of different regions does not guarantee the continuity of the foreground target, leading to discontinuities at the boundaries of the concatenated images, known as boundary effects. To mitigate these effects caused by CS, an attention mask matrix M is introduced to penalize shifted samples that are far from the base sample. The feature attention score matrix is added to the attention mask matrix to obtain the final feature attention matrix AM, as shown in Eq (3.9)
{AM=A+MM=1L×L⊗(vec(C))TC=yT×y−1r×r, | (3.9) |
where 1L×L is a matrix of size L×L with all elements equal to 1. In the same way, 1r×r is a matrix of size r×r with all elements equal to 1. The symbol ⊗ denotes the Kronecker product, with y∈R1×r, yi={xi,xi>0.51−xi,others, i=1, 2, ⋯, r, x∈R1×r, and xi=i−1r.
The Softmax function transforms raw attention scores into a probability distribution. The normalized attention matrix is then multiplied by the shifted value matrix VC to generate the single-head attention output Ti
Ti=Softmax(AM)×VC, | (3.10) |
where Ti∈RL×(r2di)(i=1, 2, ⋯, 8).
The single-head attention matrix is passed through a feed-forward neural network (FFN) to enhance the features. Specifically, the multi-head attention matrix O is formed by concatenating all single-head outputs, as shown in Eq (3.11)
O=Concat(LP1(T1), ⋯, LP8(T8)), | (3.11) |
where O∈R(Lr2)×D.
The multi-head attention matrix O is added to the FFB input Zj and then normalized using layer normalization, as shown in Eq (3.12)
Ttemp=LN(LP(O)+Zj). | (3.12) |
Ttemp is passed through the FFN for non-linear transformation, and the result is added back to the original Ttemp matrix to achieve residual connection and normalization. This process aims to preserve as much of the underlying structural information of the input data as possible. By integrating the deep semantic features with the original spatial details in an organic manner, a composite representation that combines high fidelity and strong discriminability is ultimately generated, as shown in Eq (3.13)
Zj+1=LN(FFN(Ttemp)+Ttemp), | (3.13) |
where the FFN consists of two linear mapping layers and a non-linear rectified linear unit(ReLU) activation layer, as shown in Eq (3.14)
FFN(X)=(ReLU(XL1+B1))L2+B2, | (3.14) |
where L1, L2 are the weight matrices for linear projection, and B1, B2 are the bias matrices.
Finally, the FFB re-inputs the obtained multi-head attention matrices. Each layer's output serves as the input for the next layer. After repeating this process six times, six channels of attention computation results are obtained. This process yields a more accurate final attention result, as shown in Eq (3.15)
Zj+1=FFB(Zj), | (3.15) |
where j=0, 1, ⋯, 5.
Given that the data generated by circulant matrices are redundant, direct computation in the spatial domain results in high computational complexity and extensive memory usage. Consequently, transforming circulant matrix operations from spatial domain calculations to frequency domain element-wise multiplications is feasible for reducing the computational overhead. When the window size is 8, the FFM is involved in network training and inference. The structure of the FFM is shown in Figure 2.
Through Fourier transforms, the query feature tensor Q and the key feature tensor K, obtained from spatial domain computations, are converted from the spatial domain to the frequency domain, as shown in Eq (3.16)
{ˆQ=fft[2, 3](Q)ˆK=fft[2, 3](K), | (3.16) |
where Q∈RL×r×r×di, K∈RL×r×r×di, and fft[2, 3] denotes the Fourier transform operation applied to the second and third dimensions of the feature tensors Q and K, resulting in ˆQ∈CL×r×r×di and ˆK∈CL×r×r×di.
The query feature tensor in the frequency domain ˆQ is then expanded, with each token being copied L times to obtain the expanded query feature tensor ˆQkE, as shown in the following equation:
{ˆQ1E=Repeat(ˆQ(1, :, :, : ), L)ˆQ2E=Repeat(ˆQ(2, :, :, :), L) ⋮ˆQLE=Repeat(ˆQ(L, :, :, :), L), | (3.17) |
where Repeat denotes the repetition of each input tensor L times along the first dimension, with ˆQkE∈CL×r×r×di, k=1, 2, ⋯, L, ˆQ(l, :, :, :)∈C1×r×r×di, and l=1, 2, ⋯, L.
The conjugate of the key feature tensor in the frequency domain ˆK∗ is computed, as shown in the following equation:
ˆK∗=Conjugate(ˆK), | (3.18) |
where Conjugate is the conjugation operator that negates the imaginary part of the input tensor.
To simplify the complex spatial domain computation, the convert block executes element-wise conversion between the expanded feature tensor ˆQkE and the conjugate feature tensor ˆK∗ obtained above. The results from different channels of the resulting tensor are then summed. The summed result in the frequency domain needs to be transformed back to the spatial domain. Here, the inverse Fourier transform is applied to the frequency domain result, and the real part of the transformed result is taken and stacked as the l-th row of the attention weight matrix A, as shown in the following equation:
A(k, :)=1√r2diConcat{vec[di∑c=1Q(k, :, :, c)⋆K(p, :, :, c)]T}=(vec(real(ifft2(di∑c=1ˆQkE(:, :, :, c)⊙ˆK∗(:, :, :, c)√r2di))))T, | (3.19) |
where vec denotes the vectorization operator for tensors, real denotes the operation of taking the real part of the tensor, ifft2 is the two-dimensional inverse Fourier transform, the symbol ⊙ represents element-wise multiplication, and the symbol ⋆ denotes the correlation operation between matrices. For better comprehension of the convert block, the subsequent text further explains Eq (3.19).
The query feature map tensor Q and the key feature map tensor K both contain L tokens of size r×r×d. After the CS operation, as shown in Eq (3.6), the shifted key tensor KC is obtained. The query feature map tensor Q is reshaped into the query matrix Q, and the shifted key tensor KC is also reshaped into the shifted key matrix KC. The transpose of the shifted key matrix KTC is then computed and multiplied with the query matrix Q to get the feature attention matrix A, as shown in the following equation:
A(k, n)=Q(k, :)×KTC(:, n)√r2di=⟨Q(k, :, :, :), KC(p, q, :, :, :)⟩√r2di , | (3.20) |
where Q(k, :) denotes the k-th row of Q, with a size of Q(k, :)∈Rr2×di; KTC(:, n) denotes any column in KTC, with each column having a size of KTC(:, n)∈Rr2×di, where n=(q−1)L+p; A(k, n) represents the element in the k-th row and n-th position of A. ⟨Q(k, :, :, :), KC(p, q, :, :, :)⟩ represents the inner product between the query feature map tensor Q and the shifted key tensor KC. Q(k, :, :, :) denotes the k-th token of Q, and KC(p, q, :, :, :) denotes the q-th token in the p-th group of KC, where p=1, 2, ⋯, L and q=1, 2, ⋯, r2.
As indicated in the preliminaries, matrix multiplication is convertible to tensor correlation operations, as shown in Eq (3.21)
[A(k, (1−1)L+p), A(k, (2−1)L+p), ⋯, A(k, (r2−1)L+p)]=1√r2divec[di∑c=1Q(k, :, :, c)⋆KC(p, 1, :, :, c)]T=1√r2divec[di∑c=1Q(k, :, :, c)⋆K(p, :, :, c)]T, | (3.21) |
where vec denotes the vectorization operator.
The correlation operation is viewed as a convolution operation between tensors, as shown in Eq (3.22)
Q(k, :, :, c)⋆KC(p, 1, :, :, c)=Q(k, :, :, c)∗¯KC(p, 1, :, :, c), | (3.22) |
where ¯KC(p, 1, :, :, c) denotes the flipped version of KC, p=1, 2, ⋯, L, and q=1, 2, ⋯, r2.
Thus, Eq (3.21) is rewritten as Eq (3.23)
[A(k, (1−1)L+p), A(k, (2−1)L+p), ⋯, A(k, (r2−1)L+p)]=1√r2divec[di∑c=1Q(k, :, :, c)∗¯KC(p, 1, :, :, c)]T. | (3.23) |
By traversing p and concatenating into a new row vector, n=(q−1)L+p is also traversed, as shown in the following equation:
A(k, :)=1√r2diConcat{vec[di∑c=1Q(k, :, :, c)∗¯KC(p, :, :, c)]T}. | (3.24) |
According to the convolution theorem, the correlation operation in the spatial domain is interchangeable with element-wise multiplication in the frequency domain. Hence, Eq (3.19) is proven. The proposed FFB algorithm in this paper is shown in Algorithm 1.
Algorithm 1: Feature fusion block (FFB) |
Input: Input features Z0, the number of heads in the FFB I, and the vector r=[1, 2, 4, 8, 1, 2, 4, 8] constructed from window sizes of each attention head and number of layers in the FFB J. Output: Output features Z6. |
1: Split Z0 into I heads. 2: For i←0 to I−1 3: For j←0 to J−1 4: If ri+1=1 or ri+1=2 or ri+1=4 5: Let Zj as the input to the SFM in Figure 1, calculate the attention weight matrix A via Eqs (3.6)–. 6: Else 7: Let Zj as the input to the FFM in Figure 2, calculate the attention weight matrix A via Eqs (3.16)–(3.19). 8: End if 9: End for 10: End for 11: The fused features are further calculated using Eqs (3.11)–(3.14). 12: Return Z6. |
The feature enhancement block (FEB) and FFB have an identical number of stacked layers, with both having six layers, each containing eight attention heads. The main structure of the FEB is made up of multi-head self-attention (MSA) layers, multi-head cross-attention (MCA) layers, and feed-forward networks (FFN). This part draws on the concept from the DETR [53] object detection model, which employs the learnability of queries to steer object detection. Figure 3 displays the signal flow diagram of FEB, where the query embedding QE is a position encoding learned in the model training process, aiding the model in obtaining the spatial location information of the object.
First, the FEB input is initialized with zero vectors, i.e. d0=Query=0∈R1×D, and added to the query embedding vector QE to form the query matrix Qs and the key matrix Ks, which carry the object's position information. The value matrix Vs is directly provided by the object sequence Query, as shown in the following equation:
{[Qs1, Qs2, ⋯, Qs8]=Split[QsWQs][Ks1, Ks2, ⋯, Ks8]=Split[KsWKs][Vs1, Vs2, ⋯, Vs8]=Split[VsWVs]. | (3.25) |
The obtained Qs, Ks, and Vs are fed into the MSA layer to compute the similarity between Qs and Ks to identify the features most relevant to the object. The output of the MSA layer s is iterated, as shown in the following equation:
{zsi=Attention(Qsi, Ksi, Vsi)=Softmax(QsiKTsi√di)Vsis=MSA(Qs, Ks, Vs)=Concat(zs1, zs2, ⋯, zs8)Ws, | (3.26) |
where WQs, WKs, WVs, and Ws are linear projection weight matrices learned through the network, s∈R1×D.
To avoid the occurrence of gradient explosion or vanishing, the output of the MSA layer s is subjected to residual connection and followed by a normalization layer to obtain the intermediate state output, as shown in the following equation:
h1=LN(LP(s)+dj). | (3.27) |
The query embedding vector QE is added to the intermediate state output h1, resulting in the query matrix Qc for the MCA layer. The final output of the feature fusion block Z6 is combined with the position encoding vector PE, forming the key matrix for the MCA layer Kc; Z6 also serves as the value matrix in the MCA layer calculation, as illustrated in Eq (3.28)
{Qc=h1+QEKc=Z6+PEVc=Z6. | (3.28) |
Similarly Qc, Kc, and Vc are processed as shown in the following equation:
{[Qc1, Qc2, ⋯, Qc8]=Split[QcWQc][Kc1, Kc2, ⋯, Kc8]=Split[KcWKc][Vc1, Vc2, ⋯, Vc8]=Split[VcWVc]. | (3.29) |
The calculation process of the MCA layer is as follows:
{zci=Attention(Qci, Kci, Vci)=Softmax(QciKTci√di)Vcic=MCA(Qc, Kc, Vc)=Concat(zc1, zc2, ⋯, zc8)Wc, | (3.30) |
where WQc, WKc, WVc, and Wc are the linear projection weight matrices learned through the network, with c∈R1×D.
The output of the MCA layer c is subjected to residual connection and normalization to obtain the intermediate output state h2, as shown in the following equation:
h2=LN(LP(c)+h1). | (3.31) |
The process described above is repeated six times, constituting the entire FEB workflow. The final output of the FFB is then fed into the prediction analysis module for subsequent tracking and prediction. The overall process is represented by Eq (3.32)
dj+1=FEB(dj), | (3.32) |
where j=0, 1, ⋯, 5.
The FEB, together with the FFB, forms the encoder-decoder structure of the Transformer, which helps the model build temporal context awareness and obtain richer spatiotemporal contextual information, thereby further enhancing the precision of the model.
The prediction analysis module includes a bounding box prediction head and a classification score head. The enhanced feature d6 is directed into these heads for analysis, thus accomplishing further tracking and prediction.
The bounding box prediction head is designed inspired by the corner prediction head in the STARK [43] algorithm. Calculating the expected values of the probability distributions of the two probability maps corresponding to the top-left and bottom-right corners of the tracked object's bounding box, thus obtaining the coordinates of these corners, which is illustrated in Figure 4.
First, the final output of the feature enhancement module Z6 is cropped to obtain the search region features ZS∈R(HSWS)×D. The enhanced feature d6 is also input to calculate the similarity modulation vector m between them, as shown in the following equation:
m=ZSdT6, | (3.33) |
where m∈R(HSWS)×1.
To better enhance the important tracking regions and weaken the interference of other regions on the tracking, the similarity modulation vector m is combined with the search region features ZS via element-wise multiplication, yielding the enhanced search region features ZE, as shown in the following equation:
ZE=ZS⊙Repeat(m,D). | (3.34) |
Among them, Repeat(m,D)∈R(HSWS)×D indicates that the column vector m is continuously copied D times and stacked along the row direction to generate a matrix.
The obtained enhanced search region features ZE are reshaped into an enhanced feature map ε∈RHS×WS×D, which undergoes processing through a fully convolutional network (FCN), resulting in two probability distribution maps for the bounding box's top-left and bottom-right corners, namely P1(x, y) and P2(x, y), as shown in the following equation:
{P1(x, y)=FCN1(ε)P2(x, y)=FCN2(ε), | (3.35) |
where the FCN consists of five convolutional layers, batch normalization layers, and ReLU activation layers (Conv-BN-ReLU). FCNirepresents the processing of the data by two FCN layers with different parameters.
To determine the coordinates of the bounding box's two corner points, the two probability maps P1(x, y) and P2(x, y) are utilized through the following equation. After acquiring these coordinates, the final predicted bounding box position candidate samples Cp∈RHC×WC×3 are determined.
{(ˆx1, ˆy1)=(H∑y=0W∑x=0xP1(x, y), H∑y=0W∑x=0yP1(x, y))(ˆx2, ˆy2)=(H∑y=0W∑x=0xP2(x, y), H∑y=0W∑x=0yP2(x, y)), | (3.36) |
where (ˆx1, ˆy1) represents the coordinates of the top-left corner and similarly (ˆx2, ˆy2) stands for the coordinates of the bottom-right corner. Suppose that the prediction score of the candidate sample exceeds the set threshold. In that case, the template area is expanded to twice its original size, with the center of the sample as the base point, forming a dynamic template and updating it.
During the object tracking process, the object may undergo varying changes as the tracking time progresses. Therefore, real-time capture of its latest state is crucial for accurate tracking. As shown in Figure 1, we propose a dynamic template update mechanism sampled from intermediate frames. It serves as an additional input to capture the object's changes in appearance over time and offers more temporal details. However, when the object is entirely occluded or moves out of view, the update of the dynamic template may become unreliable. Therefore, we designed a simple candidate sample scoring head to assess whether the current sample needs to be updated.
First, the enhanced feature d6 is processed through a multilayer perceptron to improve the model's learning ability. Then the output results are activated by a function to derive the score, as shown in the following equation:
s=Sigmoid(MLP(d6)), | (3.37) |
where MLP is a three-layer perceptron with ReLU activation functions employed in all hidden layers. The output layer employs the function Sigmoid(x)=11+e−x as the activation function to map the output values to the 0 to 1 range. As shown in Figure 1, when the score s exceeds the set threshold, the model considers the candidate sample to be reliable and updates the dynamic template; otherwise, it does not update. Through this mechanism, the model has more effectively judged the reliability of candidate samples in complex environments, thereby improving the stability and accuracy of tracking.
We draw on the loss function design concept in the DETR architecture, adopting an end-to-end training method and using the generalized intersection over union (GIOU) loss to optimize the predicted bounding boxes. By directly optimizing the intersection over union (IOU) between the predicted boxes and the ground-truth boxes, the accuracy and robustness of the tracking are enhanced. In addition, L1 loss is introduced to optimize the position and size of the bounding box, enabling the model to predict the object's position more accurately. The loss function is shown in the following equation:
L=B∑i=1λGIOULGIOU(bi, gi)+λL1L1(bi, gi), | (3.38) |
where bi is the vector composed of the top-left x-coordinate, the y-coordinate, and the width and height of the i-th predicted bounding box, and gi is the vector comprising the top-left x-coordinate, the y-coordinate, and the width and height of the bounding box of the i-th training sample's ground-truth box. λGIOU and λL1 are non-negative hyperparameters, and B denotes the batch size of the training samples.
In some previous studies, researchers believed that joint learning of the localization and classification tasks may lead to suboptimal results. Thus, decoupling these two tasks is necessary. The training process is divided into two independent stages for the localization and classification tasks, which are optimized to achieve the best solution. In the first stage, the whole network, except for the candidate sample scoring head, is trained end-to-end using a localization-related loss function. This stage aims to ensure the object's inclusion in all search regions, thus improving the model's localization ability. The second stage focuses on optimizing the scoring head, and its loss function is defined as shown in the following equation:
Lce=−B∑i=1(lilog(si)+(1−li)log(1−si)), | (3.39) |
where li is the binary label of the i-th training sample, and si is the predicted probability of the i-th sample being the object. This two-stage training strategy enables the model to learn the key features of localization and classification separately, thus attaining high tracking precision and robustness. It enhances the model's object recognition and tracking accuracy in different settings and readies it for seamless integration into practical use.
In this section, we introduce the comprehensive evaluation of the MCWTT tracking algorithm on three benchmark datasets: LaSOT [48], OTB100 [49], and UAV123 [50]. The hardware environment for the experiments was a server equipped with an Intel Core i9-12900K CPU and an Nvidia RTX 3090 24 GB GPU, with 128 GB of RAM. The software environment was based on Python 3.7 and PyTorch 1.8.2. During training, the model's basic training unit consisted of two templates and one search image. The input templates were 128 × 128 pixels, about twice the area of the target box, while the search region was 384 × 384 pixels, about five times the area of the target box. The backbone network was initialized with parameters from a pre-trained ResNet-50 network. The Transformer architecture comprised six layers of encoder-decoders with eight heads each, including MSA layers, MCA layers, and FFNs. Each layer of the encoder-decoders had eight heads, with 32 channels per head and a dropout rate of 0.1 to prevent overfitting. During model training, the loss function and generalized loss weights were optimized using the Adam [54] optimizer, with initial learning rates of 10-5 and 10-4 for the backbone network and other network components, respectively.
LaSOT [48] is a large-scale tracking benchmark dataset with high-quality annotations and almost all real-world challenges, including fast motion, scale variation, and camera movement. It has 280 video sequences, with each frame labelled with high-quality bounding boxes. Tracker performance is evaluated using the area under the curve (AUC), the normalized precision (PNorm), and precision (P) scores. AUC shows the success rate across IoU thresholds, PNorm measures precision under the normalized distance, and P directly assesses the accuracy of the tracking results. As shown in Table 1, the MCWTT algorithm achieved an AUC score of 65.6%, a PNorm score of 74.5%, and a P score of 70%, surpassing many advanced trackers. Though outperformed by STARK [43] and MixFormer [55], it remains competitive.
Method | Year | LaSOT | OTB100 | UAV123 | ||||
AUC(%) | PNorm(%) | P(%) | SR(%) | PR (%) | AUC(%) | P(%) | ||
SiamRPN++ [32] | 2019 | 49.6 | 56.9 | 49.1 | – | – | 64.2 | 84.0 |
TransT [41] | 2021 | 64.2 | 73.5 | 68.2 | – | – | 66.0 | 85.2 |
STARK [43] | 2021 | 65.8 | 75.2 | 69.8 | – | – | 68.4 | 89.0 |
DSTrpn [56] | 2021 | 43.4 | 54.4 | – | 64.6 | 85.7 | – | – |
CLNet*–BAN [57] | 2022 | 52.9 | 62.3 | 52.6 | – | – | – | – |
TCTrack++ [58] | 2022 | 43.5 | 48.4 | 41.4 | 54.3 | 72.0 | 51.9 | 73.1 |
RTSFormer [34] | 2024 | 62.3 | 65.6 | 65.5 | – | – | 67.5 | – |
AGST–BR [59] | 2024 | 56.7 | – | 58.3 | – | – | 66.3 | – |
SiamRPN++–ACM [60] | 2024 | 52.3 | – | – | 71.2 | – | – | – |
MixFormer [55] | 2024 | 69.6 | 79.9 | 75.9 | 71.6 | 94.4 | 68.7 | 89.5 |
MCWTT | Ours | 65.6 | 74.5 | 70.0 | 66.7 | 86.6 | 68.7 | 89.2 |
The OTB100 [49] dataset is a widely utilized benchmark for evaluating object-tracking algorithms. Comprising 100 standardized video sequences with diverse challenging scenarios and detailed annotations, it assesses the robustness and adaptability of tracking algorithms. The evaluation follows the one-pass evaluation (OPE) protocol, tracking the entire sequence without restarts. This setup tests the tracker's performance in handling challenges like scale changes, occlusions, deformations, and motion blur. The tracking algorithms are ranked using success plots and precision plots, with success plots evaluated by the AUC, and precision plots assessed by the center position error (CPE).
Figure 5 shows a comparison of the proposed module with nine other trackers on the OTB100 dataset, including BACF [23], A3DCF [61], SRDCF [62], CSR-DCF [19], RBSCF [6], CFNet [63], DiMP [64], SiamFC [29], and SiamRPN [31]. As shown in Figure 5(a), (b), the MCWTT leads with a success rate of 66.7% and a precision of 86.6%, outperforming all the compared tracking methods while maintaining real-time capability. This validates the significant performance advantage of the proposed module in balancing real-time and accuracy requirements.
Figures 6 and 7 compare the proposed module with nine other trackers across 11 visual challenge attributes of the OTB100 dataset, namely background clutter (BC), deformation (DEF), fast motion (FM), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out of view (OV), and scale variation (SV). Figure 6 shows the proposed module performs exceptionally well in multiple attributes, achieving success rates of 68.8% in the FM scenario and 63.4% in the OV scenario. Figure 7 shows the precision performance of the proposed module and other tracking algorithms under 11 different visual attribute challenges. The proposed module also performs exceptionally well in various attribute challenges, with 89.4% in the MB scenario and 89.2% in the SV scenario, significantly outperforming other competing algorithms and fully demonstrating the advantage of the proposed module in handling high-speed dynamic objects.
The UAV123 [50] dataset contains 123 low-altitude unmanned aerial vehicle (UAV) video sequences, totaling over 110,000 frames, encompassing a variety of background environments ranging from urban landscapes to natural scenery. It is primarily used to evaluate the performance of different tracking algorithms. The dataset is divided into three subsets, each corresponding to distinct testing scenarios. The high-quality aerial video subset has 103 sequences shot by professional UAVs at heights of 5-25 meters, with frame rates of 30 to 96 frames per second (fps) and resolutions of 720P to 4K, ideal for tracking fast-moving targets. The low-cost UAV sub-set contains 12 sequences captured by economical UAVs with lower video quality, resolution, and more noise, increasing tracking difficulty. The synthetic data subset includes eight sequences generated by a UAV simulator to mimic real-world environmental changes. UAV123 can test a tracker's ability to handle fast motion, illumination changes, and occlusion, aiding in the development of systems stable in various environments. Evaluation metrics are the AUC of success plots and the CPE of precision plots.
Figure 8 compares the overall success and precision of the proposed module with the top tracking algorithms on the UAV123 dataset, including MixFormer [55], STARK [43], TransT [41], TCTrack++ [58], TrDiMP [65], SiamBAN [66], SiamCAR [67], SiamTPN [68], and SiamRPN++ [32]. In the success rate evaluation, the MCWTT and MixFormer both achieved 68.7%, ranking first. This indicates that the MCWTT's accuracy in object tracking is on par with the state-of-the-art methods. In the precision evaluation, the MCWTT's 89.2% is slightly lower than MixFormer's by 0.3% but surpasses other methods. This shows that MCWTT effectively balances real-time and accuracy requirements, demonstrating strong adaptability in complex scenes with occlusions, deformations, and illumination changes.
To demonstrate the tracking performance of the proposed module across diverse scenarios, we select 12 representative scene attributes from the UAV123 dataset for detailed analysis. These attributes encompass a variety of complex challenges, including scale variation (SV), aspect ratio change (ARC), low resolution (LR), fast motion (FM), full occlusion (FO), partial occlusion (PO), out-of-view (OV), background clutter (BC), illumination variation (IV), viewpoint change (VC), camera motion (CM), and similar object (SO).
As shown in Figures 9 and 10, the MCWTT exhibits robust tracking performance across all attribute scenarios. For instance, for the LR attribute (Figures 9 (a) and 10 (a)), the MCWTT achieves success and precision rates of 54.4% and 79.2%, respectively, outperforming all comparative algorithms. This highlights its exceptional adaptability to low-resolution objects. In FM scenarios, the MCWTT achieve success and precision rates of 66.2% and 86.8%, surpassing Mixformer by approximately 0.4% in both metrics. This demonstrates its superior efficiency and accuracy in tracking fast-moving objects. For highly challenging such as FO scenarios, the MCWTT maintains high tracking accuracy, with success and precision rates exceeding most benchmark algorithms. It ranks second only to Mixformer and STARK, further validating its reliability in handling full occlusions. These results underscore the MCWTT's versatility and robustness in addressing diverse real-world tracking challenges, particularly in scenarios involving rapid motion, low resolution, and occlusions.
To better assess and demonstrate the performance of the MCWTT, we conducted frame-by-frame comparison tests with nine other trackers on the OTB100 dataset to show the tracking effects of each tracker in different scenes. We selected five videos representing typical tracking challenges, like fast motion, occlusion, deformation and scale variation, and illumination variation. The selected videos are "Diving_1", "Girl2_1", "Jump_1", "Skating2-1_1", and "Trans_1".
(1) Fast motion and scale variation. In the low-resolution scene of the "Diving_1" video sequence (Figure 11 (a)), the athlete briefly exhibits a mid-air flipping posture. In subsequent frames, the athlete plunges into the water with significant background changes. Many trackers were unable to effectively identify the rapidly moving object, while the module proposed in this paper accurately tracked the object throughout the entire video sequence. Similarly, in the "Jump_1" video sequence (Figure 11 (c)), the object undergoes scale variations during rapid motion. Only the proposed module successfully identified and accurately tracked the target across all frames of the full video sequence.
(2) Occlusion. In the "Girl2_1" video sequence shown in Figure 11(b), when the tracked girl is completely obscured by another pedestrian in Frame 107, only the proposed module accurately distinguishes the interfering object and continues to identify the object throughout subsequent occlusion, while other tracking methods either mistrack the interference object or experience tracking drift. This scenario fully demonstrates the superiority of the proposed module. Similarly, in the "Skating2-1_1" video sequence shown in Figure 11(d), the object athlete faces consecutive obstructions by the interfering athlete. The proposed module achieves stable tracking of the object throughout the video sequence. It adaptively resizes the bounding box to ensure the structural integrity of the object within the frame.
(3) Deformation. In the "Jump_1" video sequence shown in Figure 11(c), the athlete undergoes deformation during rapid motion, posing a challenge to the tracking algorithm. It is evident from Frame 95 of the video sequence that other tracking algorithms cannot track the gymnast's human form after landing. Only the proposed module can stably track the target throughout the entire video sequence, reflecting its effective handling of deformation and scale variation. Similarly, in the "Trans_1" video sequence shown in Figure 11(e), the proposed module delivers a more accurate identification of the appearance contour of the object compared with other algorithms, significantly improving the accuracy of the tracking process.
(4) Illumination Variation. In the "Trans_1" video sequence shown in Figure 11(e), the object's background changes from bright to dark, making the object's contour more blurred than in previous scenes, such low-light conditions compromised the accuracy of several tracking algorithms, while the proposed method remained unaffected by illumination changes, accurately identifying the object.
In the above analysis, the proposed module shows better robustness and accuracy than the other trackers in complex tracking scenarios in low-resolution scenes, showing better robustness and accuracy.
To assess the proposed module's specific enhancement of the tracking performance, we conduct a series of ablation experiments on the OTB100 dataset. By removing specific components from the structure while keeping others unchanged, the performance of the components is evaluated.
In this subsection, we compare the model without the FFM (denoted as the CWTT) with the MCWTT to verify the FFM's effectiveness. As shown in Table 2, the MCWTT nearly matches the speed of the original model with only the SFM while reducing video random access memory (VRAM) usage, highlighting the advantage of the FFM in alleviating memory cache demands. The MCWTT effectively compresses the video memory through optimization in the model's architecture and computational efficiency, significantly reducing resource requirements without sacrificing tracking performance, thereby improving the performance on the OTB100 benchmark dataset. At the same time, the reduction in video memory occupancy makes the MCWTT more suitable for devices with limited video memory, further enhancing its advantages in practical applications.
Method | VRAM usage (MiB) | Speed (fps) |
CWTT | 2318MiB | 25.661 |
MCWTT | 2172MiB | 25.653 |
In this subsection, we replace the multi-scale windows with single-scale windows and compare the model with single-scale windows (denoted as the MCWTT-S) with the MCWTT to verify the effectiveness of multi-scale windows. As shown in Figure 12, the success rate and accuracy of the module with multi-scale window configuration are significantly improved compared with MCWTT-S.
Tables 3 and 4 list the success rates and accuracies of single-scale and multi-scale window configurations under different challenging scenarios. Combining the two proves that multi-scale windows are more adaptable to complex tracking environments and capture the complete information of the object more efficiently, reflecting on the tracking results.
Method | BC | DEF | FM | IPR | IV | LR | MB | OCC | OPR | OV | SV |
MCWTT | 0.596 | 0.638 | 0.688 | 0.663 | 0.645 | 0.571 | 0.708 | 0.640 | 0.653 | 0.634 | 0.690 |
MCWTT-S | 0.541 | 0.646 | 0.653 | 0.613 | 0.614 | 0.575 | 0.674 | 0.592 | 0.614 | 0.550 | 0.647 |
Method | BC | DEF | FM | IPR | IV | LR | MB | OCC | OPR | OV | SV |
MCWTT | 0.757 | 0.861 | 0.877 | 0.862 | 0.818 | 0.787 | 0.894 | 0.834 | 0.868 | 0.829 | 0.892 |
MCWTT-S | 0.712 | 0.878 | 0.837 | 0.826 | 0.792 | 0.826 | 0.859 | 0.773 | 0.826 | 0.719 | 0.844 |
We also compared the average overlap rate and center error rate curves of the MCWTT and MCWTT-S models under the "Car2" video sequence. As shown in Figure 13, the average overlap rate of the MCWTT is higher than that of the MCWTT-S and more stable. In terms of the center error rate, the MCWTT is lower than the MCWTT-S. The test results are also reflected in the visual tracking effects of the "Car2" video sequence in Figure 14. When there are interfering objects during the car's movement, the MCWTT-S loses the object, causing tracking drift and failure. In contrast, the MCWTT remains stable in tracking, proving the effectiveness of the multi-scale window configuration used in the MCWTT.
In this paper, we propose a multi-scale cyclic-shift window Transformer object tracker based on the fast Fourier transform (MCWTT). The proposed module introduces a cyclic-shift window mechanism to increase the diversity of sample positions. It also adopts a multi-scale window-level attention mechanism instead of the traditional single-scale pixel-level attention mechanism. The proposed module not only protects the integrity of the object but also enriches the diversity of training samples, further mining the object's location information and improving the tracking accuracy of the model. In addition, we use frequency-domain feature fusion instead of spatial-domain feature fusion, converting the attention matrix calculation between cyclic-shifted samples and non-cyclic-shifted samples into the frequency domain through the convolution theorem, effectively reducing the sample's storage and computational complexity and significantly improving the inference efficiency. Unlike the traditional encoder-decoder serial structure, we introduce a feature enhancement block in the network design, feeding it back to the feature fusion block to form a signal feedback loop, further enhancing the object state estimation ability and enabling the network to handle object tracking tasks in dynamic scenes more efficiently. Theoretical analysis and experimental verification show that the network has significant advantages in dealing with complex scenes, such as scale variation, background interference, and occlusion. However, there remains a performance gap compared with the state-of-the-art trackers. Our future work will focus on enhancing the tracker's capabilities through rigorous research. Ablation studies demonstrate that the cyclic shift operation effectively reduces the computational overhead for the module while maintaining its precision and efficiency. In real-world applications, enhancing interpretability and generalization while maintaining performance in harsh tracking conditions remains an open issue. Future research will focus on boosting the network's efficiency by streamlining the parameters without compromising tracking performance.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work is supported by the Natural Science Foundation of Fujian Province (2024J01820, 2024J01821, 2024J01822), Natural Science Foundation Project of Zhangzhou City (ZZ2023J37), the Principal Foundation of Minnan Normal University (KJ19019), the High-level Science Research Project of Minnan Normal University (GJ19019), the Research Project on Education and Teaching of Undergraduate Colleges and Universities in Fujian Province (FBJY20230083), the Guangdong Province Natural Science Foundation (2024A1515011766), and the State key laboratory major special projects of Jilin Province Science and Technology Development Plan (SKL202402024).
The authors declare there is no conflict of interest.
[1] |
Y. Li, X. Yuan, H. Wu, L. Zhang, R. Wang, J. Chen, CVT-track: Concentrating on valid tokens for one-stream tracking, IEEE Trans. Circuits Syst. Video Technol., 34 (2024), 321–334. https://doi.org/10.1109/TCSVT.2024.3452231 doi: 10.1109/TCSVT.2024.3452231
![]() |
[2] |
S. Zhang, Y. Chen, ATM-DEN: Image inpainting via attention transfer module and decoder-encoder network, SPIC, 133 (2025), 117268. https://doi.org/10.1016/j.image.2025.117268 doi: 10.1016/j.image.2025.117268
![]() |
[3] |
F. Chen, X. Wang, Y. Zhao, S. Lv, X. Niu, Visual object tracking: A survey, Comput. Vision Image Underst., 222 (2022), 103508. https://doi.org/10.1016/j.cviu.2022.103508 doi: 10.1016/j.cviu.2022.103508
![]() |
[4] |
F. Zhang, S. Ma, Z. Qiu, T. Qi, Learning target-aware background-suppressed correlation filters with dual regression for real-time UAV tracking, Signal Process., 191 (2022), 108352. https://doi.org/10.1016/j.sigpro.2021.108352 doi: 10.1016/j.sigpro.2021.108352
![]() |
[5] |
S. Ma, B. Zhao, Z. Hou, W. Yu, L. Pu, X. Yang, SOCF: A correlation filter for real-time UAV tracking based on spatial disturbance suppression and object saliency-aware, Expert Syst. Appl., 238 (2024), 122131. https://doi.org/10.1016/j.eswa.2023.122131 doi: 10.1016/j.eswa.2023.122131
![]() |
[6] |
J. Lin, J. Peng, J. Chai, Real-time UAV correlation filter based on response-weighted background residual and spatio-temporal regularization, IEEE Geosci. Remote Sens. Lett., 20 (2023), 1–5. https://doi.org/10.1109/LGRS.2023.3272522 doi: 10.1109/LGRS.2023.3272522
![]() |
[7] |
J. Cao, H. Zhang, L. Jin, J. Lv, G. Hou, C. Zhang, A review of object tracking methods: From general field to autonomous vehicles, Neurocomputing, 585 (2024), 127635. https://doi.org/10.1016/j.neucom.2024.127635 doi: 10.1016/j.neucom.2024.127635
![]() |
[8] |
X. Hao, Y. Xia, H. Yang, Z. Zuo, Asynchronous information fusion in intelligent driving systems for target tracking using cameras and radars, IEEE Trans. Ind. Electron., 70 (2023), 2708–2717. https://doi.org/10.1109/TIE.2022.3169717 doi: 10.1109/TIE.2022.3169717
![]() |
[9] |
L. Liang, Z. Chen, L. Dai, S. Wang, Target signature network for small object tracking, Eng. Appl. Artif. Intell., 138 (2024), 109445. https://doi.org/10.1016/j.engappai.2024.109445 doi: 10.1016/j.engappai.2024.109445
![]() |
[10] |
R. Yao, L. Zhang, Y. Zhou, H. Zhu, J. Zhao, Z. Shao, Hyperspectral object tracking with dual-stream prompt, IEEE Trans. Geosci. Remote Sens., 63 (2025), 1–12. https://doi.org/10.1109/TGRS.2024.3516833 doi: 10.1109/TGRS.2024.3516833
![]() |
[11] |
N. K. Rathore, S. Pande, A. Purohit, An efficient visual tracking system based on extreme learning machine in the defence and military sector, Def. Sci. J., 74 (2024), 643–650. https://doi.org/10.14429/dsj.74.19576 doi: 10.14429/dsj.74.19576
![]() |
[12] |
Y. Chen, Y. Tang, Y. Xiao, Q. Yuan, Y. Zhang, F. Liu, et al., Satellite video single object tracking: A systematic review and an oriented object tracking benchmark, ISPRS J. Photogramm. Remote Sens., 210 (2024), 212–240. https://doi.org/10.1016/j.isprsjprs.2024.03.013 doi: 10.1016/j.isprsjprs.2024.03.013
![]() |
[13] | W. Cai, Q. Liu, Y. Wang, HIPTrack: Visual tracking with historical prompts, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2024), 19258–19267. https://doi.org/10.1109/CVPR52733.2024.01822 |
[14] |
L. Sun, J. Zhang, D. Gao, B. Fan, Z. Fu, Occlusion-aware visual object tracking based on multi-template updating Siamese network, Digit. Signal Process., 148 (2024), 104440. https://doi.org/10.1016/j.dsp.2024.104440 doi: 10.1016/j.dsp.2024.104440
![]() |
[15] |
Y. Chen, L. Wang, eMoE-Tracker: Environmental MoE-based transformer for robust event-guided object tracking, IEEE Robot. Autom. Lett., 10 (2025), 1393–1400. https://doi.org/10.1109/LRA.2024.3518305 doi: 10.1109/LRA.2024.3518305
![]() |
[16] |
Y. Sun, T. Wu, X. Peng, M. Li, D. Liu, Y. Liu, et al., Adaptive representation-aligned modeling for visual tracking, Knowl. Based Syst., 309 (2025), 112847. https://doi.org/10.1016/j.knosys.2024.112847 doi: 10.1016/j.knosys.2024.112847
![]() |
[17] |
J. Wang, S. Yang, Y. Wang, G. Yang, PPTtrack: Pyramid pooling based Transformer backbone for visual tracking, Expert Syst. Appl., 249 (2024), 123716. https://doi.org/10.1016/j.eswa.2024.123716 doi: 10.1016/j.eswa.2024.123716
![]() |
[18] |
C. Wu, J. Shen, K. Chen, Y. Chen, Y. Liao, UAV object tracking algorithm based on spatial saliency-aware correlation filter, Electron. Res. Arch., 33 (2025), 1446–1475. https://doi.org/10.3934/era.2025068 doi: 10.3934/era.2025068
![]() |
[19] |
A. Lukežič, T. Vojíř, L. Čehovin, J. Matas, M. Kristan, Discriminative correlation filter with channel and spatial reliability, Int. J. Comput. Vision, 126 (2018), 671–688. https://doi.org/10.1007/s11263-017-1061-3 doi: 10.1007/s11263-017-1061-3
![]() |
[20] |
T. Xu, Z. Feng, X. Wu, J. Kittler, Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking, IEEE Trans. Image Process., 28 (2019), 5596–5609. https://doi.org/10.1109/TIP.2019.2919201 doi: 10.1109/TIP.2019.2919201
![]() |
[21] |
J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell., 37 (2015), 583–596. https://doi.org/10.1109/TPAMI.2014.2345390 doi: 10.1109/TPAMI.2014.2345390
![]() |
[22] |
E. O. Brigham, R. E. Morrow, The fast Fourier transform, IEEE Spectrum, 4 (1967), 63–70. https://doi.org/10.1109/MSPEC.1967.5217220 doi: 10.1109/MSPEC.1967.5217220
![]() |
[23] | H. K. Galoogahi, A. Fagg, S. Lucey, Learning background-aware correlation filters for visual tracking, in IEEE International Conference on Computer Vision (ICCV), (2017), 1144–1152. https://doi.org/10.1109/ICCV.2017.129 |
[24] | Z. Zhang, H. Peng, J. Fu, B. Li, W. Hu, Ocean: Object-aware anchor-free tracking, in European Conference on Computer Vision (ECCV), (2020), 771–787. https://doi.org/10.1007/978-3-030-58589-1_46 |
[25] |
Y. Zhang, H. Pan, J. Wang, Enabling deformation slack in tracking with temporally even correlation filters, Neural Networks, 181 (2025), 106839. https://doi.org/10.1016/j.neunet.2024.106839 doi: 10.1016/j.neunet.2024.106839
![]() |
[26] |
Y. Chen, H. Wu, Z. Deng, J. Zhang, H. Wang, L. Wang, et al., Deep-feature-based asymmetrical background-aware correlation filter for object tracking, Digit. Signal Process., 148 (2024), 104446. https://doi.org/10.1016/j.dsp.2024.104446 doi: 10.1016/j.dsp.2024.104446
![]() |
[27] |
K. Chen, L. Wang, H. Wu, C. Wu, Y. Liao, Y. Chen, et al., Background-aware correlation filter for object tracking with deep CNN features, Eng. Lett., 32 (2024), 1353–1363. https://doi.org/10.1016/j.dsp.2024.104446 doi: 10.1016/j.dsp.2024.104446
![]() |
[28] |
J. Zhang, Y. He, W. Chen, L. D. Kuang, B. Zheng, CorrFormer: Context-aware tracking with cross-correlation and transformer, Comput. Electr. Eng., 114 (2024), 109075. https://doi.org/10.1016/j.compeleceng.2024.109075 doi: 10.1016/j.compeleceng.2024.109075
![]() |
[29] | L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. Torr, Fully-convolutional siamese networks for object tracking, in European Conference on Computer Vision (ECCV), (2016), 850–865. https://doi.org/10.1007/978-3-319-48881-3_56 |
[30] | Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, S. Wang, Learning dynamic siamese network for visual object tracking, in IEEE International Conference on Computer Vision (ICCV), (2017), 1781–1789. https://doi.org/10.1109/ICCV.2017.196 |
[31] | B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2018), 8971–8980. |
[32] | B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, J. Yan, SiamRPN++: Evolution of siamese visual tracking with very deep networks, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 4277–4286. |
[33] |
L. Zhao, C. Fan, M. Li, Z. Zheng, X. Zhang, Global-local feature-mixed network with template update for visual tracking, Pattern Recognit. Lett., 188 (2025), 111–116. https://doi.org/10.1016/j.patrec.2024.11.034 doi: 10.1016/j.patrec.2024.11.034
![]() |
[34] |
F. Gu, J. Lu, C. Cai, Q. Zhu, Z. Ju, RTSformer: A robust toroidal transformer with spatiotemporal features for visual tracking, IEEE Trans. Human Mach. Syst., 54 (2024), 214–225. https://doi.org/10.1109/THMS.2024.3370582 doi: 10.1109/THMS.2024.3370582
![]() |
[35] |
O. Abdelaziz, M. Shehata, DMTrack: Learning deformable masked visual representations for single object tracking, SIViP, 19 (2025), 61. https://doi.org/10.1007/s11760-024-03713-0 doi: 10.1007/s11760-024-03713-0
![]() |
[36] | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, in the 31st International Conference on Neural Information Processing Systems (NIPS), (2017), 6000–6010. |
[37] |
O. C. Koyun, R. K. Keser, S. O. Şahin, D. Bulut, M. Yorulmaz, V. Yücesoy, et al., RamanFormer: A Transformer-based quantification approach for raman mixture components, ACS Omega, 9 (2024), 23241–23251. https://doi.org/10.1021/acsomega.3c09247 doi: 10.1021/acsomega.3c09247
![]() |
[38] | H. Fan, X. Wang, S. Li, H. Ling, Joint feature learning and relation modeling for tracking: A one-stream framework, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 341–357. https://doi.org/10.1007/978-3-031-20047-2_20 |
[39] |
H. Zhang, J. Song, H. Liu, Y. Han, Y. Yang, H. Ma, AwareTrack: Object awareness for visual tracking via templates interaction, Image Vision Comput., 154 (2025), 105363. https://doi.org/10.1016/j.imavis.2024.105363 doi: 10.1016/j.imavis.2024.105363
![]() |
[40] |
Z. Wang, L. Yuan, Y. Ren, S. Zhang, H. Tian, ADSTrack: Adaptive dynamic sampling for visual tracking, Complex Intell. Syst., 11 (2025), 79. https://doi.org/10.1007/s40747-024-01672-0 doi: 10.1007/s40747-024-01672-0
![]() |
[41] | X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, H. Lu, Transformer tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 8122–8131. https://doi.org/10.1109/CVPR46437.2021.00803 |
[42] | A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16x16 words: Transformers for image recognition at scale, preprint, arXiv: 2010.11929. https://doi.org/10.48550/arXiv.2010.11929 |
[43] | B. Yan, H. Peng, J. Fu, D. Wang, H. Lu, Learning spatio-temporal transformer for visual tracking, in IEEE International Conference on Computer Vision (ICCV), (2021), 10428–10437. https://doi.org/10.1109/ICCV48922.2021.01028 |
[44] | Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., Swin Transformer: Hierarchical vision transformer using shifted windows, in IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986 |
[45] | L. Lin, H. Fan, Z. Zhang, Y. Xu, H. Ling, SwinTrack: A simple and strong baseline for transformer tracking, in Advances in Neural Information Processing Systems (NIPS), 35 (2022), 16743–16754. |
[46] | Z. Song, J. Yu, Y. P. P. Chen, W. Yang, Transformer tracking with cyclic shifting window attention, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 8781–8790. https://doi.org/10.1109/CVPR52688.2022.00859 |
[47] |
Y. Chen, K. Chen, Four mathematical modeling forms for correlation filter object tracking algorithms and the fast calculation for the filter, Electron. Res. Arch., 32 (2024), 4684–4714. https://doi.org/10.3934/era.2024213 doi: 10.3934/era.2024213
![]() |
[48] | H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, et al., LaSOT: A high-quality benchmark for large-scale single object tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 5369–5378. https://doi.org/10.1109/CVPR.2019.00552 |
[49] | Y. Wu, J. Lim, M. -H. Yang, Online object tracking: A benchmark, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2013), 2411–2418. https://doi.org/10.1109/CVPR.2013.312 |
[50] | M. Mueller, N. Smith, B. Ghanem, A benchmark and simulator for UAV tracking, in Computer Vision–ECCV 2016, (2016), 445–461. https://doi.org/10.1007/978-3-319-46448-0_27 |
[51] |
Y. Huang, Y. Chen, C. Lin, Q. Hu, J. Song, Visual attention learning and antiocclusion-based correlation filter for visual object tracking, J. Electron. Imaging, 32 (2023), 23. https://doi.org/10.1117/1.JEI.32.1.013023 doi: 10.1117/1.JEI.32.1.013023
![]() |
[52] | K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770–778. |
[53] | N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in European Conference on Computer Vision (ECCV), (2020), 213–229. https://doi.org/10.1007/978-3-030-58452-8_13 |
[54] | D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, preprint, arXiv: 1412.6980. https://doi.org/10.48550/arXiv.1412.6980 |
[55] |
Y. Cui, C. Jiang, G. Wu, L. Wang, MixFormer: End-to-end tracking with iterative mixed attention, IEEE Trans. Pattern Anal. Mach. Intell., 46 (2024), 4129–4146. https://doi.org/10.1109/TPAMI.2024.3349519 doi: 10.1109/TPAMI.2024.3349519
![]() |
[56] |
J. Shen, Y. Liu, X. Dong, X. Lu, F. S. Khan, S. Hoi, Distilled siamese networks for visual tracking, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2022), 8896–8909. https://doi.org/10.1109/TPAMI.2021.3127492 doi: 10.1109/TPAMI.2021.3127492
![]() |
[57] |
X. Dong, J. Shen, F. Porikli, J. Luo, L. Shao, Adaptive siamese tracking with a compact latent network, IEEE Trans. Pattern Anal. Mach. Intell., 45 (2023), 8049–8062. https://doi.org/10.1109/TPAMI.2022.3230064 doi: 10.1109/TPAMI.2022.3230064
![]() |
[58] |
Z. Cao, Z. Huang, L. Pan, S. Zhang, Z. Liu, C. Fu, Towards real-world visual tracking with temporal contexts, IEEE Trans. Pattern Anal. Mach. Intell., 45 (2023), 15834–15849. https://doi.org/10.1109/TPAMI.2023.3307174 doi: 10.1109/TPAMI.2023.3307174
![]() |
[59] |
Y. Yang, X. Gu, Attention-based gating network for robust segmentation tracking, IEEE Trans. Circuits Syst. Video Technol., 35 (2025), 245–258. https://doi.org/10.1109/TCSVT.2024.3460400 doi: 10.1109/TCSVT.2024.3460400
![]() |
[60] |
W. Han, X. Dong, Y. Zhang, D. Crandall, C. Z. Xu, J. Shen, Asymmetric Convolution: An efficient and generalized method to fuse feature maps in multiple vision tasks, IEEE Trans. Pattern Anal. Mach. Intell., 46 (2024), 7363–7376. https://doi.org/10.1109/TPAMI.2024.3400873 doi: 10.1109/TPAMI.2024.3400873
![]() |
[61] |
X. Zhu, Y. Wu, D. Xu, Z. Feng, J. Kittler, Robust visual object tracking via adaptive attribute-aware discriminative correlation filters, IEEE Trans. Multimedia, 23 (2021), 2625–2638. https://doi.org/10.1109/TMM.2021.3050073 doi: 10.1109/TMM.2021.3050073
![]() |
[62] | M. Danelljan, H. Gustav, F. Shahbaz Khan, M. Felsberg, Learning spatially regularized correlation filters for visual tracking, in IEEE International Conference on Computer Vision (ICCV), (2015), 4310–4318. https://doi.org/10.1109/ICCV.2015.490 |
[63] | J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, P. H. S. Torr, End-to-end representation learning for correlation filter based tracking, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5000–5008. https://doi.org/10.1109/CVPR.2017.531 |
[64] | G. Bhat, M. Danelljan, L. V. Gool, R. Timofte, Learning discriminative model prediction for tracking, in IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 6182–6191. https://doi.org/10.1109/ICCV.2019.00628 |
[65] | N. Wang, W. Zhou, J. Wang, H. Li, Transformer meets tracker: exploiting temporal context for robust visual tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 1571–1580. https://doi.org/10.1109/CVPR46437.2021.00162 |
[66] | Z. Chen, B. Zhong, G. Li, S. Zhang, R. Ji, Siamese box adaptive network for visual tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 6668–6677. https://doi.org/10.1109/CVPR42600.2020.00670 |
[67] | Y. Guo, H. Li, L. Zhang, L. Zhang, K. Deng, F. Porikli, SiamCAR: Siamese fully convolutional classification and regression for visual tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 1176–1185. https://doi.org/10.1109/CVPR42600.2020.00630 |
[68] | D. Xing, N. Evangeliou, A. Tsoukalas, A. Tzes, Siamese transformer pyramid networks for real-time UAV tracking, in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), (2022), 1898–1907. https://doi.org/10.1109/WACV51458.2022.00196 |
Method | Year | LaSOT | OTB100 | UAV123 | ||||
AUC(%) | PNorm(%) | P(%) | SR(%) | PR (%) | AUC(%) | P(%) | ||
SiamRPN++ [32] | 2019 | 49.6 | 56.9 | 49.1 | – | – | 64.2 | 84.0 |
TransT [41] | 2021 | 64.2 | 73.5 | 68.2 | – | – | 66.0 | 85.2 |
STARK [43] | 2021 | 65.8 | 75.2 | 69.8 | – | – | 68.4 | 89.0 |
DSTrpn [56] | 2021 | 43.4 | 54.4 | – | 64.6 | 85.7 | – | – |
CLNet*–BAN [57] | 2022 | 52.9 | 62.3 | 52.6 | – | – | – | – |
TCTrack++ [58] | 2022 | 43.5 | 48.4 | 41.4 | 54.3 | 72.0 | 51.9 | 73.1 |
RTSFormer [34] | 2024 | 62.3 | 65.6 | 65.5 | – | – | 67.5 | – |
AGST–BR [59] | 2024 | 56.7 | – | 58.3 | – | – | 66.3 | – |
SiamRPN++–ACM [60] | 2024 | 52.3 | – | – | 71.2 | – | – | – |
MixFormer [55] | 2024 | 69.6 | 79.9 | 75.9 | 71.6 | 94.4 | 68.7 | 89.5 |
MCWTT | Ours | 65.6 | 74.5 | 70.0 | 66.7 | 86.6 | 68.7 | 89.2 |
Method | VRAM usage (MiB) | Speed (fps) |
CWTT | 2318MiB | 25.661 |
MCWTT | 2172MiB | 25.653 |
Method | BC | DEF | FM | IPR | IV | LR | MB | OCC | OPR | OV | SV |
MCWTT | 0.596 | 0.638 | 0.688 | 0.663 | 0.645 | 0.571 | 0.708 | 0.640 | 0.653 | 0.634 | 0.690 |
MCWTT-S | 0.541 | 0.646 | 0.653 | 0.613 | 0.614 | 0.575 | 0.674 | 0.592 | 0.614 | 0.550 | 0.647 |
Method | BC | DEF | FM | IPR | IV | LR | MB | OCC | OPR | OV | SV |
MCWTT | 0.757 | 0.861 | 0.877 | 0.862 | 0.818 | 0.787 | 0.894 | 0.834 | 0.868 | 0.829 | 0.892 |
MCWTT-S | 0.712 | 0.878 | 0.837 | 0.826 | 0.792 | 0.826 | 0.859 | 0.773 | 0.826 | 0.719 | 0.844 |
Method | Year | LaSOT | OTB100 | UAV123 | ||||
AUC(%) | PNorm(%) | P(%) | SR(%) | PR (%) | AUC(%) | P(%) | ||
SiamRPN++ [32] | 2019 | 49.6 | 56.9 | 49.1 | – | – | 64.2 | 84.0 |
TransT [41] | 2021 | 64.2 | 73.5 | 68.2 | – | – | 66.0 | 85.2 |
STARK [43] | 2021 | 65.8 | 75.2 | 69.8 | – | – | 68.4 | 89.0 |
DSTrpn [56] | 2021 | 43.4 | 54.4 | – | 64.6 | 85.7 | – | – |
CLNet*–BAN [57] | 2022 | 52.9 | 62.3 | 52.6 | – | – | – | – |
TCTrack++ [58] | 2022 | 43.5 | 48.4 | 41.4 | 54.3 | 72.0 | 51.9 | 73.1 |
RTSFormer [34] | 2024 | 62.3 | 65.6 | 65.5 | – | – | 67.5 | – |
AGST–BR [59] | 2024 | 56.7 | – | 58.3 | – | – | 66.3 | – |
SiamRPN++–ACM [60] | 2024 | 52.3 | – | – | 71.2 | – | – | – |
MixFormer [55] | 2024 | 69.6 | 79.9 | 75.9 | 71.6 | 94.4 | 68.7 | 89.5 |
MCWTT | Ours | 65.6 | 74.5 | 70.0 | 66.7 | 86.6 | 68.7 | 89.2 |
Method | VRAM usage (MiB) | Speed (fps) |
CWTT | 2318MiB | 25.661 |
MCWTT | 2172MiB | 25.653 |
Method | BC | DEF | FM | IPR | IV | LR | MB | OCC | OPR | OV | SV |
MCWTT | 0.596 | 0.638 | 0.688 | 0.663 | 0.645 | 0.571 | 0.708 | 0.640 | 0.653 | 0.634 | 0.690 |
MCWTT-S | 0.541 | 0.646 | 0.653 | 0.613 | 0.614 | 0.575 | 0.674 | 0.592 | 0.614 | 0.550 | 0.647 |
Method | BC | DEF | FM | IPR | IV | LR | MB | OCC | OPR | OV | SV |
MCWTT | 0.757 | 0.861 | 0.877 | 0.862 | 0.818 | 0.787 | 0.894 | 0.834 | 0.868 | 0.829 | 0.892 |
MCWTT-S | 0.712 | 0.878 | 0.837 | 0.826 | 0.792 | 0.826 | 0.859 | 0.773 | 0.826 | 0.719 | 0.844 |