A multi-scale cyclic-shift window Transformer object tracker based on fast Fourier transform

Huanyu Wu; Yingpin Chen; Changhui Wu; Ronghuan Zhang; Kaiwei Chen; Huanyu Wu; Yingpin Chen; Changhui Wu; Ronghuan Zhang; Kaiwei Chen

doi:10.3934/era.2025162

Electronic Research Archive

2025, Volume 33, Issue 6: 3638-3672. doi: 10.3934/era.2025162

Previous Article Next Article

Research article

A multi-scale cyclic-shift window Transformer object tracker based on fast Fourier transform

1.
School of Physics and Information Engineering, Minnan Normal University, Zhangzhou 363000, China
2.
Key Laboratory of Light Field Manipulation and System Integration Applications in Fujian Province, Minnan Normal University, Zhangzhou 363000, China

Received: 24 March 2025 Revised: 31 May 2025 Accepted: 06 June 2025 Published: 12 June 2025

In recent years, Transformer-based object trackers have demonstrated exceptional performance in object tracking. However, traditional methods often employ single-scale pixel-level attention mechanisms to compute the correlation between templates and search regions, disrupting object's integrity and positional information. To address these issues, we introduce a cyclic-shift mechanism to expand the diversity of sample positions and replace the traditional single-scale pixel-level attention mechanism with a multi-scale window-level attention mechanism. This approach not only preserves the object's integrity but also enriches the diversity of samples. Nevertheless, the introduced cyclic-shift operation heavily burdens storage and computation. To this end, we treat the attention computation of shifted and static windows in the spatial domain as convolution. By leveraging the convolution theorem, we transform the attention computation of cyclic shift samples from the spatial domain to element-wise multiplication in the frequency domain. This approach enhances computational efficiency and reduces data storage requirements. We conducted extensive experiments on the proposed module. The results demonstrate that the proposed module outperforms multiple existing tracking algorithms regarding performance. Moreover, ablation studies show that the method effectively reduces the storage and computational burden without compromising performance.

Keywords:

Citation: Huanyu Wu, Yingpin Chen, Changhui Wu, Ronghuan Zhang, Kaiwei Chen. A multi-scale cyclic-shift window Transformer object tracker based on fast Fourier transform[J]. Electronic Research Archive, 2025, 33(6): 3638-3672. doi: 10.3934/era.2025162

Related Papers:

[1]	Xiaoguang Liu, Mingjin Zhang, Jiawei Wang, Xiaodong Wang, Tie Liang, Jun Li, Peng Xiong, Xiuling Liu . Gesture recognition of continuous wavelet transform and deep convolution attention network. Mathematical Biosciences and Engineering, 2023, 20(6): 11139-11154. doi: 10.3934/mbe.2023493
[2]	Hongmei Jin, Ning He, Boyu Liu, Zhanli Li . Research on gesture recognition algorithm based on MME-P3D. Mathematical Biosciences and Engineering, 2024, 21(3): 3594-3617. doi: 10.3934/mbe.2024158
[3]	Weibin Jiang, Xuelin Ye, Ruiqi Chen, Feng Su, Mengru Lin, Yuhanxiao Ma, Yanxiang Zhu, Shizhen Huang . Wearable on-device deep learning system for hand gesture recognition based on FPGA accelerator. Mathematical Biosciences and Engineering, 2021, 18(1): 132-153. doi: 10.3934/mbe.2021007
[4]	Xiaoguang Liu, Jiawei Wang, Tie Liang, Cunguang Lou, Hongrui Wang, Xiuling Liu . SE-TCN network for continuous estimation of upper limb joint angles. Mathematical Biosciences and Engineering, 2023, 20(2): 3237-3260. doi: 10.3934/mbe.2023152
[5]	Ting Yao, Farong Gao, Qizhong Zhang, Yuliang Ma . Multi-feature gait recognition with DNN based on sEMG signals. Mathematical Biosciences and Engineering, 2021, 18(4): 3521-3542. doi: 10.3934/mbe.2021177
[6]	Xiang Wang, Yongcheng Wang, Limin He . An intelligent data analysis-based medical management method for lower limb health of football athletes. Mathematical Biosciences and Engineering, 2023, 20(8): 14005-14022. doi: 10.3934/mbe.2023624
[7]	Xiebing Chen, Yuliang Ma, Xiaoyun Liu, Wanzeng Kong, Xugang Xi . Analysis of corticomuscular connectivity during walking using vine copula. Mathematical Biosciences and Engineering, 2021, 18(4): 4341-4357. doi: 10.3934/mbe.2021218
[8]	Ying Chang, Lan Wang, Yunmin Zhao, Ming Liu, Jing Zhang . Research on two-class and four-class action recognition based on EEG signals. Mathematical Biosciences and Engineering, 2023, 20(6): 10376-10391. doi: 10.3934/mbe.2023455
[9]	Xiaowen Jia, Jingxia Chen, Kexin Liu, Qian Wang, Jialing He . Multimodal depression detection based on an attention graph convolution and transformer. Mathematical Biosciences and Engineering, 2025, 22(3): 652-676. doi: 10.3934/mbe.2025024
[10]	Jinyi Tai, Chang Liu, Xing Wu, Jianwei Yang . Bearing fault diagnosis based on wavelet sparse convolutional network and acoustic emission compression signals. Mathematical Biosciences and Engineering, 2022, 19(8): 8057-8080. doi: 10.3934/mbe.2022377

Abstract

1. Introduction

Classifying upper limb gestures using multichannel surface electromyography (sEMG) poses a formidable challenge with implications for both diagnostic and therapeutic applications ^[1,2]. The inherent non-linear and stochastic nature of sEMG signals introduces a significant hurdle, complicating the precise categorization of upper limb gestures ^[3,4]. The complex interplay among various muscles, coupled with their complex signaling patterns, contributes to the complexity of extracting meaningful information from sEMG recordings ^[5,6]. This challenge is particularly pronounced in the context of employing sEMG signals for electromechanical hand prostheses, where precision and reliability are paramount ^[7,8]. The variability inherent in sEMG signals, influenced by factors such as muscle fatigue, electrode placement, and individual anatomical differences, adds layers of intricacy to the task of achieving accurate and robust gesture classification ^[9,10]. Effective classification methods for sEMG in diverse applications require innovative solutions to address the challenges posed by unique characteristics of upper limb gesture signals.

Machine learning and deep learning techniques have emerged as transformative tools in the domains of information processing, providing unprecedented capabilities for various tasks such as pattern recognition ^[11,12], feature extraction ^[13,14], and classification ^[15,16,17]. In the context of sEMG for gesture classification, numerous studies have explored the applicability of machine learning and deep learning techniques ^[18,19]. For instance, Saeed et al. ^[20] applied machine learning techniques to raw signals from the DB1 dataset, achieving accuracies of 85.41% and 91.14% using an artificial neural network (ANN) and linear discriminant analysis (LDA), respectively. Karnam et al. ^[21] achieved a classification accuracy of 88.8% on DB1 using K-nearest neighbours (KNN). Akmal et al. ^[22] explored training strategies for artificial neural networks (ANN), which emerged as a pivotal aspect in sEMG signal classification. This study analyzes twelve different training strategies, evaluating their performance on multiday EMG data. The results highlight the resilience of backpropagation and scaled conjugate gradient methods, providing valuable insights into optimal training approaches for efficient prosthetic control. SVM-based classification for prosthetic finger movements is investigated for real-time implementation ^[23]. Leveraging the stability and efficiency of SVM on a Raspberry Pi, this study achieved 78% classification accuracy. Inam et al. ^[24] explored gender-specific considerations in sEMG for upper-limb prosthetics. Evaluating EMG differences between males and females, this research employs an ANN for classification. While overall similarities are observed, certain features exhibit gender-specific variations, shedding light on the importance of tailored approaches for diverse user populations. In the domain of deep learning, Hu et al. ^[25] implemented a recurrent neural network (RNN) and a convolutional neural network (CNN) on sEMG signals from NinaPro DB1, attaining an accuracy of 87%. Pancholi et al. ^[26] applied a CNN model to various NinaPro datasets, achieving classification accuracies ranging from 81.67% to 99.11%. Cheng et al. ^[27] utilized the NinaPro DB1 dataset, applying CNN to sEMG-feature-images extracted from raw signals and achieving a classification accuracy of 82.54%. Additionally, Tong et al. ^[28] applied a CNN and long-short term memory (LSTM) based hybrid classifier on the NinaPro DB1 dataset, yielding an accuracy of 78.31%. ^[29] explored a concatenate feature fusion recurrent convolutional neural network (CFF-RCNN) to address this by introducing a concatenate feature fusion (CFF) strategy, achieving notable accuracies. CFF-RCNN surpasses reported results, achieving 88.87% on DB1, 99.51% on DB2, and 99.29% on DB4 with over 50 gestures. Qureshi et al. ^[30] introduced the efficient concatenated convolutional neural network (E2CNN) as a robust solution for real-time sEMG classification. By converting raw sEMG signals into Log-Mel spectrograms (LMS) and employing concatenation layers, E2CNN achieves high accuracy and response times for both non disabled and amputee subjects, positioning it as a potential candidate for prosthetic control in real-world scenarios. Another study ^[31] also explored improving myoelectric control in wearable prostheses using CNNs. By comparing multiday sEMG recordings, the proposed CNN exhibits superior accuracy for able-bodied and amputee subjects in within-day and between-day analyses. This research underscores the CNN's efficacy and computational efficiency, presenting a promising avenue for enhancing prosthetic hand control.

In the proposed study, we are introducing a dual-pathway convolutional neural network (DP-CNN) to classify sEMG signals from both healthy and amputee subjects. The novelty of this architecture resides in its ability to process Log-Mel spectrogram (LMS) images derived from raw multichannel EMG signals. Utilizing spectrograms instead of raw EMG signals has demonstrated enhanced performance across various studies in the field ^{[26,30,31,32]}. This trend is evident in multiple investigations where LMS significantly improved the efficiency of the classification system. LMS effectively captured the time-frequency characteristics of the EMG signals ^[31,33], and they are commonly used in various applications, including driver fatigue detection and EMG classification ^[31]. For instance, in a study focused on driver fatigue detection, a model based on LMS and a convolution recurrent neural network (CRNN) was proposed and demonstrated high accuracy in distinguishing between alert and fatigued states ^[33]. LMS also brings superior performance to the deep learning models. This is evidenced in a study in which a CNN achieved an impressive classification accuracy of over 90% on healthy and amputee subjects when applied to LMS-based images ^[30]. Furthermore, LMS images derived from EMG signals have been employed successfully in hybrid deep-learning methods to classify four different EMG-signal patterns, achieving significant classification results ^[34]. Moreover, LMS also offers a data augmentation method, which has been shown to significantly enhance the accuracy of deep learning models in classifying EMG signals ^[32].

The rationale behind this preference for spectrograms lies in their ability to capture intricate time-frequency characteristics inherent in EMG signals. Spectrograms, particularly LMS, provide a more comprehensive representation of the signal, highlighting nuanced patterns that might be obscured in raw EMG data. This richer representation aids in extracting deep information from the signals, contributing to increased effectiveness in classification tasks. EMG signals exhibit complex and dynamic frequency patterns that convey information about muscle contractions. The Mel-frequency bands in LMS provide an effective means of representing these intricate patterns, offering a more compact and discriminative representation of the signal. The logarithmic transformation in LMS compresses the higher frequencies, emphasizing the lower-frequency components. This is beneficial for EMG analysis, as the lower frequencies often contain valuable information related to muscle activities and gestures.

In alignment with these findings, the current study also adopts the use of spectrograms, specifically LMS, as the primary input for the proposed DP-CNN. The choice is grounded in the well-substantiated efficacy of spectrograms, particularly LMS, as demonstrated in previous literature. Spectrograms not only effectively capture time-frequency characteristics, but also offer superior performance to deep learning models. The LMS technique involves transforming raw EMG signals into a visual representation that emphasizes the spectral features relevant for gesture classification. By utilizing Mel-frequency bands, which are perceptually spaced to mimic human hearing, and applying a logarithmic scale, the LMS efficiently captures the frequency patterns within EMG signals.

As such, the adoption of LMS as the primary input in our proposed DP-CNN is grounded in evidence of its efficacy and superior performance in previous studies. By implementing this methodology, we aim to validate the performance of DP-CNN on LMS images extracted from sEMG signals, thereby offering a robust, reliable, and real-time solution for prosthetic control applications. The proposed DP-CNN architecture has been implemented on surface EMG (sEMG) of healthy and amputee subjects taken from NinaPro DB1, DB2, and DB3, respectively. The proposed DP-CNN is based on the CNN, and input is provided as a spectrogram images converted from raw EMG signals. The LMS images are obtained from the raw multichannel EMG signals available in NinaPro Databases 1 (DB1) and 3 (DB3). The contribution of this work is validating the performance of the DP-CNN implemented on LMS images extracted from bio-signals, in this case, EMG signals.

The main contributions of this paper are:

● Preprocessing raw EMG signals into Log-Mel spectrogram images, enhancing feature extraction.

● Development of a dual-pathway convolutional neural network (DP-CNN) combining convolutional and dense pathways for robust EMG signal classification.

● Extensive assessment of the proposed DP-CNN's effectiveness and generalizability across diverse datasets, including both healthy and amputee subjects.

● Thorough comparison of the DP-CNN's performance against prior works on the same dataset, providing insights into advancements.

● Benchmarking against pre-trained transfer learning models (AlexNet, MobileNet, VGG19, DenseNet121, ResNet50), showcasing the uniqueness and efficacy of the proposed approach.

The paper is organized as follows: in Section 2, we provide a detailed description of the methodology used in this study. We first discuss the datasets used and the pre-processing steps undertaken to prepare the data for analysis. We then provide the details of the proposed deep neural network (DNN) used for classification. Section 3 illustrates the results obtained from the proposed methodology and we discuss the implications of our findings and their potential applications. Section 4 concludes the paper.

2. Materials and methods

2.1. Experimental datasets and setup description

In this study, we are using three datasets from a publicly available database: NinaPro Database 1 (DB1) ^[35], Database 2 (DB2) ^[36], and Database 3 (DB3) ^[37] for investigation and validation of the proposed methodology. The experiments were approved by the Ethics Commission of the state of Valais (Switzerland), the main place for data acquisition ^[35]. The details of each dataset are given below:

1) The first NinaPro database contains 10 repetitions of 52 different movements carried out by 27 intact subjects and serves as a standard dataset for the classification of myoelectric motions. The DB1 dataset includes three exercises categorized into (A) basic movements of the fingers, (B) isometric, isotonic hand configurations and basic wrist movements, and (C) grasping and functional movements. Ten Otto Bock MyoBock 13E200 electrodes are used to collect sEMG data; a Cyberglove 2 data glove is used to collect kinematic data. Each subject and exercise has a corresponding MATLAB file in the database that contains information about the subject, the exercise, the electrodes' sEMG signal, the 22 cyberglove sensors' uncalibrated signal, the subject's repeated movement, and the stimulus' repeated occurrence.

2) The DB2 dataset is composed of three types of exercises that are categorized into three groups: (A) basic movements of fingers and wrist, (B) grasping and functional movements, and (C) force patterns. To collect kinematic data, a dataglove (Cyberglove 2) and an accelerometer on the wrist were used, while a Delsys Trigno Wireless EMG system were utilized with 12 active double-differential wireless electrodes to record muscular activity. The sampling rate for sEMG signals is 2 kHz. Each exercise and subject have a synchronized MATLAB file that contains various variables, such as subject and exercise number, sEMG signal, kinematic information, inclinometer signal, movement repeated by the subject, and force recorded during the third exercise. Additionally, force sensor calibration values for the least and highest force are included in each file.

3) The third NinaPro database is a comprehensive resource for the development and evaluation of naturally-controlled non-invasive robotic hand prostheses. The experiment consists of the same three exercises as DB2: basic finger and wrist motions, grasping and functioning movements, and force patterns. The dataset includes 49 different movements (including rest) performed by 10 amputee participants with each movement repeated 6 times, and the movements were chosen from the hand taxonomy as well as literature on hand robotics. A Delsys Trigno Wireless EMG system with 12 active double-differential wireless electrodes was used to collect the muscular activity. The database provides one MATLAB file with synchronized variables for each exercise and subject, including subject and exercise number, sEMG signal, kinematic information, inclinometer signal, movement repeated by the subject, and force recorded during the third exercise. The collection also contains force sensor calibration information for the minimum and maximum force.

The number of repetitions of each gesture is ten in DB1, and six in DB2 and DB3. For this study, we randomly selected ten subjects from DB1 and DB2, five male and five female each; and ten from DB3 for validation of the proposed technique. We utilized the common movements from the three databases; exercise C of DB1 and exercise B of DB2 and DB3 are the only common gestures among the datasets. The exercise is composed of 23 hand gestures and is illustrated in Figure 1.

Figure 1. Gestures utilized in this study ^[35].

DownLoad: Full-Size Img PowerPoint

2.2. Preprocessing technique

The data in this study was recorded using different numbers of channels in the three databases: 10 channels in DB1, and 12 channels in DB2 and DB3. The sampling rate for DB1 was 100 Hz, while for the DB2 and DB3 datasets, it was 2 kHz. The DB1 data was already shielded from power line noise; however, DB2 and DB3 are not shielded from power line noise ^[35]. Therefore, a 50 Hz second-order Butterworth notch filter was used for DB2 and DB3. Furthermore, DB1 data was filtered at 1 Hz using a second-order Butterworth filter ^[38]. In DB1, the signals available were root mean squared (RMS) values of raw signals, while in DB2 and DB3, raw EMG signals were available. To ensure optimal processing for EMG-based prosthetics, it is recommended to use signal segments with duration ranges from 150 ms to 250 ms ^[39,40]. Therefore, each raw EMG signal is divided into smaller segments with a duration of 200 ms with an overlapping increment of 50 ms. Since the number of repetitions are different in each dataset, we obtained different numbers of segmented signals. Specifically, we extracted $5750 \times 10$ segmented signals from DB1, while we extracted 14,950 $\times$ 12 segmented signals from DB2 and DB3 for each subject. Here, the $10$ and $12$ represent the number of channels in each signal.

After segmentation, each signal was converted into a Log-Mel spectrogram (LMS) image using the librosa library in Python. The LMS is a representation of the spectral content of a signal on a logarithmic frequency scale. By using LMS, we were able to analyze the frequency content of the segmented signals, which can provide more information for classification.

Let us define a segmented signal $s_{w}(t)$ with length $L$ and sampling frequency $f_{s_{w}}$ in hertz. Its short-term fourier transform (STFT) $S_{w}$ is then given by

$\begin{equation} S_{w}(\bar{x},\bar{y}) = \sum\limits_{t-0}^{N-1} s_{w}(t+xH) \cdot w(t) \cdot e^{-\iota 2 \pi y \frac{t}{\tau}}. \end{equation}$

(2.1)

Here, $H \in \mathbb{N}$ represents the hop length, $w:[0:\tau-1] \in \mathbb{R}$ is the Hann window defined as $w = 0.5 - 0.5 \cos{(\frac{2 \pi t}{\tau-1})}$ , where $\tau \in \mathbb{N}$ is the length of $w$ , $\bar{x} \in [0:\frac{L-\tau}{H}]$ denotes the time index, and $\bar{y} \in [0:\frac{N}{2}]$ denotes the frequency index.

The short-term fourier transform spectrogram of $S_{w}$ can be obtained as

$\begin{equation} S_{STFT}(\tilde{x},\tilde{y}) = |S_{w}(\bar{x},\bar{y})|^{2}. \end{equation}$

(2.2)

The Mel spectrum and linear frequency are related by $f_{mel} = 2959 \times \log_{10}(1+\frac{f}{700})$ . We can estimate the LMS using

$\begin{equation} S_{LM}(\grave{x},\grave{y}) = \sum\limits_{f(y) = f_{c}(x-1)}^{f_{c}(x+1)} \log_{10} ( M_{FB}(\tilde{x},\tilde{y}) \cdot S_{STFT}(\tilde{x},\tilde{y})). \end{equation}$

(2.3)

Here, $M_{FB}(\tilde{x}, \tilde{y})$ is the Mel filter bank and can be estimated from

$\begin{equation} M_{FB}(\tilde{x},\tilde{y}) = \begin{cases} \frac{f(\tilde{y})-f_{c}(\tilde{x}-1)}{f_{c}(\tilde{x})-f_{c}(\tilde{x}-1)} & \text{for } f_{c}(\tilde{x}-1) \leq f(\tilde{y}) < f_{c}(\tilde{x}) \\ \frac{f(\tilde{y})-f_{c}(\tilde{x}+1)}{f_{c}(\tilde{x})-f_{c}(\tilde{x}+1)} & \text{for } f_{c}(\tilde{x}) \leq f(\tilde{y}) < f_{c}(\tilde{x}+1) \\ 0 & \text{others.} \end{cases} \end{equation}$

(2.4)

Here, $f(\tilde{y})$ denotes the linear frequency, and $f_{c}(\tilde{x}) = \tilde{x} \cdot \delta f_{mel}$ represents the center frequencies on the Mel-scale.

For each windowed signal in DB1, we convert it into an LMS individually. Then, this process is iteratively repeated for all ten channels, providing us with ten LMS images. We then combine these ten LMS vertically to form an input image, as shown in Figure 2(a). This process yields 5750 EMG images as LMS, which serve as the input dataset to the DP-CNN model for each subject in DB1. Similar to DB1, we apply the same technique to each windowed signal in DB2 and DB3, resulting in twelve LMS for each signal. These twelve LMS are combined vertically to form an image, as illustrated in Figure 2(b), (c). This process resulted in 14,950 EMG images as LMS for each subject in DB2 and DB3.

Figure 2. The input Log-Mel spectrogram (LMS) images converted from raw EMG signal: (a) LMS converted from DB1, (b) LMS converted from DB2, (c) LMS converted from DB3.

DownLoad: Full-Size Img PowerPoint

2.3. Dual pathway convolutional neural network architecture

We propose a dual-pathway convolutional neural network (DP-CNN) with batch normalization and max-pooling functions along with dropout layers for classification of electromyogram (EMG) signals. The proposed model is designed to operate on LMS as input, with a fixed feature size of $224 \times 224$ . The DP-CNN is comprised of two pathways, namely the traditional convolutional pathway and a dense pathway, which are combined using a concatenation layer. The input to the DP-CNN is denoted as $S_{LM}(\grave{x}, \grave{y})$ , and the convolution operation can be explained from ^[41] and is given as follows:

$\begin{equation} \begin{split} O & = F(S_{LM}(\grave{x},\grave{y})\vert a) \\ & = f_{n}(f_3(f_2(S_{LM}\grave{x},\grave{y}) \vert \theta_2) \vert \theta_3)\theta_N). \end{split} \end{equation}$

(2.5)

Here, $f_n$ represents the $n$ -th layer of the DP-CNN, and $N$ is the total number of layers used, which in this study is set to 8. The parameter for the $n$ -th layer is denoted as $\theta_1 = [X, b]$ . We can express the convolutional layer operations in the following way:

$\begin{equation} \begin{split} O_1 & = f_1(S_{LM}(\grave{x},\grave{y})S_{LM}(\grave{x},\grave{y})_n \vert \theta_N) \\ & = h(S_{LM}(\grave{x},\grave{y})_l+b * X). \end{split} \end{equation}$

(2.6)

Here, $I_n$ represents the input of the $n$ th-layer, $X$ is the corresponding filter, $*$ denotes the valid convolution operation, $h(\cdot)$ denotes the pointwise activation function, and $b$ denotes the vector bias term.

In the proposed DP-CNN, the convolutional layers of the first pathway are estimated using

$\begin{equation} C_{i,j} = \sum\limits_{k = 1}^{n} \sum\limits_{l = 1}^{m} W_{k,l} \cdot I_{i+k-1,j+l-1} + b, \end{equation}$

(2.7)

where $C_{i, j}$ is the value of the feature map for the $i^{th}$ row and $j^{th}$ column, $W_{k, l}$ is the weight for the $k^{th}$ row and $l^{th}$ column of the filter, $I_{i+k-1, j+l-1}$ is the value for the corresponding position in the input image, $b$ is the bias, and $n$ and $m$ are the dimensions of the filter.

The activation function, rectified linear unit (ReLU), can be represented as

$\begin{equation} ReLU(x) = max(0,x). \end{equation}$

(2.8)

The dense layer equation can be represented by

$\begin{equation} y = D \cdot \sum\limits_{i = 1}^{n} w_i \cdot x_i + b, \end{equation}$

(2.9)

where $y$ is the output of the dense layer, $w_i$ is the weight for the $i^{th}$ unit, $x_i$ is the input for the $i^{th}$ unit, $n$ is the number of units in the dense layer, and the dropout layer $D$ is a binary mask with a probability of 0.2 to set the corresponding value to 0.

The concatenation function can be represented by

$\begin{equation} O = [f_{CL}(f_{RL}(I)),f_{DL}(f_{FL}(f_{CL}(f_{RL}(I)))] , \end{equation}$

(2.10)

where $O$ is the final output of the concatenation function, and $[\cdot, \cdot]$ represents the concatenation operation of two arrays.

Overall, the proposed DP-CNN can be represented by

$\begin{equation} O = [C_{1,1},C_{1,2},...,C_{n,m},y_1,y_2,...,y_k], \end{equation}$

(2.11)

where

$\begin{equation} C_{i,j} = ReLU \left( \sum\limits_{k = 1}^{n} \sum\limits_{l = 1}^{m} W_{k,l} \cdot I_{i+k-1,j+l-1} + b \right), \end{equation}$

(2.12)

and

$\begin{equation} y_i = ReLU \left( \sum\limits_{j = 1}^{m} w_j \cdot x_j + b \right) \cdot D, \end{equation}$

(2.13)

and $x_j$ is the output of the previous dense layer.

The architecture of the DP-CNN model is shown in Figure 3. The primary advantage of the convolutional pathway is its ability to automatically learn and extract relevant features from input images without the need for manual feature engineering. This is accomplished through the use of convolutional and pooling layers, which learn local and global patterns in the input data. The model is able to learn increasingly complex representations of the input images by stacking multiple convolutional layers on top of each other.

Figure 3. General architecture of the proposed dual-pathway convolutional neural network.

DownLoad: Full-Size Img PowerPoint

On the other hand, the advantage of the dense pathway is its ability to detect global patterns and relationships in the input data. This is accomplished through the use of fully connected layers that learn to combine features from all parts of the input data. The model is able to learn increasingly complex representations of the input data by stacking multiple fully connected layers on top of each other. Another benefit of the dense pathway is its ability to handle input data of any size and shape, as long as it can be converted to a one-dimensional format. This makes the model suitable for a wide range of input data types, such as text or time series data, which may not have the same spatial structure as images. Furthermore, the use of dropout layers in the dense pathway helps to regularize the model and prevent overfitting. Dropout layers randomly set a fraction of the activations in the previous layer to zero during training, which helps to reduce co-adaptation between neurons and forces the model to learn more robust features.

The DP-CNN model can capture different types of information from the input images because it has two separate pathways for processing the input data. The convolutional pathway can extract spatial features from images by using convolutional and pooling layers, whereas the dense pathway can capture global patterns and relationships in input data by using fully connected layers. The model can make more accurate predictions by combining the outputs of both pathways.

The use of multiple pathways allows for greater regularization and reduces the risk of overfitting. By having two distinct pathways that process the input data in different ways, the model is less likely to memorize the training data and more likely to generalize well to new, previously unseen data. Furthermore, the use of dropout layers in the dense pathway provides additional regularization and helps to prevent overfitting.

The DP-CNN model can capture different types of information from the input images by using two separate pathways for processing the input data. The convolutional pathway can extract spatial features from images by using convolutional and pooling layers, whereas the dense pathway can capture global patterns and relationships in the input data by using fully connected layers. By combining the outputs of both pathways, the model is able to make more accurate predictions by leveraging both types of information.

The use of multiple pathways improves regularization and reduces the risk of overfitting. The model is less likely to memorize the training data and more likely to generalize well to new, unseen data by having two separate pathways that process the input data in different ways. Furthermore, the use of dropout layers in the dense pathway provides additional regularization and aids in the prevention of overfitting.

For training the proposed DP-CNN model, a total of 50 epochs are set and a batch size of 32 is used. The Adam optimizer is employed for optimization. The learning rate was set to $0.001$ . The input to the DP-CNN model is an image with a size of $224 \times 224 \times 3$ . Prior to feeding the images to the DP-CNN model, they are resized using a rescaling layer (RL), which scales the pixel values in the range of 0 to 1. This is done to ensure that the input data is within a suitable range for the neural network model. The rescaled input images are then provided as input to the DP-CNN model for training.

Convolutional pathway: The input images processed by the rescaling layer are fed to the convoultional pathway. Details of each layer are given below:

● Layer 1: The first layer is composed of a convolutional layer of $32$ filters with a filter size of $9 \times 9$ , with ReLU and L2 regularization of $0.001$ . The convolutional layer is succeeded by a batch normalization layer. After that, a maximum pooling layer is used with a pool size of $4 \times 4$ .

● Layer 2: The second layer is similar to the first layer and is composed of same batch normalization and max-pooling layer, except the the number of filters are 48 and a filter size of $5 \times 5$ is used.

● Layer 3: The third layer is made up of a convolutional layer of 48 filters with a filter size of $3 \times 3$ with ReLU activations. A batch normalization layer is introduced after the convolutional layer, and then a max-pooling layer is used with a pool size of $2 \times 2$ .

Dense pathway: The same input images from the rescaling layer are converted to one-dimensional data using a flatten layer and then are fed to the dense pathway:

● Layers 1 & 2: The first layer in the second pathway is a fully connected dense layer composed of 24 units, and ReLU is used as the activation function. The second layer is a dropout layer with a rate of 0.2.

● Layers 3 & 4: The third and fourth layers are similar to the first and second layer, except the fully connected dense layers have 32 units.

● Layers 5 & 6: The fifth and sixth layers are similar to the above two layers, except the fully connected dense layers have 48 units.

Concatenation and classification layer: The data from the first and second pathways are combined using a concatenation layer. This layer allows the outputs of both pathways within a neural network to be combined into a single output. The last layer is composed of 23 units.

2.4. Performance evaluation metrics

The performance of the proposed DP-CNN is evaluated on metrics based on multi-class classification. For this study, we are using mean accuracy, mean precision, mean recall, and mean F1-score along with their standard deviations ^[42].

Let $C_{TP}$ be true positives and $C_{TN}$ true negatives, while $C_{FP}$ are false positives and $C_{FN}$ false negatives. Multi-class classification accuracy can be estimated using

$\begin{equation} Accuracy = \frac{C_{TP} + C_{TN}}{C_{TP} + C_{TN} + C_{FP} + C_{FN}}. \end{equation}$

(2.14)

For binary-class classification $n = 1$ , precision $P_{n = 1}^{k}$ and recall $R_{n = 1}^{k}$ for a particular class $k$ can be calculated as follows:

$\begin{align} P_{n = 1}^{k} & = \frac{C_{TP}^{k}}{C_{TP}^{k} + C_{FP}^{k}}, \end{align}$

(2.15)

$\begin{align} R_{n = 1}^{k} & = \frac{C_{TP}^{k}}{C_{TP}^{k} + C_{FN}^{k}}. \end{align}$

(2.16)

For multi-class classification $n = N$ , precision $P_{N}^{k}$ and recall $R_{N}^{k}$ can be calculated as follows:

$\begin{align} P_{N}^{k} & = \frac{\sum_{k = 1}^{K} P_{n = 1}^{k}}{K}, \end{align}$

(2.17)

$\begin{align} R_{N}^{k} & = \frac{\sum_{k = 1}^{K} R_{n = 1}^{k}}{K}. \end{align}$

(2.18)

F1-score for multi-class classification is given as

$\begin{equation} F1-score = 2 \times \left( \dfrac{P_{N}^{k} \times R_{N}^{k}}{(P_{N}^{k})^{-1} + (R_{N}^{k})^{-1}} \right). \end{equation}$

(2.19)

3. Results and discussion

Table 1 illustrates the developmental setup employed in this study.

Table 1. System specifications.

Components	Specifications
Processor	Intel (R) Xeon (R) CPU E5-2630
RAM	32 GB
GPU	NVIDIA Tesla T4
Software	Ubuntu
	CUDA 11.8
	Python 3.11
	Keras 2.90
	TensorFlow 2.9.2
	Matplotlib 3.6.1
	Scikit-Learn 1.1.2
	NumPy 1.23.4
	Pandas 1.5.0
	Seaborn 0.12.0

| Show Table

DownLoad: CSV

3.1. Performance assessment of individual pathways & dual-pathway CNN

We have evaluated the performance of the individual pathways of the proposed dual-pathway CNN, and compared the results of each pathway with the DP-CNN. For this purpose, we have only tested the performance on a single subject from each dataset. For all three datasets, the dense pathway and concatenation layer are removed and the performance is validated. Alternatively, in the second phase, the convolutional pathway along with the concatenation layer in DP-CNN are removed and the performance of only the dense path is validated. Finally, the performance of the complete DP-CNN is validated on a single subject from each dataset. Figure 4 illustrates the accuracies achieved by each pathway along with DP-CNN on one subject from all three datasets. It can be seen that neither pathway could achieve the desired accuracy compared to the complete DP-CNN.

Figure 4. Performance of each individual pathway and the complete DP-CNN on a single subject from each database.

DownLoad: Full-Size Img PowerPoint

3.2. Performance assessment of the proposed method

We have evaluated the proposed DP-CNN on 30 subjects: 20 able-bodied subjects from DB1 and DB2, and 10 amputee subjects from DB3. To ensure diversity and balance, we randomly selected ten subjects from each of the two databases DB1 and DB2. In each database, we ensured that five males and five females were selected. Regarding DB3, we were able to select ten subjects out of the eleven available, but we could not include subject one due to the unavailability of the desired gestures in the dataset ^[35]. Additionally, subjects 7 and 8 had fewer electrodes, resulting in ten channels instead of twelve, but we still processed their data for our analysis. This approach allowed us to gather a representative sample that can help us draw accurate conclusions and insights from our research. The testing and training are done for each subject from both dataset. Each dataset is divided into a 70–30 split, where 70% is used for training and 30% for validation. This is done for each subject and the performance metrics are determined for each subject. Using the proposed DP-CNN, DB1, DB2, and DB3 are classified individually, and then the mean accuracy, mean precision, mean recall, and mean F1-score are determined from all subjects. Table 2 shows the accuracy, precision, recall, and F1-score for DB1 subjects. The proposed DP-CNN achieved a mean classification accuracy of 94.93 ± 1.71%, mean precision of 94.93 ± 1.71%, mean recall of 94.93 ± 1.71%, and mean F1-score of 94.93 ± 1.71% on the subjects of the DB1 dataset. Similarly, the proposed DP-CNN achieved a mean classification accuracy of 94.00 ± 3.56%, mean precision of 94.00 ± 3.56%, mean recall of 94.00 ± 3.56%, and mean F1-score of 94.00 ± 3.56% when applied to DB2 subjects. Table 3 illustrates the accuracy, precision, recall, and F1-score for DB2 subjects. Similarly, the proposed DP-CNN has achieved a mean classification accuracy of 85.36 ± 0.82%, mean precision of 85.35 ± 0.86%, mean recall of 85.34 ± 0.81%, and mean F1-score of 85.36 ± 0.82%, when applied to DB3 subjects. Table 4 illustrates the accuracy, precision, recall, and F1-score for DB3 subjects.

Table 2. Performance of the proposed DP-CNN on DB1 subjects.

Subject	Accuracy	Precision	Recall	F1-score
1	94.17%	94.17%	94.16%	94.15%
2	94.24%	94.25%	94.25%	94.25%
3	95.46%	95.46%	95.46%	94.45%
4	95.13%	95.14%	95.13%	95.13%
5	94.57%	94.58%	94.56%	94.57%
6	95.23%	95.24%	95.24%	95.24%
7	96.75%	96.75%	96.75%	96.75%
8	94.76%	94.76%	94.77%	94.76%
9	93.24%	93.21%	93.21%	93.24%
10	95.97%	95.98%	96.00%	95.98%
Mean ± SD	94.93 ± 1.71%	94.93 ± 1.71%	94.93 ± 1.71%	94.93 ± 1.71%

| Show Table

DownLoad: CSV

Table 3. Performance of the proposed DP-CNN on DB2 subjects.

Subject	Accuracy	Precision	Recall	F1-score
1	96.27%	96.27%	96.19%	96.25%
2	93.95%	93.95%	93.95%	93.95%
3	91.39%	91.39%	91.36%	91.32%
4	97.25%	97.25%	97.26%	97.95%
5	92.63%	92.65%	92.62%	92.63%
6	91.00%	91.01%	91.00%	90.29%
7	96.96%	96.99%	96.98%	96.98%
8	92.67%	92.67%	92.67%	92.67%
9	95.17%	95.17%	95.17%	95.17%
10	92.75%	92.72%	92.72%	92.72%
Mean ± SD	94.00 ± 3.56%	94.00 ± 3.56%	94.00 ± 3.56%	94.00 ± 3.56%

| Show Table

DownLoad: CSV

Table 4. Performance of the proposed DP-CNN on DB3 subjects.

Subject	Accuracy	Precision	Recall	F1-score
1	88.15%	88.11%	88.12%	88.15%
2	82.69%	82.69%	82.69%	82.69%
3	87.22%	87.22%	87.23%	87.22%
4	84.81%	84.83%	84.81%	84.81%
5	86.91%	86.91%	86.91%	86.91%
6	82.39%	82.41%	82.40%	82.40%
7	87.01%	87.00%	87.02%	87.00%
8	86.15%	86.95%	86.95%	86.95%
9	81.95%	81.95%	81.95%	81.95%
10	86.95%	86.95%	86.95%	86.95%
Mean ± SD	85.36 ± 0.82%	85.35 ± 0.86%	85.34 ± 0.81%	85.36 ± 0.82%

| Show Table

DownLoad: CSV

3.3. Comparison of results with previous studies

In this study, we evaluated the performance of DP-CNN on LMS extracted from raw EMG signals of the publicly available NinaPro datasets. To assess the effectiveness of our approach, we compared the results with previous studies that utilized deep learning-based techniques on the same datasets. Although, to best of our knowledge, only one experiment has been conducted on the NinaPro database using Mel or Log-Mel spectrograms, we have compared our results with other studies that have employed any deep learning-based techniques on this database.

The proposed DP-CNN model was compared to earlier studies on the three databases: DB1, DB2, and DB3. Table 5 reveals that the DP-CNN model attained an accuracy of 94.93% on DB1, exceeding earlier research that ranged from 66.60% to 91.27%. On DB2, the DP-CNN model achieved the similar accuracy as on DB1, 94.00%, as shown in Table 6. Previous investigations achieved accuracies ranging from 60.27% to 89.45%. Finally, on DB3 the DP-CNN model attained an accuracy of 85.36%, as shown in Table 7, which is higher than earlier research, which ranged from 46.27% to 81.67%. These findings show that the suggested DP-CNN model is successful for gesture classification across all three datasets, with higher accuracy than earlier studies.

Table 5. Comparison of the proposed DP-CNN with previous methods applied on the NinaPro DB1.

Reference	Year	Dataset	Classifier	Accuracy
^[38]	2016	DB1	CNN	66.60%
^[43]	2016	DB1	CNN	77.80%
^[44]	2019	DB1	CNN	85.00%
^[28]	2019	DB1	CNN-LSTM	71.20%
^[45]	2019	DB1	Multi-view CNN	88.20%
^[46]	2022	DB1	Deformable CNN	81.80%
^[47]	2022	DB1	Dual-View CNN	86.29%
^[30]	2023	DB1	E2CNN	91.27%
This Work	2023	DB1	DP-CNN	94.93%

| Show Table

DownLoad: CSV

Table 6. Comparison of the proposed DP-CNN with previous methods applied on the NinaPro DB2.

Reference	Year	Dataset	Classifier	Accuracy
^[35]	2014	DB2	SVM	60.27%
^[48]	2017	DB2	CNN	60.31%
^[45]	2019	DB2	CNN	83.70%
^[26]	2021	DB2	DLPR	89.45%
^[49]	2022	DB2	CNN	87.56%
This Work	2023	DB2	DP-CNN	94.00%

| Show Table

DownLoad: CSV

Table 7. Comparison of the proposed DP-CNN with previous methods applied on the NinaPro DB3.

Reference	Year	Dataset	Classifier	Accuracy
^[35]	2014	DB3	SVM	46.27%
^[48]	2017	DB3	CNN	73.31%
^[45]	2019	DB3	CNN	64.30%
^[26]	2021	DB3	DLPR	81.67%
^[49]	2022	DB3	CNN	74.24%
This Work	2023	DB3	DP-CNN	85.36%

| Show Table

DownLoad: CSV

Table 8 shows the computational cost of our proposed model compared to other studies. When comparing our DP-CNN model to previous methods, we found that, although it achieves slightly lower accuracy in DB2 compared to CFF-RCNN ^[29], it has notable advantages in terms of computational efficiency. In DB2, where CFF-RCNN achieves an accuracy of 99.51%, our proposed DP-CNN maintains a competitive accuracy of 94.00%. However, DP-CNN outperforms CFF-RCNN in terms of training time, taking only 462.82 seconds compared to CFF-RCNN's 542.245 seconds. This represents a 14.56% reduction in training time, highlighting the efficiency of DP-CNN. Our proposed model also has a lower prediction time, further emphasizing its computational advantages.

Table 8. Comparison of the proposed DP-CNN with previous methods in terms of computational cost.

Reference	Year	Dataset	Classifier	Accuracy	Training Time	Prediction Time
^[29]	2022	DB1	CFF-RCNN	88.87%	1727.415 s	-
^[29]	2022	DB2	CFF-RCNN	99.51%	542.245 s	-
^[30]	2023	DB1	E2CNN	91.27%	38.062 s	0.093 s
		DB1		94.93%	482.37 s	0.863 s
This Work	2023	DB2	DP-CNN	94.00%	462.82 s	0.649 s
		DB3		85.36%	501.23 s	0.895 s

| Show Table

DownLoad: CSV

3.4. Comparison and analysis of results with transfer learning models

Since the sEMG signals are converted to images, we have compared the performance of our proposed DP-CNN with pre-trained transfer learning models that has been trained on million of images. The transfer learning models utilized for comparison in this study include AlexNet, MobileNet, VGG19, DenseNet121, and ResNet50. The performance of the DP-CNN on each database (DB1, DB2, and DB3) was compared to these transfer learning models. As shown in Figure 5, the DP-CNN outperformed all other models on DB1 and DB2, achieving accuracies of 94.00% and 94.93%, respectively. However, on DB3, the DP-CNN achieved an accuracy of 85.36%, which was lower than the accuracy achieved by AlexNet (87.29%) and VGG19 (87.96%). The other models achieved lower accuracies on all three databases compared to the DP-CNN.

Figure 5. Performance of DP-CNN on each database.

DownLoad: Full-Size Img PowerPoint

The proposed DP-CNN model's performance on the NinaPro datasets was examined in this work. The method involved converting sEMG signals to LMS and then classifying them with the DP-CNN model. To assess the proposed model's performance, the findings were compared to earlier studies that used deep learning-based techniques on the same datasets.

The DP-CNN architecture comprises two pathways, namely the convolutional and dense pathways. These pathways are capable of capturing local patterns in the EMG signals and global patterns and relationships in the signals, respectively. Through the integration of the outputs from both pathways, the model can enhance its predictive accuracy and improve its capacity to classify EMG signals. The utilization of batch normalization and dropout layers in both pathways serves to regularize the model and mitigate overfitting.

The suggested DP-CNN model's findings were compared to earlier studies on the NinaPro DB1, DB2, and DB3. The accuracy gained by each model was used in the comparison. The suggested DP-CNN model surpassed earlier studies in terms of accuracy across all three datasets, according to the results. On DB1, the DP-CNN model achieved an accuracy of 94.93%, the highest accuracy achieved on this dataset thus far. On DB2, the DP-CNN model achieved an accuracy of 94.00%, the highest accuracy ever achieved on this dataset. Finally, on DB3, the DP-CNN model attained an accuracy of 85.36%, the highest accuracy achieved on this dataset thus far. These findings show that the DP-CNN model performs well for gesture classification across all three datasets.

4. Conclusions

In this study, we addressed the challenges associated with machine and deep learning algorithms, especially their performance decline in the face of increased number of classes, diverse data collected over multiple days, and population differences. Recognizing the need for a robust learning system, we proposed and applied a dual-pathway convolutional neural network (DP-CNN) to diverse datasets featuring both able-bodied and amputee subjects. The DP-CNN operated on Log-Mel spectrogram-based images derived from surface electromyography signals obtained from NinaPro DB1 and DB3. The results were benchmarked against other CNN models implemented on the same datasets, revealing the superior performance of the proposed DP-CNN. In DB1, the DP-CNN achieved a remarkable mean classification accuracy of 94.93%, a substantial 28.33% increase from the baseline and a noteworthy improvement of 6.73% over the previous highest accuracy. Similar advancements were observed in DB2 and DB3, showcasing the model's consistent and robust performance across datasets. The architecture of the DP-CNN, featuring convolutional and dense pathways, played a pivotal role in capturing both local and global patterns within EMG signals. Integrating outputs from these pathways enhanced predictive accuracy and classification capabilities. The incorporation of batch normalization and dropout layers in both pathways further contributed to model regularization and mitigated overfitting. Comparisons with prior studies on NinaPro DB1, DB2, and DB3 demonstrated that the DP-CNN consistently outperformed earlier models in terms of accuracy. Achieving the highest accuracy on each dataset—94.93% on DB1, 94.00% on DB2, and 85.36% on DB3—the DP-CNN showcased its effectiveness in gesture classification. Additionally, a comparative analysis against pre-trained transfer learning models, including AlexNet, MobileNet, VGG19, DenseNet121, and ResNet50, highlighted the DP-CNN's supremacy in terms of accuracy on DB1 and DB2. Although, on DB3, it slightly lagged behind specific models, but the overall performance improvement in sEMG-based gesture detection was significant. The DP-CNN model, equipped with dual pathways, proved to be an effective solution for improving the accuracy and robustness of sEMG-based gesture classification. This study contributes valuable insights into advancing machine learning techniques for prosthetic control applications, emphasizing the practical significance of employing sophisticated architectures like the DP-CNN in real-world scenarios.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) funded by the Korean government (MSIT) (No. RS-2023-00243034). This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (2021-0-00755, Dark data analysis technology for data scale and accuracy improvement). The authors also express their gratitude to Princess Nourah bint Abdulrahman University Researchers Supporting Project Number PNURSP2024R104, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	Y. Li, X. Yuan, H. Wu, L. Zhang, R. Wang, J. Chen, CVT-track: Concentrating on valid tokens for one-stream tracking, IEEE Trans. Circuits Syst. Video Technol., 34 (2024), 321–334. https://doi.org/10.1109/TCSVT.2024.3452231 doi: 10.1109/TCSVT.2024.3452231
[2]	S. Zhang, Y. Chen, ATM-DEN: Image inpainting via attention transfer module and decoder-encoder network, SPIC, 133 (2025), 117268. https://doi.org/10.1016/j.image.2025.117268 doi: 10.1016/j.image.2025.117268
[3]	F. Chen, X. Wang, Y. Zhao, S. Lv, X. Niu, Visual object tracking: A survey, Comput. Vision Image Underst., 222 (2022), 103508. https://doi.org/10.1016/j.cviu.2022.103508 doi: 10.1016/j.cviu.2022.103508
[4]	F. Zhang, S. Ma, Z. Qiu, T. Qi, Learning target-aware background-suppressed correlation filters with dual regression for real-time UAV tracking, Signal Process., 191 (2022), 108352. https://doi.org/10.1016/j.sigpro.2021.108352 doi: 10.1016/j.sigpro.2021.108352
[5]	S. Ma, B. Zhao, Z. Hou, W. Yu, L. Pu, X. Yang, SOCF: A correlation filter for real-time UAV tracking based on spatial disturbance suppression and object saliency-aware, Expert Syst. Appl., 238 (2024), 122131. https://doi.org/10.1016/j.eswa.2023.122131 doi: 10.1016/j.eswa.2023.122131
[6]	J. Lin, J. Peng, J. Chai, Real-time UAV correlation filter based on response-weighted background residual and spatio-temporal regularization, IEEE Geosci. Remote Sens. Lett., 20 (2023), 1–5. https://doi.org/10.1109/LGRS.2023.3272522 doi: 10.1109/LGRS.2023.3272522
[7]	J. Cao, H. Zhang, L. Jin, J. Lv, G. Hou, C. Zhang, A review of object tracking methods: From general field to autonomous vehicles, Neurocomputing, 585 (2024), 127635. https://doi.org/10.1016/j.neucom.2024.127635 doi: 10.1016/j.neucom.2024.127635
[8]	X. Hao, Y. Xia, H. Yang, Z. Zuo, Asynchronous information fusion in intelligent driving systems for target tracking using cameras and radars, IEEE Trans. Ind. Electron., 70 (2023), 2708–2717. https://doi.org/10.1109/TIE.2022.3169717 doi: 10.1109/TIE.2022.3169717
[9]	L. Liang, Z. Chen, L. Dai, S. Wang, Target signature network for small object tracking, Eng. Appl. Artif. Intell., 138 (2024), 109445. https://doi.org/10.1016/j.engappai.2024.109445 doi: 10.1016/j.engappai.2024.109445
[10]	R. Yao, L. Zhang, Y. Zhou, H. Zhu, J. Zhao, Z. Shao, Hyperspectral object tracking with dual-stream prompt, IEEE Trans. Geosci. Remote Sens., 63 (2025), 1–12. https://doi.org/10.1109/TGRS.2024.3516833 doi: 10.1109/TGRS.2024.3516833
[11]	N. K. Rathore, S. Pande, A. Purohit, An efficient visual tracking system based on extreme learning machine in the defence and military sector, Def. Sci. J., 74 (2024), 643–650. https://doi.org/10.14429/dsj.74.19576 doi: 10.14429/dsj.74.19576
[12]	Y. Chen, Y. Tang, Y. Xiao, Q. Yuan, Y. Zhang, F. Liu, et al., Satellite video single object tracking: A systematic review and an oriented object tracking benchmark, ISPRS J. Photogramm. Remote Sens., 210 (2024), 212–240. https://doi.org/10.1016/j.isprsjprs.2024.03.013 doi: 10.1016/j.isprsjprs.2024.03.013
[13]	W. Cai, Q. Liu, Y. Wang, HIPTrack: Visual tracking with historical prompts, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2024), 19258–19267. https://doi.org/10.1109/CVPR52733.2024.01822
[14]	L. Sun, J. Zhang, D. Gao, B. Fan, Z. Fu, Occlusion-aware visual object tracking based on multi-template updating Siamese network, Digit. Signal Process., 148 (2024), 104440. https://doi.org/10.1016/j.dsp.2024.104440 doi: 10.1016/j.dsp.2024.104440
[15]	Y. Chen, L. Wang, eMoE-Tracker: Environmental MoE-based transformer for robust event-guided object tracking, IEEE Robot. Autom. Lett., 10 (2025), 1393–1400. https://doi.org/10.1109/LRA.2024.3518305 doi: 10.1109/LRA.2024.3518305
[16]	Y. Sun, T. Wu, X. Peng, M. Li, D. Liu, Y. Liu, et al., Adaptive representation-aligned modeling for visual tracking, Knowl. Based Syst., 309 (2025), 112847. https://doi.org/10.1016/j.knosys.2024.112847 doi: 10.1016/j.knosys.2024.112847
[17]	J. Wang, S. Yang, Y. Wang, G. Yang, PPTtrack: Pyramid pooling based Transformer backbone for visual tracking, Expert Syst. Appl., 249 (2024), 123716. https://doi.org/10.1016/j.eswa.2024.123716 doi: 10.1016/j.eswa.2024.123716
[18]	C. Wu, J. Shen, K. Chen, Y. Chen, Y. Liao, UAV object tracking algorithm based on spatial saliency-aware correlation filter, Electron. Res. Arch., 33 (2025), 1446–1475. https://doi.org/10.3934/era.2025068 doi: 10.3934/era.2025068
[19]	A. Lukežič, T. Vojíř, L. Čehovin, J. Matas, M. Kristan, Discriminative correlation filter with channel and spatial reliability, Int. J. Comput. Vision, 126 (2018), 671–688. https://doi.org/10.1007/s11263-017-1061-3 doi: 10.1007/s11263-017-1061-3
[20]	T. Xu, Z. Feng, X. Wu, J. Kittler, Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking, IEEE Trans. Image Process., 28 (2019), 5596–5609. https://doi.org/10.1109/TIP.2019.2919201 doi: 10.1109/TIP.2019.2919201
[21]	J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell., 37 (2015), 583–596. https://doi.org/10.1109/TPAMI.2014.2345390 doi: 10.1109/TPAMI.2014.2345390
[22]	E. O. Brigham, R. E. Morrow, The fast Fourier transform, IEEE Spectrum, 4 (1967), 63–70. https://doi.org/10.1109/MSPEC.1967.5217220 doi: 10.1109/MSPEC.1967.5217220
[23]	H. K. Galoogahi, A. Fagg, S. Lucey, Learning background-aware correlation filters for visual tracking, in IEEE International Conference on Computer Vision (ICCV), (2017), 1144–1152. https://doi.org/10.1109/ICCV.2017.129
[24]	Z. Zhang, H. Peng, J. Fu, B. Li, W. Hu, Ocean: Object-aware anchor-free tracking, in European Conference on Computer Vision (ECCV), (2020), 771–787. https://doi.org/10.1007/978-3-030-58589-1_46
[25]	Y. Zhang, H. Pan, J. Wang, Enabling deformation slack in tracking with temporally even correlation filters, Neural Networks, 181 (2025), 106839. https://doi.org/10.1016/j.neunet.2024.106839 doi: 10.1016/j.neunet.2024.106839
[26]	Y. Chen, H. Wu, Z. Deng, J. Zhang, H. Wang, L. Wang, et al., Deep-feature-based asymmetrical background-aware correlation filter for object tracking, Digit. Signal Process., 148 (2024), 104446. https://doi.org/10.1016/j.dsp.2024.104446 doi: 10.1016/j.dsp.2024.104446
[27]	K. Chen, L. Wang, H. Wu, C. Wu, Y. Liao, Y. Chen, et al., Background-aware correlation filter for object tracking with deep CNN features, Eng. Lett., 32 (2024), 1353–1363. https://doi.org/10.1016/j.dsp.2024.104446 doi: 10.1016/j.dsp.2024.104446
[28]	J. Zhang, Y. He, W. Chen, L. D. Kuang, B. Zheng, CorrFormer: Context-aware tracking with cross-correlation and transformer, Comput. Electr. Eng., 114 (2024), 109075. https://doi.org/10.1016/j.compeleceng.2024.109075 doi: 10.1016/j.compeleceng.2024.109075
[29]	L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. Torr, Fully-convolutional siamese networks for object tracking, in European Conference on Computer Vision (ECCV), (2016), 850–865. https://doi.org/10.1007/978-3-319-48881-3_56
[30]	Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, S. Wang, Learning dynamic siamese network for visual object tracking, in IEEE International Conference on Computer Vision (ICCV), (2017), 1781–1789. https://doi.org/10.1109/ICCV.2017.196
[31]	B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2018), 8971–8980.
[32]	B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, J. Yan, SiamRPN++: Evolution of siamese visual tracking with very deep networks, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 4277–4286.
[33]	L. Zhao, C. Fan, M. Li, Z. Zheng, X. Zhang, Global-local feature-mixed network with template update for visual tracking, Pattern Recognit. Lett., 188 (2025), 111–116. https://doi.org/10.1016/j.patrec.2024.11.034 doi: 10.1016/j.patrec.2024.11.034
[34]	F. Gu, J. Lu, C. Cai, Q. Zhu, Z. Ju, RTSformer: A robust toroidal transformer with spatiotemporal features for visual tracking, IEEE Trans. Human Mach. Syst., 54 (2024), 214–225. https://doi.org/10.1109/THMS.2024.3370582 doi: 10.1109/THMS.2024.3370582
[35]	O. Abdelaziz, M. Shehata, DMTrack: Learning deformable masked visual representations for single object tracking, SIViP, 19 (2025), 61. https://doi.org/10.1007/s11760-024-03713-0 doi: 10.1007/s11760-024-03713-0
[36]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, in the 31st International Conference on Neural Information Processing Systems (NIPS), (2017), 6000–6010.
[37]	O. C. Koyun, R. K. Keser, S. O. Şahin, D. Bulut, M. Yorulmaz, V. Yücesoy, et al., RamanFormer: A Transformer-based quantification approach for raman mixture components, ACS Omega, 9 (2024), 23241–23251. https://doi.org/10.1021/acsomega.3c09247 doi: 10.1021/acsomega.3c09247
[38]	H. Fan, X. Wang, S. Li, H. Ling, Joint feature learning and relation modeling for tracking: A one-stream framework, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 341–357. https://doi.org/10.1007/978-3-031-20047-2_20
[39]	H. Zhang, J. Song, H. Liu, Y. Han, Y. Yang, H. Ma, AwareTrack: Object awareness for visual tracking via templates interaction, Image Vision Comput., 154 (2025), 105363. https://doi.org/10.1016/j.imavis.2024.105363 doi: 10.1016/j.imavis.2024.105363
[40]	Z. Wang, L. Yuan, Y. Ren, S. Zhang, H. Tian, ADSTrack: Adaptive dynamic sampling for visual tracking, Complex Intell. Syst., 11 (2025), 79. https://doi.org/10.1007/s40747-024-01672-0 doi: 10.1007/s40747-024-01672-0
[41]	X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, H. Lu, Transformer tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 8122–8131. https://doi.org/10.1109/CVPR46437.2021.00803
[42]	A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16x16 words: Transformers for image recognition at scale, preprint, arXiv: 2010.11929. https://doi.org/10.48550/arXiv.2010.11929
[43]	B. Yan, H. Peng, J. Fu, D. Wang, H. Lu, Learning spatio-temporal transformer for visual tracking, in IEEE International Conference on Computer Vision (ICCV), (2021), 10428–10437. https://doi.org/10.1109/ICCV48922.2021.01028
[44]	Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., Swin Transformer: Hierarchical vision transformer using shifted windows, in IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986
[45]	L. Lin, H. Fan, Z. Zhang, Y. Xu, H. Ling, SwinTrack: A simple and strong baseline for transformer tracking, in Advances in Neural Information Processing Systems (NIPS), 35 (2022), 16743–16754.
[46]	Z. Song, J. Yu, Y. P. P. Chen, W. Yang, Transformer tracking with cyclic shifting window attention, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 8781–8790. https://doi.org/10.1109/CVPR52688.2022.00859
[47]	Y. Chen, K. Chen, Four mathematical modeling forms for correlation filter object tracking algorithms and the fast calculation for the filter, Electron. Res. Arch., 32 (2024), 4684–4714. https://doi.org/10.3934/era.2024213 doi: 10.3934/era.2024213
[48]	H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, et al., LaSOT: A high-quality benchmark for large-scale single object tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 5369–5378. https://doi.org/10.1109/CVPR.2019.00552
[49]	Y. Wu, J. Lim, M. -H. Yang, Online object tracking: A benchmark, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2013), 2411–2418. https://doi.org/10.1109/CVPR.2013.312
[50]	M. Mueller, N. Smith, B. Ghanem, A benchmark and simulator for UAV tracking, in Computer Vision–ECCV 2016, (2016), 445–461. https://doi.org/10.1007/978-3-319-46448-0_27
[51]	Y. Huang, Y. Chen, C. Lin, Q. Hu, J. Song, Visual attention learning and antiocclusion-based correlation filter for visual object tracking, J. Electron. Imaging, 32 (2023), 23. https://doi.org/10.1117/1.JEI.32.1.013023 doi: 10.1117/1.JEI.32.1.013023
[52]	K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770–778.
[53]	N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in European Conference on Computer Vision (ECCV), (2020), 213–229. https://doi.org/10.1007/978-3-030-58452-8_13
[54]	D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, preprint, arXiv: 1412.6980. https://doi.org/10.48550/arXiv.1412.6980
[55]	Y. Cui, C. Jiang, G. Wu, L. Wang, MixFormer: End-to-end tracking with iterative mixed attention, IEEE Trans. Pattern Anal. Mach. Intell., 46 (2024), 4129–4146. https://doi.org/10.1109/TPAMI.2024.3349519 doi: 10.1109/TPAMI.2024.3349519
[56]	J. Shen, Y. Liu, X. Dong, X. Lu, F. S. Khan, S. Hoi, Distilled siamese networks for visual tracking, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2022), 8896–8909. https://doi.org/10.1109/TPAMI.2021.3127492 doi: 10.1109/TPAMI.2021.3127492
[57]	X. Dong, J. Shen, F. Porikli, J. Luo, L. Shao, Adaptive siamese tracking with a compact latent network, IEEE Trans. Pattern Anal. Mach. Intell., 45 (2023), 8049–8062. https://doi.org/10.1109/TPAMI.2022.3230064 doi: 10.1109/TPAMI.2022.3230064
[58]	Z. Cao, Z. Huang, L. Pan, S. Zhang, Z. Liu, C. Fu, Towards real-world visual tracking with temporal contexts, IEEE Trans. Pattern Anal. Mach. Intell., 45 (2023), 15834–15849. https://doi.org/10.1109/TPAMI.2023.3307174 doi: 10.1109/TPAMI.2023.3307174
[59]	Y. Yang, X. Gu, Attention-based gating network for robust segmentation tracking, IEEE Trans. Circuits Syst. Video Technol., 35 (2025), 245–258. https://doi.org/10.1109/TCSVT.2024.3460400 doi: 10.1109/TCSVT.2024.3460400
[60]	W. Han, X. Dong, Y. Zhang, D. Crandall, C. Z. Xu, J. Shen, Asymmetric Convolution: An efficient and generalized method to fuse feature maps in multiple vision tasks, IEEE Trans. Pattern Anal. Mach. Intell., 46 (2024), 7363–7376. https://doi.org/10.1109/TPAMI.2024.3400873 doi: 10.1109/TPAMI.2024.3400873
[61]	X. Zhu, Y. Wu, D. Xu, Z. Feng, J. Kittler, Robust visual object tracking via adaptive attribute-aware discriminative correlation filters, IEEE Trans. Multimedia, 23 (2021), 2625–2638. https://doi.org/10.1109/TMM.2021.3050073 doi: 10.1109/TMM.2021.3050073
[62]	M. Danelljan, H. Gustav, F. Shahbaz Khan, M. Felsberg, Learning spatially regularized correlation filters for visual tracking, in IEEE International Conference on Computer Vision (ICCV), (2015), 4310–4318. https://doi.org/10.1109/ICCV.2015.490
[63]	J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, P. H. S. Torr, End-to-end representation learning for correlation filter based tracking, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5000–5008. https://doi.org/10.1109/CVPR.2017.531
[64]	G. Bhat, M. Danelljan, L. V. Gool, R. Timofte, Learning discriminative model prediction for tracking, in IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 6182–6191. https://doi.org/10.1109/ICCV.2019.00628
[65]	N. Wang, W. Zhou, J. Wang, H. Li, Transformer meets tracker: exploiting temporal context for robust visual tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 1571–1580. https://doi.org/10.1109/CVPR46437.2021.00162
[66]	Z. Chen, B. Zhong, G. Li, S. Zhang, R. Ji, Siamese box adaptive network for visual tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 6668–6677. https://doi.org/10.1109/CVPR42600.2020.00670
[67]	Y. Guo, H. Li, L. Zhang, L. Zhang, K. Deng, F. Porikli, SiamCAR: Siamese fully convolutional classification and regression for visual tracking, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 1176–1185. https://doi.org/10.1109/CVPR42600.2020.00630
[68]	D. Xing, N. Evangeliou, A. Tsoukalas, A. Tzes, Siamese transformer pyramid networks for real-time UAV tracking, in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), (2022), 1898–1907. https://doi.org/10.1109/WACV51458.2022.00196

This article has been cited by:

1.	Usama Iqbal, Daoliang Li, Zhuangzhuang Du, Muhammad Akhter, Zohaib Mushtaq, Muhammad Farrukh Qureshi, Hafiz Abbad Ur Rehman, Augmenting Aquaculture Efficiency through Involutional Neural Networks and Self-Attention for Oplegnathus Punctatus Feeding Intensity Classification from Log Mel Spectrograms, 2024, 14, 2076-2615, 1690, 10.3390/ani14111690
2.	Riccardo Fratti, Niccolò Marini, Manfredo Atzori, Henning Müller, Cesare Tiengo, Franco Bassetto, A Multi-Scale CNN for Transfer Learning in sEMG-Based Hand Gesture Recognition for Prosthetic Devices, 2024, 24, 1424-8220, 7147, 10.3390/s24227147

Reader Comments

Your name:*

Email:*
© 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)