Possibilities and Challenges of Scanning Hard X-ray Spectro-microscopy Techniques in Material Sciences

Andrea Somogyi; Cristian Mocuta; Andrea Somogyi; Cristian Mocuta

doi:10.3934/matersci.2015.2.122

AIMS Materials Science

2015, Volume 2, Issue 2: 122-162. doi: 10.3934/matersci.2015.2.122

Previous Article Next Article

Review Special Issues

Possibilities and Challenges of Scanning Hard X-ray Spectro-microscopy Techniques in Material Sciences

Andrea Somogyi ^,,
Cristian Mocuta

Synchrotron Soleil, BP 48, Saint-Aubin, Gif sur Yvette, 91192, France

Received: 27 March 2015 Accepted: 09 June 2015 Published: 16 June 2015

Scanning hard X-ray spectro-microscopic imaging opens unprecedented possibilities in the study of inhomogeneous samples at different length-scales. It gives insight into the spatial variation of the major and minor components, impurities and dopants of the sample, and their chemical and electronic states at micro- and nano-meter scales. Measuring, modelling and understanding novel properties of laterally confined structures are now attainable. The large penetration depth of hard X-rays (several keV to several 10 keV beam energy) makes the study of layered and buried structures possible also in in situ and in operando conditions. The combination of different X-ray analytical techniques complementary to scanning spectro-microscopy, such as X-ray diffraction, X-ray excited optical luminescence, secondary ion mass spectrometry (SIMS) and nano-SIMS, provides access to optical characteristics and strain and stress distributions. Complex sample environments (temperature, pressure, controlled atmosphere/vacuum, chemical environment) are also possible and were demonstrated, and allow as well the combination with other analysis techniques (Raman spectroscopy, infrared imaging, mechanical tensile devices, etc.) on precisely the very same area of the sample. The use of the coherence properties of X-rays from synchrotron sources is triggering emerging experimental imaging approaches with nanometer lateral resolution. New fast analytical possibilities pave the way towards statistically significant studies at multi- length-scales and three dimensional tomographic investigations. This paper gives an overview of these techniques and their recent achievements in the field of material sciences.

Keywords:

Citation: Andrea Somogyi, Cristian Mocuta. Possibilities and Challenges of Scanning Hard X-ray Spectro-microscopy Techniques in Material Sciences[J]. AIMS Materials Science, 2015, 2(2): 122-162. doi: 10.3934/matersci.2015.2.122

Related Papers:

[1]	Yingpin Chen, Kaiwei Chen . Four mathematical modeling forms for correlation filter object tracking algorithms and the fast calculation for the filter. Electronic Research Archive, 2024, 32(7): 4684-4714. doi: 10.3934/era.2024213
[2]	Yijun Chen, Yaning Xie . A kernel-free boundary integral method for reaction-diffusion equations. Electronic Research Archive, 2025, 33(2): 556-581. doi: 10.3934/era.2025026
[3]	Ming Wei, Congxin Yang, Bo Sun, Binbin Jing . A multi-objective optimization model for green demand responsive airport shuttle scheduling with a stop location problem. Electronic Research Archive, 2023, 31(10): 6363-6383. doi: 10.3934/era.2023322
[4]	Jiange Liu, Yu Chen, Xin Dai, Li Cao, Qingwu Li . MFCEN: A lightweight multi-scale feature cooperative enhancement network for single-image super-resolution. Electronic Research Archive, 2024, 32(10): 5783-5803. doi: 10.3934/era.2024267
[5]	Hong Yang, Yiliang He . The Ill-posedness and Fourier regularization for the backward heat conduction equation with uncertainty. Electronic Research Archive, 2025, 33(4): 1998-2031. doi: 10.3934/era.2025089
[6]	Xiaoping Fang, Youjun Deng, Zaiyun Zhang . Reconstruction of initial heat distribution via Green function method. Electronic Research Archive, 2022, 30(8): 3071-3086. doi: 10.3934/era.2022156
[7]	Cui-Ping Cheng, Ruo-Fan An . Global stability of traveling wave fronts in a two-dimensional lattice dynamical system with global interaction. Electronic Research Archive, 2021, 29(5): 3535-3550. doi: 10.3934/era.2021051
[8]	Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen . Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192
[9]	Xiaowei Pang, Haiming Song, Xiaoshen Wang, Jiachuan Zhang . Efficient numerical methods for elliptic optimal control problems with random coefficient. Electronic Research Archive, 2020, 28(2): 1001-1022. doi: 10.3934/era.2020053
[10]	Nan Li . Summability in anisotropic mixed-norm Hardy spaces. Electronic Research Archive, 2022, 30(9): 3362-3376. doi: 10.3934/era.2022171

Abstract

1. Introduction

As an important research area in computer vision ^[1,2], visual object tracking (VOT) ^[3] has wide applications in drone tracking ^[4,5,6], intelligent driving ^[7,8], object recognition ^[9,10], military applications ^[11,12], and other fields. Its primary research involves modeling an object's appearance and other features in the initial frame of a video sequence and then accurately locating it in subsequent frames to complete the tracking task ^[13]. However, in practical applications, trackers often face various challenges ^[14], such as deformation, partial occlusion, fast movement, and background interference ^[15], which lead to cumulative errors and affect tracking performance ^[16].

Currently, mainstream object trackers include correlation filter-based trackers and deep learning-based trackers ^[17,18]. By transforming complex spatial-domain correlation operations into frequency-domain element-wise multiplications, correlation filter (CF) trackers ^[19] achieve high computational efficiency in object tracking. This method has made CF methods widely used in visual object tracking ^[20]. For example, KCF ^[21] significantly improved tracking speed and accuracy by utilizing the fast Fourier transform (FFT) ^[22]. However, in practical applications, the circulant matrices used in CF methods introduce boundary effects, leading to the unsatisfactory performance of CF methods. BACF ^[23] mitigates boundary effects by expanding the search area and cropping negative samples from the real world to address this issue. Despite the widespread attention to CF methods, these approaches have limitations, such as filter degradation and insufficient robustness ^[24]. In recent years, researchers have made many improvements by introducing deep convolutional neural networks to enhance tracking accuracy, such as TEDS ^[25], DeepABCF ^[26], and DeepBACF ^[27]. Nevertheless, most CF methods only employ the initial frame of the object as the sole template in the tracking process, leading to the tracker's limits in its ability to model the object's appearance.

Recently, deep learning-based object trackers have gradually become the dominant framework in the object tracking field ^[28]. For example, SiamFC ^[29] applies the Siamese network ^[30] structure to VOT. These methods extract features from the object and search region via an offline-trained neural network and measure their similarity by a correlation operator. SiamRPN ^[31] designs a region proposal network (RPN) to achieve precise object localization, enhancing tracking accuracy. Subsequently, SiamRPN++ ^[32] employs a spatially aware sampling strategy to alleviate translation invariance issues caused by padding operations. Despite their success, Siamese networks are limited by the local linear property of correlation operations, which lack the capacity for modeling long-range information dependencies ^[33]. Thus, the Siamese networks may lose contextual semantic information, which is significant for robust tracking ^[34].

To address these limitations ^[35], scholars have recently introduced Transformer ^[36,37] architectures, which possess strong capabilities for modeling long-range dependencies, leading to the development of various Transformer-based trackers ^[38,39,40]. For example, TransT ^[41] replaces the correlation operations in Siamese networks with self-attention and cross-attention mechanisms, demonstrating the potential of attention mechanisms in VOT ^[42]. However, TransT relies solely on the first frame's template, which limits its ability to capture real-time appearance changes of the object. STARK ^[43] introduces a dynamic template to exploit spatio-temporal information and turns the tracking task into a bounding box prediction task to overcome this issue. However, these methods rely on global self-attention for feature extraction, which is highly complex. To address this, the Swin Transformer ^[44] employs local window self-attention and a shifted window strategy. This approach reduces computational complexity to a linear level while maintaining efficiency. It has been applied in various VOT tasks, such as SwinTrack ^[45]. Inspired by the Swin Transformer, CSWinTT ^[46] employs a novel cross-window attention calculation between the template and search region. It replaces conventional pixel-level attention with multi scale window attention via cyclic-shift (CS) windows. This approach preserves object integrity and enriches positional information, effectively distinguishing the object from the background. Furthermore, CSWinTT adopts a multi-head, multi-scale attention mechanism to measure similarity between partitioned windows at specific scales, enhancing tracking accuracy. However, the CS strategy for each window introduces a substantial computational burden due to the large number of matrix multiplications, leading to unsatisfactory tracking speed. Therefore, optimizing the computational approach to reduce the calculation and storage resources required by the tracker is of great significance.

Motivated by the correlation filter theory, which shows the correlation operation between a matrix and a CS matrix is efficiently computed using FFT ^[47], a worthwhile question is whether the FFT can avoid the high storage and computational complexity caused by cyclic shifted samples in the spatial domain. Inspired by this, we propose a multi scale cyclic shift window Transformer object tracker based on fast Fourier transform (MCWTT). Unlike the CSWinTT, which directly performs CS matrix operations in the spatial domain, we introduce the FFT and design a frequency domain feature fusion block. It converts the calculation strategy from the spatial domain to the frequency domain regarding the matrix multiplication of CS windows as a correlation operation. Then, according to the convolution theorem, we transform the correlation operation into point-wise multiplication in the frequency domain, alleviating the problems of increased storage and computational costs caused by traditional CS operations. For the object location task, we borrow the idea of the STARK tracker, decoupling classification and regression. It presents a corner estimation network that predicts the object's top-left and bottom-right corner coordinates, improving the model's handling of scale changes from rapid movements. Finally, to boost the model's spatio-temporal information mining, we add a prediction head to assess the best candidate samples' reliability and obtain dynamic templates with temporal information.

For this paper, our contributions are summarized as follows.

(1) We propose a window-level attention mechanism to replace the traditional pixel-level attention mechanism, thereby avoiding the potential disruption of object's integrity and positional information caused by pixel-level attention mechanisms. Moreover, combined with the CS operation, this mechanism promotes information exchange across windows, enriches training sample diversity, and improves the model's adaptability in complex scenarios.

(2) We design a novel attention calculation strategy that first treats the attention calculation in the spatial domain as a matrix correlation operation. We then leverage the advantages of the fast Fourier transform (FFT) to convert these operations into frequency-domain element-wise multiplications. This approach significantly improves computational efficiency and reduces storage costs, providing a more efficient solution for real-time object-tracking tasks.

(3) Extensive experimental analysis of the proposed MCWTT module was conducted on multiple datasets, including the LaSOT ^[48], OTB100 ^[49], and UAV123 ^[50] datasets, with detailed comparisons against various existing mainstream trackers. The experimental results demonstrate that the MCWTT tracker outperforms most existing methods across multiple evaluation metrics, especially showing significant advantages in handling scale variations and fast-moving objects. Furthermore, ablation studies further indicate that the proposed multi-scale shifted window attention mechanism plays an important role in improving trackers' performance and also demonstrates the efficiency of the frequency domain attention mechanism.

The rest of this paper is organized as follows. Section 2 introduces the preliminary knowledge. Section 3 explains the design and implementation of the proposed module. Section 4 presents experimental results comparing the proposed module with other methods on multiple datasets, as well as ablation studies. Finally, Section 5 summarizes the main findings and outlines future work.

2. Preliminary knowledge

2.1. Correlation operations and fast computation of image processing

Extracting the object's features and providing accurate prediction information in the subsequent tracking process is a key objective in object tracking tasks ^[51]. The circulation matrix structure expands the object samples, allowing the tracker to utilize spatio-temporal information further to exploit the deep features fully. However, the large amount of sample data generated by circulant matrix operations is redundant. If these data are directly operated on in the spatial domain, it will result in a significant computational burden and high computational complexity. Therefore, leveraging the connection between convolution and circulant matrices and transforming the operations to element-wise frequency-domain multiplications via the convolution theorem, avoiding large matrix multiplications and inversion operations.

Two-dimensional image correlation operations are special convolutions. The operation is carried out through element-wise multiplication in the frequency domain and the inverse Fourier transform, as shown in Eq (2.1) below:

$ {\boldsymbol{I}_1} \star {\boldsymbol{I}_2} = {\mathbf{Con}}{{\mathbf{v}}_2}\left( {{\boldsymbol{I}_1}, {{\bar{\boldsymbol{I}}}_2}} \right) = {\mathbf{real}}\left\{ {{\mathbf{iff}}{{\mathbf{t}}_2}\left( {{{\hat{\boldsymbol{I}}}_1} \odot \hat{\boldsymbol{I}}_2^*} \right)} \right\}, $

(2.1)

where the symbol $ \mathbf{\star }$ represents the two-dimensional correlation operation, ${\mathbf{Con}}{{\mathbf{v}}_2}$ denotes the two-dimensional convolution operation. ${\boldsymbol{I}_1}{\text{, }}{\boldsymbol{I}_2} \in {\mathbb{R}^{\sqrt{\boldsymbol{N}} \times \sqrt{\boldsymbol{N}} }}$, $\sqrt {\boldsymbol{N}} $ is an integer, $ {\bar{\boldsymbol{I}}_2} $ represents two-dimensional reversed signal of ${\boldsymbol{I}_2}$. The calculation rule for the reversed signal of a two-dimensional signal is as follows: First, perform row-wise reversing on the matrix to obtain an intermediate matrix, and then perform column-wise reversing on this intermediate matrix. $ {\hat{\boldsymbol{I}}_1} $ and $ {\hat{\boldsymbol{I}}_2} $ represent the Fourier transform of ${\boldsymbol{I}_1}$ and ${\boldsymbol{I}_2}$, and $ \hat{\boldsymbol{I}}_2^* $ represents the conjugate of $ {\hat{\boldsymbol{I}}_2} $. The symbol $ \odot $ indicates element-wise multiplication, ${\mathbf{iff}}{{\mathbf{t}}_2}$ represents the two-dimensional inverse Fourier transform, and ${\mathbf{real}}$ represents the operation of taking the real part.

When Eq (2.1) is computed in the spatial domain, the memory space occupied is ${\boldsymbol{N}^2} + \boldsymbol{N}$, and the computational complexity is $\mathcal{O}\left({{\boldsymbol{N}^2}} \right)$. If the convolution theorem transforms spatial-domain correlation to frequency-domain element-wise multiplication, the module's memory requirements drop to $4\boldsymbol{N}$, and the computational complexity becomes $\mathcal{O}\left({8\boldsymbol{N}{{\log }_2}\boldsymbol{N} + 4\boldsymbol{N}} \right)$.

2.2. Attention mechanism

The attention mechanism is an essential component of the Transformer model architecture, helping the model extract and fuse features from images and adjust the tracking region to enable the tracker to more effectively distinguish between foreground and background interference and more accurately track the object.

Assume the input sequence ${\boldsymbol{X}}$ has a size of $\boldsymbol{X} \in {\mathbb{R}^{N \times C}}$, where $N$ denotes the number of tokens, and $C$ denotes the length of tokens. $\boldsymbol{Q} = \boldsymbol{X}{\boldsymbol{W}^{{Q}}}$, $\boldsymbol{K} = \boldsymbol{X}{\boldsymbol{W}^{{K}}}$, $\boldsymbol{V} = \boldsymbol{X}{\boldsymbol{W}^{{{V}}}}$. $\boldsymbol{Q} \in {\mathbb{R}^{{{N}} \times {{D}}}}$, $\boldsymbol{K} \in {\mathbb{R}^{{{N}} \times {{D}}}}$, $\boldsymbol{V} \in {\mathbb{R}^{{{N}} \times {{D}}}}$. ${\boldsymbol{W}^{{Q}}} \in {\mathbb{R}^{{{C}} \times {{D}}}}$, $ {\boldsymbol{W}^{{K}}} \in {\mathbb{R}^{{{C}} \times {{D}}}} $, and ${\boldsymbol{W}^{{{V}}}} \in {\mathbb{R}^{{{C}} \times {{D}}}}$ are linear transformation matrices. Given the query feature vector ${{\boldsymbol{Q}}_i} = {\text{Split}}\left({\boldsymbol{Q}} \right) \in {\mathbb{R}^{N \times {d_i}}}$, the key feature vector ${{\boldsymbol{K}}_i} = {\text{Split}}\left({\boldsymbol{K}} \right) \in {\mathbb{R}^{N \times {d_i}}}$, and the value feature vector ${{\boldsymbol{V}}_i} = {\text{Split}}\left({\boldsymbol{V}} \right) \in {\mathbb{R}^{N \times {d_i}}}$ for each head, in the attention mechanism, the attention is determined by the self-attention score matrix, which decides the degree of attention each part of the input sequence should receive. As shown in Eq (2.2), the attention matrix is written as follows:

$\boldsymbol{A} = \frac{{{{\boldsymbol{Q}}_i}{\boldsymbol{K}}_i^{\rm{T}}}}{{\sqrt {{d_i}} }}.$

(2.2)

where $d_i$ denotes the dimension of each token in the head.

The Softmax function converts the normalized attention score matrix into probabilities, which are then applied to the value matrix to produce the final attention output. The result of the self-attention computation can be defined by the following equation:

$ {\text{Attention}}\left( {{{\boldsymbol{Q}}_i}{\text{, }}{{\boldsymbol{K}}_i}{\text{, }}{{\boldsymbol{V}}_i}} \right) = \textbf{Softmax} \left( {\boldsymbol{A}} \right){{\boldsymbol{V}}_i}{\text{, }} $

(2.3)

where $i$ denotes the number of input sequence heads.

Using multi-head attention enhances the performance of the attention mechanism. The multi-head attention calculation is defined as follows:

$\begin{array}{c} {\text{Multihead}}\left( {{\boldsymbol{Q}}{\text{, }}{\boldsymbol{K}}{\text{, }}{\boldsymbol{V}}} \right) = \textbf{Concat} \left( {{{\boldsymbol{H}}_1}{\text{, }}{{\boldsymbol{H}}_2}{\text{, }} \cdots {\text{, }}{{\boldsymbol{H}}_I}} \right){{\boldsymbol{W}}^O}{\text{, }} \\ {\text{s}}{\text{.t}}{\text{. }}{{\boldsymbol{H}}_i} = {\text{Attention}}\left( {{{\boldsymbol{Q}}_i}{\text{, }}{{\boldsymbol{K}}_i}{\text{, }}{{\boldsymbol{V}}_i}} \right) \\ \end{array}$ $

(2.4)

where $I$ denotes the number of attention heads, ${\boldsymbol{W}^{{O}}} \in {\mathbb{R}^{{{D}} \times {{D}}}}$ is a linear transformation matrix, and $i \in \left[{1, I} \right]$.

3. Proposed module

In this section, we introduce the proposed multi-scale cyclic-shift window Transformer object tracker (MCWTT). The main framework of the module consists of four parts: the backbone network, the feature fusion block (FFB), the feature enhancement block (FEB), and the prediction analysis module. The architecture of the module is illustrated in Figure 1.

Figure 1. The overall architecture of the MCWTT module.

DownLoad: Full-Size Img PowerPoint

3.1. Backbone network

The backbone network of the tracker is based on ResNet-50 ^[52], which extracts features from the search region image $\mathcal{S}$, the initial static template image ${\mathcal{T}_s}$, and the dynamic template image in the initial frame ${\mathcal{T}_d}$. These inputs are fed into the backbone network to obtain the corresponding feature maps, namely the search region feature map ${\mathcal{F}_S} \in {\mathbb{R}^{{H_S} \times {W_S} \times D}}$, the initial static template feature map ${\mathcal{F}_{{T_S}}} \in {\mathbb{R}^{{H_{TS}} \times {W_{TS}} \times D}}$, and the dynamic template feature map ${\mathcal{F}_{{T_D}}} \in {\mathbb{R}^{{H_{TD}} \times {W_{TD}} \times D}}$. Then flattened and concatenated along the channel dimension to form a joint feature map ${\boldsymbol{X}}$, as shown in the following equation:

$ {\boldsymbol{X}} = \mathbf{Concat} \left( {\mathbf{Flatten} \left( {{\mathcal{F}_S}} \right){\text{, }}\mathbf{Flatten} \left( {{\mathcal{F}_{{T_S}}}} \right){\text{, }}\mathbf{Flatten} \left( {{\mathcal{F}_{{T_D}}}} \right)} \right){\text{, }} $

(3.1)

where $ \mathbf{Concat} $ denotes the concatenation operator and $ \mathbf{Flatten} $ denotes the flattening operator. Specifically, the feature maps are reshaped using a $r \times r $ size window, then through concatenation resulting in $L = \left({\frac{{{H_S}}}{r} \times \frac{{{W_S}}}{r} + \frac{{{H_{{T_S}}}}}{r} \times \frac{{{W_{{T_S}}}}}{r} + \frac{{{H_{{T_D}}}}}{r} \times \frac{{{W_{{T_D}}}}}{r}} \right)$ tokens.

Positional encoding ${{\boldsymbol{P}}_E}$ is added to the joint feature map ${\boldsymbol{X}}$ to incorporate spatial information, forming the feature fusion block input ${{\boldsymbol{Z}}_0}$, as shown in the following equation:

$ {{\boldsymbol{Z}}_0} = {\boldsymbol{X}} + {{\boldsymbol{P}}_E}{\text{.}} $

(3.2)

The positional encoding ${{\boldsymbol{P}}_E}$ is defined as follows:

$ {{\boldsymbol{P}}_E}\left( {n{\text{, }}i} \right) = \left\{ {

$\begin{array}{*{20}{l}} {\mathbf{sin} \left( {\frac{n}{{{{10000}^{\frac{i}{{{d_m}}}}}}}} \right){\text{, }}if\; i = 2k} \\ {\mathbf{cos} \left( {\frac{n}{{{{10000}^{\frac{i}{{{d_m}}}}}}}} \right){\text{, }}if\; i = 2k + 1} \end{array}$ } \right.{\text{, }} $

(3.3)

where $n $ represents the position of the token, $i$ represents the index of the element within the token, and ${d_m}$ is the dimension of the positional encoding vector, with a size of ${r^2}D$, ${{\boldsymbol{P}}_E} \in {\mathbb{R}^{L \times \left({{r^2}D} \right)}}$.

3.2. Feature fusion block

In this section, we introduce the feature fusion block (FFB). As mentioned in the preliminaries, transferring spatial-domain computation to the frequency-domain enhances computational efficiency and reduces memory usage. The FFB has two key parts: the spatial-domain feature fusion module (SFM) and the frequency-domain feature fusion module (FFM). These two modules integrate spatial and frequency domain features, improving the model's spatial context awareness. The following sections provide detailed explanations of these two modules.

3.2.1. Spatial-domain feature fusion module (SFM)

The FFB input ${{\boldsymbol{Z}}_j}$ is divided into eight feature heads, as shown in Eq (3.4)

$ \left[ {{\mathcal{X}_0}{\text{, }} \cdots {\text{, }}{\mathcal{X}_8}} \right] = \mathbf{Split} \left( {\mathbf{Reshape} \left( {{{\boldsymbol{Z}}_j}} \right)} \right){\text{, }} $

(3.4)

where $ \mathbf{Split} $ denotes the splitting operator, and the size of each head after splitting is ${\mathcal{X}_i} \in {\mathbb{R}^{L \times r \times r \times {d_i}}}$, $i = {\text{1, 2, }} \cdots {\text{, 8}}$.

Each feature head undergoes linear transformations to generate the query, key, and value feature tensors, i.e., $\mathcal{Q}$, $\mathcal{K}$ and $\mathcal{V}$; the linear transformation process is given by Eq (3.5)

$ \left\{ {

$\begin{array}{*{20}{c}} {\mathcal{Q}{\text{ }} = {{\mathbf{LP} }_1}\left( {{\mathcal{X}_i}} \right)} \\ {\mathcal{K}{\text{ }} = {{\mathbf{LP} }_2}\left( {{\mathcal{X}_i}} \right)} \\ {\mathcal{V}{\text{ }} = {{\mathbf{LP} }_3}\left( {{\mathcal{X}_i}} \right)} \end{array}$ } \right.{\text{, }} $

(3.5)

where $\mathbf{LP} $ denotes linear mapping transformation. The sizes of feature maps are $\mathcal{Q} \in {\mathbb{R}^{L \times r \times r \times {d_i}}}$, $\mathcal{K} \in {\mathbb{R}^{L \times r \times r \times {d_i}}}$, and $\mathcal{V} \in {\mathbb{R}^{L \times r \times r \times {d_i}}}$. The feature tensor $\mathcal{Q}$ is reshaped into a query matrix ${\boldsymbol{Q}}$, with a size of ${\boldsymbol{Q}} \in {\mathbb{R}^{L \times \left({{r^2}{d_i}} \right)}}$.

The key and value feature tensors $\mathcal{K}$, $\mathcal{V}$ are subjected to CS operations to generate the shifted key tensor ${\mathcal{K}_C}$ and the shifted value tensor ${\mathcal{V}_C}$, respectively, as shown in Eq (3.6)

$ \left\{ {

$\begin{array}{*{20}{c}} {{\mathcal{K}_C} = \mathbf{CS} \left( \mathcal{K} \right)} \\ {{\mathcal{V}_C} = \mathbf{CS} \left( \mathcal{V} \right)} \end{array}$ } \right.{\text{, }} $

(3.6)

where $\mathbf{CS} $ performs a cyclic-shift along the second and third dimensions of the input tensor. The size of the shifted key tensor ${\mathcal{K}_C}$ and the value tensor ${\mathcal{V}_C}$ are ${\mathcal{K}_C} \in {\mathbb{R}^{L \times {r^2} \times r \times r \times {d_i}}}$ and ${\mathcal{V}_C} \in {\mathbb{R}^{L \times {r^2} \times r \times r \times {d_i}}}$.

The shifted key tensor and shifted value tensor are then reshaped to obtain the shifted key matrix ${{\boldsymbol{K}}_C}$ and the shifted value matrix ${{\boldsymbol{V}}_C}$, respectively

$ \left\{ {

$\begin{array}{*{20}{c}} {{{\boldsymbol{K}}_C} = \mathbf{Reshape} \left( {{\mathcal{K}_C}} \right)} \\ {{{\boldsymbol{V}}_C} = \mathbf{Reshape} \left( {{\mathcal{V}_C}} \right)} \end{array}$ } \right.{\text{, }} $

(3.7)

where ${{\boldsymbol{K}}_C} \in {\mathbb{R}^{\left({L{r^2}} \right) \times {{\left(r \right.}^2}\left. {{d_i}} \right)}}$ and ${{\boldsymbol{V}}_C} \in {\mathbb{R}^{\left({L{r^2}} \right) \times {{\left(r \right.}^2}\left. {{d_i}} \right)}}$.

After reshaping, the shifted key matrix ${{\boldsymbol{K}}_C}$ is transposed to obtain the transposed key matrix ${\boldsymbol{K}}_C^{\text{T}} \in {\mathbb{R}^{\left({{r^2}{d_i}} \right) \times \left({L{r^2}} \right)}}$. The transposed key matrix ${\boldsymbol{K}}_C^{\text{T}}$ is then scaled by a factor $\sqrt {{r^2}{d_i}} $ to facilitate the normalization operation of the Softmax function and avoid gradient vanishing or explosion. The query matrix ${\boldsymbol{Q}}$ and the scaled key matrix ${{\boldsymbol{K}}_{CN}} = \frac{{{\boldsymbol{K}}_C^{\text{T}}}}{{\sqrt {{r^2}{d_i}} }}$ are multiplied to obtain the feature attention score matrix ${\boldsymbol{A}}$, as shown in Eq (3.8)

$ {\boldsymbol{A}} = {\boldsymbol{Q}}{{\boldsymbol{K}}_{CN}}{\text{, }} $

(3.8)

where ${\boldsymbol{A}} \in {\mathbb{R}^{L \times \left({L{r^2}} \right)}}$.

During the CS process, the concatenation of different regions does not guarantee the continuity of the foreground target, leading to discontinuities at the boundaries of the concatenated images, known as boundary effects. To mitigate these effects caused by CS, an attention mask matrix ${\boldsymbol{M}}$ is introduced to penalize shifted samples that are far from the base sample. The feature attention score matrix is added to the attention mask matrix to obtain the final feature attention matrix ${\boldsymbol{A}}_{M}$, as shown in Eq (3.9)

$ \left\{

$\begin{array}{l} {\boldsymbol{A}}_{M} = {\boldsymbol{A}} + {\boldsymbol{M}} \\ {\boldsymbol{M}} = {\mathbf{1}_{L \times L}} \otimes {\left( {{\boldsymbol{vec}}\left( {\boldsymbol{C}} \right)} \right)^{\text{T}}} \\ {\boldsymbol{C}} = {{\boldsymbol{y}}^{\text{T}}} \times {\boldsymbol{y}} - {{\boldsymbol{1}}_{r \times r}} \\ \end{array}$ \right.{\text{, }} $

(3.9)

where ${\mathbf{1}_{L \times L}}$ is a matrix of size $L \times L$ with all elements equal to 1. In the same way, ${\mathbf{1}_{r \times r}}$ is a matrix of size $r \times r$ with all elements equal to 1. The symbol $ \otimes $ denotes the Kronecker product, with ${\boldsymbol{y}} \in {\mathbb{R}^{1 \times r}}$, ${{\boldsymbol{y}}_i} = \left\{ {

$\begin{array}{*{20}{l}} {{{\boldsymbol{x}}_i}, {{\boldsymbol{x}}_i} > {\text{0}}{\text{.5}}} \\ {1 - {{\boldsymbol{x}}_i}, others} \end{array}$ } \right.$, $i = 1{\text{, 2, }} \cdots {\text{, }}r$, ${\boldsymbol{x}} \in {\mathbb{R}^{1 \times r}}$, and ${{\boldsymbol{x}}_i} = \frac{{i - 1}}{r}$.

The Softmax function transforms raw attention scores into a probability distribution. The normalized attention matrix is then multiplied by the shifted value matrix ${{\boldsymbol{V}}_C}$ to generate the single-head attention output ${{\boldsymbol{T}}_i}$

$ {{\boldsymbol{T}}_i} = \mathbf{Softmax} \left( {\boldsymbol{A}}_{M} \right) \times {{\boldsymbol{V}}_C}{\text{, }} $

(3.10)

where ${{\boldsymbol{T}}_i} \in {\mathbb{R}^{L \times {{\left(r \right.}^2}\left. {{d_i}} \right)}}$$\left({i = 1{\text{, 2, }} \cdots {\text{, }}8} \right)$.

The single-head attention matrix is passed through a feed-forward neural network (FFN) to enhance the features. Specifically, the multi-head attention matrix ${\boldsymbol{O}}$ is formed by concatenating all single-head outputs, as shown in Eq (3.11)

$ {\boldsymbol{O}} = \mathbf{Concat} \left( {{{\mathbf{LP} }_1}\left( {{{\boldsymbol{T}}_1}} \right){\text{, }} \cdots {\text{, }}{{\mathbf{LP} }_8}\left( {{{\boldsymbol{T}}_8}} \right)} \right){\text{, }} $

(3.11)

where ${\boldsymbol{O}} \in {\mathbb{R}^{\left({L{r^2}} \right) \times D}}$.

The multi-head attention matrix ${\boldsymbol{O}}$ is added to the FFB input ${{\boldsymbol{Z}}_j}$ and then normalized using layer normalization, as shown in Eq (3.12)

$ {{\boldsymbol{T}}_{temp}} = \mathbf{LN} \left( {\mathbf{LP} \left( {\boldsymbol{O}} \right) + {{\boldsymbol{Z}}_j}} \right){\text{.}} $

(3.12)

${{\boldsymbol{T}}_{temp}}$ is passed through the FFN for non-linear transformation, and the result is added back to the original ${{\boldsymbol{T}}_{temp}}$ matrix to achieve residual connection and normalization. This process aims to preserve as much of the underlying structural information of the input data as possible. By integrating the deep semantic features with the original spatial details in an organic manner, a composite representation that combines high fidelity and strong discriminability is ultimately generated, as shown in Eq (3.13)

$ {{\boldsymbol{Z}}_{j + 1}} = \mathbf{LN} \left( {\mathbf{FFN} \left( {{{\boldsymbol{T}}_{temp}}} \right) + {{\boldsymbol{T}}_{temp}}} \right){\text{, }} $

(3.13)

where the FFN consists of two linear mapping layers and a non-linear rectified linear unit(ReLU) activation layer, as shown in Eq (3.14)

$ \mathbf{FFN} \left( {\boldsymbol{X}} \right) = \left( {\mathbf{ReLU} \left( {{\boldsymbol{X}}{{\boldsymbol{L}}_1} + {{\boldsymbol{B}}_1}} \right)} \right){{\boldsymbol{L}}_2} + {{\boldsymbol{B}}_2}{\text{, }} $

(3.14)

where $ {{\boldsymbol{L}}_1} $, $ {{\boldsymbol{L}}_2} $ are the weight matrices for linear projection, and $ {{\boldsymbol{B}}_1} $, $ {{\boldsymbol{B}}_2} $ are the bias matrices.

Finally, the FFB re-inputs the obtained multi-head attention matrices. Each layer's output serves as the input for the next layer. After repeating this process six times, six channels of attention computation results are obtained. This process yields a more accurate final attention result, as shown in Eq (3.15)

$ {{\boldsymbol{Z}}_{j + 1}} = \mathbf{FFB} \left( {{{\boldsymbol{Z}}_j}} \right){\text{, }} $

(3.15)

where $j = 0{\text{, 1, }} \cdots {\text{, 5}}$.

3.2.2. Frequency-domain feature fusion module (FFM)

Given that the data generated by circulant matrices are redundant, direct computation in the spatial domain results in high computational complexity and extensive memory usage. Consequently, transforming circulant matrix operations from spatial domain calculations to frequency domain element-wise multiplications is feasible for reducing the computational overhead. When the window size is 8, the FFM is involved in network training and inference. The structure of the FFM is shown in Figure 2.

Figure 2. Diagram of the FFM architecture.

DownLoad: Full-Size Img PowerPoint

Through Fourier transforms, the query feature tensor $\mathcal{Q}$ and the key feature tensor $\mathcal{K}$, obtained from spatial domain computations, are converted from the spatial domain to the frequency domain, as shown in Eq (3.16)

$ \left\{ {

$\begin{array}{*{20}{l}} {\widehat{ \mathcal{Q}} = {{\mathbf{fft} }_{\left[ {2{\text{, }}3} \right]}}\left( \mathcal{Q} \right)} \\ {\widehat{ \mathcal{K}} = {{\mathbf{fft} }_{\left[ {2{\text{, }}3} \right]}}\left( \mathcal{K} \right)} \end{array}$ } \right.{\text{, }} $

(3.16)

where $\mathcal{Q} \in {\mathbb{R}^{L \times r \times r \times {d_i}}}$, $\mathcal{K} \in {\mathbb{R}^{L \times r \times r \times {d_i}}}$, and ${\mathbf{fft} _{\left[{2{\text{, }}3} \right]}}$ denotes the Fourier transform operation applied to the second and third dimensions of the feature tensors $\mathcal{Q}$ and $\mathcal{K}$, resulting in $\widehat{ \mathcal{Q}} \in {\mathbb{C}^{L \times r \times r \times {d_i}}}$ and $\widehat{ \mathcal{K}} \in {\mathbb{C}^{L \times r \times r \times {d_i}}}$.

The query feature tensor in the frequency domain $\widehat{ \mathcal{Q}}$ is then expanded, with each token being copied $L$ times to obtain the expanded query feature tensor ${\widehat{ \mathcal{Q}}_{kE}}$, as shown in the following equation:

$ \left\{ {

$\begin{array}{*{20}{l}} {{{\widehat{ \mathcal{Q}}}_{1E}} = \mathbf{Repeat} \left( {\widehat{ \mathcal{Q}}\left( {{\text{1, :, :, : }}} \right){\text{, }}L} \right)} \\ {{{\widehat{ \mathcal{Q}}}_{2E}} = \mathbf{Repeat} \left( {\widehat{ \mathcal{Q}}\left( {{\text{2, :, :, :}}} \right){\text{, }}L} \right)} \\ {{\text{ }} \vdots } \\ {{{\widehat{ \mathcal{Q}}}_{LE}} = \mathbf{Repeat} \left( {\widehat{ \mathcal{Q}}\left( {L{\text{, :, :, :}}} \right){\text{, }}L} \right)} \end{array}$ } \right.{\text{, }} $

(3.17)

where $ \mathbf{Repeat} $ denotes the repetition of each input tensor $L$ times along the first dimension, with ${\widehat{ \mathcal{Q}}_{kE}} \in {\mathbb{C}^{L \times r \times r \times {d_i}}}$, $k = 1{\text{, 2, }} \cdots {\text{, }}L$, $\widehat{ \mathcal{Q}}\left({l{\text{, :, :, :}}} \right) \in {\mathbb{C}^{1 \times r \times r \times {d_i}}}$, and $l = {\text{1, 2, }} \cdots {\text{, }} L$.

The conjugate of the key feature tensor in the frequency domain ${\widehat{ \mathcal{K}}^*}$ is computed, as shown in the following equation:

$ {\widehat{ \mathcal{K}}^*} = \mathbf{Conjugate} \left( {\widehat{ \mathcal{K}}} \right){\text{, }} $

(3.18)

where $ \mathbf{Conjugate} $ is the conjugation operator that negates the imaginary part of the input tensor.

To simplify the complex spatial domain computation, the convert block executes element-wise conversion between the expanded feature tensor ${\widehat{ \mathcal{Q}}_{kE}}$ and the conjugate feature tensor ${\widehat{ \mathcal{K}}^*}$ obtained above. The results from different channels of the resulting tensor are then summed. The summed result in the frequency domain needs to be transformed back to the spatial domain. Here, the inverse Fourier transform is applied to the frequency domain result, and the real part of the transformed result is taken and stacked as the $l$-th row of the attention weight matrix ${\boldsymbol{A}}$, as shown in the following equation:

$\begin{align} {\boldsymbol{A}}\left( {k{\text{, :}}} \right) &= \frac{1}{{\sqrt {{r^2}{d_i}} }}\mathbf{Concat} \left\{ {\mathbf{vec} {{\left[ {\sum\limits_{c = 1}^{{d_i}} {\mathcal{Q}\left( {k{\text{, :, :, }}c} \right)} \star \mathcal{K}\left( {p{\text{, :, :, }}c} \right)} \right]}^{\text{T}}}} \right\} \\ &= {\left( {\mathbf{vec} \left( {\mathbf{real} \left( {{{\mathbf{ifft} }_2}\left( {\frac{{\sum\limits_{c = 1}^{{d_i}} {{{\widehat{ \mathcal{Q}}}_{kE}}\left( {{\text{:, :, :, c}}} \right) \odot {{\widehat{ \mathcal{K}}}^*}\left( {{\text{:, :, :, c}}} \right)} }}{{\sqrt {{r^2}{d_i}} }}} \right)} \right)} \right)} \right)^{\text{T}}}{\text{, }} \\ \end{align}$ $

(3.19)

where $ \mathbf{vec} $ denotes the vectorization operator for tensors, $ \mathbf{real} $ denotes the operation of taking the real part of the tensor, $ {\mathbf{ifft} _2} $ is the two-dimensional inverse Fourier transform, the symbol $ \odot $ represents element-wise multiplication, and the symbol $ \star $ denotes the correlation operation between matrices. For better comprehension of the convert block, the subsequent text further explains Eq (3.19).

The query feature map tensor $\mathcal{Q}$ and the key feature map tensor $\mathcal{K}$ both contain $L$ tokens of size $r \times r \times d$. After the CS operation, as shown in Eq (3.6), the shifted key tensor ${\mathcal{K}_C}$ is obtained. The query feature map tensor $\mathcal{Q}$ is reshaped into the query matrix ${\boldsymbol{Q}}$, and the shifted key tensor ${\mathcal{K}_C}$ is also reshaped into the shifted key matrix ${{\boldsymbol{K}}_C}$. The transpose of the shifted key matrix ${\boldsymbol{K}}_C^{\text{T}}$ is then computed and multiplied with the query matrix ${\boldsymbol{Q}}$ to get the feature attention matrix ${\boldsymbol{A}}$, as shown in the following equation:

$ {\boldsymbol{A}}\left( {k{\text{, }}n} \right) = \frac{{{\boldsymbol{Q}}\left( {k{\text{, :}}} \right) \times {\boldsymbol{K}}_C^{\text{T}}\left( {{\text{:, }}n} \right)}}{{\sqrt {{r^2}{d_i}} }} = \frac{{\left\langle {\mathcal{Q}\left( {k{\text{, :, :, :}}} \right){\text{, }}{\mathcal{K}_C}\left( {p{\text{, }}q{\text{, :, :, :}}} \right)} \right\rangle }}{{\sqrt {{r^2}{d_i}} }}{\text{ , }} $

(3.20)

where $ {\boldsymbol{Q}}\left({k{\text{, :}}} \right) $ denotes the $k$-th row of ${\boldsymbol{Q}}$, with a size of ${\boldsymbol{Q}}\left({k{\text{, :}}} \right) \in {\mathbb{R}^{{r^2} \times {d_i}}}$; $ {\boldsymbol{K}}_C^{\text{T}}\left({{\text{:, }}n} \right) $ denotes any column in $ {\boldsymbol{K}}_C^{\text{T}} $, with each column having a size of ${\boldsymbol{K}}_C^{\text{T}}\left({{\text{:, }}n} \right) \in {\mathbb{R}^{{r^2} \times {d_i}}}$, where $n = \left({q - 1} \right)L + p$; $ {\boldsymbol{A}}\left({k{\text{, }}n} \right) $ represents the element in the $k$-th row and $n$-th position of ${\boldsymbol{A}}$. $ \left\langle {\mathcal{Q}\left({k{\text{, :, :, :}}} \right){\text{, }}{\mathcal{K}_C}\left({p{\text{, }}q{\text{, :, :, :}}} \right)} \right\rangle $ represents the inner product between the query feature map tensor $ \mathcal{Q} $ and the shifted key tensor $ {\mathcal{K}_C} $. $ \mathcal{Q}\left({k{\text{, :, :, :}}} \right) $ denotes the $k$-th token of $ \mathcal{Q} $, and $ {\mathcal{K}_C}\left({p{\text{, }}q{\text{, :, :, :}}} \right) $ denotes the $q$-th token in the $p$-th group of $ {\mathcal{K}_C} $, where $p = 1{\text{, 2, }} \cdots {\text{, }}L$ and $q = 1{\text{, 2, }} \cdots {\text{, }}{r^2}$.

As indicated in the preliminaries, matrix multiplication is convertible to tensor correlation operations, as shown in Eq (3.21)

$\begin{array}{l} \left[ {{\boldsymbol{A}}\left( {k{\text{, }}\left( {1 - 1} \right)L + p} \right){\text{, }}{\boldsymbol{A}}\left( {k{\text{, }}\left( {2 - 1} \right)L + p} \right){\text{, }} \cdots {\text{, }}{\boldsymbol{A}}\left( {k{\text{, }}\left( {{r^2} - 1} \right)L + p} \right)} \right] \\ = \frac{1}{{\sqrt {{r^2}{d_i}} }}\mathbf{vec} {\left[ {\sum\limits_{c = 1}^{{d_i}} {\mathcal{Q}\left( {k{\text{, :, :, }}c} \right)\; \star \; {\mathcal{K}_C}\left( {p{\text{, 1, :, :, }}c} \right)} } \right]^{\text{T}}} \\ = \frac{1}{{\sqrt {{r^2}{d_i}} }}\mathbf{vec} {\left[ {\sum\limits_{c = 1}^{{d_i}} {\mathcal{Q}\left( {k{\text{, :, :, }}c} \right)\; \star \; \mathcal{K}\left( {p{\text{, :, :, }}c} \right)} } \right]^{\text{T}}}{\text{, }} \\ \end{array}$ $

(3.21)

where $ \mathbf{vec} $ denotes the vectorization operator.

The correlation operation is viewed as a convolution operation between tensors, as shown in Eq (3.22)

$ \mathcal{Q}\left( {k{\text{, :, :, }}c} \right) \star {\mathcal{K}_C}\left( {p{\text{, 1, :, :, }}c} \right) = \mathcal{Q}\left( {k{\text{, :, :, }}c} \right)*{\overline {\mathcal{K}} _C}\left( {p{\text{, 1, :, :, }}c} \right){\text{, }} $

(3.22)

where $ {\overline {\mathcal{K}} _C}\left({p{\text{, 1, :, :, }}c} \right) $ denotes the flipped version of $ {\mathcal{K}_C} $, $p = 1{\text{, 2, }} \cdots {\text{, }}L$, and $q = 1{\text{, 2, }} \cdots {\text{, }}{r^2}$.

Thus, Eq (3.21) is rewritten as Eq (3.23)

$\begin{array}{l} \left[ {{\boldsymbol{A}}\left( {k{\text{, }}\left( {1 - 1} \right)L + p} \right){\text{, }}{\boldsymbol{A}}\left( {k{\text{, }}\left( {2 - 1} \right)L + p} \right){\text{, }} \cdots {\text{, }}{\boldsymbol{A}}\left( {k{\text{, }}\left( {{r^2} - 1} \right)L + p} \right)} \right] \hfill \\ = \frac{1}{{\sqrt {{r^2}{d_i}} }}\mathbf{vec} {\left[ {\sum\limits_{c = 1}^{{d_i}} {\mathcal{Q}\left( {k{\text{, :, :, }}c} \right)\; *{{\overline {\mathcal{K}} }_C}\left( {p{\text{, 1, :, :, }}c} \right)} } \right]^{\text{T}}}{\text{.}} \hfill \\ \end{array}$ $

(3.23)

By traversing $p$ and concatenating into a new row vector, $n = \left({q - 1} \right)L + p$ is also traversed, as shown in the following equation:

$ {\boldsymbol{A}}\left( {k{\text{, :}}} \right) = \frac{1}{{\sqrt {{r^2}{d_i}} }}\mathbf{Concat} \left\{ {\mathbf{vec} {{\left[ {\sum\limits_{c = 1}^{{d_i}} {\mathcal{Q}\left( {k{\text{, :, :, }}c} \right)} *{{\overline {\mathcal{K}} }_C}\left( {p{\text{, :, :, }}c} \right)} \right]}^{\text{T}}}} \right\}{\text{.}} $

(3.24)

According to the convolution theorem, the correlation operation in the spatial domain is interchangeable with element-wise multiplication in the frequency domain. Hence, Eq (3.19) is proven. The proposed FFB algorithm in this paper is shown in Algorithm 1.

Algorithm 1: Feature fusion block (FFB)
Input: Input features ${{\boldsymbol{Z}}_0}$, the number of heads in the FFB $I$, and the vector ${\boldsymbol{r}} = \left[{{\text{1, 2, 4, 8, 1, 2, 4, 8}}} \right]$ constructed from window sizes of each attention head and number of layers in the FFB $J$. Output: Output features ${{\boldsymbol{Z}}_6}$.
1: Split ${{\boldsymbol{Z}}_0}$ into $I$ heads. 2: For $i \leftarrow 0$ to $I - 1$ 3: For $j \leftarrow 0$ to $J - 1$ 4: If ${r_{i + 1}} = 1$ or ${r_{i + 1}} = 2$ or ${r_{i + 1}} = 4$ 5: Let ${{\boldsymbol{Z}}_j}$ as the input to the SFM in Figure 1, calculate the attention weight matrix ${\boldsymbol{A}}$ via Eqs (3.6)–. 6: Else 7: Let ${{\boldsymbol{Z}}_j}$ as the input to the FFM in Figure 2, calculate the attention weight matrix ${\boldsymbol{A}}$ via Eqs (3.16)–(3.19). 8: End if 9: End for 10: End for 11: The fused features are further calculated using Eqs (3.11)–(3.14). 12: Return ${{\boldsymbol{Z}}_6}$.

3.3. Feature enhancement block

The feature enhancement block (FEB) and FFB have an identical number of stacked layers, with both having six layers, each containing eight attention heads. The main structure of the FEB is made up of multi-head self-attention (MSA) layers, multi-head cross-attention (MCA) layers, and feed-forward networks (FFN). This part draws on the concept from the DETR ^[53] object detection model, which employs the learnability of queries to steer object detection. Figure 3 displays the signal flow diagram of FEB, where the query embedding ${{\boldsymbol{Q}}_E}$ is a position encoding learned in the model training process, aiding the model in obtaining the spatial location information of the object.

Figure 3. The signal flow diagram of the FEB.

DownLoad: Full-Size Img PowerPoint

First, the FEB input is initialized with zero vectors, i.e. ${{\boldsymbol{d}}_0} = {\boldsymbol{Query}} = \mathbf{0} \in {\mathbb{R}^{1 \times D}}$, and added to the query embedding vector ${{\boldsymbol{Q}}_E}$ to form the query matrix ${{\boldsymbol{Q}}_s}$ and the key matrix ${{\boldsymbol{K}}_s}$, which carry the object's position information. The value matrix ${{\boldsymbol{V}}_s}$ is directly provided by the object sequence ${\boldsymbol{Query}}$, as shown in the following equation:

$ \left\{

$\begin{align} \left[ {{{\boldsymbol{Q}}_{{s_1}}}{\text{, }}{{\boldsymbol{Q}}_{{s_2}}}{\text{, }} \cdots {\text{, }}{{\boldsymbol{Q}}_{{s_8}}}} \right] &= \mathbf{Split} \left[ {{{\boldsymbol{Q}}_s}{{\boldsymbol{W}}_{{Q_s}}}} \right] \\ \left[ {{{\boldsymbol{K}}_{{s_1}}}{\text{, }}{{\boldsymbol{K}}_{{s_2}}}{\text{, }} \cdots {\text{, }}{{\boldsymbol{K}}_{{s_8}}}} \right] &= \mathbf{Split} \left[ {{{\boldsymbol{K}}_s}{{\boldsymbol{W}}_{{K_s}}}} \right] \\ \left[ {{{\boldsymbol{V}}_{{s_1}}}{\text{, }}{{\boldsymbol{V}}_{{s_2}}}{\text{, }} \cdots {\text{, }}{{\boldsymbol{V}}_{{s_8}}}} \right] &= \mathbf{Split} \left[ {{{\boldsymbol{V}}_s}{{\boldsymbol{W}}_{{V_s}}}} \right] \\ \end{align}$ \right.{\text{.}} $

(3.25)

The obtained ${{\boldsymbol{Q}}_s}$, ${{\boldsymbol{K}}_s}$, and ${{\boldsymbol{V}}_s}$ are fed into the MSA layer to compute the similarity between ${{\boldsymbol{Q}}_s}$ and ${{\boldsymbol{K}}_s}$ to identify the features most relevant to the object. The output of the MSA layer ${\boldsymbol{s}}$ is iterated, as shown in the following equation:

$ \left\{

$\begin{array}{l} {{\boldsymbol{z}}_{{s_i}}} = {\text{Attention}}\left( {{{\boldsymbol{Q}}_{{s_i}}}{\text{, }}{{\boldsymbol{K}}_{{s_i}}}{\text{, }}{{\boldsymbol{V}}_{{s_i}}}} \right) = \mathbf{Softmax} \left( {\frac{{{{\boldsymbol{Q}}_{{s_i}}}{\boldsymbol{K}}_{{s_i}}^{\text{T}}}}{{\sqrt {{d_i}} }}} \right){{\boldsymbol{V}}_{{s_i}}} \hfill \\ {\boldsymbol{s}} = \mathbf{MSA} \left( {{{\boldsymbol{Q}}_s}{\text{, }}{{\boldsymbol{K}}_s}{\text{, }}{{\boldsymbol{V}}_s}} \right) = \mathbf{Concat} \left( {{{\boldsymbol{z}}_{{s_1}}}{\text{, }}{{\boldsymbol{z}}_{{s_2}}}{\text{, }} \cdots {\text{, }}{{\boldsymbol{z}}_{{s_8}}}} \right){{\boldsymbol{W}}_{\boldsymbol{s}}} \hfill \\ \end{array}$ \right.{\text{, }} $

(3.26)

where ${{\boldsymbol{W}}_{{Q_s}}}$, $ {{\boldsymbol{W}}_{{K_s}}} $, $ {{\boldsymbol{W}}_{{V_s}}} $, and ${{\boldsymbol{W}}_{\boldsymbol{s}}}$ are linear projection weight matrices learned through the network, ${\boldsymbol{s}} \in {\mathbb{R}^{1 \times D}}$.

To avoid the occurrence of gradient explosion or vanishing, the output of the MSA layer ${\boldsymbol{s}}$ is subjected to residual connection and followed by a normalization layer to obtain the intermediate state output, as shown in the following equation:

$ {{\boldsymbol{h}}_1} = \mathbf{LN} \left( {\mathbf{LP} \left( {\boldsymbol{s}} \right) + {{\boldsymbol{d}}_j}} \right){\text{.}} $

(3.27)

The query embedding vector ${{\boldsymbol{Q}}_E}$ is added to the intermediate state output $ {{\boldsymbol{h}}_1} $, resulting in the query matrix ${{\boldsymbol{Q}}_c}$ for the MCA layer. The final output of the feature fusion block ${{\boldsymbol{Z}}_6}$ is combined with the position encoding vector ${{\boldsymbol{P}}_E}$, forming the key matrix for the MCA layer ${{\boldsymbol{K}}_c}$; ${{\boldsymbol{Z}}_6}$ also serves as the value matrix in the MCA layer calculation, as illustrated in Eq (3.28)

$ \left\{

$\begin{array}{l} {{\boldsymbol{Q}}_c} = {{\boldsymbol{h}}_1} + {{\boldsymbol{Q}}_E} \\ {{\boldsymbol{K}}_c} = {{\boldsymbol{Z}}_6} + {{\boldsymbol{P}}_E} \\ {{\boldsymbol{V}}_c} = {{\boldsymbol{Z}}_6} \\ \end{array}$ \right.{\text{.}} $

(3.28)

Similarly ${{\boldsymbol{Q}}_c}$, ${{\boldsymbol{K}}_c}$, and $ {{\boldsymbol{V}}_c} $ are processed as shown in the following equation:

$ \left\{

$\begin{align} \left[ {{{\boldsymbol{Q}}_{{c_1}}}{\text{, }}{{\boldsymbol{Q}}_{{c_2}}}{\text{, }} \cdots {\text{, }}{{\boldsymbol{Q}}_{{c_8}}}} \right] &= \mathbf{Split} \left[ {{{\boldsymbol{Q}}_c}{{\boldsymbol{W}}_{{Q_c}}}} \right] \\ \left[ {{{\boldsymbol{K}}_{{c_1}}}{\text{, }}{{\boldsymbol{K}}_{{c_2}}}{\text{, }} \cdots {\text{, }}{{\boldsymbol{K}}_{{c_8}}}} \right] &= \mathbf{Split} \left[ {{{\boldsymbol{K}}_c}{{\boldsymbol{W}}_{{K_c}}}} \right] \\ \left[ {{{\boldsymbol{V}}_{{c_1}}}{\text{, }}{{\boldsymbol{V}}_{{c_2}}}{\text{, }} \cdots {\text{, }}{{\boldsymbol{V}}_{{c_8}}}} \right] &= \mathbf{Split} \left[ {{{\boldsymbol{V}}_c}{{\boldsymbol{W}}_{{V_c}}}} \right] \\ \end{align}$ \right.{\text{.}} $

(3.29)

The calculation process of the MCA layer is as follows:

$ \left\{

$\begin{array}{l} {{\boldsymbol{z}}_{{c_i}}} = {\text{Attention}}\left( {{{\boldsymbol{Q}}_{{c_i}}}{\text{, }}{{\boldsymbol{K}}_{{c_i}}}{\text{, }}{{\boldsymbol{V}}_{{c_i}}}} \right) = \mathbf{Softmax} \left( {\frac{{{{\boldsymbol{Q}}_{{c_i}}}{\boldsymbol{K}}_{{c_i}}^{\text{T}}}}{{\sqrt {{d_i}} }}} \right){{\boldsymbol{V}}_{{c_i}}} \hfill \\ {\boldsymbol{c}} = \mathbf{MCA} \left( {{{\boldsymbol{Q}}_c}{\text{, }}{{\boldsymbol{K}}_c}{\text{, }}{{\boldsymbol{V}}_c}} \right) = \mathbf{Concat} \left( {{{\boldsymbol{z}}_{{c_1}}}{\text{, }}{{\boldsymbol{z}}_{{c_2}}}{\text{, }} \cdots {\text{, }}{{\boldsymbol{z}}_{{c_8}}}} \right){{\boldsymbol{W}}_c} \hfill \\ \end{array}$ \right.{\text{, }} $

(3.30)

where $ {{\boldsymbol{W}}_{{Q_c}}} $, $ {{\boldsymbol{W}}_{{K_c}}} $, $ {{\boldsymbol{W}}_{{V_c}}} $, and $ {{\boldsymbol{W}}_c} $ are the linear projection weight matrices learned through the network, with ${\boldsymbol{c}} \in {\mathbb{R}^{1 \times D}}$.

The output of the MCA layer ${\boldsymbol{c}}$ is subjected to residual connection and normalization to obtain the intermediate output state ${{\boldsymbol{h}}_2}$, as shown in the following equation:

$ {{\boldsymbol{h}}_2} = \mathbf{LN} \left( {\mathbf{LP} \left( {\boldsymbol{c}} \right) + {{\boldsymbol{h}}_1}} \right){\text{.}} $

(3.31)

The process described above is repeated six times, constituting the entire FEB workflow. The final output of the FFB is then fed into the prediction analysis module for subsequent tracking and prediction. The overall process is represented by Eq (3.32)

$ {{\boldsymbol{d}}_{j + 1}} = \mathbf{FEB} \left( {{{\boldsymbol{d}}_j}} \right){\text{, }} $

(3.32)

where $j = 0{\text{, 1, }} \cdots {\text{, 5}}$.

The FEB, together with the FFB, forms the encoder-decoder structure of the Transformer, which helps the model build temporal context awareness and obtain richer spatiotemporal contextual information, thereby further enhancing the precision of the model.

3.4. Prediction analysis module

The prediction analysis module includes a bounding box prediction head and a classification score head. The enhanced feature $ {{\boldsymbol{d}}_6} $ is directed into these heads for analysis, thus accomplishing further tracking and prediction.

3.4.1. Bounding box prediction head

The bounding box prediction head is designed inspired by the corner prediction head in the STARK ^[43] algorithm. Calculating the expected values of the probability distributions of the two probability maps corresponding to the top-left and bottom-right corners of the tracked object's bounding box, thus obtaining the coordinates of these corners, which is illustrated in Figure 4.

Figure 4. The flowchart of the bounding box prediction head.

DownLoad: Full-Size Img PowerPoint

First, the final output of the feature enhancement module ${{\boldsymbol{Z}}_6}$ is cropped to obtain the search region features ${{\boldsymbol{Z}}_S} \in {\mathbb{R}^{\left({{H_S}{W_S}} \right) \times D}}$. The enhanced feature $ {{\boldsymbol{d}}_6} $ is also input to calculate the similarity modulation vector ${\boldsymbol{m}}$ between them, as shown in the following equation:

$ {\boldsymbol{m}} = {{\boldsymbol{Z}}_S}{\boldsymbol{d}}_6^{\text{T}}{\text{, }} $

(3.33)

where ${\boldsymbol{m}} \in {\mathbb{R}^{\left({{H_S}{W_S}} \right) \times 1}}$.

To better enhance the important tracking regions and weaken the interference of other regions on the tracking, the similarity modulation vector ${\boldsymbol{m}}$ is combined with the search region features ${{\boldsymbol{Z}}_S}$ via element-wise multiplication, yielding the enhanced search region features ${{\boldsymbol{Z}}_E}$, as shown in the following equation:

$ {{\boldsymbol{Z}}_E} = {{\boldsymbol{Z}}_S} \odot \mathbf{Repeat} \left( {{\boldsymbol{m}}, D} \right). $

(3.34)

Among them, $\mathbf{Repeat} \left({{\boldsymbol{m}}, D} \right) \in {\mathbb{R}^{\left({{H_S}{W_S}} \right) \times D}}$ indicates that the column vector ${\boldsymbol{m}}$ is continuously copied $D$ times and stacked along the row direction to generate a matrix.

The obtained enhanced search region features ${{\boldsymbol{Z}}_E}$ are reshaped into an enhanced feature map $\varepsilon \in {\mathbb{R}^{{H_S} \times {W_S} \times D}}$, which undergoes processing through a fully convolutional network (FCN), resulting in two probability distribution maps for the bounding box's top-left and bottom-right corners, namely ${{\boldsymbol{P}}_1}\left({x{\text{, }}y} \right)$ and ${{\boldsymbol{P}}_2}\left({x{\text{, }}y} \right)$, as shown in the following equation:

$ \left\{

$\begin{array}{l} {{\boldsymbol{P}}_1}\left( {x{\text{, }}y} \right) = {\mathbf{FCN} _1}\left( \boldsymbol{\varepsilon} \right) \\ {{\boldsymbol{P}}_2}\left( {x{\text{, }}y} \right) = {\mathbf{FCN} _2}\left( \boldsymbol{\varepsilon} \right) \\ \end{array}$ \right.{\text{, }} $

(3.35)

where the FCN consists of five convolutional layers, batch normalization layers, and ReLU activation layers (Conv-BN-ReLU). ${\mathbf{FCN} _i}$represents the processing of the data by two FCN layers with different parameters.

To determine the coordinates of the bounding box's two corner points, the two probability maps ${{\boldsymbol{P}}_1}\left({x{\text{, }}y} \right)$ and ${{\boldsymbol{P}}_2}\left({x{\text{, }}y} \right)$ are utilized through the following equation. After acquiring these coordinates, the final predicted bounding box position candidate samples ${\mathcal{C}_p} \in {\mathbb{R}^{{H_C} \times {W_C} \times 3}}$ are determined.

$ \left\{

$\begin{array}{l} \left( {{{\hat x}_1}{\text{, }}{{\hat y}_1}} \right) = \left( {\sum\limits_{y = 0}^H {\sum\limits_{x = 0}^W {x{{\boldsymbol{P}}_1}\left( {x{\text{, }}y} \right){\text{, }}\sum\limits_{y = 0}^H {\sum\limits_{x = 0}^W {y{{\boldsymbol{P}}_1}\left( {x{\text{, }}y} \right)} } } } } \right) \\ \left( {{{\hat x}_2}{\text{, }}{{\hat y}_2}} \right) = \left( {\sum\limits_{y = 0}^H {\sum\limits_{x = 0}^W {x{{\boldsymbol{P}}_2}\left( {x{\text{, }}y} \right){\text{, }}\sum\limits_{y = 0}^H {\sum\limits_{x = 0}^W {y{{\boldsymbol{P}}_2}\left( {x{\text{, }}y} \right)} } } } } \right) \\ \end{array}$ \right.{\text{, }} $

(3.36)

where $ \left({{{\hat x}_1}{\text{, }}{{\hat y}_1}} \right) $ represents the coordinates of the top-left corner and similarly $ \left({{{\hat x}_2}{\text{, }}{{\hat y}_2}} \right) $ stands for the coordinates of the bottom-right corner. Suppose that the prediction score of the candidate sample exceeds the set threshold. In that case, the template area is expanded to twice its original size, with the center of the sample as the base point, forming a dynamic template and updating it.

3.4.2. Classification score head

During the object tracking process, the object may undergo varying changes as the tracking time progresses. Therefore, real-time capture of its latest state is crucial for accurate tracking. As shown in Figure 1, we propose a dynamic template update mechanism sampled from intermediate frames. It serves as an additional input to capture the object's changes in appearance over time and offers more temporal details. However, when the object is entirely occluded or moves out of view, the update of the dynamic template may become unreliable. Therefore, we designed a simple candidate sample scoring head to assess whether the current sample needs to be updated.

First, the enhanced feature $ {{\boldsymbol{d}}_6} $ is processed through a multilayer perceptron to improve the model's learning ability. Then the output results are activated by a function to derive the score, as shown in the following equation:

$ s = \mathbf{Sigmoid} \left( {\mathbf{MLP} \left( {{{\boldsymbol{d}}_6}} \right)} \right){\text{, }} $

(3.37)

where $ \mathbf{MLP} $ is a three-layer perceptron with ReLU activation functions employed in all hidden layers. The output layer employs the function $\mathbf{Sigmoid} \left(x \right) = \frac{1}{{1 + {e^{ - x}}}}$ as the activation function to map the output values to the 0 to 1 range. As shown in Figure 1, when the score $s$ exceeds the set threshold, the model considers the candidate sample to be reliable and updates the dynamic template; otherwise, it does not update. Through this mechanism, the model has more effectively judged the reliability of candidate samples in complex environments, thereby improving the stability and accuracy of tracking.

3.4.3. Loss function

We draw on the loss function design concept in the DETR architecture, adopting an end-to-end training method and using the generalized intersection over union (GIOU) loss to optimize the predicted bounding boxes. By directly optimizing the intersection over union (IOU) between the predicted boxes and the ground-truth boxes, the accuracy and robustness of the tracking are enhanced. In addition, $ {L_1} $ loss is introduced to optimize the position and size of the bounding box, enabling the model to predict the object's position more accurately. The loss function is shown in the following equation:

$ L = \sum\limits_{i = 1}^B {{\lambda _{GIOU}}{L_{GIOU}}\left( {{{\boldsymbol{b}}_i}{\text{, }}{{\boldsymbol{g}}_i}} \right)} + {\lambda _{{L_1}}}{L_1}\left( {{{\boldsymbol{b}}_i}{\text{, }}{{\boldsymbol{g}}_i}} \right){\text{, }} $

(3.38)

where $ {{\boldsymbol{b}}_i} $ is the vector composed of the top-left x-coordinate, the y-coordinate, and the width and height of the $i$-th predicted bounding box, and $ {{\boldsymbol{g}}_i} $ is the vector comprising the top-left x-coordinate, the y-coordinate, and the width and height of the bounding box of the $i$-th training sample's ground-truth box. $ {\lambda _{GIOU}} $ and $ {\lambda _{{L_1}}} $ are non-negative hyperparameters, and $B$ denotes the batch size of the training samples.

In some previous studies, researchers believed that joint learning of the localization and classification tasks may lead to suboptimal results. Thus, decoupling these two tasks is necessary. The training process is divided into two independent stages for the localization and classification tasks, which are optimized to achieve the best solution. In the first stage, the whole network, except for the candidate sample scoring head, is trained end-to-end using a localization-related loss function. This stage aims to ensure the object's inclusion in all search regions, thus improving the model's localization ability. The second stage focuses on optimizing the scoring head, and its loss function is defined as shown in the following equation:

$ {L_{ce}} = - \sum\limits_{i = 1}^B {\left( {{l_i}\mathbf{log}\left( {{s_i}} \right) + \left( {1 - {l_i}} \right)\mathbf{log} \left( {1 - {s_i}} \right)} \right)} {\text{, }} $

(3.39)

where $ {l_i} $ is the binary label of the $i$-th training sample, and $ {s_i} $ is the predicted probability of the $i$-th sample being the object. This two-stage training strategy enables the model to learn the key features of localization and classification separately, thus attaining high tracking precision and robustness. It enhances the model's object recognition and tracking accuracy in different settings and readies it for seamless integration into practical use.

4. Experiments

In this section, we introduce the comprehensive evaluation of the MCWTT tracking algorithm on three benchmark datasets: LaSOT ^[48], OTB100 ^[49], and UAV123 ^[50]. The hardware environment for the experiments was a server equipped with an Intel Core i9-12900K CPU and an Nvidia RTX 3090 24 GB GPU, with 128 GB of RAM. The software environment was based on Python 3.7 and PyTorch 1.8.2. During training, the model's basic training unit consisted of two templates and one search image. The input templates were 128 × 128 pixels, about twice the area of the target box, while the search region was 384 × 384 pixels, about five times the area of the target box. The backbone network was initialized with parameters from a pre-trained ResNet-50 network. The Transformer architecture comprised six layers of encoder-decoders with eight heads each, including MSA layers, MCA layers, and FFNs. Each layer of the encoder-decoders had eight heads, with 32 channels per head and a dropout rate of 0.1 to prevent overfitting. During model training, the loss function and generalized loss weights were optimized using the Adam ^[54] optimizer, with initial learning rates of 10^-5 and 10^-4 for the backbone network and other network components, respectively.

4.1. Quantitative analysis

4.1.1. LaSOT

LaSOT ^[48] is a large-scale tracking benchmark dataset with high-quality annotations and almost all real-world challenges, including fast motion, scale variation, and camera movement. It has 280 video sequences, with each frame labelled with high-quality bounding boxes. Tracker performance is evaluated using the area under the curve (AUC), the normalized precision (P_Norm), and precision (P) scores. AUC shows the success rate across IoU thresholds, P_Norm measures precision under the normalized distance, and P directly assesses the accuracy of the tracking results. As shown in Table 1, the MCWTT algorithm achieved an AUC score of 65.6%, a P_Norm score of 74.5%, and a P score of 70%, surpassing many advanced trackers. Though outperformed by STARK ^[43] and MixFormer ^[55], it remains competitive.

Table 1. Detailed comparison of the LaSOT, OTB100, and UAV123 datasets.

Method	Year	LaSOT			OTB100		UAV123
Method	Year	AUC(%)	P_Norm(%)	P(%)	SR(%)	PR (%)	AUC(%)	P(%)
SiamRPN++ ^[32]	2019	49.6	56.9	49.1	–	–	64.2	84.0
TransT ^[41]	2021	64.2	73.5	68.2	–	–	66.0	85.2
STARK ^[43]	2021	65.8	75.2	69.8	–	–	68.4	89.0
DSTrpn ^[56]	2021	43.4	54.4	–	64.6	85.7	–	–
CLNet*–BAN ^[57]	2022	52.9	62.3	52.6	–	–	–	–
TCTrack++ ^[58]	2022	43.5	48.4	41.4	54.3	72.0	51.9	73.1
RTSFormer ^[34]	2024	62.3	65.6	65.5	–	–	67.5	–
AGST–BR ^[59]	2024	56.7	–	58.3	–	–	66.3	–
SiamRPN++–ACM ^[60]	2024	52.3	–	–	71.2	–	–	–
MixFormer ^[55]	2024	69.6	79.9	75.9	71.6	94.4	68.7	89.5
MCWTT	Ours	65.6	74.5	70.0	66.7	86.6	68.7	89.2

| Show Table

DownLoad: CSV

4.1.2. OTB100

The OTB100 ^[49] dataset is a widely utilized benchmark for evaluating object-tracking algorithms. Comprising 100 standardized video sequences with diverse challenging scenarios and detailed annotations, it assesses the robustness and adaptability of tracking algorithms. The evaluation follows the one-pass evaluation (OPE) protocol, tracking the entire sequence without restarts. This setup tests the tracker's performance in handling challenges like scale changes, occlusions, deformations, and motion blur. The tracking algorithms are ranked using success plots and precision plots, with success plots evaluated by the AUC, and precision plots assessed by the center position error (CPE).

Figure 5 shows a comparison of the proposed module with nine other trackers on the OTB100 dataset, including BACF ^[23], A3DCF ^[61], SRDCF ^[62], CSR-DCF ^[19], RBSCF ^[6], CFNet ^[63], DiMP ^[64], SiamFC ^[29], and SiamRPN ^[31]. As shown in Figure 5(a), (b), the MCWTT leads with a success rate of 66.7% and a precision of 86.6%, outperforming all the compared tracking methods while maintaining real-time capability. This validates the significant performance advantage of the proposed module in balancing real-time and accuracy requirements.

Figure 5. Tracker performance comparison on the OTB100 dataset. (a) Success plot; (b) precision plot.

DownLoad: Full-Size Img PowerPoint

Figures 6 and 7 compare the proposed module with nine other trackers across 11 visual challenge attributes of the OTB100 dataset, namely background clutter (BC), deformation (DEF), fast motion (FM), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out of view (OV), and scale variation (SV). Figure 6 shows the proposed module performs exceptionally well in multiple attributes, achieving success rates of 68.8% in the FM scenario and 63.4% in the OV scenario. Figure 7 shows the precision performance of the proposed module and other tracking algorithms under 11 different visual attribute challenges. The proposed module also performs exceptionally well in various attribute challenges, with 89.4% in the MB scenario and 89.2% in the SV scenario, significantly outperforming other competing algorithms and fully demonstrating the advantage of the proposed module in handling high-speed dynamic objects.

Figure 6. The success plots for various challenging attributes on the OTB100 dataset: (a) BC; (b) DEF; (c) FM; (d) IPR; (e) IV; (f) LR; (g) MB; (h) OCC; (i) OPR; (j) OV; (k) SV.

DownLoad: Full-Size Img PowerPoint

Figure 7. The precision plots for various challenging attributes on the OTB100 dataset: (a) BC; (b) DEF; (c) FM; (d) IPR; (e) IV; (f) LR; (g) MB; (h) OCC; (i) OPR; (j) OV; (k) SV.

DownLoad: Full-Size Img PowerPoint

4.1.3. UAV123

The UAV123 ^[50] dataset contains 123 low-altitude unmanned aerial vehicle (UAV) video sequences, totaling over 110,000 frames, encompassing a variety of background environments ranging from urban landscapes to natural scenery. It is primarily used to evaluate the performance of different tracking algorithms. The dataset is divided into three subsets, each corresponding to distinct testing scenarios. The high-quality aerial video subset has 103 sequences shot by professional UAVs at heights of 5-25 meters, with frame rates of 30 to 96 frames per second (fps) and resolutions of 720P to 4K, ideal for tracking fast-moving targets. The low-cost UAV sub-set contains 12 sequences captured by economical UAVs with lower video quality, resolution, and more noise, increasing tracking difficulty. The synthetic data subset includes eight sequences generated by a UAV simulator to mimic real-world environmental changes. UAV123 can test a tracker's ability to handle fast motion, illumination changes, and occlusion, aiding in the development of systems stable in various environments. Evaluation metrics are the AUC of success plots and the CPE of precision plots.

Figure 8 compares the overall success and precision of the proposed module with the top tracking algorithms on the UAV123 dataset, including MixFormer ^[55], STARK ^[43], TransT ^[41], TCTrack++ ^[58], TrDiMP ^[65], SiamBAN ^[66], SiamCAR ^[67], SiamTPN ^[68], and SiamRPN++ ^[32]. In the success rate evaluation, the MCWTT and MixFormer both achieved 68.7%, ranking first. This indicates that the MCWTT's accuracy in object tracking is on par with the state-of-the-art methods. In the precision evaluation, the MCWTT's 89.2% is slightly lower than MixFormer's by 0.3% but surpasses other methods. This shows that MCWTT effectively balances real-time and accuracy requirements, demonstrating strong adaptability in complex scenes with occlusions, deformations, and illumination changes.

Figure 8. Tracker performance comparison on the UAV123 dataset. (a) Success plot; (b) precision plot.

DownLoad: Full-Size Img PowerPoint

To demonstrate the tracking performance of the proposed module across diverse scenarios, we select 12 representative scene attributes from the UAV123 dataset for detailed analysis. These attributes encompass a variety of complex challenges, including scale variation (SV), aspect ratio change (ARC), low resolution (LR), fast motion (FM), full occlusion (FO), partial occlusion (PO), out-of-view (OV), background clutter (BC), illumination variation (IV), viewpoint change (VC), camera motion (CM), and similar object (SO).

As shown in Figures 9 and 10, the MCWTT exhibits robust tracking performance across all attribute scenarios. For instance, for the LR attribute (Figures 9 (a) and 10 (a)), the MCWTT achieves success and precision rates of 54.4% and 79.2%, respectively, outperforming all comparative algorithms. This highlights its exceptional adaptability to low-resolution objects. In FM scenarios, the MCWTT achieve success and precision rates of 66.2% and 86.8%, surpassing Mixformer by approximately 0.4% in both metrics. This demonstrates its superior efficiency and accuracy in tracking fast-moving objects. For highly challenging such as FO scenarios, the MCWTT maintains high tracking accuracy, with success and precision rates exceeding most benchmark algorithms. It ranks second only to Mixformer and STARK, further validating its reliability in handling full occlusions. These results underscore the MCWTT's versatility and robustness in addressing diverse real-world tracking challenges, particularly in scenarios involving rapid motion, low resolution, and occlusions.

Figure 9. The success plots for various challenging attributes on the UAV123 dataset: (a) SV; (b) ARC; (c) LR; (d) FM; (e) FO; (f) PO; (g) OV; (h) BC; (i) IV; (j) VC; (k) CM; (l) SO.

DownLoad: Full-Size Img PowerPoint

Figure 10. The precision plots for various challenging attributes on the UAV123 dataset: (a) SV; (b) ARC; (c) LR; (d) FM; (e) FO; (f) PO; (g) OV; (h) BC; (i) IV; (j) VC; (k) CM; (l) SO.

DownLoad: Full-Size Img PowerPoint

4.2. Qualitative analysis

To better assess and demonstrate the performance of the MCWTT, we conducted frame-by-frame comparison tests with nine other trackers on the OTB100 dataset to show the tracking effects of each tracker in different scenes. We selected five videos representing typical tracking challenges, like fast motion, occlusion, deformation and scale variation, and illumination variation. The selected videos are "Diving_1", "Girl2_1", "Jump_1", "Skating2-1_1", and "Trans_1".

(1) Fast motion and scale variation. In the low-resolution scene of the "Diving_1" video sequence (Figure 11 (a)), the athlete briefly exhibits a mid-air flipping posture. In subsequent frames, the athlete plunges into the water with significant background changes. Many trackers were unable to effectively identify the rapidly moving object, while the module proposed in this paper accurately tracked the object throughout the entire video sequence. Similarly, in the "Jump_1" video sequence (Figure 11 (c)), the object undergoes scale variations during rapid motion. Only the proposed module successfully identified and accurately tracked the target across all frames of the full video sequence.

Figure 11. Visualization of tracking performance on different video sequences.

DownLoad: Full-Size Img PowerPoint

(2) Occlusion. In the "Girl2_1" video sequence shown in Figure 11(b), when the tracked girl is completely obscured by another pedestrian in Frame 107, only the proposed module accurately distinguishes the interfering object and continues to identify the object throughout subsequent occlusion, while other tracking methods either mistrack the interference object or experience tracking drift. This scenario fully demonstrates the superiority of the proposed module. Similarly, in the "Skating2-1_1" video sequence shown in Figure 11(d), the object athlete faces consecutive obstructions by the interfering athlete. The proposed module achieves stable tracking of the object throughout the video sequence. It adaptively resizes the bounding box to ensure the structural integrity of the object within the frame.

(3) Deformation. In the "Jump_1" video sequence shown in Figure 11(c), the athlete undergoes deformation during rapid motion, posing a challenge to the tracking algorithm. It is evident from Frame 95 of the video sequence that other tracking algorithms cannot track the gymnast's human form after landing. Only the proposed module can stably track the target throughout the entire video sequence, reflecting its effective handling of deformation and scale variation. Similarly, in the "Trans_1" video sequence shown in Figure 11(e), the proposed module delivers a more accurate identification of the appearance contour of the object compared with other algorithms, significantly improving the accuracy of the tracking process.

(4) Illumination Variation. In the "Trans_1" video sequence shown in Figure 11(e), the object's background changes from bright to dark, making the object's contour more blurred than in previous scenes, such low-light conditions compromised the accuracy of several tracking algorithms, while the proposed method remained unaffected by illumination changes, accurately identifying the object.

In the above analysis, the proposed module shows better robustness and accuracy than the other trackers in complex tracking scenarios in low-resolution scenes, showing better robustness and accuracy.

4.3. Ablation study

To assess the proposed module's specific enhancement of the tracking performance, we conduct a series of ablation experiments on the OTB100 dataset. By removing specific components from the structure while keeping others unchanged, the performance of the components is evaluated.

4.3.1. Ablation study on the FFM

In this subsection, we compare the model without the FFM (denoted as the CWTT) with the MCWTT to verify the FFM's effectiveness. As shown in Table 2, the MCWTT nearly matches the speed of the original model with only the SFM while reducing video random access memory (VRAM) usage, highlighting the advantage of the FFM in alleviating memory cache demands. The MCWTT effectively compresses the video memory through optimization in the model's architecture and computational efficiency, significantly reducing resource requirements without sacrificing tracking performance, thereby improving the performance on the OTB100 benchmark dataset. At the same time, the reduction in video memory occupancy makes the MCWTT more suitable for devices with limited video memory, further enhancing its advantages in practical applications.

Table 2. Comparison of VRAM usage and speed between the CWTT and MCWTT on the OTB100 dataset.

Method	VRAM usage (MiB)	Speed (fps)
CWTT	2318MiB	25.661
MCWTT	2172MiB	25.653

| Show Table

DownLoad: CSV

4.3.2. Ablation study on multi-scale windows

In this subsection, we replace the multi-scale windows with single-scale windows and compare the model with single-scale windows (denoted as the MCWTT-S) with the MCWTT to verify the effectiveness of multi-scale windows. As shown in Figure 12, the success rate and accuracy of the module with multi-scale window configuration are significantly improved compared with MCWTT-S.

Figure 12. Performance comparison of the MCWTT and MCWTT-S models on the OTB100 dataset. (a) Success plot; (b) precision plot.

DownLoad: Full-Size Img PowerPoint

Tables 3 and 4 list the success rates and accuracies of single-scale and multi-scale window configurations under different challenging scenarios. Combining the two proves that multi-scale windows are more adaptable to complex tracking environments and capture the complete information of the object more efficiently, reflecting on the tracking results.

Table 3. Comparison of the MCWTT and MCWTT-S models' success rates on OTB100 dataset across 11 different challenge attributes.

Method	BC	DEF	FM	IPR	IV	LR	MB	OCC	OPR	OV	SV
MCWTT	0.596	0.638	0.688	0.663	0.645	0.571	0.708	0.640	0.653	0.634	0.690
MCWTT-S	0.541	0.646	0.653	0.613	0.614	0.575	0.674	0.592	0.614	0.550	0.647

| Show Table

DownLoad: CSV

Table 4. Comparison of the MCWTT and MCWTT-S models' precision rates on OTB100 dataset across 11 different challenge attributes.

Method	BC	DEF	FM	IPR	IV	LR	MB	OCC	OPR	OV	SV
MCWTT	0.757	0.861	0.877	0.862	0.818	0.787	0.894	0.834	0.868	0.829	0.892
MCWTT-S	0.712	0.878	0.837	0.826	0.792	0.826	0.859	0.773	0.826	0.719	0.844

| Show Table

DownLoad: CSV

We also compared the average overlap rate and center error rate curves of the MCWTT and MCWTT-S models under the "Car2" video sequence. As shown in Figure 13, the average overlap rate of the MCWTT is higher than that of the MCWTT-S and more stable. In terms of the center error rate, the MCWTT is lower than the MCWTT-S. The test results are also reflected in the visual tracking effects of the "Car2" video sequence in Figure 14. When there are interfering objects during the car's movement, the MCWTT-S loses the object, causing tracking drift and failure. In contrast, the MCWTT remains stable in tracking, proving the effectiveness of the multi-scale window configuration used in the MCWTT.

Figure 13. Validation of the average overlap rate and central error rate curves of models with and without multi-scale windows in the "Car2" video sequence. (a) Average overlap rate; (b) central error rate.

DownLoad: Full-Size Img PowerPoint

Figure 14. Comparison of models with and without multi-scale windows.

DownLoad: Full-Size Img PowerPoint

5. Conclusions

In this paper, we propose a multi-scale cyclic-shift window Transformer object tracker based on the fast Fourier transform (MCWTT). The proposed module introduces a cyclic-shift window mechanism to increase the diversity of sample positions. It also adopts a multi-scale window-level attention mechanism instead of the traditional single-scale pixel-level attention mechanism. The proposed module not only protects the integrity of the object but also enriches the diversity of training samples, further mining the object's location information and improving the tracking accuracy of the model. In addition, we use frequency-domain feature fusion instead of spatial-domain feature fusion, converting the attention matrix calculation between cyclic-shifted samples and non-cyclic-shifted samples into the frequency domain through the convolution theorem, effectively reducing the sample's storage and computational complexity and significantly improving the inference efficiency. Unlike the traditional encoder-decoder serial structure, we introduce a feature enhancement block in the network design, feeding it back to the feature fusion block to form a signal feedback loop, further enhancing the object state estimation ability and enabling the network to handle object tracking tasks in dynamic scenes more efficiently. Theoretical analysis and experimental verification show that the network has significant advantages in dealing with complex scenes, such as scale variation, background interference, and occlusion. However, there remains a performance gap compared with the state-of-the-art trackers. Our future work will focus on enhancing the tracker's capabilities through rigorous research. Ablation studies demonstrate that the cyclic shift operation effectively reduces the computational overhead for the module while maintaining its precision and efficiency. In real-world applications, enhancing interpretability and generalization while maintaining performance in harsh tracking conditions remains an open issue. Future research will focus on boosting the network's efficiency by streamlining the parameters without compromising tracking performance.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work is supported by the Natural Science Foundation of Fujian Province (2024J01820, 2024J01821, 2024J01822), Natural Science Foundation Project of Zhangzhou City (ZZ2023J37), the Principal Foundation of Minnan Normal University (KJ19019), the High-level Science Research Project of Minnan Normal University (GJ19019), the Research Project on Education and Teaching of Undergraduate Colleges and Universities in Fujian Province (FBJY20230083), the Guangdong Province Natural Science Foundation (2024A1515011766), and the State key laboratory major special projects of Jilin Province Science and Technology Development Plan (SKL202402024).

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	Bordiga S, Groppo E, Agostini G, et al. (2013) Reactivity of surface species in heterogeneous catalysts probed by in situ x-ray absorption techniques. Chem Rev 113: 1736-1850. doi: 10.1021/cr2000898
[2]	Hrauda N, Zhang J, Wintersberger E, et al. (2011) X-ray Nanodiffraction on a Single SiGe Quantum Dot inside a Functioning Field-Effect Transistor. Nano Lett 11: 2875-2880. doi: 10.1021/nl2013289
[3]	Buurmans ILC, Weckhuysen BM (2012). Space and time as monitored by spectroscopy, Nature Chemistry 4:873-886.
[4]	Hitchcock AP, Johansson GA, Mitchell GH, et al. (2008). 3-d chemical imaging using angle-scan nanotomography in a soft X-ray scanning transmission X-ray microscope. Appl Phys A 92:447-452. doi: 10.1007/s00339-008-4588-x
[5]	Hitchcock AP (2012) Soft X-Ray Imaging and Spectromicroscopy, in Handbook of Nanoscopy, Volume 1&2 (eds G. Van Tendeloo, D. Van Dyck and S. J. Pennycook), Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany.
[6]	Ade H, Stoll H (2009) Near-edge X-ray absorption fine-structure microscopy of organic and magnetic materials. Nature Materials 8: 281-290. doi: 10.1038/nmat2399
[7]	Kaulich B, Thibault P, Gianoncelli A, et al. (2011) Transmission and emission x-ray microscopy: operation modes, contrast mechanisms and applications. J Phys Cond Matter 23: 083002. doi: 10.1088/0953-8984/23/8/083002
[8]	Beale AM, Jacques SDM, & Weckhuysen BM (2010). In-situ characterization of heterogeneous catalysts themed issue Chemical imaging of catalytic solids with synchrotron radiation. Chem Soc Rev 39: 4656-4672. doi: 10.1039/c0cs00089b
[9]	Harris WM, Nelson GJ, Kiss A, at al. (2012) Nondestructive volumetric 3-D chemical mapping of nickel-sulfur compounds at the nanoscale. Nanoscale 4: 1557-1560. doi: 10.1039/c2nr11690a
[10]	Wang J, Chen-wiegart YK, Wang J, (2014) In operando tracking phase transformation evolution of lithium iron phosphate with hard X-ray microscopy. Nature Com 5: 1.
[11]	Fenter P, Park C, Zhang Z, et al. (2006) Observation of Sub-nm-High Surface Topography with X-ray Reflection Phase-Contrast Microscopy. Nat Phys 2: 700-704. doi: 10.1038/nphys419
[12]	Fenter P, Park C, Kohli V, et al. (2008) Image contrast in X-ray reflection interface microscopy: comparison of data with model calculations and simulations. J Synch Rad 15: 558-571. doi: 10.1107/S0909049508023935
[13]	Fister TT, Goldman JL, Long BR, et al. (2013) X-ray diffraction microscopy of lithiated silicon microstructures. Appl Phys Lett 102: 131903. doi: 10.1063/1.4798554
[14]	Hilhorst J, Marschall F, Tran Thi TN, et al. (2014) Full-field X-ray diffraction microscopy using polymeric compound refractive lenses. J Appl Cryst 47: 1882-1888. doi: 10.1107/S1600576714021256
[15]	Mocuta C, Barbier A, Stanescu S, et al. (2013) X-ray diffraction imaging of metal-oxide epitaxial tunnel junctions made by optical lithography: use of focused and unfocused X-ray beams. J Synch Rad 20: 355-365. doi: 10.1107/S090904951204856X
[16]	Luebbert D, Baumbach T, Hartwig J, et al. (2000) μm-resolved high resolution X-ray diffraction imaging for semiconductor quality control. Nucl Meth Phys Res B160:521-527. doi: 10.1016/S0168-583X(99)00619-9
[17]	Ice GE, Budai JD, Pang JWL (2011) The Race to X-ray Microbeam and Nanobeam Science. Science 334: 1234-1239. doi: 10.1126/science.1202366
[18]	Stangl J, Mocuta C, Chamard V, et al. (2014) Nanobeam X-ray Scattering—Probing matter at the Nanoscale. Wiley VCH WileyBook.
[19]	Johannes A, Noack SJrWP, Kumar S, et al. (2014) Enhanced sputtering and incorporation of Mn in implanted GaAs and ZnO nanowires. J Phys D Appl Phys 47: 394003. doi: 10.1088/0022-3727/47/39/394003
[20]	Segura-Ruiz J, Martinez-Criado G, Chu MH, et al. (2013) Synchrotron nanoimaging of single In-rich InGaN nanowires. J Appl Phys 113: 136511. doi: 10.1063/1.4795544
[21]	Stangl J, Mocuta C, Diaz A, et al. (2009) X-Ray Diffraction as a Local Probe Tool. Chem Phys Chem 10: 2923-2930.
[22]	Maser J, Lai B, Buonassisi T, et al. (2014) A Next-Generation Hard X-Ray Nanoprobe Beamline for In Situ Studies of Energy Materials and Devices. Metall Mater Trans A 45: 85-97. doi: 10.1007/s11661-013-1901-x
[23]	Martinez-Criado G, Segura-Ruiz J, Alén B, et al. (2014) Exploring single semiconductor nanowires with a multimodal hard X-ray nanoprobe. Adv Mater 1-7.
[24]	Holt M, Harder R, Winarski R, et al. (2013) Nanoscale Hard X-ray Microscopy Methods for Materials Studies. Annu Rev Mater Res 43: 3.1-3.29.
[25]	Weker JN, Toney MF (2015) Emerging In Situ and Operando Nanoscale X-Ray Imaging Techniques for Energy Storage Materials. Adv Func Mat 25: 1622-1637. doi: 10.1002/adfm.201403409
[26]	Sirenko AA, Reynolds CL, Peticolas LJ, et al. (2003) Micro-X-ray fluorescence and micro-photoluminescence in InGaAsP and InGaAs layers obtained by selective area growth. J Crystal Growth 253: 38-45. doi: 10.1016/S0022-0248(03)00996-5
[27]	Mino L, Agostino A, Codato S, et al. (2010) Study of epitaxial selective area growth In1-xGaxAs films by synchrotron µ-XRF mapping. J Anal At Spectrom 25: 831-836. doi: 10.1039/c000435a
[28]	Buonassisi T, Istratov AA, Marcus MA, et al. (2005) Engineering metal-impurity nanodefects for low-cost solar cells. Nature Materials 4: 676-679. doi: 10.1038/nmat1457
[29]	Kwapil W, Gundel P, Schubert MC, et al. (2009) Observation of metal precipitates at prebreakdown sites in multicrystalline silicon solar cells. Appl Phys Lett 95:23-25.
[30]	Bertoni MI, Fenning DP, Rinio M, et al. (2011) Nanoprobe X-ray fluorescence characterization of defects in large-area solar cells. Energy Environ Sci 4: 4252-4257. doi: 10.1039/c1ee02083h
[31]	Hu Q, Aboustait M, Ley MT, et al. (2014) Combined three-dimensional structure and chemistry imaging with nanoscale resolution. Acta Materialia 77: 173-182. doi: 10.1016/j.actamat.2014.05.050
[32]	Mino L, Agostini G, Borfecchia E, et al. (2013) Low-dimensional systems investigated by x-ray absorption spectroscopy: a selection of 2D, 1D and 0D cases. J Phys D: Appl Phys 46: 423001. doi: 10.1088/0022-3727/46/42/423001
[33]	Martinez-Criado G, Segura-Ruiz J, Chu M, et al. (2014a) Crossed Ga2O3/SnO2 Multiwire Architecture: A Local Structure Study with Nanometer Resolution. Nano Letters 14: 5479-5487.
[34]	Larcheri S, Rocca F, Pailharey D, et al. (2009) A new tool for nanoscale X-ray absorption spectroscopy and element-specific SNOM microscopy. Micron 40: 61-65. doi: 10.1016/j.micron.2008.01.020
[35]	Shirato N, Cummings M, Kersell H, et al. (2014) Elemental Fingerprinting of Materials with Sensitivity at the Atomic Limit. Nano Letters 14: 6499-6504. doi: 10.1021/nl5030613
[36]	Mocuta C, Stangl J, Mundboth K, et al. (2008) Beyond the ensemble average: X-ray microdiffraction analysis of single SiGe islands. Phys Rev B77: 245425.
[37]	Bleuet P, Cloetens P, Gergaud P, et al. (2009) A hard x-ray nanoprobe for scanning and projection nanotomography. Rev Sci Inst 80: 056101. doi: 10.1063/1.3117489
[38]	Bunk O, Bech M, Jensen TH, et al. (2009) Multimodal x-ray scatter imaging. New J Phys 11: 123016. doi: 10.1088/1367-2630/11/12/123016
[39]	Meirer F, Cabana J, Liu Y, et al. (2011) Three-dimensional imaging of chemical phase transformations at the nanoscale with full-field transmission X-ray microscopy. J Synch Rad 18: 773-781. doi: 10.1107/S0909049511019364
[40]	Villanova J, Segura-Ruiz J, Lafford T, et al. (2012) Synchrotron microanalysis techniques applied to potential photovoltaic materials. J Sync Rad 19: 521-524. doi: 10.1107/S0909049512021383
[41]	Gómez-Gómez M, Garro N, Segura-Ruiz J, et al. (2014) Spontaneous core-shell elemental distribution in In-rich InxGa1-xN nanowires grown by molecular beam epitaxy. Nanotechnology 25: 075705. doi: 10.1088/0957-4484/25/7/075705
[42]	Tsuji K, Injuk J, Van Grieken R (2004) X-ray spectrometry: recent technological advances. John Wiley & Sons, Ltd.
[43]	Hippert F, Geissler E, Hodeau JL, et al. (eds) (2006) Neutron and X-ray spectroscopy. Springer.
[44]	Van Grieken R, Markowicz A (2001) Handbook of X-Ray Spectrometry. CRC Press.
[45]	Somogyi A, Medjoubi K, Baranton G, et al. (2015). Optical design and multi-length-scale scanning spectro-microscopy possibilities at the Nanoscopium beamline of Synchrotron Soleil. J Synch Rad. In press
[46]	Schroer CG (2001) Reconstructing x-ray fluorescence microtomograms. Appl Phys Lett 79: 1912. doi: 10.1063/1.1402643
[47]	Golosio B, Somogyi A, Simionovici A, et al. (2004) Nondestructive three-dimensional elemental microanalysis by combined helical x-ray microtomographies. Appl Phys Lett 84: 2199. doi: 10.1063/1.1686892
[48]	Bleuet P, Gergaud P, Lemelle L, et al. (2010) 3D chemical imaging based on a third-generation synchrotron source. TrAC-Trends in Anal Chem 29: 518-527. doi: 10.1016/j.trac.2010.02.011
[49]	De Jonge MD, Vogt S,(2010) Hard X-ray fluorescence tomography-an emerging tool for structural visualization. Curr Opin Struct Biol 20: 606.
[50]	Kak AC, Slaney M (2001) Principles of Computerized Tomographic Imaging. Philadelphia: Society for Industrial and Applied Mathematics.(an abridged republication of the work first published by IEEE Press, New York, 1988).
[51]	Golosio B, Simionovici A, Somogyi A, et al. (2003) Internal elemental microanalysis combining XRF, Compton and Transmission Tomography. J Appl Phys 94: 145. doi: 10.1063/1.1578176
[52]	Naghedolfeizi M, Chung J, Morris R, et al. (2003) X-ray fluorescence microtomography study of trace elements in a SiC nuclear fuel shell. J Nucl Mater 312: 146-155. doi: 10.1016/S0022-3115(02)01681-1
[53]	Lombi E, De Jonge MD, Donner E, et al. (2011) Fast X-ray fluorescence microtomography of hydrated biological samples. PLoS ONE 6: e20626. doi: 10.1371/journal.pone.0020626
[54]	Medjoubi K, Leclercq N, Langlois F, et al. (2013) Development of fast, simultaneous and multi-technique scanning hard X-ray microscopy at Synchrotron Soleil. J Synch Rad 20: 293-300. doi: 10.1107/S0909049512052119
[55]	Medjoubi K, Bonissent A, Leclercq N, et al. (2013) Simultaneous fast scanning XRF, dark field, phase-, and absorption contrast tomography. Proceedings of SPIE, 2013, 8851:art.n° 88510P
[56]	Vincze L, Vekemans B, Brenker FE (2004) Three-Dimensional Trace Element Analysis by Confocal X-ray Microfluorescence Imaging. Anal Chem 76: 6786. doi: 10.1021/ac049274l
[57]	Lamberti C, Agostini G, (2013) Characterization of Semiconductor Heterostructures and Nanostructures 2nd edn, ed C Lamberti, G Agostini (Amsterdam: Elsevier).
[58]	Calvin S (2013) XAFS for Everyone. Boca Raton FL: Taylor and Francis.
[59]	Schnohr CS, Ridgway MC (Eds.) (2015) X-Ray Absorption Spectroscopy of semiconductors, Series: Springer Series in Optical Sciences, Vol. 190 (Springer)
[60]	Martinez-Criado G, Somogyi A, Homs A, et al. (2005a) Micro-x-ray absorption near-edge structure imaging for detecting metallic Mn in GaN. Appl Phys Lett 87: 061913.
[61]	Seifert W, Vyvenko O, Arguirov T, et al. (2009) Synchrotron-based investigation of iron precipitation in multicrystalline silicon. Superlattice Microst 45: 168-176. doi: 10.1016/j.spmi.2008.11.025
[62]	Watts B, McNeil C (2010) Simultaneous Surface and Bulk Imaging of Polymer Blends with X-ray Spectromicroscopy. Macromolec Rapid Commun 31: 1706-1712. doi: 10.1002/marc.201000269
[63]	Watts B, McNeil C, Raabe J (2012) Imaging nanostructures in organic Semiconductor films with scanning transmission X-ray spectro-microscopy. Synthetic Met 161: 2516-2520. doi: 10.1016/j.synthmet.2011.09.016
[64]	d’Acapito F (2015) Group III-V and II-VI Nanowires in X-Ray Absorption Spectroscopy of Semiconductors. Springer Series in Optical Sciences 190: 269-286. doi: 10.1007/978-3-662-44362-0_13
[65]	Strachan JP, Medeiros-Ribeiro G, Yang JJ, et al. (2011) Spectromicroscopy of tantalum oxide memristors. Appl Phys Lett 98: 24-26.
[66]	Rau C, Somogyi A, Simionovici A (2003) Microimaging and tomography with chemical speciation. Nucl Instr Methods Phys Res B 200: 444-450. doi: 10.1016/S0168-583X(02)01737-8
[67]	Segura-Ruiz J, Martinez-Criado G, Chu MH, et al. (2011) Nano-X-ray Absorption Spectroscopy of Single Co-Implanted ZnO Nanowires. Nano Letters 11: 5322-5326. doi: 10.1021/nl202799e
[68]	Wang J, Chen-Wiegart YK, Wang J (2013) In situ chemical mapping of a lithium-ion battery using full-field hard X-ray spectroscopic imaging. Chem Com 49: 6480-6482. doi: 10.1039/c3cc42667j
[69]	Rosenberg RA, Shenoy GK, Sun XH, et al. (2006) Time-resolved x-ray-excited optical luminescence characterization of one-dimensional Si—CdSe heterostructures. Appl Phys Lett 89: 243102. doi: 10.1063/1.2402262
[70]	O’Malley SM, Revesz P, Kazimirov A, et al. (2011) Time-resolved x-ray excited optical luminescence in InGaN / GaN multiple quantum well structures. J Appl Phys 109: 124906. doi: 10.1063/1.3598137
[71]	Bianconi A, Jackson D (1978) Intrinsic luminescence excitation spectrum and extended x-ray absorption fine structure above the K edge in CaF₂. Phys Rev A 17: 2021-2024.
[72]	Goulon J, Tola P, Lemonnier M, et al. (1983) On a site-selective EXAFS experiment using optical emission, Chem. Phys. 78: 347-356.
[73]	Rogalev A, & Goulon J (2002) X-ray excited optical luminescence spectroscopies, Advanced Series in Physical Chemistry: Volume 12, Chemical Applications of Synchrotron Radiation, Edited by: Sham TK, 707-760. World Scientific.
[74]	Sham TK, Naftel SJ, Kim PG, et al. (2004) Electronic structure and optical properties of silicon nanowires : A study using x-ray excited optical luminescence and x-ray emission spectroscopy, Phys Rev B70:045313.
[75]	Hanke M, Dubslaff M, Schmidbauer M, et al. (2008) Scanning x-ray diffraction with 200 nm spatial resolution. Appl Phys Lett 92: 193109. doi: 10.1063/1.2929374
[76]	Diaz A, Mocuta C, Stangl J, et al. (2009) Spatially resolved strain within a single SiGe island investigated by X-ray scanning microdiffraction. Phys Stat Sol A 206: 1829-1832. doi: 10.1002/pssa.200881594
[77]	Mocuta C, Barbier A, Ramos AV, et al. (2007) Effect of optical lithography patterning on the crystalline structure of tunnel junctions. Appl Phys Lett 91: 241917. doi: 10.1063/1.2824858
[78]	Evans PG, Savage DE, Prance JR, et al. (2012) Nanoscale Distortions of Si Quantum Wells in Si/SiGe Quantum-Electronic Heterostructures. Adv Mater 24: 5217-5221. doi: 10.1002/adma.201201833
[79]	Chahine GA, Richard MI, Homs-Regojo RA, et al. (2014) Imaging of strain and lattice orientation by quick scanning X-ray microscopy combined with three dimensional reciprocal space mapping. J Appl Cryst 47: 762-769. doi: 10.1107/S1600576714004506
[80]	Goodman JW (2005) Introduction to Fourier Optics, Roberts and Company Publishers
[81]	Garcia-Sucerquia J, Xu W, Jericho SK, et al. (2006) Digital in-line holographic microscopy. Appl Opt 45: 836-850. doi: 10.1364/AO.45.000836
[82]	Leith EN, Upatniesks J (1962) Reconstructed Wavefronts and Communication Theory. J Opt Soc Am 52: 1123-1128. doi: 10.1364/JOSA.52.001123
[83]	Fuhse C, Ollinger C, Saldit T (2006) Waveguide-Based Off-Axis Holography with Hard X Rays. Phys Rev Lett 97: 254801. doi: 10.1103/PhysRevLett.97.254801
[84]	Eisebitt S, Luning J, Schlotter WF, et al. (2004) Lensless imaging of magnetic nanostructures by X-ray spectro-holography. Nature 432: 885-888. doi: 10.1038/nature03139
[85]	Stadler LM, Gutt C, Autenrieth T, et al. (2008) Hard X Ray Holographic Diffraction Imaging. Phys Rev Lett 100: 245503. doi: 10.1103/PhysRevLett.100.245503
[86]	Chamard V, Stangl J, Carbone D, et al. (2010) Three-Dimensional X-Ray Fourier Transform Holograhpy: The Bragg Case. Phys Rev Lett 104: 165501. doi: 10.1103/PhysRevLett.104.165501
[87]	Livet F (2007) Diffraction with a coherent X-ray beam: dynamics and imaging. Acta Crystallogr A 63: 87-107. doi: 10.1107/S010876730605570X
[88]	Miao J, Charalambous P, Kirz J, et al. (1999) Extending the methodology of X-ray crystallography to allow imaging of micrometre-sized non-crystalline specimens. Nature 400: 342-344. doi: 10.1038/22498
[89]	Miao J, Ishikawa T, Johnson B, et al. (2002) High Resolution 3D X-Ray Diffraction Microscopy. Phys Rev Lett 89: 088303. doi: 10.1103/PhysRevLett.89.088303
[90]	Takahashi T, Zettsu N, Nishino Y, et al. (2010) Three-Dimensional Electron Density Mapping of Shape-Controlled Nanoparticle by Focused Hard X-ray Diffraction Microscopy. Nano Lett 10: 1922-1926. doi: 10.1021/nl100891n
[91]	Diaz A, Mocuta C, Stangl J, et al. (2009B) Coherent diffraction imaging of a single epitaxial InAs nanowire using a focused x-ray beam. Phys Rev B 79: 125324.
[92]	Diaz A, Chamard V, Mocuta C, et al. (2010) Imaging the displacement field within epitaxial nanostructures by coherent diffraction: a feasibility study. New J Phys 12: 035006. doi: 10.1088/1367-2630/12/3/035006
[93]	Mastropietro F, Carbone D, Diaz A, et al. (2011) Coherent x-ray wavefront reconstruction of a partially illuminated Fresnel zone plate. Opt Express 19: 19223-19232. doi: 10.1364/OE.19.019223
[94]	Pfeiffer M, Williams GJ, Vartanyants IA, et al. (2006) Three-dimensional mapping of a deformation field inside a nanocrystal. Nature 442: 63-66. doi: 10.1038/nature04867
[95]	Robinson IK, Vartanyants IA, Williams GJ, et al. (2001) Reconstruction of the Shapes of Gold Nanocrystals Using Coherent X-Ray Diffraction. Phys Rev Lett 87: 195505. doi: 10.1103/PhysRevLett.87.195505
[96]	Newton MC, Leake SJ, Harder R, et al. (2010) Three-dimensional imaging of strain in a single ZnO nanorod. Nature Mater 9: 120-124. doi: 10.1038/nmat2607
[97]	Beutier G, Verdier M, Parry G, et al. (2013). Strain inhomogeneity in copper islands probed by coherent X-ray diffraction. Thin Solid Films 530: 120-124. doi: 10.1016/j.tsf.2012.02.032
[98]	Minkevich AA, Gailhanou M, Micha JS, et al. (2007) Inversion of the Diffraction Pattern from an Inhomogeneously Strained Crystal using an Iterative Algorithm. Phys Rev B 76: 104106. doi: 10.1103/PhysRevB.76.104106
[99]	Minkevich AA, Baumbach T, Gailhanou M, et al. (2008) Applicability of an iterative inversion algorithm to the diffraction patterns from inhomogeneously strained crystals. Phys Rev B 78: 174110. doi: 10.1103/PhysRevB.78.174110
[100]	Minkevich AA, Fohtung E, Slobodskyy T, et al. (2011) Strain field in (Ga,Mn)As/GaAs periodic wires revealed by coherent X-ray diffraction. European Phys Lett 94: 66001. doi: 10.1209/0295-5075/94/66001
[101]	Williams GJ, Quiney HM, Dhal BB, et al. (2006) Fresnel Coherent Diffractive Imaging. Phys Rev Lett 97: 025506. doi: 10.1103/PhysRevLett.97.025506
[102]	Quiney HM, Peele AG, Cai Z, et al. (2006) Diffractive imaging of highly focused X-ray fields. Nature Phys 2: 101-104. doi: 10.1038/nphys218
[103]	Faulkner HLM, Rodenberg JM (2004) Movable Aperture Lensless Transmission Microscopy: A Novel Phase Retrieval Algorithm. Phys Rev Lett 93: 023903. doi: 10.1103/PhysRevLett.93.023903
[104]	Bunk O, Dierloff M, Kynde S, et al. (2007) Influence of the overlap parameter on the convergence of the ptychographical iterative engine. Ultramicroscopy 108: 481-487.
[105]	Maiden A, Rodenburg J (2009) An improved ptychographical phase retrieval algorithm for diffractive imaging. Ultramicroscopy 109: 1256-1262. doi: 10.1016/j.ultramic.2009.05.012
[106]	Guizar-Sicairos M, Fineup J (2008) Phase retrieval with transverse translation diversity: a nonlinear optimization approach. Opt Express 16: 7264-7278. doi: 10.1364/OE.16.007264
[107]	Dierolf M, Menzel A, Thibault P, et al. (2010) Ptychographic X-ray computed tomography at the nanoscale. Nature 467: 436-439. doi: 10.1038/nature09419
[108]	Godard P, Carbone G, Allain M, et al. (2011) Three-dimensional high-resolution quantitative microscopy of extended crystals. Nature Commun 2: 568. doi: 10.1038/ncomms1569
[109]	Godard P, Allain M, Chamard V, et al. (2011B) Imaging of highly inhomogeneous strain field in nanocrystals using x-ray Bragg ptychography: A numerical study. Phys Rev B 84: 144109.
[110]	Korsunsky A, Hofmann F, Abbey B, et al. (2012) Analysis of the internal structure and lattice (mis) orientation in individual grains of deformed CP nickel polycrystals by synchrotron X-ray micro-diffraction and microscopy. Int J Fatigue 42: 1-13. doi: 10.1016/j.ijfatigue.2011.03.003
[111]	Georgiadis M, Guizar-sicairos M, Zwahlen A, et al. (2015).3D scanning SAXS: A novel method for the assessment of bone ultrastructure orientation. Bone 71: 42-52. doi: 10.1016/j.bone.2014.10.002
[112]	Snigirev A, Snigireva I, Kohn V, et al. (1995) On the possibilities of x-ray phase contrast microimaging by coherent high‐energy synchrotron radiation. Rev Sci Instrum 66: 5486. doi: 10.1063/1.1146073
[113]	Cloetens P, Pateyron-Salomé M, Buffière J, et al. (1997) Observation of microstructure and damage in materials by phase sensitive radiography and tomography. J Appl Phys 81: 5.
[114]	Pfeiffer M, Bech M, Bunk O, et al. (2008) Hard-X-ray dark-field imaging using a grating interferometer. Nature Mater 7: 134-137. doi: 10.1038/nmat2096
[115]	Weitkamp T, Diaz A, David C, et al. (2005) X-ray phase imaging with a grating interferometer. Opt Express 13: 6296. doi: 10.1364/OPEX.13.006296
[116]	David C, Nohammer B, Solak H, et al. (2002) Differential x-ray phase contrast imaging using a shearing interferometer. Appl Phys Lett 81: 3287. doi: 10.1063/1.1516611
[117]	Egan CK, Wilson MD, Veale MC, et al. (2014) Material specific X-ray imaging using an energy-dispersive pixel detector. Nucl Instrum Meth Phys Res B 324: 25-28. doi: 10.1016/j.nimb.2013.11.021
[118]	Ice GE, Specht ED (2012) Microbeam, timing and signal-resolved studies of nuclear materials with synchrotron X-ray sources. J Nucl Mat 425: 233-237. doi: 10.1016/j.jnucmat.2011.10.038
[119]	Singer A, Ulvestad A, Cho H, et al. (2014) Nonequilibrium Structural Dynamics of Nanoparticles in LiNi1/2Mn3/2O4 Cathode under Operando Conditions. Nano Lett 14: 5295-5300. doi: 10.1021/nl502332b
[120]	Wang L, Ding Y, Patel U, et al. (2011) Studying single nanocrystals under high pressure using an x-ray nanoprobe. Rev Sci Inst 82: 43903. doi: 10.1063/1.3584881
[121]	Xu F, Helfen L, Suhonen H, et al. (2012) Correlative Nanoscale 3D Imaging of Structure and Composition in Extended Objects. PLOS One 7: e50124. doi: 10.1371/journal.pone.0050124
[122]	McHugo S, Thompson C, Périchaud I, et al. (1998) Direct correlation of transition metal impurities and minority carrier recombination in multicrystalline silicon. Appl Phys Lett 72: 3482. doi: 10.1063/1.121673
[123]	Vyvenko OF, Buonassisi T, Istratov AA, et al. (2002) X-ray beam induced current—A synchrotron radiation based technique for the in situ analysis of recombination properties and chemical nature of metal clusters in silicon. J Appl Phys 91: 3614-3617. doi: 10.1063/1.1450026
[124]	Istratov AA, Buonassisi T, McDonald RJ, et al. (2003) Metal content of multicrystalline silicon for solar cells and its impact on minority carrier diffusion length. J Appl Phys 94: 6552. doi: 10.1063/1.1618912
[125]	Buonassisi T, Istratov AA, Huer M, et al. (2005a) Synchrotron-based investigations of the nature and impact of iron contamination in multicrystalline silicon solar cells. J Appl Phys 97: 074901.
[126]	Buonassisi T, Istratov AA, Pickett MD, et al. (2006) Chemical natures and distributions of metal impurities in multicrystalline silicon materials. Progress in Photovoltaics: Res App 14: 512-531.
[127]	Trushin M, Seifert W, Vyvenko O, et al. (2010) XBIC/μ-XRF/μ-XAS analysis of metals precipitation in block-cast solar silicon. Nucl Inst Methods Phys Res B 268: 254-258. doi: 10.1016/j.nimb.2009.09.057
[128]	Gundel P, Schubert MC, Heinz FD, et al. (2010) Impact of stress on the recombination at metal precipitates in silicon. J Appl Phys 108: 103707. doi: 10.1063/1.3511749
[129]	Fenning DP, Hofstetter J, Bertoni MI, et al. (2011) Iron distribution in silicon after solar cell processing: Synchrotron analysis and predictive modeling. Appl Phys Lett 98: 162103. doi: 10.1063/1.3575583
[130]	Fenning DP, Hofstetter J, Bertoni MI, et al. (2013). Precipitated iron: A limit on gettering efficacy in multicrystalline silicon Precip. J Appl Phys 113: 044521.
[131]	Zuschlag M, Schwab M, Merhof D, Hahn G (2014) Transition metal precipitates in mc Si: a new detection method using 3D-FIB. Solid State Phenom 205-206: 136-141.
[132]	Dietl T (2010) A ten-year perspective on dilute magnetic semiconductors and oxides. Nat Mater 9: 965-974. doi: 10.1038/nmat2898
[133]	Martinez-Criado G, Somogyi A, Ramos S, et al. (2005) Mn-rich clusters in GaN: Hexagonal or cubic symmetry? Appl Phys Lett 86: 131927. doi: 10.1063/1.1886908
[134]	Martínez-Criado G, Somogyi A, Hermann M, et al. (2004) Direct observation of Mn clusters in GaN by X-ray scanning microscopy. Jap J Appl Phys Part 2: Letter 43: 695-697. doi: 10.1143/JJAP.43.695
[135]	Sancho-Juan O, Cantarero A, Martínez-Criado G, et al. (2006) X-ray absorption near edge spectroscopy at the Mn K-edge in highly homogeneous GaMnN diluted magnetic semiconductors. Physica Status Solidi (B) Basic Res 243: 1692-1695. doi: 10.1002/pssb.200565413
[136]	Sancho-Juan O, Cantarero A, Garro N, et al. (2009) X-ray absorption near-edge structure of GaN with high Mn concentration grown on SiC. J Phys Cond Mat 21: 295801. doi: 10.1088/0953-8984/21/29/295801
[137]	Farvid SS, Hegde M, Hosein ID, et al. (2011) Electronic structure and magnetism of Mn dopants in GaN nanowires : Ensemble vs single nanowire measurements. Appl Phys Lett 99: 222504. doi: 10.1063/1.3664119
[138]	Segura-Ruiz J, Martinez-Criado G, Denker C, et al. (2014) Phase Separation in Single InxGa1-xN Nanowires Revealed through a Hard X ray Synchrotron Nanoprobe. Nano Lett 14: 1300-1305. doi: 10.1021/nl4042752
[139]	Egerton RF, Li P, Malac M (2004) Radiation damage in the TEM and SEM, 35: 399-409.
[140]	Hitchcock AP, Dynes JJ, Johansson G, et al. (2008a) Comparison of NEXAFS microscopy and TEM-EELS for Studies of Soft Matter. Micron 39:741-748.
[141]	Harris WM, Lombardo JJ, Nelson GJ, et al. (2014) Three-dimensional microstructural imaging of sulfur poisoning-induced degradation in a Ni-YSZ anode of solid oxide fuel cells. Scientific Reports 4: 5246.
[142]	Martinez-Criado G, Alen B, Homs A, et al. (2006) Scanning x-ray excited optical luminescence microscopy in GaN. Appl Phys Lett 89: 221913. doi: 10.1063/1.2399363
[143]	Martínez-Criado G, Homs A, Alén B, et al. (2012) Probing quantum confinement within single core- multishell nanowires. Nano Lett 12: 5829-5834. doi: 10.1021/nl303178u
[144]	Paterson D, Jonge MD De, Howard DL, et al. (2011). The X-ray fluorescence microscopy beamline at the Australian synchrotron. AIP Conf Proc 1365:219-222.
[145]	Huang X, Lauer K, Clark JN, et al. (2015). Fly-scan ptychography. Scientific Reports 5:9074.
[146]	Jonge MD De, Ryan G, Jacobsen CJ (2014) X-ray nanoprobes and diffraction-limited storage rings : opportunities and challenges of fluorescence tomography of biological specimens. J Synch Rad 21:1031-1047. doi: 10.1107/S160057751401621X
[147]	Guillamet R, Lagay N, Mocuta C, et al. (2013) Micro-characterization and three imensional modeling of very large waveguide arrays by selective area growth for photonic integrated circuits. J Cryst Growth 370: 128-132. doi: 10.1016/j.jcrysgro.2012.09.053
[148]	Decobert J, Guillamet R, Mocuta C, et al. (2013) Structural characterization of selectively grown multilayers with new high angular resolution and sub-millimeter spot-size x-ray diffractometer. J Cryst Growth 370: 154-156. doi: 10.1016/j.jcrysgro.2012.06.011
[149]	Bussone G, Schäfer-Eberwein H, Dimakis E, et al. (2015) Correlation of Electrical and Structural Properties of Single As-Grown GaAs Nanowires on Si (111) Substrates. Nano Lett 15: 981. doi: 10.1021/nl5037879
[150]	Beutier G, Verdier M, De Boissieu M, et al. (2013b) Combined coherent x-ray micro-diffraction and local mechanical loading on copper nanocrystals. J Phys Conf Series 425: 132003.
[151]	Mondiali V, Bollani M, Cecchi S, et al. (2014A) Dislocation engineering in SiGe on periodic and aperiodic Si(001) templates studied by fast scanning X-ray nanodiffraction. Appl Phys Lett 104: 021918.
[152]	Mondiali V, Bollani M, Chrastina D, et al. (2014B) Strain release management in SiGe/Si films by substrate patterning. Appl Phys Lett 105: 242103.
[153]	Als-Nielsen J, McMorrow D (2011) Elements of Modern X-ray Physics. (2nd Edition), Wiley.
[154]	Riekel C, Davies JD (2005) Applications of synchrotron radiation micro-focus techniques to the study of polymer and biopolymer fibers. Curr Opin Colloid In 9: 396-403. doi: 10.1016/j.cocis.2004.10.004
[155]	Larson BC, Yang W, Ice GE, et al. (2002) Three-dimensional X-ray structural microscopy with submicrometre resolution. Nature 415: 887-890. doi: 10.1038/415887a
[156]	Budai JD, Yang WG, Tamura N, et al. (2003) X-ray microdiffraction study of growth modes and crystallographic tilts in oxide films on metal substrates. Nature Materials 2: 487-492. doi: 10.1038/nmat916
[157]	Schmidbauer M (2004) X-Ray Diffuse Scattering from Self-Organized Mesoscopic Semiconductor Structures. Springer Tracts in Modern Physics 199.
[158]	Hrauda N, Zhang JJ, Groiss H, et al. (2013) Strain relief and shape oscillations in site-controlled coherent SiGe islands. Nanotechnology 24: 335707. doi: 10.1088/0957-4484/24/33/335707
[159]	Barbier A, Mocuta C, Belkhou R (2012) Selected Synchrotron Radiation Techniques, chapter in Enciclopedia of Nanotechnology. ed. B. Bhushan, 2322-2344. Springer.
[160]	Holy V, Mundboth K, Mocuta C, et al. (2008) Structural characterization of self-assembled semiconductor islands by three-dimensional X-ray diffraction mapping in reciprocal space. Thin Solid Films 516: 8022-8028. doi: 10.1016/j.tsf.2008.04.009
[161]	Mocuta C, Barbier A, Ramos AV, et al. (2009) Crystalline structure of oxide-based epitaxial tunnel junctions. Eur Phys J Special Topics 167: 53-58. doi: 10.1140/epjst/e2009-00936-5
[162]	Murray CE, Noyan IC, Mooney PM, et al. (2003) Mapping of strain fields about thin film structures using x-ray microdiffraction. Appl Phys Lett 83: 4163. doi: 10.1063/1.1628399
[163]	Murray CE, Saenger KL, Kalenci O, et al. (2008) Submicron mapping of silicon-on-insulator strain distributions induced by stressed liner structures. J Appl Phys 104: 013530. doi: 10.1063/1.2952044
[164]	Murray CE, Yan HF, Noyan IC, et al. (2005) High-resolution strain mapping in heteroepitaxial thin-film features. J Appl Phys 98: 013504. doi: 10.1063/1.1938277
[165]	Murray CE, Ying A, Polvino SM, et al. (2011) Nanoscale silicon-on-insulator deformation induced by stressed liner structures. J Appl Phys 109: 083543. doi: 10.1063/1.3579421
[166]	Vartaniants IA, Zozulya AV, Mundboth K, et al. (2008) Crystal truncation planes revealed by three-dimensional reconstruction of reciprocal space. Phys Rev B 77: 115317. doi: 10.1103/PhysRevB.77.115317
[167]	Yefanov OM, Zozulya AV, Vartaniants IA, et al. (2009) Coherent diffraction tomography of nanoislands from grazing-incidence small-angle x-ray scattering. Appl Phys Lett 94: 123104. doi: 10.1063/1.3103246
[168]	Zozulya AV, Yefanov OM, Vartaniants IA, et al. (2009) Imaging of nanoislands in coherent grazing-incidence small-angle x-ray scattering experiments. Phys Rev B 78: 121304R.
[169]	Nanver LK, Jovanovis V, Biasotto C, et al. (2011) Integration of MOSFETs with SiGe dots as stressor material. Solid State Electronics 60: 75-83. doi: 10.1016/j.sse.2011.01.038
[170]	Rodrigues MS, Cornelius TW, Scheler T, et al. (2009) In situ observation of the elastic deformation of a single epitaxial SiGe crystal by combining atomic force microscopy and micro x-ray diffraction. J Appl Phys 106: 103525. doi: 10.1063/1.3262614
[171]	Cornelius TW, Davydok A, Jacques VLR, et al. (2012) In situ three-dimensional reciprocal-space mapping during mechanical deformation. J Synch Rad 19: 688-694. doi: 10.1107/S0909049512023758
[172]	Cornelius TW, Mastropietro F, Thomas O, et al. (2013) In situ nanofocused X-ray diffraction combined with scanning probe microscopy, chapter in X-ray diffraction: Structure, Principles and Applications, ed. Shih K, 223-259, Nova Science Publisher.
[173]	Kang HC, Yan H, Chu YS, et al. (2013). Oxidation of PtNi nanoparticles studied by a scanning X-ray fluorescence microscope with multi-layer Laue lenses. Nanoscale 5:7184.
[174]	Chu YS (2010) Preliminary Design Report for the Hard X-ray (HXN) Nanoprobe Beamline. National Synchrotron Light Source II, Brookhaven National Laboratory, LT-C-XFD-HXN-PDR-001.
[175]	Nazaretski E, Lauer K, Yan H, et al. (2015). Pushing the limits: an instrument for hard X-ray imaging below 20 nm. J Synch Rad 22:336-341. doi: 10.1107/S1600577514025715
[176]	Vogt S, Lanzirotti A (2013) Trends in X-ray Fluorescence Microscopy. Synchrotron Radiation News 26: 32-38.
[177]	Deng J, Vine DJ, Chen S, et al. (2015) Simultaneous cryo X-ray ptychographic and fluorescence microscopy of green algae. Proc Nat Acad Sci 112: 2314-2319. doi: 10.1073/pnas.1413003112
[178]	Que EL, Bleher R, Duncan FE, et al. (2015) Quantitative mapping of zinc fluxes in the mammalian egg reveals the origin of fertilization-induced zinc sparks. Nature Chem 7: 130-139.
[179]	Davies R, Burghammer M, Riekel C (2009) A combined microRaman and microdiffraction set-up at the European Synchrotron Radiation Facility ID13 beamline. J Synch Rad 16: 22-29. doi: 10.1107/S0909049508034663
[180]	Bernal S, Provis JL, Rose V, et al. (2013). High-resolution X-ray diffraction and fluorescence microscopy characterization of alkali-activated slag-metakaolin binders. J Am Ceramic Soc 96: 1951-1957. doi: 10.1111/jace.12247
[181]	Andrews JC, Weckhuysen BM (2013) Hard X-ray Spectroscopic Nano-Imaging of Hierarchical Functional Materials at Work. Chem Phys Chem 14: 3655.
[182]	Barrea RA, Gore D, Kujala N (2010) Fast-scanning high-flux microprobe for biological X-ray fluorescence microscopy and microXAS. J Synch Rad 17: 522-529. doi: 10.1107/S0909049510016869
[183]	Mocuta C, Richard MI, Fouet J, et al. (2013B) Fast pole figure acquisition using area detectors at DiffAbs beamline - Synchrotron SOLEIL. J Appl Cryst 46: 1842-4853.
[184]	Eriksson M, van der Veen JF, guest editors (2014) Special issue on Diffraction-Limited Storage Rings and New Science Opportunities. J Synch Rad 21: Part 5.
[185]	Hettel R (2014) Diffraction-limited storage rings DLSR design and plans : an international overview diffraction-limited storage rings. J Synch Rad 21:843-855. doi: 10.1107/S1600577514011515
[186]	Mimura H, Handa S, Kimura T, et al. (2010) Breaking the 10nm barrier in hard-X-ray focusing. Nature Physics 6:122-125.
[187]	Yamauchi K, Mimura H, Kimura T, et al. (2011) Single-nanometer focusing of hard x-rays by Kirkpatrick - Baez mirrors. J Phys: Condensed Matter 23:394206. doi: 10.1088/0953-8984/23/39/394206
[188]	Yan H, Conley R, Bouet N, et al. (2014) Hard x-ray nanofocusing by multilayer Laue lelenses. J Phys D: Appl Phys 47: 263001. doi: 10.1088/0022-3727/47/26/263001
[189]	Döring F, Robisch L, Eberl C, et al. (2013) Sub-5 nm hard x-ray point focusing by a combined Kirkpatrick-Baez mirror and multilayer zone plate. Opt Express 21: 19311-19323. doi: 10.1364/OE.21.019311
[190]	Mohacsi I, Vartiainen I, Guizar-Sicairos M, et al. (2015) High resolution double-sided diffractive optics for hard X-ray microscopy. Opt Express 23:776-786. doi: 10.1364/OE.23.000776
[191]	Schroer CG, Falkenberg G (2014) Hard X-ray nanofocusing at low-emittance synchrotron radiation sources. J Sync Rad 21:996-1005. doi: 10.1107/S1600577514016269
[192]	Thibault P, Guizar-Sicairos M, Menzel A (2014) Coherent imaging at the diffraction limit. J Synch Rad 21: 1011-1018. doi: 10.1107/S1600577514015343
[193]	Schlichting I, White WE, Yabashi M, guest editors (2015) Special issue on X-ray Free-Electron Lasers, Guest Editors: J Synch Rad 22:Part 3.
[194]	Solé VA, Papillon E, Cotte M, et al. (25007) A multiplatform code for the analysis of energy-dispersive X-ray fluorescence spectra. Spectrochim Acta B 62: 63-68.
[195]	Ashiotis G, Deschildre A, Nawaz Z, et al. (2015) The fast azimuthal integration Python library: pyFAI. J Appl Cryst 48: 510-519. doi: 10.1107/S1600576715004306

Reader Comments

Your name:*

Email:*
© 2015 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Materials Science

1.8 3.7

Metrics

Article views(10673) PDF downloads(1442) Cited by(7)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(8)

AIMS Materials Science

Possibilities and Challenges of Scanning Hard X-ray Spectro-microscopy Techniques in Material Sciences

Related Papers:

Abstract

1. Introduction

2. Preliminary knowledge

2.1. Correlation operations and fast computation of image processing

2.2. Attention mechanism

3. Proposed module

3.1. Backbone network

3.2. Feature fusion block

3.2.1. Spatial-domain feature fusion module (SFM)

3.2.2. Frequency-domain feature fusion module (FFM)

3.3. Feature enhancement block

3.4. Prediction analysis module

3.4.1. Bounding box prediction head

3.4.2. Classification score head

3.4.3. Loss function

4. Experiments

4.1. Quantitative analysis

4.1.1. LaSOT

4.1.2. OTB100

4.1.3. UAV123

4.2. Qualitative analysis

4.3. Ablation study

4.3.1. Ablation study on the FFM

4.3.2. Ablation study on multi-scale windows

5. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Materials Science

Possibilities and Challenges of Scanning Hard X-ray Spectro-microscopy Techniques in Material Sciences

Related Papers:

Abstract

1. Introduction

2. Preliminary knowledge

2.1. Correlation operations and fast computation of image processing

2.2. Attention mechanism

3. Proposed module

3.1. Backbone network

3.2. Feature fusion block

3.2.1. Spatial-domain feature fusion module (SFM)

3.2.2. Frequency-domain feature fusion module (FFM)

3.3. Feature enhancement block

3.4. Prediction analysis module

3.4.1. Bounding box prediction head

3.4.2. Classification score head

3.4.3. Loss function

4. Experiments

4.1. Quantitative analysis

4.1.1. LaSOT

4.1.2. OTB100

4.1.3. UAV123

4.2. Qualitative analysis

4.3. Ablation study

4.3.1. Ablation study on the FFM

4.3.2. Ablation study on multi-scale windows

5. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog