Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet

Shouming Zhang; Yaling Zhang; Yixiao Liao; Kunkun Pang; Zhiyong Wan; Songbin Zhou; Shouming Zhang; Yaling Zhang; Yixiao Liao; Kunkun Pang; Zhiyong Wan; Songbin Zhou

doi:10.3934/mbe.2024089

Mathematical Biosciences and Engineering

2024, Volume 21, Issue 2: 2004-2023. doi: 10.3934/mbe.2024089

Previous Article Next Article

Research article Special Issues

Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet

1.
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
2.
Institute of Intelligent Manufacturing, Guangdong Academy of Science, Guangdong Key Laboratory of Modern Control Technology, Guangzhou 510030, China

Academic Editor: Guenther Palm

Received: 13 August 2023 Revised: 22 December 2023 Accepted: 29 December 2023 Published: 08 January 2024

Sound event localization and detection have been applied in various fields. Due to the polyphony and noise interference, it becomes challenging to accurately predict the sound event and their occurrence locations. Aiming at this problem, we propose a Multiple Attention Fusion ResNet, which uses ResNet34 as the base network. Given the situation that the sound duration is not fixed, and there are multiple polyphonic and noise, we introduce the Gated Channel Transform to enhance the residual basic block. This enables the model to capture contextual information, evaluate channel weights, and reduce the interference caused by polyphony and noise. Furthermore, Split Attention is introduced to the model for capturing cross-channel information, which enhances the ability to distinguish the polyphony. Finally, Coordinate Attention is introduced to the model so that the model can focus on both the channel information and spatial location information of sound events. Experiments were conducted on two different datasets, TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021. The results demonstrate that the proposed model significantly outperforms state-of-the-art methods under multiple polyphonic and noise-directional interference environments and it achieves competitive performance under a single polyphonic environment.

Keywords:

Citation: Shouming Zhang, Yaling Zhang, Yixiao Liao, Kunkun Pang, Zhiyong Wan, Songbin Zhou. Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet[J]. Mathematical Biosciences and Engineering, 2024, 21(2): 2004-2023. doi: 10.3934/mbe.2024089

Related Papers:

[1]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2025, 12(2): 252-252. doi: 10.3934/environsci.2025011
[2]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2021, 8(2): 100-100. doi: 10.3934/environsci.2021007
[3]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2022, 9(2): 217-217. doi: 10.3934/environsci.2022014
[4]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2023, 10(2): 245-245. doi: 10.3934/environsci.2023014
[5]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2018, 5(1): 64-66. doi: 10.3934/environsci.2018.1.64
[6]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2016, 3(1): 140-140. doi: 10.3934/environsci.2016.1.140
[7]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2019, 6(4): 262-264. doi: 10.3934/environsci.2019.4.262
[8]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2020, 7(2): 153-155. doi: 10.3934/environsci.2020009
[9]	Yifeng Wang . Journal summary from Editor in Chief. AIMS Environmental Science, 2017, 4(2): 287-288. doi: 10.3934/environsci.2017.2.287
[10]	Rukhsana Kokkadan, Resha Neznin, Praseeja Cheruparambath, Jerisa Cabilao, Salma Albouchi . A Study of Infaunal Abundance, Diversity and Distribution in Chettuva Mangrove, Kerala, India. AIMS Environmental Science, 2023, 10(1): 82-92. doi: 10.3934/environsci.2023005

Abstract

1. Introduction

Robot visual servo control is an advanced robotics technology widely used in agricultural ^[1], industrial ^[2] and medical ^[3] scenarios. Vision sensors enable agile access to rich environmental information, enabling robots to achieve precise and efficient operations in unstructured environments. The categories of eye-in-hand configuration and eye-to-hand configuration for visual servoing are based on the spatial relationship between the vision sensor and the robot. Distinguished by the information fed back from the vision sensor, visual servoing is classified into position-based visual servoing ^[4,5,6], image-based visual servoing ^{[7,8,9,10,11]} and hybrid visual servoing ^[12,13,14]. Among them, the research and application of IBVS are the most extensive. This study discusses the design of an image-based visual servo controller.

In visual servo tasks, the system constraints are an influential factor that must be considered. If any image feature is out of the field of the camera, it can be said to violate the visibility constraint. An actuator constraint violation occurs when the robot's maximum permissible torque is exceeded by the controller's input torque. All these violations of constraints can result in the failure of the visual servo task. Therefore, meeting the system constraints is a problem that must be solved in visual servo control. Model predictive control can naturally build system constraints into the optimization problem to ensure constraint satisfaction by constructing optimization problems to solve controller actions^[15]. Thus, model predictive control methods are often used to execute constrained visual servo control ^{[16,17,18,19,20,21]}. A conjugate visual predictive control strategy was presented in ^[16] with internal and external constraints in uncertain environments for visual servoing. A predictive control method that simultaneously considers system constraints and uncertainties was proposed in ^[17], where the prediction model uses a depth-independent Jacobian matrix. However, it was only validated on a planar robot manipulator. A quasi-minimum-maximum MPC method applied to visual servoing was proposed in ^[18], where the depth values are fixed constants in the Jacobian matrix. A predictive control strategy based on a nonlinear state observer was proposed in ^[19] to achieve attitude control of the unmanned aerial vehicle (UAV) by using the IBVS method. A nonlinear model predictive control technique ^[20] based on Gaussian processes was suggested for limiting mobile robots while considering camera visibility limits and robot hardware constraints. An nonlinear model predictive model (NMPC) strategy^[21] was proposed to solve the six-degree-of-freedom robotic constrained visual servo but with a heavy computational burden. It is necessary to select the weight matrix of the objective function in the model predictive controller, which reflects the relative importances of different system state variables to the objective function. Generally, the choice of the weight matrix is full of subjectivity and requires many attempts. However, even after a large number of tests, the best control quality cannot be obtained. The weight matrix tuning method of model predictive control has always been a concern. Shridhar and Cooper proposed a multivariable model predictive control adjustment strategy, but it is only applicable to unconstrained systems ^[22]. Particle swarm and genetic algorithms have also been used to rectify the model predictive controller parameters ^[23,24,25].

The discipline of artificial intelligence greatly benefits from the interdisciplinary study of machine learning^[26,27]. Among different approaches, RL, which has gained prominence recently, uses learning procedures to accomplish a certain goal to explain and address the issue that arises in an agent's interaction with the environment^{[28,29,30,31]}. Many existing works use RL to assist intelligent control methods, to bring better control in various working scenarios. An adaptive PID control method ^[28] was proposed for wind energy conversion system control using actor-critic learning to adaptively adjust the controller parameters, and the robustness of the presented control strategy was confirmed by comparative simulation. A deep RL method^[29] was used to learn an effective adaptive PID gain adjustment strategy for the control of a marine vessel. A double-Q-learning algorithm ^[30] was presented to adjust a multilevel PID controller, which was used for simulation research and experiments of a mobile robot. A maximum entropy depth RL method ^[31] was presented to adjust an AUV controller, to maintain the balance between performance and robustness. The Q-learning algorithm was adopted to adaptively adjust the visual servo gain in ^[32], which significantly improved the convergence speed and stability compared to the traditional IBVS method with a fixed servo gain. The effectiveness of the proposed method was verified by simulations and experiments. Several works combine model predictive control with RL to carry out research ^[33,34,35]. A model predictive control parameter rectification method based on RL was proposed in ^[33], and the ideal weight matrix was explored in a short time. The NPMC methods tuned by RL were applied to control a UAV in ^[34,35].

Although reinforcement learning has been applied to visual servoing, the most commonly used algorithm is still Q-learning. However, the performance of Q-learning in continuous action spaces needs to be improved^[36]. DDPG is a reinforcement learning algorithm based on deep neural networks and policy gradient methods which can effectively solve the problem of continuous action spaces by outputting a probability distribution of continuous action values instead of hard-coding discrete actions ^[37]. Compared with Q-learning, DDPG uses neural networks to approximate the value function and policy, which can converge faster and apply a loss function that combines action value and policy gradient, thus better optimizing the policy and value function. In addition, DDPG also uses techniques such as experience replay buffer and batch training to improve the stability of the algorithm^[38].

Inspired by the above works, this work presents a model predictive control strategy for robot manipulator visual servoing tuned by DDPG. The main contributions of this paper are as follows:

● To improve the servo efficiency and control accuracy of constrained IBVS systems, this paper proposes an MPC-based IBVS method tuned by DDPG. A depth-independent image interaction matrix is established as the predictive model, and more suitable predictive control weight matrix parameters are trained offline by the DDPG algorithm. Then, the control signal is given through the MPC controller.

● Compared with the traditional MPC-based IBVS method, the proposed method compensates for the disadvantages of the traditional trial-and-error method in tuning the weight matrix parameters and provides better weight matrix parameters, thereby improving visual servo efficiency and steady-state accuracy.

● Compared with the MPC-based IBVS method tuned by Q-learning, the proposed method obtains higher cumulative rewards in the continuous visual servo space and reduces the settling time.

The subsequent structure of this study is as follows. In Section 2, the model of the visual servo system is established to provide the prediction model needed by the model predictive controller. In Section 3, the model predictive controller of IBVS tuned by RL is designed. First, the visual servo model predictive control method is introduced, while later the RL and policy gradient algorithm is introduced, and finally the DDPG-based model prediction control weight matrix strategy is described. In Section 4, the DDPG learning process is introduced, and the effectiveness of the proposed method is verified by simulations. Finally, the research conclusion of this study is summarized in Section 5.

2. System modeling

In robot manipulator visual servo tasks, visibility constraints and robot joint constraints are the system constraint terms that must be considered. The MPC method can solve the system input-output constraint problem, so this study is based on the MPC method to solve the constrained visual servo problem. The predictive model is a very important part of MPC, which can predict the future values of the process output according to the input of the designed controller and the past status of the system. In this study, the association between the rates of variation of image coordinates of feature points and the joint velocity is described by building a depth-independent Jacobian matrix. Figure 1 shows schematic diagrams of the robot manipulator visual servo eye-in-hand configuration and eye-to-hand configuration, respectively. In the two configurations, the image coordinates of feature points can be uniformly represented as

$\begin{equation} s = \left( {\begin{array}{*{20}{c}} u\\ v \end{array}} \right) = \frac{1}{{{Z_i}}}\left( {\begin{array}{*{20}{c}} {c_1^T}\\ {c_2^T} \end{array}} \right)P\left( {\begin{array}{*{20}{c}} \alpha \\ 1 \end{array}} \right) \end{equation}$

(2.1)

Figure 1. Two configurations of visual servoing.

DownLoad: Full-Size Img PowerPoint

where $s = {\left({u, v} \right)^T}$ denotes the 2-D coordinates of the feature point in the image framework, and $u$ and $v$ represent the pixel coordinates projected on the $u$ - and $v$ -axes, respectively. ${Z_i}$ denotes the depth of the feature point concerning the camera framework. $C \in {\Re ^{3 \times 4}}$ denotes the unknown perspective projection matrix, and $c_1^T, c_2^T$ , and $c_3^T$ are the first, second and third rows of the matrix $C$ , respectively. $\alpha$ denotes the 3-D coordinates of the feature point in the Cartesian coordinate system. $P$ denotes the homogeneous transformation matrix, which is up to the forward kinematics of the robot manipulator. By differentiating both sides of the above equation simultaneously with respect to time, it can be obtained that

$\begin{equation} \begin{aligned} \dot{s} & = \frac{1}{Z_i^2}\left(\left(\begin{array}{c} c_1^T \\ c_2^T \end{array}\right)P\left(\begin{array}{c} \dot{\alpha} \\ 1 \end{array}\right) \dot{q} Z_i-\dot{Z}_i\left(\begin{array}{c} c_1^T \\ c_2^T \end{array}\right)P\left(\begin{array}{c} \alpha \\ 1 \end{array}\right)\right) \\ & = \frac{1}{Z_i}\left(\begin{array}{c} c_1^T \\ c_2^T \end{array}\right)P\left(\begin{array}{c} \dot{\alpha} \\ 1 \end{array}\right) \dot{q}-\frac{1}{Z_i^2} \dot{Z}_i\left(\begin{array}{c} c_1^T \\ c_2^T \end{array}\right)P\left(\begin{array}{c} \alpha \\ 1 \end{array}\right) \end{aligned} \end{equation}$

(2.2)

The depth ${Z_i}$ can be linearly expressed as

$\begin{equation} {Z_i} = {c_3}^TP\left( {\begin{array}{*{20}{c}} \alpha \\ 1 \end{array}} \right). \end{equation}$

(2.3)

Taking the derivative of both sides of Eq (2.3), it can be obtained that

$\begin{equation} {{\dot Z}_i} = c_3^TP\left( {\begin{array}{*{20}{c}} {\dot \alpha }\\ 1 \end{array}} \right)\dot q . \end{equation}$

(2.4)

Substituting Eq (2.4) into Eq (2.2), the time-related association between the joint velocity and the variation of the feature points can be given as

$\begin{equation} \begin{aligned} \dot{s} & = \frac{1}{Z_i}\left(\begin{array}{c} c_1^T \\ c_2^T \end{array}\right)P\left(\begin{array}{c} \dot{\alpha} \\ 1 \end{array}\right) \dot{q}-\frac{1}{Z_i} s c_3^TP\left(\begin{array}{c} \dot{\alpha} \\ 1 \end{array}\right) \dot{q} \\ & = \frac{1}{Z_i}\left(\begin{array}{c} c_1^T-u c_3^T \\ c_2^T-v c_3^T \end{array}\right)P\left(\begin{array}{c} \dot{\alpha} \\ 1 \end{array}\right) \dot{q} \\ & = \frac{1}{Z_i} \Omega \dot{q} \end{aligned} \end{equation}$

(2.5)

where $q$ denotes the joint angle of the robot manipulator, ${\dot q}$ denotes the joint velocity of the robot manipulator, and $\Omega$ denotes the image Jacobian matrix, which is independent of depth. It can be given as

$\begin{equation} \Omega = \left( {\begin{array}{*{20}{c}} {c_1^T - uc_3^T}\\ {c_2^T - vc_3^T} \end{array}} \right)P\left( {\begin{array}{*{20}{c}} {\dot \alpha }\\ 1 \end{array}} \right) \end{equation}$

(2.6)

In the formulas above, the depth-independent image Jacobian matrix $\Omega$ and the depth of feature point ${Z_i}$ are nonlinear. However, they can be represented linearly with regressor matrices and unknown parameter vectors by the following property.

Property 1. For any vector $\chi$ , the products $\Omega \chi$ and ${Z_i}\chi$ can be parameterized in a linear form as

$\begin{equation} \Omega \chi = A(\chi , q, s)\theta , \end{equation}$

(2.7)

$\begin{equation} {Z_i}\chi = B(\chi , s)\theta , \end{equation}$

(2.8)

where $A(\chi, q, s)$ and $B(\chi, s)$ are the regressor matrices. $\theta$ is the corresponding parameter vector determined by the products of the camera parameters and the robot kinematic parameters.

3. MPC controller design tuned by RL

A constrained control system is usually a kind of system that has constraints on control inputs, outputs and system states. In practical application scenarios, robot visual servo systems will be constrained by image visibility, physical constraints of joints, etc. Therefore, a robot visual servo system is a typical constraint control system. MPC is widely used in the control of constrained systems. In this study, the model predictive controller samples the image coordinates of feature points within a given sampling period and transforms the visual servo problem into a constrained optimization problem to generate the most suitable joint torque signal and minimize the incremental change in joint torque while minimizing the deviation of the predicted image. The purpose of visual servo control is to make the control process as stable as possible while completing visual servo control. The objective function in the optimization problem includes weight matrices, and an appropriate weight matrix will effectively improve the control performance of visual servoing. Previous weight matrices were obtained by researchers through repeated experiments, and it was challenging to achieve the best control effect. In this study, the DDPG-based RL strategy is adopted to modify the appropriate weight matrix of the objective function of MPC to optimize the visual servo performance of MPC-based IBVS. The design of the visual servo model predictive control is introduced in the last section, the RL and policy gradient algorithm is introduced in Section 2, and the reinforcement-learning-based objective function weight matrix rectification method is introduced in Section 3. The control scheme of the presented DDPG-based MPC (DDPG-MPC) IBVS algorithm is given in Figure 2.

Figure 2. The control scheme of the proposed DDPG-MPC IBVS algorithm.

DownLoad: Full-Size Img PowerPoint

3.1. Model predictive control for visual servoing

To develop the MPC-based IBVS controller, the discrete-time system model is obtained according to Eq (2.5).

$\begin{equation} {s(k + 1) = s(k) + {T_d}\frac{1}{{{Z_i}}}\Omega {u_\tau }\left( k \right)} \end{equation}$

(3.1)

When the control sequence obtained from the prediction model is applied to the IBVS system, the predicted system output for the next $N_p$ time steps is

$\begin{equation} \begin{aligned} s(k+1 \mid k) = & s(k)+T_d \frac{1}{Z_i(k)} \Omega(k) u_\tau(k) \\ s(k+2 \mid k) = & s(k)+T_d \frac{1}{Z_i(k)} \Omega(k) u_\tau(k)+T_d \frac{1}{Z_i(k+1)} \Omega(k+1) u_\tau(k+1) \\ & \vdots \\ s\left(k+N_p \mid k\right) = & s(k)+T_d \frac{1}{Z_i(k)} \Omega(k) u_\tau(k)+\cdots+T_d \frac{1}{Z_i(k+i-1)} \Omega(k+i-1) u_\tau\left(k+N_c-1\right) \end{aligned} \end{equation}$

(3.2)

where ${{N_p}}$ is predictive time domain, ${{N_c}}$ is control time domain, and ${{T_d}}$ is the sampling period. The following constrained optimization problem can be solved to calculate the optimal joint torque input.

$\begin{equation} \Pi :\mathop {\min }\limits_{\Delta {U_\tau }(k)} \left( {{s_e}(k), \Delta {U_\tau }(k)} \right) \end{equation}$

(3.3)

subject to

$\begin{equation} {s_e}(k + i) = \bar s(k + i) - {s^ * }(k + i) \end{equation}$

(3.4)

$\begin{equation} \begin{aligned} &\Delta U_\tau(k) = \left[\Delta u_\tau(k)^T, \Delta u_\tau(k+1)^T \cdots\right. \\ &\left.\Delta u_\tau(k+{N_p}-2)^T, \Delta u_\tau(k+{N_p}-1)^T\right] \end{aligned} \end{equation}$

(3.5)

$\begin{equation} \Delta {u_\tau }(k) = {u_\tau }(k) - {u_\tau }(k - 1) \end{equation}$

(3.6)

where ${{s^ * }}$ denotes the desired image coordinates of feature points, ${\bar s}$ denotes the predicted image coordinates of feature points, ${s_e}$ denotes the image deviation within the prediction period, $\Delta {u_\tau }(k)$ denotes the changing values of the control input, and $\Delta {U_\tau }(k)$ denotes the optimal sequence of the changing value of the control input within the prediction period. To minimize the predicted image coordinate deviation and the minimum control input variation, the quadratic cost function can be described as

$\begin{equation} G\left( {{s_e}(k), \Delta {U_\tau }(k)} \right) = \sum\limits_{i = 1}^{{N_p}} {\left\| {\bar s(k + i) - {s^*}(k + i)} \right\|_{{R_s}}^2} + \sum\limits_{i = 1}^{{N_c}} {\left\| {\Delta {u_\tau }(k + i - 1)} \right\|_{{R_u}}^2} \end{equation}$

(3.7)

where ${{N_p}}$ denotes the prediction horizon, ${{N_c}}$ denotes the control horizon, and ${N_c} \le {N_p}$ normally. ${R_s} \ge 0$ and ${R_u} \ge 0$ denote the weight matrices of the image coordinate deviation and the sequence of changing values of the control input, respectively. The constraints in the limited visual servoing system can be presented by the following formulas:

$\begin{equation} {s^{\min }} \le s(k) \le {s^{\max }} \end{equation}$

(3.8)

$\begin{equation} {U_\tau }^{\min } \le {U_\tau }(k) \le {U_\tau }^{\max } \end{equation}$

(3.9)

$\begin{equation} {q^{\min }} \le q(k) \le {q^{\max }} \end{equation}$

(3.10)

$\begin{equation} {{\dot q}^{\min }} \le \dot q(k) \le {{\dot q}^{\max }} \end{equation}$

(3.11)

Equation (3.8) represents the visibility constraints, Eq (3.9) represents the torque constraints, Eq (3.10) represents the joint angle constraints, and Eq (3.11) represents the joint velocity constraints.

The weight matrices ${R_s}$ and ${R_u}$ in the minimization function affect the control effect of the model predictive control. The matrix ${R_s}$ is used to describe the weights of the control deviations. When the value of ${R_s}$ is too large, the image coordinates of feature points will converge to the target image coordinates during the control process. However, it ignores the control input variation, resulting in the appearance of an input jitter, which reduces the response quality of the control process. The matrix ${R_u}$ is used to describe the weights of the change value of control inputs. When the value of the weight matrix ${R_u}$ used to describe the control input variable is too large, it will cause the control process to pay too much attention to the slow change of the control input variables, and the control system's response time will lengthen such that the visual servo task cannot be completed quickly. It is time-consuming to adjust the weight matrix of the objective function manually, and even then, the optimal control state of the model predictive controller cannot be reached. In this work, an RL-based method is proposed to adjust the weight matrix of the objective function, and it is introduced in the next section.

3.2. RL and policy gradient algorithm

RL has excellent ability to solve sequential problems. In addition, RL is a computational method, and robots can achieve their goals by interacting with the environment. RL is usually considered a Markov decision process (MDP). MDP is based on the tuple $\langle { {S, P, r, \delta } \rangle }$ of the Markov reward process, denoted as $\langle { {S, A, P, r, \delta } \rangle }$ :

● $S$ is the collection of states for the robot.

● $A$ is the collection of actions that the robot performs according to its current state.

● $P(s'\mid s, a)$ is the state transition function of the current state $s$ for action $A$ to arrive at state $s'$

● $r(s, a)$ is the reward function used to evaluate the reward generated by action $a$ under the current state $s$ .

● $\delta \in [0, 1]$ is a discount factor, representing the importance of the future return series.

The robot makes a decision of action in a state of the environment and applies the action to the environment. The environment changes, and the reward is passed on to the robot in the next state. This kind of interaction is iterative, and the robot's goal is to maximize the accumulated reward expectation over the process of multiple rounds. When the robot is in the environment ${s_t}$ to perform action ${a_t}$ , there will be a reward ${r_{t + 1}}$ , and the environment will be updated to ${s_{t + 1}}$ . When the time approaches infinity, the accumulated reward ${r_t}$ can be expressed as

$\begin{equation} {r_t} = {r_{t + 1}} + \delta {r_{t + 2}} + {\delta ^2}{r_{t + 3}} + \cdots = \sum\limits_{k = 0}^\infty {{\delta ^k}} {r_{t + k + 1}} . \end{equation}$

(3.12)

In RL, the DDPG is a combination of the deterministic policy gradient and the deep neural network, which can effectively solve the problem of continuous action space. In this study, DDPG is adopted to adjust the appropriate IBVS model predictive control objective function weight matrix. The strategy for tuning weight matrix parameters can be expressed by $\omega$ .

The learning goal is to find the strategy with the highest expected cumulative return, which can be expressed as

$\begin{equation} \omega^{*}(s) = \underset{\omega}{\arg \max } \mathbb{E}\left[\sum\limits_{t = 0}^{\infty} \delta^{t} r\left(s_{t}, \omega\left(s_{t}\right)\right) \mid s_{0} = s\right] \end{equation}$

(3.13)

The gradient under the highest accumulated reward is expressed as:

$\begin{equation} J(\omega) = \mathbb{E}\left[\sum\limits_{t = 0}^{\infty} \delta^{t} r\left(s_{t}, \omega\left(s_{t}\right)\right)\right] \end{equation}$

(3.14)

According to the deterministic strategy gradient theory, the parameterized vector $\eta$ of the optimal strategy is expressed as

$\begin{equation} \nabla_{\eta} J(\omega) = \mathbb{E}_{s_{t} \sim \rho}\left[ \nabla_{a} Q(s, a) \mid _{a = \omega(s \mid \eta)} \nabla_{\eta} \omega(s \mid \eta)\right] \end{equation}$

(3.15)

where $\rho$ is a random probabilistic strategy, and $Q\left({s, a} \right)$ denotes the action value function.

3.3. MPC weight matrices tuned by DDPG

To optimize the visual servo control performance by tuning the MPC weight matrices based on DDPG, the element of the MDP solved by DDPG can be expressed as follows:

● As shown by Eq (3.1), the variation in feature points is jointly influenced by ${T_d}\frac{1}{{{Z_i}}}\Omega$ and $\dot q$ . The parameters of the weight matrices determine the output joint velocity $\dot q$ , which depends on ${T_d}\frac{1}{{{Z_i}}}\Omega$ . There, $S$ is defined as ${T_d}\frac{1}{{{Z_i}}}\Omega$ .

● $A$ is defined as the MPC weight matrix $({R_s}, {R_u})$ , where $0{I_{8 \times 8}} \le {R_s} \le 30{I_{8 \times 8}}$ , $0{I_{2 \times 2}} \le {R_u} \le 5{I_{6 \times 6}}$ .

● $r(s, a)$ represents the effect on the MPC-based IBVS task with action $a$ under state $s$ . If the deviation between the state of the feature points and the desired state of the $u$ - and $v$ - axes is within the threshold value, a positive reward will be given. If the visual servo task fails, nonpositive reward will be given. Every control action that affects the image coordinate deviation and deviation rate will be punished accordingly, so that the visual servo task is successful, and the control efficiency is improved.

$\begin{equation} {r_t} = \left\{ {\begin{array}{*{20}{c}} 1\\ { - 1}\\ { - s_e^T{W_e}{s_e} - s_{\dot e}^T{W_{\dot e}}{s_{\dot e}}} \end{array}} \right.\begin{array}{*{20}{c}} {for\lvert {{s_{e(u)}}} \rvert < \zeta , \lvert {{s_{e(v)}}} \rvert < \zeta }\\ {{\text{visual servo task failed}}}\\ {else} \end{array} \nonumber \end{equation}$

The process of the DDPG-based MPC IBVS algorithm is shown in Algorithm 1. Four networks are required in the DDPG algorithm: The actor network is denoted as ${\omega ^Q}$ , the critic network is denoted as ${\omega ^\tau }$ , the actor target network is denoted as ${\omega ^Q}^\prime$ , and the critic target network is denoted as ${\omega ^\tau}^\prime$ . The robot performs action ${a_t}$ at state ${s_t}$ , updates to the next state ${s_t+1}$ and obtains the reward ${r_t}$ . The $Q$ -value of the critic target network ${Q_{c - t}}$ can be obtained by the following formula:

$\begin{equation} {Q_{c - t}} = {r_i} + \delta {Q^\prime}\left( {{s_{t + 1}}, {\tau ^\prime}\left( {{s_{t + 1}} \vert {\omega ^\tau }^\prime} \right) \vert {\omega ^Q}^\prime} \right) \end{equation}$

(3.16)

Algorithm 1: DDPG-based MPC IBVS Algorithm

1 Initialize soft updating rate

$\sigma$ and discount factor

$\delta$ ;
2 Initialize the parameterized actor network

${\omega ^Q}$ and parameterized critic network

${{\omega ^\tau }}$ ;
3 Initialize the parameterized actor target network

${\omega ^Q}^\prime$ and parameterized critic target network

${\omega ^\tau}^\prime$ ;
4 Initialize Gaussian noise

$\kappa$ ;
5 Initialize memory pool

${M_p}$ ;
6 for episode = 1, 2,

$\cdots$ ,

$N$ do

The minimal loss function of the critical network is calculated by the gradient descent algorithm

$\begin{equation} {L_{{\omega ^\tau }}} = \frac{1}{N}\sum\limits_{i = 1}^N {{Q_{c - t}} - Q\left( {{s_t}, {a_t}\vert{\omega ^\tau }} \right)} , \end{equation}$

(3.17)

$\begin{equation} \nabla {L_{{\omega ^\tau }}} = \frac{1}{N}\left[ {{Q_{c - t}} - Q\left( {{s_t}, {a_t}\vert{\omega ^\tau }} \right)} \right]{\nabla _{{\omega ^\tau }}}Q\left( {{s_t}, {a_t}\vert{\omega ^Q}} \right) \end{equation}$

(3.18)

The actor network parameters are updated with the following formula:

$\begin{equation} {\nabla _{{\omega ^Q}}}J \approx \frac{1}{N}\sum\limits_i {{\nabla _a}} Q\left( {s, a\vert{\omega ^\tau }} \right){\vert_{a = Q(s\vert{\omega ^Q})}}{\nabla _{{\omega ^Q}}}Q(s\vert{\omega ^Q}) . \end{equation}$

(3.19)

The actor target network and the critic target network are updated by the exponential smoothing method.

$\begin{equation} {\omega ^Q}^\prime \leftarrow \sigma {\omega ^Q} + (1 - \sigma ) {\omega ^Q}^\prime \end{equation}$

(3.20)

$\begin{equation} {\omega ^\tau }^\prime \leftarrow \sigma {\omega ^\tau} + (1 - \sigma ) {\omega ^\tau}^\prime \end{equation}$

(3.21)

4. Experiments and discussion

To confirm the effectiveness of the proposed method, this section gives the simulation comparison experiments of different IBVS methods acting on the same visual servo task. The differences in performance between different control methods will be described in detail.

The simulation object of this study is a visual servo system of a 6-DOF robot manipulator^[39]. The camera is placed on the last joint of the 6-DOF robot manipulator. The focal length of the camera ${f_0}$ is 0.0005 m. The scaling factors along the $u$ - and $v$ - axes are 269,167 pixels/m and 267,778 pixels/m, respectively. The coordinates of the image feature points are fed back from the camera and passed to the 6-DOF robot manipulator, which generates a change in pose so that the image feature points respond from the initial position to the desired position. Simulation training and experiments were performed using MATLAB/Simulink on a laptop computer with a 2.3 GHz Intel Core i7. In RL, the process of an intelligence executing a certain strategy to reach the termination state from the start state is usually referred to as an episode. In this study, the following rules need to be followed each time during the learning process:

$1)$ During the learning process, if the squared difference between the current coordinates of the image feature point and the desired coordinates is lower than the set threshold, it is considered that the visual servo task is successful, and this round of learning ends.

$2)$ During the learning process, if the squared difference between the current coordinates of the image feature point and the desired coordinates is higher than the set threshold, it is considered that the visual servo task has failed, and the current round of learning is ended.

$3)$ During the learning process, if the current coordinates of the image feature point break the image constraint, the visual servo task is considered to have failed, and the current round of learning ends.

In the actor and actor target network, the actor and actor target are the input quantities, and action $a$ is the output quantity. In the critic and critic target networks, the action pair $(s, a)$ is the input, and the action value function $Q(s, a)$ is the output. The activation function gives the neural network a nonlinear modeling capability. In this study, the activation function is chosen as the ReLU function described by Eq (4.1).

$\begin{equation} {f_r} = \max \left( {0, t} \right) \end{equation}$

(4.1)

The reward weight matrices are set as follows:

$\begin{equation} {W_e} = \left[ {\begin{array}{*{20}{c}} {0.5}&0\\ 0&1 \end{array}} \right], \end{equation}$

(4.2)

$\begin{equation} {W_{\dot e}} = \left[ {\begin{array}{*{20}{c}} {0.1}&0\\ 0&{0.05} \end{array}} \right] \end{equation}$

(4.3)

A total of 1000 experiments were carried out, with 20 episodes in each round. The reward value for each round of experiments is the average value of 20 experiments. The hyperparameters of the DDPG algorithm are shown in Table 1.

Table 1. The hyperparameters of DDPG.

Parameter	Value
Discount Factor ( $\delta$ )	0.95
Network soft update parameters ( $\sigma$ )	0.0005
Experience replay pool size	${10^5}$
Numbers of hidden layers	2
First hidden layer size	40
Second hidden layer size	30

| Show Table

DownLoad: CSV

In the simulation training and experiments, the visual servo has four image feature points. The 3-D Cartesian coordinates of the image feature points are mapped to the two-dimensional image framework, arranged counterclockwise from ${O_1} \to {O_2} \to {O_3} \to {O_4}$ . We set the desired position of the visual servo feature points in the image plane as ${\left({400,525} \right)^T}$ pixels, ${\left({720,525} \right)^T}$ pixels, ${\left({720,320} \right)^T}$ pixels, ${\left({400,320} \right)^T}$ pixels.

Remark 1. The visibility constraint of the camera $u$ -axis and $v$ -axis can be expressed as ${u_{\min }} = 0, {u_{\max }} = 1292, {v_{\min }} = 0, {v_{\max }} = 964$ . If the initial coordinates of randomly generated image feature points are outside the camera vision constraint, a set of initial image feature point coordinates will be randomly generated again until the initial coordinates of all image feature points are within the camera vision constraint.

The purpose of the comparative simulation experiments is to demonstrate the accuracy and stability of the DDPG-MPC IBVS method proposed in this paper. The simulation is divided into two parts. First, the control effect of this method is compared with the traditional model predictive control visual servo method (MPC-IBVS)^[21]. Then, the control effect of the proposed method in this study is compared with the model predictive control IBVS method tuned by Q-learning (Q-learning-MPC IBVS).

4.1. Comparison with traditional IBVS methods

In this section, the control effects of the proposed DDPG-MPC IBVS method and traditional MPC-IBVS are analyzed. In addition, to prove that the proposed method has a more stable and accurate control effect, this section selects two groups of different weight matrices for the MPC-IBVS method, which are called MPC-IBVS-A and MPC-IBVS-B. The weight matrix of MPC-IBVS-A is set as ${R_s} = 2{I_{8 \times 8}}, {R_u} = {I_{6 \times 6}}$ , the predictive time domain is set as ${N_p} = 5$ , and the control time domain is set as ${N_c} = 2$ . The weight matrix of MPC-IBVS-B is set as ${R_s} = 25{I_{8 \times 8}}, {R_u} = 5{I_{6 \times 6}}$ , the predictive time domain is set as ${N_p} = 5$ , and the control time domain is set as ${N_c} = 2$ . The sampling period of the controller is 40 ms.

The weight matrix after training and rectification by the proposed method in the last part is

$\begin{equation} {R_s} = \left[ {\begin{array}{*{20}{c}} {10}&0&0&0&0&0&0&0\\ 0&8&0&0&0&0&0&0\\ 0&0&7&0&0&0&0&0\\ 0&0&0&5&0&0&0&0\\ 0&0&0&0&5&0&0&0\\ 0&0&0&0&0&6&0&0\\ 0&0&0&0&0&0&8&0\\ 0&0&0&0&0&0&0&9 \end{array}} \right], {R_u} = \left[ {\begin{array}{*{20}{l}} 5&0&0&0&0&0\\ 0&3&0&0&0&0\\ 0&0&2&0&0&0\\ 0&0&0&3&0&0\\ 0&0&0&0&2&0\\ 0&0&0&0&0&4 \end{array}} \right] \end{equation}$

(4.4)

The predictive time domain is set as ${N_p} = 5$ , and the control time domain is set as ${N_c} = 2$ . To test the stability of various control methods, the desired feature point image coordinates are set to ${(400,525)^T}$ pixels, ${(720,525)^T}$ pixels, ${(720,320)^T}$ pixels, ${(400,320)^T}$ pixels. The first set of initial feature point image coordinates $p_1$ are set to ${(233.5,740.7)^T}$ pixels, ${(240,600)^T}$ pixels, ${(115.2,612.7)^T}$ pixels, ${(107.3,760.1)^T}$ pixels, and the second set of initial feature point image coordinates $p_2$ are set to ${(917.2,350.1)^T}$ pixels, ${(919,200.3)^T}$ pixels, ${(801.9,226. 1)^T}$ pixels, ${(800.2,374.8)^T}$ pixels. The robot manipulator joint velocity constraint is limited to 0.5 rad/s.

First, gives the comparative simulation results of the DDPG-MPC IBVS, MPC-IBVS-A and MPC-IBVS-B methods under the initial coordinates $p_1$ . From Figure 3(a)–, it can be observed that under all three IBVS methods, the image feature points can successfully respond from the initial coordinates to the desired coordinates and finally stabilize in the desired state. However, – shows that the image deviation converges at the fastest rate under the proposed method compared to the other methods. The settling time of the DDPG-MPC IBVS method is approximately 5 s. The settling time of MPC-IBVS-A is approximately 10 s. The settling time of MPC-IBVS-B is approximately 8 s. The steady-state error of the proposed method is less than those of MPC-IBVS-A and MPC-IBVS-B. The above experiments show that the proposed method has better control performance when the initial position is $p_1$ .

Figure 3. Comparative simulation results under initial coordinates

$p_1$ .

DownLoad: Full-Size Img PowerPoint

Then, the results of the comparative simulation at the initial coordinates $p_2$ are given in . From –, it can be observed that the image feature points can also respond from the initial coordinates to the desired coordinates and stabilize in the desired state under all three MPC-based IBVS methods. Similar to the above simulation results, the proposed method can make the image feature points respond and stabilize at the desired coordinates as fast as possible (approximately 5 s). Different from the above simulation results, in the initial state $p_2$ , the settling time of MPC-IBVS-A (approximately 8 s) is longer than that of MPC-IBVS-B (approximately 10 s). This indicates that the different parameters of the weight matrix of the objective function set considered cannot remain stable in giving the optimal control effect in different visual servo tasks. This is because the weight matrix contains multiple parameters, and the selection of the optimal weight matrix parameters by humans is also not achievable in practical applications. With the increase in accumulated rewards, the intensive training method can gradually find the optimal weight matrix parameters, thus achieving better visual servo control performance.

Figure 4. Comparative simulation results under initial coordinates

$p_2$ .

DownLoad: Full-Size Img PowerPoint

In the IBVS tasks, the relationship between the target object and the visual servo system is usually represented by four or more feature points. As shown in Eq (4.4), the four feature points in this study correspond to the eight parameters in the image deviation weight matrix $R_s$ . At the same time, the 6-DOF manipulator corresponds to six parameters in the weight matrix $R_u$ . Then, there are fourteen parameters of the weight matrices that need to be set in the MPC-based IBVS method. Although these fourteen parameters can be given artificially through trial and error, it is possible to complete the visual servo task, but finding the optimal weight matrix parameters requires a lot of work, which is impossible in practical applications. The proposed method can get better weight matrix parameters with a certain number of training sets in less than one hour, improve the servo efficiency and accuracy and effectively improve the control quality.

The evaluation results of the above simulation experiments are shown in Table 2, which demonstrates that compared with the traditional MPC-based IBVS method, the DDPG-MPC IBVS method can complete the constrained visual servo task with higher control quality.

Table 2. The evaluation results of different MPC-based IBVS methods.

Initial state	Control method	DDPG-MPC-IBVS	MPC-IBVS-A	MPC-IBVS-B
$p_1$	Settling time (s)	5.24	9.98	8.13
$p_1$	Image deviation (pixels)	4.98	10.21	10.5
	Control overshoot	12 $\%$	20 $\%$	19 $\%$
$p_2$	Settling time (s)	4.89	8.07	10.1
$p_2$	Image deviation (pixels)	5.1	11.26	11.18
	Control overshoot	12 $\%$	18 $\%$	20 $\%$

| Show Table

DownLoad: CSV

4.2. Comparison with the Q-learning-MPC IVBS method

In this part, the results of the comparative simulation between the DDPG-MPC IBVS method and the Q-Learning-MPC IBVS method are analyzed. Q-learning is an extremely important RL algorithm, and $Q(s, a)$ is the benefit obtained by taking action $a$ in a certain state $s$ . The environment will provide feedback on the corresponding reward according to the action reward, so the main idea of the Q-learning algorithm is to build a Q-table with states and actions to store Q-values and then select the action that can obtain the maximum gain according to the Q-values. In the simulation of this study, the Q-learning algorithm is composed of a low-latitude Q-table. The objective function weight matrix is discretely tuned by the Q-learning algorithm.

In the comparison experiments, the desired coordinates of the feature points are the same as in the previous section, and the initial coordinates of the feature points are $p_1$ . From Figure 5(a), (b) we find that both the DDPG-MPC IBVS method and the Q-learning-MPC IBVS method can cope well with the system constraints. From , , it can be observed that DDPG-MPC IBVS has a faster servo deviation convergence speed than the Q-Learning-MPC IBVS method, at 5 s versus 7 s, respectively. Furthermore, the deviation response curve based on DDPG-MPC IBVS is smoother. This is because the action set of Q-learning is discrete. In contrast, DDPG trains and corrects continuous visual servo actions through continuous network functions. Therefore, the DDPG algorithm tunes the more appropriate weight matrix of the objective function, so as to achieve a better visual servo control. As shown in , the cumulative rewards of the two RL methods increase with the increase of rounds. However, the final cumulative rewards of the DDPG algorithm are higher than that of the Q-learning algorithm, which indicates that the DDPG algorithm has achieved better training results. This means that the DDPG-IBVS method can adjust more suitable weight matrix parameters, which leads to better visual servo control performance. The changes of the camera velocity $V = {\left[{{v_x}, {v_y}, {v_z}, {w_x}, {w_y}, {w_z}} \right]^T} \in {R^{6 \times 1}}$ of the two methods are shown in Figure 5(f). The camera velocities do not fluctuate too much under the two methods, which means that the response processes are both smooth.

Figure 5. Comparative simulation results of different RL algorithm IBVS methods.

DownLoad: Full-Size Img PowerPoint

To describe the effectiveness of the proposed method more vividly, we quantitatively analyze the visual servo control effect for the model prediction control rectified by two different reinforcement learning algorithms. As seen from Table 3, after 30 random experiments, both methods can complete the constrained visual servo task with a high success rate and similar average image deviation. However, the DDPG-MPC IBVS method always has a faster convergence speed.

Table 3. The evaluation results of different RL-MPC IBVS methods.

Control method	DDPG-MPC IBVS	Q-learning-MPC IBVS
Success rate	100 $\%$	100 $\%$
Average overshoot	12 $\%$	14 $\%$
Average settling time (s)	5.12	7.39
Average image deviation (pixels)	5.06	5.2

| Show Table

DownLoad: CSV

In conclusion, the proposed method has faster servo efficiency and better control quality compared to the existing IBVS methods. The proposed method can effectively perform visual servo tasks.

5. Conclusions

In visual servoing, system constraints contain visibility constraints, and actuator constraints must be considered. To solve the constrained visual servo problem, a model predictive control IBVS method tuned by RL is proposed in this study. First, a depth-independent Jacobian matrix is established as the predictive model, and the optimal control input is found by minimizing the cost function of the predictive error. Different from traditional model predictive control methods, the weight matrix of the objective function is adjusted offline by the DDPG algorithm. Appropriate states, rewards and actions are defined in the training progress. Then, the accumulated rewards converge to specific values, which means that the DDPG algorithm has successfully learned the appropriate weight matrix parameters. Finally, in simulation experiments of a 6-DOF manipulator, the control effect of the proposed method is compared and analyzed with other visual servo control methods, and we find that the proposed method has better performance. In future work, we plan to explore the delayed visual servo control strategies caused by low-quality visual signals. We will design a predictive control IBVS method based on a time-delay predictive model, to improve the control stability of visual servo systems with time delay.

Acknowledgments

This research was funded by the Natural Science Foundation of Heilongjiang Province, grant number KY10400210217, and the Fundamental Strengthening Program Technical Field Fund, grant number 2021-JCJQ-JJ-0026.

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	T. K. Chan, C. S. Chin, A comprehensive review of polyphonic sound event detection, IEEE Access, 8 (2020), 103339–103373. https://doi.org/10.1109/ACCESS.2020.2999388 doi: 10.1109/ACCESS.2020.2999388
[2]	A. Mesaros, T. Heittola, T. Virtanen, M. D. Plumbley, Sound event detection: A tutorial, IEEE Signal Process Mag., 38 (2021), 67–83. https://doi.org/10.1109/MSP.2021.3090678 doi: 10.1109/MSP.2021.3090678
[3]	J. P. Bello, C. Silva, O. Nov, R. L. Dubois, A. Arora, J. Salamon, et al., Sonyc: A system for monitoring, analyzing, and mitigating urban noise pollution, Commun. ACM, 62 (2019), 68–77. https://doi.org/10.1145/3224204 doi: 10.1145/3224204
[4]	T. Hu, C. Zhang, B. Cheng, X. P. Wu, Research on abnormal audio event detection based on convolutional neural network (in Chinese), J. Signal Process., 34 (2018), 357–367. https://doi.org/10.16798/j.issn.1003-0530.2018.03.013 doi: 10.16798/j.issn.1003-0530.2018.03.013
[5]	D. Stowell, M. Wood, Y. Stylianou, H. Glotin, Bird detection in audio: A survey and a challenge, in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), (2016), 1–6. https://doi.org/10.1109/MLSP.2016.7738875
[6]	K. K. Lell, A. Pja, Automatic COVID-19 disease diagnosis using 1D convolutional neural network and augmentation with human respiratory sound based on parameters: Cough, breath, and voice, AIMS Public Health, 8 (2021), 240. https://doi.org/10.3934/publichealth.2021019
[7]	K. K. Lella, A. Pja, Automatic diagnosis of COVID-19 disease using deep convolutional neural network with multi-feature channel from respiratory sound data: Cough, voice, and breath, Alexandria Eng. J., 61 (2022), 1319–1334. https://doi.org/10.1016/j.aej.2021.06.024
[8]	G. Chen, M. Liu, J. Chen, Frequency-temporal-logic-based bearing fault diagnosis and fault interpretation using Bayesian optimization with Bayesian neural network, Mech. Syst. Signal Process., 145 (2020), 1–21. https://doi.org/10.1016/j.ymssp.2020.106951 doi: 10.1016/j.ymssp.2020.106951
[9]	S. Adavanne, A. Politis, T. Virtanen, A multi-room reverberant dataset for sound event localization and detection, preprint, arXiv: 1905.08546.
[10]	S. R. Eddy, What is a hidden Markov model, Nat. Biotechnol., 22 (2004), 1315–1316. https://doi.org/10.1038/nbt1004-1315 doi: 10.1038/nbt1004-1315
[11]	J. Wang, S. Sun, Y. Ning, M. Zhang, W. Pang, Ultrasonic TDoA indoor localization based on Piezoelectric Micromachined Ultrasonic Transducers, in 2021 IEEE International Ultrasonics Symposium (IUS), (2021), 1–3. https://doi.org/10.1109/IUS52206.2021.9593813
[12]	C. Liu, J. Yun, J. Su, Direct solution for fixed source location using well-posed TDOA and FDOA measurements, J. Syst. Eng. Electron., 31 (2020), 666–673. https://doi.org/10.23919/JSEE.2020.000042 doi: 10.23919/JSEE.2020.000042
[13]	T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux, K. Takeda, BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic sound event detection, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2017), 766–770. https://doi.org/10.1109/ICASSP.2017.7952259
[14]	H. Zhu, H. Wan, Single sound source localization using convolutional neural networks trained with spiral source, in 2020 5th International Conference on Automation, Control and Robotics Engineering (CACRE), (2020), 720–724. https://doi.org/10.1109/CACRE50138.2020.9230056
[15]	S. Adavanne, A. Politis, J. Nikunen, T. Virtanen, Sound event localization and detection of overlapping sources using convolutional recurrent neural networks, IEEE J. Sel. Top. Signal Process., 13 (2019), 34–48. https://doi.org/10.1109/JSTSP.2018.2885636 doi: 10.1109/JSTSP.2018.2885636
[16]	T. Komatsu, M. Togami, T. Takahashi, Sound event localization and detection using convolutional recurrent neural networks and gated linear units, in 2020 28th European Signal Processing Conference (EUSIPCO), (2021), 41–45. https://doi.org/10.23919/Eusipco47968.2020.9287372
[17]	V. Spoorthy, S. G. Koolagudi, A transpose-SELDNet for polyphonic sound event localization and detection, in 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), (2023), 1–6. https://doi.org/10.1109/I2CT57861.2023.10126251
[18]	J. S. Kim, H. J. Park, W. Shin, S. W. Han, AD-YOLO: You look only once in training multiple sound event localization and detection, in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2023), 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096460
[19]	J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 779–788. https://doi.org/10.1109/CVPR.2016.91
[20]	H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2015), 559–563. https://doi.org/10.1109/ICASSP.2015.7178031
[21]	H. Phan, L. Pham, P. Koch, N. Q. K. Duong, I. McLoughlin, A. Mertins, On multitask loss function for audio event detection and localization, preprint, arXiv: 2009.05527.
[22]	S. Adavanne, A. Politi, T. Virtanen, Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network, preprint, arXiv: 1904.12769.
[23]	Z. X. Han, Research on robot sound source localization method based on beamforming (in Chinese), Nanjing Univ. Inf. Sci. Technol., 2022 (2022). https://doi.org/10.27248/d.cnki.gnjqc.2021.000637 doi: 10.27248/d.cnki.gnjqc.2021.000637
[24]	T. N. T. Nguyen, K. N. Watcharasupat, N. K. Nguyen, D. L. Jones, W. S. Gan, SALSA: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection, IEEE/ACM Trans. Audio Speech Lang. Process., 30 (2022), 1749–1762. https://doi.org/10.1109/TASLP.2022.3173054 doi: 10.1109/TASLP.2022.3173054
[25]	A. Politis, A. Mesaros, S. Adavanne, T. Heittola, T. Virtanen, Overview and evaluation of sound event localization and detection in DCASE 2019, IEEE/ACM Trans. Audio Speech Lang. Process., 29 (2021), 684–698. https://doi.org/10.1109/TASLP.2020.3047233 doi: 10.1109/TASLP.2020.3047233
[26]	Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, M. D. Plumbley, Polyphonic sound event detection and localization using a two-stage strategy, preprint, arXiv: 1905.00268.
[27]	K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770–778. https://doi.org/10.1109/CVPR.2016.90
[28]	J. Naranjo-Alcazar, S. Perez-Castanos, J. Ferrandis, P. Zuccarello, M. Cobos, Sound event localization and detection using squeeze-excitation residual CNNs, preprint, arXiv: 2006.14436.
[29]	R. Ranjan, S. Jayabalan, T. Nguyen, W. Gan, Sound event detection and direction of arrival estimation using Residual Net and recurrent neural networks, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), (2019), 214–218. https://doi.org/10.33682/93dp-f064
[30]	J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-excitation networks, IEEE Trans. Pattern Anal. Mach. Intell., 42 (2020), 2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372 doi: 10.1109/TPAMI.2019.2913372
[31]	D. L. Huang, R. F. Perez, Sseldnet: A fully end-to-end sample-level framework for sound event localization and detection, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), (2021), 1–5.
[32]	S. Woo, J. Park, J. Y. Lee, I. S. Kweon, CBAM: Convolutional Block Attention Module, in Proceedings of the European Conference on Computer Vision (ECCV), (2018), 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
[33]	J. W. Kim, G. W. Lee, C. S. Park, H. K. Kim, Sound event detection using EfficientNet-B2 with an attentional pyramid network, in 2023 IEEE International Conference on Consumer Electronics (ICCE), (2023), 1–2. https://doi.org/10.1109/ICCE56470.2023.10043590
[34]	C. Xu, H. Liu, Y. Min, Y. Zhen, Sound event localization and detection based on dual attention (in Chinese), Comput. Eng. Appl., 2022 (2022), 1–11.
[35]	Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, ECA-Net: Efficient channel attention for deep convolutional neural networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 11534–11542. https://doi.org/10.1109/CVPR42600.2020.01155
[36]	J. Jia, M. Sun, G. Wu, W. Qiu, W. G. Qiu, DeepDN_iGlu: Prediction of lysine glutarylation sites based on attention residual learning method and DenseNet, Math. Biosci. Eng., 20 (2023), 2815–2830. https://doi.org/10.3934/mbe.2023132 doi: 10.3934/mbe.2023132
[37]	Z. Yang, L. Zhu, Y. Wu, Y. Yang, Gated Channel Transformation for visual recognition, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 11791–11800. https://doi.org/10.1109/CVPR42600.2020.01181
[38]	H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, et al., ResNeSt: Split-attention networks, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (2022), 2735–2745. https://doi.org/10.1109/CVPRW56347.2022.00309
[39]	Q. Hou, D. Zhou, J. Feng, Coordinate attention for efficient mobile network design, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 13708–13717. https://doi.org/10.1109/CVPR46437.2021.01350
[40]	A. Politis, S. Adavanne, T. Virtanen, A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection, preprint, arXiv: 2006.01919.
[41]	A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, T. Virtanen, A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection, preprint, arXiv: 2106.06999.
[42]	A. Politis, A. Mesaros, S. Adavanne, T. Heittola, T. Virtanen, Overview and evaluation of sound event localization and detection in DCASE 2019, IEEE/ACM Trans. Audio Speech Lang. Process., 29 (2021), 684–698. https://doi.org/10.1109/TASLP.2020.3047233 doi: 10.1109/TASLP.2020.3047233
[43]	K. Liu, X. Zhao, Y. Hu, Y. Fu, Modeling the effects of individual and group heterogeneity on multi-aspect rating behavior, Front. Data Domputing, 2 (2020), 59–77. https://doi.org/10.11871/jfdc.issn.2096-742X.2020.02.005 doi: 10.11871/jfdc.issn.2096-742X.2020.02.005
[44]	Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, PANNs: Large-scale pretrained audio neural networks for audio pattern recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 28 (2020), 2880–2894. https://doi.org/10.1109/TASLP.2020.3030497 doi: 10.1109/TASLP.2020.3030497

This article has been cited by:

1.	Kottakkaran Sooppy Nisar, Muhammad Farman, Mahmoud Abdel-Aty, Chokalingam Ravichandran, A review of fractional order epidemic models for life sciences problems: Past, present and future, 2024, 95, 11100168, 283, 10.1016/j.aej.2024.03.059
2.	Kaushik Dehingia, Salah Boulaaras, Evren Hinçal, Kamyar Hosseini, Thabet Abdeljawad, M.S. Osman, On the dynamics of a financial system with the effect financial information, 2024, 106, 11100168, 438, 10.1016/j.aej.2024.08.049
3.	Rituparna Bhattacharyya, Brajesh Kumar Jha, Computational Fuzzy Modelling Approach to Analyze Neuronal Calcium Dynamics With Intracellular Fluxes, 2024, 1085-9195, 10.1007/s12013-024-01541-0
4.	Parvaiz Ahmad Naik, Anum Zehra, Muhammad Farman, Aamir Shehzad, Sundas Shahzeen, Zhengxin Huang, Forecasting and dynamical modeling of reversible enzymatic reactions with a hybrid proportional fractional derivative, 2024, 11, 2296-424X, 10.3389/fphy.2023.1307307
5.	Kottakkaran Sooppy Nisar, Muhammad Farman, Anum Zehra, Evren Hincal, Numerical and analytical study of fractional order tumor model through modeling with treatment of chemotherapy, 2024, 0228-6203, 1, 10.1080/02286203.2024.2327659
6.	Muhammad Farman, Rabia Sarwar, Evren Hincal, Dumitru Baleanu, Kottakkaran Sooppy Nisar, Aceng Sambas, Muhammad Manan Akram, Dynamical Evaluation of the Mittag-Leffler Properties-Based Marine Ecosystem Model, 2025, 2509-9426, 10.1007/s41748-025-00617-y
7.	Md. Saifur Rahman, Rehena Nasrin, Md. Haider Ali Biswas, Mathematical Modeling and Analysis of Human-to-Human Transmitted Viral Encephalitis, 2025, 13, 2227-7390, 1809, 10.3390/math13111809

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

4.4

Metrics

Article views(2020) PDF downloads(111) Cited by(2)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(9) / Tables(7)

Mathematical Biosciences and Engineering

Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet

Related Papers:

Abstract

1. Introduction

2. System modeling

3. MPC controller design tuned by RL

3.1. Model predictive control for visual servoing

3.2. RL and policy gradient algorithm

3.3. MPC weight matrices tuned by DDPG

4. Experiments and discussion

4.1. Comparison with traditional IBVS methods

4.2. Comparison with the Q-learning-MPC IVBS method

5. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Abstract

1. Introduction

2. System modeling

3. MPC controller design tuned by RL

3.1. Model predictive control for visual servoing

3.2. RL and policy gradient algorithm

3.3. MPC weight matrices tuned by DDPG

4. Experiments and discussion

4.1. Comparison with traditional IBVS methods

4.2. Comparison with the Q-learning-MPC IVBS method

5. Conclusions

Acknowledgments

Conflict of interest

References

Mathematical Biosciences and Engineering

Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet

Related Papers:

Abstract

1. Introduction

2. System modeling

3. MPC controller design tuned by RL

3.1. Model predictive control for visual servoing

3.2. RL and policy gradient algorithm

3.3. MPC weight matrices tuned by DDPG

4. Experiments and discussion

4.1. Comparison with traditional IBVS methods

4.2. Comparison with the Q-learning-MPC IVBS method

5. Conclusions

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog

Abstract

1. Introduction

2. System modeling

3. MPC controller design tuned by RL

3.1. Model predictive control for visual servoing

3.2. RL and policy gradient algorithm

3.3. MPC weight matrices tuned by DDPG

4. Experiments and discussion

4.1. Comparison with traditional IBVS methods

4.2. Comparison with the Q-learning-MPC IVBS method

5. Conclusions

Acknowledgments

Conflict of interest

References