
Symptoms of anxiety and depression are very common among healthcare workers (HCWs) and could impact the quality of care.
This study aimed to evaluate the prevalence of these disorders in a public health company and their association with work ability and work-related stress.
A cross-sectional study involved 80 HCWs being treated for mental disorders (MD), 55 HCWs who said they suffered from MD but were not being treated, and 824 healthy colleagues. All workers completed the Work Ability Index (WAI), the Siegrist's Effort/Reward Imbalance questionnaire (ERI), the Goldberg's scales of anxiety and depression (GADS), and the Warr's scale of job satisfaction.
Three-quarters of workers with MD suffered from anxiety and/or depression. Workers who declared at the periodic medical examination in the workplace that they were being treated for MD had significantly lower levels of work ability than those of their colleagues who declared good mental health. They also reported greater work stress (high effort, low rewards, high overcommitment) and lower job satisfaction than their healthy colleagues. Symptomatic but untreated workers reported significantly lower work ability, lower satisfaction, and greater occupational stress than their healthy colleagues. In the entire sample, there were many workers with symptoms of anxiety or depression who did not declare these disorders during the examination. Overall, there were 328 suspected cases of anxiety (34.2%) and 334 cases of depression (34.8%). Anxious workers [OR = 8.11, 95% confidence interval (CI) = 3.74–17.58] and depressed workers (OR = 4.49, 95% CI = 2.22–9.10) had an increased risk of being classified as having “poor work ability”.
The negative association between psychological symptoms and work ability even in undiagnosed/untreated workers demonstrates the usefulness of screening for these symptoms in work environments.
Citation: Nicola Magnavita, Igor Meraglia, Matteo Riccò. Anxiety and depression in healthcare workers are associated with work stress and poor work ability[J]. AIMS Public Health, 2024, 11(4): 1223-1246. doi: 10.3934/publichealth.2024063
[1] | Minseok Kim, Yeongjong Kim, Yeoneung Kim . Physics-informed neural networks for optimal vaccination plan in SIR epidemic models. Mathematical Biosciences and Engineering, 2025, 22(7): 1598-1633. doi: 10.3934/mbe.2025059 |
[2] | Zichen Wang, Xin Wang . Fault-tolerant control for nonlinear systems with a dead zone: Reinforcement learning approach. Mathematical Biosciences and Engineering, 2023, 20(4): 6334-6357. doi: 10.3934/mbe.2023274 |
[3] | Yoon-gu Hwang, Hee-Dae Kwon, Jeehyun Lee . Feedback control problem of an SIR epidemic model based on the Hamilton-Jacobi-Bellman equation. Mathematical Biosciences and Engineering, 2020, 17(3): 2284-2301. doi: 10.3934/mbe.2020121 |
[4] | Han Ma, Qimin Zhang . Threshold dynamics and optimal control on an age-structured SIRS epidemic model with vaccination. Mathematical Biosciences and Engineering, 2021, 18(6): 9474-9495. doi: 10.3934/mbe.2021465 |
[5] | Jin Zhang, Nan Ma, Zhixuan Wu, Cheng Wang, Yongqiang Yao . Intelligent control of self-driving vehicles based on adaptive sampling supervised actor-critic and human driving experience. Mathematical Biosciences and Engineering, 2024, 21(5): 6077-6096. doi: 10.3934/mbe.2024267 |
[6] | Ruiping Yuan, Jiangtao Dou, Juntao Li, Wei Wang, Yingfan Jiang . Multi-robot task allocation in e-commerce RMFS based on deep reinforcement learning. Mathematical Biosciences and Engineering, 2023, 20(2): 1903-1918. doi: 10.3934/mbe.2023087 |
[7] | Dan Shi, Shuo Ma, Qimin Zhang . Sliding mode dynamics and optimal control for HIV model. Mathematical Biosciences and Engineering, 2023, 20(4): 7273-7297. doi: 10.3934/mbe.2023315 |
[8] | Jiashuai Li, Xiuyan Peng, Bing Li, Victor Sreeram, Jiawei Wu, Ziang Chen, Mingze Li . Model predictive control for constrained robot manipulator visual servoing tuned by reinforcement learning. Mathematical Biosciences and Engineering, 2023, 20(6): 10495-10513. doi: 10.3934/mbe.2023463 |
[9] | Cong Zhao, Na Deng . An actor-critic framework based on deep reinforcement learning for addressing flexible job shop scheduling problems. Mathematical Biosciences and Engineering, 2024, 21(1): 1445-1471. doi: 10.3934/mbe.2024062 |
[10] | Xiaoding Meng, Hecheng Li, Anshan Chen . Multi-strategy self-learning particle swarm optimization algorithm based on reinforcement learning. Mathematical Biosciences and Engineering, 2023, 20(5): 8498-8530. doi: 10.3934/mbe.2023373 |
Symptoms of anxiety and depression are very common among healthcare workers (HCWs) and could impact the quality of care.
This study aimed to evaluate the prevalence of these disorders in a public health company and their association with work ability and work-related stress.
A cross-sectional study involved 80 HCWs being treated for mental disorders (MD), 55 HCWs who said they suffered from MD but were not being treated, and 824 healthy colleagues. All workers completed the Work Ability Index (WAI), the Siegrist's Effort/Reward Imbalance questionnaire (ERI), the Goldberg's scales of anxiety and depression (GADS), and the Warr's scale of job satisfaction.
Three-quarters of workers with MD suffered from anxiety and/or depression. Workers who declared at the periodic medical examination in the workplace that they were being treated for MD had significantly lower levels of work ability than those of their colleagues who declared good mental health. They also reported greater work stress (high effort, low rewards, high overcommitment) and lower job satisfaction than their healthy colleagues. Symptomatic but untreated workers reported significantly lower work ability, lower satisfaction, and greater occupational stress than their healthy colleagues. In the entire sample, there were many workers with symptoms of anxiety or depression who did not declare these disorders during the examination. Overall, there were 328 suspected cases of anxiety (34.2%) and 334 cases of depression (34.8%). Anxious workers [OR = 8.11, 95% confidence interval (CI) = 3.74–17.58] and depressed workers (OR = 4.49, 95% CI = 2.22–9.10) had an increased risk of being classified as having “poor work ability”.
The negative association between psychological symptoms and work ability even in undiagnosed/untreated workers demonstrates the usefulness of screening for these symptoms in work environments.
In the last two decades, the research and application of nonlinear systems have made considerable progress. Combined with the Lyapunov stability theory, there have been many research results in the control theory of nonlinear systems. Wu et al. [1] designed an adaptive quantitative control method for analyzing uncertain nonlinear strictly feedback systems by combining the backstepping technique and Lyapunov stability theory. Shatyrko et al. [2] studied the stabilization of Lur'e-type nonlinear indirect control systems with time-delay argument, and obtained the sufficient conditions for the absolute stability of the system by the Lyapunov method. Kai et al. [3] discussed a control method that can generate the desired limit-cycle-like behavior for a two-dimensional discrete-time nonlinear control system. Wei [4] discussed the boundedness of nonlinear nabla fractional-order systems, and by using the nabla Laplace transform, derived two stability criteria in the form of Lyapunov's theorem. Pole et al. [5] described progressively globally asymptotically stable nonlinear control systems through symbolic models, which enables one to utilize techniques from supervisory control and algorithms from game theory for controller synthesis purposes.
At the same time, the development of optimization theory of dynamic systems and the research and application of nonlinear control systems have opened up huge application space. Zhang et al. [6] introduced the control research of adaptive dynamic programming (ADP) in synchronous generator and generator excitation of multi-machine power system. Volckaert et al. [7] use iterative learning control (ILC) to solve the control problem of the cart-cable system with nonlinear characteristics. Gao et al. [8] used a global adaptive dynamic programming(GADP) method for cooperative adaptive cruise control (CACC) of connected vehicles with unknown nonlinear dynamics. Trélat. [9] conducted research on some common methods used to solve nonlinear optimal control problems in aerospace, such as Pontryagin's maximum principle and conjugate point theory.
At present, many scholars combine nonlinear system and optimal control theory to carry out research on optimal control of nonlinear system. The main methods to solve the optimal control problem are classical variational method, maximum value principle and dynamic programming [10,11,12]. One of the main difficulties of these classical optimal control theories is to determine the optimal control of a nonlinear system, so it is necessary to solve the Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) [13]. Due to the complexity of the nonlinear system and the various constraints of the real systems, nonlinear HJB equations and nonlinear two-point boundary value problems cannot be solved analytically. Therefore, it is of great significance to study some numerical solution methods or explicit optimal solutions approximate to the optimal solution to solve the optimal control problem.
With the development of various technologies in the field of artificial intelligence, many methods of combining artificial intelligence algorithms to solve nonlinear system problems emerge [14,15]. Another powerful methodology for solving optimal control problems is the Adaptive Dynamic Programming (ADP) algorithm, which was proposed by Werbos in 1991 [16]. Combining the reinforcement learning and the dynamic programming, ADP simulates the human learning through environmental feedback and is considered to be a method very close to the intelligence of the human brain. This method effectively solves the problem of "curse of dimensionality" in dynamic programming [17]. In 1997, Prokhorov and Wunsch discussed the design of Heuristic Dynamic Programming (HDP), Dual Heuristic Programming (DHP) and Global Dual Heuristic Dynamic Programming (GDHP), and proposed the implementation method and training steps of ADP [18]. The essence of ADP is to use the function approximation structure to approximate the performance index function, control strategy in the dynamic programming equation, and obtain the optimal control and optimal performance index function under the condition of meeting the Behrman optimality principle [19,20,21]. In the ADP structure, there are generally three parts, namely the dynamic system, the critic performance index function and the control action. Each part can be replaced by a neural network [22]. The mathematical derivation process of ADP method is based on minimizing the performance index function or cost function in the infinite horizon, and the performance index function is the integration or summation of the utility function. In [23], combined with Bellman's optimization principle, the method of introducing discount factor into cost function was used to solve the optimal control problem. This method is closer to the role of discount rate in reinforcement learning, therefore, the model can better refer to the value return in the future during the process of training [24].
The research of the above literature is mainly aimed at the design and optimization of the network structure in ADP, and proposing ADP algorithms with different structures. In recent years, many scholars have done a lot of research on the stability analysis and algorithm convergence of the closed-loop system composed of the ADP iterative algorithm. Mu et al. [25] proposed an approximate optimal control strategy, obtained the approximate optimal tracking control law of uncertain nonlinear systems with predefined cost function by using ADP, and solved the tracking control problem of continuous-time nonlinear systems with unmatched uncertainty. Dong et al. [26] proposed a critic-only ADP algorithm to approximate the solution of the HJB equation and the corresponding optimal control strategy. This strategy solves the tracking control problem described by Euler-Lagrange equations with uncertain system parameters, furthermore, uniformly ultimately bounded stability is guaranteed via a Lyapunov-based stability analysis. Song et al. [27] proposed a value iteration (Ⅵ) algorithm based on ADP to solve the fixed-point tracking control problem, gave the convergence proof of Ⅵ algorithm, and proved that the iterative cost function can converge to the optimal value accurately. Liang et al. [28] constructed a partial-policy iterative adaptive dynamic programming (ADP) algorithm to solve the optimal control problem of nonlinear systems with discounted total rewards. Compared with the traditional strategy iterative ADP algorithm, the calculation amount in the iterative process is reduced, and the stability is improved. Fan et al. [29] proposed a novel control strategy based on robust adaptive dynamic programming (RADP) for optimal control of a class of output-constrained continuous-time unknown nonlinear systems. Yang et al. [30] proposed a self-learning robust optimal control scheme based on adaptive dynamic programming (ADP), using indicator functions and concurrent learning techniques to solve mismatched perturbed input affine continuous-time nonlinear systems.
However, most ADP algorithms are either implemented offline by iterative methods or require prior knowledge of system dynamics. These ADP algorithms are intractable for real-time control applications since exact knowledge of nonlinear dynamic systems is often not available. Many scholars use reinforcement learning methods to solve the above problems. Liu et al. [31] proposed a robust adaptive control algorithm based on Reinforcement Learning (RL) to solve the control problem of continuous-time uncertain nonlinear systems with input constraints. The algorithm uses a single evaluation network to obtain approximate optimal control, which ensures a certain stability of the uncertain nonlinear continuous-time system. Liu et al. [32] constructed two kinds of neural networks to achieve approximate optimal control for continuous-time nonlinear systems with unknown structure and control constraints. Among them, the cyclic neural network (RNN) is used to identify the system model of the system, and two feedforward neural networks (Feedforward NN) are used as the execution network and the evaluation network, respectively, for the approximation of the optimal control strategy and the optimal performance index function. Zhao et al. [33] proposed a RL-based cyclic fixed finite-time algorithm to solve the HJB equation concerning the optimal control of continuous nonlinear finite-time systems with uncertain system structure. Zhao et al. [34] proposed a new reinforcement learning strategy consisting of two parts: an online learning optimal control of a nominal system and a feedforward neural network (NN) compensation that handles uncertain input constraints. This strategy solves the optimal stability problem of unknown nonlinear systems constrained by uncertain inputs. Wang et al. [35] studied the approximate optimal control of continuous-time non-affine nonlinear systems using reinforcement learning, and used effective precompensation techniques for proper system transitions. Combined with the overall strategy iteration to alleviate the needs of system dynamics. J. W. Kim et al. [36] proposed a model-based RL approach that uses deep neural networks (DNNs) to iteratively learn the solutions of HJB and its associated equations, which can significantly improve the performance of learning policies in uncertain initial states and in the presence of state noise.
The algorithm based on dynamic programming and iterative strategy starts with the performance index function in the state space of the control system, explores the control effect of the control sequence, and obtains the optimal control law [37,38,39,40]. The above literature has achieved great results in the design of network structure and the convergence proof of policy algorithm. In the study of optimal control of nonlinear systems, more attention is paid to the control details of dynamic programming algorithms. Its purpose is to expect that the changes in the state space during the whole control process can meet the detailed requirements of the control system, and finally achieve the control goal. This requires that the utility function in the performance indicator function must be able to express the specific needs of users. Most of the above literatures use quadratic forms when defining utility functions, and they do not have a good description of the relationship between utility functions and system states. In addition, there are great limitations on the setting of parameters in utility functions. As for how to design a utility function that is closely related to the actual control objective in nonlinear systems, there are few literature, and this is also the research direction of this paper.
This paper proposes detailed regulation of DRM based on the control process. By designing the evaluation function of the control target, multiple evaluation functions are combined into a reward function to replace the utility function in the performance index function, and then the reward function is introduced into the deep reinforcement learning algorithms. We use three algorithms to verify the effectiveness of DRM, which are based on Deep Q-Networks, policy gradient and actor-critic frameworks respectively.
The remainder of this paper is organized as follows. Section 2 describes nonlinear system properties and introduces the limitations of utility functions in iterative algorithms. In Section 3, an alternative approach to the utility function is proposed combining Lyapunov stability and Q-learning. Taking the inverted pendulum system as the experiment object, Section 4 establishes the dynamic environment, constructs the reward function mechanism of the inverted pendulum system, analyzes the deep reinforcement learning algorithm model and designs experiments to verify the effectiveness of the algorithm. Finally, Section 5 concludes this work.
Consider a continuous-time nonlinear system given by:
˙x(t)=f[x(t),u(t),t], | (1) |
where x(t)∈Rn is the system state vector, u(t)∈Ω⊂Rm is the control input vector, and f(⋅) is a continuously differentiable vector function. The boundary conditions are that the initial condition x(t0)=x0 is fixed and the end condition x(tf) is free. The performance index function is defined as:
J(x(t),t)=φ(x(tf),tf)+∫tftL(x(τ),u(τ),τ)dτ, | (2) |
where the control vector u(t) is unconstrained and continuous. In the interval [t,tf], the scalar functions φ(⋅) and L(⋅) are continuous and twice differentiable. Function J(x(t),t) is continuous and has continuous first and second order partial derivatives with respect to x(t) and t. In the admissible control domain Ω, u[t,tf]∈Ω, so the optimal performance index function is defined as:
J∗(x(t),t)=minu[t,tf]∈Ω{φ(x(tf),tf)+∫tftL(x(τ),u(τ),τ)dτ}. | (3) |
Then the continuous time HJB equation can be:
∂J∗(x(t),t)∂t=−minu(t)∈Ω{L(x(t),u(t),t)+(∂J∗(x(t),t)∂x)Tf[x(t),u(t),t]}. | (4) |
For optimal control law u∗(t), Eq (4) can be written as:
∂J∗(x(t),t)∂t=−L(x(t),u∗(t),t)−(∂J∗(x(t),t)∂x)Tf[x(t),u∗(t),t]. | (5) |
According to Euler's discretization [41], the discrete NCS function of the control system can be as follows:
xk+1=F(xk,uk). | (6) |
The nonlinear system is expressed by the function F(xk,uk), where xk∈Ωx⊂Rn is the system state vector, and uk∈Ωu⊂Rm is the control input vector. Ωx and Ω are defined as:
{Ωx={xk|xk∈Rn and ||xk||<∞}Ωu={uk|uk∈Rm and ||uk||<∞}, | (7) |
where ||⋅|| denotes the Euclidean norm. Let u_k=(uk,uk+1,...) denote the control sequence from k to ∞. Let U(xk,uk) be the utility function. The performance index function is defined as:
J(xk,u_k)=∑∞i=kU(xi,ui). | (8) |
For a control sequence u_k, the optimal performance index function can be defined as:
J∗(xk)≐minu_k{J(xk, u_k)}. | (9) |
So we have:
J∗(xk)=minuk{U(xk,uk)+J∗(xk+1)}. | (10) |
Then according to Bellman's principle of optimality [42], the discrete time HJB equation can be:
J∗(xk)=U(xk,u∗(xk))+J∗(F(xk,u∗(xk))). | (11) |
The optimal single control law u∗(xk) can be expressed as:
u∗(xk)=arg minuk{U(xk,u(xk))+J∗(F(xk,u(xk)))}. | (12) |
The above is the basic principle of using dynamic programming to solve the optimal control of the two systems. The optimal control vector function and the optimal state trajectory can be solved through the HJB equation, but the process is often very complicated or difficult to achieve. In addition, when the discrete system is high-order, it is easy to cause the "curse of dimensionality" problem. A common alternative method is ADP. The definition of the performance index function in ADP is usually in the form of a quadratic performance index, similar to the following expression as follows:
J(x(t),u(t),t)=12∫∞t[xT(t)Q(t)x(t)+uT(t)R(t)u(t)]dt, | (13) |
where Q is a l×l-dimensional nonnegative definite matrix, and R is a m×m-dimensional positive definite matrix. The values of l and m depend on the dimensions of x(t) and u(t) respectively. The utility function is generally expressed in the following form:
U(xk,uk)=(xTkQxk+uTkRuk). | (14) |
In most cases, Q and R are set as identity matrix with suitable dimensions, but this is often not very consistent with the real control system and control objectives. For example, in a specific control system, the high-frequency control law may damage electronic components, and high-frequency control should be avoided. The mechanical characteristics of some systems require that the control law conforms to a functional flexible control as much as possible. Moreover, due to the specific operation environment, the state space under the output of the control law is required to satisfy the specific requirements. This requires that the utility function has a more flexible nonlinear form. As far as we know, there is very little literature about how to assign appropriate values to Q and R in order that the utility function satisfies the real control system requirements. In fact, it is difficult or even impossible to accurately determine the two matrices to match the real control system.
In order to solve the above problems, we try to modify the utility function so that it can not only satisfy the mathematical requirements of the best performance indicator, but also satisfy our requirements for the control process. Firstly, define a new function R(xk,uk) and use it as the mapping object of the utility function U(xk,uk). Then, for a known NCS f[x(t),u(t),t], the discrete method is used to establish the function F(xk,uk), so as to establish the dynamic environment. Using the dynamic environment, the output vector xk+1 under the input vectors xk and uk can be solved, and the R(xk,uk) matching with U(xk,uk) can be output.
Thus, how to design the dynamic environment and how to design the output function R(xk,uk) corresponding to U(xk,uk) has become one of the key problems to solve in the nonlinear control systems. For mathematical convenience, we make the following conventions and constraints on R(xk,uk):
1) The discrete nonlinear control sequence can be defined as a finite Markov decision process (MDP)M=(S,A,R), where: S={x0,x1,…,xN},xi∈Ωx,N>1 is a finite set of states; A={u0,u1,…,uN},ui∈Ωu is a set of control actions; R is the reward distributions, and it can be written as R:S×A×S↦R, with R(xk,uk,xk+1)∈[0,rmax] being the reward received upon taking uk in state xk transitioning to state xk+1. The boundedness of the reward function of R(xk,uk,xk+1) is to ensure that the performance index function composed of it is bounded. The intuitive explanation of the reward function is: the higher the compliance of the control effect, the closer the value is to rmax, and the lower the compliance of the control effect, the closer the value is to 0.
2) For a specific state space κ∈Rm, it is the subspace of state space x, i.e., κ⊂x, and it is composed of specific components of state space x. In this state, there may be specific control requirements or the designer's objective evaluation of the state. We regard κ as an observation perspective of state space. Therefore, we define the evaluation function eκ(xk,uk)∈[0,1] under the perspective of κ. In this perspective, specific control can be done and the control effect (xk,uk) can be evaluated. As a component of the reward function, the evaluation function under a specific observation perspective is often helpful to guide the follow-up control to better approach the control goal. As shown in Figure 1, in a gradually stable MDP control trajectory, the control vector in state xk is uk, the reward value is rk, and xk transitions to the next state xk+1. If there is a better control vector u′k under the observation perspective κk a at this time step, the state is directly transitioned to xk+i, bypassing the middle i−1 control processes, and the MDP trajectory becomes (xk,u′k,r′k). Therefore, such a control law is suitable for increasing the reward value. According to the different observation perspectives, the evaluation function e(xk,uk) related to the state-control pair (xk,uk) is formulated to evaluate the control validity. When i>0, ei(⋅) has a value range of [0,1] and is continuously differentiable.
3) R(⋅) can be defined as:
R(xk,uk)≐∏n−1i=0ei(xk,uk),n>1 | (15) |
and
{e0(xk,uk)=rmax, xk∈Ωx and ux∈Ωue0(xk,uk)=0, otherwise. | (16) |
Here rmax≥1 means that the maximum value of the reward function is 1. When i>0, ei(⋅) should be continuous and twice differentiable in order to ensure that R(⋅) is also continuous and twice differentiable if R(⋅)>0. The evaluation function sequence is (e0(xk,uk),...,en−1(xk,uk)). Where n is the number of evaluation functions, i.e., the effectiveness of control is divided into n different observation perspectives for evaluation.
A dynamic environment is constructed by the above formula, as shown in Figure 2. The input parameters are state-control pair (xk,uk), and the output parameters are the next state xk+1 and R(xk,uk) of the system. If R(⋅) is regarded as the reward value of the dynamic environment, the transformation from the utility function of the NCS to the reward is completed, i.e., the utility function is substituted by the reward value. The description effect may be that the smaller the value of the utility function, the greater the reward value, but the value range of the reward is the interval from 0 to rmax.
In order to satisfy the Eqs (8)–(11), we can express the relationship between U(xk,uk) and R(xk,uk) with the following equation:
U(xk,uk)=e0(xk,uk)−R(xk,uk). | (17) |
It is worth mentioning that the n evaluation functions and their respective observation perspectives should be designed consistent with the control objectives. The design of observation perspectives can mainly focus on two points: the first is to monitor the effect of the control law in a specific state during the control process, the second is that this design is conducive to the system to achieve an optimal control effect. The observation perspective of the evaluation function should not be repeated, and there should be no conflict in control methods between the evaluation functions, which means that the high score of the control law under one observation perspective is easier to guide the control law to obtain high scores under other observation perspectives. However, the high score of the control law under a certain observation perspective may not immediately lead to high scores under other observation perspectives, but this will cause the control law in the current state to get high scores with great probability under more observation perspectives in the next time steps.
According to the definition of expected discounted return in reinforcement learning theory, let Rk=R(xk,uk) and bring it into the following equation:
Gt≐∑Tk=t+1γk−t−1Rk, | (18) |
where γ is the discount rate, 0≤γ≤1, Rk is the reward of step k, Rk∈R⊂R, R is reward sets, and T is a final time step.
The advantages of using reward function R(xk,uk) instead of utility function U(xk,uk) are as follows:
1) The reward function reflects the feasible control objectives from various observation perspectives. It may be a nonlinear function instead of the linear expression of the utility function, which can better satisfy the detailed control objectives.
2) The reward function R(xk,uk) is composed of several evaluation functions from their respective observation perspectives. Each evaluation function e(xk,uk) is specific, simple and easy to calculate, which avoids the problem that it is difficult to specify the weight matrix Q and R in the utility function of U(xk,uk).
3) For a specific real control platform, the reward function is easier to realize. According to the requirements of control objectives and control details, it is easier to adjust each evaluation function, and the adjustment of one evaluation function will not affect other evaluation functions.
4) After adding the reward function in the dynamic environment, we can try to use more skillful reinforcement learning algorithms to solve the nonlinear control problem.
In Eq (15), the reward function takes the form of successive multiplications instead of successive additions or others. The advantages of doing this are as following:
1) If the reward function takes the continuous addition form of the evaluation functions, when changing one of the evaluation functions, the weight of each evaluation function must also be considered, which will make the problem very complex, and its rationality needs to be carefully verified in the experiment. In the form of continuous multiplication, if one of the evaluation functions is changed, it only changes the calculation method and the possible gain of the function, and there is no need to change the weights of other evaluation functions. In fact, the output value of each function can be understood as its own weight.
2) Any evaluation function can play the role of "one vote veto" by using continuous multiplication. For example, in the experiment of controlling the robot to walk, no matter how high the score given by the evaluation function that responsible for evaluating the robot's posture, as long as it touches the boundary, it should be given a zero score. According to Eq (15), when the evaluation function responsible for boundary detection gives a zero score, the score of the reward function is zero. But the continuous addition form of the evaluation functions will not bring such an effect.
In order to comply with Eqs (3)–(5) and (11), we try to modify Eq (8) as:
J(xk,u_k)=∑∞i=kU(xi,ui) |
=∑∞i=k(e0(xi,ui)−R(xi,ui)) |
=∑∞i=k(rmax−R(xi,ui)). | (19) |
Let
Ψ(xk,u_k)=∑∞i=kR(xi,ui), | (20) |
and we have:
J(xk,u_k)=∑∞i=krmax−Ψ(xk,u_k). | (21) |
Therefore, as long as u_k conforms to the system under (6), it satisfies the optimality equation:
Ψ∗(xk)=maxu_k{Ψ(xk, u_k)}. | (22) |
Hence Ψ∗(xk) can be obtained as:
Ψ∗(xk)=maxuk{R(xk,uk)+Ψ∗(xk+1)}. | (23) |
So the optimal performance index function has another form:
J∗(xk)=minu_k{J(xk,u_k)} |
=∑∞i=krmax−Ψ∗(xk). | (24) |
Recalling Eq (10), we get:
J∗(xk)=minuk{U(xk,uk)+J∗(xk+1)} |
=rmax−R(xk,u∗(xk))+J∗(F(xk,u∗(xk))), | (25) |
and the optimal single control law is:
u∗(xk)=arg maxukΨ∗(xk). | (26) |
Comparing Eqs (26) and (14), the problem of solving the optimal performance index function is transformed into the problem of maximizing the return function. Further, combined with Eq (18), we can try to use a wider range of reinforcement learning algorithms to solve the optimization problem of nonlinear systems.
For the convenience of analysis, the stability proof is made on the basis of the following assumptions 1 and 2.
Assumption 1: The discrete NCS represented by Eq (6) is controllable, and the system state xk=0 is an equilibrium state of the discrete NCS under the condition of uk=0, i.e., F(0,0)=0.
Assumption 2: The feedback control uk=u(xk) satisfies uk=u(xk)=0 for xk=0, ∀xi∈Ωi⊂Rn, u(xk) is continuous and u(xk) stabilizes the discrete NCS on Ωi, J(xi) is finite.
Assumption 1 ensures that the object we use the deep reinforcement learning method to solve is a controllable NCS, and determines that the equilibrium point of the system is xk=0, and the output of the system function F(⋅) is 0. Assumption 2 ensures that the control law is admissible and the output is 0 at the equilibrium point, so as to meet the requirements of a controllable NCS. In the process of control sequence, the performance index function J(xi) is a bounded function. In this way, we have made restrictive assumptions about the system object and control law.
According to the definition of e(xk,uk), e(xk,uk)≥0, then R(xk,uk) is a positive definite function for any xk and uk. Let a0(xk) be an arbitrary admissible control law. According to Q-learning [43], for i=0, let Q0(xk,uk) be the initial iterative Q function constructed by a0(xk), i.e.,
Q0(xk,a0(xk))=∑∞j=0R(xk+j,a0(xk+j)). | (27) |
Thus, initial iterative Q function satisfies the following optimality equation:
Q0(xk,uk)=R(xk,uk)+Q0(xk+1,a0(xk+1)). | (28) |
Then, we get the iterative control law as follows:
a1(xk)=arg maxukQ0(xk,uk). | (29) |
Let Qi(xk,uk) be the iterative Q function constructed by ai(xk) for i=1,2,..., then we can get the following generalized optimality equation:
Qi(xk,uk)=R(xk,uk)+Qi(xk+1,ai(xk+1)). | (30) |
Therefore, the iterative control law is updated by:
ai+1(xk)=arg maxukQi(xk,uk). | (31) |
Because the reward function R(xk,uk) is a positive definite function of xk and uk, and under Assumption 1 and 2, the iterative function Qi(xk,uk),i=0,1,..., is positive definite for xk and uk. According to Eq (30), substitute ai(xk) for uk, and let Vi(xk)=Qi(xk,ai(xk)) for i=0,1,..., we can get:
Vi(xk+1)−Vi(xk)=Qi(xk+1,ai(xk+1))−Qi(xk,ai(xk)) |
=−R(xk,ai(xk))<0. | (32) |
For i=0,1,..., the function Vi(xk) is positive definite for xk, so Vi(xk) is a Lyapunov function. Thus, ai(xk) is a stable control law.
In the above, it is difficult to give a specific expression form for the control objective of the utility function under the actual control system. We can try to reconstruct the reward function R(xk,uk) to replace the utility function with the help of DRM. Therefore, the key step of solving nonlinear optimization problem is transformed into the problem of how to construct a reinforcement learning dynamic environment and optimize the reward function R(xk,uk). According to Eqs (15) and (16), the determination of evaluation function is one of the important tasks of DRM. Ng et al. [44] proposed the reward shaping theory, if a potential energy-based function is added to the reward function, the optimal strategy remains unchanged, realizing the optimal strategy of modifying the reward without affecting the Markov Decision Process. The reward shaping method is used in multi-agent reinforcement learning, and it is effective in robots collaborative-competitive operations and multi-objective tasks [45,46,47,48,49]. This approach give us some inspiration, especially in the construction of the potential energy function and how to avoid reward conflicts.
Using the mathematical models above, the problem of obtaining a control method for a nonlinear system can be transformed into a problem of training one or more agents' control strategies, establishing a value function and obtaining the maximum reward. In this section, by choosing the inverted pendulum system as the experiment object, we will analyze its characteristics and establish the dynamic environment, design the reward function through different control objectives, then use three deep reinforcement learning algorithm models to conduct experiments. the experiments result is discussed finally.
Figure 3 is a schematic diagram of the structure of the inverted pendulum system. Our ultimate goal is to control the force exerted to the cart and make it move left or right to maintain the balance of the single pole mounted on the cart. The typical parameters of inverted pendulum-cart system setup are selected as: mass of the cart (mc): 1.34 kg, mass of the pendulum (mp): 0.09 kg, length of the swing pole (l): 0.40 m, length of the cart rail (lr): 0.40 m, friction coefficient of the cart & pole rotation is assumed negligible. The acceleration due to gravity g = 9.81 m/s2.
In [50], the deep reinforcement learning algorithm model is used to realize the balance of the cart-pole in the inverted pendulum system, but the relevant control theory in the NCS is not involved, and the design of the reward function is not introduced in detail. The system function of the inverted pendulum system model in this paper is expressed as follows [51,52]:
¨θ=gmsinθ−cosθ(F+mpl˙θ2sinθ)lm(4m/3−mpcos2θ), | (33) |
¨χ=F+mpl(˙θ2sinθ−¨θcosθ)m, | (34) |
where m=mc+mp, ¨χ is the moving speed of the cart on the rail, ¨θ is the swing angular speed of the swing pole. The nonlinear system equations (33) and (34) of inverted pendulum dynamic system represent the control input affine system. These two nonlinear equations are represented in the following standard state space form:
˙x(t)=f(x(t))+g(x(t))u(t). | (35) |
For Eq (35), Let x(t)=(χ,˙χ,θ,˙θ)T, and set θmax as the maximum limit of the swing pole deviation from the balance angle, Let χmax be the maximum value that the cart deviates from the origin of the guide rail, ˙χmax the maximum speed of the cart, and ˙θmax the maximum rotation speed of the swing pole. The definition rules of the system state space and evaluation function are as following:
Ωx=[−lr2≤χ≤lr2−˙χmax≤˙χ≤˙χmax−θmax≤θ≤θmax−˙θmax≤˙θ≤˙θmax], | (36) |
e0(x,u)={1,x∈Ωx0,x∉Ωx. | (37) |
The range of the output value of the evaluation function e0 is [0, 1]. In order to describe the expected position of the cart on the rail and guide the cart to run at the equilibrium position of the rail with greater probability, we make an evaluation function as following:
e1(x,u)=1−sgn(χ)2χlr. | (38) |
From the perspective of control, the cart "has to" drive the swing pole into a better state space at some positions of the rail. In this way, we can establish an evaluation function to better guide the system into the equilibrium state. If it is said that the desired balance angle is zero in the equilibrium state, then at different positions of the rail, we assign position-related dynamic balance angles to guide the movement of the cart. The position-related balance angle is given by:
θ0=−2χθmaxlr. | (39) |
Then the evaluation function is established as:
e2(x,u)={1−|θ−θ0|θmax, |θ−θ0|≤θmax0, |θ−θ0|>θmax or θ>θmax or θ<−θmax. | (40) |
Notice that the output range of e2(⋅) is [0, 1], and θmax is the maximum allowable deflection angle of the swing pole.
In the balancing process of inverted pendulum system, not only the system is required to reach the equilibrium state in a limited time, but also the energy consumed should be considered. It is expected that the energy consumed will reach the minimum in the process of state transition under two adjacent time steps. Therefore, we make an evaluation function as follows:
e3(x,u)=η(x,u)φ(x), | (41) |
where, η(x,u) is the power consumption function of the system under the action of control output u in x state; φ(x) is the coefficient function in x state. Without considering the external disturbance of the system such as air resistance, the force power consumption function can be expressed by the following formula:
η(xk,uk)=W(xu=Fk+1)+ς(xu=Fk+1)−W(xu=0k+1)−ς(xu=0k+1), | (42) |
here, W(xu=0k+1) is defined as the energy loss after the system transitions to the next state when there is no action output control in state xk; in the same case of no action output, ς(xu=0k+1) is the energy loss caused by system friction. W(xu=Fk+1) is defined as the energy loss after the system transitions to the next state under the action of output control u=F in state xk, and ς(xu=Fk+1) is the energy loss caused by system friction after the system transitions to the next state under the action of output control u=F in state xk.
In the system represented by Eqs (33) and (34), friction is not considered, so ς(xu=0k+1)=ς(xu=Fk+1)=0. Therefore, in state xk, the system energy is:
W(xk)=Tc(xk)+Tp(xk)+Vp(xk), | (43) |
where Tc(xk) is the kinetic energy of the cart, Tp(xk) is the kinetic energy of the swing pole, and Vp(xk) is the potential energy of the swing pole. According to the inverted pendulum model, we can get:
Tc(xk)=12mc˙χ2xk, | (44) |
Tp(xk)=12mp((d(θxk+lsinθxk/2)dt)2+(d(lcosθxk/2)dt)2)+16mpl2˙θ2xk |
=12mp˙χ2xk+12mpl˙χxk˙θxkcos˙θ2xk+724mpl2˙θ2xk, | (45) |
Vp(xk)=12mpglcosθxk, | (46) |
where θxk is the swing pole angle in state xk. According to the above formulas, calculate W(xu=0k+1) and W(xu=Fk+1) respectively, and then calculate η(xk,uk). Note the coefficient function φ(xk), which reflects the user's specific allocation method of energy in combination with the current state. For example, it is required to minimize the single-step energy, or establish the minimum theoretical value of energy according to the current state, and then require the single-step energy to approach the value as much as possible, and ensure that the final evaluation function is between 0 and 1.
Note that in the control process of the system, e0(x,u) gives the maximum value of evaluation 1, e1(x,u) is the evaluation function of cart displacement χ in the state vector, e2(x,u) is the evaluation function of χ and θ in the state vector, and e3(x,u) is the evaluation function of θ, ˙χ and ˙θ in the state vector. The final evaluation effect is expressed as:
R(xk,uk)=∏3i=0ei(xk,uk). | (47) |
R(xk,uk) is a nonlinear description that acts as a utility function, which is different from the description method of Eq (8). In other words, R(xk,uk) reflects the user's specific control needs in a more subtle way. It is commendable that users can add, modify the details of specific control requirements via the evaluation function, and the modification of a certain evaluation function will not affect the roles of other evaluation functions. In the dynamic environment, the relationship between the evaluation functions and the reward function is that the reward function is like composed of multiple "referees", and these "referees" are the evaluation functions. Each referee gives a score ranging from 0 to 1, and then the reward function summarizes the scores of each referee by means of continuous multiplication. The scoring standard for each referee is based on their different viewing angles, and the viewing angles of each referee cannot be the same. In fact, the viewing angles are their observation perspectives as shown in Figure 2. As mentioned above, no evaluation function can violate the control law of the system, which means that the referee is required to guide and encourage the effect of "performer", in other words, the performer (i.e., the control law) follows the performance guidance effect of a referee's high score, which is easier to achieve the final effect of control. This final effect is the best one close to rmax recognized by each referee.
In this paper, R(⋅) is regarded as the reward value of the dynamic environment, and the deep reinforcement learning algorithm is designed to interact with the dynamic environment from three perspectives of state-value, policy and "Actor-Critic". During the interaction process, the agent's execution strategy is optimized through the reinforcement learning algorithm to achieve the maximum reward value. Among them, the deep reinforcement learning algorithm based on state-value mainly takes Deep Q-Learning (DQN) as an example, the policy-based algorithm mainly takes the policy-gradient reinforcement algorithm as an example, and the "Actor-Critic" framework mainly takes Deep Deterministic Policy Gradient as an example. Three deep reinforcement learning models are analyzed and described below.
DQN is a value-based iteration reinforcement learning algorithm [53]. When the state and action space generated during the interaction between the agent and the environment is discrete and the dimension is not high, the Q-Learning method can be used to establish the state-action value Q table. Update the Q-values in the table according to the Bellman optimality equation. In the inverted pendulum system, the states of the cart and the swing pole are high-dimensional continuous. Using the Q-Learning method to build a table is very difficult, and it often causes the "curse of dimensionality". When the states and actions are high-dimensional and continuous, the construction table can be transformed into a function fitting problem, and the Q value is generated by fitting the function instead of the table, so that similar states can obtain similar output actions. In the function fitting of high-dimensional input, the advantage of deep neural network for complex feature extraction can be used to map the input to the output through neural network. Therefore, the essence of the DQN method is to combine the deep neural network and Q-Learning, map the state action value Q through the network structure, and update the agent action strategy according to the Bellman optimal equation. Algorithm 1 shows the DQN algorithm flow.
![]() |
In algorithm 1, st{χt,θt,˙χt,˙θt} is the state space of the inverted pendulum system at time t. Among them, χt is the displacement of the cart, θt is the angle of the swing pole, ˙χt is the speed of the cart, and ˙θt is the angular speed of the swing pole. rt is the reward value obtained during the interaction with the dynamic environment at time t.
DQN records the tuple {st,at,rt+1,st+1} generated by the agent in each state and stores it in the experience replay. The deep neural network randomly selects the tuple in the experience replay as the training label to update the network parameters. Randomly selecting previous experience in the experience replay will make the network more efficient, solving the problem of correlation and non-static distribution to a certain extent [54].
DQN adopts two network structures with the same structure but different parameters, namely action-value network (Q-network) and target action-value network (target Q-network). When the action-value network is used to approximate the Q value, the update of the Q value is prone to oscillation, resulting in unstable learning behavior of the agent. The target action-value network reduces the correlation between the predicted Q value and the estimated Q value to a certain extent to improve the stability of the algorithm. In the network parameter update process, the parameters used by the Q-network are the latest parameters of each update, and the Q-network parameters are copied to the target Q-network after multiple iterations.
The agent maps the estimated Q value through Q-network at st, uses the epsilon-greedy [55] strategy to select the Q value under the corresponding action, takes the action at corresponding to the Q value to interact with the environment, and records the generated tuple {st,at,rt+1,st+1} and stores it in the experience replay. The target Q-network maps st+1 as input to the predicted Q value, selects the sum of the maximum predicted Q value and the immediate reward rt+1 as the TD target[56], uses the mean square error between the TD target and the Q value as a loss function for updating network parameters. The loss function between target action-value function and action-value function in DQN is as follows:
TargetQ = rt+1+γmaxa′ˆQw−(st+1,a′), | (48) |
L(θ)=E{st,at,rt+1,st+1}ϵN[(TargetQ−Qw(st,at))2]. | (49) |
In the DQN series of reinforcement learning algorithms, we mainly approximate the value function, learn the action value function based on the value, and then adopt a similar greedy strategy to select the agent output action according to the estimated action value function. But the DQN series of reinforcement learning algorithms have the following problems: 1) Algorithms based on value functions can only process discrete actions, but cannot process continuous actions output by the agent. 2) The algorithm based on the value function cannot solve the stochastic policy problem. The optimal strategy corresponding to the DQN series of methods is a deterministic strategy, and the action corresponding to the maximum value is taken as the optimal strategy. Aiming at the shortcomings of value-based reinforcement learning algorithms, a policy-based reinforcement learning algorithm, policy gradient, is introduced [57].
The following is the state value function gradient formula in Monte Carlo sampling, where Pr(s→x,k,π) is the probability of transitioning from state s to state x in k steps under policy π:
∇vπ(s)=∇[∑aπ(a|s)qπ(s,a)] |
=∑a[∇π(a|s)qπ(s,a)+π(a|s)∇qπ(s,a)] |
=∑a[∇π(a|s)qπ(s,a)+π(a|s)∇∑s′,rp(s′,r|s,a)(r+vπ(s′))] |
=∑a[∇π(a|s)qπ(s,a)+π(a|s)∑s′p(s′|s,a)∇vπ(s′)] |
=∑x∈S∑∞k=0Pr(s→x,k,π)∑a∇π(a|x)qπ(x,a). | (50) |
∇vπ(s) in Eq (50) is converted as follows:
∇J(w)=∇vπ(s0) |
=∑s∑∞k=0Pr(s0→s,k,π)∑a∇π(a|s)qπ(s,a) |
=∑sη(s)∑a∇π(a|s)qπ(s,a) |
=∑s′η(s′)∑sη(s)∑s′η(s′)∑a∇π(a|s)qπ(s,a) |
=∑s′η(s′)∑sμ(s)∑a∇π(a|s)qπ(s,a) |
∝∑sμ(s)∑a∇π(a|s)qπ(s,a) |
=Eπ[∑aqπ(St,a)∇π(a|St,w)]. | (51) |
∇J(w) is the gradient of the policy parameters, μ(s) is the policy distribution under the policy π, and the learning goal of the policy parameters is to maximize the value of J(w), so the update of the policy parameters is similar to solving the gradient of J(w), which is:
wt+1 |
=wt+α∇J(wt) |
=wt+αEπ[∑aqπ(St,a)∇π(a|St,w)] |
=wt+α∑aˆq(St,a,w、)∇π(a|St,w). | (52) |
In the above equation, w、 and w are the parameters in the action value network and the policy network, respectively. In the Monte Carlo Policy Gradient, the sampling At is used to replace a under the policy π. Therefore, ∇J(w) can be defined as:
∇J(w)=Eπ[∑aqπ(St,a)∇π(a|St,w)] |
=Eπ[∑aπ(a|St,w)qπ(St,a)∇π(a|St,w)π(a|St,w)] |
=Eπ[qπ(St,At)∇π(At|St,w)π(At|St,w)] |
=Eπ[Gt∇π(At|St,w)π(At|St,w)]. | (53) |
Gt represents the reward obtained in the St state, Gt∇π(At|St,w)π(At|St,w) can be calculated by sampling, the expected value is equal to the gradient. Introducing the above gradient into the formula wt+1=wt+α∇J(wt), we get:
wt+1˙=wt+αGt∇π(At|St,wt)π(At|St,wt) |
=wt+αGt∇lnπ(At|St,wt). | (54) |
Algorithm 2 shows the process of the Monte Carlo Policy Gradient reinforce algorithm. In algorithm 2, s={χ,θ,˙χ,˙θ} represents the state space in the inverted pendulum system, and r represents the immediate reward obtained when the state is transferred, which is equivalent to R(xk,uk).
![]() |
According to the output action characteristics of the agent in reinforcement learning, strategies can be divided into deterministic strategies and random strategies. Deterministic policy means that the action of the agent output through the policy in a certain state is deterministic. The random strategy refers to the probability that the agent outputs multiple actions through the strategy in a certain state, and the agent chooses to execute the action according to the probability. In the DQN series of methods, the greedy strategy is used to select the action corresponding to the maximum Q value, which belongs to the greedy deterministic strategy. When the action set output by the agent is continuous-valued or discrete-valued with very high dimensionality, we use a stochastic strategy to study the probability of action output, which increases the amount of computation. The Deep Deterministic Policy Gradient (DDPG) algorithm is often used to solve the action selection problem in a high-dimensional continuous state space [58]. The algorithm is tuned on the basis of the Actor-Critic framework, making full use of DQN and policy gradients. The DDPG algorithm uses a training network and a target network in the design of the network structure. Through the target network, the correlation between the predicted value and the estimated value can be reduced to a certain extent, and the stability of the algorithm can be improved. Training the network with samples from the experience replay minimizes the correlation between samples.
The DDPG algorithm mainly includes the following parts: ① The network includes two parts, Actor and Critic, in which Actor and Critic are composed of training network and target network respectively. ② Introduce the experience replay to store the data of the interaction between the agent and the environment, and use soft update to update the parameters of the target network to ensure the stability of the training network [59]. ③ Add random noise to the Actor network output to ensure that the agent has a certain exploration ability when selecting actions [60]. Algorithm 3 shows the process of the DDPG algorithm.
![]() |
In algorithm 3, N is random noise, μ(s|wμ) is the Actor network, Q(s,a|wQ) is the Critic network, μ′(s|wμ′) is the target Actor network and Q′(s,a|wQ′) is the target critic network. The agent obtains the execution action at through the network μ at st, and interacts with the environment after adding noise to the action to generate an immediate reward rt+1, and the state is updated to st+1. The agent stores the tuple {st,at,rt+1,st+1} in the experience replay. When the experience replay data reaches a certain amount, it samples data from the experience replay to train the network Q. Q-Network needs to evaluate the actions made by μ. Therefore, Q-Network uses the deterministic policy gradient method to optimize the network parameters of μ, and the network parameter replication process adopts a soft update method. The agent keeps repeating the above process until the policy network converges.
The target action value function in the DDPG algorithm can be expressed as:
yi=ri+1+γQ′(si+1,μ′(si+1∣wμ′)∣wQ′). | (55) |
The loss function of the Critic network is:
L(w)=1N∑i(yi−Q(si,ai∣wQ))2. | (56) |
Using the deterministic sampling strategy gradient to update the policy network parameter can be expression as:
∇wμJ≈1N∑i∇aQ(s,a|wQ)|s=si,a=μ(si)∇wμμ(s|wμ)|s. | (57) |
This section describes and analyzes the experimental results. We use three reinforcement learning algorithms to train the inverted pendulum control model, and analyze the differences between the different algorithms based on their training data and test data (Section 4.4.1). Finally, we discuss the current research deficiencies and propose future research directions (Section 4.4.2).
This section compares the experimental results of three deep reinforcement learning algorithms (DQN, policy-gradient, DDPG) in the dynamic environment of the inverted pendulum system, including the impact of the DRM settings on the experimental accuracy, and the differences between different algorithms. The experiment is divided into two parts: ⅰ) The influence of the setting of the detail-reward function under three deep reinforcement learning algorithms on the experimental accuracy in the training and testing phase. ⅱ) Differences between three deep reinforcement learning algorithms under the same detail-reward function. In the dynamic environment of the inverted pendulum system, the initial states of the inverted pendulum system in the training and testing phase are shown in Table 1.
experimental phases | Cart position (/m) | Cart speed (m/s) | Swing pole angle (/rad) | Swing pole speed (rad/s) |
Training phase | (−0.2, 0.2) | (−0.05, 0.05) | (−0.174, 0.174) | (−0.05, 0.05) |
Testing phase | −0.12 | 0.2 | 0.1 | 0.2 |
*Note: In the training phase of the inverted pendulum system, the initial state of the cart and swing pole is a random number within the corresponding parameter range. In the experiment, the position of the cart and the angle unit of the swing pole are the same as those in the table. |
1) Experimental comparison of reward functions with different details under the same algorithm
Tables 2 and 3 show the reinforcement learning parameters and network parameters of the DQN algorithm in the inverted pendulum system.
Episodes | Steps | Discount factor | Epsilon(ε) | Memory |
2000 | 300 | 0.95 | 0.9 | 2000 |
Network | Network structure | Learning rate | Loss function | Activation function | Batch size |
Action-value Q network | (4,128) (128,128) (128, 25) |
0.0002 | MSE | RELU | 32 |
In Table 3, we set the number of outputs of the action-value Q network equal to 25. It can be seen from reference [61] that the richer the choice of control vector u(k), the higher the controllability, but if it is too large, the convergence effect of the network model becomes worse in the training process. Therefore, reasonable selection of the number of action values output by Q network can improve the controllability of inverted pendulum system without affecting the effect of training.
We first give an illustration of the agent training rules for the following experiments. Episode is a round of agent training, and in each episode, the maximum number of times the agent is allowed to learn is 300. The setting range of cart position is [−0.2, 0.2], and the setting range of the swing pole angle is [−0.174, 0.174]. During the training of the agent, as long as the cart and the swing pole are in the areas of the set values, the agent can continue to learn in this episode, and the step value is increased by 1 while the sum of the reward value is also added to the instant reward value, otherwise the learning round will be exited, and the sums of steps and rewards under each episode is recorded.
Figure 4 shows the training and testing process of an inverted pendulum system using the DQN algorithm in a dynamic environment. Figure 4(a) shows the number of admissible control actions output by the inverted pendulum system agent under each episode in the training process. Figure 4(b) shows the cumulative sum of the reward values obtained by the agent under each episode during the training process. Figure 4(c), (d) show the experimental results in the testing phase. Figure 4(c) shows the position of the cart at each step. Figure 4(d) shows the swing pole angle at each step. In the inverted pendulum system, the cart should be controlled to be near the midpoint of the rail (χ=0) and the swing pole angle to be near the 0 degree (θ=0).
As can be seen from Figure 4, the setting of the DRM affects the reinforcement learning effect of the inverted pendulum. In Figure 4(a), (b), a single-reward function (i.e., R=e0) is used, when the episode value is less than 400, the number of times the rule is met is less than 50, and the reward value under each episode is less than 50. When the number of episodes is greater than 400, the reward value gradually increases, but the reward distribution curve fluctuates greatly, the generalization of the trained model is weak, and the agent cannot learn useful experience well. When detail-reward function (i.e., R=∏ei) is used, the number of times the rule is met is proportional to the number of the episodes. In Figure 4(b), when the episodes are greater than 200, the training process tends to converge, the reward value is close to 250, and the reward distribution curve is relatively flat.
In Figure 4(c), (d), when steps = 15 and a single-reward function is used, the position of the cart is 0.01 and the swing pole angle is 0.33, which means that the agent control fails. After 40 steps, by using detail-reward function, the position of the cart is close to zero, the swing pole angle is close to zero degrees, and the distribution of the position and angle curves is relatively gentle.
The above analysis of the distribution curves of steps, rewards, cart position, and swing pole angle in Figure 4 shows that when using the DQN algorithm to train an agent in a dynamic environment of inverted pendulum system, the experimental effect of detail-reward function is better than a single-reward function.
Tables 4 and 5 show the reinforcement learning parameters and network parameters of the policy gradient algorithm in the inverted pendulum system. Figure 5 shows the training and testing process of an inverted pendulum system using the policy gradient algorithm in a dynamic environment.
Episodes | Steps | Discount factor |
15000 | 300 | 0.99 |
Network | Network structure | Learning rate | Loss function | Activation function |
Policy network | (4,128) (128,128) (128, 3) |
1 × 10−3 | Cross entropy | Adam |
As can be seen from Figure 5(a), (b), the training process converges faster when a single-reward function is used. When the episode is equal to 3000, the step and the reward reach the maximum value of 300 respectively; when using the detail-reward function, the training process converges slowly and the reward rises more gently. When the number of episodes is 14,000, the step and reward reach their maximum values of 300 and 250 respectively. According to Eq (37), a single-reward function means that only the evaluation function e0(⋅) plays a role, and its values are only 0 and 1. The final value range of the detail-reward function is [0, 1]. Therefore, the maximum values of reward in these two cases are different.
It area A2 of Figure 5(c), when the detail-reward function is used, the position curve distribution of the cart is relatively flat, the minimum position is −0.05, and the maximum position is 0.04. While using a single-reward function, the cart position distribution curve oscillates greatly, the minimum position and maximum position are −0.077 and 0.05 respectively. It can be seen from area A1 of Figure 5(d) that when the detail-reward function is used, the variation range of the swing pole angle is [−0.05, 0.05]. When the single-reward function is used, the range of the swing pole angle is [−0.07, 0.06], and the data illustrate the swing angle range of the former is smaller than that of the latter.
By analyzing the distribution curves of step, reward, cart position and swing pole angle in Figure 5, we can get: When an agent is trained with the policy-gradient algorithm, the single-reward function makes the training rounds less than the detail-reward function, but in the test phase, the experimental effect of detail-reward function is better than that of single-reward function.
Tables 6 and 7 show the reinforcement learning parameters and network parameters of the DDPG algorithm in the inverted pendulum system. Figure 6 shows the training and testing process while using the DDPG algorithm in a dynamic environment.
Episodes | Steps | Discount factor | Soft update coefficient | Memory |
2000 | 300 | 0.99 | 1 × 10−2 | 10,000 |
Network | Network structure | Learning rate | Loss function | Activation function | Batch size |
Actor network | (4,128) (128,128) (128, 1) |
1 × 10−3 | −Q(s,a|wQ) | Adam | 64 |
Critic network | (5,128) (128,128) (128, 1) |
1 × 10−2 | MSE | Adam | 64 |
It can be seen from Figure 6(a), (b) that in the case of using single-reward function, when the episodes are less than 180, the number of times the rule is met is less than 100, and the reward value under each episode is less than 50. As the episodes gradually increase, the distribution of step and reward curves fluctuates greatly. In contrast, using detail-reward function, when the episodes are greater than 180, the distribution of the reward curve is relatively flat, and the reward distribution under each episode is around 210.
In Figures 6(c), (d), when steps = 15 and the single-reward function is used, the position of the cart is located at 0.05 in the positive direction, and the swing pole angle is 0.3, which means that the agent control fails. After 200 steps, by using detail-reward function, both the cart position and the swing pole angle are close to zero, while the distribution of the position and angle curves is relatively gentle.
From the above analysis it can be concluded that in the process of using the DDPG algorithm to train an agent, the reward obtained by using the single-reward function is greater than that obtained by using the detail-reward function in most cases. However, the distribution of the reward curve under the single-reward function fluctuates greatly and the model stability is poor. In addition, the model trained with single-reward function will lead to its out of control in the process of testing. The experimental effect of using the detail-reward function is better than that of using the single-reward function.
2) Comparison of experimental effects of different algorithms under the same detail-reward function
Through the analysis and comparison of the above experimental results, the following conclusions can be drawn: i) The method of using DRM in the inverted pendulum system under deep reinforcement learning is effective and innovative. ii) A good design of evaluation functions is crucial for the training and testing process. Under different algorithms, the detail-reward function produces better experimental results than a single-reward function.
Figures 7 and 8 show the experimental comparison of the models generated by different deep reinforcement learning algorithms during training under the same detail-reward function in the testing phase. Figure 7 shows the distribution curves of cart positions in each step of the inverted pendulum system when DRM is applied in the experiment. After 50 steps, the model generated by DQN algorithm converges to a point close to 0 in the test phase, while the model generated by policy-gradient algorithm fluctuates in the range of [−0.06, 0.05], and the model generated by DDPG algorithm shows that the moving position of the cart gradually converges to the vicinity of zero with the increase of the number of steps. The right part of Figure 7 is an enlarged schematic diagram of area B. It can be seen from the figure that the model generated by the policy-gradient algorithm fluctuates greatly, i.e., the moving position of the cart fluctuated wildly in the range of more than 0.06. The data of area B also shows that the models generated by DQN and DDPG algorithms has a small fluctuation range, both mean that the cart fluctuates reasonably near the equilibrium point.
Figure 8 is the distribution curves of swing pole angle at each step under the models generated by three algorithms under the application of DRM in the experiment. After 50 steps, the model generated by DQN algorithm converges to near zero during the test process, the model generated by the policy-gradient algorithm fluctuates in the range of [−0.05, 0.05], and the model generated by the DDPG algorithm gradually converges to zero with the increase of steps. The right part of Figure 8 is the enlarged schematic diagram of area A. After 200 steps, the models generated by DDPG and DQN will guide the swing pole angle to vibrate around 0. The fluctuation range of the swing pole angle under the model generated by policy-gradient is [−0.06, 0.06], which is larger than that of the above two models.
By analyzing the distribution curves in Figures 7 and 8, the following conclusions can be drawn: ⅰ) The system stability of the policy-gradient algorithm is weaker than that of DQN and DDPG during the test process, and the car position and the swing pole angle are difficult to converged to zero. ⅱ) During the test process of the DQN algorithm, the cart position and the swing pole angle distribution are both close to the zero. However, the output of the DQN algorithm is a discrete action. In a continuous control system, the frequent output of discrete actions will affect the stability of the system, and on a physical platform, the large-scale pulsed current output can greatly reduce the lifetime of electronic devices. ⅲ) DDPG algorithm is continuous action output. Compared to the DQN algorithm, the convergence rate is slower. During the test, as the number of steps increased, the cart position and the swing pole angle distribution gradually converged to zero. In an inverted pendulum system, the output of the DDPG model can ensure the stability of the system. Compared with the above three deep reinforcement learning algorithms, as a continuous action output algorithm, DDPG can be applied to the inverted pendulum control system and achieve good experimental results by introducing DRM into the algorithm.
According to the characteristics of the inverted pendulum NCS, deep reinforcement learning with DRM can solve the optimal control problem of the system. Through the above experiments, the control accuracy of the inverted pendulum system under different reward functions is compared. At present, this paper has the following deficiencies in experimental details: 1) This paper just introduce DRM to the inverted pendulum system without using more complex nonlinear systems. 2) In this paper, a novel DRM method is given, and the evaluation functions and reward function are established by taking the inverted pendulum system as an example, but the unified rules for the establishment of evaluation function are not given, i.e., it depends on the designer's personalized design. 3) The parameters of the inverted pendulum system used have limitations. For example, we only set the angle of the swing pole in the range of [−0.174, 0.174], and did not study the control problem of a broader state space. 4) When designing the dynamic environment of the inverted pendulum system in this paper, the unstable disturbance of the real inverted pendulum experimental platform is not considered, which has certain limitations for deploying agent models to real physical platforms.
Future work will improve in the following aspects: 1) We will use the method of DRM in more nonlinear systems and summarize the applicability of the method in a wider experimental setting. 2) The construction of the evaluation function is a reward realization for a specific observation perspective. In the course of experiments in more nonlinear environments, the validity of the evaluation function and the design method will be summarized. 3) We will increase the control range of the swing pole angle in the inverted pendulum system, and use the deep reinforcement learning algorithms to solve the global optimal control problems. 4) We will design the target policy and action policy separately according to the importance sampling off policy learning theory in deep reinforcement learning. The action strategy is a strategy optimized for training in a dynamic environment, and the target strategy is a real experimental platform. The control strategy uses the action strategy to sample the data in the dynamic environment, which is used to predict the target strategy under the real experimental platform, so as to realize the optimal control of the real system.
In this paper, a reward mechanism based on control details is proposed for the first time. According to the specific requirements of control objectives, the state space is divided into different observation perspectives, and then the evaluation function is designed. Finally, the final reward function is composed in the form of continuous multiplication of the evaluation functions, which replaces the utility function in HJB equation. Based on Lyapunov stability theory and Q-learning, the stability of the control law is proved. Then taking the nonlinear inverted pendulum system as the experiment object, the detail-reward mechanism is introduced into the deep reinforcement learning algorithms. After designing the reward function according to the detail-reward mechanism, the effects of different reward functions on the experimental accuracy are compared under three algorithms: DQN, policy-gradient and DDPG. The effectiveness of this method is proved by experiments, and the optimal reward function and deep reinforcement learning algorithm model of inverted pendulum system are determined. We will do the following work in the future: 1) Apply the detail reward mechanism to other deep reinforcement learning algorithms for verification, such as A3C, TD3 and PPO. 2) Apply the detail reward mechanism to more nonlinear control systems, such as multi-agent, robot motion control systems, etc. 3) Optimize the model generated by the deep reinforcement learning algorithm to improve the control effect.
The author acknowledge the support from the Joint Development Research Institute of Intelligent Motion Control Technology of the Liaoning Provincial Department of Education and the National Key R & D Program of China (Grant No. 2017YFB1300700).
The authors declared that they have no conflicts of interest in this work.
[1] |
Pappa S, Ntella V, Giannakas T, et al. (2020) Prevalence of depression, anxiety, and insomnia among healthcare workers during the COVID-19 pandemic: A systematic review and meta-analysis. Brain Behav Immun 88: 901-907. https://doi.org/10.1016/j.bbi.2020.05.026 ![]() |
[2] |
Marvaldi M, Mallet J, Dubertret C, et al. (2019) Anxiety, depression, trauma-related, and sleep disorders among healthcare workers during the COVID-19 pandemic: A systematic review and meta-analysis. Neurosci Biobehav Rev 126: 252-264. https://doi.org/10.1016/j.neubiorev.2021.03.024 ![]() |
[3] |
Li Y, Scherer N, Felix L, et al. (2021) Prevalence of depression, anxiety and post-traumatic stress disorder in health care workers during the COVID-19 pandemic: A systematic review and meta-analysis. PLoS One 16: e0246454. https://doi.org/10.1371/journal.pone.0246454 ![]() |
[4] |
Hill JE, Harris C, Danielle LC, et al. (2022) The prevalence of mental health conditions in healthcare workers during and after a pandemic: Systematic review and meta-analysis. J Adv Nurs 78: 1551-1573. https://doi.org/10.1111/jan.15175 ![]() |
[5] |
Takada H, Ae R, Ogawa M, et al. (2022) Depression prevention in healthcare workers during the COVID-19 pandemic. Occup Med (Lond) 72: 207-214. https://doi.org/10.1093/occmed/kqab192 ![]() |
[6] |
Turan O, Demirci NY, Güntülü AK, et al. (2022) Anxiety and depression levels of healthcare workers during Covid-19 pandemic. Afr Health Sci 22: 532-540. https://doi.org/10.4314/ahs.v22i1.62 ![]() |
[7] |
Weibelzahl S, Reiter J, Duden G (2021) Depression and anxiety in healthcare professionals during the COVID-19 pandemic. Epidemiol Infect 149: e46. https://doi.org/10.1017/S0950268821000303 ![]() |
[8] |
Shamsan A, Alhajji M, Alabbasi Y, et al. (2022) Level of anxiety and depression among healthcare workers in Saudi Arabia during the COVID-19 pandemic. Peer J 10: e14246. https://doi.org/10.7717/peerj.14246 ![]() |
[9] | Smirnova D, Syunyakov T, Pavlichenko A, et al. (2022) Anxiety, depression and suicidality in healthcare workers during the COVID-19 pandemic: The Russian study sample of the COMET-G project. Psychiatr Danub 34: 246-255. |
[10] |
Danet Danet A (2021) Psychological impact of COVID-19 pandemic in western frontline healthcare professionals. A systematic review. Med Clin (Barc) 156: 449-458. https://doi.org/10.1016/j.medcli.2020.11.009 ![]() |
[11] |
Abdulla EK, Velladath SU, Varghese A, et al. (2021) Depression and anxiety associated with COVID-19 pandemic among healthcare professionals in India- A systematic review and meta-analysis. Clin Epidemiol Glob Health 12: 100888. https://doi.org/10.1016/j.cegh.2021.100888 ![]() |
[12] |
Saragih ID, Tonapa SI, Saragih IS, et al. (2021) Global prevalence of mental health problems among healthcare workers during the Covid-19 pandemic: A systematic review and meta-analysis. Int J Nurs Stud 121: 104002. https://doi.org/10.1016/j.ijnurstu.2021.104002 ![]() |
[13] |
Aymerich C, Pedruzo B, Pérez JL, et al. (2022) COVID-19 pandemic effects on health worker's mental health: Systematic review and meta-analysis. Eur Psychiatry 65: e10. https://doi.org/10.1192/j.eurpsy.2022.1 ![]() |
[14] |
Fernandez R, Sikhosana N, Green H, et al. (2021) Anxiety and depression among healthcare workers during the COVID-19 pandemic: A systematic umbrella review of the global evidence. BMJ Open 11: e054528. https://doi.org/10.1136/bmjopen-2021-054528 ![]() |
[15] |
Vindegaard N, Benros ME (2020) COVID-19 pandemic and mental health consequences: Systematic review of the current evidence. Brain Behav Immun 89: 531-542. https://doi.org/10.1016/j.bbi.2020.05.048 ![]() |
[16] |
Chen Y, Wang J, Geng Y, et al. (2022) Meta-analysis of the prevalence of anxiety and depression among frontline healthcare workers during the COVID-19 pandemic. Front Public Health 10: 984630. https://doi.org/10.3389/fpubh.2022.984630 ![]() |
[17] |
Wu T, Jia X, Shi H, et al. (2021) Prevalence of mental health problems during the COVID-19 pandemic: A systematic review and meta-analysis. J Affect Disord 281: 91-98. https://doi.org/10.1016/j.jad.2020.11.117 ![]() |
[18] |
Salari N, Khazaie H, Hosseinian-Far A, et al. (2020) The prevalence of stress, anxiety and depression within front-line healthcare workers caring for COVID-19 patients: A systematic review and meta-regression. Hum Resour Health 18: 100. https://doi.org/10.1186/s12960-020-00544-1 ![]() |
[19] | Tian H, Qiao T, Teng J, et al. (2023) Factors associated with depression among healthcare workers during the COVID-19 pandemic: A systematic review and meta-analysis. Psychol Med 15: 1-10. https://doi.org/10.1017/S0033291723002271 |
[20] |
Andhavarapu S, Yardi I, Bzhilyanskaya V, et al. (2022) Post-traumatic stress in healthcare workers during the COVID-19 pandemic: A systematic review and meta-analysis. Psychiatry Res 317: 114890. https://doi.org/10.1016/j.psychres.2022.114890 ![]() |
[21] |
Magnavita N, Chirico F, Garbarino S, et al. (2021) SARS/MERS/SARS-CoV-2 outbreaks and burnout syndrome among healthcare workers. An umbrella systematic review. Int J Environ Res Public Health 18: 4361. https://doi.org/10.3390/ijerph18084361 ![]() |
[22] |
Serrano-Ripoll MJ, Meneses-Echavez JF, Ricci-Cabello I, et al. (2020) Impact of viral epidemic outbreaks on mental health of healthcare workers: A rapid systematic review and meta-analysis. J Affect Disord 277: 347-357. https://doi.org/10.1016/j.jad.2020.08.034 ![]() |
[23] |
Norhayati MN, Che Yusof R, Azman MY (2021) Prevalence of psychological impacts on healthcare providers during COVID-19 pandemic in Asia. Int J Environ Res Public Health 18: 9157. https://doi.org/10.3390/ijerph18179157 ![]() |
[24] |
Arora T, Grey I, Östlundh L, et al. (2022) The prevalence of psychological consequences of COVID-19: A systematic review and meta-analysis of observational studies. J Health Psychol 27: 805-824. https://doi.org/10.1177/1359105320966639 ![]() |
[25] |
Lee BEC, Ling M, Boyd L, et al. (2023) The prevalence of probable mental health disorders among hospital healthcare workers during COVID-19: A systematic review and meta-analysis. J Affect Disord 330: 329-345. https://doi.org/10.1016/j.jad.2023.03.012 ![]() |
[26] |
Ghahramani S, Kasraei H, Hayati R, et al. (2023) Health care workers' mental health in the face of COVID-19: A systematic review and meta-analysis. Int J Psychiatry Clin Pract 27: 208-217. https://doi.org/10.1080/13651501.2022.2101927 ![]() |
[27] |
Chirico F, Nucera G, Magnavita N (2020) COVID-19: Protecting healthcare workers is a priority. Infect Control Hosp Epidemiol 41: 1117. https://doi.org/10.1017/ice.2020.148 ![]() |
[28] | Chirico F, Nucera G, Magnavita N (2020) Protecting the mental health of healthcare workers during the COVID-19 emergency. Brit J Psych Int 18: E1. https://doi.org/10.1192/bji.2020.39 |
[29] |
Cohen C, Pignata S, Bezak E, et al. (2023) Workplace interventions to improve well-being and reduce burnout for nurses, physicians and allied healthcare professionals: A systematic review. BMJ Open 13: e071203. https://doi.org/10.1136/bmjopen-2022-071203 ![]() |
[30] |
Muller AE, Hafstad EV, Himmels JPW, et al. (2020) The mental health impact of the covid-19 pandemic on healthcare workers, and interventions to help them: A rapid systematic review. Psychiatry Res 293: 113441. https://doi.org/10.1016/j.psychres.2020.113441 ![]() |
[31] | Pollock A, Campbell P, Cheyne J, et al. (2020) Interventions to support the resilience and mental health of frontline health and social care professionals during and after a disease outbreak, epidemic or pandemic: A mixed methods systematic review. Cochrane Database Syst Rev 11: CD013779. https://doi.org/10.1002/14651858.CD013779 |
[32] |
Yang JM, Ye H, Long Y, et al. (2023) Effects of web-based mindfulness-based interventions on anxiety, depression, and stress among frontline health care workers during the COVID-19 pandemic: Systematic review and meta-analysis. J Med Internet Res 25: e44000. https://doi.org/10.2196/44000 ![]() |
[33] |
Del Campo MT, Romo PE, de la Hoz RE, et al. (2017) Anxiety and depression predict musculoskeletal disorders in health care workers. Arch Environ Occup Health 72: 39-44. https://doi.org/10.1080/19338244.2016.1154002 ![]() |
[34] |
Shi L, Li G, Hao J, et al. (2020) Psychological depletion in physicians and nurses exposed to workplace violence: A cross-sectional study using propensity score analysis. Int J Nurs Stud 103: 103493. https://doi.org/10.1016/j.ijnurstu.2019.103493 ![]() |
[35] |
Magnavita N, Heponiemi T (2011) Workplace violence against nursing students and nurses. An Italian experience. J Nurs Scholarsh 43: 203-210. https://doi.org/10.1111/j.1547-5069.2011.01392.x ![]() |
[36] |
Magnavita N, Fileni A (2012) Violence against radiologists. II: psychosocial factors. Radiol Med 117: 1034-1043. ![]() |
[37] |
Li Y, Wang Y, Lv X, et al. (2022) Effects of factors related to shift work on depression and anxiety in nurses. Front Public Health 10: 926988. https://doi.org/10.3389/fpubh.2022.926988 ![]() |
[38] |
Hoying J, Melnyk BM, Hutson E, et al. (2020) Prevalence and correlates of depression, anxiety, stress, healthy beliefs, and lifestyle behaviors in first-year graduate health sciences students. Worldviews Evid Based Nurs 17: 49-59. https://doi.org/10.1111/wvn.12415 ![]() |
[39] |
Cirrincione L, Plescia F, Malta G, et al. (2023) Evaluation of correlation between sleep and psychiatric disorders in a population of night shift workers: A pilot study. Int J Environ Res Public Health 20: 3756. https://doi.org/10.3390/ijerph20043756 ![]() |
[40] |
Weaver MD, Vetter C, Rajaratnam SMW, et al. (2018) Sleep disorders, depression and anxiety are associated with adverse safety outcomes in healthcare workers: A prospective cohort study. J Sleep Res 27: e12722. https://doi.org/10.1111/jsr.12722 ![]() |
[41] |
Sabanciogullari S, Yilmaz FT, Karabey G (2021) The effect of the clinical nurses' compassion levels on tendency to make medical error: A cross-sectional study. Contemp Nurse 57: 65-79. https://doi.org/10.1080/10376178.2021.1927772 ![]() |
[42] |
Cocker F, Joss N (2016) Compassion fatigue among healthcare, emergency and community service workers: A systematic review. Int J Environ Res Public Health 13: 618. https://doi.org/10.3390/ijerph13060618 ![]() |
[43] |
Bailey BC, Cox S, Terris L, et al. (2023) Rural health care worker wellness during COVID-19: Compassion fatigue, compassion satisfaction & utilization of wellness resources. PLoS One 18: e0295020. https://doi.org/10.1371/journal.pone.0295020 ![]() |
[44] | Abbaszadeh R, Ahmadi F, Khoobi M, et al. (2024) Contributors to fatigue among nurses working in critical care units: A qualitative study. Nurs Crit Care 2024 . https://doi.org/10.1111/nicc.13091 |
[45] |
de Boer J, Lok A, Van't Verlaat E, et al. (2011) Work-related critical incidents in hospital-based health care providers and the risk of post-traumatic stress symptoms, anxiety, and depression: a meta-analysis. Soc Sci Med 73: 316-326. https://doi.org/10.1016/j.socscimed.2011.05.009 ![]() |
[46] |
Magnavita N, Magnavita G, Fileni A, et al. (2009) Ethical problems in radiology. Medical error and disclosure. Radiol Med 114: 1345-1355. https://doi.org/10.1007/s11547-009-0445-z ![]() |
[47] |
Fileni A, Magnavita N, Mammi F, et al. (2007) Malpractice stress syndrome in radiologists and radiotherapists. Perceived causes and consequences. Radiol Med 112: 1069-1084. https://doi.org/10.1007/s11547-007-0206-9 ![]() |
[48] |
Lindwall M, Gerber M, Jonsdottir IH, et al. (2014) The relationships of change in physical activity with change in depression, anxiety, and burnout: A longitudinal study of Swedish healthcare workers. Health Psychol 33: 1309-1318. https://doi.org/10.1037/a0034402 ![]() |
[49] |
Magnavita N, Fileni A (2014) Association of work-related stress with depression and anxiety in radiologists. Radiol Med 119: 359-366. https://doi.org/10.1007/s11547-013-0355-y ![]() |
[50] |
Ramírez-Elvira S, Romero-Béjar JL, Suleiman-Martos N, et al. (2021) Prevalence, risk factors and burnout levels in intensive care unit nurses: A systematic review and meta-analysis. Int J Environ Res Public Health 18: 11432. https://doi.org/10.3390/ijerph182111432 ![]() |
[51] |
Ulfa M, Azuma M, Steiner A (2022) Burnout status of healthcare workers in the world during the peak period of the COVID-19 pandemic. Front Psychol 13: 952783. https://doi.org/10.3389/fpsyg.2022.952783 ![]() |
[52] |
Jun J, Ojemeni MM, Kalamani R, et al. (2021) Relationship between nurse burnout, patient and organizational outcomes: Systematic review. Int J Nurs Stud 119: 103933. https://doi.org/10.1016/j.ijnurstu.2021.103933 ![]() |
[53] |
Melnyk BM, Kelly SA, Stephens J, et al. (2020) Interventions to improve mental health, well-being, physical health, and lifestyle behaviors in physicians and nurses: A systematic review. Am J Health Promot 34: 929-941. https://doi.org/10.1177/0890117120920451 ![]() |
[54] |
Zhang XJ, Song Y, Jiang T, et al. (2020) Interventions to reduce burnout of physicians and nurses: An overview of systematic reviews and meta-analyses. Medicine (Baltimore) 99: e20992. https://doi.org/10.1097/MD.0000000000020992 ![]() |
[55] |
West CP, Dyrbye LN, Erwin PJ, et al. (2016) Interventions to prevent and reduce physician burnout: A systematic review and meta-analysis. Lancet 388: 2272-2281. https://doi.org/10.1016/S0140-6736(16)31279-X ![]() |
[56] |
Ghawadra SF, Abdullah KL, Choo WY, et al. (2019) Mindfulness-based stress reduction for psychological distress among nurses: A systematic review. J Clin Nurs 28: 3747-3758. https://doi.org/10.1111/jocn.14987 ![]() |
[57] |
Gilmartin H, Goyal A, Hamati MC, et al. (2017) Brief mindfulness practices for healthcare providers-A systematic literature review. Am J Med 130: 1219.e1-1219.e17. https://doi.org/10.1016/j.amjmed.2017.05.041 ![]() |
[58] |
Guillaumie L, Boiral O, Champagne J (2017) A mixed-methods systematic review of the effects of mindfulness on nurses. J Adv Nurs 73: 1017-1034. https://doi.org/10.1111/jan.13176 ![]() |
[59] | Tamminga SJ, Emal LM, Boschman JS, et al. (2023) Individual-level interventions for reducing occupational stress in healthcare workers. Cochrane Database Syst Rev 5: CD002892. https://doi.org/10.1002/14651858.CD002892.pub6 |
[60] |
Di Nota PM, Bahji A, Groll D, et al. (2021) Proactive psychological programs designed to mitigate posttraumatic stress injuries among at-risk workers: A systematic review and meta-analysis. Syst Rev 10: 126. https://doi.org/10.1186/s13643-021-01677-7 ![]() |
[61] |
Ward EJ, Fragala MS, Birse CE, et al. (2023) Assessing the impact of a comprehensive mental health program on frontline health service workers. PLoS One 18: e0294414. https://doi.org/10.1371/journal.pone.0294414 ![]() |
[62] | Magnavita N, Cicerone M, Cirese V, et al. (2006) Critical aspects of the management of “hazardous” health care workers. Consensus document. Med Lav 97: 715-725. |
[63] | Magnavita N (2006) Management of impaired physicians in Europe. Med Lav 97: 762-773. |
[64] |
Magnavita N (2007) The unhealthy physician. J Med Ethics 33: 210-214. https://doi.org/10.1136/jme.2006.017533 ![]() |
[65] |
Magnavita N, Magnavita G, Bergamaschi A (2010) The impaired radiologist. Radiol Med 115: 826-838. https://doi.org/10.1007/s11547-010-0539-7 ![]() |
[66] |
Sturesson M, Edlund C, Fjellman-Wiklund A, et al. (2013) Work ability as obscure, complex and unique: Views of Swedish occupational therapists and physicians. Work 45: 117-128. https://doi.org/10.3233/WOR-2012-1416 ![]() |
[67] |
Ebener M, Hasselhorn HM (2019) Validation of short measures of work ability for research and employee surveys. Int J Environ Res Public Health 16: 3386. https://doi.org/10.3390/ijerph16183386 ![]() |
[68] |
Tengland PA (2011) The concept of work ability. J Occup Rehabil 21: 275-285. https://doi.org/10.1007/s10926-010-9269-x ![]() |
[69] |
Abbasi M, Zakerian A, Kolahdouzi M, et al. (2016) Relationship between Work Ability Index and cognitive failure among nurses. Electron Physician 8: 2136-2143. https://doi.org/10.19082/2136 ![]() |
[70] |
Gracino ME, Dos Santos Tortajada J, de-Castro-Alves MB, et al. (2020) Analysis of physicians work ability, in the city of Maringá, Brazil. Rev Bras Med Trab 16: 417-428. https://doi.org/10.5327/Z1679443520180296 ![]() |
[71] |
Magnavita N (2023) Workplace health promotion embedded in medical surveillance: The Italian way to total worker health program. Int J Environ Res Public Health 20: 3659. https://doi.org/10.3390/ijerph20043659 ![]() |
[72] | Magnavita N (2007) Anxiety and depression at work. the A/D Goldberg Questionnaire. G Ital Med Lav Ergon 29: 670-671. |
[73] |
Goldberg D, Bridges K, Duncan-Jones P, et al. (1988) Detecting anxiety and depression in general medical settings. BMJ 297: 897-899. https://doi.org/10.1136/bmj.297.6653.897 ![]() |
[74] | Magnavita N (2007) Two tools for health surveillance of job stress: The Karasek job content questionnaire and the Siegrist effort reward imbalance questionnaire. G Ital Med Lav Ergon 29: 667-670. |
[75] | Magnavita N, Garbarino S, Siegrist J (2012) The use of parsimonious questionnaires in occupational health surveillance: Psychometric properties of the short Italian version of the effort/reward imbalance questionnaire. Scientific World J 372852. https://doi.org/10.1100/2012/372852 |
[76] |
Siegrist J (1996) Adverse health effects of high-effort/low-reward conditions. J Occup Health Psychol 1: 27-41. https://doi.org/10.1037/1076-8998.1.1.27 ![]() |
[77] |
Siegrist J, Wege N, Pühlhofer F, et al. (2009) A short generic measure of work stress in the era of globalization: Effort-reward imbalance. Int Arch Occup Environ Health 82: 1005-1013. https://doi.org/10.1007/s00420-008-0384-3 ![]() |
[78] |
Avanzi L, Perinelli E, Vignoli M, et al. (2020) Unravelling work drive: A comparison between workaholism and overcommitment. Int J Environ Res Public Health 17: 5755. https://doi.org/10.3390/ijerph17165755 ![]() |
[79] |
Siegrist J, Li J (2016) Associations of extrinsic and intrinsic components of work stress with health: A systematic review of evidence on the effort-reward imbalance model. Int J Environ Res Public Health 13: 432. https://doi.org/10.3390/ijerph13040432 ![]() |
[80] | Magnavita N, Fileni A, Magnavita L, et al. (2007) Job satisfaction. Use of the Job Satisfaction Scale (JSS). G Ital Med Lav Ergon 29: 655-657. |
[81] |
Warr P, Cook J, Wall T (1979) Scales for the measurement of some work attitudes and aspects of psychological well-being. J Occup Psychol 52: 129-148. https://doi.org/10.1111/j.2044-8325.1979.tb00448.x ![]() |
[82] | Converso D, Sottimano I, Garzaro G, et al. (2022) Contribution to validation of the Italian version of work ability personal radar. Med Lav 113: e2022051. https://doi.org/10.23749/mdl.v113i6.13470 |
[83] | Tuomi K, Ilmarinen J, Jahkola A, et al. (1998) Work Ability Index, 2 Eds. revised Edn. Helsinki: Finnish Institute of Occupational Health. |
[84] |
Ilmarinen J (2007) The Work Ability Index (WAI). Occup Med 57: 160. https://doi.org/10.1093/occmed/kqm008 ![]() |
[85] |
Radkiewicz P, Widerszal-Bazyl M (2005) Psychometric properties of Work Ability Index in the light of comparative survey study. Int Congr Ser 1280: 304-309. https://doi.org/10.1016/j. ics.2005.02.089. 23 ![]() |
[86] |
Abdolalizadeh M, Arastoo AA, Ghsemzadeh R, et al. (2012) The psychometric properties of an Iranian translation of the Work Ability Index (WAI) questionnaire. J Occup Rehabil 22: 401-408. https://doi.org/10.1007/s10926-012-9355-3 ![]() |
[87] |
Bethge M, Radoschewski FM, Gutenbrunner C (2012) The Work Ability Index as a screening tool to identify the need for rehabilitation: Longitudinal findings from the Second German Sociomedical Panel of Employees. J Rehabil Med 44: 980-987. https://doi.org/10.2340/16501977-1063 ![]() |
[88] |
Martinez MC, Latorre Mdo R, Fischer FM (2009) Validity and reliability of the Brazilian version of the Work Ability Index questionnaire. Rev Saude Publica 43: 525-532. https://doi.org/10.1590/S0034-89102009005000017 ![]() |
[89] |
Martus P, Jakob O, Rose U, et al. (2010) A comparative analysis of the Work Ability Index. Occup Med 60: 517-524. https://doi.org/10.1093/occmed/kqq093 ![]() |
[90] |
Peralta N, Godoi Vasconcelos AG, Härter Griep R, et al. (2012) Validity and reliability of the Work Ability Index in primary care workers in Argentina. Salud Colectiva 8: 163-173. https://doi.org/10.1590/S1851-82652012000200005 ![]() |
[91] |
Al Maqbali M, Alsayed A, Hughes C, et al. (2024) Stress, anxiety, depression and sleep disturbance among healthcare professional during the COVID-19 pandemic: An umbrella review of 72 meta-analyses. PLoS One 19: e0302597. https://doi.org/10.1371/journal.pone.0302597 ![]() |
[92] |
Nowrouzi-Kia B, Sithamparanathan G, Nadesar N, et al. (2022) Factors associated with work performance and mental health of healthcare workers during pandemics: A systematic review and meta-analysis. J Public Health (Oxf) 44: 731-739. https://doi.org/10.1093/pubmed/fdab173 ![]() |
[93] |
Magnavita N, Tripepi G, Di Prinzio RR (2020) Symptoms in health care workers during the COVID-19 epidemic. A cross-sectional survey. Int J Environ Res Public Health 17: 5218. https://doi.org/10.3390/ijerph17145218 ![]() |
[94] |
Magnavita N, Soave PM, Ricciardi W, et al. (2020) Occupational stress and mental health of anaesthetists during the COVID-19 pandemic. Int J Environ Res Public Health 17: 8245. https://doi.org/10.3390/ijerph17218245 ![]() |
[95] |
Magnavita N, Soave PM, Antonelli M (2021) Prolonged stress causes depression in frontline workers facing the COVID-19 pandemic-a repeated cross-sectional study in a COVID-19 hub-hospital in central Italy. Int J Environ Res Public Health 18: 7316. https://doi.org/10.3390/ijerph18147316 ![]() |
[96] |
Magnavita N, Soave PM, Antonelli M (2021) A one-year prospective study of work-related mental health in the intensivists of a COVID-19 hub hospital. Int J Environ Res Public Health 18: 9888. https://doi.org/10.3390/ijerph18189888 ![]() |
[97] |
Magnavita N, Soave PM, Antonelli M (2022) Treating anti-vax patients, a new occupational stressor—data from the 4th wave of the Prospective Study of Intensivists and COVID-19 (PSIC). Int J Environ Res Public Health 19: 5889. https://doi.org/10.3390/ijerph19105889 ![]() |
[98] |
Magnavita N, Soave PM, Antonelli M (2022) Teaching safety-resident anaesthetists at the forefront of COVID-19. Ind Health 60: 75-78. https://doi.org/10.2486/indhealth.2021-0054 ![]() |
[99] |
Petrie K, Milligan-Saville J, Gayed A, et al. (2018) Prevalence of PTSD and common mental disorders amongst ambulance personnel: A systematic review and meta-analysis. Soc Psychiatry Psychiatr Epidemiol 53: 897-909. https://doi.org/10.1007/s00127-018-1539-5 ![]() |
[100] |
Tung YJ, Lo KKH, Ho RCM, et al. (2018) Prevalence of depression among nursing students: A systematic review and meta-analysis. Nurse Educ Today 63: 119-129. https://doi.org/10.1016/j.nedt.2018.01.009 ![]() |
[101] |
Pacheco JP, Giacomin HT, Tam WW, et al. (2017) Mental health problems among medical students in Brazil: A systematic review and meta-analysis. Braz J Psychiatry 39: 369-378. https://doi.org/10.1590/1516-4446-2017-2223 ![]() |
[102] |
Wieclaw J, Agerbo E, Mortensen PB, et al. (2006) Risk of affective and stress related disorders among employees in human service professions. Occup Environ Med 63: 314-319. https://doi.org/10.1136/oem.2004.019398 ![]() |
[103] |
Maharaj S, Lees T, Lal S (2018) Prevalence and risk factors of depression, anxiety, and stress in a cohort of Australian nurses. Int J Environ Res Public Health 16: 61. https://doi.org/10.3390/ijerph16010061 ![]() |
[104] | Al Mawaali Z, Abdelrahman H, Al Qadire M, et al. (2024) Prevalence of anxiety, depression, and sleep disturbance among emergency nurses in Oman. J Emerg Nurs 13: S0099–1767(24)00050–3. https://doi.org/10.1016/j.jen.2024.02.004 |
[105] |
Opoku Agyemang S, Ninnoni JP, Enyan NIE (2022) Prevalence and determinants of depression, anxiety and stress among psychiatric nurses in Ghana: A cross-sectional study. BMC Nurs 21: 179. https://doi.org/10.1186/s12912-022-00964-5 ![]() |
[106] |
Huang CL, Weng SF, Wang JJ, et al. (2015) Risks of treated insomnia, anxiety, and depression in health care-seeking physicians: A nationwide population-based study. Medicine (Baltimore) 94: e1323. https://doi.org/10.1097/MD.0000000000001323 ![]() |
[107] | Chirico F, Afolabi AA, Ilesanmi OS, et al. (2021) Prevalence, risk factors and prevention of burnout syndrome among healthcare workers before the COVID-19 pandemic: An umbrella review of systematic reviews and meta-analyses. J Health Soc Sci 6: 465-491. HTTPS://DOI.ORG/10.19204/2021/prvl3 |
[108] |
Prga Borojević I, Kincaid IA, Bago M, et al. (2024) The mediating role of depressive, anxiety, and physical symptoms on Work Ability Index in employed women with breast cancer: A prospective study from Croatia. Croat Med J 65: 101-110. https://doi.org/10.3325/cmj.2024.65.101 ![]() |
[109] |
Magnavita N, Di Prinzio RR, Meraglia I, et al. (2023) Supporting return to work after breast cancer: A mixed method study. Healthcare (Basel) 11: 2343. https://doi.org/10.3390/healthcare11162343 ![]() |
[110] |
Cordeiro RA, Fischer FM, Shinjo SK (2023) Work situation, work ability and expectation of returning to work in patients with systemic autoimmune myopathies. Rheumatology (Oxford) 62: 785-793. https://doi.org/10.1093/rheumatology/keac389 ![]() |
[111] |
Knekt P, Lindfors O, Laaksonen MA, et al. (2008) Helsinki psychotherapy study group. Effectiveness of short-term and long-term psychotherapy on work ability and functional capacity--a randomized clinical trial on depressive and anxiety disorders. J Affect Disord 107: 95-106. https://doi.org/10.1016/j.jad.2007.08.005 ![]() |
[112] |
Riedl D, Labek K, Gstrein I, et al. (2023) Large improvement of mental health during in outpatient short-term group psychotherapy treatment-a naturalistic pre-/post-observational study. Neuropsychiatr 37: 57-64. https://doi.org/10.1007/s40211-022-00449-6 ![]() |
[113] |
Guidi S, Bagnara S, Fichera GP (2012) The HSE indicator tool, psychological distress and work ability. Occup Med (Lond) 62: 203-209. https://doi.org/10.1093/occmed/kqs021 ![]() |
[114] |
Walker EJ, Jackson CA, Egan HH, et al. (2015) Workability and mental wellbeing among therapeutic prison officers. Occup Med (Lond) 65: 549-551. https://doi.org/10.1093/occmed/kqv084 ![]() |
[115] |
Westermann C, Nienhaus A, Treszl A (2019) Quality of life and work ability among healthcare personnel with chronic viral hepatitis. Evaluation of the inpatient rehabilitation program of the Wartenberg clinic. Int J Environ Res Public Health 16: 3874. https://doi.org/10.3390/ijerph16203874 ![]() |
[116] |
Lerner D, Adler DA, Chang H, et al. (2004) The clinical and occupational correlates of work productivity loss among employed patients with depression. J Occup Environ Med 46: S46-S55. https://doi.org/10.1097/01.jom.0000126684.82825.0a ![]() |
[117] |
Lerner D, Henke RM (2008) What does research tell us about depression, job performance, and work productivity?. J Occup Environ Med 50: 401-410. https://doi.org/10.1097/JOM.0b013e31816bae50 ![]() |
[118] |
McKnight PE, Kashdan TB (2009) The importance of functional impairment to mental health outcomes: A case for reassessing our goals in depression treatment research. Clin Psychol Rev 29: 243-259. https://doi.org/10.1016/j.cpr.2009.01.005 ![]() |
[119] |
Gartner FR, Nieuwenhuijsen K, van Dijk FJ, et al. (2010) The impact of common mental disorders on the work functioning of nurses and allied health professionals: A systematic review. Int J Nurs Stud 47: 1047-1061. https://doi.org/10.1016/j.ijnurstu.2010.03.013 ![]() |
[120] |
Noben C, Smit F, Nieuwenhuijsen K, et al. (2014) Comparative cost-effectiveness of two interventions to promote work functioning by targeting mental health complaints among nurses: Pragmatic cluster randomised trial. Int J Nurs Stud 51: 1321-1331. https://doi.org/10.1016/j.ijnurstu.2014.01.017 ![]() |
[121] |
Noben C, Evers S, Nieuwenhuijsen K, et al. (2015) Protecting and promoting mental health of nurses in the hospital setting: Is it cost-effective from an employer's perspective?. Int J Occup Med Environ Health 28: 891-900. https://doi.org/10.13075/ijomeh.1896.00465 ![]() |
[122] |
Magnavita N, Meraglia I, Viti G, et al. (2024) The Work Ability Index (WAI) in the healthcare sector: A cross-sectional/retrospective assessment of the questionnaire. Int J Environ Res Public Health 21: 349. https://doi.org/10.3390/ijerph21030349 ![]() |
[123] |
Magnavita N, Meraglia I, Chiorri C (2024) Emotional and work-related factors in the self-assessment of work ability among italian healthcare workers. Healthcare 12: 1731. https://doi.org/10.3390/healthcare12171731 ![]() |
[124] |
Kaewboonchoo O, Ratanasiripong P (2015) Psychometric properties of the Thai version of the work ability index (Thai WAI). J Occup Health 57: 371-377. https://doi.org/10.1539/joh.14-0173-OA ![]() |
[125] | Smrekar M, Franko A, Petrak O, et al. (2020) Validation of the croatian version of Work Ability Index (WAI) in population of nurses on transformed item-specific scores. Zdr Varst 59: 57-64. https://doi.org/10.2478/sjph-2020-0008 |
[126] |
González-Domínguez ME, Fernández-García E, Paloma-Castro O, et al. (2024) Work Ability Index: Psychometric testing in aeronautical industry workers. Saf Health Work 15: 80-86. https://doi.org/10.1016/j.shaw.2023.12.001 ![]() |
1. | Yasunari Matsuzaka, Ryu Yashiro, Applications of Deep Learning for Drug Discovery Systems with BigData, 2022, 2, 2673-7426, 603, 10.3390/biomedinformatics2040039 | |
2. | Shixuan Yao, Xiaochen Liu, Yinghui Zhang, Ze Cui, Research on Solving Nonlinear Problem of Ball and Beam System by Introducing Detail-Reward Function, 2022, 14, 2073-8994, 1883, 10.3390/sym14091883 | |
3. | Xiaochen Liu, Sipeng Wang, Xingxing Li, Ze Cui, Balance Controller Design for Inverted Pendulum Considering Detail Reward Function and Two-Phase Learning Protocol, 2024, 16, 2073-8994, 1227, 10.3390/sym16091227 | |
4. | Shaili Mishra, Anuja Arora, Intelligent computational techniques for physical object properties discovery, detection, and prediction: A comprehensive survey, 2024, 51, 15740137, 100609, 10.1016/j.cosrev.2023.100609 |
![]() |
![]() |
![]() |
experimental phases | Cart position (/m) | Cart speed (m/s) | Swing pole angle (/rad) | Swing pole speed (rad/s) |
Training phase | (−0.2, 0.2) | (−0.05, 0.05) | (−0.174, 0.174) | (−0.05, 0.05) |
Testing phase | −0.12 | 0.2 | 0.1 | 0.2 |
*Note: In the training phase of the inverted pendulum system, the initial state of the cart and swing pole is a random number within the corresponding parameter range. In the experiment, the position of the cart and the angle unit of the swing pole are the same as those in the table. |
Episodes | Steps | Discount factor | Epsilon(ε) | Memory |
2000 | 300 | 0.95 | 0.9 | 2000 |
Network | Network structure | Learning rate | Loss function | Activation function | Batch size |
Action-value Q network | (4,128) (128,128) (128, 25) |
0.0002 | MSE | RELU | 32 |
Episodes | Steps | Discount factor |
15000 | 300 | 0.99 |
Network | Network structure | Learning rate | Loss function | Activation function |
Policy network | (4,128) (128,128) (128, 3) |
1 × 10−3 | Cross entropy | Adam |
Episodes | Steps | Discount factor | Soft update coefficient | Memory |
2000 | 300 | 0.99 | 1 × 10−2 | 10,000 |
Network | Network structure | Learning rate | Loss function | Activation function | Batch size |
Actor network | (4,128) (128,128) (128, 1) |
1 × 10−3 | −Q(s,a|wQ) | Adam | 64 |
Critic network | (5,128) (128,128) (128, 1) |
1 × 10−2 | MSE | Adam | 64 |
![]() |
![]() |
![]() |
experimental phases | Cart position (/m) | Cart speed (m/s) | Swing pole angle (/rad) | Swing pole speed (rad/s) |
Training phase | (−0.2, 0.2) | (−0.05, 0.05) | (−0.174, 0.174) | (−0.05, 0.05) |
Testing phase | −0.12 | 0.2 | 0.1 | 0.2 |
*Note: In the training phase of the inverted pendulum system, the initial state of the cart and swing pole is a random number within the corresponding parameter range. In the experiment, the position of the cart and the angle unit of the swing pole are the same as those in the table. |
Episodes | Steps | Discount factor | Epsilon(ε) | Memory |
2000 | 300 | 0.95 | 0.9 | 2000 |
Network | Network structure | Learning rate | Loss function | Activation function | Batch size |
Action-value Q network | (4,128) (128,128) (128, 25) |
0.0002 | MSE | RELU | 32 |
Episodes | Steps | Discount factor |
15000 | 300 | 0.99 |
Network | Network structure | Learning rate | Loss function | Activation function |
Policy network | (4,128) (128,128) (128, 3) |
1 × 10−3 | Cross entropy | Adam |
Episodes | Steps | Discount factor | Soft update coefficient | Memory |
2000 | 300 | 0.99 | 1 × 10−2 | 10,000 |
Network | Network structure | Learning rate | Loss function | Activation function | Batch size |
Actor network | (4,128) (128,128) (128, 1) |
1 × 10−3 | −Q(s,a|wQ) | Adam | 64 |
Critic network | (5,128) (128,128) (128, 1) |
1 × 10−2 | MSE | Adam | 64 |