
The transformer model has recently been a milestone in artificial intelligence. The algorithm has enhanced the performance of tasks such as Machine Translation and Computer Vision to a level previously unattainable. However, the transformer model has a strong performance but also requires a high amount of memory overhead and enormous computing power. This significantly hinders the deployment of an energy-efficient transformer system. Due to the high parallelism, low latency, and low power consumption of field-programmable gate arrays (FPGAs) and application specific integrated circuits (ASICs), they demonstrate higher energy efficiency than Graphics Processing Units (GPUs) and Central Processing Units (CPUs). Therefore, FPGA and ASIC are widely used to accelerate deep learning algorithms. Several papers have addressed the issue of deploying the Transformer on dedicated hardware for acceleration, but there is a lack of comprehensive studies in this area. Therefore, we summarize the transformer model compression algorithm based on the hardware accelerator and its implementation to provide a comprehensive overview of this research domain. This paper first introduces the transformer model framework and computation process. Secondly, a discussion of hardware-friendly compression algorithms based on self-attention and Transformer is provided, along with a review of a state-of-the-art hardware accelerator framework. Finally, we considered some promising topics in transformer hardware acceleration, such as a high-level design framework and selecting the optimum device using reinforcement learning.
Citation: Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen. Hardware-friendly compression and hardware acceleration for transformer: A survey[J]. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192
[1] | Guohui Zhang, Jinghe Sun, Xing Liu, Guodong Wang, Yangyang Yang . Solving flexible job shop scheduling problems with transportation time based on improved genetic algorithm. Mathematical Biosciences and Engineering, 2019, 16(3): 1334-1347. doi: 10.3934/mbe.2019065 |
[2] | Ruiping Yuan, Jiangtao Dou, Juntao Li, Wei Wang, Yingfan Jiang . Multi-robot task allocation in e-commerce RMFS based on deep reinforcement learning. Mathematical Biosciences and Engineering, 2023, 20(2): 1903-1918. doi: 10.3934/mbe.2023087 |
[3] | Shixuan Yao, Xiaochen Liu, Yinghui Zhang, Ze Cui . An approach to solving optimal control problems of nonlinear systems by introducing detail-reward mechanism in deep reinforcement learning. Mathematical Biosciences and Engineering, 2022, 19(9): 9258-9290. doi: 10.3934/mbe.2022430 |
[4] | Kongfu Hu, Lei Wang, Jingcao Cai, Long Cheng . An improved genetic algorithm with dynamic neighborhood search for job shop scheduling problem. Mathematical Biosciences and Engineering, 2023, 20(9): 17407-17427. doi: 10.3934/mbe.2023774 |
[5] | Jianguo Duan, Mengting Wang, Qinglei Zhang, Jiyun Qin . Distributed shop scheduling: A comprehensive review on classifications, models and algorithms. Mathematical Biosciences and Engineering, 2023, 20(8): 15265-15308. doi: 10.3934/mbe.2023683 |
[6] | Zilong Zhuang, Zhiyao Lu, Zizhao Huang, Chengliang Liu, Wei Qin . A novel complex network based dynamic rule selection approach for open shop scheduling problem with release dates. Mathematical Biosciences and Engineering, 2019, 16(5): 4491-4505. doi: 10.3934/mbe.2019224 |
[7] | Shaofeng Yan, Guohui Zhang, Jinghe Sun, Wenqiang Zhang . An improved ant colony optimization for solving the flexible job shop scheduling problem with multiple time constraints. Mathematical Biosciences and Engineering, 2023, 20(4): 7519-7547. doi: 10.3934/mbe.2023325 |
[8] | Zichen Wang, Xin Wang . Fault-tolerant control for nonlinear systems with a dead zone: Reinforcement learning approach. Mathematical Biosciences and Engineering, 2023, 20(4): 6334-6357. doi: 10.3934/mbe.2023274 |
[9] | Jin Zhang, Nan Ma, Zhixuan Wu, Cheng Wang, Yongqiang Yao . Intelligent control of self-driving vehicles based on adaptive sampling supervised actor-critic and human driving experience. Mathematical Biosciences and Engineering, 2024, 21(5): 6077-6096. doi: 10.3934/mbe.2024267 |
[10] | Lu-Wen Liao . A branch and bound algorithm for optimal television commercial scheduling. Mathematical Biosciences and Engineering, 2022, 19(5): 4933-4945. doi: 10.3934/mbe.2022231 |
The transformer model has recently been a milestone in artificial intelligence. The algorithm has enhanced the performance of tasks such as Machine Translation and Computer Vision to a level previously unattainable. However, the transformer model has a strong performance but also requires a high amount of memory overhead and enormous computing power. This significantly hinders the deployment of an energy-efficient transformer system. Due to the high parallelism, low latency, and low power consumption of field-programmable gate arrays (FPGAs) and application specific integrated circuits (ASICs), they demonstrate higher energy efficiency than Graphics Processing Units (GPUs) and Central Processing Units (CPUs). Therefore, FPGA and ASIC are widely used to accelerate deep learning algorithms. Several papers have addressed the issue of deploying the Transformer on dedicated hardware for acceleration, but there is a lack of comprehensive studies in this area. Therefore, we summarize the transformer model compression algorithm based on the hardware accelerator and its implementation to provide a comprehensive overview of this research domain. This paper first introduces the transformer model framework and computation process. Secondly, a discussion of hardware-friendly compression algorithms based on self-attention and Transformer is provided, along with a review of a state-of-the-art hardware accelerator framework. Finally, we considered some promising topics in transformer hardware acceleration, such as a high-level design framework and selecting the optimum device using reinforcement learning.
In this paper, we consider the following diffusion equation on
{−∇⋅(α∇u)=f,inΩ,u=0,on∂Ω. | (1) |
To approximate (1), taking advantage of the adaptive mesh refinement (AMR) to save valuable computational resources, the adaptive finite element method on quadtree mesh is among the most popular ones in the engineering and scientific computing community [20]. Compared with simplicial meshes, quadtree meshes provide preferable performance in the aspects of the accuracy and robustness. There are lots of mature software packages (e.g., [1,2]) on quadtree meshes. To guide the AMR, one possible way is through the a posteriori error estimation to construct computable quantities to indicate the location that the mesh needs to be refined/coarsened, thus to balance the spacial distribution of the error which improves the accuracy per computing power. Residual-based and recovery-based error estimators are among the most popular ones used. In terms of accuracy, the recovery-based error estimator shows more appealing attributes [28,3].
More recently, newer developments on flux recovery have been studied by many researchers on constructing a post-processed flux in a structure-preserving approximation space. Using (1) as an example, given that the data
However, these
More recently, a new class of methods called the virtual element methods (VEM) were introduced in [4,8], which can be viewed as a polytopal generalization of the tensorial/simplicial finite element. Since then, lots of applications of VEM have been studied by many researchers. A usual VEM workflow splits the consistency (approximation) and the stability of the method as well as the finite dimensional approximation space into two parts. It allows flexible constructions of spaces to preserve the structure of the continuous problems such as higher order continuities, exact divergence-free spaces, and many others. The VEM functions are represented by merely the degrees of freedom (DoF) functionals, not the pointwise values. In computation, if an optimal order discontinuous approximation can be computed elementwisely, then adding an appropriate parameter-free stabilization suffices to guarantee the convergence under common assumptions on the geometry of the mesh.
The adoption of the polytopal element brings many distinctive advantages, for example, treating rectangular element with hanging nodes as polygons allows a simple construction of
The major ingredient in our study is an
If
(α∇uT,∇vT)=(f,vT),∀vT∈Qk(T)∩H10(Ω), | (2) |
in which the standard notation is opted.
Qk(T):={v∈H1(Ω):v|K∈Qk(K),∀K∈T}. |
and on
Qk(K):=Pk,k(K)={p(x)q(y),p∈Pk([a,b]),q∈Pk([c,d])}, |
where
On
NH:={z∈N:∃K∈T,z∈∂K∖NK} | (3) |
Otherwise the node
For each edge
{v}γe:=γv−+(1−γ)v+. |
In this subsection, the quadtree mesh
For the embedded element
Subsequently,
On
Vk(K):={τ∈H(div;K)∩H(rot;K):∇⋅τ∈Pk−1(K),∇×τ=0,τ⋅ne∈Pk(e),∀e⊂∂K}. | (4) |
An
Vk:={τ∈H(div):τ|K∈Vk(K),onK∈Tpoly}. | (5) |
Next we turn to define the degrees of freedom (DoFs) of this space. To this end, we define the set of scaled monomials
Pk(e):=span{1,s−mehe,(s−mehe)2,…,(s−mehe)k}, | (6) |
where
Pk(K):=span{mα(x):=(x−xKhK)α,|α|≤k}. | (7) |
The degrees of freedom (DoFs) are then set as follows for a
(e)k≥1∫e(τ⋅ne)mds,∀m∈Pk(e),one⊂Epoly.(i)k≥2∫Kτ⋅∇mdx,∀m∈Pk−1(K)/RonK∈Tpoly. | (8) |
Remark 1. We note that in our construction, the degrees of freedom to determine the curl of a VEM function originally in [8] are replaced by a curl-free constraint thanks to the flexibility to virtual element. The reason why we opt for this subspace is that the true flux
As the data
Consider
On each
{−α∇uT}γee⋅ne:=(γe(−αK−∇uT|K−)+(1−γe)(−αK+∇uT|K+))⋅ne, | (9) |
where
γe:=α1/2K+α1/2K++α1/2K−. | (10) |
First for both
σT⋅ne={−α∇uT}γee⋅ne. | (11) |
In the lowest order case
|K|∇⋅σT=∫K∇⋅σTdx=∫∂KσT⋅n∂Kds=∑e⊂∂K∫eσT⋅n∂K|eds. | (12) |
If
∇⋅σT=Πk−1f+cK. | (13) |
The reason to add
cK=1|K|(−∫KΠk−1fdx+∑e⊂∂K∫e{−α∇uT}γee⋅n∂K|eds), | (14) |
Consequently for
(σT,∇q)K=−(Πk−1f+cK,q)K+∑e⊂∂K({−α∇uT}γee⋅n∂K|e,q)e. | (15) |
To the end of constructing a computable local error indicator, inspired by the VEM formulation [8], the recovered flux is projected to a space with a much simpler structure. A local oblique projection
(Πτ,∇p)K=(τ,∇p)K,∀p∈Pk(K)/R. | (16) |
Next we are gonna show that this projection operator can be straightforward computed for vector fields in
When
(τ,∇p)K=−(∇⋅τ,p)K+(τ⋅n,p)∂K. | (17) |
By definition of the space (4) when
When
(τ,∇p)K=−(∇⋅τ,Πk−1p)K+(τ⋅n,p)∂K=(τ,∇Πk−1p)K+(τ⋅n,p−Πk−1p)∂K, | (18) |
which can be evaluated using both DoF sets
Given the recovered flux σT in Section 3, the recovery-based local error indicator
ηflux,K:=‖α−1/2(σT+α∇uT)‖K,andηres,K:=‖α−1/2(f−∇⋅σT)‖K, | (19) |
then
ηK={ηflux,Kwhenk=1,(η2flux,K+η2res,K)1/2whenk≥2. | (20) |
A computable
ˆηflux,K:=‖α−1/2KΠ(σT+αK∇uT)‖K, | (21) |
with the oblique projection
ˆηstab,K:=|α−1/2K(I−Π)(σT+αK∇uT)|S,K. | (22) |
Here
SK(v,w):=∑e⊂∂Khe(v⋅ne,w⋅ne)e+∑α∈Λ(v,∇mα)K(w,∇mα)K, | (23) |
where
The computable error estimator
ˆη2={∑K∈T(ˆη2flux,K+ˆη2stab,K)=:∑K∈Tˆη2Kwhenk=1,∑K∈T(ˆη2flux,K+ˆη2stab,K+η2res,K)=:∑K∈Tˆη2Kwhenk≥2. | (24) |
In this section, we shall prove the proposed recovery-based estimator
Theorem 4.1. Let
ˆη2flux,K≲osc(f;K)2+η2elem,K+η2edge,K, | (25) |
where
osc(f;K)=α−1/2KhK‖f−Πk−1f‖K,ηelem,K:=α−1/2KhK‖f+∇⋅(α∇uT)‖K,andηedge,K:=(∑e⊂∂KheαK+αKe‖[[α∇uT⋅ne]]‖2e)1/2. |
In the edge jump term,
Proof. Let
ˆη2flux,K=(Π(σT+αK∇uT),∇p)K=(σT+αK∇uT,∇p)K=−(∇⋅(σT+αK∇uT),p)K+∑e⊂∂K∫e(σT+αK∇uT)⋅n∂K|epds. | (26) |
By (11), without loss of generality we assume
(σT+αK∇uT)⋅ne=((1−γe)αK∇uT|K−(1−γe)αKe∇uT|Ke)⋅ne=α1/2Kα1/2K+α1/2Ke[[α∇uT⋅ne]]e. | (27) |
The boundary term in (26) can be then rewritten as
∫e(σT+αK∇uT)⋅nepds=∫e1α1/2K+α1/2Ke[[α∇uT⋅ne]]eα1/2Kpds≲1(αK+αKe)1/2h1/2e‖[[α∇uT⋅ne]]‖eα1/2Kh−1/2e‖p‖e. | (28) |
By a trace inequality on an edge of a polygon (Lemma 7.1), and the Poincaré inequality for
h−1/2e‖p‖e≲h−1K‖p‖K+‖∇p‖K≲‖∇p‖K. |
As a result,
∑e⊂∂K∫e(σT+αK∇uT)⋅nepds≲ηedge,Kα1/2K‖∇p‖e=ηedge,Kˆηflux,K. |
For the bulk term on
−(∇⋅(σT+αK∇uT),p)K≤|∇⋅(σT+αK∇uT)||K|1/2‖p‖K≤1|K|1/2|∫K∇⋅(σT+αK∇uT)dx|‖p‖K=1|K|1/2|∑e⊂∂K∫e(σT+αK∇uT)⋅neds|‖p‖K≤(∑e⊂∂K1α1/2K+α1/2Ke‖[[α∇uT⋅ne]]‖eα1/2Khe)‖∇p‖≲ηedge,Kˆηflux,K. |
When
−(∇⋅(σT+αK∇uT),p)K=−(Πk−1f+cK+∇⋅(αK∇uT),p)K≤(‖f−Πk−1f‖K+‖f+∇⋅(α∇uT)‖K+|cK||K|1/2)‖p‖K. | (29) |
The first two terms can be handled by combining the weights
cK|K|1/2=1|K|1/2(−∫K(Πk−1f−f)dx−∫K(f+∇⋅(α∇uT))dx+∫K∇⋅(α∇uT)dx+∑e⊂∂K∫e{−α∇uT}γee⋅neds)≤‖f−Πk−1f‖K+‖f+∇⋅(α∇uT)‖K+1|K|1/2∑e⊂∂K∫e(αK∇uT−{α∇uT}γee)⋅neds≤‖f−Πk−1f‖K+‖f+∇⋅(α∇uT)‖K+∑e⊂∂Kα1/2Kα1/2K+α1/2Ke‖[[α∇uT⋅ne]]‖e. | (30) |
The two terms on
−(∇⋅(σT+αK∇uT),p)K≲(osc(f;K)+ηelem,K+ηedge,K)α1/2K‖∇p‖ |
and the theorem follows.
Theorem 4.2. Under the same setting with Theorem 4.1, let
ˆη2stab,K≲osc(f;K)2+η2elem,K+η2edge,K, | (31) |
The constant depends on
Proof. This theorem follows directly from the norm equivalence Lemma 7.3:
|α−1/2K(I−Π)(σT+αK∇uT)|S,K≲|α−1/2K(σT+αK∇uT)|S,K, |
while evaluating the DoFs
Theorem 4.3. Under the same setting with Theorem 4.1, on any
ˆηK≲osc(f;K)+‖α1/2∇(u−uT)‖ωK, | (32) |
with a constant independent of
Proof. This is a direct consequence of Theorem 4.1 and 4.2 and the fact that the residual-based error indicator is efficient by a common bubble function argument.
In this section, we shall prove that the computable error estimator
Assumption 1 (
By Assumption 1, we denote the father
Assumption 2 (Quasi-monotonicity of
Denote
πzv={∫ωz∩ωm(z)vϕz∫ωz∩ωm(z)ϕzifz∈Ω,0ifz∈∂Ω. | (33) |
We note that if
Iv:=∑z∈N1(πzv)ϕz. | (34) |
Lemma 4.4 (Estimates for
α1/2Kh−1K‖v−Iv‖K+α1/2K‖∇Iv‖K≲‖α1/2∇v‖ωK, | (35) |
and for
∑K⊂ωzh−2z‖α1/2(v−πzv)ϕz‖2K≲‖α1/2∇v‖2ωz, | (36) |
in which
Proof. The estimate for
Denotes the subset of nodes
osc(f;T)2:=∑z∈N1∩(N∂Ω∪NI)h2z‖α−1/2f‖2ωz+∑z∈N1∖(N∂Ω∪NI)h2z‖α−1/2(f−fz)‖2ωz, | (37) |
with
Theorem 4.5. Let
‖α1/2∇(u−uT)‖≲(ˆη2+osc(f;T)2)1/2. | (38) |
For
‖α1/2∇(u−uT)‖≲ˆη, | (39) |
where the constant depends on
Proof. Let
‖α1/2∇ε‖2=(α∇(u−uT),∇(ε−Iε))=(α∇u+σT,∇(ε−Iε))−(α∇uT+σT,∇(ε−Iε))=(f−∇⋅σT,ε−Iε)−(α∇uT+σT,∇(ε−Iε))≤(∑K∈Tα−1Kh2K‖f−∇⋅σT‖2K)1/2(∑K∈TαKh−2K‖ε−Iε‖2K)1/2(∑K∈Tα−1K‖α∇uT+σT‖2K)1/2(∑K∈TαK‖∇(ε−Iε)‖2K)1/2.≲(∑K∈T(η2res,K+η2flux,K))1/2(∑K∈T‖α1/2∇ε‖ωK)1/2. |
Applying the norm equivalence of
When
(f,ε−Iε)=∑z∈N1∑K⊂ωz(f,(ε−πzε)ϕz)K, | (40) |
in which a patch-wise constant
(f−∇⋅σT,ε−Iε)=(f,ε−Iε)−(∇⋅(σT+αK∇uT),ε−Iε)=∑z∈N∑K⊂ωz(f,(ε−πzε)ϕz)K−(∇⋅(σT+αK∇uT),ε−Iε)≤(osc(f;T)2)1/2(∑z∈N1∑K⊂ωzh−2z‖α1/2(ε−πzε)ϕz‖2K)1/2+(∑K∈Tα−1Kh2K‖∇⋅(σT+αK∇uT)‖2K)1/2(∑K∈TαKh−2K‖ε−Iε‖2K)1/2. |
Applied an inverse inequality in Lemma 7.2 on
The numerics is prepared using the bilinear element for common AMR benchmark problems. The codes for this paper are publicly available on https://github.com/lyc102/ifem implemented using
The adaptive finite element (AFEM) iterative procedure is following the standard
SOLVE⟶ESTIMATE⟶MARK⟶REFINE. |
The linear system is solved by MATLAB
∑K∈Mˆη2K≥θ∑K∈Tˆη2K,forθ∈(0,1). |
Throughout all examples, we fix
η2Residual,K:=α−1Kh2K‖f+∇⋅(α∇uT)‖2K+12∑e⊂∂KheαK+αKe‖[[α∇uT⋅ne]]‖2e, |
Let
effectivityindex:=η/‖α1/2∇ε‖,whereε:=u−uT,η=ηResidualorˆη, |
i.e., the closer to 1 the effectivity index is, the more accurate this estimator is to measure the error of interest. We use an order
lnηn∼−rηlnNn+c1,andln‖α1/2∇(u−uT)‖∼−rerrlnNn+c2, |
where the subscript
In this example, a standard AMR benchmark on the L-shaped domain is tested. The true solution
The solution
This example is a common benchmark test problem introduced in [9], see also [17,12]) for elliptic interface problems. The true solution
μ(θ)={cos((π/2−δ)γ)⋅cos((θ−π/2+ρ)γ)if0≤θ≤π/2cos(ργ)⋅cos((θ−π+δ)γ)ifπ/2≤θ≤πcos(δγ)⋅cos((θ−π−ρ)γ)ifπ≤θ<3π/2cos((π/2−ρ)γ)⋅cos((θ−3π/2−δ)γ)if3π/2≤θ≤2π |
While
γ=0.1,R≈161.4476387975881,ρ=π/4,δ≈−14.92256510455152, |
By this choice, this function is very singular near the origin as the maximum regularity it has is
The AFEM procedure for this problem stops when the relative error reaches
A postprocessed flux with the minimum
However, we do acknowledge that the technical tool involving interpolation is essentially limited to
The author is grateful for the constructive advice from the anonymous reviewers. This work was supported in part by the National Science Foundation under grants DMS-1913080 and DMS-2136075, and no additional revenues are related to this work.
Unlike the identity matrix stabilization commonly used in most of the VEM literature, for
((σ,τ))K:=(Πσ,Πτ)K+SK((I−Π)σ,(I−Π)τ), | (41) |
where
To show the inverse inequality and the norm equivalence used in the reliability bound, on each element, we need to introduce some geometric measures. Consider a polygonal element
Proposition 1. Under Assumption 1,
Lemma 7.1 (Trace inequality on small edges [13]). If Proposition 1 holds, for
h−1/2e‖v‖e≲h−1K‖v‖K+‖∇v‖K,one⊂K. | (42) |
Proof. The proof follows essentially equation (3.9) in [13,Lemma 3.3] as a standard scaled trace inequality on
h−1/2e‖v‖e≲h−1e‖v‖Te+‖∇v‖Te≲h−1K‖v‖K+‖∇v‖K. |
Lemma 7.2 (Inverse inequalities). Under Assumption 1, we have the following inverse estimates for
‖∇⋅τ‖K≲h−1K‖τ‖K,and‖∇⋅τ‖K≲h−1KSK(τ,τ)1/2. | (43) |
Proof. The first inequality in (43) can be shown using a bubble function trick. Choose
‖∇⋅τ‖2K≲(∇⋅τ,pbK)=−(τ,∇(pbK))≤‖τ‖K‖∇(pbK)‖K, |
and then
‖∇(pbK)‖≤‖bK∇p‖K+‖p∇bK‖K≤‖bK‖∞,Ω‖∇p‖K+‖p‖K‖∇bK‖∞,K. |
Consequently, the first inequality in (43) follows above by the standard inverse estimate for polynomials
To prove the second inequality in (43), by integration by parts we have
‖∇⋅τ‖2=(∇⋅τ,p)=−(τ,∇p)+∑e⊂∂K(τ⋅ne,p). | (44) |
Expand
‖p‖2K=p⊤Mp≥p⊤diag(M)p≥minjMjj‖p‖2ℓ2≃h2K‖p‖2ℓ2, | (45) |
since
‖∇⋅τ‖2≤(∑α∈Λ(τ,mα)2K)1/2(∑α∈Λp2α)1/2+(∑e⊂∂Khe‖τ⋅ne‖2e)1/2(∑e⊂∂Kh−1e‖p‖2e)1/2≲SK(τ,τ)1/2(‖p‖ℓ2+h−1K‖p‖K+‖∇p‖K). |
As a result, the second inequality in (43) is proved when apply an inverse inequality for
Remark 2. While the proof in Lemma 7.2 relies on
Lemma 7.3 (Norm equivalence). Under Assumption 1, let
γ∗‖τ‖K≤‖τ‖h,K≤γ∗‖τ‖K, | (46) |
where both
Proof. First we consider the lower bound, by triangle inequality,
‖τ‖K≤‖Πτ‖K+‖(τ−Πτ)‖K. |
Since
‖τ‖2K≤SK(τ,τ),forτ∈Vk(K). | (47) |
To this end, we consider the weak solution to the following auxiliary boundary value problem on
{Δψ=∇⋅τinK,∂ψ∂n=τ⋅n∂Kon∂K. | (48) |
By a standard Helmholtz decomposition result (e.g. Proposition 3.1, Chapter 1[23]), we have
‖τ−∇ψ‖2K=(τ−∇ψ,∇⊥ϕ)=0. |
Consequently, we proved essentially the unisolvency of the modified VEM space (4) and
‖τ‖2K=(τ,∇ψ)K=(τ,∇ψ)K=−(∇⋅τ,ψ)K+(τ⋅n∂K,ψ)∂K≤‖∇⋅τ‖K‖ψ‖K+∑e⊂∂K‖τ⋅ne‖e‖ψ‖e≤‖∇⋅τ‖K‖ψ‖K+(∑e⊂∂Khe‖τ⋅ne‖2e)1/2(∑e⊂∂Kh−1e‖ψ‖2e)1/2 | (49) |
Proposition 1 allows us to apply an isotropic trace inequality on an edge of a polygon (Lemma 7.1), combining with the Poincaré inequality for
h−1/2e‖ψ‖e≲h−1K‖ψ‖K+‖∇ψ‖K≲‖∇ψ‖K. |
Furthermore applying the inverse estimate in Lemma 7.2 on the bulk term above, we have
‖τ‖2K≲SK(τ,τ)1/2‖∇ψ‖K, |
which proves the validity of (47), thus yield the lower bound.
To prove the upper bound, by
he‖τ⋅ne‖2e≲‖τ‖K,and|(τ,∇mα)K|≤‖τ‖K. | (50) |
To prove the first inequality, by Proposition 1 again, consider the edge bubble function
‖∇be‖∞,K=O(1/he),and‖be‖∞,K=O(1). | (51) |
Denote
‖τ⋅ne‖2e≲(τ⋅ne,beqe)e=x(τ⋅ne,beqe)∂K=(τ,qe∇be)K+(∇⋅τ,beqe)K≤‖τ‖K‖qe∇be‖Te+‖∇⋅τ‖K‖qebe‖Te,≤‖τ‖K‖qe‖Te‖∇be‖∞,K+‖∇⋅τ‖K‖qe‖Te‖be‖∞,K. |
Now by the fact that
The second inequality in (50) can be estimated straightforward by the scaling of the monomials (7)
(52) |
Hence, (46) is proved.
[1] | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is All you Need, in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. https://doi.org/10.48550/arXiv.2206.09457 |
[2] | Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, et al., Learning deep transformer models for machine translation, preprint, arXiv: 1906.01787. |
[3] | S. A. Chowdhury, A. Abdelali, K. Darwish, J. Soon-Gyo, J. Salminen, B. J. Jansen, Improving arabic text categorization using transformer training diversification, in Proceedings of the Fifth Arabic Natural Language Processing Workshop (COLING-WANLP), (2020), 226–236. https://aclanthology.org/2020.wanlp-1.21 |
[4] | X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, M. Zhou, et al., A tensorized transformer for language modeling, preprint, arXiv: 1906.09777. |
[5] | J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, preprint, arXiv: 1810.04805. |
[6] | Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, et al., RoBERTa: A robustly optimized BERT pretraining approach, preprint, arXiv: 1907.11692. |
[7] | H. Xu, B. Liu, L. Shu, P. S. Yu, BERT post-training for review reading comprehension and aspect-based sentiment analysis, preprint, arXiv: 1904.02232. |
[8] | P. Shi, J. Lin, Simple BERT models for relation extraction and semantic role labeling, preprint, arXiv: 1904.05255. |
[9] | V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, preprint, arXiv: 1910.01108. |
[10] |
Y. Cheng, D. Wang, P. Zhou, T. Zhang, Model compression and acceleration for deep neural networks: The principles, progress, and challenges, IEEE Signal Process. Mag., 35 (2018), 126–136. https://doi.org/10.1109/MSP.2017.2765695 doi: 10.1109/MSP.2017.2765695
![]() |
[11] |
S. Cheng, D. Lucor, J. P. Argaud, Observation data compression for variational assimilation of dynamical systems, J. Comput. Sci., 53 (2021), 101405. https://doi.org/10.1016/j.jocs.2021.101405 doi: 10.1016/j.jocs.2021.101405
![]() |
[12] | S. Liu, Y. Lin, Z. Zhou, K. Nan, H. Liu, J. Du, On-demand deep model compression for mobile devices: A usage-driven model selection framework, in Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, (2018), 389–400. https://doi.org/10.1145/3210240.3210337 |
[13] |
S. Liu, J. Du, K. Nan, Z. Zhou, H. Liu, Z. Wang, et al., AdaDeep: A usage-driven, automated deep model compression framework for enabling ubiquitous intelligent mobiles, IEEE Trans. Mob. Comput., 20 (2021), 3282–3297. https://doi.org/10.1109/TMC.2020.2999956 doi: 10.1109/TMC.2020.2999956
![]() |
[14] |
V. L. Tran, S. E. Kim, Efficiency of three advanced data-driven models for predicting axial compression capacity of CFDST columns, Thin-Walled Struct., 152 (2020), 106744. https://doi.org/10.1016/j.tws.2020.106744 doi: 10.1016/j.tws.2020.106744
![]() |
[15] |
Z. X. Hu, Y. Wang, M. F. Ge, J. Liu, Data-driven fault diagnosis method based on compressed sensing and improved multiscale network, IEEE Trans. Ind. Electron., 67 (2020), 3216–3225. https://doi.org/10.1109/TIE.2019.2912763 doi: 10.1109/TIE.2019.2912763
![]() |
[16] | S. Cheng, I. C. Prentice, Y. Huang, Y. Jin, Y. K. Guo, R. Arcucci, Data-driven surrogate model with latent data assimilation: Application to wildfire forecasting, J. Comput. Phys., 464 (2022). https://doi.org/10.1016/j.jcp.2022.111302 |
[17] |
S. Yang, Z. Zhang, C. Zhao, X. Song, S. Guo, H. Li, CNNPC: End-edge-cloud collaborative CNN inference with joint model partition and compression, IEEE Trans. Parallel Distrib. Syst., (2022), 1–1. https://doi.org/10.1109/TPDS.2022.3177782 doi: 10.1109/TPDS.2022.3177782
![]() |
[18] |
H. He, S. Jin, C. K. Wen, F. Gao, G. Y. Li, Z. Xu, Model-driven deep learning for physical layer communications, IEEE Wireless Commun., 26 (2019), 77–83. https://doi.org/10.1109/MWC.2019.1800447 doi: 10.1109/MWC.2019.1800447
![]() |
[19] |
Z. Liu, M. del Rosario, Z. Ding, A markovian model-driven deep learning framework for massive MIMO CSI feedback, IEEE Trans. Wireless Commun., 21 (2022), 1214–1228. https://doi.org/10.1109/TWC.2021.3103120 doi: 10.1109/TWC.2021.3103120
![]() |
[20] | W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, preprint, arXiv: 2002.10957. |
[21] | X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, et al., TinyBERT: Distilling BERT for natural language understanding, preprint, arXiv: 1909.10351. |
[22] | S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for BERT model compression, preprint, arXiv: 1908.09355. |
[23] | H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jegou, Training data-efficient image transformers & distillation through attention, in Proceedings of the 38th International Conference on Machine Learning (ICML), (2021), 10347–10357. https://doi.org/10.48550/arXiv.2012.12877 |
[24] | P. Michel, O. Levy, G. Neubig, Are sixteen heads really better than one?, Adv. Neural Inf. Process. Syst., preprint, arXiv: 1905.10650. |
[25] | M. A. Gordon, K. Duh, N. Andrews, Compressing BERT: Studying the effects of weight pruning on transfer learning, preprint, arXiv: 2002.08307. |
[26] |
T. Chen, Y. Cheng, Z. Gan, L. Yuan, L. Zhang, Z. Wang, Chasing sparsity in vision transformers: An end-to-end exploration, Adv. Neural Inf. Process. Syst., (2021), 19974–19988. https://doi.org/10.48550/arXiv.2106.04533 doi: 10.48550/arXiv.2106.04533
![]() |
[27] |
T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, et al., The lottery ticket hypothesis for pre-trained BERT networks, Adv. Neural Inf. Process. Syst., (2020), 15834–15846. https://doi.org/10.48550/arXiv.2007.12223 doi: 10.48550/arXiv.2007.12223
![]() |
[28] | S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, et al., Q-BERT: Hessian based ultra low precision quantization of BERT, preprint, arXiv: 1909.05840. |
[29] | Z. Liu, Y. Wang, K. Han, S. Ma, W. Gao, Post-training quantization for vision transformer, preprint, arXiv: 2106.14156. |
[30] | H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, et al., BinaryBERT: Pushing the limit of BERT quantization, preprint, arXiv: 2012.15701. |
[31] | O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8BERT: Quantized 8Bit BERT, in the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS 2019, (2019), 36–39. https://doi.org/10.1109/EMC2-NIPS53020.2019.00016 |
[32] | Z. Wu, Z. Liu, J. Lin, Y. Lin, S. Han, Lite transformer with long-short range attention, preprint, arXiv: 2004.11886. |
[33] | L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, Q. Liu, DynaBERT: Dynamic BERT with adaptive width and depth, preprint, arXiv: 2004.04037. |
[34] | M. Chen, H. Peng, J. Fu, H. Ling, AutoFormer: Searching transformers for visual recognition, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 12250–12260. https://doi.org/10.1109/ICCV48922.2021.01205 |
[35] |
P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, et al., Compressing large-scale transformer-based models: A case study on BERT, Trans. Assoc. Comput. Linguist., 9 (2021), 1061–1080. https://doi.org/10.1162/tacl_a_00413 doi: 10.1162/tacl_a_00413
![]() |
[36] |
S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 doi: 10.1162/neco.1997.9.8.1735
![]() |
[37] | J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, preprint, arXiv: 1412.3555. |
[38] | D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, preprint, arXiv: 1409.0473. |
[39] | B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, et al., FTRANS: energy-efficient acceleration of transformers using FPGA, in Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), (2020), 175–180. https://doi.org/10.1145/3370748.3406567 |
[40] | T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, et al., A.3: Accelerating attention mechanisms in neural networks with approximation, in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), (2020), 328–341. https://doi.org/10.1109/HPCA47549.2020.00035 |
[41] | T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, et al., ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks, in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), (2021), 692–705. https://doi.org/10.1109/ISCA52012.2021.00060 |
[42] |
X. Zhang, Y. Wu, P. Zhou, X. Tang, J. Hu, Algorithm-hardware co-design of attention mechanism on FPGA devices, ACM Trans. Embed. Comput. Syst., 20 (2021), 1–24. https://doi.org/10.1145/3477002 doi: 10.1145/3477002
![]() |
[43] | S. Lu, M. Wang, S. Liang, J. Lin, Z. Wang, Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer, in IEEE International SOC Conference, (2020), 84–89. https://doi.org/10.1109/ISCA52012.2021.00060 |
[44] | A. Parikh, O. Tä ckströ m, D. Das, J. Uszkoreit, A decomposable attention model for natural language inference, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), 2249–2255. https://doi.org/10.48550/arXiv.1606.01933 |
[45] | Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, et al., A structured self-attentive sentence embedding, preprint, arXiv: 1703.03130 |
[46] | M. S. Charikar, Similarity estimation techniques from rounding algorithms, in Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, (2002), 380–388. https://doi.org/10.1145/509907.509965 |
[47] | X. Zhang, F. X. Yu, R. Guo, S. Kumar, S. Wang, S. F. Chang, Fast orthogonal projection based on kronecker product, in 2015 IEEE International Conference on Computer Vision (ICCV), (2015), 2929–2937. https://doi.org/10.1109/ICCV.2015.335 |
[48] | Y. Gong, S. Kumar, H. A. Rowley, S. Lazebnik, Learning binary codes for high-dimensional data using bilinear projections, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2013), 484–491. https://doi.org/10.1109/CVPR.2013.69 |
[49] | M. Wang, S. Lu, D. Zhu, J. Lin, Z. Wang, A high-speed and low-complexity architecture for softmax function in deep learning, in 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), (2018), 223–226. https://doi.org/10.1109/APCCAS.2018.8605654 |
[50] | R. Hu, B. Tian, S. Yin, S. Wei, Efficient hardware architecture of softmax layer in deep neural network, in 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), (2018), 1–5. https://doi.org/10.1109/ICDSP.2018.8631588 |
[51] |
L. Deng, G. Li, S. Han, L. Shi, Y. Xie, Model compression and hardware acceleration for neural networks: A comprehensive survey, Proc. IEEE, 108 (2020), 485–532. https://doi.org/10.1109/JPROC.2020.2976475 doi: 10.1109/JPROC.2020.2976475
![]() |
[52] | C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, et al., C ir CNN: Accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), (2017), 395–408. https://doi.org/10.1145/3123939.3124552 |
[53] | S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, et al., C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs, in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), (2018), 11–20. https://doi.org/10.1145/3174243.3174253 |
[54] | L. Zhao, S. Liao, Y. Wang, Z. Li, J. Tang, B. Yuan, Theoretical properties for neural networks with weight matrices of low displacement rank, in Proceedings of the 34th International Conference on Machine Learning (ICML), (2017), 4082–4090. https://doi.org/10.48550/arXiv.1703.00144 |
[55] | V. Y. Pan, Structured matrices and displacement operators, in Structured Matrices and Polynomials: Unified Superfast Algorithms, Springer Science & Business Media, (2001), 117–153. https://doi.org/10.1007/978-1-4612-0129-8 |
[56] | J. O. Smith, Mathematics of the discrete fourier transform (DFT): with audio applications, in Mathematics of the Discrete Fourier Transform (DFT): With Audio Applications, Julius Smith, (2007), 115–164. https://ccrma.stanford.edu/~jos/st/ |
[57] | Z. Liu, G. Li, J. Cheng, Hardware acceleration of fully quantized BERT for efficient natural language processing, in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), (2021), 513–516. https://doi.org/10.23919/DATE51398.2021.9474043 |
[58] | M. Sun, H. Ma, G. Kang, Y. Jiang, T. Chen, X. Ma, et al., VAQF: Fully automatic software-hardware co-design framework for low-bit vision transformer, preprint, arXiv: 2201.06618. |
[59] | Z. Liu, Z. Shen, M. Savvides, K. T. Cheng, ReActNet: Towards precise binary neural network with generalized activation functions, in Computer Vision–ECCV 2020 (ECCV), (eds. Vedaldi. A., Bischof. H., Brox. T., Frahm. J.-M.), Cham, Springer International Publishing, (2020), 143–159. https://doi.org/10.1007/978-3-030-58568-6_9 |
[60] | M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classification using binary convolutional neural networks, in Computer Vision–ECCV 2016 (ECCV), (eds. Leibe. B., Matas. J., Sebe. N., Welling. M.), Cham, Springer International Publishing, (2016), 525–542. https://doi.org/10.1007/978-3-319-46493-0_32 |
[61] | S. Han, H. Mao, W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, preprint, arXiv: 1510.00149. |
[62] | W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neural networks, in Advances in Neural Information Processing Systems (NeurIPS), Curran Associates, (2016). https://doi.org/10.48550/arXiv.1608.03665 |
[63] | X. Ma, F. M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, et al., PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices, in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), (2020), 5117–5124. https://doi.org/10.1609/aaai.v34i04.5954 |
[64] | B. Li, Z. Kong, T. Zhang, J. Li, Z. Li, H. Liu, et al., Efficient transformer-based large scale language representations using hardware-friendly block structured pruning, preprint, arXiv: 2009.08065. |
[65] | S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, et al., Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity, in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), (2019), 63–72. https://doi.org/10.1145/3289602.3293898 |
[66] | H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, et al., Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning, in 2021 22nd International Symposium on Quality Electronic Design (ISQED), (2021), 142–148. https://doi.org/10.1109/ISQED51717.2021.9424344 |
[67] | C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, et al., Structured weight matrices-based hardware accelerators in deep neural networks: FPGAs and ASICs, in Proceedings of the 2018 on Great Lakes Symposium on VLSI (GLSVLSI), Chicago, IL, USA, Association for Computing Machinery, (2018), 353–358. https://doi.org/10.1145/3194554.3194625 |
[68] | S. Narang, E. Undersander, G. Diamos, Block-sparse recurrent neural networks, preprint, arXiv: 1711.02782. |
[69] | P. Qi, E. H. M. Sha, Q. Zhuge, H. Peng, S. Huang, Z. Kong, et al., Accelerating framework of transformer by hardware design and model compression co-optimization, in 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), (2021), 1–9. https://doi.org/10.1109/ICCAD51958.2021.9643586 |
[70] | P. Qi, Y. Song, H. Peng, S. Huang, Q. Zhuge, E. H. M. Sha, Accommodating transformer onto FPGA: Coupling the balanced model compression and FPGA-implementation optimization, in Proceedings of the 2021 on Great Lakes Symposium on VLSI (GLSVLSI), Virtual Event, USA, Association for Computing Machinery, (2021), 163–168. https://doi.org/10.1145/3453688.3461739 |
[71] | D. So, Q. Le, C. Liang, The evolved transformer, in Proceedings of the 36th International Conference on Machine Learning (ICML), PMLR, (2019), 5877–5886. https://doi.org/10.48550/arXiv.1901.11117 |
[72] | H. Wang, Efficient algorithms and hardware for natural language processing, Graduate Theses, Retrieved from the Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/127440. |
[73] | H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, et al., Bit fusion: Bit-Level dynamically composable architecture for accelerating deep neural network, in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), (2018), 764–775. https://doi.org/10.1109/ISCA.2018.00069 |
[74] | R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, et al., Templates for the solution of linear systems: Building blocks for iterative methods, in Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, Society for Industrial and Applied Mathematics, (1994), 39–55. https://doi.org/10.1137/1.9781611971538 |
[75] | W. Liu, B. Vinter, CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication, in Proceedings of the 29th ACM on International Conference on Supercomputing (ICS), Newport Beach, California, USA, Association for Computing Machinery, (2015), 339–350. https://doi.org/10.1145/2751205.2751209 |
[76] | R. Kannan, Efficient sparse matrix multiple-vector multiplication using a bitmapped format, in 20th Annual International Conference on High Performance Computing (HiPC), (2013), 286–294. https://doi.org/10.1109/HiPC.2013.6799135 |
[77] | W. Jiang, X. Zhang, E. H. M. Sha, L. Yang, Q. Zhuge, Y. Shi, et al., Accuracy vs. efficiency: achieving both through FPGA-implementation aware neural architecture search, in Proceedings of the 56th Annual Design Automation Conference 2019 (DAC), Las Vegas NV USA, ACM, (2019), 1–6. https://doi.org/10.1145/3316781.3317757 |
[78] | W. Jiang, E. H. M. Sha, X. Zhang, L. Yang, Q. Zhuge, Y. Shi, et al., Achieving super-linear speedup across multi-FPGA for real-time DNN inference, preprint, arXiv: 1907.08985. |
[79] | W. Jiang, X. Zhang, E. H. M. Sha, Q. Zhuge, L. Yang, Y. Shi, et al., XFER: A novel design to achieve super-linear performance on multiple FPGAs for real-time AI, in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Seaside, CA, USA, Association for Computing Machinery, (2019), 305. https://doi.org/10.1145/3289602.3293988 |
1. | Hengliang Tang, Jinda Dong, Solving Flexible Job-Shop Scheduling Problem with Heterogeneous Graph Neural Network Based on Relation and Deep Reinforcement Learning, 2024, 12, 2075-1702, 584, 10.3390/machines12080584 | |
2. | Chen Han, Xuanyin Wang, TPN:Triple network algorithm for deep reinforcement learning, 2024, 591, 09252312, 127755, 10.1016/j.neucom.2024.127755 | |
3. | Miguel S. E. Martins, João M. C. Sousa, Susana Vieira, A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems, 2025, 15, 2076-3417, 1211, 10.3390/app15031211 | |
4. | Tianhua Jiang, Lu Liu, A Bi-Population Competition Adaptive Interior Search Algorithm Based on Reinforcement Learning for Flexible Job Shop Scheduling Problem, 2025, 24, 1469-0268, 10.1142/S1469026824500251 | |
5. | Tianyuan Mao, A Review of Scheduling Methods for Multi-AGV Material Handling Systems in Mixed-Model Assembly Workshops, 2025, 5, 2710-0723, 227, 10.54691/p4x5a536 | |
6. | Peng Zhao, You Zhou, Di Wang, Zhiguang Cao, Yubin Xiao, Xuan Wu, Yuanshu Li, Hongjia Liu, Wei Du, Yuan Jiang, Liupu Wang, 2025, Dual Operation Aggregation Graph Neural Networks for Solving Flexible Job-Shop Scheduling Problem with Reinforcement Learning, 9798400712746, 4089, 10.1145/3696410.3714616 | |
7. | Yuxin Peng, Youlong Lyu, Jie Zhang, Ying Chu, Heterogeneous Graph Neural-Network-Based Scheduling Optimization for Multi-Product and Variable-Batch Production in Flexible Job Shops, 2025, 15, 2076-3417, 5648, 10.3390/app15105648 |