Loading [MathJax]/jax/element/mml/optable/BasicLatin.js
Review Special Issues

Hardware-friendly compression and hardware acceleration for transformer: A survey


  • Received: 18 June 2022 Accepted: 27 July 2022 Published: 16 August 2022
  • The transformer model has recently been a milestone in artificial intelligence. The algorithm has enhanced the performance of tasks such as Machine Translation and Computer Vision to a level previously unattainable. However, the transformer model has a strong performance but also requires a high amount of memory overhead and enormous computing power. This significantly hinders the deployment of an energy-efficient transformer system. Due to the high parallelism, low latency, and low power consumption of field-programmable gate arrays (FPGAs) and application specific integrated circuits (ASICs), they demonstrate higher energy efficiency than Graphics Processing Units (GPUs) and Central Processing Units (CPUs). Therefore, FPGA and ASIC are widely used to accelerate deep learning algorithms. Several papers have addressed the issue of deploying the Transformer on dedicated hardware for acceleration, but there is a lack of comprehensive studies in this area. Therefore, we summarize the transformer model compression algorithm based on the hardware accelerator and its implementation to provide a comprehensive overview of this research domain. This paper first introduces the transformer model framework and computation process. Secondly, a discussion of hardware-friendly compression algorithms based on self-attention and Transformer is provided, along with a review of a state-of-the-art hardware accelerator framework. Finally, we considered some promising topics in transformer hardware acceleration, such as a high-level design framework and selecting the optimum device using reinforcement learning.

    Citation: Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen. Hardware-friendly compression and hardware acceleration for transformer: A survey[J]. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192

    Related Papers:

    [1] Guohui Zhang, Jinghe Sun, Xing Liu, Guodong Wang, Yangyang Yang . Solving flexible job shop scheduling problems with transportation time based on improved genetic algorithm. Mathematical Biosciences and Engineering, 2019, 16(3): 1334-1347. doi: 10.3934/mbe.2019065
    [2] Ruiping Yuan, Jiangtao Dou, Juntao Li, Wei Wang, Yingfan Jiang . Multi-robot task allocation in e-commerce RMFS based on deep reinforcement learning. Mathematical Biosciences and Engineering, 2023, 20(2): 1903-1918. doi: 10.3934/mbe.2023087
    [3] Shixuan Yao, Xiaochen Liu, Yinghui Zhang, Ze Cui . An approach to solving optimal control problems of nonlinear systems by introducing detail-reward mechanism in deep reinforcement learning. Mathematical Biosciences and Engineering, 2022, 19(9): 9258-9290. doi: 10.3934/mbe.2022430
    [4] Kongfu Hu, Lei Wang, Jingcao Cai, Long Cheng . An improved genetic algorithm with dynamic neighborhood search for job shop scheduling problem. Mathematical Biosciences and Engineering, 2023, 20(9): 17407-17427. doi: 10.3934/mbe.2023774
    [5] Jianguo Duan, Mengting Wang, Qinglei Zhang, Jiyun Qin . Distributed shop scheduling: A comprehensive review on classifications, models and algorithms. Mathematical Biosciences and Engineering, 2023, 20(8): 15265-15308. doi: 10.3934/mbe.2023683
    [6] Zilong Zhuang, Zhiyao Lu, Zizhao Huang, Chengliang Liu, Wei Qin . A novel complex network based dynamic rule selection approach for open shop scheduling problem with release dates. Mathematical Biosciences and Engineering, 2019, 16(5): 4491-4505. doi: 10.3934/mbe.2019224
    [7] Shaofeng Yan, Guohui Zhang, Jinghe Sun, Wenqiang Zhang . An improved ant colony optimization for solving the flexible job shop scheduling problem with multiple time constraints. Mathematical Biosciences and Engineering, 2023, 20(4): 7519-7547. doi: 10.3934/mbe.2023325
    [8] Zichen Wang, Xin Wang . Fault-tolerant control for nonlinear systems with a dead zone: Reinforcement learning approach. Mathematical Biosciences and Engineering, 2023, 20(4): 6334-6357. doi: 10.3934/mbe.2023274
    [9] Jin Zhang, Nan Ma, Zhixuan Wu, Cheng Wang, Yongqiang Yao . Intelligent control of self-driving vehicles based on adaptive sampling supervised actor-critic and human driving experience. Mathematical Biosciences and Engineering, 2024, 21(5): 6077-6096. doi: 10.3934/mbe.2024267
    [10] Lu-Wen Liao . A branch and bound algorithm for optimal television commercial scheduling. Mathematical Biosciences and Engineering, 2022, 19(5): 4933-4945. doi: 10.3934/mbe.2022231
  • The transformer model has recently been a milestone in artificial intelligence. The algorithm has enhanced the performance of tasks such as Machine Translation and Computer Vision to a level previously unattainable. However, the transformer model has a strong performance but also requires a high amount of memory overhead and enormous computing power. This significantly hinders the deployment of an energy-efficient transformer system. Due to the high parallelism, low latency, and low power consumption of field-programmable gate arrays (FPGAs) and application specific integrated circuits (ASICs), they demonstrate higher energy efficiency than Graphics Processing Units (GPUs) and Central Processing Units (CPUs). Therefore, FPGA and ASIC are widely used to accelerate deep learning algorithms. Several papers have addressed the issue of deploying the Transformer on dedicated hardware for acceleration, but there is a lack of comprehensive studies in this area. Therefore, we summarize the transformer model compression algorithm based on the hardware accelerator and its implementation to provide a comprehensive overview of this research domain. This paper first introduces the transformer model framework and computation process. Secondly, a discussion of hardware-friendly compression algorithms based on self-attention and Transformer is provided, along with a review of a state-of-the-art hardware accelerator framework. Finally, we considered some promising topics in transformer hardware acceleration, such as a high-level design framework and selecting the optimum device using reinforcement learning.



    In this paper, we consider the following diffusion equation on ΩR2,

    {(αu)=f,inΩ,u=0,onΩ. (1)

    To approximate (1), taking advantage of the adaptive mesh refinement (AMR) to save valuable computational resources, the adaptive finite element method on quadtree mesh is among the most popular ones in the engineering and scientific computing community [20]. Compared with simplicial meshes, quadtree meshes provide preferable performance in the aspects of the accuracy and robustness. There are lots of mature software packages (e.g., [1,2]) on quadtree meshes. To guide the AMR, one possible way is through the a posteriori error estimation to construct computable quantities to indicate the location that the mesh needs to be refined/coarsened, thus to balance the spacial distribution of the error which improves the accuracy per computing power. Residual-based and recovery-based error estimators are among the most popular ones used. In terms of accuracy, the recovery-based error estimator shows more appealing attributes [28,3].

    More recently, newer developments on flux recovery have been studied by many researchers on constructing a post-processed flux in a structure-preserving approximation space. Using (1) as an example, given that the data fL2(Ω), the flux αu is in H(div):={vL2(Ω):vL2(Ω)}, which has less continuity constraint than the ones in [28,3] which are vertex-patch based with the recovered flux being H1(Ω)-conforming. The H(div)-flux recovery shows more robustness than vertex-patch based ones (e.g., [11,10]).

    However, these H(div)-flux recovery techniques work mainly on conforming meshes. For nonconforming discretizations on nonmatching grids, some simple treatment of hanging nodes exists by recovering the flux on a conforming mother mesh [22]. To our best knowledge, there is no literature about the local H(div)-flux recovery on a multilevel irregular quadtree meshes. One major difficulty is that it is impossible to recover a robust computable polynomial flux to satisfy the H(div)-continuity constraint, that is, the flux is continuous in the normal direction on edges with hanging nodes.

    More recently, a new class of methods called the virtual element methods (VEM) were introduced in [4,8], which can be viewed as a polytopal generalization of the tensorial/simplicial finite element. Since then, lots of applications of VEM have been studied by many researchers. A usual VEM workflow splits the consistency (approximation) and the stability of the method as well as the finite dimensional approximation space into two parts. It allows flexible constructions of spaces to preserve the structure of the continuous problems such as higher order continuities, exact divergence-free spaces, and many others. The VEM functions are represented by merely the degrees of freedom (DoF) functionals, not the pointwise values. In computation, if an optimal order discontinuous approximation can be computed elementwisely, then adding an appropriate parameter-free stabilization suffices to guarantee the convergence under common assumptions on the geometry of the mesh.

    The adoption of the polytopal element brings many distinctive advantages, for example, treating rectangular element with hanging nodes as polygons allows a simple construction of H(div)-conforming finite dimensional approximation space on meshes with multilevel irregularities. We shall follow this approach to perform the flux recovery for a conforming Qk discretization of problem (1). Recently, arbitrary level of irregular quadtree meshes have been studied in [21,26,15]. Analyses of the residual-based error estimator on 1-irregular (balanced) quadtree mesh can be found, e.g., in [14]. In the virtual element context, Zienkiewicz-Zhu (ZZ)-type recovery techniques are studied for linear elasticity in [18], and for diffusion problems in [24]. In [18,24], the recovered flux is in H1 and associated with nodal DoFs, thus cannot yield a robust estimate when the diffusion coefficient has a sharp contrast [11,10]. The first equilibrated flux recovery in H(div) for virtual element methods is studied in [19]. While [19] recovers a flux by solving a mixed problem globally, we opt for a cheap and simple weighted averaging locally.

    The major ingredient in our study is an H(div)-conforming virtual element space modified from the ones used in [8,5] (Section 2.2). Afterwards, an H(div)-conforming flux is recovered by a robust weighted averaging of the numerical flux, in which some unique properties of the tensor-product type element Qk are exploited (Section 3). The a posteriori error estimator is constructed based on the projected flux elementwisely. The efficiency of the local error indicator is then proved by bounding it above by the residual-based error indicator (Section 4.1). The reliability of the recovery-based error estimator is then shown under certain assumptions (Section 4.2). These estimates are verified numerically by some common AMR benchmark problems implemented in a publicly available finite element software library iFEM [16] (Section 5).

    If Ω is not a rectangle, u is extended by 0 to an ˜Ω that is rectangular, therefore without loss of generality, we assume Ω is partitioned into a shape-regular T={K} with rectangular elements, and α:=αK is assumed to be a piecewise, positive constant with respect to T. The weak form of problem (1) is then discretized in a tensor-product finite element space as follows,

    (αuT,vT)=(f,vT),vTQk(T)H10(Ω), (2)

    in which the standard notation is opted. (,)D denotes the inner product on L2(D), and D:=(,)D, with the subscript omitted when D=Ω. The discretization space is

    Qk(T):={vH1(Ω):v|KQk(K),KT}.

    and on K=[a,b]×[c,d]

    Qk(K):=Pk,k(K)={p(x)q(y),pPk([a,b]),qPk([c,d])},

    where Pk(D) stands for the degree no more than k polynomial defined on D. Henceforth, we shall simply denote Qk(T)=:Qk when no ambiguity arises.

    On K, the sets of 4 vertices, as well as 4 edges of the same generation with K, are denoted by NK and EK, respectively. The sets of nodes and edges in T are denoted by N:=KTNK and E:=KTEK. A node zN is called a hanging node if it is on K but is not counted as a vertex of KT, and we denote the set of hanging nodes as NH

    NH:={zN:KT,zKNK} (3)

    Otherwise the node zN is a regular node. If an edge eE contains at most l hanging nodes, the partition T, as well as the element these hanging nodes lie on, is called l-irregular.

    For each edge eE, a unit normal vector ne is fixed by specifying its direction pointing rightward for vertical edges, and upward for horizontal edges. If an exterior normal of an element on this edge shares the same orientation with ne, then this element is denoted by K, otherwise it is denoted by K+, i.e., ne is pointing from K to K+. The intersection of the closures of K+,K is always an edge eE. However, we note that by the definition in (3) it is possible that eEK+ but not in EK or vice versa, if there exists a hanging node on e (see e.g., Figure 1). For any function or distribution v well-defined on the two elements, define [[v]]e=vv+ on an edge eΩ, in which v and v+ are defined in the limiting sense v±=limϵ0±v(x+ϵne) for xe. If e is a boundary edge, the function v is extended by zero outside the domain to compute [[v]]e. Furthermore, the following notation denotes a weighted average of v on edge e for a weight γ[0,1],

    {v}γe:=γv+(1γ)v+.
    Figure 1.  For the upper right element KT, NK={z2,z4,z5,z6}. For KTpoly, NK={zi}7i=1.

    In this subsection, the quadtree mesh T of interest is embedded into a polygonal mesh TTpoly={Kpoly}. On any given quadrilateral element K, for example we consider a vTQ1(K), it has 4 degrees of freedom associated with 4 nodes {z}. Its numerical flux αvTn is well-defined on the 4 edges {e} locally on K, such that on each edge it is a polynomial defined on the whole edge, regardless of the number of hanging nodes on that edge. Using Figure 1 as an example, on the upper right element K, vT|Kn|z2z6P1(z2z6) is a linear function in y-variable.

    For the embedded element KpolyTpoly, which geometrically coincides with K, it includes all the hanging nodes, while the set of edges are formed accordingly as the edges of the cyclic graph of the vertices. We shall denote the set of all edges on Tpoly as Epoly. Using Figure 1 as example, it is possible to define a flux on K with piecewise linear normal component on z2z6 which now consists of three edges on Kpoly.

    Subsequently, KpolyTpoly shall be denoted by simply KTpoly in the context of flux recovery, and the notion eK denotes an edge on the boundary of K, which takes into account of the edges formed with one end point or both end points as the hanging nodes.

    On Tpoly, we consider the following Brezzi-Douglas-Marini-type virtual element modification inspired by the ones used in [8,5]. The local space on a KTpoly is defined as for k1

    Vk(K):={τH(div;K)H(rot;K):τPk1(K),×τ=0,τnePk(e),eK}. (4)

    An H(div)-conforming global space for recovering the flux is then

    Vk:={τH(div):τ|KVk(K),onKTpoly}. (5)

    Next we turn to define the degrees of freedom (DoFs) of this space. To this end, we define the set of scaled monomials Pk(e) on an edge e. e is parametrized by [0,he]sa+ste, where a is the starting point of e, and te is the unit tangential vector of e. The basis set for Pk(e) is chosen as:

    Pk(e):=span{1,smehe,(smehe)2,,(smehe)k}, (6)

    where me=he/2 representing the midpoint when using this parametrization. Similar to the edge case, Pk(K)'s basis set is chosen as follows (see e.g., [4]):

    Pk(K):=span{mα(x):=(xxKhK)α,|α|k}. (7)

    The degrees of freedom (DoFs) are then set as follows for a τVk:

    (e)k1e(τne)mds,mPk(e),oneEpoly.(i)k2Kτmdx,mPk1(K)/RonKTpoly. (8)

    Remark 1. We note that in our construction, the degrees of freedom to determine the curl of a VEM function originally in [8] are replaced by a curl-free constraint thanks to the flexibility to virtual element. The reason why we opt for this subspace is that the true flux αu is locally curl-free since we have assumed that α is a piecewise constant. The unisolvency of the set of DoFs (8) including the curl-part can be found in [8]. While for the modified space (4), a simplified argument is in the proof of Lemma 7.3.

    As the data fL2(Ω), the true flux σ=αuH(div). Consequently, we shall seek a postprocessed flux σT in VkH(div) by specifying the DoFs in (8). Throughout this section, whenever considering an element KT, we treat it a polygon as KTpoly.

    Consider αKuT which is the numerical flux on K. We note that αKuT|KPk1,k(K)×Pk,k1(K). The normal flux on each edge eEpoly is in Pk(e) as ne=(±1,0) and x=const on vertical edges, ne=(0,±1) and y=const on horizontal edges. Therefore, the edge-based DoFs can be computed by a simple averaging thanks to the matching polynomial degrees of the numerical flux to the functions in Vk.

    On each e=K+K, define

    {αuT}γeene:=(γe(αKuT|K)+(1γe)(αK+uT|K+))ne, (9)

    where

    γe:=α1/2K+α1/2K++α1/2K. (10)

    First for both k=1 and k2 cases, we set the normal component of the recovered flux is set as

    σTne={αuT}γeene. (11)

    In the lowest order case k=1, σT is a constant on K by (4), thus the construction (11) alone, which consists the edge DoFs (e) in (8), can determine the divergence σT in K as follows

    |K|σT=KσTdx=KσTnKds=eKeσTnK|eds. (12)

    If k2, after the normal component (11) is set, furthermore on each K, denote Πk1 stands for the L2-projection to Pk1(K), and we let

    σT=Πk1f+cK. (13)

    The reason to add cK is that we have set the normal components of the recovered flux first without relying on the divergence information. While in general σTΠk1f as otherwise the divergence theorem will be rendered invalid in (12). As a result, an element-wise constant cK is added to ensure the compatibility of σT locally on each K. It is straightforward to verify that cK has the following form, and later we shall show that cK does not affect the efficiency as well as the reliability of the error estimates.

    cK=1|K|(KΠk1fdx+eKe{αuT}γeenK|eds), (14)

    Consequently for k2, the set (i) of DoFs can be set as: qPk1(K)

    (σT,q)K=(Πk1f+cK,q)K+eK({αuT}γeenK|e,q)e. (15)

    To the end of constructing a computable local error indicator, inspired by the VEM formulation [8], the recovered flux is projected to a space with a much simpler structure. A local oblique projection Π:L2(K)Pk(K),τΠτ is defined as follows:

    (Πτ,p)K=(τ,p)K,pPk(K)/R. (16)

    Next we are gonna show that this projection operator can be straightforward computed for vector fields in Vk(K).

    When k=1, we can compute the right hand side of (16) as follows:

    (τ,p)K=(τ,p)K+(τn,p)K. (17)

    By definition of the space (4) when k=1, τ is a constant on K and can be determined by edge DoFs (e) in (8) similar to (12). Moreover, p|eP1(e), thus the boundary term can be evaluated using DoFs (e) in (8).

    When k2, the right hand side of (16) can be evaluated following a similar procedure as (17), if we exploit the fact that τPk1(K), we have

    (τ,p)K=(τ,Πk1p)K+(τn,p)K=(τ,Πk1p)K+(τn,pΠk1p)K, (18)

    which can be evaluated using both DoF sets (e) and (i).

    Given the recovered flux σT in Section 3, the recovery-based local error indicator ηflux,K and the element residual ηres,K as follows:

    ηflux,K:=α1/2(σT+αuT)K,andηres,K:=α1/2(fσT)K, (19)

    then

    ηK={ηflux,Kwhenk=1,(η2flux,K+η2res,K)1/2whenk2. (20)

    A computable ˆηflux,K is defined as:

    ˆηflux,K:=α1/2KΠ(σT+αKuT)K, (21)

    with the oblique projection Π defined in (16). The stabilization part ˆηstab,K is

    ˆηstab,K:=|α1/2K(IΠ)(σT+αKuT)|S,K. (22)

    Here ||S,K:=(SK(,))1/2 is seminorm induced by the following stabilization

    SK(v,w):=eKhe(vne,wne)e+αΛ(v,mα)K(w,mα)K, (23)

    where Λ is the index set for the monomial basis of Pk1(K)/R with cardinality k(k+1)/21, i.e., the second term in (23) is dropped in the k=1 case. We note that this is a slightly modified version of the standard stabilization for an H(div)-function in [8] as we have replaced the edge DoFs by an integral. In Section 7.1 it is shown that the integral-based stabilization still yields the crucial norm equivalence result.

    The computable error estimator ˆη is then

    ˆη2={KT(ˆη2flux,K+ˆη2stab,K)=:KTˆη2Kwhenk=1,KT(ˆη2flux,K+ˆη2stab,K+η2res,K)=:KTˆη2Kwhenk2. (24)

    In this section, we shall prove the proposed recovery-based estimator ˆηK is efficient by bounding it above by the residual-based error estimator. In the process of adaptive mesh refinement, only the computable ˆηK is used as the local error indicator to guide a marking strategy of choice.

    Theorem 4.1. Let uT be the solution to problem (2), and ˆηflux,K be the error indicator in (24). On KTpoly, ˆηflux,K can be locally bounded by the residual-based ones:

    ˆη2flux,Kosc(f;K)2+η2elem,K+η2edge,K, (25)

    where

    osc(f;K)=α1/2KhKfΠk1fK,ηelem,K:=α1/2KhKf+(αuT)K,andηedge,K:=(eKheαK+αKe[[αuTne]]2e)1/2.

    In the edge jump term, Ke is the element on the opposite side of K with respect to an edge eK. The constant depends on k and the number of edges on K.

    Proof. Let α1KΠ(σT+αKuT)=:p on K, then pPk(K)/R and we have

    ˆη2flux,K=(Π(σT+αKuT),p)K=(σT+αKuT,p)K=((σT+αKuT),p)K+eKe(σT+αKuT)nK|epds. (26)

    By (11), without loss of generality we assume K=K (the local orientation of e agrees with the global one, i.e., nK|e=ne), and Ke=K+ which is the element opposite to K with respect to e, and γe:=α1/2Ke/(α1/2Ke+α1/2K), we have on edge eK

    (σT+αKuT)ne=((1γe)αKuT|K(1γe)αKeuT|Ke)ne=α1/2Kα1/2K+α1/2Ke[[αuTne]]e. (27)

    The boundary term in (26) can be then rewritten as

    e(σT+αKuT)nepds=e1α1/2K+α1/2Ke[[αuTne]]eα1/2Kpds1(αK+αKe)1/2h1/2e[[αuTne]]eα1/2Kh1/2epe. (28)

    By a trace inequality on an edge of a polygon (Lemma 7.1), and the Poincaré inequality for pPk(K)/R, we have,

    h1/2epeh1KpK+pKpK.

    As a result,

    eKe(σT+αKuT)nepdsηedge,Kα1/2Kpe=ηedge,Kˆηflux,K.

    For the bulk term on K's in (26), when k=1, by (12), the representation in (28), and the Poincaré inequality for pPk(K)/R again with hK|K|1/2, we have

    ((σT+αKuT),p)K|(σT+αKuT)||K|1/2pK1|K|1/2|K(σT+αKuT)dx|pK=1|K|1/2|eKe(σT+αKuT)neds|pK(eK1α1/2K+α1/2Ke[[αuTne]]eα1/2Khe)pηedge,Kˆηflux,K.

    When k2, by (13),

    ((σT+αKuT),p)K=(Πk1f+cK+(αKuT),p)K(fΠk1fK+f+(αuT)K+|cK||K|1/2)pK. (29)

    The first two terms can be handled by combining the weights α1/2 and hK from pKhKpK. For cK, it can be estimated straightforwardly as follows

    cK|K|1/2=1|K|1/2(K(Πk1ff)dxK(f+(αuT))dx+K(αuT)dx+eKe{αuT}γeeneds)fΠk1fK+f+(αuT)K+1|K|1/2eKe(αKuT{αuT}γee)nedsfΠk1fK+f+(αuT)K+eKα1/2Kα1/2K+α1/2Ke[[αuTne]]e. (30)

    The two terms on K can be treated the same way with the first two terms in (29) while the edge terms are handled similarly as in the k=1 case. As a result, we have shown

    ((σT+αKuT),p)K(osc(f;K)+ηelem,K+ηedge,K)α1/2Kp

    and the theorem follows.

    Theorem 4.2. Under the same setting with Theorem 4.1, let ˆηstab,K as the estimator in (22), we have

    ˆη2stab,Kosc(f;K)2+η2elem,K+η2edge,K, (31)

    The constant depends on k and the number of edges on K.

    Proof. This theorem follows directly from the norm equivalence Lemma 7.3:

    |α1/2K(IΠ)(σT+αKuT)|S,K|α1/2K(σT+αKuT)|S,K,

    while evaluating the DoFs (e) and (i) using (11) and (15) reverts us back to the proof of Theorem 4.1.

    Theorem 4.3. Under the same setting with Theorem 4.1, on any KTpoly with ωK defined as the collection of elements in T which share at least 1 vertex with K

    ˆηKosc(f;K)+α1/2(uuT)ωK, (32)

    with a constant independent of α, but dependent on k and the maximum number of edges in KTpoly.

    Proof. This is a direct consequence of Theorem 4.1 and 4.2 and the fact that the residual-based error indicator is efficient by a common bubble function argument.

    In this section, we shall prove that the computable error estimator ˆη is reliable under two common assumptions in the a posteriori error estimation literature. For the convenience of the reader, we rephrase them here using a "layman" description, for more detailed and technical definition please refer to the literature cited.

    Assumption 1 (T is l-irregular [14]). Any given T is always refined from a mesh with no hanging nodes by a quadsecting red-refinement. For any two neighboring elements in T, the difference in their refinement levels is l for a uniformly bounded constant l, i.e., for any edge eE, it has at most l hanging nodes.

    By Assumption 1, we denote the father 1-irregular mesh of T as T1. On T1, a subset of all nodes is denoted by N1, which includes the regular nodes NR on T1, as well as NE as the set of end points of edges with a hanging node as the midpoint. By [14,Theorem 2.1], there exists a set of bilinear nodal bases {ϕz} associated with zN1, such that {ϕz} form a partition of unity and can be used to construct a Clément-type quasi-interpolation. Furthermore, the following assumption assures that the Clément-type quasi-interpolant is robust with respect to the coefficient distribution on a vertex patch, when taking nodal DoFs as a weighted average.

    Assumption 2 (Quasi-monotonicity of α [6]). On T, let ϕz be the bilinear nodal basis associated with zN1, with ωz:=suppϕz. For every element Kωz,KT, there exists a simply connected element path leading to ωm(z), which is a Lipschitz domain containing the elements where the piecewise constant coefficient α achieves the maximum (or minimum) on ωz.

    Denote

    πzv={ωzωm(z)vϕzωzωm(z)ϕzifzΩ,0ifzΩ. (33)

    We note that if α is a constant on ωz, (1,(vπzv)ϕz)ωz=0. A quasi-interpolation I:L2(Ω)Q1(T1) can be defined as

    Iv:=zN1(πzv)ϕz. (34)

    Lemma 4.4 (Estimates for πz and I). Under Assumption 1 and 2, the following estimates hold for any vH1(ωK)

    α1/2Kh1KvIvK+α1/2KIvKα1/2vωK, (35)

    and for zN1

    Kωzh2zα1/2(vπzv)ϕz2Kα1/2v2ωz, (36)

    in which hz:=maxKωzhK, and here ωK denotes the union of elements in T1 sharing at least a node (hanging or regular) with K.

    Proof. The estimate for πz follows from [6,Lemma 2.8]. For I, its error estimates and stability only rely on the partition of unity property of the nodal basis set {ϕz} (see e.g., [27]), therefore the proof follows the same argument with the ones used on triangulations in [6,Lemma 2.8].

    Denotes the subset of nodes {z}N1 (i) on the boundary as NΩ and (ii) with the coefficient α on patch ωz as NI. For the lowest order case, we need the following oscillation term for f

    osc(f;T)2:=zN1(NΩNI)h2zα1/2f2ωz+zN1(NΩNI)h2zα1/2(ffz)2ωz, (37)

    with fz:=ωzvϕz/ωzϕz.

    Theorem 4.5. Let uT be the solution to problem (2), and ˆη be the computable error estimator in (24), under Assumption 2 and 1, we have for k=1

    α1/2(uuT)(ˆη2+osc(f;T)2)1/2. (38)

    For k2,

    α1/2(uuT)ˆη, (39)

    where the constant depends on l and k.

    Proof. Let ε:=uuTH10(Ω), and IεQ1(T1)Q1(T) be the quasi-interpolant in (34) of ε, then by the Galerkin orthogonality, αu+σTH(div), the Cauchy-Schwarz inequality, and the interpolation estimates (35), we have for k2,

    α1/2ε2=(α(uuT),(εIε))=(αu+σT,(εIε))(αuT+σT,(εIε))=(fσT,εIε)(αuT+σT,(εIε))(KTα1Kh2KfσT2K)1/2(KTαKh2KεIε2K)1/2(KTα1KαuT+σT2K)1/2(KTαK(εIε)2K)1/2.(KT(η2res,K+η2flux,K))1/2(KTα1/2εωK)1/2.

    Applying the norm equivalence of η to ˆη by Lemma 7.3, as well as the fact that the number of elements in ωK is uniformly bounded by Assumption 1, yields the desired estimate.

    When k=1, the residual term on K can be further split thanks to ΔQ1(K)={0}. First we notice that by the fact that {ϕz} form a partition of unity,

    (f,εIε)=zN1Kωz(f,(επzε)ϕz)K, (40)

    in which a patch-wise constant fz (weighted average of f) can be further inserted by the definition of πz (33) if α is a constant on ωz. Therefore, by the assumption of αK being a piecewise constant, splitting (40), we have

    (fσT,εIε)=(f,εIε)((σT+αKuT),εIε)=zNKωz(f,(επzε)ϕz)K((σT+αKuT),εIε)(osc(f;T)2)1/2(zN1Kωzh2zα1/2(επzε)ϕz2K)1/2+(KTα1Kh2K(σT+αKuT)2K)1/2(KTαKh2KεIε2K)1/2.

    Applied an inverse inequality in Lemma 7.2 on (σT+αKuT)K and the projection estimate for πz (36), the rest follows the same argument with the one used in the k2 case.

    The numerics is prepared using the bilinear element for common AMR benchmark problems. The codes for this paper are publicly available on https://github.com/lyc102/ifem implemented using iFEM [16]. The linear algebraic system on an l-irregular quadtree is implemented following the conforming prolongation approach [15] by PAPu=Pf, where A is the locally assembled stiffness matrix for all nodes in N, u and f are the solution vector associated with NR and load vector associated with N, respectively. P=(I,W):RdimNRRdimN is a prolongation operator mapping conforming H1-bilinear finite element function defined on regular nodes to all nodes, the weight matrix W is assembled locally by a recursive kNN query in NH, while the polygonal mesh data structure embedding is automatically built during constructing P. For details we refer the readers to https://github.com/lyc102/ifem/tree/master/research/polyFEM.

    The adaptive finite element (AFEM) iterative procedure is following the standard

    SOLVEESTIMATEMARKREFINE.

    The linear system is solved by MATLAB mldivide. In MARK, the Dorfler L2-marking is used with the local error indicator ˆηK in that the minimum subset MT is chosen such that

    KMˆη2KθKTˆη2K,forθ(0,1).

    Throughout all examples, we fix θ=0.3. T is refined by a red-refinement by quadsecting the marked element afterwards. For comparison, we compute the standard residual-based local indicator for KTpoly

    η2Residual,K:=α1Kh2Kf+(αuT)2K+12eKheαK+αKe[[αuTne]]2e,

    Let η2Residual=KTη2Residual,K. The residual-based estimator ηResidual is merely computed for comparison purpose and not used in marking. The AFEM procedure stops when the relative error reaches a threshold. The effectivity indices for different estimators are compared

    effectivityindex:=η/α1/2ε,whereε:=uuT,η=ηResidualorˆη,

    i.e., the closer to 1 the effectivity index is, the more accurate this estimator is to measure the error of interest. We use an order 5 Gaussian quadrature to compute α1/2(uuT) elementwisely. The orders of convergence for various η's and α1/2(uuT) are computed, for which rη and rerr are defined as the slope for the linear fitting of lnηn and lnα1/2(uuT,n) in the asymptotic regime,

    lnηnrηlnNn+c1,andlnα1/2(uuT)rerrlnNn+c2,

    where the subscript n stands for the number of iteration in the AFEM cycles, Nn:=#(NRNΩ). rη and rerr are considered optimal when being close to 1/2.

    In this example, a standard AMR benchmark on the L-shaped domain is tested. The true solution u=r2/3sin(2θ/3) in polar coordinates on Ω=(1,1)×(1,1)[0,1)×(1,0]. The AFEM procedure stops if the relative error has reached 0.01. The adaptively refined mesh can be found in Figure 2a. While both estimators show optimal rate of convergence in Figure 2b, the effectivity index for ηResidual is 4.52, and is 2.24 for ˆη.

    Figure 2.  The result of the L-shape example. (a) The adaptively refined mesh with 1014 DoFs. (b) Convergence in Example 1.

    The solution u=tan1(α(rr0)) is defined on Ω=(0,1)2 with r:=(x+0.05)2+(y+0.05)2, α=100, and r0=0.7. The true solution shows a sharp transition layer (Figure 3a). The result of the convergence can be found in Figure 3b. In this example, the AFEM procedure stops if the relative error has reached 0.05. Additionally, we note that by allowing l-irregular (l2), the AMR procedure shows to be more efficient toward capturing the singularity of the solution. A simple comparison can be found in Figure 4. The effectivity indices for ηResidual and ˆη are 5.49 and 2.08, respectively.

    Figure 3.  The result of the circular wave front example. (a) uT on a 3-irregular mesh with #DoFs=1996, the relative error is 14.3%. (b) Convergence in Example 2.
    Figure 4.  Comparison of the adaptively refined meshes. (a) 1-irregular mesh, #DoFs=1083, the relative error is 21.8%. (b) 4-irregular mesh, and #DoFs=1000, the relative error is 17.8%.

    This example is a common benchmark test problem introduced in [9], see also [17,12]) for elliptic interface problems. The true solution u=rγμ(θ) is harmonic in four quadrants, and μ(θ) takes different values within four quadrants:

    μ(θ)={cos((π/2δ)γ)cos((θπ/2+ρ)γ)if0θπ/2cos(ργ)cos((θπ+δ)γ)ifπ/2θπcos(δγ)cos((θπρ)γ)ifπθ<3π/2cos((π/2ρ)γ)cos((θ3π/2δ)γ)if3π/2θ2π

    While α=R in the first and third quadrants, and α=1 in the second and fourth quadrants, and the true flux αu is glued together using H(div)-continuity conditions. We choose the following set of coefficients for u

    γ=0.1,R161.4476387975881,ρ=π/4,δ14.92256510455152,

    By this choice, this function is very singular near the origin as the maximum regularity it has is H1+γloc(Ω{0}). Through an integration by parts, it can be computed accurately that α1/2u0.56501154. For detailed formula and more possible choices of the parameters above, we refer the reader to [17].

    The AFEM procedure for this problem stops when the relative error reaches 0.05, and the resulting mesh and finite element approximation during the refinement can be found in Figure 5, and the AFEM procedure shows optimal rate of convergence in Figure 6. The effectivity index for ηResidual is 2.95, and 1.33 for ˆη.

    Figure 5.  The result of the Kellogg example. (a) The adaptively refined mesh with #DoFs=2001 on which the energy error is 0.0753, this number is roughly 75% of the number of DoFs needed to achieve the same accuracy if using conforming linear finite element on triangular grid (see [17,Section 4]). (b) The finite element approximation with #DoFs=1736.
    Figure 6.  The convergence result of the Kellogg example.

    A postprocessed flux with the minimum H(div) continuity requirement is constructed for tensor-product type finite element. The implementation can be easily ported to finite element on quadtree to make use the vast existing finite element libraries in the engineering community. Theoretically, the local error indicator is efficient, and the global estimator is shown to be reliable under the assumptions that (i) the mesh has bounded irregularities, and (ii) the diffusion coefficient is a quasi-monotone piecewise constant. Numerically, we have observed that both the local error indicator and the global estimator are efficient and reliable (in the asymptotic regime), respectively. Moreover, the recovery-based estimator is more accurate than the residual-based one.

    However, we do acknowledge that the technical tool involving interpolation is essentially limited to 1-irregular meshes in reliability. A simple weighted averaging has restrictions and is hard to generalize to hp-finite elements, or discretization on curved edges/isoparametric elements. Nevertheless, we have shown that the flexibility of the virtual element framework allows further modification of the space in which we perform the flux recovery to cater the needs.

    The author is grateful for the constructive advice from the anonymous reviewers. This work was supported in part by the National Science Foundation under grants DMS-1913080 and DMS-2136075, and no additional revenues are related to this work.

    Unlike the identity matrix stabilization commonly used in most of the VEM literature, for τVk(K), we opt for a mass matrix/DoF hybrid stabilizer approach. Let α1/2τ2h,K:=((τ,τ))K and

    ((σ,τ))K:=(Πσ,Πτ)K+SK((IΠ)σ,(IΠ)τ), (41)

    where SK(,) is defined in (23).

    To show the inverse inequality and the norm equivalence used in the reliability bound, on each element, we need to introduce some geometric measures. Consider a polygonal element K and an edge eK, let the height le which measures how far from this edge e one can advance to an interior subset of K, and denote TeK as a right triangle with height le and base as edge e.

    Proposition 1. Under Assumption 1, Tpoly satisfies (1) The number of edges in every KTpoly is uniformly bounded above. (2) For any edge e on every K, le/he is uniformly bounded below.

    Lemma 7.1 (Trace inequality on small edges [13]). If Proposition 1 holds, for vH1(K) and KTpoly we have

    h1/2eveh1KvK+vK,oneK. (42)

    Proof. The proof follows essentially equation (3.9) in [13,Lemma 3.3] as a standard scaled trace inequality on e toward Te reads

    h1/2eveh1evTe+vTeh1KvK+vK.

    Lemma 7.2 (Inverse inequalities). Under Assumption 1, we have the following inverse estimates for τVk(K) (4) on any KTpoly with constants depending on k and the number of edges in K:

    τKh1KτK,andτKh1KSK(τ,τ)1/2. (43)

    Proof. The first inequality in (43) can be shown using a bubble function trick. Choose bK be a bubble function of Te where e is the longest edge on K. Denote p:=τPk1(K), we have

    τ2K(τ,pbK)=(τ,(pbK))τK(pbK)K,

    and then (pbK) can be estimated as follows

    (pbK)bKpK+pbKKbK,ΩpK+pKbK,K.

    Consequently, the first inequality in (43) follows above by the standard inverse estimate for polynomials pKh1KpK, and the properties of the bubble function bK,K=O(1), and bK,K=O(h1K).

    To prove the second inequality in (43), by integration by parts we have

    τ2=(τ,p)=(τ,p)+eK(τne,p). (44)

    Expand τ=p in the monomial basis p(x)=αΛpαmα(x), and denote the mass matrix M:=((mα,mγ)K)αγ, p:=(pα)αΛ, it is straightforward to see that

    p2K=pMppdiag(M)pminjMjjp22h2Kp22, (45)

    since K(xxK)l(yyK)mdxdy0 for the off-diagonal entries of M due to K being geometrically a rectangle (with additional vertices). As a result, applying the trace inequality in Lemma 7.1 on (44) yields

    τ2(αΛ(τ,mα)2K)1/2(αΛp2α)1/2+(eKheτne2e)1/2(eKh1ep2e)1/2SK(τ,τ)1/2(p2+h1KpK+pK).

    As a result, the second inequality in (43) is proved when apply an inverse inequality for pK and estimate (45).

    Remark 2. While the proof in Lemma 7.2 relies on K being a rectangle, the result holds for a much broader class of polygons by changing the basis of Pk1(K) from the simple scaled monomials to quasi-orthogonal ones in [25,7] and apply the isotropic polygon scaling result in [13].

    Lemma 7.3 (Norm equivalence). Under Assumption 1, let Π be the oblique projection defined in (16), then the following relations holds for τVk(K) (4) on any KTpoly:

    γτKτh,KγτK, (46)

    where both γ and γ depends on k and the number of edges in K.

    Proof. First we consider the lower bound, by triangle inequality,

    τKΠτK+(τΠτ)K.

    Since ΠτVk(K), it suffices to establish the following to prove the lower bound in (46)

    τ2KSK(τ,τ),forτVk(K). (47)

    To this end, we consider the weak solution to the following auxiliary boundary value problem on K:

    {Δψ=τinK,ψn=τnKonK. (48)

    By a standard Helmholtz decomposition result (e.g. Proposition 3.1, Chapter 1[23]), we have τψ=ϕ. Moreover, since on K, 0=ϕn=ϕt=ϕ/s, we can further choose ϕH10(K). As a result, by the assumption that ×τ=0 for τ in the modified virtual element space (4), we can verify that

    τψ2K=(τψ,ϕ)=0.

    Consequently, we proved essentially the unisolvency of the modified VEM space (4) and τ=ψ. We further note that ψ in (48) can be chosen in H1(K)/R and thus

    τ2K=(τ,ψ)K=(τ,ψ)K=(τ,ψ)K+(τnK,ψ)KτKψK+eKτneeψeτKψK+(eKheτne2e)1/2(eKh1eψ2e)1/2 (49)

    Proposition 1 allows us to apply an isotropic trace inequality on an edge of a polygon (Lemma 7.1), combining with the Poincaré inequality for H1(K)/R, we have, on every eK,

    h1/2eψeh1KψK+ψKψK.

    Furthermore applying the inverse estimate in Lemma 7.2 on the bulk term above, we have

    τ2KSK(τ,τ)1/2ψK,

    which proves the validity of (47), thus yield the lower bound.

    To prove the upper bound, by ΠτKτK, it suffices to establish the reversed direction of (47) on a single edge e and for a single monomial basis mαPk1(K):

    heτne2eτK,and|(τ,mα)K|τK. (50)

    To prove the first inequality, by Proposition 1 again, consider the edge bubble function be such that suppbe=Te. We can let be=0 on eK for ee. It is easy to verify that:

    be,K=O(1/he),andbe,K=O(1). (51)

    Denote qe:=τne, and extend it to K by a constant extension in the normal direction rectangular strip ReK with respect to e (notice suppbeRe), we have

    τne2e(τne,beqe)e=x(τne,beqe)K=(τ,qebe)K+(τ,beqe)KτKqebeTe+τKqebeTe,τKqeTebe,K+τKqeTebe,K.

    Now by the fact that , the scaling of the edge bubble function in (51), and the first inverse estimate of in Lemma 7.2 yields the first part of (50).

    The second inequality in (50) can be estimated straightforward by the scaling of the monomials (7)

    (52)

    Hence, (46) is proved.



    [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is All you Need, in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017. https://doi.org/10.48550/arXiv.2206.09457
    [2] Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, et al., Learning deep transformer models for machine translation, preprint, arXiv: 1906.01787.
    [3] S. A. Chowdhury, A. Abdelali, K. Darwish, J. Soon-Gyo, J. Salminen, B. J. Jansen, Improving arabic text categorization using transformer training diversification, in Proceedings of the Fifth Arabic Natural Language Processing Workshop (COLING-WANLP), (2020), 226–236. https://aclanthology.org/2020.wanlp-1.21
    [4] X. Ma, P. Zhang, S. Zhang, N. Duan, Y. Hou, M. Zhou, et al., A tensorized transformer for language modeling, preprint, arXiv: 1906.09777.
    [5] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, preprint, arXiv: 1810.04805.
    [6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, et al., RoBERTa: A robustly optimized BERT pretraining approach, preprint, arXiv: 1907.11692.
    [7] H. Xu, B. Liu, L. Shu, P. S. Yu, BERT post-training for review reading comprehension and aspect-based sentiment analysis, preprint, arXiv: 1904.02232.
    [8] P. Shi, J. Lin, Simple BERT models for relation extraction and semantic role labeling, preprint, arXiv: 1904.05255.
    [9] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, preprint, arXiv: 1910.01108.
    [10] Y. Cheng, D. Wang, P. Zhou, T. Zhang, Model compression and acceleration for deep neural networks: The principles, progress, and challenges, IEEE Signal Process. Mag., 35 (2018), 126–136. https://doi.org/10.1109/MSP.2017.2765695 doi: 10.1109/MSP.2017.2765695
    [11] S. Cheng, D. Lucor, J. P. Argaud, Observation data compression for variational assimilation of dynamical systems, J. Comput. Sci., 53 (2021), 101405. https://doi.org/10.1016/j.jocs.2021.101405 doi: 10.1016/j.jocs.2021.101405
    [12] S. Liu, Y. Lin, Z. Zhou, K. Nan, H. Liu, J. Du, On-demand deep model compression for mobile devices: A usage-driven model selection framework, in Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, (2018), 389–400. https://doi.org/10.1145/3210240.3210337
    [13] S. Liu, J. Du, K. Nan, Z. Zhou, H. Liu, Z. Wang, et al., AdaDeep: A usage-driven, automated deep model compression framework for enabling ubiquitous intelligent mobiles, IEEE Trans. Mob. Comput., 20 (2021), 3282–3297. https://doi.org/10.1109/TMC.2020.2999956 doi: 10.1109/TMC.2020.2999956
    [14] V. L. Tran, S. E. Kim, Efficiency of three advanced data-driven models for predicting axial compression capacity of CFDST columns, Thin-Walled Struct., 152 (2020), 106744. https://doi.org/10.1016/j.tws.2020.106744 doi: 10.1016/j.tws.2020.106744
    [15] Z. X. Hu, Y. Wang, M. F. Ge, J. Liu, Data-driven fault diagnosis method based on compressed sensing and improved multiscale network, IEEE Trans. Ind. Electron., 67 (2020), 3216–3225. https://doi.org/10.1109/TIE.2019.2912763 doi: 10.1109/TIE.2019.2912763
    [16] S. Cheng, I. C. Prentice, Y. Huang, Y. Jin, Y. K. Guo, R. Arcucci, Data-driven surrogate model with latent data assimilation: Application to wildfire forecasting, J. Comput. Phys., 464 (2022). https://doi.org/10.1016/j.jcp.2022.111302
    [17] S. Yang, Z. Zhang, C. Zhao, X. Song, S. Guo, H. Li, CNNPC: End-edge-cloud collaborative CNN inference with joint model partition and compression, IEEE Trans. Parallel Distrib. Syst., (2022), 1–1. https://doi.org/10.1109/TPDS.2022.3177782 doi: 10.1109/TPDS.2022.3177782
    [18] H. He, S. Jin, C. K. Wen, F. Gao, G. Y. Li, Z. Xu, Model-driven deep learning for physical layer communications, IEEE Wireless Commun., 26 (2019), 77–83. https://doi.org/10.1109/MWC.2019.1800447 doi: 10.1109/MWC.2019.1800447
    [19] Z. Liu, M. del Rosario, Z. Ding, A markovian model-driven deep learning framework for massive MIMO CSI feedback, IEEE Trans. Wireless Commun., 21 (2022), 1214–1228. https://doi.org/10.1109/TWC.2021.3103120 doi: 10.1109/TWC.2021.3103120
    [20] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, preprint, arXiv: 2002.10957.
    [21] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, et al., TinyBERT: Distilling BERT for natural language understanding, preprint, arXiv: 1909.10351.
    [22] S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for BERT model compression, preprint, arXiv: 1908.09355.
    [23] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jegou, Training data-efficient image transformers & distillation through attention, in Proceedings of the 38th International Conference on Machine Learning (ICML), (2021), 10347–10357. https://doi.org/10.48550/arXiv.2012.12877
    [24] P. Michel, O. Levy, G. Neubig, Are sixteen heads really better than one?, Adv. Neural Inf. Process. Syst., preprint, arXiv: 1905.10650.
    [25] M. A. Gordon, K. Duh, N. Andrews, Compressing BERT: Studying the effects of weight pruning on transfer learning, preprint, arXiv: 2002.08307.
    [26] T. Chen, Y. Cheng, Z. Gan, L. Yuan, L. Zhang, Z. Wang, Chasing sparsity in vision transformers: An end-to-end exploration, Adv. Neural Inf. Process. Syst., (2021), 19974–19988. https://doi.org/10.48550/arXiv.2106.04533 doi: 10.48550/arXiv.2106.04533
    [27] T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, et al., The lottery ticket hypothesis for pre-trained BERT networks, Adv. Neural Inf. Process. Syst., (2020), 15834–15846. https://doi.org/10.48550/arXiv.2007.12223 doi: 10.48550/arXiv.2007.12223
    [28] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, et al., Q-BERT: Hessian based ultra low precision quantization of BERT, preprint, arXiv: 1909.05840.
    [29] Z. Liu, Y. Wang, K. Han, S. Ma, W. Gao, Post-training quantization for vision transformer, preprint, arXiv: 2106.14156.
    [30] H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, et al., BinaryBERT: Pushing the limit of BERT quantization, preprint, arXiv: 2012.15701.
    [31] O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8BERT: Quantized 8Bit BERT, in the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS 2019, (2019), 36–39. https://doi.org/10.1109/EMC2-NIPS53020.2019.00016
    [32] Z. Wu, Z. Liu, J. Lin, Y. Lin, S. Han, Lite transformer with long-short range attention, preprint, arXiv: 2004.11886.
    [33] L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, Q. Liu, DynaBERT: Dynamic BERT with adaptive width and depth, preprint, arXiv: 2004.04037.
    [34] M. Chen, H. Peng, J. Fu, H. Ling, AutoFormer: Searching transformers for visual recognition, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), (2021), 12250–12260. https://doi.org/10.1109/ICCV48922.2021.01205
    [35] P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, et al., Compressing large-scale transformer-based models: A case study on BERT, Trans. Assoc. Comput. Linguist., 9 (2021), 1061–1080. https://doi.org/10.1162/tacl_a_00413 doi: 10.1162/tacl_a_00413
    [36] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 doi: 10.1162/neco.1997.9.8.1735
    [37] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, preprint, arXiv: 1412.3555.
    [38] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, preprint, arXiv: 1409.0473.
    [39] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, et al., FTRANS: energy-efficient acceleration of transformers using FPGA, in Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), (2020), 175–180. https://doi.org/10.1145/3370748.3406567
    [40] T. J. Ham, S. J. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, et al., A.3: Accelerating attention mechanisms in neural networks with approximation, in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), (2020), 328–341. https://doi.org/10.1109/HPCA47549.2020.00035
    [41] T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, et al., ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks, in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), (2021), 692–705. https://doi.org/10.1109/ISCA52012.2021.00060
    [42] X. Zhang, Y. Wu, P. Zhou, X. Tang, J. Hu, Algorithm-hardware co-design of attention mechanism on FPGA devices, ACM Trans. Embed. Comput. Syst., 20 (2021), 1–24. https://doi.org/10.1145/3477002 doi: 10.1145/3477002
    [43] S. Lu, M. Wang, S. Liang, J. Lin, Z. Wang, Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer, in IEEE International SOC Conference, (2020), 84–89. https://doi.org/10.1109/ISCA52012.2021.00060
    [44] A. Parikh, O. Tä ckströ m, D. Das, J. Uszkoreit, A decomposable attention model for natural language inference, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, (2016), 2249–2255. https://doi.org/10.48550/arXiv.1606.01933
    [45] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, et al., A structured self-attentive sentence embedding, preprint, arXiv: 1703.03130
    [46] M. S. Charikar, Similarity estimation techniques from rounding algorithms, in Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, (2002), 380–388. https://doi.org/10.1145/509907.509965
    [47] X. Zhang, F. X. Yu, R. Guo, S. Kumar, S. Wang, S. F. Chang, Fast orthogonal projection based on kronecker product, in 2015 IEEE International Conference on Computer Vision (ICCV), (2015), 2929–2937. https://doi.org/10.1109/ICCV.2015.335
    [48] Y. Gong, S. Kumar, H. A. Rowley, S. Lazebnik, Learning binary codes for high-dimensional data using bilinear projections, in 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2013), 484–491. https://doi.org/10.1109/CVPR.2013.69
    [49] M. Wang, S. Lu, D. Zhu, J. Lin, Z. Wang, A high-speed and low-complexity architecture for softmax function in deep learning, in 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), (2018), 223–226. https://doi.org/10.1109/APCCAS.2018.8605654
    [50] R. Hu, B. Tian, S. Yin, S. Wei, Efficient hardware architecture of softmax layer in deep neural network, in 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), (2018), 1–5. https://doi.org/10.1109/ICDSP.2018.8631588
    [51] L. Deng, G. Li, S. Han, L. Shi, Y. Xie, Model compression and hardware acceleration for neural networks: A comprehensive survey, Proc. IEEE, 108 (2020), 485–532. https://doi.org/10.1109/JPROC.2020.2976475 doi: 10.1109/JPROC.2020.2976475
    [52] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, et al., C ir CNN: Accelerating and compressing deep neural networks using block-circulant weight matrices, in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), (2017), 395–408. https://doi.org/10.1145/3123939.3124552
    [53] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, et al., C-LSTM: Enabling efficient LSTM using structured compression techniques on FPGAs, in Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), (2018), 11–20. https://doi.org/10.1145/3174243.3174253
    [54] L. Zhao, S. Liao, Y. Wang, Z. Li, J. Tang, B. Yuan, Theoretical properties for neural networks with weight matrices of low displacement rank, in Proceedings of the 34th International Conference on Machine Learning (ICML), (2017), 4082–4090. https://doi.org/10.48550/arXiv.1703.00144
    [55] V. Y. Pan, Structured matrices and displacement operators, in Structured Matrices and Polynomials: Unified Superfast Algorithms, Springer Science & Business Media, (2001), 117–153. https://doi.org/10.1007/978-1-4612-0129-8
    [56] J. O. Smith, Mathematics of the discrete fourier transform (DFT): with audio applications, in Mathematics of the Discrete Fourier Transform (DFT): With Audio Applications, Julius Smith, (2007), 115–164. https://ccrma.stanford.edu/~jos/st/
    [57] Z. Liu, G. Li, J. Cheng, Hardware acceleration of fully quantized BERT for efficient natural language processing, in 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), (2021), 513–516. https://doi.org/10.23919/DATE51398.2021.9474043
    [58] M. Sun, H. Ma, G. Kang, Y. Jiang, T. Chen, X. Ma, et al., VAQF: Fully automatic software-hardware co-design framework for low-bit vision transformer, preprint, arXiv: 2201.06618.
    [59] Z. Liu, Z. Shen, M. Savvides, K. T. Cheng, ReActNet: Towards precise binary neural network with generalized activation functions, in Computer Vision–ECCV 2020 (ECCV), (eds. Vedaldi. A., Bischof. H., Brox. T., Frahm. J.-M.), Cham, Springer International Publishing, (2020), 143–159. https://doi.org/10.1007/978-3-030-58568-6_9
    [60] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classification using binary convolutional neural networks, in Computer Vision–ECCV 2016 (ECCV), (eds. Leibe. B., Matas. J., Sebe. N., Welling. M.), Cham, Springer International Publishing, (2016), 525–542. https://doi.org/10.1007/978-3-319-46493-0_32
    [61] S. Han, H. Mao, W. J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, preprint, arXiv: 1510.00149.
    [62] W. Wen, C. Wu, Y. Wang, Y. Chen, H. Li, Learning structured sparsity in deep neural networks, in Advances in Neural Information Processing Systems (NeurIPS), Curran Associates, (2016). https://doi.org/10.48550/arXiv.1608.03665
    [63] X. Ma, F. M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, et al., PCONV: The missing but desirable sparsity in DNN weight pruning for real-time execution on mobile devices, in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), (2020), 5117–5124. https://doi.org/10.1609/aaai.v34i04.5954
    [64] B. Li, Z. Kong, T. Zhang, J. Li, Z. Li, H. Liu, et al., Efficient transformer-based large scale language representations using hardware-friendly block structured pruning, preprint, arXiv: 2009.08065.
    [65] S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, et al., Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity, in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), (2019), 63–72. https://doi.org/10.1145/3289602.3293898
    [66] H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, et al., Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning, in 2021 22nd International Symposium on Quality Electronic Design (ISQED), (2021), 142–148. https://doi.org/10.1109/ISQED51717.2021.9424344
    [67] C. Ding, A. Ren, G. Yuan, X. Ma, J. Li, N. Liu, et al., Structured weight matrices-based hardware accelerators in deep neural networks: FPGAs and ASICs, in Proceedings of the 2018 on Great Lakes Symposium on VLSI (GLSVLSI), Chicago, IL, USA, Association for Computing Machinery, (2018), 353–358. https://doi.org/10.1145/3194554.3194625
    [68] S. Narang, E. Undersander, G. Diamos, Block-sparse recurrent neural networks, preprint, arXiv: 1711.02782.
    [69] P. Qi, E. H. M. Sha, Q. Zhuge, H. Peng, S. Huang, Z. Kong, et al., Accelerating framework of transformer by hardware design and model compression co-optimization, in 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), (2021), 1–9. https://doi.org/10.1109/ICCAD51958.2021.9643586
    [70] P. Qi, Y. Song, H. Peng, S. Huang, Q. Zhuge, E. H. M. Sha, Accommodating transformer onto FPGA: Coupling the balanced model compression and FPGA-implementation optimization, in Proceedings of the 2021 on Great Lakes Symposium on VLSI (GLSVLSI), Virtual Event, USA, Association for Computing Machinery, (2021), 163–168. https://doi.org/10.1145/3453688.3461739
    [71] D. So, Q. Le, C. Liang, The evolved transformer, in Proceedings of the 36th International Conference on Machine Learning (ICML), PMLR, (2019), 5877–5886. https://doi.org/10.48550/arXiv.1901.11117
    [72] H. Wang, Efficient algorithms and hardware for natural language processing, Graduate Theses, Retrieved from the Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/127440.
    [73] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, et al., Bit fusion: Bit-Level dynamically composable architecture for accelerating deep neural network, in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), (2018), 764–775. https://doi.org/10.1109/ISCA.2018.00069
    [74] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, et al., Templates for the solution of linear systems: Building blocks for iterative methods, in Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, Society for Industrial and Applied Mathematics, (1994), 39–55. https://doi.org/10.1137/1.9781611971538
    [75] W. Liu, B. Vinter, CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication, in Proceedings of the 29th ACM on International Conference on Supercomputing (ICS), Newport Beach, California, USA, Association for Computing Machinery, (2015), 339–350. https://doi.org/10.1145/2751205.2751209
    [76] R. Kannan, Efficient sparse matrix multiple-vector multiplication using a bitmapped format, in 20th Annual International Conference on High Performance Computing (HiPC), (2013), 286–294. https://doi.org/10.1109/HiPC.2013.6799135
    [77] W. Jiang, X. Zhang, E. H. M. Sha, L. Yang, Q. Zhuge, Y. Shi, et al., Accuracy vs. efficiency: achieving both through FPGA-implementation aware neural architecture search, in Proceedings of the 56th Annual Design Automation Conference 2019 (DAC), Las Vegas NV USA, ACM, (2019), 1–6. https://doi.org/10.1145/3316781.3317757
    [78] W. Jiang, E. H. M. Sha, X. Zhang, L. Yang, Q. Zhuge, Y. Shi, et al., Achieving super-linear speedup across multi-FPGA for real-time DNN inference, preprint, arXiv: 1907.08985.
    [79] W. Jiang, X. Zhang, E. H. M. Sha, Q. Zhuge, L. Yang, Y. Shi, et al., XFER: A novel design to achieve super-linear performance on multiple FPGAs for real-time AI, in Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Seaside, CA, USA, Association for Computing Machinery, (2019), 305. https://doi.org/10.1145/3289602.3293988
  • This article has been cited by:

    1. Hengliang Tang, Jinda Dong, Solving Flexible Job-Shop Scheduling Problem with Heterogeneous Graph Neural Network Based on Relation and Deep Reinforcement Learning, 2024, 12, 2075-1702, 584, 10.3390/machines12080584
    2. Chen Han, Xuanyin Wang, TPN:Triple network algorithm for deep reinforcement learning, 2024, 591, 09252312, 127755, 10.1016/j.neucom.2024.127755
    3. Miguel S. E. Martins, João M. C. Sousa, Susana Vieira, A Systematic Review on Reinforcement Learning for Industrial Combinatorial Optimization Problems, 2025, 15, 2076-3417, 1211, 10.3390/app15031211
    4. Tianhua Jiang, Lu Liu, A Bi-Population Competition Adaptive Interior Search Algorithm Based on Reinforcement Learning for Flexible Job Shop Scheduling Problem, 2025, 24, 1469-0268, 10.1142/S1469026824500251
    5. Tianyuan Mao, A Review of Scheduling Methods for Multi-AGV Material Handling Systems in Mixed-Model Assembly Workshops, 2025, 5, 2710-0723, 227, 10.54691/p4x5a536
    6. Peng Zhao, You Zhou, Di Wang, Zhiguang Cao, Yubin Xiao, Xuan Wu, Yuanshu Li, Hongjia Liu, Wei Du, Yuan Jiang, Liupu Wang, 2025, Dual Operation Aggregation Graph Neural Networks for Solving Flexible Job-Shop Scheduling Problem with Reinforcement Learning, 9798400712746, 4089, 10.1145/3696410.3714616
    7. Yuxin Peng, Youlong Lyu, Jie Zhang, Ying Chu, Heterogeneous Graph Neural-Network-Based Scheduling Optimization for Multi-Product and Variable-Batch Production in Flexible Job Shops, 2025, 15, 2076-3417, 5648, 10.3390/app15105648
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(6557) PDF downloads(548) Cited by(4)

Figures and Tables

Figures(17)  /  Tables(11)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog