Algorithm | Expression of the algorithm |
FFT | X(k)=∑N−1n=0x(n)WknN |
FIR | y(n)=∑N−1k=0h(k)x(n−k) |
IIR | y(n)=∑Nk=0bkx(n−k)−∑Nk=1aky(n−k) |
LMS | y(n)=wT(n)x(n) e(n)=d(n)−y(n) w(n+1)=w(n)+2μe(n)x(n) |
Humanity has always benefited from an intercapillary study in the quantification of natural occurrences in mathematics and other pure scientific fields. Graph theory was extremely helpful to other studies, particularly in the applied sciences. Specifically, in chemistry, graph theory made a significant contribution. For this, a transformation is required to create a graph representing a chemical network or structure, where the vertices of the graph represent the atoms in the chemical compound and the edges represent the bonds between the atoms. The quantity of edges that are incident to a vertex determines its valency (or degree) in a graph. The degree of uncertainty in a system is measured by the entropy of a probability. This idea is heavily grounded in statistical reasoning. It is primarily utilized for graphs that correspond to chemical structures. The development of some novel edge-weighted based entropies that correspond to valency-based topological indices is made possible by this research. Then these compositions are applied to clay mineral tetrahedral sheets. Since they have been in use for so long, corresponding indices are thought to be the most effective methods for quantifying chemical graphs. This article develops multiple edge degree-based entropies that correlate to the indices and determines how to modify them to assess the significance of each type.
Citation: Qingqun Huang, Muhammad Labba, Muhammad Azeem, Muhammad Kamran Jamil, Ricai Luo. Tetrahedral sheets of clay minerals and their edge valency-based entropy measures[J]. Mathematical Biosciences and Engineering, 2023, 20(5): 8068-8084. doi: 10.3934/mbe.2023350
[1] | Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen . Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192 |
[2] | Renping Wang, Shun Li, Enhao Tang, Sen Lan, Yajing Liu, Jing Yang, Shizhen Huang, Hailong Hu . SH-GAT: Software-hardware co-design for accelerating graph attention networks on FPGA. Electronic Research Archive, 2024, 32(4): 2310-2322. doi: 10.3934/era.2024105 |
[3] | Yejin Yang, Miao Ye, Qiuxiang Jiang, Peng Wen . A novel node selection method for wireless distributed edge storage based on SDN and a maldistributed decision model. Electronic Research Archive, 2024, 32(2): 1160-1190. doi: 10.3934/era.2024056 |
[4] | Jikun Guo, Qing Zhao, Lixuan Guo, Shize Guo, Gen Liang . An improved signal detection algorithm for a mining-purposed MIMO-OFDM IoT-based system. Electronic Research Archive, 2023, 31(7): 3943-3962. doi: 10.3934/era.2023200 |
[5] | Juan Li, Geng Sun . Design of a virtual simulation interaction system based on enhanced reality. Electronic Research Archive, 2023, 31(10): 6260-6273. doi: 10.3934/era.2023317 |
[6] | Yu Xue, Zhenman Zhang, Ferrante Neri . Similarity surrogate-assisted evolutionary neural architecture search with dual encoding strategy. Electronic Research Archive, 2024, 32(2): 1017-1043. doi: 10.3934/era.2024050 |
[7] | Shuang Zhang, Songwen Gu, Yucong Zhou, Lei Shi, Huilong Jin . Energy efficient resource allocation of IRS-Assisted UAV network. Electronic Research Archive, 2024, 32(7): 4753-4771. doi: 10.3934/era.2024217 |
[8] | Jian Gao, Hao Liu, Yang Zhang . Intelligent traffic safety cloud supervision system based on Internet of vehicles technology. Electronic Research Archive, 2023, 31(11): 6564-6584. doi: 10.3934/era.2023332 |
[9] | Liting Yu, Lenan Wang, Jianzhong Pei, Rui Li, Jiupeng Zhang, Shihui Cheng . Structural optimization study based on crushing of semi-rigid base. Electronic Research Archive, 2023, 31(4): 1769-1788. doi: 10.3934/era.2023091 |
[10] | Wenjie Li, Zimei Huang . Do different stock indices volatility respond differently to Central bank digital currency signals?. Electronic Research Archive, 2023, 31(9): 5573-5588. doi: 10.3934/era.2023283 |
Humanity has always benefited from an intercapillary study in the quantification of natural occurrences in mathematics and other pure scientific fields. Graph theory was extremely helpful to other studies, particularly in the applied sciences. Specifically, in chemistry, graph theory made a significant contribution. For this, a transformation is required to create a graph representing a chemical network or structure, where the vertices of the graph represent the atoms in the chemical compound and the edges represent the bonds between the atoms. The quantity of edges that are incident to a vertex determines its valency (or degree) in a graph. The degree of uncertainty in a system is measured by the entropy of a probability. This idea is heavily grounded in statistical reasoning. It is primarily utilized for graphs that correspond to chemical structures. The development of some novel edge-weighted based entropies that correspond to valency-based topological indices is made possible by this research. Then these compositions are applied to clay mineral tetrahedral sheets. Since they have been in use for so long, corresponding indices are thought to be the most effective methods for quantifying chemical graphs. This article develops multiple edge degree-based entropies that correlate to the indices and determines how to modify them to assess the significance of each type.
With the rapid development of information science and wireless communication, there has been an explosive growth in digital signal processing (DSP)[1]. The DSP has become an indispensable technology in various fields, e.g., control systems[2,3], communication systems[4,5], and image processing[6]. Moreover, to meet the demands for low latency and high throughput in fields, e.g., the Internet of Things (IoT)[7], artificial intelligence[8], and 6G[9], the demands for performance enhancement in digital signal processors are becoming increasingly important.
Existing hardware implementations of digital signal processing designs are mainly divided into three types, i.e., the application-specific integrated circuit (ASIC), system on chip (SoC), and field-programmable gate array (FPGA) [10]. Specifically, ASIC-based designs minimize the area and power consumption with high performance. However, their fixed hardware circuits are not programmable with high development costs[11,12]. Moreover, SoC-based designs such as TMS320C6678 (C6678) can be programmable and customized to a certain extent, but their performance is likely to be saturated while leveraging multiple cores[13]. They also suffer from limited performance and massive power consumption due to low data transmitting speed and insufficient hardware optimization. In contrast, FPGA has become increasingly popular in DSP design due to its programmability, parallel computing units, and ability to significantly reduce development costs while maintaining high performance and flexibility[14].
There are already several works on implementing DSP algorithms with FPGA. For instance, Saeed et al.[15] introduced an FPGA-based compatible FFT/IFFT processor with lower area and latency. Paul et al.[16] migrated different forms of FIR and IIR filters to an FPGA. Vijay et al.[17] proposed a parallel pipeline based on FIR filters to efficiently implement IIR filters. Kavitha et al.[18] designed an LMS-based adaptive filter algorithm that reduces area and power consumption by eliminating multipliers.
Despite the good performances, the above FPGA-based works are specifically designed for accelerating one or two DSP algorithms. Constrained by the limited hardware resources, such designs cannot deploy all the required algorithms on target FPGA [19]. Even if all DSP algorithms are deployed, the execution efficiency of each algorithm is low due to the small amount of allocated resources. As a result, such works cannot support complex scenarios. For example, in multi-functional scenarios, where different DSP algorithms are required and updated, such designs need to re-design and re-generate related RTL circuits for diverse DSP algorithms, which is difficult to adapt to various applications.
It is a reliable solution to design an FPGA-based processor compatible with multiple DSP algorithms. This processor has sufficient parallel computing resources and is reconfigured to support new algorithms by modifying a small portion of the design. However, some challenges are still remaining to implement a general processor, such as data and control scheduling while users' requirements are updated, as well as reusable computation units compatible with different DSP algorithms.
In this paper, we propose an FPGA-based overlay processor for DSP algorithms, named DSP-OPU. Specifically, we present a customized instruction set and an overlay hardware architecture for the DSP-OPU. We further incorporate multiple reconfigurable computation engines (RCE) in DSP-OPU to mitigate performance saturation. These RCEs can be reused for different DSP algorithms, thereby reducing resource utilization and enhancing system integration. Additionally, we design an efficient compiler to schedule and convert algorithms into instructions that can be run directly on the FPGA.
In summary, the contributions of our work are as follows:
● An overlay architecture for DSP. We design a reconfigurable and domain-specific overlay processor for DSP algorithms. Specifically, DSP-OPU is capable of accelerating various DSP algorithms with a flexible and scalable hardware architecture. It is also optimized to be highly parallel, ensuring high computation efficiency.
● Multiple reconfigurable computation engines. In DSP-OPU, we propose a series of novel computation engines. The engines are highly flexible, enabling the reconfigurable interconnections to accelerate the required algorithm and thereby significantly reducing power consumption. Different from SoC-based C6678 chip, our DSP-OPU alleviates the performance saturation with the increasing of engine numbers due to efficient bandwidth management.
● User-friendly SW-HW co-design. On the software side, we customize an instruction set architecture (ISA) for DSP-OPU, which can be easily enhanced to support new emerging algorithms with strong scalability. A simple and efficient compiler is also proposed to generate executable instruction streams from users' detailed requirements. The compiler provides hardware-friendly optimizations such as model parsing, instruction generation, along with instruction and data optimization. On the hardware side, we leverage a reconfigurable data path related to our customized instructions. This data path breaks the dependency between instructions and data, therefore significantly improving the ability to schedule on-chip data and enhancing transmission efficiency.
● High performance. We implement our DSP-OPU on Kintex-7 XC7K325T (325T) FPGA and Xilinx Alveo U200 (U200) FPGA. With comprehensive evaluations, DSP-OPU demonstrates competitive performance, achieving up to 29 × lower latency and 70 × better energy efficiency compared to C6678 SoC. In addition, compared to other DSP implementations for a single algorithm on FPGA, DSP-OPU achieves up to 4.5 × speedup and 4.4 × better energy efficiency.
DSP is an essential technology in multiple fields like wireless communication systems and IoT. There are various commonly used and efficient DSP algorithms for different functions, e.g., FFT, FIR, IIR, and LMS. Specifically, FFT facilitates time-frequency transformations by decomposing the signal spectrum into distinct frequency components. FIR and IIR are responsible for signal filtering and frequency response control. FIR processes the signal through convolution with a set of coefficients, and IIR introduces feedback into the signal output. LMS is an adaptive filtering algorithm that adjusts filter coefficients based on the input signal. The computational details of different DSP algorithms are outlined in Table 1.
Algorithm | Expression of the algorithm |
FFT | X(k)=∑N−1n=0x(n)WknN |
FIR | y(n)=∑N−1k=0h(k)x(n−k) |
IIR | y(n)=∑Nk=0bkx(n−k)−∑Nk=1aky(n−k) |
LMS | y(n)=wT(n)x(n) e(n)=d(n)−y(n) w(n+1)=w(n)+2μe(n)x(n) |
In recent years, there has been a continuous rise in performance demands for DSP technology, especially in terms of latency requirements for a single algorithm[20,21]. However, existing DSP chips have not updated their production lines for several years with limited performance. In contrast, there is a growing number of high-performance DSP algorithm accelerators proposed on FPGA, which can accelerate DSP algorithms 2-300 × faster than existing DSP chips, as indicated in Table 2. Therefore, considering FPGA for implementing DSP algorithms is a worthy option[22]. The throughput in Table 2 represents the millions of samples (MS/s) that can be processed per second. The specific throughput is obtained by analyzing the processing time of samples.
Designs | Platform | Algorithm | Point | Throughput (MS/s) |
Garrido et al.[23] | Virtex-7 | FFT | 1024 | 2720.00 |
XC7VS332T | ||||
Potsangbam et al. [24] | Virtex-7 | FIR | \ | 195.00 |
XC7VS335T | IIR | \ | 212.00 | |
Ezilarasan et al. [25] | Virtex-5 | LMS | \ | 112.00 |
XC5VLX30 | ||||
DSP SoC | TMS320C6678 | FFT | 1024 | 97.00 |
FIR | \ | 57.00 | ||
IIR | \ | 116.00 | ||
LMS | \ | 0.35 |
However, the mainstream FPGA-based DSP implementations are primarily tailored to support a single DSP algorithm and cannot simultaneously handle diverse algorithmic workflows. There is a feasible solution to accelerate different algorithms by deploying various DSP IPs on the FPGA. Nevertheless, such designs often require a trade-off between the number of supported algorithms, FPGA hardware resources, and the throughput of IPs. Moreover, deploying multiple IPs on an FPGA allocates limited resources for each IP. However, for most algorithms, fewer allocated resources always lead to lower throughput, which restricts the performance of each IP.
There are several widely used SoC designs in the industry such as C6678, which support the implementations of multiple DSP algorithms. The C6678 is based on SIMD instruction architecture and generates corresponding instructions for different algorithms through a specialized compiler. However, the C6678 still suffers from resource waste, resulting from the inability to leverage multiple computing cores simultaneously for single-task acceleration. Moreover, the practical performances in multi-task workloads are severely limited by transmission speed. Constrained by the available transmission bandwidth, the data transferring time of C6678 often surpasses the time of the computation part. This limitation becomes even greater with an increased core number and leads to performance saturation as shown in Figure 1.
In order to address the above challenges, we devise an FPGA-based overlay processor called DSP-OPU to support the acceleration of various DSP algorithms with multiple RCEs and less performance saturation. We introduce a multi-level datapath and integrate it with RCEs efficiently, achieving complete separation of instructions and data, which enhances overall throughput and computation efficiency. A flexible ISA is also proposed in our work with a user-friendly compiler. The compiler can transform the requirements of users into detailed instruction streams, which can be decoded and executed on the underlying hardware architecture.
A substantial challenge in the design of the OPU lies in microarchitecture design. The micro-architecture must minimize control overhead while preserving the convenience of runtime adjustment and functionality. To address this challenge, we have developed a unique data path module that facilitates parameterized register-based customization and mode switching for the data path. These parameter registers directly receive parameters provided by instructions, enabling seamless transitions between different data paths. Additionally, this data path module can establish connections between different computation engines, forming a reconfigurable pipeline structure for high-performance DSP algorithm computations. Furthermore, we propose a multi-engine system with shared memory to meet the high-bandwidth demands when accessing data.
We deploy DSP-OPU on FPGA and design a dedicated instruction set on the software side. The microarchitecture is depicted in Figure 2. Our DSP-OPU primarily consists of the following components: data pre-processing, instruction subsystem, L3DP, multiple engines shared memory (MESM), multiple engines shared memory controller (MESMC), processing element of reconfigurable pipeline (PERP), and reconfigurable computation engines (RCE). Upon receiving user-supplied parameters and the data to be processed, the compiler initially converts them into corresponding instructions while simultaneously handling the data. Subsequently, the instructions and data are transmitted for execution on the hardware processor. A detailed description of the instruction set and compiler will be provided in the following chapter.
The data path module serves as the core of DSP-OPU, and the proposed microarchitecture, instruction set, and compiler are all designed with the data path as the central focus. The data path realizes the majority of data scheduling in the architecture. Once the data path is constructed, input data will be continuously transmitted according to the data path. Therefore, this architecture possesses innate hardware advantages for accelerating algorithms with fixed computation flows. To reduce logical complexity and resource consumption, the data path of our DSP-OPU can be divided into three levels, i.e., L1DP, L2DP, and L3DP. In L1DP, the data paths mainly exist between the adder, multiplier, register, and transformation unit, and these units form a computing module PERP. In L2DP, the data paths mainly exist between different PERPs, and four PERPs form an RCE along with the RCE input and output modules. In L3DP, the data paths mainly exist between architecture input and output, multiple RCEs, and MESM.
To meet the high parallel computing demands of DSP algorithms, a distinctive computing module has been devised, which houses multiple RCEs. Balancing control complexity and parallelism, four PERPs are employed instead of directly interconnecting all computing modules within the RCE. Within a single RCE, while maintaining a constant number of internal computing modules, an increase in the number of PERPs significantly reduces the number of data paths, concurrently lowering pipeline complexity. To strike a balance between control overhead and reconfigurable pipeline complexity, evaluations of various metrics on the RCE are performed, as shown in Figure 3. Pipeline complexity indicates the number of different pipeline types to which a single computational PERP can be reconfigured, while pipeline length denotes the number of PERPs in a single RCE. Each pipeline type corresponds to a computational process that a PERP can support. By considering control overhead, reconfigurable pipeline complexity, and deployable PERP count, and based on the curve intersections and instructions design in Figure 3, we conclude the utilization of four PERPs per RCE.
Within the RCE, PERPs are connected via the Level 2 RCE data path (L2DP), as illustrated in Figure 4. All PERPs are connected through the data path. According to the selection of different data paths, PERPs can transmit data to the next PERP, obviating the need for data scheduling using caches. It is worth noting that the simplified representation of data paths in the figure understates the actual count, given that each PERP encompasses multiple computing modules. The selection of paths is specified by instructions, empowering the RCE for reconfiguration to different pipelines through data paths, thereby substantially elevating hardware optimization for algorithms.
It is noteworthy that DSP-OPU achieves robust scalability and compatibility with new digital signal processing algorithms by increasing the number of RCEs based on FPGAs with more resources. This scalability can be achieved by simply concatenating additional computation modules, enhancing parallelism and pipeline length without significantly increasing control complexity. We retain instructions in the instruction set for switching RCEs, requiring no modification to the instruction set.
Each PERP comprises three functional modules: an addition module, a multiplication module, and a transformation module, as shown in Figure 3. PERPs are connected through a first-level processing element data path (L1DP), using the same connectivity method as L2DP.
The addition module can simultaneously perform four 32-bit floating-point additions or one 64-bit floating-point addition. The multiplication module can simultaneously execute four instances of 32-bit floating-point multiplication or one instance of 64-bit floating-point multiplication. The memory control module primarily performs logical operations and cache control on data, such as memory protection, data shifting, data rearrangement, and similar operations. Due to the different clock cycles required for various operations and to ensure that each piece of data enters the module at the correct time, each module can implement delayed or continuous data output. The required delay time is specified by the input together with instructions specifying the module's state.
Given the DSP-OPU structure, there is no need for instructions and data to be input simultaneously. Additionally, there are no extra requirements for the order of instruction input. Therefore, the instructions generated by the compiler will be stored in the instruction cache module. After retrieving these instructions from the instruction cache, the unpacking module first segments them into control instructions for different modules and transmits them to the decoders of each module. Instructions destined for multiple RCEs also undergo further processing through a multi-engine navigator to ensure rapid and precise instruction input.
The architecture we propose ensures versatility by configuring different data paths, achieving high parallelism for various functionalities. The L3DP, located external to the RCE, is a critical component ensuring structural generality and is primarily divided into three parts: the external input-output data path (EIODP), memory input-output data path (MIODP), and computation module input-output data path (CIODP). The data path connections among these three parts are similar to L2DP.
The EIODP in our DSP-OPU supports up to eight 32-bit data paths or four 64-bit data paths for input or output, with data paths directed by instructions to either the RCE or the BRAM components in the architecture. Without BRAM caching, sending data directly through the CIODP to the RCE can reduce instruction complexity and computation delay, and enhance the versatility of the structure in our design. When BRAM caching is needed, the design meets the data retrieval rules, enabling the characteristic of high parallelism of our architecture and supporting specified algorithms like FFT that meet the requirement of repeated data access.
The MIODP consists of 32 channel groups with 128 channels in total. Each channel is constructed to include two 32-bit data paths, facilitating efficient data transmission and processing. What sets the MIODP apart is its remarkable flexibility in data handling. The MIODP can support up to 256 32-bit data paths or 128 64-bit data paths for input or output according to related instructions. In order to ensure the flexibility of data access, each channel group can be connected to all other channel groups within the CIODP or EIODP.
The CIODP is equipped with 16 input channel groups and 16 output channel groups with 128 channels in total, and the channel includes two 32-bit data paths. Unlike the MIODP, the functions of the channel groups in the CIODP are fixed and do not support the conversion of input and output. To minimize control complexity and ensure the versatility of the structure, the CIODP configuration is based on channel groups, similar to that of the MIODP.
According to the computation flows of various DSP algorithms, some algorithms require frequent access to previously used data. To ensure efficient parallel data retrieval, data caching is necessary. The caching capacity can be specified by allocating different block RAM (BRAM) according to the amount of resources in the target FPGA. The BRAM module in our DSP-OPU includes 32 BRAM groups, each with 4 BRAM instances and eight 32-bit data paths. This configuration supports up to 256 32-bit data paths or 128 64-bit data paths simultaneously. For simplified control and consistency, the channels in each BRAM group are mapped one-to-one with the channels in the MIODP.
Leveraging the distinctive structure of DSP-OPU, there is no simultaneous entry requirement for instructions and data into the FPGA. Consequently, the compiler transfers data to the FPGA in the form of instruction blocks and data blocks. Preliminary processing of the data received from the compiler is essential to identify instruction and data blocks. Subsequently, these blocks are transformed into the necessary data structures for transmission to the instruction subsystem or L3DP.
A concise and efficient instruction set is of paramount importance for a reconfigurable processor[26,27]. In our DSP-OPU, we classify instructions into two types: computational instructions (C-type) and memory instructions (M-type). The C-type instructions are under the length of 32 bits, responsible for the state specifications of computing modules and configurations for various data paths. The M-type instructions are also 32-bit long and are responsible for configuring memory address, memory state, and initiating the commencement of data transferring.
The instruction set design enables our DSP-OPU to dynamically generate instruction streams that can be directly executed on hardware circuits. The instruction streams can control the operation mode of underlying hardware modules after loaded into the FPGA board. On the software side, the compiler analyzes user requirements, i.e., algorithm type, data type, data length, number of computation points and filter order, and transforms them into bitstreams with specific configurations. The DSP-OPU ISA includes instructions that control data path connections, computation module state, and data access. By calling and combining these instructions, we can describe the computation process and support the execution of arbitrary DSP algorithms. Note that our DSP-OPU can support multiple DSP algorithms simultaneously by specifying the number of algorithms with required types and configurations.
The instruction operational process of the proposed RCE is elucidated in Figure 5. The beginning of the computation is predetermined by the Compute Flag Signal (CFS) instruction. At t1, the compiler initiates the transmission of instructions to the hardware in the form of instruction blocks. The order of instruction block transmission is determined by the number of instructions of the current functional type rather than the instruction functionality. To be specific, we prefer to transmit blocks with a higher instruction number. This is because the construction of data paths does not need a specific order, but the execution speed of instructions of the same functionality is limited. The more instructions of the same functional type, the longer the execution time required. Specifically, the underlying hardware executes instructions from t1 to t2. The execution time depends on the total number of instructions of the current instruction type. At t2, algorithmic data transferring begins, triggering post-processing at t3. Data processing and output end at t4, followed by the configuration of the next algorithm. The configuration time is variable according to the complexity of different algorithms.
Based on the above execution process, we realize a complete separation of instructions and data, where they are not input at the same time. The advantage of this approach is that the execution of instructions can be simplified, and there is no need to adjust the instruction input order. Moreover, in the instruction optimization process, after the instructions and data are transferred into a certain core, other operations are not needed for this core. As a result, when a certain core is performing data processing, other cores can be configured in parallel, which makes full use of hardware caching and significantly improves the throughput of multi-core tasks.
The C-type instructions are determined by the opcode and can be divided into four types.
Module state specification. The module status specification (state) instruction defines the operation type and synchronization operation of the computation module within the PE. It controls the operation type within a single PE and determines whether a delayed operation is necessary, along with specifying the delay duration. For example, an adder can be adjusted for operations such as 32-bit floating-point addition, 32-bit floating-point subtraction, or 64-bit floating-point addition. Considering the similarity among various operations, e.g., addition and subtraction, configuring the operation type through instructions facilitates minimal control overhead, while ensuring synchronized data output of distinct operation types with different computation cycles.
Inter-PE connection. Inter-PE connection (IPC) instruction is responsible for the connectivity of the L1DP within the RCE. The existence of L1DP serves as a partial substitute for the data scheduling operations instruction found in the general instruction set for the engines. Moreover, it significantly enhances the data scheduling capability of the engines, particularly in the context of DSP algorithms.
Inter-engine connection. Inter-engine connection (IEC) instruction specifies the connectivity of the L2DP between engines. Serving as a bridge between L1DP and L3DP, L2DP significantly reduces the resource overhead of data path control. Concurrently, the existence of L2DP simplifies multi-engine optimization. The collaborative acceleration of a single algorithm by multiple engines can be achieved by serially connecting them to the RCE through L2DP.
External engine connection. External engine connection (EEC) instruction is utilized to specify the connectivity of the L3DP for engines. L3DP functions as a bridge connecting external data, MESM, and RCE. The existence of L3DP broadens the algorithm support range for the entire processor. Complex algorithms can be supported by employing operations such as data caching and rescheduling through L3DP.
As mentioned above, C-type instructions are primarily employed for reconfiguring computation modules and various levels of data paths. The construction of diverse pipelines using different data paths enables the realization of a wide range of algorithmic operations. In multiple RCE architectures, the number of RCEs directly influences the complexity and total length of the pipelines. Thus, the number of RCEs directly affects the variety and types of algorithms that can be supported. Simultaneously, this method efficiently utilizes hardware computation resources through a pipelined format, avoiding the need for complex data scheduling operations and effectively harnessing the abundant computational resources of FPGA.
The M-type instruction is responsible for performing configuration operations related to memory functionality with a length of 32 bits. The definitions for three distinct types of M-type instructions are as follows:
BRAM group address specification. BRAM group address specification (Addr) instruction is designed to identify the BRAM group responsible for storing data and the initial address within that BRAM group. Due to various data requirements across multiple RCEs, this instruction is essential for precise data allocation, specifying exact data storage addresses, and providing offset addresses to ensure the accurate reading and storing of different data.
Memory state specification. Memory state specification (MSS) instruction is designed to specify the status of individual channels in the memory. It encompasses details such as whether the channels are enabled, operating in data input or data output mode, whether the output is looped, the total number of outputs, and other relevant parameters.
Computation flag signal. Compute flag signal (CFS) instruction is designed to control the output of data from the memory. The instruction indicates the computation commencement, where data starts to flow out from the BRAM. A single CFS instruction is adequate to initiate the execution of each computation stage. For algorithms requiring data buffering through memory, it is necessary to wait for data buffering to complete before the instruction input. Consequently, this instruction can specify the number of cycles to delay execution. Algorithms not requiring data buffering through memory do not need to issue this instruction.
Compilers serve as vital tools to bridge the gap between user requirements and low-level hardware computation. Specifically tailored for our DSP-OPU, we carefully design a dedicated compiler for our end-to-end execution. This compiler parses models and optimizes data structures to reduce the overhead of algorithm execution. It enables the transformation of DSP algorithm models into a series of instructions executable by DSP-OPU through compilation. These instructions are then transmitted via PCIe to the cache of our DSP-OPU and subsequently executed by the decoding module. As shown in Figure 6(a), the end-to-end compiler consists of three different stages: model parsing, instruction generation, and data optimization. During the model parsing and instruction generation stages, the compiler's primary function is to extract essential information from the model file and generate the necessary instruction sequences. In the data optimization stage, the compiler must structurally optimize the model file and instruction sequences based on hardware characteristics.
We abandon the conventional general-purpose instruction architecture, trading off some universality to achieve enhanced processor performance comparable to custom accelerators. The instructions rely heavily on the interconnection between various levels of data paths. Consequently, for different algorithms, even if the differences between them are minor, the types and quantities of corresponding instructions can be significantly diverse. It poses a significant challenge in designing a compiler that supports a series of algorithms.
Within the compiler, two instruction libraries are introduced, one contains pre-set algorithmic instruction sequences, and another comprises general computational flow instruction sequences. The compiler can adeptly assemble these instruction fragments, generating corresponding instruction sequences for various algorithms or models of different sizes, thus minimizing situations where certain algorithms are left entirely unsupported.
In order to fully harness these two instruction libraries, the compiler is initialized by determining the current algorithm type. The codes supporting the algorithms are stored within the compiler. In the first phase of the model parsing stage, the compiler can deduce the current algorithm type for digital signal processing based on user invocations of pre-stored algorithmic functions and extract the required parameters. Additionally, the compiler can analyze the needed algorithm type or instruction fragments based on the description provided by the user.
Following the details of the algorithm type and parameters, the compiler determines whether the current algorithm is part of the existing algorithms. If so, the compiler retrieves the instruction fragments for that algorithm directly from the existing algorithmic instruction library and assembles the required instruction sequence. If not, the compiler chooses appropriate instruction segments from the computational flow instruction library based on the computational process of the current algorithm and generates the instruction sequence.
To fully exploit the hardware capability, it is crucial to further optimize the instructions and data based on the hardware architecture. DSP-OPU boasts multiple computation engines, and the key challenge in instruction optimization lies in achieving the multi-engine acceleration for algorithms. To tackle this challenge, we have integrated two additional instruction libraries into the compiler: the engine instruction library and the memory instruction library. Similar to the existing algorithmic instruction library and computational flow instruction library, these two new libraries mainly store instruction fragments related to the linkage methods of engines and memory.
At the outset of the optimization phase, the compiler evaluates the scale of the current task based on the model parameters obtained in the previous stage. Subsequently, it selects the necessary instruction fragments from the two instruction libraries and integrates them into the existing instruction sequence to generate an optimized instruction sequence.
After optimizing the instruction sequence, we ensure the full utilization of all engines within the hardware. However, achieving accurate computations requires us to confirm the correct allocation of model data to their respective computation engines. According to the instruction design in our DSP-OPU, our instruction sequence imposes no requirements on the input order. Therefore, we evaluate the current method of instruction partitioning and data storage based on the data path configuration information and memory address details within the instructions. Using this information as a foundation, we appropriately partition the data and instructions.
We package the optimized instructions and data into data blocks, as shown in Figure 6(b), and transmit them to the hardware. Considering the notable differences in the quantity of different instructions, we prioritize the transmission of the larger quantity of instructions to maximize the transfer rate and hardware decoding efficiency. In addition, we employ distinct identification markers for partitioning, ensuring that the hardware can correctly distinguish between different types of instructions or data.
For the implementation of DSP-OPU, we deploy our work on the 325T FPGA and the U200 FPGA. Two RCE units are deployed on the 325T, and eight RCE units are deployed on the U200. We implement DSP-OPU by Verilog HDL and synthesis it by Vivado 2020.1. For the compiler, we design and implement it in C++.
In theoretical evaluations, we compared the theoretical and practical accelerations of DSP-OPU with the C6678 SoC, a widely used and high-performance digital signal processor, as depicted in Table 3. The theoretical values for C6678 are derived from its official documentation applicable to 32-bit floating-point data. In practice, we utilize the throughput of LMS as a benchmark for comparison. DSP-OPU exhibits actual acceleration far surpassing the theoretical ratio when compared to C6678, owing to the efficient data path mechanism that achieves acceleration effects comparable to custom accelerators.
Designs | Theoretical RCE performance (GMAC/s) | Practical throughput (MS/s) |
C6678 | 44.89 (1×) | 0.952 (1×) |
Ours-325T | 38.4 (0.86×) | 10 (10.50×) |
Ours-U200 | 71.6 (1.60×) | 18.67 (19.61×) |
We also evaluated the proportion of different data in transmission, as depicted in Figure 7. Specifically, we propose a statistical analysis of the proportion of logic and instruction data for various algorithms under different sizes. It is noticeable that the instruction data is of a smaller magnitude. This enables our architecture to be more effective, as the data transmitted from the host are mainly used for logic computation, alleviating the performance saturation issue resulting from transmission bandwidth limitations in a multi-task scenario.
In the context of the FFT algorithm, we establish a 1024-point radix-2 FFT as a reference benchmark. For filtering algorithms, our attention is focused on 32-tap LMS, 5th-order IIR, and 32-tap FIR filters, and the performance on the C6678 SoC is also included. Specifically, the results of both the C6678 SoC and DSP-OPU are based on 32-bit floating-point data.
Due to the absence of support for LMS and IIR algorithms in Xilinx IP, and the suboptimal throughput of the 1024-point FFT IP at 64 MS/s, we implement customized IPs based on Xilinx IP to address these limitations. The customized IPs employ 32-bit fixed-point data, providing certain advantages in terms of resource utilization and throughput. The customized IP leverages a similar implementation approach to DSP-OPU, except that it encapsulates the implementation of each algorithm as a separate IP, in order to evaluate the resource consumption when deploying multiple IPs simultaneously.
Table 4 details the hardware resource utilization. In this context, 325T employs 2 RCEs, and U200 utilizes 8 RCEs. The U200 retains unused resources suitable for additional RCE deployment, in order to refrain from deploying more RCEs to avoid potential frequency impacts associated with excessive data paths. We further summarize the overlay performances of DSP-OPU compared with other works in Table 5. The frequency (MHz), latency (us), throughput (MS/s), power (W), LUT, FF, DSP, and BRAM utilization are reported. The power consumption in Table 5 is obtained by the analysis report of the Vivado Power Estimator, which reflects the design characteristics of the FPGA chip.
FPGA | Num of RCE | LUT | FF | DSP | BRAM | ||||
Usage | Utilization | Usage | Utilization | Usage | Utilization | Usage | Utilization | ||
XC7K325T | 2 | 197,624 | 96.97% | 118,736 | 29.13% | 512 | 60.95% | 256 | 57.53% |
Alveo U200 | 8 | 770,496 | 65.17% | 474,944 | 20.09% | 2048 | 29.94% | 256 | 11.85% |
Designs | Plantform | WL | Alg | Order | Point | Freq (MHz) | Latency (ns) | Throughput (MS/s) | Power (W) | LUT | FF | DSP | BRAM |
Pakize et al.[28] | Virtex-7 XC7VS330T |
16 | FFT | \ | 1024 | 339 | \ | 1261 | \ | 12.5k | 22.5k | 150 | 6 |
Garrido et al.[23] | Virtex-6 XC6VSX475T |
16 | FFT | \ | 1024 | 475 | \ | 1900 | \ | 10.3k | 10.3k | 12 | 0 |
Virtex-7 XC7VS330T |
16 | FFT | \ | 1024 | 680 | \ | 2720 | 1.6 | 10.5k | 10.5k | 12 | 0 | |
Ezilarasan et al.[25] | Virtex-4 XC4VFX12 |
\ | LMS | 32 | \ | 109 | 9.152 | 109 | \ | \ | \ | \ | \ |
Virtex-5 XC5VLX30 |
\ | LMS | \ | \ | 111 | 8.95 | 112 | \ | \ | \ | \ | \ | |
Potsangbam et al.[24] | Virtex-7 | \ | FIR | 3 | \ | 120 | 5.12 | 195 | 8.053 | \ | \ | \ | \ |
XC7AS200T | \ | IIR | 1 | \ | 4.72 | 212 | 5.293 | \ | \ | \ | \ | ||
DSP SoC | TMS320C6678 | 32 | FFT | \ | 1024 | 1400 | \ | 135.8 | 15.00 | \ | \ | \ | \ |
32 | LMS | 16 | \ | 1044 | 0.952 | ||||||||
32 | IIR | 7 | \ | 105 | 9.52 | ||||||||
32 | FIR | 32 | \ | 12.56 | 79.8 | ||||||||
Customized IP | Xilinx Alveo U200 |
32 | FFT | \ | 1024 | 200 | \ | 200 | 6.32 | 59k | 116k | 153 | 58 |
32 | LMS | 16 | \ | 20.08 | 50 | ||||||||
32 | IIR | 7 | \ | 5.09 | 196 | ||||||||
32 | FIR | 32 | \ | 5.09 | 197 | ||||||||
DSP-OPU | Kintex-7 XC7K325T |
32 | FFT | \ | 1024 | 150 | \ | 480 | 3.4 | 197k | 118k | 512 | 256 |
32 | LMS | 16 | \ | 100 | 10 | ||||||||
32 | IIR | 7 | \ | 6.67 | 150 | ||||||||
32 | FIR | 32 | \ | 6.67 | 150 | ||||||||
DSP-OPU | Xilinx Alveo U200 |
32 | FFT | \ | 1024 | 280 | \ | 896 | 12.3 | 770k | 475k | 2048 | 256 |
32 | LMS | 16 | \ | 53.57 | 18.67 | ||||||||
32 | IIR | 7 | \ | 3.57 | 280 | ||||||||
32 | FIR | 32 | \ | 3.57 | 280 |
Our implementation is superior to C6678 SoC in throughput and processing speed, which is mainly due to the high parallelism of the DSP-OPU engine. In terms of FFT performance, our architecture deployed on the U200 FPGA achieves 6.6 × higher throughput compared to C6678. For the LMS algorithm, the throughput is improved by 19 ×. As for the filter algorithm, we are 3.5–29 × faster and 1.2 × lower in power consumption with the same filter order. Compared with customized IP, the latency of DSP-OPU is higher in LMS due to the longer cycle of the floating-point data type. In other algorithms, the latency is basically the same in different data types.
Note that the four customized IPs can achieve execution throughput similar to DSP-OPU. However, they can only support the four algorithms in Table 5 with great limitations. Specifically, for the FFT algorithm, the customized IPs cannot support more than 1024 points. For the LMS algorithm, the customized IPs cannot support the case that the filter order is more than 16. At the same time, the customized IPs cannot support IIR and FIR algorithms with larger orders. In contrast, our DSP-OPU is compatible with complex algorithm configurations, and can support up to 100,000 points for the FFT algorithm and up to 128 filter orders for the FIR algorithm.
Compared with algorithm accelerators deployed on FPGA, our architecture demonstrates superior performance in all areas except FFT. In contrast to the accelerators designed by Pakize et al.[28] and Garrido et al.[23], our architecture uses 32-bit floating-point data, while their accelerators utilize 16-bit fixed-point data. Doubling the word length within the same architecture results in a fourfold increase in throughput. Additionally, the use of floating-point computation generally offers better precision and higher latency compared to fixed-point computation. If the same word length is used, our architecture can achieve equal or even better throughput than other architectures.
In Figure 8, a comparison is made between the DSP-OPU architecture and the C6678 chip in terms of time for different multi-task scenarios. We utilize a workload from the radar domain[29], where 16 segments of data are applied to the FFT, FIR, and IFFT operations. Each segment consists of 1024 32-bit floating-point data, and both the ideal and practical time are compared. The ideal time is the sum of the computation time for the algorithms on the cores, while the practical extra time includes the transmission time for instructions and algorithm data. The transmission rate is determined by the PCIE rate supported by C6678.
It is evident that DSP-OPU outperforms C6678 by a significant margin in both ideal time and practical extra time. Moreover, the proportion of the practical extra time is also lower than that of C6678. If the instruction data in DSP-OPU have the same size as C6678, the proportion of practical extra time should be increased. Furthermore, our proportion of practical extra time is lower than C6678 and offers higher transmission efficiency.
In Table 6, we compare the influence of different numbers of computing engines. With an increased number of computing engines, longer reconfigurable pipeline lengths are available, making it more suitable for accelerating complex algorithms. In order to facilitate the performance analysis of multi-core design, we set the operating frequency of both 325T and U200 FPGAs to 150MHz. Our architecture has two computing engines when deployed on 325T, and there is a significant decrease in throughput when the order of the LMS algorithm exceeds 32. On the other hand, the architecture is deployed on U200 with 8 computing engines. It can support higher-order LMS algorithms without any change in throughput. For the FFT algorithm, the throughput corresponds to detailed points, and has little relationship with the increase of the core number. The implementation of the FIR algorithm is relatively simple, and two computation engines can already support the filter order of 64, which indicates that the number of engines does not affect FIR throughput.
Platform | Num of RCE | Algorithm | Order | Point | Frequency | Throughput (MS/s) |
XC7K325T | 2 | FFT | \ | 1024 | 150 | 480 |
\ | 2048 | 436 | ||||
\ | 4096 | 400 | ||||
FIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
IIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 35 | ||||
LMS | 16 | \ | 10 | |||
32 | \ | 2 | ||||
64 | \ | 0.8 | ||||
Alveo U200 | 8 | FFT | \ | 1024 | 150 | 480 |
\ | 2048 | 436 | ||||
\ | 4096 | 400 | ||||
FIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
IIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
LMS | 16 | \ | 10 | |||
32 | \ | 10 | ||||
64 | \ | 10 |
In Figure 9, we compare the energy efficiency of the DSP-OPU architecture with the C6678 chip and customized IP. The efficiency is calculated by dividing throughput by power. It is worth noting that we use a logarithmic scale in the graph. Therefore, the actual differences are larger than the results in Figure 9. The efficiency of our architecture deployed on 325T is 8.3–70 × better than that of C6678.
In this paper, we propose DSP-OPU, a reconfigurable and overlay processor based on FPGA to implement various DSP algorithms. First, we introduce a reconfigurable data path that allows for the reconfigurable pipeline of specific algorithms by connecting multiple computing cores through instructions. Second, we design an instruction set for DSP-OPU, ensuring the flexibility of our design. A simple and efficient compiler is also introduced to schedule and optimize the computation flow. Compared to C6678, our DSP-OPU achieves up to 29 × speedup and 70 × higher energy efficiency for different algorithms. Compared to FPGA-based implementations for a single algorithm, we achieve up to 4.5 × speedup and 4.4 × better energy efficiency.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work was financially supported in part by the National Key Research and Development Program of China under Grant 2021YFA1003602, in part by the Shanghai Pujiang Program under Grant 22PJD003.
The authors declare there are no conflicts of interest.
[1] |
R. Luo, A. Khalil, A. Ahmad, M. Azeem, I. Gafurjan, M. F. Nadeem, Computing The partition dimension of certain families of toeplitz graph, Front. Comput. Neurosci., 2022 (2022), 1–7. https://doi.org/10.3389/fncom.2022.959105 doi: 10.3389/fncom.2022.959105
![]() |
[2] | Z. Chu, M. K. Siddiqui, S. Manzoor, S. A. K. Kirmani, M. F. Hanif, M. H. Muhammad, On rational curve fitting between topological indices and entropy measures for graphite carbon nitride, Polycyclic Aromat. Compd., 2022 (2022), 1–18. |
[3] | I. Tugal, D. Murat, Çizgelerde yapısal karmaşıklığın olçülmesinde farklı parametrelerin kullanımı, Muş Alparslan Üniv. Mühendislik Mimarlık Fak. Derg., 2022 (2022), 22–29. |
[4] | M. A. Alam, M. U. Ghani, M. Kamran, Degree-based entropy for a non-kekulean benzenoid graph, J. Math., 2022 (2022). |
[5] | X. Wang, M. K. Siddiqui, S. A. K. Kirmani, S. Manzoor, S. Ahmad, M. Dhlamini, On topological analysis of entropy measures for silicon carbides networks, Complexity, 2021 (2021). https://doi.org/10.1155/2021/4178503 |
[6] | M. S. Alatawi, A. Ahmad, A. N. A. Koam, Edge weight-based entropy of nagnesium iodide graph, J. Math., 2021 (2021), 1–7. |
[7] | M. Rashid, S. Ahmad, M. Siddiqui, M. Kaabar, On computation and analysis of topological index-based invariants for complex coronoid systems, Complexity, 2021 (2021). |
[8] |
R. Huang, M. H. Muhammad, M. K. Siddiqui, S. Khalid, S. Manzoor, E. Bashier, Analysis of topological aspects for Metal-Insulator transition superlattice network, Complexity, 2022 (2022), 8344699. https://doi.org/10.1155/2022/8344699 doi: 10.1155/2022/8344699
![]() |
[9] |
M. K. Siddiqui, S. Manzoor, S. Ahmad, M. K. A. Kaabar, On computation and analysis of entropy measures for crystal structures, Math. Probl. Eng., 2021 (2021), 9936949. https://doi.org/10.1155/2021/9936949 doi: 10.1155/2021/9936949
![]() |
[10] |
Y. Chu, M. Imran, A. Q. Baig, S. Akhter, M. K. Siddiqui, On M-polynomial-based topological descriptors of chemical crystal structures and their applications, Eur. Phys. J. Plus, 135 (2020), 874. https://doi.org/10.1140/epjp/s13360-020-00893-9 doi: 10.1140/epjp/s13360-020-00893-9
![]() |
[11] |
C. Feng, M. H. Muhammad, M. K. Siddiqui, S. A. K. Kirmani, S. Manzoor, M. F. Hanif, On entropy measures for molecular structure of remdesivir system and their applications, Int. J. Quantum Chem., 122 (2022), e26957. https://doi.org/10.1002/qua.26957 doi: 10.1002/qua.26957
![]() |
[12] |
M. Imran, A. Ahmad, Y. Ahmad, M. Azeem, Edge weight based entropy measure of different shapes of carbon nanotubes, IEEE Access, 9 (2021), 139712–139724. https://doi.org/10.1109/ACCESS.2021.3119032 doi: 10.1109/ACCESS.2021.3119032
![]() |
[13] |
R. Huang, M. K. Siddiqui, S. Manzoor, S. Khalid, S. Almotairi, On physical analysis of topological indices via curve fitting for natural polymer of cellulose network, Eur. Phys. J. Plus, 137 (2022), 410. https://doi.org/10.1140/epjp/s13360-022-02629-3 doi: 10.1140/epjp/s13360-022-02629-3
![]() |
[14] |
K. Julietraja, P. Venugopal, S. Prabhu, A. K. Arulmozhi, M. K. Siddiqui, Structural analysis of three types of PAHs using entropy measures, Polycyclic Aromat. Compd., 42 (2022), 1–31. https://doi.org/10.1080/10406638.2021.1884101 doi: 10.1080/10406638.2021.1884101
![]() |
[15] | S. Manzoor, M. K. Siddiqui, S. Ahmad, On entropy measures of polycyclic hydroxychloroquine used for novel coronavirus (COVID-19) treatment, Polycyclic Aromat. Compd., 2020 (2020), 1–26. |
[16] | S. Manzoor, M. K. Siddiqui, S. Ahmad, Degree-based entropy of molecular structure of hyaluronic acid–curcumin conjugates, Eur. Phys. J. Plus, 136 (2021), 1–21. |
[17] | S. Manzoor, M. K. Siddiqui, S. Ahmad, On physical analysis of degree-based entropy measures for metal–organic superlattices, Eur. Phys. J. Plus, 136 (2021), 1–22. |
[18] | R. Huang, M. K. Siddiqui, S. Manzoor, S. Ahmad, M. Cancan, On eccentricity-based entropy measures for dendrimers, Heliyon, 7 (2021), e07762. |
[19] |
A. N. A. Koam, M. Azeem, M. K. Jamil, A. Ahmad, K. H. Hakami, Entropy measures of Y-junction based nanostructures, Shams Eng. J., 14 (2023), 101913. https://doi.org/10.1016/j.asej.2022.101913 doi: 10.1016/j.asej.2022.101913
![]() |
[20] |
F. E. Alsaadi, S. A. H. Bokhary, A. Shah, U. Ali, J. Cao, M. O. Alassafi, et al., On knowledge discovery and representations of molecular structures using topological indices, J. Artif. Intell. Soft Comput. Res., 11 (2021), 21–32. https://doi.org/10.2478/jaiscr-2021-0002 doi: 10.2478/jaiscr-2021-0002
![]() |
[21] |
M. C. Shanmukha, A. Usha, N. S. Basavarajappa, K. C. Shilpa, Graph entropies of porous graphene using topological indices, Comput. Theor. Chem., 1197 (2021), 113142. https://doi.org/10.1016/j.comptc.2021.113142 doi: 10.1016/j.comptc.2021.113142
![]() |
[22] |
X. Zuo, M. F. Nadeem, M. K. Siddiqui, M. Azeem, Edge weight based entropy of different topologies of carbon nanotubes, IEEE Access, 9 (2021), 102019–102029. https://doi.org/10.1109/ACCESS.2021.3097905 doi: 10.1109/ACCESS.2021.3097905
![]() |
[23] |
S. R. J. Kavitha, J. Abraham, M. Arockiaraj, J. Jency, K. Balasubramanian, Topological characterization and graph entropies of tessellations of kekulene structures: existence of isentropic structures and applications to thermochemistry, nuclear magnetic resonance, and electron spin resonance, J. Phys. Chem., 125 (2021), 8140–8158. https://doi.org/10.1021/acs.jpca.1c06264 doi: 10.1021/acs.jpca.1c06264
![]() |
[24] |
M. Imran, S. Manzoor, M. K. Siddiqui, S. Ahmad, M. H. Muhammad, On physical analysis of synthesis strategies and entropy measures of dendrimers, Arab. J. Chem., 15 (2022), 103574. https://doi.org/10.1016/j.arabjc.2021.103574 doi: 10.1016/j.arabjc.2021.103574
![]() |
[25] | S. Manzoor, M. K. Siddiqui, S. Ahmad, Computation of entropy measures for phthalocyanines and porphyrins dendrimers, Int. J. Quant. Chem., 122 (2022), e26854. |
[26] |
J. Abraham, M. Arockiaraj, J. Jency, S. R. J. Kavitha, K. Balasubramanian, Graph entropies, enumeration of circuits, walks and topological properties of three classes of isoreticular metal organic frameworks, J. Math. Chem., 60 (2022), 695–732. https://doi.org/10.1007/s10910-021-01321-8 doi: 10.1007/s10910-021-01321-8
![]() |
[27] |
M. Arockiaraj, J. Jency, J. Abraham, S. R. J. Kavitha, K. Balasubramanian, Two-dimensional coronene fractal structures: topological entropy measures, energetics, NMR and ESR spectroscopic patterns and existence of isentropic structures, Mol. Phys., 120 (2022), e2079568. https://doi.org/10.1080/00268976.2022.2079568 doi: 10.1080/00268976.2022.2079568
![]() |
[28] |
P. Juszczuk, J. Kozak, G. Dziczkowski, S. Głowania, T. Jach, B. Probier, Real-world data difficulty estimation with the use of entropy, Entropy, 23 (2021), 1621. https://doi.org/10.3390/e23121621 doi: 10.3390/e23121621
![]() |
[29] |
J. Liu, M. H. Muhammad, S. A. K. Kirmani, M. K. Siddiqui, S. Manzoor, On Analysis of Topological Aspects of Entropy Measures for Polyphenylene Structure, Polycyclic Aromat. Compd., 2022 (2022), 1–21. https://doi.org/10.1080/10406638.2022.2043914 doi: 10.1080/10406638.2022.2043914
![]() |
[30] |
M. Rashid, S. Ahmad, M. K. Siddiqui, S. Manzoor, M. Dhlamini, An analysis of eccentricity-based invariants for biochemical hypernetworks, Complexity, 2021 (2021), 1974642. https://doi.org/10.1155/2021/1974642 doi: 10.1155/2021/1974642
![]() |
[31] |
M. K. Siddiqui, S. Manzoor, S. Ahmad, M. K. A. Kaabar, On computation and analysis of entropy measures for crystal structures, Math. Probl. Eng., 2021 (2021), 9936949. https://doi.org/10.1155/2021/9936949 doi: 10.1155/2021/9936949
![]() |
[32] |
Y. Shang, Sombor index and degree-related properties of simplicial networks, Appl. Math. Comput., 419 (2022), 126881. https://doi.org/10.1016/j.amc.2021.126881 doi: 10.1016/j.amc.2021.126881
![]() |
[33] |
Y. Shang, Lower bounds for Gaussian Estrada index of graphs, Symmetry, 10 (2018), 325. https://doi.org/10.3390/sym10080325 doi: 10.3390/sym10080325
![]() |
[34] |
S. Khan, S. Pirzada, Y. Shang, On the sum and spread of reciprocal distance laplacian eigenvalues of graphs in terms of Harary index, Symmetry, 14 (2022), 1937. https://doi.org/10.3390/sym14091937 doi: 10.3390/sym14091937
![]() |
[35] |
M. Azeem, M. F. Nadeem, Metric-based resolvability of polycyclic aromatic hydrocarbons, Eur. Phys. J. Plus, 136 (2021), 1–14. https://doi.org/10.1140/epjp/s13360-021-01399-8 doi: 10.1140/epjp/s13360-021-01399-8
![]() |
[36] |
A. Ahmad, A. N. A. Koam, M. H. F. Siddiqui, M. Azeem, Resolvability of the starphene structure and applications in electronics, Shams Eng. J., 2021 (2021), forthcoming. https://doi.org/10.1016/j.asej.2021.09.014 doi: 10.1016/j.asej.2021.09.014
![]() |
[37] |
M. F. Nadeem, M. Hassan, M. Azeem, S. U. D. Khan, M. R. Shaik, M. A. F. Sharaf, et al., Application of resolvability technique to investigate the different polyphenyl structures for polymer industry, J. Chem., 2021 (2021), 1–8. https://doi.org/10.1155/2021/6633227 doi: 10.1155/2021/6633227
![]() |
[38] |
M. Azeem, M. K. Jamil, A. Javed, A. Ahmad, Verification of some topological indices of Y-junction based nanostructures by M-polynomials, J. Math., 2022 (2022), forthcoming. https://doi.org/10.1155/2022/8238651 doi: 10.1155/2022/8238651
![]() |
[39] |
M. Azeem, M. Imran, and M. F. Nadeem, Sharp bounds on partition dimension of hexagonal Möbius ladder, J. King Saud Univ. Sci., 2021 (2021), forthcoming. https://doi.org/10.1016/j.jksus.2021.101779 doi: 10.1016/j.jksus.2021.101779
![]() |
[40] |
H. Raza, J. B. Liu, M. Azeem, M. F. Nadeem, Partition dimension of generalized petersen graph, Complexity, 2021 (2021), 1–14. https://doi.org/10.1155/2021/5592476 doi: 10.1155/2021/5592476
![]() |
[41] |
A. N. A. Koam, A. Ahmad, M. Ibrahim, M. Azeem, Edge metric and fault-tolerant edge metric dimension of hollow coronoid, Mathematics, 9 (2021), 1405. https://doi.org/10.3390/math9121405 doi: 10.3390/math9121405
![]() |
[42] |
H. Wang, M. Azeem, M. F. Nadeem, A. U, Rehman, A. Aslam, On fault-tolerant resolving sets of some families of ladder networks, Complexity, 2021 (2021), 9939559. https://doi.org/10.1155/2021/9939559 doi: 10.1155/2021/9939559
![]() |
[43] |
A. Shabbir, M. Azeem, On the partition dimension of tri-hexagonal alpha-boron nanotube, IEEE Access, 9 (2021), 55644–55653. https://doi.org/10.1109/ACCESS.2021.3071716 doi: 10.1109/ACCESS.2021.3071716
![]() |
[44] |
M. F. Nadeem, M. Azeem, A. Khalil, The locating number of hexagonal Möbius ladder network, J. Appl. Math. Comput., 66 (2021), 149–165. https://doi.org/10.1007/s12190-020-01430-8 doi: 10.1007/s12190-020-01430-8
![]() |
[45] |
H. M. A. Siddiqui, M. A. Arshad, M. F. Nadeem, M. Azeem, A. Haider, M. A. Malik, Topological properties of supramolecular chain of different complexes of N-salicylidene-L-Valine, Polycyclic Aromat. Compd., 42 (2022), 6185–6198. https://doi.org/10.1080/10406638.2021.1980060 doi: 10.1080/10406638.2021.1980060
![]() |
[46] |
H. Raza, M. F. Nadeem, A. Ahmad, M. A. Asim, M. Azeem, Comparative study of valency-based topological indices for tetrahedral sheets of clay minerals, Curr. Org. Synth., 18 (2021), 711–718. https://doi.org/10.2174/1570179418666210709094729 doi: 10.2174/1570179418666210709094729
![]() |
Algorithm | Expression of the algorithm |
FFT | X(k)=∑N−1n=0x(n)WknN |
FIR | y(n)=∑N−1k=0h(k)x(n−k) |
IIR | y(n)=∑Nk=0bkx(n−k)−∑Nk=1aky(n−k) |
LMS | y(n)=wT(n)x(n) e(n)=d(n)−y(n) w(n+1)=w(n)+2μe(n)x(n) |
Designs | Platform | Algorithm | Point | Throughput (MS/s) |
Garrido et al.[23] | Virtex-7 | FFT | 1024 | 2720.00 |
XC7VS332T | ||||
Potsangbam et al. [24] | Virtex-7 | FIR | \ | 195.00 |
XC7VS335T | IIR | \ | 212.00 | |
Ezilarasan et al. [25] | Virtex-5 | LMS | \ | 112.00 |
XC5VLX30 | ||||
DSP SoC | TMS320C6678 | FFT | 1024 | 97.00 |
FIR | \ | 57.00 | ||
IIR | \ | 116.00 | ||
LMS | \ | 0.35 |
Designs | Theoretical RCE performance (GMAC/s) | Practical throughput (MS/s) |
C6678 | 44.89 (1×) | 0.952 (1×) |
Ours-325T | 38.4 (0.86×) | 10 (10.50×) |
Ours-U200 | 71.6 (1.60×) | 18.67 (19.61×) |
FPGA | Num of RCE | LUT | FF | DSP | BRAM | ||||
Usage | Utilization | Usage | Utilization | Usage | Utilization | Usage | Utilization | ||
XC7K325T | 2 | 197,624 | 96.97% | 118,736 | 29.13% | 512 | 60.95% | 256 | 57.53% |
Alveo U200 | 8 | 770,496 | 65.17% | 474,944 | 20.09% | 2048 | 29.94% | 256 | 11.85% |
Designs | Plantform | WL | Alg | Order | Point | Freq (MHz) | Latency (ns) | Throughput (MS/s) | Power (W) | LUT | FF | DSP | BRAM |
Pakize et al.[28] | Virtex-7 XC7VS330T |
16 | FFT | \ | 1024 | 339 | \ | 1261 | \ | 12.5k | 22.5k | 150 | 6 |
Garrido et al.[23] | Virtex-6 XC6VSX475T |
16 | FFT | \ | 1024 | 475 | \ | 1900 | \ | 10.3k | 10.3k | 12 | 0 |
Virtex-7 XC7VS330T |
16 | FFT | \ | 1024 | 680 | \ | 2720 | 1.6 | 10.5k | 10.5k | 12 | 0 | |
Ezilarasan et al.[25] | Virtex-4 XC4VFX12 |
\ | LMS | 32 | \ | 109 | 9.152 | 109 | \ | \ | \ | \ | \ |
Virtex-5 XC5VLX30 |
\ | LMS | \ | \ | 111 | 8.95 | 112 | \ | \ | \ | \ | \ | |
Potsangbam et al.[24] | Virtex-7 | \ | FIR | 3 | \ | 120 | 5.12 | 195 | 8.053 | \ | \ | \ | \ |
XC7AS200T | \ | IIR | 1 | \ | 4.72 | 212 | 5.293 | \ | \ | \ | \ | ||
DSP SoC | TMS320C6678 | 32 | FFT | \ | 1024 | 1400 | \ | 135.8 | 15.00 | \ | \ | \ | \ |
32 | LMS | 16 | \ | 1044 | 0.952 | ||||||||
32 | IIR | 7 | \ | 105 | 9.52 | ||||||||
32 | FIR | 32 | \ | 12.56 | 79.8 | ||||||||
Customized IP | Xilinx Alveo U200 |
32 | FFT | \ | 1024 | 200 | \ | 200 | 6.32 | 59k | 116k | 153 | 58 |
32 | LMS | 16 | \ | 20.08 | 50 | ||||||||
32 | IIR | 7 | \ | 5.09 | 196 | ||||||||
32 | FIR | 32 | \ | 5.09 | 197 | ||||||||
DSP-OPU | Kintex-7 XC7K325T |
32 | FFT | \ | 1024 | 150 | \ | 480 | 3.4 | 197k | 118k | 512 | 256 |
32 | LMS | 16 | \ | 100 | 10 | ||||||||
32 | IIR | 7 | \ | 6.67 | 150 | ||||||||
32 | FIR | 32 | \ | 6.67 | 150 | ||||||||
DSP-OPU | Xilinx Alveo U200 |
32 | FFT | \ | 1024 | 280 | \ | 896 | 12.3 | 770k | 475k | 2048 | 256 |
32 | LMS | 16 | \ | 53.57 | 18.67 | ||||||||
32 | IIR | 7 | \ | 3.57 | 280 | ||||||||
32 | FIR | 32 | \ | 3.57 | 280 |
Platform | Num of RCE | Algorithm | Order | Point | Frequency | Throughput (MS/s) |
XC7K325T | 2 | FFT | \ | 1024 | 150 | 480 |
\ | 2048 | 436 | ||||
\ | 4096 | 400 | ||||
FIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
IIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 35 | ||||
LMS | 16 | \ | 10 | |||
32 | \ | 2 | ||||
64 | \ | 0.8 | ||||
Alveo U200 | 8 | FFT | \ | 1024 | 150 | 480 |
\ | 2048 | 436 | ||||
\ | 4096 | 400 | ||||
FIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
IIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
LMS | 16 | \ | 10 | |||
32 | \ | 10 | ||||
64 | \ | 10 |
Algorithm | Expression of the algorithm |
FFT | X(k)=∑N−1n=0x(n)WknN |
FIR | y(n)=∑N−1k=0h(k)x(n−k) |
IIR | y(n)=∑Nk=0bkx(n−k)−∑Nk=1aky(n−k) |
LMS | y(n)=wT(n)x(n) e(n)=d(n)−y(n) w(n+1)=w(n)+2μe(n)x(n) |
Designs | Platform | Algorithm | Point | Throughput (MS/s) |
Garrido et al.[23] | Virtex-7 | FFT | 1024 | 2720.00 |
XC7VS332T | ||||
Potsangbam et al. [24] | Virtex-7 | FIR | \ | 195.00 |
XC7VS335T | IIR | \ | 212.00 | |
Ezilarasan et al. [25] | Virtex-5 | LMS | \ | 112.00 |
XC5VLX30 | ||||
DSP SoC | TMS320C6678 | FFT | 1024 | 97.00 |
FIR | \ | 57.00 | ||
IIR | \ | 116.00 | ||
LMS | \ | 0.35 |
Designs | Theoretical RCE performance (GMAC/s) | Practical throughput (MS/s) |
C6678 | 44.89 (1×) | 0.952 (1×) |
Ours-325T | 38.4 (0.86×) | 10 (10.50×) |
Ours-U200 | 71.6 (1.60×) | 18.67 (19.61×) |
FPGA | Num of RCE | LUT | FF | DSP | BRAM | ||||
Usage | Utilization | Usage | Utilization | Usage | Utilization | Usage | Utilization | ||
XC7K325T | 2 | 197,624 | 96.97% | 118,736 | 29.13% | 512 | 60.95% | 256 | 57.53% |
Alveo U200 | 8 | 770,496 | 65.17% | 474,944 | 20.09% | 2048 | 29.94% | 256 | 11.85% |
Designs | Plantform | WL | Alg | Order | Point | Freq (MHz) | Latency (ns) | Throughput (MS/s) | Power (W) | LUT | FF | DSP | BRAM |
Pakize et al.[28] | Virtex-7 XC7VS330T |
16 | FFT | \ | 1024 | 339 | \ | 1261 | \ | 12.5k | 22.5k | 150 | 6 |
Garrido et al.[23] | Virtex-6 XC6VSX475T |
16 | FFT | \ | 1024 | 475 | \ | 1900 | \ | 10.3k | 10.3k | 12 | 0 |
Virtex-7 XC7VS330T |
16 | FFT | \ | 1024 | 680 | \ | 2720 | 1.6 | 10.5k | 10.5k | 12 | 0 | |
Ezilarasan et al.[25] | Virtex-4 XC4VFX12 |
\ | LMS | 32 | \ | 109 | 9.152 | 109 | \ | \ | \ | \ | \ |
Virtex-5 XC5VLX30 |
\ | LMS | \ | \ | 111 | 8.95 | 112 | \ | \ | \ | \ | \ | |
Potsangbam et al.[24] | Virtex-7 | \ | FIR | 3 | \ | 120 | 5.12 | 195 | 8.053 | \ | \ | \ | \ |
XC7AS200T | \ | IIR | 1 | \ | 4.72 | 212 | 5.293 | \ | \ | \ | \ | ||
DSP SoC | TMS320C6678 | 32 | FFT | \ | 1024 | 1400 | \ | 135.8 | 15.00 | \ | \ | \ | \ |
32 | LMS | 16 | \ | 1044 | 0.952 | ||||||||
32 | IIR | 7 | \ | 105 | 9.52 | ||||||||
32 | FIR | 32 | \ | 12.56 | 79.8 | ||||||||
Customized IP | Xilinx Alveo U200 |
32 | FFT | \ | 1024 | 200 | \ | 200 | 6.32 | 59k | 116k | 153 | 58 |
32 | LMS | 16 | \ | 20.08 | 50 | ||||||||
32 | IIR | 7 | \ | 5.09 | 196 | ||||||||
32 | FIR | 32 | \ | 5.09 | 197 | ||||||||
DSP-OPU | Kintex-7 XC7K325T |
32 | FFT | \ | 1024 | 150 | \ | 480 | 3.4 | 197k | 118k | 512 | 256 |
32 | LMS | 16 | \ | 100 | 10 | ||||||||
32 | IIR | 7 | \ | 6.67 | 150 | ||||||||
32 | FIR | 32 | \ | 6.67 | 150 | ||||||||
DSP-OPU | Xilinx Alveo U200 |
32 | FFT | \ | 1024 | 280 | \ | 896 | 12.3 | 770k | 475k | 2048 | 256 |
32 | LMS | 16 | \ | 53.57 | 18.67 | ||||||||
32 | IIR | 7 | \ | 3.57 | 280 | ||||||||
32 | FIR | 32 | \ | 3.57 | 280 |
Platform | Num of RCE | Algorithm | Order | Point | Frequency | Throughput (MS/s) |
XC7K325T | 2 | FFT | \ | 1024 | 150 | 480 |
\ | 2048 | 436 | ||||
\ | 4096 | 400 | ||||
FIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
IIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 35 | ||||
LMS | 16 | \ | 10 | |||
32 | \ | 2 | ||||
64 | \ | 0.8 | ||||
Alveo U200 | 8 | FFT | \ | 1024 | 150 | 480 |
\ | 2048 | 436 | ||||
\ | 4096 | 400 | ||||
FIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
IIR | 16 | \ | 150 | |||
32 | \ | 150 | ||||
64 | \ | 150 | ||||
LMS | 16 | \ | 10 | |||
32 | \ | 10 | ||||
64 | \ | 10 |