Systematic Deep-Learning Architecture Evaluations with DNN Accelerator Design Exploring Frameworks
In recent years, deep neural networks (DNN) have seen a lot of success. Executing these complex models on embedded systems becomes problematic due to resource and power constraints in edge devices. Many studies have been performed in both business and academia in recent years to develop specialized hardware accelerators for energy-efficient and high-throughput mapping of DNN workloads. Each accelerator has its own memory architecture and data flow configuration. However, the vast majority of them are ad hoc and have local optimum solutions derived from a limited design space. Early-stage design space exploration tools can quickly and accurately predict the area, performance, and energy of DNN inference accelerators based on high-level network topology and architecture parameters, without the need for low-level RTL coding. These tools help designers find the Pareto optimal designs with the best dataflow while staying within hardware constraints and computing demands. Through this paper, the reader will first learn what a hardware accelerator is and what its key components are, before moving on to the newest dataflow, reconfigurability and design approaches.
Keywords: DNN, Design Space Exploration, Accelerator
DNNs have been created in a wide range of models to attain great accuracy. The real boom for CNNs, which are the most extensively employed for object identification and recogni- tion, occurred in 2012 when AlexNet won the ILSVRC com- petition by exceeding previous methods. Since then, as com- puter hardware and memory resources have become more readily available, DNN models have grown increasingly com- plicated. Hardware options for developing and deploying DNNs are diverse, ranging from general-purpose (CPUs and GPUs) to programmable (FPGAs) to special-purpose ASICs. In this research, we survey different DNN Accelerator Design Exploration frameworks based on systolic array architecture like ZigZag, Timeloop and SCALE-Sim on prominent neural networks like AlexNet and present energy-cost tradeoffs and other critical metrics for designs examined by these frameworks.
DNN Accelerator Design Exploration Frameworks
Timeloop is an infrastructure for evaluating and exploring the architecture design space of DNN accelerators. Timeloop has two main components: a model to provide performance, area and energy projections and a mapper to construct and search through the mapspace of any given workload on the targeted architecture. Accelergy is an energy estimation methodology for accelerators that allows design specifications of high level and low level components. It allows for estimating area and energy for flexible com- ponents, something which hardware designers need while designing DNN accelerators. 
Timeloop’s operation consists of creating a mapspace for a given workload on an architecture, exploring that mapspace to find the optimal mapping, and re- porting the performance and energy metrics of that optimal mapping. Timeloop needs the following inputs to generate a mapspace:
- The shape and parameterization of a workload, e.g., the dimensions of the input, output, and weight tensors used in a DNN layer, and the set of operations needed to compute the layer.
- A specification of the architecture’s hardware organization
- The constraints that the architecture imposes on ways in which the computation can be partitioned across the hardware and scheduled over time.
Once the mapspace is constructed, in this framework, Timeloop evaluates the performance and energy efficiency of a set of mappings within the space using Accelergy which determines the counts of various activities including arith- metic operations, memory accesses, and network transfers. It also estimates access counts of multi casting data, and forwarding of data across PE’s. Combined with the energy cost per access from the energy model, these access counts are also used to determine the energy consumption of the workload.
- Workload Specification - Specifying the shape and parameterization of a workload, e.g., the dimensions of the input, output, and weight tensors used in a DNN layer, and the set of operations needed to compute the layer.
- Architecture Specification - Specifying the hardware organization, i.e., the topology of interconnected compute and storage units, and mapspace constraints that limit the set of mappings allowed by the hardware.
- Mappings - A mapping describes the way in which the operation space and the associated dataspaces are split into chunks i.e, tiles at each level of the memory hierarchy and among multiple memory instances at each level.
- Mapspace Constraints - Specifying data flow constraints on the hardware. Factors are used to specify loop bounds and Permutations specify the loop ordering in the tiles. This is done since without constraints the mapper space becomes very large.
- Mapspace Construction and Search - A mapspace is a set of all legal mappings generated by the mapper. A search routine samples a mapping from the pruned and-constrained mapspace, evaluates it using the architecture model and chooses the next mapping to evaluate based on some heuristic.
Timeloop’s architecture model evaluates a mapping by first analyzing tiles of data that represent the mapping and measuring the transfer of data between them to obtain their access counts. Next, these access counts are used to derive the access counts to hardware components, which combined with Accelergy give performance and energy estimations. The technology model also provides an area estimate based on the specified architecture in the inputs.
Results from Timeloop
The architecture and mapspace constraints specified by the original Eyeriss authors were used to generate energy estimations for the various components listed in the architecture constraints.
ZigZag is a memory-centric Design Space Exploration (DSE) framework for rapid DNN acceleration. 
The ZigZag framework takes in the neural network workload definition as input (i.e. Conv2D, DepthwiseConv2D, Dense or fully connected neural network dimension size). It also takes in the hardware constraint (i.e. memory utilization, precision), memory pool (which includes different memory instances which can be used to build differ- ent memory hierarchy) and technology dependent cost (i.e. each MAC cost or interconnection cost). As output, ZigZag provides pareto optimal solutions (i.e. optimal accelerator architectures, optimal partial/temporal mappings and the corresponding energy, performance and area). Inside the ZigZag framework, there are three primary components: Architecture Generator, Mapping Search Engines and Hard- ware Cost Estimator. Hence, the ZigZag framework can be explored through the following perspectives:
- Unified Design Point Representation: model each design point in the framework which includes other information from the hardware, algorithm and map- ping
- Standardized Hardware cost Estimation: hardware cost estimation given a unified design point represen- tation
- Automated Design Point Generation: automatic generation of design points by searching the design space
- Unified Design Point Representation: The whole design space is split into two dimensions. First, the hard- ware architecture may consist of balanced (every memory level stores all the three different operands) or unbalanced memory hierarchy (weight operand has two memory levels and input/output have three memory levels). On the other dimension, algorithm-to- hardware mapping which can be even or uneven map- ping. In the diagram, we have one design point ex- ample for each of the combination categories. For example:
- Balanced memory with even mapping: In (a), the alphanumerical pair in each block is like a for loop (for FX from 1 to 5 and another for loop on top of that for C from 1 to 2 and so on). Each of the DRAM, Global Buffer, Register File and MAC level have the same mappings.
- – Unbalanced memory with even mapping: In (b), as long as the memory is shared, the loop tiling and loop blocking are even.
- – Balanced memory with even mapping: In (c), it can do loop tiling in whichever way it fits the memory capacity.
- – Balanced memory with uneven mapping: In (d), it gives a large freedom for mapping space compared to other comparable frameworks. This data representation is beneficial as it gives a clear idea about memory hierarchy (e.g. memory has two levels) and total algorithm size as all the loop dimensions along with temporal and spatial information.
- Standardized Hardware Cost Estimation: Based on the above representation, the hardware cost is esti- mated based on the loop relevance principle. Based on whether the loop is relevant or irrelevant for each operand, a set of equations can be extracted to esti- mate data size, MAC operation, etc. for each operand at each memory level.
- Automated Design Point Generation: Once the ZigZag framework has the hardware cost estimation, it auto- matically generates the design point which is defined as a combination of hardware architecture, spatial mapping and temporal mapping. For each of the differ- ent design perspectives, different search methods have been developed. For the hardware architecture, based on the user defined memory candidates, ZigZag can automatically generate the memory hierarchy which is called a memory-pool-based memory hierarchy search engine. For spatial and temporal mapping, different search methods can be used such as exhaustive, heuristic and iterative search.
Moreover, ZigZag also uses certain pruning principles at each search engine to reduce the design space. For memory hierarchy, a constraint is put on the size/cost ratio or area between different memories at different levels in the hierarchy. For each spatial mapping scenario, the pareto optimal solutions are used to be further explored in temporal map- ping search. For temporal mapping, it ensures that data reuse exists at each memory level during the loop tiling phase. As an example, the data stationarity can be maximized at lower memory level during the loop ordering phase. Apart from these, a number of other clever search optimizations can be applied in this step.
Results from Zigzag
As can be seen from the re- sults, with auto-memory hierarchy generation, ZigZag is able to achieve upto 64% of memory energy saving as com- pared to Timeloop and Timeloop-ZigZag combination ac- celerator framework (i.e. hardware architecture generated by Timeloop being fed as input into ZigZag architecture) excluding MAC cost (as it remains constant across various optimizations for each layer).
Based on our observations from the various state-of-the-art deep learning accelerators in terms of architecture mapping, performance, energy and area and cost estimation, following are our observations:
- The Timeloop supports even architecture-mapping whereas the ZigZag supports uneven mapping giving it large freedom for the mapping space. It opens up new mapping possibilities and thus can find better de- sign points. SCALE-Sim doesn’t support architecture mapping as it evaluates one architecture at a time.
- While Timeloop and SCALE-Sim function with con- strained search space with basic heuristics, ZigZag supports advanced heuristics resulting in efficient and scalable architecture designs.
- The mapping analysis on Timeloop and SCALE-Sim is done using fast simulation. However, the mapping analysis on ZigZag is done analytically. So, there is a trade-off of inference time and performance of the resulting architecture design.
- The hardware cost estimation is fine-grained on Timeloop but coarse-grained on ZigZag giving its architecture design space more liberty. However, it again depends on the developer’s use case scenario and the desired neural network design.
- SCALE-Sim doesn’t generate optimized memory hier- archy designs which makes it a smaller design space exploration framework as compared to Timeloop and ZigZag. As per our observations, ZigZag is much more efficient in terms of memory hierarchy generation unlike other state-of-the-art accelerator generators as it takes hardware architecture, spatial and temporal mapping into consideration and maps them into a pareto optimal graph.
Deep Learning’s importance and application have skyrocketed in the last decade. In a short amount of time, these algorithms have been able to surpass human accuracy. How- ever, their high effectiveness is due to their high algorithmic complexity, which necessitates the use of computational power. Researchers have created hardware platforms for the acceleration of such algorithms in this scenario. Given the common trend of electronics moving towards mobile devices and IoT nodes, these hardware platforms will need to take a low-power, efficiency-oriented approach. As a result, critical components of the system, such as memory and access to it, must be considered from the beginning of the design process and many dataflows and techniques for optimizing all critical aspects of accelerators have been developed. For example, spatial architectures that distribute part of the store elements directly in the processing elements to enable data reuse are better for reducing the impact of memory on power consumption. The goal of this paper is to provide an updated survey by focusing primarily on state-of-the-art architectures from the last three years and evaluate them across key metrics such as performance of accelerator-generated DNN designs, hard- ware cost estimation and energy. The contribution of this work is the collection and comparison of the latest architectures that have not been covered in prior surveys.
 Shyamnath Gollakota and Dina Katabi. 2008. Zigzag decoding: Combat- ing hidden terminals in wireless networks. In Proceedings of the ACM SIGCOMM 2008 conference on Data communication. 159–170.
 Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In 2019 IEEE In- ternational Symposium on Performance Analysis of Systems and Software (ISPASS). 304–315. https://doi.org/10.1109/ISPASS.2019.00042
 Ananda Samajdar, Jan Moritz Joseph, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2020. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-sim. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 58–68.