AMAIX: A Generic Analytical Model for Deep Learning Accelerators

Introduction

Heyho! Here you can download my fist paper AMAIX: A Generic Analytical Model for Deep Learning Accelerators free of charge. We successfully submitted this paper to the SAMOS XX conference where it won the best paper award. I guess it will soon be published in the Springer Lecture Notes in Computer Science (LNCS).

In next section I'll be shortly describing what we did in this paper using the kind of language I prefer (not that super duper fancy paper-language). But of course you are also invited to read the paper ;)

Just a short description

As with many technical products, you want to know as early as possible in the development process how good the product you are developing actually is. This is especially true when there are many competitors and the market is therefore highly competitive. Such a situation can be found with so-called deep learning accelerators (DLAs) (hardware for the acceleration of AI applications). This is favoured by the fact that an extreme growth is predicted for this market. On their AI day in 2019, Qualcomm said that they expect a x10 growth in revenue from 1.8 billion american USD dollars in 2018 to 17 billion dollars in 2025. And this is just for AI accelerators in data centers.
Thus, many large tech-giants (actually all of them), but also small start-ups, are trying to establish a dominant position as early as possible. If you don't believe it: here's a list with companies currently trying to engage in this market.

So, long story short: if you want to be succesful in this market, you need methods to estimate your design's key performance indicators (chip area, power consumption, computing power, etc.). Well-known methods include RTL-simulations (e.g. Verilog) or System-level-simulations (e.g. SystemC). But if you have a Verilog or SystemC model of your DLA, your project is already in a progressed state.

A method, which you can use even directly after you had the initial idea of your DLA, are so called analytical models. These models try to estimate at system's KPIs using math or simple algorithms. A pen and a paper or an excel spreadsheet is everything you need to get started with them. The problem of analytical models is there extremely simplifying nature. If your system has any kind of non-determinism or is dynamic, the obtained results will be pretty inaccurate for sure. As most compute systems are very dynamic or include non-determinism (like caches), analytical models are usually not of great help. But how well do they work for these emerging deep learning accelerators?

The paper provides you with all the details, so here's a summarized answer for this question: At least for our examined case-study (the NVDLA) we could estimate the execution time pretty well. There are many reasons for that, so let my list a few of them:

The NVDLA is quite simple. Covolutional Neural Networks (the workload of the NVDLA) consist may consist of many layers, but there are only a few underlying operation types.
There is only a small control flow overhead. In a few cycles the pipelines are filled and then the operations start.
There are no significant dynamic effects. Rather than using caches, the NVDLA incorporates application managed buffers.
The NVDLA is either bottlenecked by the available memory bandwidth or its maximum compute power.

Especially the last point is of particular interest as this allows us to use the so called roofline model. A significant part of the paper is about how to rearrange this model for DLAs and apply it to the NVDLA. The cool thing is, that this model takes some of the NVDLAs configurable hardware parameters as an input and gives you the estimated execution time as an output. If you pour this into a python script, you can evaluate thousands of designs in a few seconds and generate such nice graphs which you can use for design space exploration:

Besides design space exploration there are still some open topics which we haven't studied yet. I think that analytical models could be an interesting addition for DLA compilers. For example, many DLAs support a so called Winograd convolution which basically allows you to convolute with less operations compared to a standard convolution. But the downside is, that you need more weights leading to a higher memory bandwidth consumption. In my eyes a smart compiler could analyse the system and choose the right operation depending on the bottleneck of the system.

Anyway, this was the "short" summary of our paper. If you have any question, feel free to write an e-mail to me (see About or use the address in the paper).