.. ECE 574

.. attention::

   This document was last updated |today|

.. _01asictechnology:

ASIC Technology
===============

.. important::

   The purpose of this lecture is as follows.

   * Answer the questions: what is an ASIC? why do we need it? [slides]
   * To review the historical context of ASIC hardware design [Feynman, SIA]
   * To discuss new insight into the needs of ASIC hardware design [Leiserson]
   * To discuss the future opportunities of ASIC hardware design [Hennessy]

.. attention::

   The following references are relevant background to this lecture.

   * Slides What is an ASIC? Why do we need it? :download:`PDF <pdf/lecture1-whatisanasic.pdf>`
   * Feynman, Richard P. (1960) There's Plenty of Room at the Bottom. 
     Engineering and Science, 23 (5). pp. 22-36. ISSN 0013-7812.
     https://calteches.library.caltech.edu/1976/1/1960Bottom.pdf
   * 2013 International Technology Roadmap for Semiconductors (ITRS).
     https://www.semiconductors.org/resources/2013-international-technology-roadmap-for-semiconductors-itrs/
   * M. Bohr, "A 30 Year Retrospective on Dennard's MOSFET Scaling Paper," 
     in IEEE Solid-State Circuits Society Newsletter, vol. 12, no. 1, 
     pp. 11-13, Winter 2007, doi: 10.1109/N-SSC.2007.4785534. 
     https://ieeexplore.ieee.org/document/4785534
   * C. E. Leiserson, N. C. Thompson, J. S. Emer, B. C. Kuszmaul, 
     B. W. LHampson, D. Sanchez, et al., "There’s plenty of room at the Top: 
     What will drive computer performance after Moore’s law?" 
     in Science, American Association for the Advancement of 
     Science, vol. 368, no. 6495, 2020. 
     https://doi.org/10.1126/science.aam9744
   * John L. Hennessy, David A. Patterson, "A new golden age for 
     computer architecture. Commun. ACM 62(2)," 48-60, (2019). 
     https://doi.org/10.1145/3282307

**If history learns us anything ..**

This course presents a modern view on the hardware design process, with specific attention to ASIC technology. Our goal is to understand what matters in modern hardware design. Over the past decades, hardware has gone through major, profound changes that deeply affect how modern hardware design works. We specifically consider ASIC design of digital hardware based on standard cell technology. This is just one target among many different variants (FPGA, sea-of-gates, full-custom hardware, ...), but it quite likely dominates modern chip design. Furthermore, our observations for ASIC standard cell design will also apply, in many aspects, to design for other targets.

In this introductory lecture, we look at the past, and attempt to explain the current state of hardware design.

Plenty of room at the bottom
----------------------------

A unique characteristic of digital hardware is that it is small, very small. One billion transistors on a chip is not uncommon nowadays, and the upcoming (2022) M1 processor of Apple will cross the one hundred billion transistor threshold. That is a mind boggling number of devices, to be able to hold in one's hand.

Hardware has not always been small. You may have seen images of room-filling computers which were common in the first half of the 20th century -- which would have the equivalent of a few thousand logical switches. From there to today, we figured out how to scale the design process for hardware from :math:`$10^4$` switches  to :math:`$10^{11}$` switches. That is an increase of 7 orders in magnitude! This required scaling down these former mechanical switches to infinitesimal size. And the reason why this scaling can be done at all, is because the components of nature are also of infinitesimal size.

One of the first people to remind us of this feature was Prof. Richard P. Feynman, one of the greatest physicists of modern times. In 1959 he wrote a paper called *There is plenty of room at the bottom*, which explores the physical limitations in nature towards building computers. And his conclusions were, essentially, that most of the limitations in scaling where related to the engineering of small systems, not to the limitations that nature itself had imposed.

Printing Small
^^^^^^^^^^^^^^

For example, Feynman goes through a thought experiment to print the entire Brittannica, a 24-tome encyclopedia, on the head of a pin of one sixteenth of an inch in diameter. He concludes that this can be done by scaling down the printing by 25,000 times. So the question then is, whether we can print a font small enough so that it can be printed. The smallest feature of the Brittannica is a grascale dot of one hundredth twentieth of an inch. Reducing that dot 25,000 times would leave a dot 80 Angstrom across -- roughly 32 metal atoms. So the shrunken dot occupies just 1000 atoms, and that would be the tiniest feature that has to be printed. He concludes, therefore, that the Britannica would physically fit on the head of a pin, and that the only thing stopping us from doing it is, is the absensce of a (1959) printing technology to provide such a small printing features.

Feynman goes one step further, by evaluating how much room would be needed to store *all* of the human knowledge, estimated at 24 million books in 1959. Storing each bit in a cube of 5 by 5 by 5 atoms (so roughly 125 atoms per bit), he concludes that a cube of one two hundreth of an inch wide is sufficient to store 24 million books. Therefore, Feynman states that there is PLENTY of room at the bottom, not just room at the bottom.

Computing in the Small
^^^^^^^^^^^^^^^^^^^^^^

The next question Feynman tackles, is if it's possible to build machines that are very small. He points out that biology has excelled at building small machines. For example, humans have no trouble recognizing faces using a computer the size of a brain. Conversely, solving the same problem in 1959 technology would require an enormous computer, who would not only be very big but would also be a lot slower due to its enormous physical size.

However, Feynman points out that the limitations on small machines are set by limits on engineering. Any manufactured is subject to manufacturing tolerances. Any manufactured artifact may not have a homogeneous material structure. That puts hard limits to how small you can effectively build.

He further points out scaling changes the nature of a design. For example, material becomes relatively stronger at smaller scale, and some features start to change -- such as the magnetism created by magnetic dipoles. As electric circuits are physically scaled down, their properties change. For example, their natural frequency increases significantly. Therefore, electric circuits may have to be redesigned when they are scaled down. But again, scaling only causes engineering challenges and it does not pose physical limitations.

Feynman goes through a thought experiment how to build such a small computer. He describes something which is amazingly similar to how modern chips are produced. He describes a mechanism where one large machine can make several smaller-size copies of itself. This can be repeated several times to create smaller and smaller machines. And all the time, this could be achieved by using roughly a constant amount of rough material, because the machines become so much smaller. Recall that modern chip making machines still use the principle of optical scaling to construct a miniature version of a machine in raw material.

Moore's law
^^^^^^^^^^^

The famous Moore's law is captured in a 1965 paper by Dr. Gordon Moore, and observes that the number of transistors integrated on a chip doubles roughly every two years. In essence, Gordon Moore showed that the observations of Feynman are confirmed in practice. Scaling works because the engineering problems of scaling can be solved.

The amazing insight of Moore's law is that the scaling works as well as it does. The following figure by Richie and Roser demonstrates, quite dramatically, just how much scaling Moore's law has been able to provide: 7 orders of magnitude, and still, the scaling continues!

It's fashionable to say that 'Moore's law is dead' or 'ending', but the truth is that there has been *nothing* in engineering that has achieved the same level of efficiency. Compared to 50 years ago, does a car consumes 7 orders of magnitude of less fuel? No! Compared to 50 years ago, do we cover 7 orders of magnitude more topics in the engineering curriculum? No!

It is correct that the physical limits of scaling are real and, in this sense, Moore's law faces a wall. But engineers are never stopped by walls. Instead, they find a way around and continue.

.. figure:: images/moore.png
   :figwidth: 600px
   :align: center

Historically, there have been two breaking points in the race to put more components on chip. The best-known breaking point is Moore's law. Another factor that has had an equal impact is Dennard's scaling law, which has been used until 2006 to decide how to adjust the supply voltage of transistors as they are shrunken smaller. The supply voltage of digital technology matters because it determines the maximum voltage that can occur across a gate source and drain, when the gate is turned off, and it determines the peak current that can occur through the transistor, when the gate switches on.

.. figure:: images/inverterlayout.png
   :figwidth: 600px
   :align: center

Before 2006, when Dennar's scaling still applied, the following rules were used to drive scaling. When the dimension of the length :math:`L` and width :math:`W` each scale by a factor of 0.7, then the following holds.

* The area of the transistor drops to half (= twice as many transistors on same die space)
* The capacitance :math:`C` of the transistor input scales by 0.7
* The voltage :math:`V` across the transistor must be scaled by 0.7, to keep the electrical field within the transistor constant (excessive field strength can destroy the transistor).
* The switching delay of the transistor scales by 0.7, as a smaller capacitance is switched across a smaller voltage, but with a smaller charging current.
* The dynamic power consumption of the transistor reduces by half, as the dynamic power consumption is given by :math:`V^2 C f`. Furthermore, because there are twice as many transistors in the same die area, the power density (power per unit of area) remains constant.

The beauty of Dennard scaling is that chips become faster and can carry more transistors with every generation. However, since 2006, the frequency no longer increases, and the voltage no longer decreases, with every new technology generation. This was because of several reasons. First, Dennard scaling does not take static power consumption into account (transistor leakage current), and this deteriorates the power density. Second, Dennard scaling does not take the delay through interconnect into account. Wires do not shrink in the same manner as transistors, making the interconnect delay play a more important role in each new scaled-down generation. 

The impact of the end of Dennard Scaling on chip performance is considerable, as can be clearly seen in the K. Rupp's `chart of microprocessor data <https://www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/>`_.

.. figure:: images/ruppcpudata.png
   :figwidth: 600px
   :align: center

We can clearly see that the number of transistors keeps increasing with every new generation (orange). But, since processors don't run any faster (green), it becomes hardware to get more instructions completed on a single processor (blue). One important trend to keep system performance growing, is to increase the number of cores per chip (black).

And still, even this scaling story comes to an end, because the scaling of transistors runs into physical limitations. Advanced process technologies are still able to shrink transistors, but it becomes very expensive to do so. As observed by several authors, the cost per transistor is no longer decreasing (for example, read `this post <https://www.fabricatedknowledge.com/p/the-rising-tide-of-semiconductor>`_ by D. O'Loughlin on the rising cost of semiconductors). The implications of a constant transistor cost are not minor. Simply put, it no longer pays off to migrate a give system function to a smaller technology node. You can still get increased capacity by moving to a smaller technology node, but such a chip will be more expensive, too. 

.. figure:: images/endofmoore.png
   :figwidth: 600px
   :align: center

   Slide by Marvell Semiconductor (`Marvel Invester Day Slide 43 <https://filecache.investorroom.com/mr5ir_marvell/131/download/Marvell%20Investor%20Day%202020.pdf>`_)

And hence, we must revise how digital systems are build. This is a very profound change with implications at every level of abstraction. Indeed, the only way to make hardware systems faster is to make every transistor perform more efficiently. Throwing more transistors at a problem is not longer a catch-all, because more transistors also *cost* more. 

Plenty of room at the top
-------------------------

Intel had 14nm chips in 2014, and then stalled on 10nm integration in 2016, delaying its use to 2019. Increasingly, the scaling-down becomes more difficult, and in order to keep on building better hardware and better computers, we will need to build *better* computers. The following discussion is limited to improvements that can be done using traditional computing techniques, and it does not consider alternative computing mechanisms such as quantum computing.

The problem of building better computers (and not just smaller computers) is captured by Dr. Charles Leiserson and his colleagues in a paper called *Plenty of room at the top*. He observes that, to compensate for the increased difficulty in scaling, there is innovation needed in three areas: better software, better algorithms, and better hardware architecture.

Better software
^^^^^^^^^^^^^^^

Leiserson oberves that the efficiency of software (in terms of speed) varies by orders of magnitude in performance. Consider, for example, a matrix multiplication, which is typically expressed as a three-level loop:

.. code::

   for i in xrange(4096):
     for j in xrange(4096):
       for k in xrange(4096):
         C[i][j] += A[i][k] * B[k][j]

Depending on the programming environment, the execution time will greatly vary, *even is the exact same hardware is utilized*. The numbers provided in the paper tell it all:

+---------------------+----------------+----------------+
| Language            | Rel. Speed     | Rel. Speedup   |
+=====================+================+================+
| Python              | 1              | NA             |
+---------------------+----------------+----------------+
| Java                | 11             | 10.8           |
+---------------------+----------------+----------------+
| C                   | 47             | 4.4            |
+---------------------+----------------+----------------+
| Parallel Loops      | 366            | 7.8            |
+---------------------+----------------+----------------+
| Parallel + Locality | 6,727          | 18.4           |
+---------------------+----------------+----------------+
| + Vectorization     | 23,224         | 3.5            |
+---------------------+----------------+----------------+
| + AVX Instrinsix    | 62,806         | 2.7            |
+---------------------+----------------+----------------+

The reason for this speedup across almost 5 orders of magnitude, is due to several factors, such as (1) algorithmic improvements, (2) making sure that the problem becomes a better 'fit' for the hardware, (3) making sure that all the special features of hardware (vector instruction) are properly used. Another way of looking at this result, is to say that the flexibility and generality of native Python comes at an enormous efficiency overhead. As scaling becomes hardware, the impact of this overhead is increasingly felt at higher layers of abstraction in the application stack. Therefore, the application most move 'closer' to the hardware to experience the same effect as Moore's law gave for decades.

Dr. Leiserson uses the term *software bloat* to describe this inefficiency. He observes that this inefficiency is essentially caused by the common practice, in software engineering, to solve problems by *reduction*. This means the following. You have to solve problem A, but you already have a solution to solve problem B. Rather than building a computer to solve problem A, you therefore figure out a technique to transform problem A into problem B, and you are done. Reduction can be applied recursively: If Problem A can be solved through Problem B, and Problem B can be solved through Problem C, then Problem A can be solved through Problem C in two steps. And hence, we end up with a complex stack of software machines: Python runs as an interpreter on an operating system that runs on a hypervisor that runs on a processor which is implemented with an microarchitecture. We don't built a hardware machine to do a matrix multiplication; instead, we write a Python program and call it a day. But that shortcut appears to cost tremendously in efficiency.

Better Algorithms
^^^^^^^^^^^^^^^^^

In addition to better software, Dr. Leiserson observes that better algorithms are needed. Algorithms are invented (like science) rather than evolved (like engineering). And hence, a new insight into how to solve e.g. graph optimization, can suddenly decrease the complexity order of the problem and cause a jump in performance of 10x or more.

Hence, better algorithms require careful design and development. Progress in algorithm design has relied on three kinds of progress.

1. Attacking new problem domains. Novel problems such as machine learning, post quantum cryptography, social networking, robotics, have spurred innovation in problem solving.

2. The need for scalability has pushed algorithm designers towards finding better solutions that can handle larger data sets. For example, estimating the average of a large data set by computing the average of a random sample of the data set is a means to quickly estimate the average. Similarly, approximate computing has developed techniques to derive answers that are close to correct, often at large gains in performance and resource cost.

3. Finally, tayloring algorithms towards the hardware is another source of improvement. Historically, algorithms have been developed against a single-thread random-access machine. New algorithms, however, may take into account the multi-core structure of GPU's or the distributed nature of FPGA fabric. Developing an algorithm with the final architecture in mind is one way to bring algorithms closer to hardware, but it is not easy.

Better Hardware Architecture
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The third area of improvement enumerated in Dr. Leiserson's paper is that of hardware architecture. In the past, the role of hardware was straightforward: transistor sizes shrink with every generation, therefore using more transistors to solve a given problem is an easy way to handle performance improvement by technology.

Clock frequency increases are no longer a source of improvement, either. Since 2006, the so-called Denard scaling has eneded. Denard observed that the smaller feature sizes of every hardware generation allows to reduce the voltage (to keep the electric field constant in a smaller transistor) and increase the clock (because of a smaller, faster transistor). The net effect of denard scaling was that the power per transistor is more or less constant, since the voltage goes down and the frequency goes up. However, because the transistor density increase, the power density increases as well. Around 2006, the limit of power density that can be tolerated on chip has been reached, meaning that future generations of chips still use lower voltages, but not more higher clock frequency.

So in the future, every transistor will have to become more efficient, because you will not automatically get more transistors with every new technology generation. In other words, for a given functionality X, you will need to be able to accomplish that functionality with *less* (not more!) transistors because that is the only way to increase hardware parallellism on chip.

A good example of this observation is the SPECINT benchmark, a benchmark to measure the efficiency of a processor. Since 2004, single-core machines have demonstrated a 3x performance gain while multicore machines have demonstrated a 20-30x performance gain.

Innovation in hardware architecture must come from two factors:

1. Use simple processors, but use lots of them per chip. This increases the overall parallellism on chip.

2. Use domain specialization and apply hardware specialized towards a single task. Example success stories include the GPU or Google's Tensor Processing Unit (TPU).


An excellent historical study of computer architecture and the crucial role of hardware is offered by Hennessy and Patterson's `review paper <https://doi.org/10.1145/3282307>`_. They review several seminal architectures and highlight their role in computer history. Hennessye and Patterson concludes that there are three major opportunities for future computer architecture improvements.

1. Better programming of cores. Note that this is similar to Leiserson's argument who says that software and hardware should live closer together.

2. Domain specific architectures such as GPU, network processors, deep learning processors, software-defined radios, will offer the next level of improvements in programmable machines. Domain-specific architectures, or DSA for short, outperform classic general-purpose machines for three reasons. First, they offer better parallellism. Second, they provide better memory hiearchies for the problem at hand. Third, the allow for custom precision in their operations. Examples of successful DSA's today are GPU's and TPU's.

3. Open architectures (think RISC-V) will enable a broader range of innovation opportunities by enabling customization at every level of processing abstraction.