Lecture 11: ASIC vs FPGA

5:00 ASIC vs FPGA

In this course we are making use of an SRAM FPGA from Intel called 
Cyclone V. But what other technologies can we use to implement hardware?

We will be making a (high-level) comparison between the different options,
and put two of the more common techniques against each other:
	1. Field Programmable Gate Arrays (FPGA)
	2. Application Specific Integrated Circuits (ASIC)

Our objective is to understand the major differences between each of these
two technologies.

We will start by explaining two basic points in digital design that
affect both the FPGA design methods as well as the ASIC design methods:
power dissipation and timing.

From there we will review the technologies of FPGA resp. ASIC, followed by
the design tools used to program FPGA and to design ASIC.

Finally, we put both of them next two each other to compare.

Note that the following is necessarily a very crude and abridged comparison.
It highlights the important differences, but skimps on many of the finer
points. Nevertheless, by the end of this lecture, you should have a sense of
both technologies, enough to appreciate technological differences.

Reading:

1/ J. Rabaey, Digital Integrated Circuits
2/ Weste & Harris, CMOS VLSI Design 

5:05 The CMOS inverter driving other gates

(drawing of a CMOS inverter)

     Vdd
    PMOS
      +---->-------- long wire -------------> fan-out
    NMOS
     Vss

  Ron = on-resistance
        larger transistors have smaller Ron

  Cin = input capacitance
        larger transistors have larger Cin

  Rwire = wire series resistance
          longer wire has a larger Rwire

  Cwire = wire parasitic capacitance
          longer wire has a larger Cwire

  Switch time ~ (Ron + Rwire).(Cin,fanout + Cwire) 

  Hence, faster circuits
     - need larger drive transistors
     - need shorter wires
     - need lower fan-out

5:05. Power Dissipation

The power dissipation in a CMOS circuit is given by

	P = alpha . (CL + Csc) . Vswing . Vdd . f + Ileak . Vdd
        -------------------------------------   -----------
                          DYNAMIC                  STATIC

alpha = activity factor
      = probability that a net makes a transition in a given cycle
      ~ proportion of the nets that make a transition in a given cycle

CL  = capacitive load
    = the total equivalent capacitance present in gate inputs, and in wires

Csc = short-circuit capacitance
    = the equivalent capacitance that models short-circuit current

Vswing = highest voltage driven by the output. In principle, up to Vdd

Vdd = operating voltage

f   = operating frequency

Ileak = leakage current

The Capacitive Load is determined by:
		- the fanout (the number of gates driven by a gate)
		- the length and nature of the wire driven

5:10 Timing

A digital circuit has a timing characteristic.
Every combinational path has a given propagation delay:
	- input to output
	- input to register input
	- register output to output
	- register output to register input

The slowest of such paths in a digital circuit is called the critical path

The difference between the critical path and the actual clock frequency
is called the slack. For a correct circuit, the slack must be postive.

The synthesis tools will attempt to implement the circuit
such that the resulting slack is positive.
The user can set the desired clock period as a constraint.

By reducing the clock period constraint, the digital synthesis 
tools will have to work harder to find a solution that is 
fast enough. Typically, the may use a bigger, stronger gate,
or a design that computes more things in, parallel (eg carry
save adder instead of ripple carry adder).

By using a bigger, stronger gate, we reduce Ron and
hence are able to speed up the trise.
This will, of course, also increase the power dissipation.

Hence, by decreasing the clock period constraint, the tools will
produce circuits that are faster, larger, and that consume more
power.

5:15. ASIC Technology

We will not talk about silicon processing, other than mentioning
the following: There are several layers of n-doped and p-doped 
silicon which form transistors. These transistors are interconnected
with metal wires.

(slide 2)

Let's start with a review on how digital circuits can get implemented.

	- Full Custom Design:
		Complete custom-designed layout of transistors and
		interconnect. Labor intensive and complex. Only for
		the most advanced and most demanding designs.

    - Semicustom Design:
        Automates the mechanism of laying out wires and transistors
        according to a pattern.

        Cell-based Semicustom Design:
        Creates a unit of design, larger than a transistor, called
        a cell. These cells are then placed on a silicon die and
        a custom network is created to interconnect them

        There are two kinds of cell technologies:

        	Standard cells: Simple gates and registers with few
        	inputs. Standard cells have a regular geometry which
        	allows them to be arranged on a power grid. 

        	Macro cells: Complex functions such as memories,
        	complex I/O functions, optimized intellectual property
        	designs, etc. Macro cells have an irregular geometry.

(slide 3-5)
        
        Array Based Design:
        Creates an array of uniform cells which may be specialized.
        The designer no longer needs to place the cells, but only
        needs to customize the cell function to the application.

        Pre-diffused (Gate Array)
        The network and the specific cell function are decided
        before the last fabrication step

        Pre-wired (FPGA)
        The network and the specific cell function are decided
        at run time

Besides the logic functions, another major challenge of on-chip
design is coping with interconnect.

Besides the normal signal wires, there are two networks on a chip
that deserve special consideration:

  1/ The power network, which needs to distribute power evenly
     across a chip

  2/ The clock network, which needs to distribute the clock with
     minimal skew across a chip

 Both the power network as well as the clock network are handled
 separately by the design tools.

5:20 FPGA Technology

The basic concept of programmable gate array logic is the provide
a chip to the customer that can be customized in the field (hence
the name). Programming involves two aspects:

   - Deciding on the logic function that needs to be implemented
   - Storing the configuration that determines the logic function

There are commonly three different strategies to store the programming
of an FPGA:

   1. Fuse based FPGA: 

      The programming is done once by blowing fuses, or by 
      connecting them (antifuses).

      This is a program-once technology. The advantage is that 
      fuses are small, and that the configuration can be stored
      on-chip. The disadvantage, of course, is that the functionality
      cannot be changed.

   2. Nonvolatile (flash-based) FPGA:

      The configuration is stored in EEPROM cells, nonvolatile memory
      cells that are located on chip. EEPROM requires extra processing
      steps, making the manufacturing of such FPGA more expensive.

   3. Volatile (SRAM-based) FPGA:

      The configuration is stored in SRAM cells, located on the chip.
      This is the most popular technology at the moment, and both market
      leaders (Xilinx and Intel, formerly Altera) concentrate on
      SRAM-based FPGA. The disadvantage is that the configuration content
      needs to be stored off-chip, and before the chip can be used,
      you have to configure it. 

The early generations of FPGA were called Programmable Logic Arrays.
These chips where specialized in implementing logic functions.

(slide 6, 7)

More recent generations of FPGA's use a regular array of logic
functions, interleaved with a regular array of interconnect.

A modern FPGA uses an 'island style' arrangement.
There are five configurable elements in such an FPGA:
  1/ There are logic elements, called Configurable Logic BLock (Xilinx)
     are Logic Elements (Altera)
  2/ There are memory elements, hard macros of memory cells
  3/ There are configurable multipliers, hard macros with arithmetic logic
  4/ There are dedicated clock generation and distribution elements
  5/ There are configurable input/ouput pins

(slide 8)

Let's decode what the FPGA on our board says:

		5CSEMA5F31C6N

		Cyclone V
		SoC (dual core embedded ARM)
		M 1 hard memory controller (they mean: a hard macro
			that controls DRAM)
		A5 85K logic elements
		F31 896 pin package
		C Commercial Temp Range
		6 Speed Grade (fastest)
		N lead free packaging

Note with temperature range and speed grade:
	- Integrated circuits slow down at high temperature
	- Temperature range is the range in which the circuit will
	  work according to the datasheet specs
	- Different chips have different speed, even when produced
	  using an identical layout
	- Speed testing during manufacturing determines how fast a
	  chip can work -> this leads to the speed grade
	- This process is called 'binning' - faster, better components
	  are marked as such, and are sold at a premium
 
 5:35 Logic elements in the Cyclone V

(slide 10-11)

The logic elements:
   - contain lookup tables
   - contain registers
   - contain carry chains

Each logic element can thus serve for different purposes:
   - flip-flop storage
   - lookup table logic function
   - both storage and lookup table logic function

The logical elements are arranged in blocks of 10, called LAB
or MLAB. These blocks connect to local interconnect and global row/
column interconnect.

The idea is that there is a hierarchy on the network. Short connections
between logic elements can be done through small, short wires (ie deepest
layer metal). Long connections between LE's far apart will need to take the
'highway' of global interconnect, which can carry a limited set of signals
across the chip, very quickly.

5:40 Memory elements in the Cyclone V

(slide 12)

Besides these logic elements, the Cyclone V can implement
memory modules - RAM modules that accept an address. We will
discuss the instantiation of these later.

The Cyclone V A5 has:
	446 blocks of 'M10K' modules, configurable 10 kilobit RAM modules
	679 blocks of MLAB modules, configurable 640 bit modules built from logic elements

	In total, the Cyclone V A5 chip has 4,884 Kilobit of on-chip memoru.

5:45 DSP elements in Cyclone V

(slide 13)

Multipliers are extensively used in signal processing, communications, accelerated scientific computation. FPGA have dedicated multipliers on board. Why? Because they are much faster then the multipliers you would implement using configurable logic elements.

The DSP elements contain multiply-accumulation logic.

The Cyclone V A5 has:
    150 DSP modules.
    This means it can do up to 300 18*18 bit multiplies PER CLOCK CYCLE.

5:50 ASIC Design Flow

(slide 14)

The semicustom design flow goes through the following steps:

- Design Capture: 
     - writing Verilog code (RTL)
     - integrate hard macros as black-box modules in RTL
       (ie. we write a functional equivalent in RTL, but
        in the actual chip implementation, we will use a hard macro)
- Front-end Design:
     - select design constraints 
         - tech library
         - target clock speed
         - optimizations such as pipelining etc
     - perform logic synthesis
     - result is a netlist of cells in specified technology
- Back-end Design:
     - Floorplan: decide on the overall organization of the chip
         - choose chip size
         - choose IO pin locations
         - decide on power network architecture
     - Placement
         - place cells within core area
     - Routing
         - clock tree synthesis
         - build interconnections between cells

- Simulation/Verification is extensively used during the design
    - to verify functionality
    - to verify if the technology mapped result is still functional

5:50. FPGA Design Flow

(slide 15)

The FPGA design flow contains similar steps.
However, there is no 'backend' as in the ASIC sense.

The FPGA tools are concerned with finding the proper FPGA programming.

The constraints of the design include:
  - Desired speed
  - I/O pin binding
  - placement of specific functions
    (FPGA's can emulate 'hard macro's' through selected configuration
     of cells)

Demo: investigate the backend results of place-and-route for the
      tonegen design

In Quartus - Select Fitter - Chip Planner
   - Demonstrate LAB
   - Demonstrate Logic Cell
   - Demonstrate IO
   - Demonstrate fan-out of interconnect

6:00 ASIC vs FPGA

Finally, let us pit ASIC against FPGA

(slide 17)

The following are the design criteria:

NRE = Non Recurring Engineering Cost
    = Cost to design the very first circuit
    = Engineering Cost + Prototype Manufacturing Cost

Unit Cost = Cost per instance of the design

Power = Power consumption of the design (assuming all technologies
        implement the same function)

Time to Market = time from conception to product for sale
Performance = Maximum speed available in this technology

Cost to the manufacturer
   = NRE + number_of_units_sold * Unit Cost

ASIC and FPGA have a cross over point
For very high volume, ASIC will always be cheaper than FPGA