Lecture 11: ASIC vs FPGA 5:00 ASIC vs FPGA In this course we are making use of an SRAM FPGA from Intel called Cyclone V. But what other technologies can we use to implement hardware? We will be making a (high-level) comparison between the different options, and put two of the more common techniques against each other: 1. Field Programmable Gate Arrays (FPGA) 2. Application Specific Integrated Circuits (ASIC) Our objective is to understand the major differences between each of these two technologies. We will start by explaining two basic points in digital design that affect both the FPGA design methods as well as the ASIC design methods: power dissipation and timing. From there we will review the technologies of FPGA resp. ASIC, followed by the design tools used to program FPGA and to design ASIC. Finally, we put both of them next two each other to compare. Note that the following is necessarily a very crude and abridged comparison. It highlights the important differences, but skimps on many of the finer points. Nevertheless, by the end of this lecture, you should have a sense of both technologies, enough to appreciate technological differences. Reading: 1/ J. Rabaey, Digital Integrated Circuits 2/ Weste & Harris, CMOS VLSI Design 5:05 The CMOS inverter driving other gates (drawing of a CMOS inverter) Vdd PMOS +---->-------- long wire -------------> fan-out NMOS Vss Ron = on-resistance larger transistors have smaller Ron Cin = input capacitance larger transistors have larger Cin Rwire = wire series resistance longer wire has a larger Rwire Cwire = wire parasitic capacitance longer wire has a larger Cwire Switch time ~ (Ron + Rwire).(Cin,fanout + Cwire) Hence, faster circuits - need larger drive transistors - need shorter wires - need lower fan-out 5:05. Power Dissipation The power dissipation in a CMOS circuit is given by P = alpha . (CL + Csc) . Vswing . Vdd . f + Ileak . Vdd ------------------------------------- ----------- DYNAMIC STATIC alpha = activity factor = probability that a net makes a transition in a given cycle ~ proportion of the nets that make a transition in a given cycle CL = capacitive load = the total equivalent capacitance present in gate inputs, and in wires Csc = short-circuit capacitance = the equivalent capacitance that models short-circuit current Vswing = highest voltage driven by the output. In principle, up to Vdd Vdd = operating voltage f = operating frequency Ileak = leakage current The Capacitive Load is determined by: - the fanout (the number of gates driven by a gate) - the length and nature of the wire driven 5:10 Timing A digital circuit has a timing characteristic. Every combinational path has a given propagation delay: - input to output - input to register input - register output to output - register output to register input The slowest of such paths in a digital circuit is called the critical path The difference between the critical path and the actual clock frequency is called the slack. For a correct circuit, the slack must be postive. The synthesis tools will attempt to implement the circuit such that the resulting slack is positive. The user can set the desired clock period as a constraint. By reducing the clock period constraint, the digital synthesis tools will have to work harder to find a solution that is fast enough. Typically, the may use a bigger, stronger gate, or a design that computes more things in, parallel (eg carry save adder instead of ripple carry adder). By using a bigger, stronger gate, we reduce Ron and hence are able to speed up the trise. This will, of course, also increase the power dissipation. Hence, by decreasing the clock period constraint, the tools will produce circuits that are faster, larger, and that consume more power. 5:15. ASIC Technology We will not talk about silicon processing, other than mentioning the following: There are several layers of n-doped and p-doped silicon which form transistors. These transistors are interconnected with metal wires. (slide 2) Let's start with a review on how digital circuits can get implemented. - Full Custom Design: Complete custom-designed layout of transistors and interconnect. Labor intensive and complex. Only for the most advanced and most demanding designs. - Semicustom Design: Automates the mechanism of laying out wires and transistors according to a pattern. Cell-based Semicustom Design: Creates a unit of design, larger than a transistor, called a cell. These cells are then placed on a silicon die and a custom network is created to interconnect them There are two kinds of cell technologies: Standard cells: Simple gates and registers with few inputs. Standard cells have a regular geometry which allows them to be arranged on a power grid. Macro cells: Complex functions such as memories, complex I/O functions, optimized intellectual property designs, etc. Macro cells have an irregular geometry. (slide 3-5) Array Based Design: Creates an array of uniform cells which may be specialized. The designer no longer needs to place the cells, but only needs to customize the cell function to the application. Pre-diffused (Gate Array) The network and the specific cell function are decided before the last fabrication step Pre-wired (FPGA) The network and the specific cell function are decided at run time Besides the logic functions, another major challenge of on-chip design is coping with interconnect. Besides the normal signal wires, there are two networks on a chip that deserve special consideration: 1/ The power network, which needs to distribute power evenly across a chip 2/ The clock network, which needs to distribute the clock with minimal skew across a chip Both the power network as well as the clock network are handled separately by the design tools. 5:20 FPGA Technology The basic concept of programmable gate array logic is the provide a chip to the customer that can be customized in the field (hence the name). Programming involves two aspects: - Deciding on the logic function that needs to be implemented - Storing the configuration that determines the logic function There are commonly three different strategies to store the programming of an FPGA: 1. Fuse based FPGA: The programming is done once by blowing fuses, or by connecting them (antifuses). This is a program-once technology. The advantage is that fuses are small, and that the configuration can be stored on-chip. The disadvantage, of course, is that the functionality cannot be changed. 2. Nonvolatile (flash-based) FPGA: The configuration is stored in EEPROM cells, nonvolatile memory cells that are located on chip. EEPROM requires extra processing steps, making the manufacturing of such FPGA more expensive. 3. Volatile (SRAM-based) FPGA: The configuration is stored in SRAM cells, located on the chip. This is the most popular technology at the moment, and both market leaders (Xilinx and Intel, formerly Altera) concentrate on SRAM-based FPGA. The disadvantage is that the configuration content needs to be stored off-chip, and before the chip can be used, you have to configure it. The early generations of FPGA were called Programmable Logic Arrays. These chips where specialized in implementing logic functions. (slide 6, 7) More recent generations of FPGA's use a regular array of logic functions, interleaved with a regular array of interconnect. A modern FPGA uses an 'island style' arrangement. There are five configurable elements in such an FPGA: 1/ There are logic elements, called Configurable Logic BLock (Xilinx) are Logic Elements (Altera) 2/ There are memory elements, hard macros of memory cells 3/ There are configurable multipliers, hard macros with arithmetic logic 4/ There are dedicated clock generation and distribution elements 5/ There are configurable input/ouput pins (slide 8) Let's decode what the FPGA on our board says: 5CSEMA5F31C6N Cyclone V SoC (dual core embedded ARM) M 1 hard memory controller (they mean: a hard macro that controls DRAM) A5 85K logic elements F31 896 pin package C Commercial Temp Range 6 Speed Grade (fastest) N lead free packaging Note with temperature range and speed grade: - Integrated circuits slow down at high temperature - Temperature range is the range in which the circuit will work according to the datasheet specs - Different chips have different speed, even when produced using an identical layout - Speed testing during manufacturing determines how fast a chip can work -> this leads to the speed grade - This process is called 'binning' - faster, better components are marked as such, and are sold at a premium 5:35 Logic elements in the Cyclone V (slide 10-11) The logic elements: - contain lookup tables - contain registers - contain carry chains Each logic element can thus serve for different purposes: - flip-flop storage - lookup table logic function - both storage and lookup table logic function The logical elements are arranged in blocks of 10, called LAB or MLAB. These blocks connect to local interconnect and global row/ column interconnect. The idea is that there is a hierarchy on the network. Short connections between logic elements can be done through small, short wires (ie deepest layer metal). Long connections between LE's far apart will need to take the 'highway' of global interconnect, which can carry a limited set of signals across the chip, very quickly. 5:40 Memory elements in the Cyclone V (slide 12) Besides these logic elements, the Cyclone V can implement memory modules - RAM modules that accept an address. We will discuss the instantiation of these later. The Cyclone V A5 has: 446 blocks of 'M10K' modules, configurable 10 kilobit RAM modules 679 blocks of MLAB modules, configurable 640 bit modules built from logic elements In total, the Cyclone V A5 chip has 4,884 Kilobit of on-chip memoru. 5:45 DSP elements in Cyclone V (slide 13) Multipliers are extensively used in signal processing, communications, accelerated scientific computation. FPGA have dedicated multipliers on board. Why? Because they are much faster then the multipliers you would implement using configurable logic elements. The DSP elements contain multiply-accumulation logic. The Cyclone V A5 has: 150 DSP modules. This means it can do up to 300 18*18 bit multiplies PER CLOCK CYCLE. 5:50 ASIC Design Flow (slide 14) The semicustom design flow goes through the following steps: - Design Capture: - writing Verilog code (RTL) - integrate hard macros as black-box modules in RTL (ie. we write a functional equivalent in RTL, but in the actual chip implementation, we will use a hard macro) - Front-end Design: - select design constraints - tech library - target clock speed - optimizations such as pipelining etc - perform logic synthesis - result is a netlist of cells in specified technology - Back-end Design: - Floorplan: decide on the overall organization of the chip - choose chip size - choose IO pin locations - decide on power network architecture - Placement - place cells within core area - Routing - clock tree synthesis - build interconnections between cells - Simulation/Verification is extensively used during the design - to verify functionality - to verify if the technology mapped result is still functional 5:50. FPGA Design Flow (slide 15) The FPGA design flow contains similar steps. However, there is no 'backend' as in the ASIC sense. The FPGA tools are concerned with finding the proper FPGA programming. The constraints of the design include: - Desired speed - I/O pin binding - placement of specific functions (FPGA's can emulate 'hard macro's' through selected configuration of cells) Demo: investigate the backend results of place-and-route for the tonegen design In Quartus - Select Fitter - Chip Planner - Demonstrate LAB - Demonstrate Logic Cell - Demonstrate IO - Demonstrate fan-out of interconnect 6:00 ASIC vs FPGA Finally, let us pit ASIC against FPGA (slide 17) The following are the design criteria: NRE = Non Recurring Engineering Cost = Cost to design the very first circuit = Engineering Cost + Prototype Manufacturing Cost Unit Cost = Cost per instance of the design Power = Power consumption of the design (assuming all technologies implement the same function) Time to Market = time from conception to product for sale Performance = Maximum speed available in this technology Cost to the manufacturer = NRE + number_of_units_sold * Unit Cost ASIC and FPGA have a cross over point For very high volume, ASIC will always be cheaper than FPGA