Review 2 - Nios II, Looking back and forward

Introduction
Cyclic Redundancy Check
- The example code
- Running the example
Study Guide

Introduction

Over the past few lectures we discussed primarily the Nios II processor and its use in the context of System-on-Chip architecture. We’ll look back at the design of Nios II based systems by walking through the design of a complete Platform-based design. The example is taken from the Nios II Custom Instruction Guide and involves the design of a Cyclic Redundancy Check module integrated as a custom instruction.

Cyclic Redundancy Check

The CRC is nice example of optimization of hardware/software optimization. The idea of a CRC is similar to that of a checksum bit. With a CRC you compute a checksum code from a message, a stream of bytes. Then, at a later point, you re-compute the checksum over the message and verify that you have obtained the same checksum. If that is the case, then the message is unchanged.

But the ability of a checksum to detect errors is limited. In the case if a single checksum bit, the checksum bit is obtained by XOR-ing all message bits together. Therefore any even number of bitflips in the message will go undetected. Indeed, with a single checksum bit, half of the messages must have the same checksum. Since an error may map a messsage into an abitrary other message, this means that a single-bit checksum can miss half of the possible errors. The probability of failing to detect an error with a single checksum bit is 1/2.

A CRC does better. A CRC computes a checksum over N bits, such that the probability of failure is 1/(1 << N). CRCs are extensively used in storage and transmission equipment. The CRC that we will implement is CRC-32, a standardized version of CRC which computes a 32-bit checksum. It is the same CRC as the one used to protect Ethernet data frames. An in-depth discussion of CRC is beyond the scope of this lecture, and we only touch upon the main ideas. Refer to the introduction by R. Williams if you want to learn more.

The following is the gist of CRC computation. CRC checksums are computed in Galois Field arithmethic. A message, a string of bits, is represented as the coefficients of a Galois Field polynomial. For example, the single-byte message 0xC7 = 11000111 is represented as the polynomial x^7 + x^6 + x^2 + x + 1. The CRC is the remainder obtained by dividing a message polynomial by the CRC polynomial. For CRC-32, this is a polynomial of 32-nd degree. The following illustrates the division process of the sample message by a shorter CRC polynomial, x^3 + x + 1. This CRC polynomial will generate a 3-bit checksum.

The process starts by padding three zero checksum bits to the message, and then apply a long-tail division (in Galois Field arithmetic) to obtain the remainder.

message to transmit: 11000111
initial checksum:    000

m(x)    11000111 000
crc     1011
        -------- ---
        01110111 000
crc      1011
        -------- ---
        00101111 000
crc       1011
        -------- ---
        00000011 000
crc           10 11
        -------- ---
        00000001 110
crc            1 011
        -------- ---
r(x)    00000000 101

The remainder is x^2 + 1 which corresponds to the checksum 101.

To verify the checksum for a received message, simply pad the checksum to the message and repeat this process. The resulting remainder should now by 000.

received message:  11000111
received checksum: 101

t(x)    11000111 101
crc     1011
        -------- ---
        01110111 101
crc      1011
        -------- ---
        00101111 101
crc       1011
        -------- ---
        00000011 101
crc           10 11
        -------- ---
        00000001 011
crc            1 011
        -------- ---
r(x)'   00000000 000 - message is correct

This process appears slow because of its bit-serial nature. The standard approach is to pre-compute partial checksums over longer groups of bits, and then use a lookup table to store and apply these partial checksums.

For example, consider a two-bit substring, which can have four values, leading to four partial checksums:

  00  -> checksum 000
  01  -> checksum 011 
  10  -> checksum 110
  11  -> checksum 101

m(x)  01 000     
crc    1 011   
      ------  
r(x)  00 011 

m(x)  10 000    
crc   10 11
      ------
      00 110

m(x)  11 000
crc   10 11
      ------
      01 110
crc    1 011
      ------
r(x)  00 101

With this table, we can compute the checksum in steps of two bits in an original, longer message.

message to transmit: 11000111
initial checksum:    000

m(x)    11000111 000
rem(11)   101        
        -------- ---
          101111 000
rem(10)     110
        -------- ---
            0011 000
rem(00)       00 0           
        -------- ---
              11 000
rem(11)          101                                  

The example custom instruction design uses a CRC-32 (32-bit checksum) and uses an 8-bit lookup table to accelerate the software.

Another approach to accelerate CRC computation is to ‘unroll the bitserial computation’ and compute all of the bits in a single clock cycle this is the strategy used by the CRC computation.

The example code

The example-nios-crc repository contains three different implementations of the CRC-32 design. The first is a slow, bit-serial implementation in software. The second is a faster software implementation that uses the table-lookup method. The third is a custom-instruction design that uses an unrolled CRC computation that evaluates the CRC of up to 32 message bits in one clock cycle.

You can find these implementations in the repository https://github.com/vt-ece4530-f19/example-nios-crc.git. For example, the following is the slow, bit-serial software implementation. A special feature of this code is the ability to ‘reflect’ the data. This is a compatability feature with some CRC implementations, that read the bits in a byte from ‘right to left’ instead of from ‘left to right’ as in our example. When this is the case, we need to reflect the bits in a byte before processing them in a software CRC.

crc crcSlow(unsigned char const message[], int nBytes) {
  crc            remainder = INITIAL_REMAINDER;
  int            byte;
  unsigned char  bit;

    for (byte = 0; byte < nBytes; ++byte) {
        remainder ^= (REFLECT_DATA(message[byte]) << (WIDTH - 8));
        for (bit = 8; bit > 0; --bit) {
            if (remainder & TOPBIT)  {
                remainder = (remainder << 1) ^ POLYNOMIAL;
            } else {
                remainder = (remainder << 1);
            }
        }
    }

    return (REFLECT_REMAINDER(remainder) ^ FINAL_XOR_VALUE);

}   /* crcSlow() */

The accelerated software version, which uses the custom instruction, feeds in up to 32 bits at a time of the message. The design uses an extended custom-instruction, with the n-parameter indicating the length of the input message and the length of the output checksum.

Running the example

The following is a walkthrough of the commands to use to synthesize the example. You’ll need a Nios II Command shell.

# download the example code
git clone https://github.com/vt-ece4530-f19/example-nios-crc.git

# optionally, open the code in quartus
# optionally, open the platform in Platform Designer
# synthesize the code
quartus_sh --flow compile examplenioscrc

# download the bitstream
nios2-configure-sof -d 2 examplenioscrc.sof

# directly create bsp from command line
nios2-bsp hal software/hal . \
              platformnioscrc \
              --set hal.sys_clk_timer none \
              --set hal.timestamp_timer "timer_0" \
              --set hal.enable_small_c_library true 

# directly create application makefile from command line
nios2-app-generate-makefile.exe \
              --bsp-dir software/hal \
              --src-files=software/ci_crc.c  \
              --src-files=software/crc.c  \
              --src-files=software/crc_main.c \
              --elf-name crc.elf \
              --set APP_INCLUDE_DIRS \".\"

make

# open a second nios command shell and run nios2-terminal
# then run the application

nios2-download crc.elf --go

Study Guide

The midterm covers Lecture 9 to 15 included. Note that Lecture 14 was a discussion on Homework 5. That lecture did not introduce new materials.

The lecture materials cover lecture notes, the design examples, and the handouts discussed in class. You do not have to memorize manuals. You should be able to reverse engineer the meaning of a Nios II assembly program given a corresponding C program. You do not have to study MSP-430 specific materials, but you have to understand design concepts that are common between MSP-430 and Nios-II. For example, you have to understand what a linker description file does, and you have to be able to analyze and navigate a memory map of an SoC.

The questions on the midterm are design-oriented, which means that they will test your knowledge in the context of short design problems. Hence, insight into the course content is much more important than memorizing the content.

Lecture 9 - Hardware-Software Synchronization
- The meaning of a synchronization point
- The meaning of master and slave in synchronization
- Using Binary Semaphores to achieve synchronization
- Single-semaphore and dual-semphore synchronization
- One-way and two-way handshake
- Handshake implementation with software and memory-mapped registers
- Handshake implementation with hardware and finite-state machines
Lecture 10 - Nios II Core
- Nios II overall architecture
- Function and meaning of tightly-coupled memory
- Function and meaning of custom-instruction logic
- Link between assembly programs and C programs
Lecture 11 - On-chip Bus
- Basic elements in a System-on-Chip architecture
- Bus elements: master, slave, bridge, memory map, bus segments,
- Hardware customization in the SoC and their hardware/software interfaces
- Simply bus transations
- Bus arbitration
- Platform Designer flow: Board support Package
- Example: example-nios-timer
Lecture 12 - On Chip Bus 2
- Bus Locking
- Test-and-set operation
- Bus Transaction Splitting and Bus Pipelining
- Burst Transfers
- (We did not discuss Bus Transfer Sizing)
- Avalon MM Bus
Lecture 13 - Nios II Custom Instructions
- Combinational, Multi-cycle, Extended custom instructions
- Opcode mapping of custom instructions
- Byte-merge combinational instruction example
- Multiply-accumulate extended instruction example
- Signed multiplication
- Integration of custom instructions in the Platform Design Flow
- Example: example-nios-ci
Lecture 15 - Synchronous Dataflow
- Synchronous Dataflow: Actor, Token, Queue
- Multi-rate Dataflow
- Periodic Admissible Schedule
- Topology Matrix, Firing Vector
- Procedure to compute a Period Admissible Sequential Schedule