Introduction

We will temporarily leave the world of HPS and Linux to discuss a fundamental topic in hardware-software codesign. We will define control flow and data flow in the context of C programs and assembly programs. And we will show how the analysis of control flow and data flow of a program can reveal fundamental dependencies in your design. The analysis of control flow and data flow of a program leads to two data structures called a control flow graph and a data flow graph, respectively.

Control Edges and Data Edges

Control edges and data edges are two orthogonal representations of a C or assembly program. They describe different aspects. Control edges describe the sequence of instructions and/or statements. Data edges describe the flow of data troughout the program.

All control edges together form a graph called the Control Flow Graph (CFG). All data edges together form a graph called the Data Flow Graph (DFG). Neither control edges nor data edges are a complete representation of a C (or assembly) program. In particular, you need both to fully capture the meaning of the C (or assembly) program.

A C (or assembly) program = CFG + DFG

We next define control edges and data edges. The two concepts are closely related but not identical.

A Data Edge is a relation between two operations in a program, such that data produced by one operation is consumed by the other.

A Control Edge is a relation between two operations in a program, such that one operation will be executed after the other.

Why do we care about distinguishing control edges and data edges? Both of these describe different aspects of a program.

  • You can think of a Data Edge as a fundamental constraint of the program’s meaning. If you execute two operations on a manner that would violate a data edge relationship, then you change the meaning of the program.

  • A Control Edge on the other hand is an implementation effect. The fact that two operations have to execute sequentially is not a constraint of the program’s meaning. Rather, it is an implementation issue. The two operations have to execute sequentially because there is not enough parallel hardware available to run these operations in parallel.

It’s clear where you would go with such a concept. If you want to build hardware from a C program, then you have to understand the DFG in detail. The DFG combines all operations together and links them through the data they produce and/or consume. A CFG, on the other hand, is not a fundamental property. For a parallel implementation of a sequential program, we would very much like to get rid of the sequential features, and thus eliminate the CFG.

Construction of the CFG and the DFG

We now discuss a systematic procedure to draw a CFG and a DFG for a given program. We start with a simple example in C. The following max function determines the largest of two inputs a and b.

int max(int a, int b) {
  if (a > b)
    r = a;
  else
    r = b;
  return r;

First, the draw the CFG and the DFG we have to decide on the nodes of the graph. We will use the easy approximation that each C statement corresponds to one node. Hence, we could annotate the program with the node numbers as follows. The source node is the node that provides the input, while the sink node is the node that carries the result away.

int max(int a, int b) { // 1 - source node
  if (a > b)            // 2
    r = a;              // 3
  else
    r = b;              // 4
  return r;             // 5 - sink node

Next, following the C semantics, we draw the control flow graph, by interconnecting the nodes that have a sequential dependency. For the max function we find the following control flow graph. Note how decision nodes are at the source of more than one control node.

                        // control edges
int max(int a, int b) { // 1 -> 2
  if (a > b)            // 2 -> 3 or 2 -> 4
    r = a;              // 3 -> 5
  else
    r = b;              // 4 -> 5
  return r;             // 5 - sink node

The data flow graph (DFG) must be derived after the control flow graph. The data flow graph has the same set of nodes as the control flow graph: there is one node per C statement. The data flow graph requires identification of the data dependencies between every node. To simplify this, we start by annotating, with each node, what variables are read and what variables are written.

                        //        a    b    r
int max(int a, int b) { // 1      W    W
  if (a > b)            // 2      R    R
    r = a;              // 3      R         W
  else
    r = b;              // 4           R    W
  return r;             // 5                R

Next, we draw each data dependency. A data dependency goes from a node that writes into a variable to another node that reads from the variable. To have a valid dependency, we must identify the correct ‘write’ node for each ‘read’ node. That is done as follows.

  1. Start with a node that reads from a variable. For example, node 3 in the example reads variable a. That read operation is the endpoint of a data dependency.

  2. Next, walk backward in the control flow graph until you find a node that writes the same variable. That is the starting point of a data dependency. For example, going backward from node 3, we visit node 2, and then node 1. Only node 1 writes a. Therefore, the data dependency for a goes from 1 to 3.

The search for data dependencies has to be performed breadth-first and consider every control path that leads to the endpoint of the data dependency. For example, the read of r in node 5 could be done by node 4 or else by node 3. This is because there is a control flow from node 3 to node 5 5 as well as from node 4 to node 5.

                        //           data edges
                        //       a       b      r
int max(int a, int b) { // 1     1->2   1->2
                        //       1->3   1->4  
  if (a > b)            // 2           
    r = a;              // 3                   3->5
  else
    r = b;              // 4                   4->5
  return r;             // 5               

The same analysis can also be done at assembly level. Here is the max function in Nios assembly. Recall that this function follows the Nios ABI: the input arguments are in r4 and r5, while the output argument is in r2.

                     // CFG             r2  r4  r5
max(r4, r5):         // 1->2                W   W
  mov r2, r4         // 2->3            W   R
  bge r4, r5, .L2    // 3->4, 3->5          R   R
  mov r2, r5         // 4->5            W       R
.L2: 
  ret (r2)           // 5               R

Using the CFG and the read/write operations in registers, we can derive the DFG.

                     //           data edges
                     //        r2     r4      r5
max(r4, r5):         // 1            1->2    1->3
                     //              1->3    1->4
  mov r2, r4         // 2     2->5
  bge r4, r5, .L2    // 3     
  mov r2, r5         // 4     2->5
.L2: 
  ret (r2)           // 5         

Dealing with loops

DFG and CFG contruction is interesting when the program becomes more complicated. For example, when there are loops, backtracing in the control flow graph can lead to unexpected data dependencies. This is an area where data flow analysis really shines, since it can alert you to read/write dependencies that would go unnoticed otherwise. In the second example we consider a GCD algorithm.

                            //        control edges
unsigned gcd(unsigned a,    // 1      1->2
             unsigned b) {  

  while (a != b)            // 2      2->3, 2->6
    if (b > a)              // 3      3->4, 3->5
      b = b - a;            // 4      4->2
    else
      a = a - b;            // 5      5->2

  return a;                 // 6
}

To determine the DFG, first collect the read/write operations for each node.

                            //        a    b
unsigned gcd(unsigned a,    // 1      W    W
             unsigned b) {  

  while (a != b)            // 2      R    R
    if (b > a)              // 3      R    R
      b = b - a;            // 4      R    W,R
    else
      a = a - b;            // 5      W,R  R

  return a;                 // 6      R
}

Next, we derive the data flow edges, per variable. We start with ‘a’. Every node that reads a is the end point of a data dependency. The starting point is a corresponding write node that can be directly reached through a reverse control dependency. These dependencies go across loop boundaries, and can be non-trivial.

Consider, for example, the read of a in node 5. There are two possible nodes that write a: node 1 and node 5. To find which are try dependencies, follow the reverse path in the control graph. From 5->3->2->1, we see that node 1 has a data dependency on node 5 for a. Next, from 5->3->2->5, we see that node 5 also has a dependency on itself. The following program shows all data dependencies.

                            //         data edges   
                            //         a      b
unsigned gcd(unsigned a,    // 1      1->2   1->2    
             unsigned b) {            1->3   1->3
                                      1->4   1->4
                                      1->5   1->5
                                      1->6   1->6

  while (a != b)            // 2            
    if (b > a)              // 3            
      b = b - a;            // 4             4->4
                            //               4->3
                            //               4->2
                            //               4->5
    else
      a = a - b;            // 5      5->5
                            //        5->6
                            //        5->2
                            //        5->3    
                            //        5->4

  return a;                 // 6      
}

Harder cases

Control Flow Analysis and Data Flow Analysis work well if we can statically observe every dependency from the source code. That is not always the case.

  • Pointers obscure data dependencies. Pointers can alias other variables so that a read/write from a dereferenced pointer really is a dependency into another variable (while at the same time being a read-operation for the pointer).
              //        CFG        a   b   c    DFG(a)  DFG(b)
  a = 5;      //  1     1->2       W
  b = &a;     //  2     2->3           W
  c = *b;     //  3     3->exit    R   R   W     1->3    2->3
  • Arrays make data dependencies more complex, since a single array reference can be any of the data members of the array. There are two ways to look at a reference such as a[i]. The first is to look at a[i] as a scalar variable. This requires knowledge of i - which may be hard, when i depends on the runtime state of the program. The second is to look at the array a as an atomic entity, and consider every access into the array a dependency on the complete array. The dependencies identified in each case may be different, as the following example shows.
                //       CFG        a[3]   i    DFG(a)  DFG(a[3])
   i    = 3;    //  1    1->2              W   
   a[i] = 1;    //  2    2->3        W     R     
   a[0] = 0;    //  3    3->4
   b    = a[3]; //  4    4->exit     R           2->4    1->4

Our analysis technique is not quite advanced enough to handle arrays in a general manner. In the following, we will stick to the scalar case.

Other constructions

for-loop

for (a=0; a<3; a++) {
   b[a] = 2;
}

We break down the compound for loop as follows:

                        //        CFG       a      DFG(a)
for (a=0;               // 1      1->2      W      
          a<3;          // 2      2->4      R      1->2, 3->2
                                  2->exit
               a++) {   // 3      3->2      W,R    1->3, 3->3
   b = b + a;           // 4      4->3      R      1->4, 3->4
}

case

                        //        CFG        a     DFG(a)
switch(a) {             // 1      1->2       R     initial->1
                                  1->3
                                  1->4
  case '1': b = a;      // 2      2->exit    R     initial->2
            break;
  case '2': c = a;      // 3      3->exit    R     init->3
            break; 
  default : b = 0;      // 4      4->exit
}

do-while loop

                            //      CFG        a      DFG(a)
 a = 1;                     // 1    1->2       W
 do {                        
    a = a - 1;              // 2    2->3       R,W    1->2, 2->2
 } while (a != 0);          // 3    3->2       R      2->3
                            //      3->exit

Application: Converting C to Hardware

In recent years, high level synthesis has rapidly grown in interest and importance. The need to design hardware faster, while stil delivering performance speedup due to parallellization, has made the use of automatic synthesis tools essential.

The following is an snippet of input code for Vivado High-level Synthesis, a compiler for Xilinx FPGA’s. Several other examples can be found on the Xilinx Github repository.

void diff_sq_acc(din_t a[N], din_t b[N], dout_t *dout) {
  int i;
  int acc= 0;
  int a_reg, b_reg, sub, sub2;

  for(i=0; i<N; i++)  {
    #pragma HLS PIPELINE II=1
    a_reg  = a[i];
    b_reg  = b[i];
    sub = a_reg - b_reg;
    sub2 = sub*sub;
    acc += sub2;
  }

  *dout = acc;
}

Dataflow and controlflow analysis is one of the steps done in the conversion of a C program into a parallel implementation of it.

We will illustrate this by converting a small C function into hardware using a very (very) simple conversion approach, based on the following rules.

  1. Each single C statement will execute in a clock cycle
  2. Each expression (operation) becomes a datapath operation
  3. Each variable becomes a register
  4. The FSM controller is extracted from the CFG

The function computes a modulo-179 operation. This particular design is interesting as it can achieve a mod-179 by only using division and multiplication with powers of two. In hardware, such operations can be implemented efficiently using shift operations and bitwise and operations.

unsigned modk(unsigned x, unsigned k) { return (x & ((1 << k) - 1)); }
unsigned divk(unsigned x, unsigned k) { return (x >> k); }

unsigned modulo(unsigned x) {
  unsigned r, q, k, a, m, z;
  m = 0xB3; // 179
  k = 8;
  a = (1 << k) - m;
  r = modk(x, k);
  q = divk(x, k);
  do {
    do {
      r = r + modk(q * a, k);
      q = divk(q * a, k);
    } while (q != 0);
    q = divk(r, k);
    r = modk(r, k);
  } while (q != 0);
  z = (r >= m) ? r - m : r;
  return z;
}

We will first simplify this program and substitute some of the variable that hold constants. Our conversion rules state that variables will become registers, and there’s no point storing a constant as a register!

unsigned modulo(unsigned x) {
  unsigned r, q, z;
  r = x & 255;
  q = x >> 8;
  do {
    do {
      r = r + (q * 77) & 255;
      q = (q * 77) >> 8;
    } while (q != 0);
    q = r >> 8;
    r = r & 255;
  } while (q != 0);
  z = (r >= m) ? r - m : r;
  return z;
}

Next, we extract the CFG and the DFG.

                               //      CFG        x   r   q   z
unsigned modulo(unsigned x) {  //  1   1->2       W
  unsigned r, q, z;
  r = x & 255;                 //  2   2->3       R   W
  q = x >> 8;                  //  3   3->4       R       W 
  do {
    do {
      r = r + (q * 77) & 255;  //  4   4->5           W,R R
      q = (q * 77) >> 8;       //  5   5->6               W,R
    } while (q != 0);          //  6   6->7               R
                                       6->4
    q = r >> 8;                //  7   7->8           R   W
    r = r & 255;               //  8   8->9           W,R
  } while (q != 0);            //  9   9->10              R
                                       9->4
  z = (r >= m) ? r - m : r;    // 10   10->11         R       W
  return z;                    // 11   11->exit
}
                               //      CFG        DFG(r)   DFG(q)
unsigned modulo(unsigned x) {  //  1   1->2       
  unsigned r, q, z;
  r = x & 255;                 //  2   2->3          
  q = x >> 8;                  //  3   3->4               
  do {
    do {
      r = r + (q * 77) & 255;  //  4   4->5       2->4     3->4    
                               //                 8->4     5->4
      q = (q * 77) >> 8;       //  5   5->6                3->5
                               //                          5->5
    } while (q != 0);          //  6   6->7                5->6
                                       6->4
    q = r >> 8;                //  7   7->8       4->7
    r = r & 255;               //  8   8->9       4->8    
  } while (q != 0);            //  9   9->10               7->9    
                                       9->4
  z = (r >= m) ? r - m : r;    // 10   10->11     8->10    
  return z;                    // 11   11->exit   10->11
}

We can now implement the FSMD. If we apply our conversion rules, we end up with an 11-cycle implementation that uses three registers r,q and z.

However, there is a trivial optimization possible, which we will implement as well. The rule is the following:

Two successive states in the CFG can be merged into one, if they have no data dependencies between them, and if the resulting cluster has a unique entry point and exit point.

It’s easy to see why that is the case. If two successive states have no data dependencies, that means that they write into different registers, so they can run in the same clock cycle. Second, if the cluster has a unique entry point and exit point, then the sucessive states will always run jointly.

The following annotation reflects this optimization. Notice also how we are now able to extract the FSM states and corresponding datapath expressions.

                               //       FSM          Datapath
unsigned modulo(unsigned x) {  //  1          
  unsigned r, q, z;  
  
  r = x & 255;                 //  2    LOAD         r <= x&255  
  q = x >> 8;                  //  3                 q <= x>>8
  
  -----------------------------------  ----------  ------------
  do {  
    do {  
      r = r + (q * 77) & 255;  //  4    COMPUTE1     r <= r + mul
                               //                  
      q = (q * 77) >> 8;       //  5                 q <= mul>>8
                               //        
  
  -----------------------------------  ----------  ------------
    } while (q != 0);          //  6    INNER        (q != 0)?
                                         
  
  -----------------------------------  ----------  ------------
    q = r >> 8;                //  7    COMPUTE2     q <= r>>8
    r = r & 255;               //  8                 r <= r&255
  
  -----------------------------------  ----------  ------------
  } while (q != 0);            //  9    OUTER        (q != 0)?
                                         
  
  -----------------------------------  ----------  ------------
  
  z = (r >= m) ? r - m : r;    // 10    ADJ          z = ...
                               //         
  
  return z;                    // 11     
}

Finally, we can write a Verilog implementation for this design. This transformation is mechanical; all of the conversion work and analysis was done by identifying (and optimizing) control flow dependencies and data flow dependencies.

module mod179(
  input  wire        clk,
  input  wire        reset,
  input  wire [15:0] x,
    input  wire        start,
    output reg         done,
  output reg [7:0]   z);

   localparam [2:0] load     = 3'b000,
                    compute1 = 3'b001,
          inner    = 3'b010,
          compute2 = 3'b011,
          outer    = 3'b100,
          adj      = 3'b101;
   
   reg [15:0] q_reg, q_next;
   reg [15:0] r_reg, r_next;
   reg [ 2:0] state_reg, state_next;

   wire [15:0] mul;
   
   always @(posedge clk, posedge reset)
     begin
     q_reg     <= reset ? 16'b0 : q_next;
     r_reg     <= reset ? 16'b0 : r_next;
     state_reg <= reset ? load  : state_next;
   end

   assign mul = q_reg[7:0] * 8'd77;
   
   always @*
     begin
     q_next     = q_reg;
     r_next     = r_reg;
     state_next = state_reg;
     done       = 1'd0;
     z          = 8'd0;
     case (state_reg)

       load:
       if (start) begin
         q_next     = x[15:8];
       r_next     = x[7:0];
       state_next = compute1;
           end
       
     compute1: 
       begin
       r_next     = r_reg + mul[7:0];
       q_next     = mul[15:8];
       state_next = inner;
       end
       
     inner:
       if (q_reg != 16'd0)
         state_next = compute1;
       else
         state_next = compute2;
       
     compute2:
       begin
       q_next     = r_reg[15:8];
       r_next     = r_reg[7:0];
       state_next = outer;
       end
       
     outer:
       if (q_reg != 16'd0)
         state_next = compute1;
       else
         state_next = adj;
       
     adj:
       begin
       done = 1'd1;
       z = (r_reg >= 8'd179) ? (r_reg - 8'd179) : r_reg;
       state_next = load;
       end
       
     default:
       state_next = load;
       
     endcase
   end

endmodule

Conclusions

The analysis of dataflow and controlflow dependencies is an important element in building insight into the behavior of a design specification. Understanding the data flow is crucial to identify opportunities for parallel execution. Understanding the control flow helps to construct a hardware controller that drives and FSMD.