Introduction

This week we’re discussing several important and useful techniques that will help you in the codesign challenge. Today, we’ll cover debugging and profiling. On Thursday, we’ll cover basic ideas in performance optimization and the identification of performance bottlenecks.

Configuring and running example-nios-sdram

The following figure illustrates the example platform that we will use in this (and the next few) lectures. It has the following features.

  • Nios 2/e processor at 50MHz system clock.
  • On-chip 64KByte RAM holding the text segment of the program
  • SDRAM controller to access an off-chip 64Mbyte memory (16-bit wide) running at 100MHz system clock. The SDRAM holds static and runtime data of the C program, including rwdata, rodata, heap, stack, and bss.
  • Three PIO ports to drive the HEX displays, red LEDs and keys on the DE1-SoC kit, respectively.
  • Two 32-bit interval timers for system-timing and performance timing.
  • A PLL to create three different clock regions: 50MHz system clock, 100 MHz SDRAM controller clock, and 100MHz skew-adjusted SDRAM clock.

Figure: Nios 2 SDRAM Test Platform

debug-nios

To compile and setup the example on your DE1-SoC kit, proceed as follows.

  • Download the repository on your laptop.
git clone https://github.com/vt-ece4530-f19/example-nios-sdram
  • Compile the bitstream. You can, optionally, load the design in Quartus by opening exampleniossdram.qpf. In addition, you can load the platform in Platform Designer by opening platformniossdram. This enables you to configure design components, add new components, adjust the memory map, and so forth. The compile the bitstream from the command line (Nios II Command Shell), simply use the following.
quartus_sh --flow compile exampleniossdram
  • Once you have the bitstream (sof), download it to the DE1-SoC board.
nios2-configure-sof -d 2 exampleniossdram.sof
  • Next, create a board support package. Run the Nios2 BSP editor with nios2-bsp-editor. Open platformniossdram.sopcinfo and select the following settings.

    • sys_clk_timer is connected to timer_0
    • timestamp_timer is connected to timer_1
    • Under Advanced/hal/linker, turn off enable_alt_load. This prevents code and data from being relocated at startup.

The load facility (which we don’t want to use) works as follows. When enable_alt_load is selected, the loader will first place all code segments in on-chip RAM, and then copy the data segments (rwdata, rodata) over to the SDRAM. That makes sense if the program is stored in a non-volatile memory, and the loader is part of the processor boot sequence. In our case, we don’t use a non-volatile memory nor a bootloader; we use nios2-download. Since the SDRAM is much bigger than the on-chip RAM, we want the loader to directly copy the data segments into SDRAM.

Figure: Disable alt_load facility

debug-alt

  • Exit the BSP editor, and compile the BSP code.
cd software
nios2-bsp-generate-files  \
    --settings=hal_bsp/settings.bsp  \
    --bsp-dir=hal_bsp
cd hal_bsp
make
  • Compile a sample application (there are several examples in the software subdirectory). The spigot algorithm computes digits of pi. It can be easily grown to a program with high computational complexity and storage complexity by increasing n, the number of digits desired from PI.
cd spigot
nios2-app-generate-makefile \
    --bsp-dir ../hal_bsp \
    --elf-name main.elf \
    --src-files spigot.c
make
  • To run the application, open a Nios 2 terminal. Next, download the application to the Nios.
# in a first Nios 2 Command Shell
nios2-terminal

# in a second Nios 2 Command Shell
nios2-download main.elf --go

Debugging

Software contains bugs. To find these bugs, we wish to observe software closely during operation. In hardware/software codesign, we have used (and will use) a wide array of techniques to observe the software in action. Let’s enumerate our arsenal of debugging tools.

  • One of the most basic debugging techniques uses blinking LEDs and/or numerical HEX displays. Many of the design examples provided for your board, including the example we’ll discuss today, use a ‘heartbeat’ LED. On your DE1-SoC board, such a heartbeat LED shows that the board is powered, that the clock is correctly configured, and that a bitstream was loaded on the FPGA. As an example, the following heartbeat signal blinks two LEDs at a rate of approximately 2 Hz. In addition, the state of a ‘locked’ signal is encoded by the two LED blinking in a unison fashion or unlocked (locked == 1'b0) when blinking in an alternating fashion.
   reg [23:0] heartbeat;

   // heartbeat indicator
   always @(posedge CLOCK_50, negedge KEY[0])
     if (KEY[0] == 1'b0)
       heartbeat <= 24'b0;
     else
       heartbeat <= heartbeat + 1'b1;
   assign LEDR[9] =  heartbeat[23];
   assign LEDR[8] = ~heartbeat[23] ^ locked;
  • A second, very common debugging technique uses printf of internal data structures and events during program execution. On an embedded board, printf needs to be implemented through a UART. In our Nios-based designs, we used a JTAG-UART.

  • A third debugging technique is to use a software debugger. A software debugger is a tool that provides direct control over the execution characteristics of a program. It enables a designer to load, start and stop a program. It enables a designer to observe intermediate values. It enables a designer to watch for certain events, or stop the program at a precise point during its execution. In contrast to the printf technique, using a debugger does not require a separate instrumentation of a program with extra printf function calls. At the downside, debugging using a software debugger removes the real-time characteristics of the program. For example, real-time interrupt service routines are hard to handle in a debugger since stopping the program also means stopping serving real-time interrupts. We will discuss debugging with GNU gdb in further detail.

  • A fourth debugging technique is to use simulation. This provides even greater visibility of the system, at the cost of requiring a simulation model for the system under development. On the other hand, using a simulation model, time itself can be simulated so that we can closely monitor real-time behavior. Building a simulation model for a complex system (such as DE1-SoC) is hard. Imagine the simulation of a design for the DE1-SoC board using Modelsim. We need Verilog to simulate the design on the FPGA. In addition, we also need Verilog for any of the chips on the board that we want to include in the simulation. And we need to design a testbench to exercise the simulation.

  • A fifth technique, which can be used in parallel with a software debugger, is to use a hardware logic analyzer. The logic analyzer provides detailed information of the state of hardware signals. In FPGA design, the logic analyzer can be compiled into the design-under-test (DUT). In contrast to a software debugger, a hardware logic analyzer captures hardware activities in full detail, and at the granularity of a single clock cycle. For example, a logic analyzer is good to study the internal hardware details of a coprocessor.

Debugging on a host

We are interested in source-level debugging, which allows you to step through each line of your program as you run it. To prepare a program for source-level debugging, we have to compile debug information into the program. The debug information contains the (address) location of every function and every variable in memory. The debug information associates each line number in the source code with a location in the text segment.

In GCC, you can add debug information into your program with the -g flag. In the following examples, we will compile the spigot application on the host (X86 processor), using GCC for Cygwin. You may need to install GCC (and GDB, the debugger) for cygwin if you don’t have it yet.

gcc -g spigot.c -o spigot.exe

After the application is compiled, you can load it in the GNU debugger, gdb. The debugger allows you to list source code, step through the program, set breakpoints, inspect memory and variables, and much more. The following shows typical commands.

gdb spigot.exe
  • The list command shows source code. Functions can be referenced by name.
(gdb) list main
11        for (i=2; i<n-1; ++i)
12          printf("%04d", pi[i]);
13        printf("\n");
14      }
15
16      int main( ) {
17        int n = 5000;  /* number of pi digits */
18        unsigned short *pi = (unsigned short*) malloc(n * sizeof(unsigned short));
19        div_t d;
20        int i, j, t;
  • The b command sets a breakpoint at a line number or a function name. When the execution of a program reaches a breakpoint, execution is halted and control is returned to the debugger. You can list the active breakpoints using the info command. Breakpoints can be deleted with the delete command.
(gdb) b main
Breakpoint 1 at 0x4011f8: file spigot.c, line 17.
(gdb) b 22
Breakpoint 2 at 0x401212: file spigot.c, line 22.
(gdb) info b
Num     Type           Disp Enb Address    What
1       breakpoint     keep y   0x004011f8 in main at spigot.c:17
2       breakpoint     keep y   0x00401212 in main at spigot.c:22
(gdb) delete 2
(gdb) info b
Num     Type           Disp Enb Address    What
1       breakpoint     keep y   0x004011f8 in main at spigot.c:17
  • The r command runs a program. You can add arguments (similar to the command line). The s command steps through a program and into a function. The n command steps through a program and over a function. The c command continues execution after a breakpoint was hit.
(gdb) r
Starting program: /home/ece4530f19/example-nios-sdram/software/spigot/spigot.exe

Thread 1 "a" hit Breakpoint 1, main () at spigot.c:17
17        int n = 50;  /* number of pi digits */
(gdb) s
18        unsigned short *pi = (unsigned short*) malloc(n * sizeof(unsigned short));
(gdb) c
Continuing.
3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294
[Inferior 1 (process 972) exited normally]
  • During execution you can print variables and expressions using variables with the p command. You can dump the contents of memory with the x command.
(gdb) p j
$9 = 49
(gdb) p &j
$10 = (int *) 0x62cbe8
(gdb) x/4b 0x62cbe8
(gdb) x/4w &j
0x62cbe8:       49      664     -2147174212     6474800
  • There is a text-mode user interface (TUI) that helps you to look at the source code (or assembly code) while stepping through the code. You can exit the TUI using ctrl-x ctrl-A.
(gdb) layout split
   ┌──spigot.c─────────────────────────────────────────────────────────────────────────────────┐
   │11        for (i=2; i<n-1; ++i)                                                            │
   │12          printf("%04d", pi[i]);                                                         │
   │13        printf("\n");                                                                    │
   │14      }                                                                                  │
   │15                                                                                         │
   │16      int main( ) {                                                                      │
B+>│17        int n = 50;  /* number of pi digits */                                           │
   │18        unsigned short *pi = (unsigned short*) malloc(n * sizeof(unsigned short));       │
   │19        div_t d;                                                                         │
   │20        int i, j, t;                                                                     │
   │21                                                                                         │
   │22        memset(pi, 0, n*sizeof(unsigned short));                                         │
   │23        pi[1]=4;                                                                         │
   │24                                                                                         │
   ┌───────────────────────────────────────────────────────────────────────────────────────────┐
   │0x4011ea <main>         push   %ebp                                                        │
   │0x4011eb <main+1>       mov    %esp,%ebp                                                   │
   │0x4011ed <main+3>       and    $0xfffffff0,%esp                                            │
   │0x4011f0 <main+6>       sub    $0x40,%esp                                                  │
   │0x4011f3 <main+9>       call   0x4013bc <__main>                                           │
B+>│0x4011f8 <main+14>      movl   $0x32,0x30(%esp)                                            │
   │0x401200 <main+22>      mov    0x30(%esp),%eax                                             │
   │0x401204 <main+26>      add    %eax,%eax                                                   │
   │0x401206 <main+28>      mov    %eax,(%esp)                                                 │
   │0x401209 <main+31>      call   0x4013cc <malloc>                                           │
   │0x40120e <main+36>      mov    %eax,0x2c(%esp)                                             │
   │0x401212 <main+40>      mov    0x30(%esp),%eax                                             │
   │0x401216 <main+44>      add    %eax,%eax                                                   │
   │0x401218 <main+46>      mov    %eax,0x8(%esp)                                              │
   └───────────────────────────────────────────────────────────────────────────────────────────┘
native Thread 7068.0x3d18 In: main                                           L17   PC: 0x4011f8
(gdb) 
Command Purpose
file <exe> Selects file for debugging
b <function> Sets a breakpoint at a function
b <line> Sets a breakpoint at a line number
info b Lists all breakpoints
r Starts the program
c Continues the program after a breakpoint was hit
p <expression> Print the value of a variable or expression
help <topic> Displays help on topic
layout <mode> Enables text-mode UI, with mode one of split,src,asm,regs

Debugging on a target

You can use gdb on your DE1-SoC board as well. In that case, the debugger application (gdb) runs on your PC while that program is running on a Nios on your DE1-SoC. This is called remote debugging. The JTAG connection provides an efficient and fast connection into the Nios processor, and can start, stop and examine it remotely. However, the software environment has to be properly setup.

On the host, you have to run a an application called nios2-gdb-server. This opens a program that connects to the Nios processor on the board. It also opens a network connection and listens for a gdb client to connect on that port.

nios2-gdb-server --tcpport auto -tcppersist

Using cable "DE-SoC [USB-1]", device 2, instance 0x00
Processor is already paused
Listening on port 52464 for connection from GDB:         

Next, you can run the nios2-elf-gdb in the directory where you have compiled the software application for the Nios.

nios2-elf-gdb main.elf

You then have to connect the debugger to the target, through the nios2-gdb-server application. This is done using the target command of gdb:

(gdb) target remote:52464
Remote debugging using :52464
0x00000000 in ?? ()

The port 52464 corresponds to the port announced by the gdb server. In the window where you run the nios2-gdb-server, you will see the connection being made.

$ nios2-gdb-server --tcpport auto --tcppersist
Using cable "DE-SoC [USB-1]", device 2, instance 0x00
Processor is already paused
Listening on port 52464 for connection from GDB: accepted

Before you can run the program, you have to download the executable to the target. That is done using the load command. In case you started nios2-elf-gdb without an argument, you can select the executable with the file command. It is worthwhile to take a close look at the output of the debugger, as it tells you exactly where it is copying the various sections of the code. For example, we see that rodata and rwdata are copied into the SDRAM (with starting address 0x0), while the other parts are copied into the on-chip memory (with starting address 0x04000000).

(gdb) load
Loading section .rodata, size 0x308 lma 0x0
Loading section .rwdata, size 0x1aec lma 0x308
Loading section .entry, size 0x20 lma 0x4000000
Loading section .exceptions, size 0x210 lma 0x4000020
Loading section .text, size 0xfb50 lma 0x4000230
Start address 0x4000230, load size 72564
Transfer rate: 185 KB/sec, 392 bytes/write.
(gdb)

You can now use the same debugging commands as under host debugging. For example:

(gdb) list
23        pi[1]=4;
24
25        for (i=(int)(3.322*4*n); i>0; --i) {
26
27          t = 0;
28          for (j=n-1; j>=0; --j) {
29            t += pi[j] * i;
30            pi[j] = t % 10000;
31            t /= 10000;
32          }
(gdb) b 29
Breakpoint 2 at 0x40003d4: file spigot.c, line 29.
(gdb) c
Continuing.

Breakpoint 1, main () at spigot.c:17
17        int n = 50;  /* number of pi digits */
(gdb) c
Continuing.

Breakpoint 2, main () at spigot.c:29
29            t += pi[j] * i;
(gdb) p i
$1 = 664
(gdb) p &i
$2 = (int *) 0x3ffffcc
(gdb) quit

When you exit the debugging session, the nios2-gdb-server connection is broken. Because of the -tcppersist command line option given earlier, a new server is started right away:

$ nios2-gdb-server --tcpport auto --tcppersist
Using cable "DE-SoC [USB-1]", device 2, instance 0x00
Processor is already paused
Listening on port 52464 for connection from GDB: accepted
Exiting due to 'k' command from GDB
Leaving target processor paused
Processor is already paused
Listening on port 52464 for connection from GDB:

Performance Profiling

To understand where software spends its time, we have to use a profiler. You may recall from our earlier discussion on hardware acceleration (Lecture 8), that the innermost loop of a program is the most influential to the overall performance. This is because it contains the most frequently executed code.

A profiler is a tool that helps you find the ‘innermost loop. The profiler will tell you how many times each function was executed, how many instructions were executed, how many times each variable was read and written, and so on.

We will look at one specific example, the GNU gprof profiler, which is a basic profiler that helps you understand how many times each function was executed, and how much each function’s execution time contributed to the overal program’s execution time (as a percentage).

Gprof requires an instrumentation step to your program. The instrumentation consists of two things:

  • First, every function call is instrumented with a call to a profiler function called mcount. This function will keep track of how many times each function is executed, as well as the sequence of functions. In other words, gprof remembers which function calls which function in your program. gprof then uses this to construct a ‘call graph’, a graph that represents the set of functions called by a parent function.

  • Next, during profiling, gprof will periodically sample the program counter (using a system timer interrupt). This sample frequency is not very high, from 100Hz to perhaps a few kHz. This way, gprof can sample the program counter and determine, statistically, where the program spends most of its time. By combining the program counter with debug information, gprof can determine what function is executing. This leads to the self time a function.

The profiling data, consisting of the sampled program counter and the call graph, can then be combined into call graph timing information which includes the self time of a function as well as the time spend in all of the callees of that function.

We’ll illustrate this with the following example, which calls the modulo-179 in a set of complicated manners.

#include <stdio.h>

unsigned modk(unsigned x, unsigned k) {
  return (x & ((1 << k) - 1));
}

unsigned divk(unsigned x, unsigned k) {
  return (x >> k);
}

unsigned modulo(unsigned x) {
  unsigned r, q, k, a, m, z;
  m = 0xB3; // 179
  k = 8;
  a = (1 << k) - m;
  r = modk(x, k);
  q = divk(x, k);
  do {
    do {
      r = r + modk(q * a, k);
      q = divk(q * a, k);
    } while (q != 0);
    q = divk(r, k);
    r = modk(r, k);
  } while (q != 0);
  z = (r >= m) ? r - m : r;
  return z;
}

unsigned prod(unsigned a) {
  unsigned i, p = 1;
  for (i=0; i<a; i++)
    p = modulo(p * i);
  return i;
}

void main() {
  unsigned i, j;
  unsigned extended_profile = 1;

  if (extended_profile) {
    for (i=0; i<1000; i++)
      for (j = 1; j< 179; j++)
        prod(j);
  } else {
    for (j = 1; j< 179; j++)
      printf("%d %d\n", j, prod(j));
  }

}

Profiling on a host

To enable profiling, we have to compile the program with debug information enabled, as well as with profiling enabled. The -pg flag of gcc enables profiling, while the -g flag enables debug information.

gcc -pg -g mod179.c -o mod179

Next, you can run the program. As a side effect of the profiling instrumentation, the program dumps the profiling data in a file gmon.out at the conclusion of the profiling operation.

$ ./mod179.exe                                                                $ ls
gmon.out  mod179.c  mod179.exe

To post-process the profiling data, use gprof. This displays two tables. The first is a sorted list of call data per function. The second table is a sorted list of call graph edges with profile data. Here is the first table.

$ gprof mod179.exe
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
 40.79      0.31     0.31                             __fentry__
 30.26      0.54     0.23                             _mcount_private
 14.47      0.65     0.11 15931000     0.01     0.01  modulo
  6.58      0.70     0.05 47793000     0.00     0.00  modk
  5.26      0.74     0.04 47793000     0.00     0.00  divk
  1.32      0.75     0.01   178000     0.06     1.18  prod
  1.32      0.76     0.01                             main

The table indicates that most of the execution time goes into the modulo function. The two higher-ranked functions, _mcount_private and __fentry__, are part of the profiling framework and represent overhead. As you may suspect, profiling causes overhead in the execution time because of the instrumented mcount calls. Luckily they are detected and reported by gprof.

Further, the difference between self and total is that self only related to the function, while total includes the function and all it’s decendants. For example, the main function is called only once, but will account for all of the execution time. Most of that time is spent in functions called by main. Thus, the self time of main is negligible while the total time of main covers almost the entire program.

The second table produced by gprof is shown next. This table allows you to keep track of the paths. It shows precisely how you end up in each lower-level function by listing all of the functions called along that path. The table is ranked according to the contribution of the call subgraph to the total execution time. In addition, for each ranked function, the table lists the parents in the call graph, so that you know who is responsible for the function calls.

                     Call graph (explanation follows)


granularity: each sample hit covers 4 byte(s) for 1.32% of 0.76 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     40.8    0.31    0.00                 __fentry__ [1]
-----------------------------------------------
                                                 <spontaneous>
[2]     30.3    0.23    0.00                 _mcount_private [2]
-----------------------------------------------
                                                 <spontaneous>
[3]     28.9    0.01    0.21                 main [3]
                0.01    0.20  178000/178000      prod [4]
-----------------------------------------------
                0.01    0.20  178000/178000      main [3]
[4]     27.6    0.01    0.20  178000         prod [4]
                0.11    0.09 15931000/15931000     modulo [5]
-----------------------------------------------
                0.11    0.09 15931000/15931000     prod [4]
[5]     26.3    0.11    0.09 15931000         modulo [5]
                0.05    0.00 47793000/47793000     modk [6]
                0.04    0.00 47793000/47793000     divk [7]
-----------------------------------------------
                0.05    0.00 47793000/47793000     modulo [5]
[6]      6.6    0.05    0.00 47793000         modk [6]
-----------------------------------------------
                0.04    0.00 47793000/47793000     modulo [5]
[7]      5.3    0.04    0.00 47793000         divk [7]
-----------------------------------------------

gprof is not perfect and has some caveats. The most important of these is that the timing data is collected in a statistical manner, by sampling the PC. If the profiled program is too short, you will end up with significant sampling error. Make sure that you profile a sufficiently long trace to minimize this error.

Further reading: gprof: a call graph execution profiler.

Profiling on a target

You can use gprof also on Nios2. In that case, you have to prepare a board support package that support profiling. After running the program, you have to download the profiling information from the target to the host.

Figure: Enable profiling in BSP Editor

debug-gprof

This creates a board support package that has profiling enabled. In addition, when you recreate the application, the application will be created with the profiling flags as well.

$ nios2-app-generate-makefile --bsp-dir ../hal_bsp --elf-name main.elf --src-files mod179.c
$ make
...
nios2-elf-gcc -xc -MP -MMD -c -I../hal_bsp//HAL/inc -I../hal_bsp/ -I../hal_bsp//drivers/inc  -pipe -D__hal__ -DALT_PROVIDE_GMON -DALT_NO_INSTRUCTION_EMULATION -DALT_SINGLE_THREADED    -O0 -g -Wall   -mno-hw-div -mno-hw-mul -mno-hw-mulx -pg -mgpopt=global  -o obj/default/mod179.o mod179.c
...
[main build complete]

You can run the program as usual, but you have to add the --write-gmon flag. Open a nios2-terminal and run the program as follows. Note that the program exits upon completion and downloads the grof data.

$ nios2-download main.elf --write-gmon gmon.out --go
Using cable "DE-SoC [USB-1]", device 2, instance 0x00
Processor is already paused
Initializing CPU cache (if present)
OK
Downloaded 74KB in 0.1s
Verified OK
Running target program until exit
Uploaded GMON data: 6K in 0.0s
Leaving target processor paused

You can now consult the profiling data using nios2-elf-gprof. Here is the top-level data of each table. You can verify that a significant amount of time is spent in compiler-instrinsic functions such as __mulsi3, __umoddi3, and so forth. This is because we are using a Nios II/e, which does not have any hardware multiplication. Second, the Nios program executes signicantly slower than the same program as it runs on the host (A 50MHz CPU with 0.1 CPI vs a 2GHz CPU with a »1 CPI!). Therefore, the table below shows the profile for the extended_profiling flag set to 0.

$ nios2-elf-gprof.exe main.elf
Flat profile:

Each sample counts as 0.001 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 25.28      0.27     0.27    15931     0.02     0.04  modulo
 18.53      0.46     0.20    47793     0.00     0.00  modk
 14.31      0.62     0.15    47793     0.00     0.00  divk
 10.73      0.73     0.11                             alt_get_errno
  7.58      0.81     0.08      178     0.45     4.10  prod
  6.85      0.88     0.07        4    18.18    18.18  altera_avalon_jtag_uart_close
  3.25      0.92     0.03    51362     0.00     0.00  __mulsi3
  2.45      0.94     0.03      356     0.07     0.27  ___vfprintf_internal_r
  1.63      0.96     0.02      834     0.02     0.03  __udivdi3
  1.58      0.98     0.02      834     0.02     0.03  __umoddi3
  1.13      0.99     0.01      534     0.02     0.02  __sfvwrite_r
...
                     Call graph (explanation follows)


granularity: each sample hit covers 32 byte(s) for 0.09% of 1.06 seconds

index % time    self  children    called     name
                0.00    0.93       1/1           _start [2]
[1]     87.7    0.00    0.93       1         alt_main [1]
                0.01    0.84       1/1           main [3]
                0.00    0.06       1/1           exit [15]
                0.00    0.02       1/4           close [12]
                0.00    0.00       1/1           alt_io_redirect [46]
                0.00    0.00       1/1           atexit [73]
                0.00    0.00       1/1           _do_ctors [121]
                0.00    0.00       1/1           __register_exitproc [117]
                0.00    0.00       1/1           alt_irq_init [68]
                0.00    0.00       1/1           alt_sys_init [70]
-----------------------------------------------
                                                 <spontaneous>
[2]     87.7    0.00    0.93                 _start [2]
                0.00    0.93       1/1           alt_main [1]
                0.00    0.00       1/1           alt_load [69]
-----------------------------------------------
                0.01    0.84       1/1           alt_main [1]
[3]     80.7    0.01    0.84       1         main [3]
                0.08    0.65     178/178         prod [4]
                0.00    0.11     178/178         printf [9]
-----------------------------------------------
                0.08    0.65     178/178         main [3]
[4]     68.8    0.08    0.65     178         prod [4]
                0.27    0.37   15931/15931       modulo [5]
                0.01    0.00   15931/51362       __mulsi3 [19]

Conclusions

We have discussed two different tools, gdb and gprof, and their use on a host processor and an a target processor. Both tools are a significant help to software developers. gdb helps to track down bugs. gprof helps to identify where a program spends its cycles. When moving the debugging and profiling from a host to a target, you have to take extra steps to prepare the target. The debugging user interface and profile analysis steps always execute on the host.