RISC-V: Baremetal From The Ground Up (Chipyard Edition)

This article will walk you through the behind-the-scenes of how a baremetal C program is compiled and linked as a RISC-V binary file.

Let's start with something simple. The "hello world" equivalent program in the embedded systems world would be the blinkly LED program:

// some handy macros to do bit operations
#define SET_BITS(REG, BIT)              ((REG) |= (BIT))
#define CLEAR_BITS(REG, BIT)            ((REG) &= !(BIT))

// peripheral MMIO addresses
#define GPIOA_OUTPUT_VAL                0x1001000CUL
#define GPIOA_OUTPUT_EN                 0x10010008UL
#define CLINT_MTIME                     0x0200BFF8UL

// the pin we are using
const unsigned int GPIO_PIN = 0x01;

// a global counter
volatile unsigned int counter;

// A simple delay function. 
void delay(unsigned int ticks) {
  unsigned int mtime_start;
  while ((*(volatile unsigned int *)CLINT_MTIME) - mtime_start < ticks) {}
}

void main() {
  // enable GPIOA as output
  SET_BITS(*(volatile unsigned int *)GPIOA_OUTPUT_EN, GPIO_PIN);

  while (1) {
    // if counter is even, turn on the LED, otherwise turn it off
    if (counter % 2 == 0) {
      SET_BITS(*(volatile unsigned int *)GPIOA_OUTPUT_VAL, GPIO_PIN);
    } else {
      CLEAR_BITS(*(volatile unsigned int *)GPIOA_OUTPUT_VAL, GPIO_PIN);
    }

    // delay for 1 second
    delay(1000);
    
    // increment the counter
    counter += 1;
  }

  // we won't reach here if everything is working
}

This might look intimidating. Let's break down the elements:

First, we define SET_BITS and CLEAR_BITS as macro functions. These will become handy, since when operating Memory-Mapped Input/Output (MMIO) registers, in most cases we are operating on a bit-level, only touching fields that we are focused on and leave the rest bits intact.

Then, we define GPIOA_OUTPUT_VAL, GPIOA_OUTPUT_EN, and CLINT_MTIME. These are the memory address of the corresponding MMIO registers.

Followed by that, we write our first actual line of C program, which defines GPIO_PIN as a constant. We also define a global variable called counter.

circle-info

Note:

We only defined the address of MMIO registers that we are going to use here. And to demonstrate the read-only data section, we delibriately define the GPIO_PINas a global constant instead of macro.

In a more proper program, these elements are defined in a slightly different way (see CLINTarrow-up-right, GPIOarrow-up-right, and HAL_GPIOarrow-up-right in Baremetal-IDE as an example)

Then, we define delay(), which reads from the mtime register in CLINT to keep track of the time, and halt the program for a given amount of ticks.

circle-info

Note:

Note that we are using ticks, instead of a physical unit of time like seconds or milliseconds, in the delay function. This is because without special circuits such as Real-time Clock (RTC), the SoC does not have a sense of how fast the real-world wall clock. The ratio between ticks and seconds is determined by the input clock frequency and the internal clock tree settings of the SoC.

Finally, we move to the main() function. Inside the function, we first enable the GPIO output functionality, and then proceed to an infinite loop. Inside the loop, we toggle the LED every time the loop restarts, delay for 1000 ticks, and then increment the counter.

circle-info

Note:

You might be more familiar with the main that looks like this:

This is because in embedded systems, main normally would be an infinite loop and will not return. It does not make sense for an embedded program to "exit from main()", since there is no additional code past main().

In order to compile this C program to something our SoC can understand, we need to use the RISC-V Toolchain.

RISC-V Toolchain

The RISC-V Toolchain is a collection of executables that helps us to compile, assemble, and link the program we write in C/C++ to binary format. It can also provide tools for us to debug and analyze the generated binaries.

There is a wide range of choices of toolchains, usually marked by different prefixes. The following is a simple list of the common ones that we may encounter:

  • TODO

Here, we will use the riscv-gnu-toolchainarrow-up-right from riscv-collabarrow-up-right (it comes with the prefix riscv64-unknown-elf-).

For toolchain installation, see Setting up RISC-V Toolchainarrow-up-right.

In the toolchain directory, we can see a set of executables:

-gcc is the most general one. You can consider it as the entry executable which can invoke the compiler, the linker, and the assembler by passing it different compiler flags.

-ar is the assembler itself.

-ld is the linker itself.

-objdump, -readelf, and -nm are elf file analyzers.

-objcopy is the format converter. It can convert between elf format, binary, hex, and many other.

All of these toolchain executables will run on the host machine, but it knows the architecture of the target SoC, and thus can build the binary in the format that our target can understand.

Build Process

Now let's dive into the build process. There are several stages in the build process. Normally, the toolchain will join several stages together to speed up the build process. Here, we pass special flags to the toolchain to let it stop at each stage, so we can take a look at the intermediate contents.

Pre-processing Stage

The first stage is the pre-processing stage.

In this stage, the compiler will resolve all the compiler macrosarrow-up-right (basically, everything we defined with "#" marks).

By default, the compiler will not generate this intermediate "main.i" file for us. To do this, we will pass the -E argument to tell the compiler stop after pre-processing. We use the -o argument to specify the output file.

We can see that in main.i, all of the macro defines are processed and replaced with their definition contents.

Code Generation Stage

The "main.i" file is then passed through the compiler again for the code-generation stage.

In this stage, all of the high-level C/C++ code will be converted to architecture-specific assembly language.

Similarly, the compiler will not generate this intermediate file for us, and we need to use the -S argument to command the compiler stop after code-generation.

The resulting file is our familiar RISC-V assembly code.

Assembling Stage

At the assembling stage, the assembly language will be further converted into binary instructions.

The output file is also called "relocatable object file". The word “relocatable” indicates that the addresses in the program (where to put each piece of code in the memory) are not determined yet.

Same as before, we need to supply the -c flag to prevent compiler proceed to linking stage.

The format of the relocatable object file is in Executable and Linkable Format (ELF). Since it's a binary format, we cannot examine the content directly with text editor anymore, so we need the toolchain to decode the content.

Analyzing Relocatable Object Files

There's still one last stage (linking stage) remaining, but let's take a side track here and examine the content of the generated "main.o" file first.

The ELF format describes how various elements of the code (e.g. code, data, read-only data, uninitialized data) are located in different sections.

We will use the riscv-unknown-elf-objdump to analyze our program

Display Section Headers

Let's first examine the section headers in main.o.

By running objdump with -h argument, we can print out all the section headers in an ELF file.

.text section holds the code of the program.

.data section holds the initialized global data.

.bss section holds the uninitialized global data. The actual memory mapped with this section will be reset to zero by the program boot code. The name "bss" stands for "block starting symbol", and is chosen due to historical reasons. Due to the RISC-V compiler's default setting, it's also generating a .sbss section, which stands for "small .bss data" and decides to put our counter variable there.

.rodata section holds the read-only data. Due to the RISC-V compiler's default setting, it instead generats a .srodata section, which stands for "small .rodata data" and decides to put our GPIO_PIN constant there.

.comment and .riscv.attributes are sections added by the compiler for debugging purposes.

Note that all the sections start at address 0x00. This is the reason why .o files are called relocatable. All of the addresses are relative, and it will be during the linking stage to let linker to convert these relative addresses into absolute locations.

Display Full Content

With -s argument, we can print out the full content of the ELF file. . The result will be large, so we redirect the output to a file.

Display Disassembly

With -d argument, we can print out the disassembly code from the text section

Linking Stage

This is the final stage before we can get an executable binary program.

The linker will put different pieces of code and data to our desired address locations, resolve all the not-yet-defined symbols, and merge all the programs and external libraries into a single file.

We need to tell the linker how we want the program to be linked together, and that is through the use of a linker script.

Linker scripts are written in linker commands, with the file extension .ld.

Linker Commands

ENTRY defines the entry point of the program. It is the first piece of code the MCU will execute. Debugger will also set the initial PC location according to this entry value.

Syntax of the ENTRY command is shown below, where entry_symbol_name is the name of the entry function.

MEMORY defines the various memory regions in the MCU and provides info of their locations and sizes. Linker also calculates the total code size and memory usage from this value to determine if the program can fit inside the memory.

Syntax of the MEMORY command is shown below.

The attribute is defined as follows:

R Read-only sections

W Read and write sections

X Sections containing executable code

A Allocated sections

I Initialized sections

L Initialized sections, same as I

! Invert the meaning of the following symbols

SECTION defines which symbol sections are mapped to which memory regions, as well as the order of the mapping. It will generate the defined sections in the final ELF file. For example, we can map .text section to FLASH region.

Syntax of the SECTION command is shown below.

When virtual memory address and load memory address are the same, we only need to write the virtual memory address.

Writing Linker Script

For the sake of simplicity and ease of understanding, for now we will not care about the C runtime hassles and interrupt routines. We will make our program enter directly to main, and start to run our blinky LED program.

Thus, the entry symbol of our program will just be main

In Chipyard tutorial SoC design, we have three memory regions

To keep things simple, we will stack every section on top of each other on scratchpad memory.

Now we have our unsafe-but-usable linker script:

Finally, we are ready for linking.

With -T argument, we can tell gcc to link the target programs.

Also for simplicity, we are not going to link the standard C library for now. To do that, we are adding the -nostdlib argument.

Format Converison

Loading the Program

TODO

Startup Code

Our LED has successfully blinked. However, if we try running other more complex programs, they might fail. This is because we have made a lot of assumptions about the state of the SoC when we enter the main() function.

This is usually set up with a startup file. This piece of the program will be responsible for setting up the interrupt vector, initializing the stack, zeroing out the .bss section, and sometimes also copying the .data section to SRAM. Hence, we will write our own startup file to properly initialize the SoC.

// TODO: change

Boot Flow:

  1. The program starts at the BootROM `path`.

  2. Jump to the entry point, which is at the label: _enter in freedom-metal/src/entry.S.

  3. Initialize global pointer gp register using the generated symbol __global_pointer$.

  4. Write mtvec register with early_trap_vector as default exception handler.

  5. Read mhartid into register a0 and call _start, which exists in crt0.S.

  6. Initialize stack pointer, sp, with _sp generated symbol. Harts with mhartid of one or larger are offset by (_sp + __stack_size * mhartid). The __stack_size field is generated in the linker file.

  7. Check if mhartid == __metal_boot_hart and run the init code if they are equal. All other harts skip init and go to the Post Init Flow, step #15.

  8. Boot Hart Init Flow Begins Here

  9. Init data section to destination in defined RAM space

  10. Copy ITIM section, if ITIM code exists, to destination

  11. Zero out bss section

  12. Call atexit library function which registers the libc and freedom-metal destructors to run after main returns

  13. Call __libc_init_array library function, which runs all functions marked with attribute((constructor)).

  14. Post Init Flow Begins Here

  15. Call the C routine __metal_synchronize_harts, where hart 0 will release all harts once their individual msip bits are set. The msip bit is typically used to assert a software interrupt on individual harts, however interrupts are not yet enabled, so msip in this case is used as a gatekeeping mechanism

  16. Check misa register to see if floating point hardware is part of the design, and set up mstatus accordingly.

  17. Single or multi-hart design redirection step

  18. If design is a single hart only, or a multi-hart design without a C-implemented function secondary_main, ONLY the boot hart will continue to main(). b. For multi-hart designs, all other CPUs will enter sleep via WFI instruction via the weak secondary_main label in crt0.S, while boot hart runs the application program. c. In a multi-hart design which includes a C-defined secondary_main function, all harts will enter secondary_main as the primary C function.

Interrupt Vector

Stack Initialization

__stack_size

__boot_hart_idx

__global_pointer$

_sp: Address of the end of stack for hart 0, used to initialize the beginning of the stack since the stack grows lower in memory. On a multi-hart system, the start address of the stack for each hart is calculated using (_sp + __stack_size * mhartid)

metal_segment_bss_target_start & metal_segment_bss_target_end ◦ Used to zero out global data mapped to .bss section

metal_segment_data_source_start, metal_segment_data_target_start, metal_segment_data_target_end ◦ Used to copy data from image to its destination in RAM.

metal_segment_itim_source_start, metal_segment_itim_target_start, metal_segment_itim_target_end ◦ Code or data can be placed in itim sections using the __attribute__section(".itim")

Last updated