Late Breaking Results: Waveform-based Performance Analysis of RISC-V Processors

Lucas Klemmer
Institute for Complex Systems
Johannes Kepler University Linz
Linz, Austria
lucas.klemmer@jku.at

Daniel Große
Institute for Complex Systems
Johannes Kepler University Linz
Linz, Austria
daniel.grosse@jku.at

ABSTRACT
In this paper, we demonstrate the use of the open-source domain specific language WAL to analyze performance metrics of RISC-V processors. The WAL programs calculate these metrics by evaluating the processors signals while “walking” over the simulation waveform (VCD). The presented WAL programs are flexible and generic, and can be easily adapted to different RISC-V cores.

1 INTRODUCTION
Today, the processor market is dominated by few proprietary Instruction Set Architectures (ISAs) and only a handful of very large corporations. RISC-V is an open and royalty free ISA [10] striving for innovation through collaboration. The open nature of RISC-V enabled even small companies as well as community projects to develop their own processors which take advantage from RISC-V’s permissive license and its extensibility to explore new ideas and markets with often highly specialized hardware.

However, this development brings its own set of challenges since the sheer number of available RISC-V cores, which are often highly configurable and extensible, makes it very hard and time-consuming for both, designers and users, to compare different cores and core configurations against each other [6, 9]. A sophisticated analysis of the cores is needed to obtain relevant performance metrics. Since a wide range of cores has to be evaluated, the analysis solution must satisfy several requirements: (1) the analysis must be powerful enough to cover complex analysis tasks, (2) it must be implementation-agnostic and easy to port to new cores, and (3) it must be easy to integrate into existing workflows.

In this paper, we use the open-source Waveform Analysis Language (WAL) [5] to analyze relevant metrics for several RISC-V implementations ranging from extremely area efficient ones to pipelined cores with higher performance. WAL has been realized as a Domain Specific Language (DSL) [8]. The language allows creating analysis programs using the values from the VCD waveforms generated during simulation of a RISC-V core. Our contributions are flexible WAL programs for different performance metrics. The programs can be adapted and used on a wide variety of RISC-V microarchitectures.

Our experimental results demonstrate that the WAL-based analysis can clearly highlight the differences between the analyzed cores. In addition, we can quantify the performance improvements of different core configurations that can be set by enabling additional features, such as instruction caches or branch prediction.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
DAC ‘22, July 10–14, 2022, San Francisco, CA, USA © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-9142-9/22/07. $15.00 https://doi.org/10.1145/3489517.3530623

Figure 1: Waveform of instruction control and data signals.

2 PROCESSOR ANALYSIS WITH WAL
We demonstrate how flexible and generic processor analysis programs can be created in WAL. To this end, we first illustrate how WAL programs work (Section 2.1). Then, we introduce WAL programs to analyze the number of executed instructions per cycle (Section 2.2) and to calculate the percentage of cycles with stalled pipeline stages (Section 2.3). We consider four well known RISC-V cores. Two of the cores are small and area efficient [2, 3], while the other two cores are more sophisticated, pipelined, and capable of running Linux in some configurations [1, 4, 7].

2.1 WAL Program Principle
In comparison to other programming languages, WAL programs have direct access to all signal values of a waveform. Accessing signals in WAL is similar to accessing variables with the difference that the value returned depends on the loaded waveform and the time at which the signal is accessed. Consider the waveform in Figure 1. The WAL expression \( \text{count}(\&\& \text{clk instr_done}) \) returns true at a given time point in the waveform if and only if the \text{clk} and \text{instr_done} signals are both set to 1. In Figure 1, all time-points at which the expression evaluates to true are highlighted in green. WAL provides a large collection of functions that can be used to analyze waveforms. For example, the \text{count} function can be used to count how many instructions are executed on the waveform with the WAL expression \( \text{count}(\&\& \text{clk instr_done}) \).

2.2 Instructions Per Cycle
First, we analyze the raw performance of each core in terms of executed Instructions Per Cycle (IPC). Since all analyzed cores are single core architectures, the best theoretical IPC score is 1.0. This means that the core executes and commits one instruction in each clock cycle. However, this is almost impossible to achieve, for example, due to branching and memory induced delays.

The WAL program for IPC analysis is split into two separate parts, a generic and core-independent analysis part and the core-specific code which has to be provided by the user.

The generic WAL program to perform the IPC analysis is shown in Listing 1. The function performs the IPC analysis for all waveforms passed in the \text{traces} parameter. For each trace, first, the trace is loaded in Line 3 and then the optional \text{setup} function is called in Line 4. The optional \text{setup} and \text{clean-up} functions can be defined by the users to perform core-specific setup and clean operations. Then, the number of executed instructions is calculated in Line 5 using the

\( \text{WAL uses a LISP style prefix notation.} \)
In addition to the IPC analysis, the WAL RISC-V library also supports the calculation of pipeline stall activity on the VexRiscv processor. The results for the IPC and pipeline stall activity are summarized in Table 1. To get the experimental results we analyzed the waveforms produced by running the Dhrystone benchmark on each core.

The Core column shows the name of the analyzed RISC-V core and the Configuration column shows the core configurations. Columns IPC and Stalled Cycles show the number of instructions per cycle and the percentage of cycles with stalled pipeline stages, respectively. The IPC results show large differences in the performance of the evaluated cores. For example, the IBEX icache configuration is more than 44 times faster than the SERV core. Enabling more sophisticated features, e.g., better branch prediction, for the VexRiscv and IBEX cores clearly shows that the number of cycles in which the pipeline is stalled decreases significantly. This also correlates with the performance improvements seen in the IPC column.

### 3 EXPERIMENTAL RESULTS

The results for the IPC and pipeline stall activity are summarized in Table 1. To get the experimental results we analyzed the waveforms produced by running the Dhrystone benchmark on each core.

The Core column shows the name of the analyzed RISC-V core and the Configuration column shows the core configurations. Columns IPC and Stalled Cycles show the number of instructions per cycle and the percentage of cycles with stalled pipeline stages, respectively. The IPC results show large differences in the performance of the evaluated cores. For example, the IBEX icache configuration is more than 44 times faster than the SERV core. Enabling more sophisticated features, e.g., better branch prediction, for the VexRiscv and IBEX cores clearly shows that the number of cycles in which the pipeline is stalled decreases significantly. This also correlates with the performance improvements seen in the IPC column.

### 4 CONCLUSIONS

In this paper, we have demonstrated the use of the open-source language WAL to analyze performance metrics. We have shown that the analysis programs can be written in a generic form and microarchitecture specific details can be handled via user-defined functions. In the experiments our analysis programs highlighted large performance differences on RISC-V cores with diverse microarchitectures.

### ACKNOWLEDGMENTS

This work has partially been supported by the LIT Secure and Core Systems Lab funded by the State of Upper Austria.

### REFERENCES


<table>
<thead>
<tr>
<th>Core Configuration</th>
<th>IPC</th>
<th>Stalled Cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>SERV servant</td>
<td>0.02</td>
<td>not pipelined</td>
</tr>
<tr>
<td>Picorv32 default</td>
<td>0.24</td>
<td>not pipelined</td>
</tr>
<tr>
<td>VexRiscv microNoCsr</td>
<td>0.33</td>
<td>63%</td>
</tr>
<tr>
<td>VexRiscv smallest</td>
<td>0.33</td>
<td>66%</td>
</tr>
<tr>
<td>VexRiscv smallAndProductive</td>
<td>0.42</td>
<td>54%</td>
</tr>
<tr>
<td>VexRiscv smallAndProductiveCache</td>
<td>0.47</td>
<td>51%</td>
</tr>
<tr>
<td>VexRiscv twoThreeStage</td>
<td>0.47</td>
<td>48%</td>
</tr>
<tr>
<td>VexRiscv secure</td>
<td>0.57</td>
<td>42%</td>
</tr>
<tr>
<td>VexRiscv linux</td>
<td>0.57</td>
<td>38%</td>
</tr>
<tr>
<td>VexRiscv full</td>
<td>0.57</td>
<td>35%</td>
</tr>
<tr>
<td>VexRiscv fullNoMmuMaxPerf</td>
<td>0.63</td>
<td>33%</td>
</tr>
<tr>
<td>IBEX default</td>
<td>0.63</td>
<td>48%</td>
</tr>
<tr>
<td>IBEX icache</td>
<td>0.89</td>
<td>19%</td>
</tr>
</tbody>
</table>