Efficient RISC-V Floating Point Simulation

Contents

1. Introduction
2. The Story
3. Floating Point Basics
4. The Problems
5. How Other Simulators Work
6. Methods
7. Result & Discussion
- 7.1 Clean Room Benchmarks
- 7.2 My Method vs. QEMU
8. Conclusion & Outlook
9. References

1. Introduction

This post is basically an extended version of my recently published paper TODO. In order to guide expectations right from the start, I would like to answer three essential questions.

What is this post about and is it worth reading?
This post is about floating point (FP) arithmetic in simulators/emulators. So, if you ever wondered how simulators/emulators like QEMU or gem5 handle floating point arithmetic, the following might be of interest for you. Although the title says RISC-V, the methods presented here are applicable to most other Instruction Set Architectures (ISAs) as well. In fact, I also present a little section about Apple’s Rosetta 2 (x64-on-ARM).

Should I read the paper or this blog post?
Read this post for the reasons described in the next answer.

Why did I spend my free time rewriting something I already spent weeks on?
Blog posts are better than papers because:

I don’t have to appeal to reviewers
No page limit
Additional material (data, videos, code, etc.)

How to cite?
TODO

2. The Story

In 2022, a colleague of mine and his friend took the courage and founded a startup. Their flagship product is a RISC-V simulator called SIM-V [1], which can be used to simulate RISC-V systems on x64 (or other) machines. One of the key selling points is the almost native performance. The simulated system is so fast, that you can interact with it like a real system.

So, how does one make a simulator go 🚀🚀🚀?
I am certainly not giving away any secrets when I reveal that the underlying technology is Dynamic Binary Translation (DBT). So basically the same method that is used by QEMU. With DBT, binary instructions of the target system (RISC-V in our case) are translated into instructions of the host system (i.e. x64) at runtime and executed. If possible, instructions are translated 1-to-1 (or at least 1-to-only-a-few), which also explains the native speed. For example, one could simply translate a RISC-V 32-bit floating point (FP) addition fadd.s to an x64 FP addition addss. Semantically, these two instructions seem to be identical, at least at first sight.

My colleagues thought so too and implemented it this way in their first version of SIM-V. In practice, this method actually works quite well. You can boot Linux systems with it, and execute many applications without encountering problems.

One of the few applications that doesn’t work with this method is the RISV-V Architectural Test Framework (RISCOF). Unfortunately, that’s a real showstopper, since passing these tests is required to license the RISC-V trademark. Or to quote RISCOF’s documentation:

Passing the tests and having the results approved by RISC-V International is a prerequisite to licensing the RISC-V trademarks in connection with the design.

So, passing these tests was top priority and my colleagues asked me to do an investigation. After taking a closer look at the failing tests, I could pinpoint the following 6 reasons why they failed:

Different NaN encodings
Different instruction semantics
x64’s missing RMM rounding mode
NaN boxing
Floating Point exception flags
NaN Propagation

In the following, I will explain each of these points in greater detail. Subsequently, I show how other simulators and how I solve these issues.

But first, I’ll explain some basics about FP arithmetic, IEEE 754, and how it is implemented in RISC-V and x64. Feel free to skip the next section if you are already familiar with these topics.

3. Floating Point Basics

3.1 The Math

Floating point (FP) numbers are the most common way to approximate real numbers in computing. You find them in most programming languages with names such as float, double, f32 or f64. Due to the many ways FP arithmetic can be implemented, adhering to standards avoids a lot of problems. This is why most software and hardware follows the IEEE 754 standard. But also standards might be erroneous or incomplete, which is why there are now 3 versions:

IEEE 754 1985, 20 pages [2]
IEEE 754 2008, 70 pages [3]
IEEE 754 2019, 84 pages [4]

They differ mostly in some details, which will be discussed later.

The most important number formats defined by IEEE 754 are binary32 and binary64. If you program C/C++, you already know them as float and double. In Rust they are called f32 and f64. A FP number comprises a sign, a significand, an exponent, and a bias with the following bit representation:

Note that the bias is implicit and fixed. It is used to reach negative numbers in the exponent without using two’s complement. Ultimately, the numerical value of an FP number is given by:

\begin{equation} f = (-1)^{sign} \cdot (1.s_{p-1}s_{p-2}…s_1)_2 \cdot 2^{exponent-bias} \end{equation}

In the formula $s_i$ refers to the bit at position $i$ in the significand. However, there are quite a few corner cases to represent some special values.

The first case is subnormal numbers. Whenever $exponent$ is 0, the implicit leading 1 turns into a 0. So we get:

\begin{equation} f = (-1)^{sign} \cdot (0.s_{p-1}s_{p-2}…s_1)_2 \cdot 2^{-bias} \end{equation}

Having these special cases gives us some cool mathematical properties, like additions and subtractions that never underflow. However, in many other regards like hardware complexity, some mathematical proofs, or timing side channels, it can be a pain.

Another special value is infinity. If all bits in the exponent are set and the significand is 0, the value is interpreted as $\pm \infty$.

The last special value is NaN (Not a Number), which comes in two different flavors: quiet (qNaN) and signaling (sNaN). qNaNs are used to represent non-meaningful results (e.g. $\infty-\infty$), while sNaNs are intended to be used for uninitialized variables/memory. The bit pattern of a NaNs is an exponent with all bits set and a significand that is not 0. How the encoding of qNaN and sNaN differ is explained in Section “4.1 Different NaN Encoding”.

While Equation 1 is often used to introduce and understand the concept of IEEE FP numbers, the $p-1$ significand bits with an implicit leading 1 complicate mathematical proofs. A representation more suited for mathematical adventures is:

\begin{equation} \label{eq:float1} f = M \cdot 2^{e - p + 1}, \quad e=exponent-bias \end{equation}

With this representation, the significand is shifted so far, that it becomes an integer value. Due to the finite number of bits in binary32 and binary64, the precision $p$, the significand $M$, and the exponent $e$ are constrained by the values given in the following table:

data type	exponent range	precision bits	significand range
binary32	$ e_{f,min}=-126 \leq e_f \leq 127 = e_{f,max}$	$p_f=24$	$\left\lvert M_f \right\rvert \leq 2^{24}-1$
binary64	$ e_{d,min}=-1022 \leq e_d \leq 1023 = e_{d,max}$	$p_f=53$	$\left\lvert M_d \right\rvert \leq 2^{53}-1$

Note that the $p$ precision bits include the implicit leading 1. For example, a binary32 value has a precision of 24 bits of which 23 bits are explicitly stored. Hence, the representation is only suitable for normal numbers! Or in other words: don’t use this model to represent subnormal numbers!

Another really painful aspect of FP numbers is rounding errors. Whenever mathematical operations, such as additions or multiplications, are performed on FP numbers, rounding errors may occur. In literature and this post, rounding is symbolized by the $\circ$ operator. While rounding errors are hard to avoid, most FP hardware allows to control the sign of the error by means of rounding modes. With these modes, you can control whether the final result is rounded down, up, to the nearest number, or however you define it. The most recent IEEE 754 standard defines 5 rounding modes:

roundTiesToEven (mandatory)
roundTiesToAway (introduced in 2008, not mandatory)
roundTowardPositive (mandatory)
roundTowardNegative (mandatory)
roundTowardZero (mandatory)

To indicate which rounding mode is used in mathematical representations, a little acronym is added to the circle operator. For example, $\circ_{RNE32}(a+b)$ corresponds to a 32-bit addition under Round Nearest, Ties to Even (RNE) rounding mode. I’m using the acronyms from the RISC-V spec [5]. In the following, if no rounding mode is given, RNE shall be assumed.

To assess the numerical impact of these errors, one can use the standard error model of FP arithmetic [6]. According to the model, the error of many arithmetic operations (+, −, /, ·, √), including underflows, can be represented as:

\begin{equation} \label{eq:standard-error-model} \begin{gathered} z = (a \, \text{op}\, b) \cdot (1 + \epsilon ) + \eta = \circ(a \, \text{op}\, b) \\\
\epsilon \eta = 0, \quad |\epsilon| \leq \textbf{u}, \quad \eta \leq |2^{e_{min}}| \cdot \textbf{u}, \quad \textbf{u} = 2^{-p} \end{gathered} \end{equation}

Whereby $\eta$ and $\epsilon$ are used to distinguish between subnormal and normal numbers:

Normal number: $\eta=0$
Subnormal number: $\epsilon=0$

The relative error $\epsilon$ is bounded by the so-called unit roundoff error $\textbf{u}$. Note that this formula only works for the round-to-nearest rounding. To account for other rounding modes as well, you can use a roundoff error of $2\textbf{u}$. This is also referred to as the machine epsilon.

3.2 RISC-V Floating Point

In this subsection, I’ll explain how FP arithmetic works on RISC-V systems. All information presented here is based on the RISC-V ISA manual [5].

In general, RISC-V is organized in so-called extensions. Each extensions defines a certain set of instructions and other characteristics, which can be assembled to larger systems in a modular way. This includes FP arithmetic, which is used in the extensions F, D, Q, Zfa, Zfh, Zfhmin, Zfinx, Zhinx, and Zhinxmin. Moreover, there is a vector extension V, which also uses FP arithmetic. Vanilla 32-bit and 64-bit FP arithmetic is provided by the extensions F and D respectively.

All FP extensions mostly adhere to the latest IEEE 754 2019 standard [4]. Accordingly, there are 5 FP exceptions and 5 rounding modes. Reading FP exceptions and setting rounding modes is achieved by reading/writing the fcsr register (see Figure below).
Opposed to many other ISAs, RISC-V doesn’t trigger hardware traps when encountering FP exceptions. Hence, you cannot catch, for example, a resulting underflow without constantly checking the fcsr register.
Another interesting characteristic of RISC-V is the instruction-embedded rounding mode. That means, it possible to specify an operation’s rounding mode directly in the instruction’s encoding. However, if the instruction’s rounding mode encodes to “dynamic”, a global rounding mode from fcsr is used instead.
A special peculiarity, that is not part of the IEEE standard, is RISC-V’s hardware-assisted NaN boxing. With NaN boxing, the upper bits of an M-bit FP register are saturated if an N-bit value is written to it with $M>N$. Also, values smaller than FLEN (FP register width) are only considered valid if the upper bits in the register are set. For example, if a 32-bit FP value resides in a 64-bit register, it is only considered valid if the top 32 bits are set to 1. This means, instructions working solely on 32-bit FP values must check the upper bits when reading the operands and set them when writing back the result. Since the whole 64-bit value encodes to a negative qNaN, there is no risk of creating valid values by accident.
One issue where the IEEE standard leaves/left too much freedom in my oppinion are canonical qNaNs. A canonical qNaN is the specific bit pattern returned by the hardware if it executed an invalid operation (e.g. 0/0). For example, a 32-bit zero-through-zero division will result in 0x7fc00000 for 32-bit FP registers. The same 32-bit division for 64-bit FP registers results in a NaN-boxed value of 0xffffffff7fc00000. But more on that later in Subsection Different Canonical qNaN Encodings.

3.3 x64 Floating Point

Similar to RISC-V, FP arithmetic on x64 is also defined by extensions. Yet, the story for this ISA is a little bit more convoluted.

The first FP ISA for x64 was introduced in 1980 by the x87 extension. This extension was succeeded by SSE in 1999, which not only provided scalar FP arithmetic but also vector instructions. Even though SSE mostly superseded x87, today’s x64 CPUs still support the x87 extension for legacy reasons I guess. Modern compilers like gcc primarly generate SSE instructions when it comes to scalar FP arithmetic. There are only a few corner cases like long double, for which gcc will still generate x87 code.

In 2011, Intel and AMD released the first processor including the AVX extension, which had new SIMD and scalar instructions. This was followed by AVX-512 in 2016, which adds scalar FP instructions using an instruction-encoded rounding mode. Yet AVX-512 isn’t even supported by many modern CPUs and in general doesn’t seem to be a very beloved child. Or to quote Linux Torvalds: “I hope Intel’s AVX-512 ‘dies a painful death’.” (https://www.phoronix.com/news/Linus-Torvalds-On-AVX-512).

So, after having introduced 4 different FP extensions, which one is relevant for the following? It’s not x87 due to its obsolescence, and it’s not AVX-512 due to its unpopularity. Consequently, we are left with SSE and AVX. Since SSE is the default extension

Since SSE was intdocued in 1999, it adhere to the most recent IEEE standard at that time, which was IEEE 754-1985 [2].

In addition to the aforementioned functional differences of certain instructions, there are other important subtleties which need to be considered. For example, x64 misses the RMM rounding mode, which was introduced in later standards (see Figure above).
The five FP exceptions (invalid, underflow, overflow, inexact, divide-by-zero) were already defined in the first standard matching RISC-V in this regard. Yet mapping the FP exceptions from host to target turned out to be one of the most difficult challenges, as shown in the subsequent section.
In addition to the five standard FP exceptions, x64 also defines a denormal flag for the detection of subnormal results.
One further peculiarity is x64’s support for treating subnormal numbers as 0 using the FTZ and DAZ flags. Depending on the microarchitecture, the processing of subnormal numbers can reduce FPU performance by an order of magnitude or more [7]. If subnormal numbers are treated as 0, there is no risk of hampering performance. This flush-to-zero mode was designed for 3D applications where performance is a greater concern than accuracy [8].
In contrast to RISC-V, the x64 ISA also allows to specify which FP exceptions cause a trap. The corresponding masking bits are selected in the FMASK field, as depicted in the Figure above.
Another difference between RISC-V and x64 is the canonical NaN encoding. On x64 systems, the canonical NaN uses a negative sign, as opposed to the RISC-V positive sign. That means, a 32-bit qNaN as a result of an invalid operation would be encoded as 0xffc00000.

4 The Problems

As already mentioned in Section 2, we are facing 6 different problems when executing RISC-V instructions on x64 hosts. In the following, I provide a more detailed explanation.

4.1 Different Canonical qNaN Encodings

For some operands, certain FP instructions cannot provide a meaningful result. For example, when multiplying ∞ and 0, or when adding +∞ and -∞. To indicate the occurrence of an invalid operation, a specific pattern bit pattern has to be returned. This pattern is referred to as a qNaN (quiet Not A Number). There is also an sNaN (signaling Not A Number), but this is rather irrelevant in our case. So, how does the bit pattern of a qNaN look like?
The IEEE 754 standard from 1985 defines a NaN very vaguely as a number with all exponent bits set to one, and a non-zero significand. The exact difference between a qNaN and an sNaN was specified in the 2008 version, with a qNaN having a leading “1” in the significand and sNaN having a leading “0”. So, according to the latest IEEE 754 standard, a 32-bit qNaN looks like this:

x111 1111 11xx xxxx xxxx xxxx xxxx xxxx
x = arbitrary bit

As you can see, there’s not only one qNaN, but a whole range of patterns, leaving an ISA designer with the problem of which exact pattern to return when encountering an invalid operation. Since IEEE 754 unfortunately does not give a recommendation here, we see various patterns in practice. The following extended table from [9] shows the qNaN patterns of some popular ISAs.

ISA	Sign	Significand	IEEE 754 2008 compliant
SPARC	0	11111111111111111111111	✓
RISC-V $< v2.1$	0	11111111111111111111111	✓
MIPS	0	01111111111111111111111	✗
PA-RISC	0	01000000000000000000000	✗
x64	1	10000000000000000000000	✓
Alpha	1	10000000000000000000000	✓
ARM64	0	10000000000000000000000	✓
PowerPc	0	10000000000000000000000	✓
Loongson	0	10000000000000000000000	✓
RISC-V $\geq v2.1$	0	10000000000000000000000	✓

As you can see, the qNaN of RISC-V and x64 differ in their signs. Thus, if we were to translate RISC-V FP instructions one-to-one to x64, we’d have to check for qNaNs after each instruction. If qNaN is encountered as a result, the sign must be inverted. In case you’d like to see the different qNaNs, execute the following code on different ISAs:

// x64:    0xffc00000
// RISC-V: 0x7fc00000
// MIPS:   0x7fbfffff
#include <iostream>
#include <ios>

int main() {
  float a = 0.f;
  float b = 0.f;
  a /= b; // Generates a canonical qNaN.

  unsigned int* c = reinterpret_cast<unsigned int *>(&a);
  std::cout << std::hex << "0x" << *c << std::endl;
  return 0;
}

4.2 Different Instruction Semantics

Now to one of my favorite problems, which shows in an absurd way that even IEEE standards created by experts are not impeccable. Let’s start with a simple question: What is the maximum of an sNaN and an arbitrary number? Or expressed directly as instructions:

x64: maxss 5.f, sNaN = ?
RISC-V: fmax  5.f, sNaN = ?

The answers to this question are as numerous as they are confusing:

x64:
  maxss 5.f, sNaN = sNaN
  maxss sNaN, 5.f = 5.f
RISC-V <2.2:
  fmax  5.f, sNaN = qNaN
  fmax  sNaN, 5.f = qNaN
RISC-V 2.2:
  fmax  5.f, sNaN = 5.f
  fmax  sNaN, 5.f = 5.f

I guess the results show quite well, that some instructions cannot be mapped 1-to-1.

So, why is that? The answer is interesting, but not relevant for the understanding of the rest of the post. Thus, feel free to skip the rest of this subsection.

Let’s start with the odd behavior of the x64 maxss instruction. When the modern x64 floating point arithmetic was introduced as part of the SSE extension in 1999, the current IEEE 754 standard was still from 1985. If you look into this standard and look for guidance on maximum/minimum instructions, you find exactly… nothing! So, here is my guess how Intel’s engineers made it more or less compliant. Instead of regarding the maximum/mininum instruction as atomic, you define it using order relations. For example, using C++ syntax, you could define it as:

a > b ? a : b;

Fortunately, we find some information about comparisons in the standard. IEEE 754 1985 defines any comparisons with NaNs as unordered, requiring false to be returned [10]. This means, 5.f > sNaN is false, as well as sNaN > 5.f. Also things like sNaN == sNaN evaluate to false. So if every comparison with NaN is false, our maximum/minimum instruction defined by order relations will always return the second operand (b) if one or more operands are NaN. And that’s exactly what you see with x64’s maxss instruction.

A few years later, the IEEE 754 2008 standard was published, which finally included a definition of the maximum/minimum operation (see subsection 5.3.1 General operations, maxNum and minNum). According to this standard, maximum/mininum should return a qNaN when one of the operands is a sNaN. If only one of the operands is a qNaN, the number shall be returned. This definition was adopted by the RISC-V ISA for the fmax/fmin instruction and kept until version 2.2. In comparison to maxss, this instruction is commutative, which is what a maximum/minimum operation should be in my opinion. So apparently, the experts thought about commutativity, but a closer look reveals they forgot about associativity. In his article The IEEE Standard 754: One for the History Books [11] the author David G. Hough confirms that the aspect of associativity in the presence of NaNs was simply overseen. To show you what is meant by this, consider the following operations:

max(6.f, max(5.f, sNaN)) = max(6.f, qNaN) = 6.f
max(max(6.f, 5.f), sNaN) = max(6.f, sNaN) = qNaN

If you just follow the standard, you get different results depending on the way the operations are associated. That sounds like a possible source of trouble, so the experts rectified the definition in the IEEE 754 2019 standard.

To be more precise, they replaced maxNum and minNum with the associative operations maximumNumber and minimumNumber. They also introduced maximum and minimum, but these are not relevant in the context of RISC-V. These new operations simply do not turn sNaNs into qNaNs which makes them associative and commutative. Since RISC-V tries to adhere to IEEE 754 standard and is also not afraid to change things, the fmax and fmin were adjusted in version 2.2. So here we are. We just needed 34 years to figure out what the maximum/minimum of two values is.

Besides maximum and minimum, also other instructions like fused multiply-add and float to integer conversions show slightly different behavior. Execute the following program on x64 and RISC-V to see it with your own eyes:

#include <cfenv>
#include <cmath>
#include <iostream>
#include <limits>

template <typename T>
using nl = std::numeric_limits<T>;

int main() {
  // Maximum/Minimum
  float res1 = nl<float>::signaling_NaN();
  float res2 = 5.f;
#ifdef __x86_64
  asm volatile("maxss %0, %1" :"=x"(res1) : "x"(5.0f));
  asm volatile("maxss %0, %1" :"=x"(res2) : "x"(nl<float>::signaling_NaN()));
#elif __riscv
  asm volatile("fmax.s %0, %1, %2" :"=f"(res1) : "f"(5.0f) , "f"(res1));
  asm volatile("fmax.s %0, %1, %2" :"=f"(res2) : "f"(nl<float>::signaling_NaN()) , "f"(res2));
#else
  static_assert(false, "No architecture detected.");
#endif
  std::cout << "max(sNaN, 5.f) = " << res1 << std::endl
            << "max(5.f, sNaN) = " << res2 << std::endl;

  // Fused Multiply-Add
  std::feclearexcept(FE_ALL_EXCEPT);
  float res = std::fma(0, nl<float>::infinity(), nl<float>::quiet_NaN());
  std::cout << "Invalid: " << std::fetestexcept(FE_INVALID) << std::endl;

  // Float to Integer
  volatile float a = 2e10;
  std::cout << "(int)2e10 = " << (int) a << std::endl;

  return 0;
}

On x64 the output is:

max(sNaN, 5.f) = 5
max(5.f, sNaN) = nan
Invalid: 0
(int)2e10 = -2147483648

On RISC-V you get:

max(sNaN, 5.f) = 5
max(5.f, sNaN) = 5
Invalid: 16
(int)2e10 = 2147483647

4.3 The Missing Rounding Mode

As already explained in the background section, x64 misses the “roundTiesToAway”, which was introduced in the IEEE 754 2008 standard. So, whenever we want to simulate RISC-V FP instructions under a “roundTiesToAway”, the host’s FPU cannot be used. Yet, this is a corner case, as most applications just use the default RNE rounding mode.

4.4 NaN Boxing

Now to a unique feature/clarification that was introduced in 2017 with version 2.2 of the RISC-V ISA [12]. Until version 2.2, there was no definition of how 32-bit FP values are encoded in 64-bit registers. This can lead to several problems as described in [13] and [14]. After a lively discussion, the chosen solution was a NaN boxing scheme, which was used in no other ISA at that point as far as I know (remark: in 2019 OpenRISC 1000 also adopted NaN Boxing with version 1.3). That means, if a 32-bit FP value is stored in a 64-bit FP register, the upper 32 bits are set to 1’s. Hence, the 32-bit FP value is basically a payload of a 64-bit negative qNaN.

This gives you some advantage in terms of debuging capabilities, but requires additional treatment for emulation. If you want to see NaN boxing in action, execute the following code on RISC-V and x64:

#include <iostream>

int main() {
  const float a = -0.f;
  const double b = -0.;
  double out;

#ifdef __x86_64
  // Storing the float does not touch the upper bits.
  // Hence, the output is 0x8000000080000000 (-1.0609978955e-314).
  asm volatile("movsd %2, %%xmm0 \n\t\
                movss %1, %%xmm0 \n\t\
                movsd %%xmm0, %0"
                : "=x" (out) : "x" (a), "x" (b) : "xmm0");
#elif __riscv
  // Output should be -qNaN due to RISC-V NaN boxing.
  asm volatile("fmv.d f0, %2 \n\t\
                fmv.s f0, %1 \n\t\
                fmv.d %0, f0"
                : "=f" (out), "=f" (a) : "f" (b) : "f0");
#else
  static_assert(false, "No architecture detected.");
#endif

  std::cout << "out = " << out << std::endl;
  return 0;
}

4.5 NaN Propagation

A feature recommended but not mandated by IEEE 754 is NaN propagation. The idea is to propagate inputs NaN payloads through instruction as some kind of diagnostic information. It is part of x64 and ARM, but RISC-V doesn’t mandate it due to additional hardware costs. To see how it looks like, execute the following code on x64 and RISC-V:

// x64:    0xffc00123
// RISC-V: 0x7fc00000
#include <iostream>

int main() {
  float a = 0.f;
  float b;
  unsigned int *ai = reinterpret_cast<unsigned int *>(&a);
  unsigned int *bi = reinterpret_cast<unsigned int *>(&b);
  *bi = 0xffc00123;
  a += b;

  std::cout << std::hex << "0x" << *ai << std::endl;
  return 0;
}

4.6 Floating Point Exception Flags

Whenever FP instructions are executed, certain exceptions may occur. The IEEE 754 standard defines 5 exception flags which indicate irregularities during an instruction’s execution:

invalid (e.g.: $\infty-\infty = qNaN$)
underflow (e.g.: $(1.5046328E−36)^2=0$)
overflow (e.g.: $(1.5845633𝐸29)^2=\infty$)
inexact (e.g: $0.00390625+65536=65536$)
divide-by-zero: (e.g: $1/+0 = \infty$)

This was already defined in the first standard and hasn’t changed. So, what is the problem if RISC-V and x64 are equal in this regard? Finding a working solution isn’t the problem, but having a fast one is.

But let me begin with the naive approach, that I call FPU guards. It involves the following steps to load and save the FP exception flags from the mxcsr register:

Save host FPU state
Load target FPU state
Execute target FP instruction(s)
Save target FPU state
Load host FPU state

Or in C++ terms, it could look like this:

#include <cfenv>

struct fpu_guard {
  std::fenv_t envp;
  void lock() {
    std::fegetenv(&envp);
    std::fesetenv(&envp);
  }

  void unlock() {
    std::fegetenv(&envp);
    std::fesetenv(&envp);
  }
};

int main() {
  fpu_guard fg;
  float a, b, c;

  fg.lock();
  a = b + c;
  fg.unlock();

  return 0;
}

It’s simple, maintainable, and ISA-agnostic. So why not use it? Because it is ridiculously slow. The lock guard, including FP operation, just comprises a few instructions, so you’d expect a performance in the range of 100-1000MIPS. But what you get is merely 2-4 MIPS, even on the most modern machines.

As a computer engineer, it’s my passion to explore such mysteries, which is what I will do in the rest of this subsection. The slow part of my code is obviously the lock guard, which is implemented by fegetenv and fesetenv from the standard library. Consequently, analyzing the corresponding code in glibc seems to be the next logical step. With a few minutes of research, I found the following code (which I also deconvoluted and commented a little bit) for the fegetenv function.

int __fegetenv (fenv_t *envp) {
  // x87 state
  __asm__ ("fnstenv %0\n" : "=m" (*envp));
  __asm__ ("fldenv %0\n" : "=m" (*envp));

  // SSE state
  __asm__ ("stmxcsr %0\n" : "=m" (envp->__mxcsr));

  return 0;
}

As you can see, it only comprises 3 instructions. Two of them are responsible for the x87 part (yey, legacy), while only one is needed to fetch the mxcsr register. In a profiling run, I could see the x87 part taking about 90% of the total execution time of the function. That’s a big share, considering that x87 is an obsolete extension for which compilers, with a few exceptions, no longer generate code.
So, I decided to remove the x87 instructions and reevaluate the performance. Now it was faster, but still far away from my excpectations. Since there’s only one remaining instruction, the case is clear more or less. In the infinite realms of the internet, I found this cool website/document, which analyzed the throughput and latency of all x64 instructions. The following table summarizes it for the LDMXCSR and STMXCSR instructions (load and store of the MXCSR register).

µArch	Latency		Reciprocal Throughput
	LDMXCSR	STMXCSR	LDMXCSR	STMXCSR
AMD Zen 2	-	-	17	16
AMD Zen 3	13	13	20	15
AMD Zen 4	13	13	21	15
Intel Coffee Lake	5	4	3	1
Intel Cannon Lake	5	4	3	1
Intel Ice Lake	6	4	3	1

As you can see in the table, executing these instructions is relatively costly (13 cycles latency for the AMD Zen microarchitecture). Surprisingly, AMD also performs much worse than Intel. Since I used an AMD machine for my benchmarks, better results could have been obtained with an Intel CPU. Anyway, I ultimately wanted an approach that works well on all microarchitectures, so I decided to go for something different as shown later.

A possible approach to hide the expensive cost of LDMXCSR and STMXCSR, is to only invoke them when the simulator switches between the generated code and the host environment. As already hinted in the FPU guard description, multiple instructions can be between LDMXCSR and STMXCSR. I guess this allows to attain reasonable performance, but you drastically reduce the code modularity. You also increase the cost of switching between simulator and simulated code. So, in the end, I took a different way.

But before I present that, the next section shows how other simulators deal with all these problems.

5. How Other Simulators Work

Whenever I code something, I try to get some inspiration from other projects first. Or as one of my colleagues said: “Before you code something simulation-related, ask yourself: What would QEMU do?”
Wise words to live by, so the next sections dissect the floating point implementations of a few simulators, such as QEMU, rv8, and gem5. I also present all academic works that have been published in this field. Don’t worry, it’s only 3 papers.

5.1 Soft Float

The open-source projects gem5 [15], Spike [16], Uni Bremem RISC-V VP [17], [18] and QEMU [19] pre-v4.0.0, all use a method called soft float to simulate FP arithmetic. Note that QEMU changed to a different approach in version 4.0.0, but more on that later. The idea of soft float is to use integer arithmetic and boolean operations to mimic arbitrary FP behavior. It often comes as a C/C++ library, making it easy to integrate. For example, all simulators listed above use the open-source library Berkley Softfloat by J. Hauser [20], which is based on the IEEE 754 1985 standard. Soft float libraries that implement the more recent IEEE 754 2008 standard include SoftFP by F. Bellard [21], and FLIP by C.-P. Jeannerod et al. [22]. Besides generic solutions in programming languages like C, there are also architecture-optimized soft float libraries. For example, RVfplib [23] is an optimized soft float library for RISC-V systems that do not include the F or D extension.

The availability of multiple open-source libraries and the ease of use make it the most popular FP arithmetic simulation approach. If you are starting to develop your own simulator, I recommend to use it for the first proof of concept. That’s also what we did at MachineWare. Yet, the performance might be somewhat disappointing. Using tens or hundreds of integer instructions to simulate one FP instruction can easily reduce your performance by that same factor. Some exact slowdown factors are provided in the results section.

If you want to enjoy the full pain of coding your own soft float library, the Handbook of Floating Point Arithmetic [24] provides you with all the necessary background information.

5.2 rv8

rv8 [25], [26] is an open-source, DBT-based, RISC-V emulator for x64 hosts. With rv8, the RISC-V target exception flags and rounding mode are mapped 1-to-1 to the x64 host. As explained in Section 4.5 Floating Point Exception Flags RISC-V and x64 are equivalent with regards to their exception flags. Thus, checking and setting the exceptions flags is simply achieved by accessing the host’s mxcsr register. Yet mapping the rounding modes poses a problem, as x64 is not endowed with a mode for RMM. The solution of rv8 for this problem can be found in the following Code in Line 9 (rv8/src/asm/fpu.h):

inline void fenv_setrm(int rm) {
    int x86_mxcsr_val = __builtin_ia32_stmxcsr();
    x86_mxcsr_val &= ~x86_mxcsr_RC_RZ;
    switch (rm) {
        case rv_rm_rne: x86_mxcsr_val |= x86_mxcsr_RC_RN; break;
        case rv_rm_rtz: x86_mxcsr_val |= x86_mxcsr_RC_RZ; break;
        case rv_rm_rdn: x86_mxcsr_val |= x86_mxcsr_RC_DN; break;
        case rv_rm_rup: x86_mxcsr_val |= x86_mxcsr_RC_UP; break;
        case rv_rm_rmm: x86_mxcsr_val |= x86_mxcsr_RC_RN; break;
    }
    __builtin_ia32_ldmxcsr(x86_mxcsr_val);
}

In the function fenv_setrm(int rm) the RISC-V rounding mode is loaded into the host FPU. The missing rounding mode RMM of x64 is incorrectly mapped to RNE. This trades accuracy for simplicity/performance and leads to rv8 not being compliant with the official RISC-V standard.

Other problems, such as NaN boxing or semantically different instructions, are solved by rectifications in software. Furthermore, FP instructions are not directly translated, but use an interpreter. This interpreter falls back to standard C++ operators to implement RISC-V instructions. For example, the following Code shows the implementation of the fadd and fmax instructions.

P::ux exec_inst_rv32(T &dec, P &proc, P::ux pc_offset) {
    // ...
    switch (dec.op) {
        case rv_op_fadd_s:
          if (rvf) {
              fenv_setrm((fcsr >> 5) & 0b111);
              freg[dec.rd] = freg[dec.rs1] + freg[dec.rs2];
          }
          break;
          case rv_op_fmax_s:
              if (rvf) {
                  freg[dec.rd] = (freg[dec.rs1] > freg[dec.rs2]) || isnan(freg[dec.rs2])
                                ? freg[dec.rs1] : freg[dec.rs2];
              }
          break;
    // ...
  }
}

5.3 QEMU post-v4.0.0

As of version 4.0.0, there is an ongoing effort to speed up the FP performance of QEMU by using the method from Guo et al. [27]. The initial idea of Guo et al. was to calculate the result of a FP instruction on the host FPU, while the exception flags are emulated in software. This approach showed a significant overhead of the calculation of the inexact exception, leading to no speedup compared to soft float. Note that they could find a fast solution for additions, but more on that in Section 5.5 You et al..
To avoid the high costs for the inexact calculation, an FP operation is preceded by a quick check, whether the exception must be calculated at all. Because with most ISAs, such as RISC-V or x64, the inexact exception is sticky. That means, if an instruction has generated an inexact result, the inexact exception remains set, even if subsequent instructions produce exact results. Since most applications do not clear the inexact exception and tend to generate an inexact result at some point, it can be assumed that in most cases the inexact exception is already set. Hence, there’s no need to recalculate it for every instruction.

An example for the square root instruction in QEMU using the method of Guo et al. is shown in the following, simplified, code from qemu/fpu/softfloat.c: (yes, despite not being a mere soft float implementation, the file is called “softfloat” ¯\(ツ)/¯ )

static inline bool can_use_fpu(const float_status *s) {
    if (QEMU_NO_HARDFLOAT)
        return false;

    return likely(s->f_excep_flags & f_flag_inexact && s->f_round_mode == f_round_near_even);
}

float32 float32_sqrt(float32 xa, float_status *s) {
    union_float32 ua, ur;

    ua.s = xa;
    if (unlikely(!can_use_fpu(s)))
        goto soft;

    float32_input_flush1(&ua.s, s);
    if (unlikely(!float32_is_zero_or_normal(ua.s) || float32_is_neg(ua.s)))
        goto soft;

    ur.h = sqrtf(ua.h);
    return ur.s;

    soft: return soft_f32_sqrt(ua.s, s);
}

The function starts with a call to can_use_fpu. This corresponds to the aforementioned check whether the inexact flag must be calculated at all. Furthermore, the host FPU can only be used if the host rounding mode corresponds to the target’s rounding mode. It is assumed that the default C rounding mode of RNE is used and not altered during execution. Hence, a check of the target’s rounding mode is sufficient. Since some target architectures like PowerPC do not have a sticky inexact exception, the check can be skipped in advance by defining the macro QEMU_NO_HARDFLOAT accordingly.

To also avoid setting the underflow and invalid exception, the soft float fall back is called if the input is negative or subnormal. The idea of extending Guo’s method by checking both underflow and invalid flags was proposed by Cota et al. [28]. It was also E. G. Cota who committed the code to QEMU in 2018. If all checks passed, the hard float function sqrtf is called, resulting in a sqrtss instruction for x64 hosts. In case the checks do not pass, the corresponding soft float function soft_f32_sqrt is called.

Using the method of Guo and Cota, the performance of FP instructions can be increased by a factor of more than $2\times$. However, this speedup can only be achieved if the RNE mode is used and if an inexact exception occurs at some point. To tackle the latter issue, at least for additions, Guo et al. developed a quick inexact check, which resembles the Fast2Sum algorithm by T. J. Dekker [29].

5.4 Rosetta 2

Rosetta 2 is Apple’s x64-on-ARM emulator, which was introduced in 2020 to aid the transition from x64 to ARM-based Apple Silicon [30]. Despite translating instructions from x64 to ARM, which is not the focus of this post, the underlying principle can be applied to any architecture as well. In fact, I’m currently implementing a similar thing for RISC-V, but shhhh.

Since Apple does not disclose the technical details of their products, the following statements are based on internet sources. In general, most problems of x64-to-ARM FP simulation concern non-standard behavior and cases labeled as “implementation defined”. For example, the FTZ and DAZ flags of the x64 ISA are not part of the IEEE 754 standard. These flags allow to individually flush the input and output of an instruction to zero. Similarly, the ARM ISA also allows to flush numbers to zero, yet there is no way to control both input and output as on x64.

According to [31], Apple introduced an alternate FP mode to solve this problem in hardware. By setting a certain bit in the ARM FP control register, x64 FP arithmetic can be mimicked. While the Rosetta 2 approach allows for maximum performance, it requires full control of the ISA and silicon. Shortly after Apple’s release of the M1 processor [32], the first physical implementation of the alternate FP mode, ARM officially included this mode in the ARMv8 ISA. More specifically, it is part of ARMv8.7 architecture extension from January 2021 [33] and technically referenced it as FEAT_AFP. Thus, in the future, the alternate FP mode could also find its way in the products of other manufacturers.

Interestingly, just recently I saw this article about Loongson’s LBT extension for hardware accelerated DBT. The Loongson ISA manual and this article still lack important details, but I guess that parts of the additional hardware features go into a similar direction as FEAT_AFP.

5.5 You et al.

As mentioned, in Section 5.3 QEMU post-v4.0.0, Guo et al. [27] tried to implement software-based calculations for the inexact exception, but could only come up with a solution for additions/subtractions. Their solution looked as follows:

inexact = ((a + b) - a) < b

Guo et al. don’t mention it in their paper, but that is pretty much the so-called Fast2Sum algorithm that was introduced in 1971 by T. J. Dekker [29]. According to Dekker, the result of a rounded addition can be described by the sum of its exact value and a residual:

\begin{equation} \label{eq:fast2sum-main} \begin{gathered} a + b - r = s = \circ(a + b) \\\
r = \circ(b - \circ(s - a)) \quad with : |a|>|b| \end{gathered} \end{equation}

The residual can be calculated by rounded FP instructions as follows:

\begin{equation} \label{eq:fast2sum-residuum} \begin{aligned} r = \circ(b - \circ(s - a)) \quad with : |a|>|b| \end{aligned} \end{equation}

As mathematically proven by Dekker, the residual $r$ holds the exact rounding error of the addition of the variables $a$ and $b$. Hence, if the residual $r$ is not 0, the FP addition was inexact. Additionally, the value of the residual also determines the rounding direction of the preceding addition $\circ(a + b)$. For values greater than 0, the result of the addition was rounded down; for values less than 0, the result was rounded up. This fact wasn’t used by Guo et al. [27], but by You et al. [34] in 2019. Note that Guo et al. [27] and You et al. [34] share a similar co-author. So, ultimately, a solution to emulate RUP rounding using RNE on the host might look like this

// Fast2Sum for RUP
float c = a + b; // Result.
float x = fabs(a) > fabs(b) ? a : b;
float y = fabs(a) > fabs(b) ? b : a;
float r = y - (c - x); // Rounding error.
if (r != 0) {
  inexact = true;
  if (r > 0) {
        c = nextup(c); // Next greater FP value.
        overflow = is_inf(c) ? true : overflow;
  }
}

While You et al. and Guo et al. managed to develop fast inexact checks and rounding adjustments for additions/subtractions, other arithmetic instructions remained untouched. They developed an inexact check for FMA instructions using integer-based intermediate results, but their measurements show no speedup compared to a soft float implementation. So, let’s take a look at a more successful attempt in the next section.

5.6 Sarrazin et al.

The approach from Sarrazin et al. [35] isn’t really about determining the inexactness of FMA, but it comes close to it. Interestingly, their work was published in 2016, which predates the unsuccessful attempt of You et al. [34] in 2019.

The group of Sarrazin faced the problem of emulating FMA instructions on systems with hardware FMA support. So, they combined UpMul with the 2Sum algorithm to get the following equations: \begin{equation} \label{eq:ErrFma-residual} \begin{gathered} M = \circ_{64}(a \cdot b) \\\
S,T = 2Sum(M, \circ_{64}(c)) \\\
r = \circ_{32}(S) \\\
E = ||S-r|| \\\
with : \circ_{32}(a)=a, \quad \circ_{32}(b)=b, \quad \circ_{32}(c)=c \end{gathered} \end{equation} The output of the 2Sum algorithm is identical to the Fast2Sum algorithm, which was presented in the previous subsection. A more detailed discussion about the differences and performance implication is provided in the following section. The residual $T$ (yes, suboptimal variable name) determines if the addition $c$ and $a \cdot b$ was inexact. This can have an impact on the rounding if $E$ is in the middle of two 32-bit FP numbers ($E=2^{e_r - p}$). So, if $E$ is equal to $2^{e_r - p}$, you have to check $S$ and $T$, and adapt $r$ accordingly.

As you can see, that doesn’t really indicate if the calculation was inexact or not. Later in Section 6.5 Fast 32-bit Fused Multiply-Add, I show how the equations can be rearranged to fulfill that purpose.

One major disadvantage of the method by Sarrazin et al. is the dependence on larger data types. If the residual of a 32-bit FMA instruction is computed, at least 64-bit FP precision is required. Or more precisely, the larger data type needs at least $2p$ significand bits. Hence, this algorithm does not work for double precision values on x64 systems. The 80-bit precision provided by x87 FPU cannot be used, as it does not have $2p$ significand bits.

6. Methods

In this section, I show which methods I used and developed to equip MachineWare’s SIM-V simulator with an ultra-fast FP arithmetic. As shown in the previous section, there are numerous ways to simulate FP arithmetic. To make life easy for myself, I implemented a soft float library for the first proof concept. With soft float, SIM-V was able to pass the RISCOF, but the performance was underwhelming. So, for the second attempt, I implemented QEMU’s method. This already increased the speed significantly, and profiling showed that there was only a limited room for optimization. In more than 99.9% of all cases, the critical exception flags are already set and don’t need to be recalculated.
From the point of view of a programmer, certainly good - there is nothing more to do!
For an ongoing Phd under pressure to publish, rather suboptimal - there is nothing more to research!

Ok, but what if I focus on some of the corner cases in which QEMU’s method doesn’t perform well? For instance, if the target doesn’t use RNE, QEMU always has to fall back to soft float. You et al. [34] already showed how the residual of an addition could be used to account for different rounding modes. But they didn’t propose any methods for other arithmetic instructions, such as multiplication, division, or square root.

So, in the following, I will show for all relevant arithmetic instructions, how to quickly calculate a residual that can be used to determine inexactness and perform directed rounding. I call this approach floppy float, because it’s somewhere between soft and hard float. As far as I know, the methods for division and square root haven’t been described anywhere else in literature so far. The goal of the method is to perform equally fast as QEMU for standard rounding, and outperform it for non-standard rounding.

Besides using mathematical proofs to check the validity of the approaches, all instructions were verified using the RISC-V Architecture Test [37], as well as hand-crafted tests to confirm corner cases.

NOTE
In the following I’m using a positive residual (e.g. $c_{exact} + r = \circ(a+b)$). Hence, if $r>0$, the result was rounded up, and if $r<0$, the result was rounded down. In my opinion it feels more intuitive this way.

6.1 Fast Addition/Subtraction

As explained in Section 5.5 You et al. the work of You et. al [34] uses the Fast2Sum algorithm for the calculation of the residual $r$. This requires two arithmetic operations, but the operands must be sorted by absolute value. Consequently, branching instructions might be needed, which can lead to performance penalties. As an alternative without sorted operands, O. Møller [38] proposed the 2Sum algorithm in 1965. Similar to Dekker’s Fast2Sum algorithm, the 2Sum’s motivation was to increase accuracy in floating point calculation. But roughly 50 years later, we found a way to use it to speed up our simulations! Opposed to the Fast2Sum algorithm, the 2Sum algorithm does not require branching instructions, but involves more arithmetic instructions: \begin{equation} \label{eq:2sum-main} \begin{gathered} c_{exact} + r = c = \circ(a+b) \\\
a’ = \circ(c-b) \quad b’ = \circ(c-a’) \\\
\delta_a = \circ(a’ - a) \quad \delta_b = \circ(b’ - b) \quad r = \circ(\delta_a + \delta_b) \end{gathered} \end{equation}

This algorithm also exhibits some potential for instruction-level parallelism/vectorization, as the data dependency graph reveals:

In some benchmark experiments I ran, the 2Sum algorithm was ~10% faster than the Fast2Sum algorithm when working on randomized data. If the input data is predictable, thus favorable to the branch predictor, both algorithms achieve the same performance. Ultimately, a 32-bit FP add for RUP rounding might look like this:

// RUP case
float c = a + b; // Result.
float ad = c - b;
float bd = c - a;
float da = ad - a;
float db = bd - b;
float r = da + db; // Residual.
if (r != 0.f) {
    inexact = true;
    if (r > 0.f) {
        c = nextup(c);
        overflow = (c == infinity) ? true : overflow;
    }
}

6.2 Fast 32-bit Multiplication

For the fast calculation and rounding of multiplications, I exploited one interesting property of IEEE FP numbers: multiplying two 32-bit FP values as 64-bit values always yields an exact result! Similar to addition, this allows to calculate a residual, which can be used for rounding and setting the inexact flag. For the sake of simplicity, I will call this approach UpMul from now on.

So, let’s start with some operands $a$ and $b$ as 32-bit FP values. In a first step, these are upcasted to 64-bit values and then multiplied. Since the number of significands more than doubles from 32-bit FP to 64-bit FP, the result of the multiplication can be represented exactly. If the exact value is subtracted from the erroneous value, the residual remains: \begin{equation} \label{eq:upmul-main} \begin{gathered} c_{exact} + r = c = a \cdot b + r = \circ_{32}(a \cdot b) \\\
r = a \cdot b + r - (a \cdot b) = \circ_{64}(\circ_{32}(a \cdot b) - \circ_{64}(a \cdot b) ) \end{gathered} \end{equation}

The mathematical proof is provided at the end of this section. A C/C++ implementation for the RUP rounding mode can be found in the following code:

// RUP case
float c = a * b;
double r = (double)c - (double)a * (double)b;
if (r != 0.) {
  inexact = true;
  if (r < 0.) {
    c = nextup(c); // Next greater FP value.
    overflow = is_inf(c) ? true : overflow;
  }
  underflow = (is_subnormal(c) || is_zero(c)) ? true : underflow;
}

As shown in the code, an inexact calculation has occurred if $r\neq 0$. Subsequently, the result is rectified in case the host hardware rounded it down. This could lead to an overflow, hence the result is checked for infinity. According to the RISC-V ISA, tininess is detected after rounding, requiring an underflow check after rectification. Note that underflow only occurs when the result is subnormal and inexact.

So, now let’s take a look at mathematical proof of this method. The formula can be derived by first showing that the multiplication of the 32-bit values as 64-bit values is exact. Using Equation \ref{eq:float1} the multiplication can be expressed as: \begin{equation} \label{eq:upmul3} \begin{aligned} a \cdot b = M_a \cdot M_b \cdot 2^{e_a + e_b - 2p_f + 2} = c = M_c \cdot 2^{e_c - p_d + 1} \end{aligned} \end{equation} As stated in Section 3.1 The Math, this model is not suitable for subnormal numbers. So, how to deal with this case? The trick is, we don’t need to consider it!
Casting 32-bit FP values to 64 bit can never lead to subnormal results.
And even the following multiplication cannot lead to subnormal results.
Why is that?
The smallest subnormal 32-bit FP number is $2^{e_{f,min}- p_f + 1} = 2^{-149}$. Multiplying the smallest subnormal 32-bit FP number with itself results in $2^{2 \cdot -149} = 2^{-298}$. These results are still far away from the 64-bit subnormal range, which begins at $2^{e_{d,min}} = 2^{-1022}$. GG EZ!

Next, we derive the maximum ranges of $M_c$ and $e_c$: \begin{equation} \label{eq:upmul5} \begin{gathered} |M_c| = |M_a \cdot M_b| \leq (2^{p_f}-1)^2 \leq (2^{24}-1)^2 \leq 2^{48} - 1 \leq 2 ^{p_d} - 1 \leq 2 ^{53} - 1 \\\
|e_c| = |e_a + e_b - 2p_f + p_d + 1| \leq 260 \leq |e_{d,min}| \end{gathered} \end{equation} Since both $M_c$ and $e_c$ fit into the range of a double-precision value, the result of the multiplication is exact. From Equation \ref{eq:upmul5} we can also see why $2p$ significand bits are required to represent a multiplication exactly.

As the final step, the exactness of the subtraction needs to be shown. Here I simply used Sterbenz’ Lemma [39] . According to his Lemma, the subtraction of two very close FP numbers is always exact. Interesting remark: this only works if the FP number format supports subnormal. Or to express it mathematically:

\begin{equation} \label{eq:sterbenz} \begin{gathered} \text{if} \quad a/2 \leq b \leq 2a \\\
\text{then} \quad \circ(b - a) = b - a \end{gathered} \end{equation}

Since the values of $\circ_{64}(a \cdot b)$ and $\circ_{32}(a \cdot b)$ differ by not more than $2\times$ their subtraction is exact.

6.3 Fast 32-bit Division

For the fast division, I developed a new method called UpDiv, which was not seen in any other work before. Similar to the UpMul method from before, both operands must be 32-bit FP values, and the goal is to compute the residual $r$. However, in this case, the exact determination of the residual of a division is overambitious, as certain rational numbers cannot be represented with a finite number of significand bits. Nevertheless, the exact value of the residual is not crucial for our endeavor. Rather, we want to know whether there was a rounding error, and if it is positive or negative. In mathematical terms, an approximation of the residual $\tilde{r}$ is sought, for which $sgn(\tilde{r})=sgn(r)$ is satisfied. Such an approximation is obtained by: \begin{equation} \label{eq:updiv-main} \begin{gathered} a / b + r = c_{exact} + r = c = \circ_{32}(a / b) \\\
\tilde{r} = \circ_{64}(\circ_{64}(\circ_{32}(a / b) \cdot b) - a) \cdot sgn(b) \end{gathered} \end{equation}

And in terms of C/C++:

// RUP case
float c = a / b;
double r = (double)c * (double)b - (double)a;
r = signbit(b) ? -r : r;
if (r != 0.) {
  inexact = true;
  if (r < 0.) {
    c = nextup(c); // Next greater FP value.
    overflow = is_inf(c) ? true : overflow;
  }
  underflow = (is_subnormal(c) || is_zero(c)) ? true : underflow;
}

If you are interested in the mathematical proof, here it comes.

The equation can be derived by using the standard model of FP arithmetic extended for subnormals (see Equation \ref{eq:standard-error-model}). According to the model, the error of the FP division, including underflow and overflow, can be represented as follows: \begin{equation} \label{eq:updiv3} \begin{aligned} \frac{a}{b} \cdot (1 + \epsilon_1 ) + \eta_1 = \circ_{32}(a/b) = a / b + r \end{aligned} \end{equation} If the result of the division is upcasted to 64-bit and multiplied by the value of $b$, which is also upcasted to 64-bit, the result must be exact (see previous subsection). This allows to calculate the approximation $\tilde{a}$ as follows: \begin{equation} \label{eq:updiv4} \begin{aligned} \tilde{a} = a + a \epsilon_1 + b \eta_1 = \circ_{64}(b \cdot \circ_{32}(a/b)) \end{aligned} \end{equation} Subtracting $a$ from $\tilde{a}$ yields Equation \ref{eq:updiv5}: \begin{equation} \label{eq:updiv5} \begin{gathered} z = \circ_{64}(\tilde{a} - a) = \circ_{64}(a - \circ_{64}(b \cdot \circ_{32}(a/b))) = (a \epsilon_1 + b \eta_1)(1 + \epsilon_2) \\\
z = \begin{cases} b \eta_1 (1 + \epsilon_2) & subn.\\\
a \epsilon_1 (1 + \epsilon_2) & else \end{cases} \end{gathered} \end{equation} Although this addition can be inexact, which is described by $\epsilon_2$, the result 0 can only be obtained if the preceding division was exact ($\epsilon_1=\eta_1=0$). Otherwise, the sign of $z$ is directly determined by $a \epsilon_1$ or $b \eta_1$. Next, Equation \ref{eq:updiv5} is rearranged to: \begin{equation} \label{eq:updiv7} \begin{aligned} \epsilon_1 = \frac{z}{a \cdot (1 + \epsilon_2)} \quad \eta_1 = \frac{z}{b \cdot (1 + \epsilon_2)} \end{aligned} \end{equation} Inserting Equation \ref{eq:updiv7} into Equation \ref{eq:updiv3} yields for both cases the following residual: \begin{equation} \label{eq:updiv8} r = \frac{z} {b \cdot (1 + \epsilon_2)} = \frac{\circ_{64}(a - \circ_{64}(b \cdot \circ_{32}(a/b)))} {b \cdot (1 + \epsilon_2)} \end{equation} Therefore, the residual can only be 0 if $z$ is 0 as well. Likewise, the sign of $r$ is directly determined by $z$ and $b$. Consequently, we conclude $sgn(\tilde{r}) = sgn(r)$.

6.4 Fast 32-bit Square Root

The calculation of a fast square root and its residual follows the same principle as the UpDiv algorithm. Hence, I named it UpSqrt. I exploit that multiplication is the inverse operation of square root, and that multiplication with larger data types is exact. The residual results according to Equation \ref{eq:upsqrt-main}:

\begin{equation} \label{eq:upsqrt-main} \begin{gathered} \sqrt{a} + r = b_{exact} + r = b = \circ_{32}(\sqrt{a}) \\\
\tilde{r} = \circ_{64}(\circ_{64}(\circ_{32}(\sqrt{a})^2) - a) \end{gathered} \end{equation}

The proof of the algorithm is equivalent to the proof of the UpDiv algorithm. Again, an approximation $\tilde{r}$ for the residual $r$ with $sgn(r) = sgn(\tilde{r})$ is sought. And again, the property that the multiplication is precise on the one hand is exploited again, if a larger data type is available, and on the other hand that the multiplication can be used as an inverse function of the actual operation. The final result is the following expression: \begin{equation} \label{eq:upsqrt2} \begin{aligned} r = \sqrt{\frac{\tilde{r}}{1+\epsilon_2}+a} - \sqrt{a} \end{aligned} \end{equation} Since the sign of $r$ is only dependent on $\tilde{r}$, $sgn(r) = sgn(\tilde{r})$ holds. Here’s the corresponding C/C++ code:

// RUP case
float b = sqrt(a)
double r = (double)b * (double)b - (double)a;
if (r != 0.) {
    inexact = true;
    if (r < 0.) {
        b = nextup(b);
    }
}

And here’s the proof. According to the standard error model of FP, the 64-bit multiplication of the 32-bit square root $a$ results in: \begin{equation} \circ_{64}(\circ_{32}(\sqrt{a})^2) = \circ_{32}(\sqrt{a})^2 = (a \cdot (1 + \epsilon_1))^2 \end{equation} Note, that a square root cannot produce a subnormal result (thus no $\eta$) and that a 64-bit multiplication of 32-bit values is always exact. The latter is the same property of FP that I already used in the previous two sections. Next, we subtract $a$:

\begin{equation} \label{eq:upsqrt-proof1} \begin{gathered} \tilde{r} = \circ_{64}(\circ_{32}(\sqrt{a})^2 - a) = ((a \cdot (1 + \epsilon_1))^2 - a)\cdot (1 + \epsilon_2) \end{gathered} \end{equation}

And rearrange the formula:

\begin{equation} \label{eq:upsqrt-proof2} \begin{gathered} \epsilon_1 = \sqrt{\frac{\tilde{r}}{(1+\epsilon_2) \cdot a}+1} - 1 \end{gathered} \end{equation}

Inserting $\epsilon_1$ into $\sqrt{a} \cdot \epsilon_1 = r$ gives us:

\begin{equation} \label{eq:eq:upsqrt-proof3} \begin{aligned} r = \sqrt{\frac{\tilde{r}}{1+\epsilon_2}+a} - \sqrt{a} \end{aligned} \end{equation} And q.e.d.

6.5 Fast 32-bit Fused Multiply-Add

For fast FMA simulation, I deployed a similar method as Sarrazin et al. [35]. Yet, I repurposed it to account for inexact excpetions. The idea is to first calculate the exact multiplication of $a$ and $b$ using a larger data type. Subsequently, the residual of the summation of $a \cdot b$ and $c$ is calculated using the 2Sum algorithm. But even if this summation was exact ($r_1=0$), the final result might not be representable as 32-bit FP value. Hence, another residual $r_2$ is calculated to determine the 64-bit to 32-bit rounding error. Note that $r_2$ is exact due to Sterbenz’ Lemma [39]. \begin{equation} \label{eq:fast-fma-main} \begin{gathered} d_{exact} + r = d = \circ_{32}(a \cdot b + c) \\\
r_1 = 2Sum(\circ_{64}(a \cdot b), c) \\\
r_2 = \circ_{64}(d - \circ_{64}(\circ_{64}(a \cdot b) + c)) \end{gathered} \end{equation} Finally, an approximation of the rounding error $\tilde{r}$ can be calculated, as shown in Equation \ref{eq:fast-fma-residual}: \begin{equation} \label{eq:fast-fma-residual} \begin{aligned} \tilde{r} & = r_1 + r_2 \end{aligned} \end{equation} Although the addition of $r_1$ and $r_2$ is not exact per se, it satisfies $sgn(\tilde{r})=sgn(r)$. This is enabled by gradual underflows, due to which the following property holds for two arbitrary 32-bit FP numbers: $sgn(a+b) = sgn(\circ_{32}(a + b))$.

As before, here the C/C++ code for a RUP case:

// RUP case
float d = std::fma(a, b, c);
double p = (double)a * (double)b;
double dd = p + (double)c;
double r1 = two_sum<double>(p, (double)c, dd);
double r2 = (double)d - dd;
double r = r1 + r2;
if (r != 0.) {
  inexact = true;
  if (r < 0.) {
    d = nextup(d);
    overflow = is_inf(d) ? true : overflow;
  }
  underflow = (is_subnormal(d) || is_zero(d)) ? true : underflow;
}
return d;

6.6 Fast 64-bit Operations

The previous upcast algorithms UpMul, UpDiv, UpSqrt, and also the FMA algorithm according to Sarrazin et al. [35], are all based on larger data type that can perform multiplications exactly. As mentioned earlier, these algorithms reach their limitations for 64-bit values on x64 systems. To circumvent these limitations, the fused multiply-add (FMA) instruction of the x64 ISA can be used. This instruction is formalized in the FMA3/FMA4 instruction set extensions and is part of all modern x64 processors. For example, using FMA, the residual of the UpMul algorithm can be calculated as follows:

\begin{equation} \label{eq:example-div} \begin{aligned} r’ & = \circ_{64}(a \cdot b - \circ_{64}(a \cdot b)) = \circ_{64}(c_{exact} - c) \end{aligned} \end{equation}

However, the rounding step at the end of each FMA instruction poses a problem. Although an FMA instruction calculates all intermediate results with infinite precision, the result is eventually rounded. In the example shown, it is possible that $r’$ is not representable with a 64-bit precision. One could therefore wrongly assume a value of 0, although the value is actually different from 0. Hence, $r’=r$ does not hold in all cases.

Consequently, bounds must be determined for which $r’$ is no longer representable. Since $r’$ is the direct result of the subtraction of $c$ and $c’$, we have to determine the smallest distance between these numbers, excluding 0. This distance is $|d| \geq 2^{e_c - 2p_d}$. The number of double significand bits $2p_d$ follows from the exact intermediate results of the FMA instruction. As explained previously, $2p$ significand bits are needed for the exact representation of a $p$-bit multiplication. In order to represent $r’$ as a 64-bit FP value, $e_c - 2p_d \geq e_{d,min} - p_d + 1$ must hold. A simple rearrangement leads to the following inequality: \begin{equation} \label{eq:example-div-bound} \begin{aligned} e_c \geq e_{d,min} + p_d + 1 = -1022 + 53 + 1 = -968 \end{aligned} \end{equation} If $|c|$ is less than $2^{-968}$, my method cannot be used, and the instruction has to be calculated using soft float. However, the range below $2^{-968}$ represents less than 3% of all 64-bit FP values. In practice, it’s even less, as most FP values are centered around 1. To prove this statement, I ran different 78 FP benchmarks and tracked the in- and output exponents of all 64-bit arithmetic FP instructions:

As you can, on average less than 0.1% values have an exponent less than $2^{-968}$.

A C/C++ example for the 64-bit division is given in the following code:

if (abs(a) < 4.008336720017946e-292)
    return soft::div(a, b);

double r = std::fma(c, b, -a);
if (r != 0.0) {
    inexact = true;
    underflow = (is_subnormal(c) || is_zero(c)) ? true : underflow;
}

7. Results & Discussion

7.1 Clean Room Benchmarks

In this section, I show the results of some clean room benchmarks. The goal was to assess the maximum performance of each individual instruction for soft float, floppy float (my approach), and hard float (native FP instructions). That means inputs and outputs are never subnormal, there are no data dependencies between the instructions, standard rounding is used, and there’s no DBT overhead. While floppy float and hard float aren’t really sensitive to different kinds of input data (except subnormals), the soft float is due to its control-flow-heavy calculations. In general, the input data was designed to favor optimistic paths in soft float. So, let’s take a look at the results:

As you can see, simply executing FP instructions one after another (hard float) achieves around 8500 MIPS for instructions that can be executed in one cycle (max, min, add, sub, etc.). This is explained by the FP pipeline of the host processor, which was an AMD Ryzen Threadripper 3990X in my case. Most FP instructions can use 2 of 4 FP pipes provided by the Zen 2 microarchitecture, leading to $8500 MIPS \approx 2 \cdot 4.3GHz$. Some instructions, such as division, square root, or 64-bit multiplication, require multiple cycles, which results in lower performance. Nevertheless, hard float is faster than soft and floppy float in all cases. The performance of the floppy float approach is in the range of 300-600 MIPS, and is faster than soft float by up to $5 \times$ in some operations, such as square root. For lightweight operations, such as min or max, there is no significant difference between soft- and floppy float.

7.2 My Method vs. QEMU

Since my approach is intended to accelerate FP performance in DBT simulators, a practical performance assessment is indispensable. For this purpose, I integrated my approach, the method by Cota et al. [28](QEMU’s method), and Bellard’s SoftFP [21], into MachineWare’s DBT-based RISC-V simulator SIM-V [1]. I then conducted a performance analysis using well-known FP benchmarks such as linpack, NPB, SPEC CPU 2017, and other representative workloads. The results can be found in the following graph:

In the graph, the speedups of the individual benchmarks are shown, whereby the soft float method was used as a reference baseline. All benchmarks in Subplot a) were executed with the default RNE rounding, while Subplot b) represents the same benchmarks under RUP rounding. Please not that this graph does not compare SIM-V with QEMU! It’s only QEMU’s method implemented in SIM-V! Since SIM-V uses multiple other techniques to speed up simulations, a comparison wouldn’t be fair.

As can be seen in the graph, QEMU’s method and my approach achieve a speedup of $3\times$ in a best case scenario (see Subplot a), NPB/ft.A and 508.namd). Also, in most cases, the performance of my approach is equal to the performance of QEMU’s approach when RNE rounding is used. As explained previously, my approach is only faster when underflows occur and no inexact flags are set, or when a non-default rounding mode is not used. Since most applications already set an inexact flag after a few executed instructions, the speedup gained from an accelerated inexact calculation is marginal. Also, underflows are seldom, as I could confirm with a separate instruction and data study. For example, in the case of the NPB/ft.A benchmark, not a single underflow occurred in a total of 3,875,127,289 executed fmadd instructions.

To demonstrate the advantages of my methods, I ran all benchmarks again under RUP rounding which is depicted in Subplot b). Here we can see that QEMU is slower than soft float in all cases. This can be attributed to the fact that QEMU first checks the rounding mode before resorting to soft float. My method, however, can rectify the result for most instructions and set the exception flags without using soft float. Thus, speedups of 50% over QEMU are achieved for benchmarks like linpack32. Since the speedup of my method depends on the executed instructions, we observe a heterogeneous picture of results. Moreover, the speedups under RNE cannot be used to infer the speedups under RUP. As described in previously, we do not have a method for 64-bit FMA instructions, and all presented approaches require less checks when working on 32-bit data. Hence, single precision benchmarks, such as linpack32 or machine learning applications (lenet, alexnet), achieve higher speedups in non-default rounding modes. Applications that comprise many 64-bit FMA instructions achieve low to no speedup (see NPB/bt.A and NPB/cg.A).

8. Conclusion & Outlook

In this post, I showed how floating point arithmetic is calculated in emulators/simulators, such as QEMU, gem5, or Rosetta 2. To the best of my knowledge, this post provides the most complete picture of this topic to date. But if you find more literature worth citing, let me know!

Besides just providing a related work overview, I showed how the QEMU approach can be improved to also perform well for other rounding modes. I implemented my method in MachineWare’s SIM-V RISC-V simulator and beat QEMU’s by more than 50% in the best case. For the vanilla RNE rounding mode, I couldn’t achieve any speedups for standard benchmarks. This is due to exception bits being sticky and not requiring any recalculations. I later noticed that the PowerPC has non-sticky exception flags, which requires a recalculation for every instruction. Hence, I guess my method could significantly speed up PowerPc simulations even for standard benchmarks with RNE rounding.

One important missing piece of this work are efficient algorithms for 64-bit FMA instructions. Unfortunately, these instructions occur relatively frequently, costing us a significant chunk of performance for some benchmarks. I found an interesting work of Boldo et al. [36], which provides an algorithm to calculate the residual for FMA instructions. So exactly what I need! But I wasn’t able to get it running correctly for whatever reason… Since their paper is basically 8 pages of mathematical proofs, I leave this as a problem for other people and future Niko.

If you have remarks, questions, or just want to say “hello”, feel free to write me a mail!

9. References

[1]L. Jünger, J. H. Weinstock, and R. Leupers, “SIM-V: Fast, Parallel RISC-V Simulation for Rapid Software Verification,” DVCON Europe 2022, 2022.
[2]“IEEE Standard for Binary Floating-Point Arithmetic,” ANSI/IEEE Std 754-1985, pp. 1–20, 1985, doi: 10.1109/IEEESTD.1985.82928.
[3]“IEEE Standard for Floating-Point Arithmetic,” IEEE Std 754-2008, pp. 1–70, 2008, doi: 10.1109/IEEESTD.2008.4610935.
[4]“IEEE Standard for Floating-Point Arithmetic,” IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84, 2019, doi: 10.1109/IEEESTD.2019.8766229.
[5]R. I. S. C.-V. Foundation, The RISC-V Instruction Set Manual, vol. Volume I: User-Level ISA, Document Version 20191213. 2019 [Online]. Available at: https://riscv.org/wp-content/uploads/2019/12/riscv-spec-20191213.pdf
[6]N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd ed. USA: Society for Industrial and Applied Mathematics, 2002.
[7]I. Dooley and L. Kale, “Quantifying the interference caused by subnormal floating-point values,” Jan. 2006.
[8]S. Thakkur and T. Huff, “Internet Streaming SIMD Extensions,” Computer, vol. 32, no. 12, pp. 26–34, 1999, doi: 10.1109/2.809248.
[9]A. Waterman, “Design of the RISC-V Instruction Set Architecture,” 2016 [Online]. Available at: https://people.eecs.berkeley.edu/ krste/papers/EECS-2016-1.pdf
[10]“Wikipedia - Comparision with NaN.” [Online]. Available at: https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN
[11]D. G. Hough, “The IEEE Standard 754: One for the History Books,” Computer, vol. 52, no. 12, pp. 109–112, 2019, doi: 10.1109/MC.2019.2926614.
[12]R. I. S. C.-V. Foundation, The RISC-V Instruction Set Manual, vol. Volume I: User-Level ISA, Document Version 2.2. 2017 [Online]. Available at: https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf
[13]A. Bradbury, “NaN Boxing RFC.” Mar-2017 [Online]. Available at: https://gist.github.com/asb/a3a54c57281447fc7eac1eec3a0763fa
[14]A. Bradbury, “NaN Boxing ISA-Dev Group.” Mar-2017 [Online]. Available at: https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/_r7hBlzsEd8/m/z1rjr2BaAwAJ
[15]N. Binkert et al., “The Gem5 Simulator,” SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, Aug. 2011, doi: 10.1145/2024716.2024718. [Online]. Available at: https://doi.org/10.1145/2024716.2024718
[16]R. I. S. C.-V. Foundation, “Spike RISC-V ISA Simulator.” [Online]. Available at: https://github.com/riscv-software-src/riscv-isa-sim
[17]V. Herdt, D. Große, P. Pieper, and R. Drechsler, “AGRA Uni Bremen RISC-VP.” [Online]. Available at: https://github.com/agra-uni-bremen/riscv-vp
[18]V. Herdt, D. Große, P. Pieper, and R. Drechsler, “RISC-V based virtual prototype: An extensible and configurable platform for the system-level,” Journal of Systems Architecture, vol. 109, p. 101756, 2020, doi: https://doi.org/10.1016/j.sysarc.2020.101756. [Online]. Available at: https://www.sciencedirect.com/science/article/pii/S1383762120300503
[19]F. Bellard, “QEMU, a Fast and Portable Dynamic Translator,” in Proceedings of the Annual Conference on USENIX Annual Technical Conference, USA, 2005, p. 41.
[20]J. R. Hauser, “Berkley SoftFloat.” 1996 [Online]. Available at: https://github.com/ucb-bar/berkeley-softfloat-3
[21]F. Bellard, “SoftFP.” 2018 [Online]. Available at: https://bellard.org/softfp/
[22]C. Bertin et al., “A floating-point library for integer processors,” Proceedings of SPIE - The International Society for Optical Engineering, vol. 5559, Oct. 2004, doi: 10.1117/12.557168.
[23]M. Perotti, G. Tagliavini, S. Mach, L. Bertaccini, and L. Benini, “RVfplib: A Fast and Compact Open-Source Floating-Point Emulation Library for Tiny RISC-V Processors,” in Embedded Computer Systems: Architectures, Modeling, and Simulation, Cham, 2022, pp. 16–32.
[24]J.-M. Muller et al., Handbook of Floating-Point Arithmetic. 2010.
[25]M. Clark and B. Hoult, “rv8 - RISC-V simulator for x86-64.” [Online]. Available at: https://github.com/michaeljclark/rv8
[26]M. Clark and B. Hoult, “rv8: a high performance RISC-V to x86 binary translator,” CARRV, Oct. 2017, doi: 10.13140/RG.2.2.30957.69601.
[27]Y.-C. Guo, W. Yang, J.-Y. Chen, and J.-K. Lee, “Translating the ARM Neon and VFP Instructions in a Binary Translator,” Softw. Pract. Exper., vol. 46, no. 12, Dec. 2016.
[28]E. G. Cota and L. P. Carloni, “Cross-ISA Machine Instrumentation Using Fast and Scalable Dynamic Binary Translation,” in Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, New York, NY, USA, 2019, pp. 74–87, doi: 10.1145/3313808.3313811 [Online]. Available at: https://doi.org/10.1145/3313808.3313811
[29]T. J. Dekker, “A floating-point technique for extending the available precision,” Numerische Mathematik, vol. 18, pp. 224–242, 1971.
[30]Apple Inc., “Apple announces Mac transition to Apple silicon.” Jun-2020 [Online]. Available at: https://www.apple.com/newsroom/2020/06/apple-announces-mac-transition-to-apple-silicon/
[31]D. Johnson, “Why is Rosetta 2 fast?” [Online]. Available at: https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-fast/
[32]Apple Inc., “Apple unleashes M1.” Nov-2020 [Online]. Available at: https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/
[33]“ARM Architecture Reference Manual.” ARM [Online]. Available at: https://developer.arm.com/documentation/ddi0487/latest
[34]Y.-P. You, T.-C. Lin, and W. Yang, “Translating AArch64 Floating-Point Instruction Set to the X86-64 Platform,” in Proceedings of the 48th International Conference on Parallel Processing: Workshops, 2019.
[35]G. Sarrazin, N. Brunie, and F. Pétrot, “Virtual Prototyping of Floating Point Units,” 2016.
[36]S. Boldo and J.-M. Muller, “Exact and Approximated Error of the FMA,” IEEE Transactions on Computers, vol. 60, no. 2, pp. 157–164, 2011, doi: 10.1109/TC.2010.139.
[37]Gala, N. and Karasek, M., “RISC-V Architecture Test.” [Online]. Available at: ttps://github.com/riscv-non-isa/riscv-arch-test
[38]O. Møller, “Quasi Double-Precision in Floating Point Addition,” BIT, vol. 5, no. 1, pp. 37–50, Mar. 1965.
[39]S. P.H., “Floating Point Computation.” Prentice Hall, 1974.