Contents
This post is an extended and completely reworked version of our paper “Efficient RISC-V-on-x64 Floating Point Simulation”. A preprint version of the original paper can be donwloaded here. In order to guide expectations right from the start, I would like to answer three essential questions first.
What is this post about and is it worth reading?
This post is about floating point (FP) arithmetic in simulators/emulators.
So, if you ever wondered how simulators/emulators like QEMU or gem5 handle floating point arithmetic,
the following might be of interest for you.
Although the title says RISC-V,
the methods presented here are applicable to most other Instruction Set Architectures (ISAs) as well.
In fact, I also present a little section about Apple’s Rosetta 2 (x64-on-ARM).
Should I read the paper or this blog post?
Read this post for the reasons described in the next answer.
Why did I spend my free time rewriting something I already spent weeks on?
Blog posts are better than papers because:
How to cite?
Please prefer to cite original paper:
@INPROCEEDINGS{zurstrassen2023,
author={Zurstraßen, Niko and Bosbach, Nils and Joseph, Jan Moritz and Jünger, Lukas and Weinstock, Jan Henrik and Leupers, Rainer},
booktitle={2023 IEEE 41st International Conference on Computer Design (ICCD)},
title={Efficient RISC-V-on-x64 Floating Point Simulation},
year={2023},
volume={},
number={},
pages={1-6},
doi={10.1109/ICCD58817.2023.00090}
}
In 2022, a colleague of mine and his friend took the courage and founded a startup. Their flagship product is a RISC-V simulator called SIM-V [1], which can be used to simulate RISC-V systems on x64 (or other) machines. One of the key selling points is the almost native performance. The simulated system is so fast, that you can interact with it like a real system.
So, how does one make a simulator go 🚀🚀🚀?
I am certainly not giving away any secrets when I reveal that the underlying technology is
Dynamic Binary Translation (DBT).
So basically the same method that is used by QEMU.
With DBT, binary instructions of the target system (RISC-V in our case) are translated into instructions of the host system (i.e. x64) at runtime and executed.
If possible, instructions are translated 1-to-1 (or at least 1-to-only-a-few), which also explains the native speed.
For example, one could simply translate a RISC-V 32-bit floating point (FP) addition fadd.s
to an x64 FP addition addss
.
Semantically, these two instructions seem to be identical, at least at first sight.
My colleagues thought so too and implemented it this way in their first version of SIM-V. In practice, this method actually works quite well. You can boot Linux systems with it, and execute many applications without encountering problems.
One of the few applications that doesn’t work with this method is the RISV-V Architectural Test Framework (RISCOF). Unfortunately, that’s a real showstopper, since passing these tests is required to license the RISC-V trademark. Or to quote RISCOF’s documentation:
Passing the tests and having the results approved by RISC-V International is a prerequisite to licensing the RISC-V trademarks in connection with the design.
So, passing these tests was top priority and my colleagues asked me to do an investigation. After taking a closer look at the failing tests, I could pinpoint the following 6 reasons why they failed:
In the following, I will explain each of these points in greater detail. Subsequently, I show how other simulators and how I solve these issues.
But first, I’ll explain some basics about FP arithmetic, IEEE 754, and how it is implemented in RISC-V and x64. Feel free to skip the next section if you are already familiar with these topics.
Floating point (FP) numbers are the most common way to approximate real numbers in computing.
You find them in most programming languages with names such as float
, double
, f32
or f64
.
Due to the many ways FP arithmetic can be implemented, adhering to standards avoids a lot of problems.
This is why most software and hardware follows the IEEE 754 standard.
But also standards might be erroneous or incomplete, which is why there are now 3 versions:
They differ mostly in some details, which will be discussed later.
The most important number formats defined by IEEE 754 are binary32
and binary64
.
If you program C/C++, you already know them as float
and double
.
In Rust they are called f32
and f64
.
A FP number comprises a sign, a significand, an exponent, and a bias
with the following bit representation:
Note that the bias is implicit and fixed.
It is used to reach negative numbers in the exponent without using two’s complement.
Ultimately, the numerical value of an FP number is given by:
\begin{equation} f = (-1)^{sign} \cdot (1.s_{p-1}s_{p-2}…s_1)_2 \cdot 2^{exponent-bias} \end{equation}
In the formula $s_i$ refers to the bit at position $i$ in the significand. However, there are quite a few corner cases to represent some special values.
The first case is subnormal numbers. Whenever $exponent$ is 0, the implicit leading 1 turns into a 0. So we get:
\begin{equation} f = (-1)^{sign} \cdot (0.s_{p-1}s_{p-2}…s_1)_2 \cdot 2^{-bias} \end{equation}
Having these special cases gives us some cool mathematical properties, like additions and subtractions that never underflow. However, in many other regards like hardware complexity, some mathematical proofs, or timing side channels, it can be a pain.
Another special value is infinity. If all bits in the exponent are set and the significand is 0, the value is interpreted as $\pm \infty$.
The last special value is NaN (Not a Number), which comes in two different flavors: quiet (qNaN) and signaling (sNaN). qNaNs are used to represent non-meaningful results (e.g. $\infty-\infty$), while sNaNs are intended to be used for uninitialized variables/memory. The bit pattern of a NaNs is an exponent with all bits set and a significand that is not 0. How the encoding of qNaN and sNaN differ is explained in Section “4.1 Different NaN Encoding”.
While Equation 1 is often used to introduce and understand the concept of IEEE FP numbers, the $p-1$ significand bits with an implicit leading 1 complicate mathematical proofs. A representation more suited for mathematical adventures is:
\begin{equation} \label{eq:float1} f = M \cdot 2^{e - p + 1}, \quad e=exponent-bias \end{equation}
With this representation, the significand is shifted so far, that it becomes an integer value. Due to the finite number of bits in binary32 and binary64, the precision $p$, the significand $M$, and the exponent $e$ are constrained by the values given in the following table:
data type | exponent range | precision bits | significand range |
---|---|---|---|
binary32 | $ e_{f,min}=-126 \leq e_f \leq 127 = e_{f,max}$ | $p_f=24$ | $\left\lvert M_f \right\rvert \leq 2^{24}-1$ |
binary64 | $ e_{d,min}=-1022 \leq e_d \leq 1023 = e_{d,max}$ | $p_f=53$ | $\left\lvert M_d \right\rvert \leq 2^{53}-1$ |
Note that the $p$ precision bits include the implicit leading 1. For example, a binary32 value has a precision of 24 bits of which 23 bits are explicitly stored. Hence, the representation is only suitable for normal numbers! Or in other words: don’t use this model to represent subnormal numbers!
Another really painful aspect of FP numbers is rounding errors. Whenever mathematical operations, such as additions or multiplications, are performed on FP numbers, rounding errors may occur. In literature and this post, rounding is symbolized by the $\circ$ operator. While rounding errors are hard to avoid, most FP hardware allows to control the sign of the error by means of rounding modes. With these modes, you can control whether the final result is rounded down, up, to the nearest number, or however you define it. The most recent IEEE 754 standard defines 5 rounding modes:
To indicate which rounding mode is used in mathematical representations, a little acronym is added to the circle operator. For example, $\circ_{RNE32}(a+b)$ corresponds to a 32-bit addition under Round Nearest, Ties to Even (RNE) rounding mode. I’m using the acronyms from the RISC-V spec [5]. In the following, if no rounding mode is given, RNE shall be assumed.
To assess the numerical impact of these errors, one can use the standard error model of FP arithmetic [6]. According to the model, the error of many arithmetic operations (+, −, /, ·, √), including underflows, can be represented as:
\begin{equation}
\label{eq:standard-error-model}
\begin{gathered}
z = (a \, \text{op}\, b) \cdot (1 + \epsilon ) + \eta = \circ(a \, \text{op}\, b) \\\
\epsilon \eta = 0, \quad |\epsilon| \leq \textbf{u}, \quad \eta \leq |2^{e_{min}}| \cdot \textbf{u}, \quad \textbf{u} = 2^{-p}
\end{gathered}
\end{equation}
Whereby $\eta$ and $\epsilon$ are used to distinguish between subnormal and normal numbers:
The relative error $\epsilon$ is bounded by the so-called unit roundoff error $\textbf{u}$. Note that this formula only works for the round-to-nearest rounding. To account for other rounding modes as well, you can use a roundoff error of $2\textbf{u}$. This is also referred to as the machine epsilon.
In this subsection, I’ll explain how FP arithmetic works on RISC-V systems. All information presented here is based on the RISC-V ISA manual [5].
In general, RISC-V is organized in so-called extensions. Each extensions defines a certain set of instructions and other characteristics, which can be assembled to larger systems in a modular way. This includes FP arithmetic, which is used in the extensions F, D, Q, Zfa, Zfh, Zfhmin, Zfinx, Zhinx, and Zhinxmin. Moreover, there is a vector extension V, which also uses FP arithmetic. Vanilla 32-bit and 64-bit FP arithmetic is provided by the extensions F and D respectively.
All FP extensions mostly adhere to the latest IEEE 754 2019 standard [4].
Accordingly, there are 5 FP exceptions and 5 rounding modes.
Reading FP exceptions and setting rounding modes is achieved by reading/writing the fcsr
register
(see Figure below).
Opposed to many other ISAs, RISC-V doesn’t trigger hardware traps when encountering FP exceptions.
Hence, you cannot catch, for example, a resulting underflow without constantly checking the fcsr
register.
Another interesting characteristic of RISC-V is the instruction-embedded rounding mode.
That means, it possible to specify an operation’s rounding mode directly in the instruction’s encoding.
However, if the instruction’s rounding mode encodes to “dynamic”, a global rounding mode from fcsr
is used instead.
A special peculiarity, that is not part of the IEEE standard, is RISC-V’s hardware-assisted NaN boxing.
With NaN boxing, the upper bits of an M-bit FP register are saturated if an N-bit value is written to it with $M>N$.
Also, values smaller than FLEN (FP register width) are only considered valid if the upper bits in the register are set.
For example, if a 32-bit FP value resides in a 64-bit register, it is only considered valid if the top 32 bits are set to 1.
This means, instructions working solely on 32-bit FP values must check the upper bits when reading the operands and set them when writing back the result.
Since the whole 64-bit value encodes to a negative qNaN, there is no risk of creating valid values by accident.
One issue where the IEEE standard leaves/left too much freedom in my oppinion are canonical qNaNs.
A canonical qNaN is the specific bit pattern returned by the hardware if it executed an invalid operation (e.g. 0/0).
For example, a 32-bit zero-through-zero division will result in 0x7fc00000
for 32-bit FP registers.
The same 32-bit division for 64-bit FP registers results in a NaN-boxed value of 0xffffffff7fc00000
.
But more on that later in Subsection Different Canonical qNaN Encodings.
Similar to RISC-V, FP arithmetic on x64 is also defined by extensions. Yet, the story for this ISA is a little bit more convoluted.
The first FP ISA for x64 was introduced in 1980 by the x87 extension.
This extension was succeeded by SSE in 1999, which not only provided scalar FP arithmetic but also vector instructions.
Even though SSE mostly superseded x87, today’s x64 CPUs still support the x87 extension for legacy reasons I guess.
Modern compilers like gcc primarly generate SSE instructions when it comes to scalar FP arithmetic.
There are only a few corner cases like long double
, for which gcc will still generate x87 code.
In 2011, Intel and AMD released the first processor including the AVX extension, which had new SIMD and scalar instructions. This was followed by AVX-512 in 2016, which adds scalar FP instructions using an instruction-encoded rounding mode. Yet AVX-512 isn’t even supported by many modern CPUs and in general doesn’t seem to be a very beloved child. Or to quote Linux Torvalds: “I hope Intel’s AVX-512 ‘dies a painful death’.” (https://www.phoronix.com/news/Linus-Torvalds-On-AVX-512).
So, after having introduced 4 different FP extensions, which one is relevant for the following? It’s not x87 due to its obsolescence, and it’s not AVX-512 due to its unpopularity. Consequently, we are left with SSE and AVX. Since SSE is the default extension
Since SSE was intdocued in 1999, it adhere to the most recent IEEE standard at that time, which was IEEE 754-1985 [2].
In addition to the aforementioned functional differences of certain instructions, there are other important subtleties which need to be considered.
For example, x64 misses the RMM rounding mode, which was introduced in later standards (see Figure above).
The five FP exceptions (invalid, underflow, overflow, inexact, divide-by-zero) were already defined in the first standard matching RISC-V
in this regard.
Yet mapping the FP exceptions from host to target turned out to be one of the most difficult challenges, as shown in the subsequent section.
In addition to the five standard FP exceptions, x64 also defines a denormal flag for the detection of subnormal results.
One further peculiarity is x64’s support for treating subnormal numbers as 0 using the FTZ and DAZ flags.
Depending on the microarchitecture, the processing of subnormal numbers can reduce FPU performance by an order of magnitude or more [7].
If subnormal numbers are treated as 0, there is no risk of hampering performance.
This flush-to-zero mode was designed for 3D applications where performance is a greater concern than accuracy
[8].
In contrast to RISC-V, the x64 ISA also allows to specify which FP exceptions cause a trap.
The corresponding masking bits are selected in the FMASK field, as depicted in the Figure above.
Another difference between RISC-V and x64 is the canonical NaN encoding.
On x64 systems, the canonical NaN uses a negative sign, as opposed to the RISC-V positive sign.
That means, a 32-bit qNaN as a result of an invalid operation would be encoded as 0xffc00000.
As already mentioned in Section 2, we are facing 6 different problems when executing RISC-V instructions on x64 hosts. In the following, I provide a more detailed explanation.
For some operands, certain FP instructions cannot provide a meaningful result.
For example, when multiplying ∞ and 0, or when adding +∞ and -∞.
To indicate the occurrence of an invalid operation, a specific pattern bit pattern has to be returned.
This pattern is referred to as a qNaN (quiet Not A Number).
There is also an sNaN (signaling Not A Number), but this is rather irrelevant in our case.
So, how does the bit pattern of a qNaN look like?
The IEEE 754 standard from 1985 defines a NaN very vaguely as a number with all exponent bits set to one,
and a non-zero significand.
The exact difference between a qNaN and an sNaN was specified in the 2008 version, with a qNaN having a leading “1” in the significand
and sNaN having a leading “0”.
So, according to the latest IEEE 754 standard, a 32-bit qNaN looks like this:
x111 1111 11xx xxxx xxxx xxxx xxxx xxxx
x = arbitrary bit
As you can see, there’s not only one qNaN, but a whole range of patterns, leaving an ISA designer with the problem of which exact pattern to return when encountering an invalid operation. Since IEEE 754 unfortunately does not give a recommendation here, we see various patterns in practice. The following extended table from [9] shows the qNaN patterns of some popular ISAs.
ISA | Sign | Significand | IEEE 754 2008 compliant |
---|---|---|---|
SPARC | 0 | 11111111111111111111111 | ✓ |
RISC-V $< v2.1$ | 0 | 11111111111111111111111 | ✓ |
MIPS | 0 | 01111111111111111111111 | ✗ |
PA-RISC | 0 | 01000000000000000000000 | ✗ |
x64 | 1 | 10000000000000000000000 | ✓ |
Alpha | 1 | 10000000000000000000000 | ✓ |
ARM64 | 0 | 10000000000000000000000 | ✓ |
PowerPc | 0 | 10000000000000000000000 | ✓ |
Loongson | 0 | 10000000000000000000000 | ✓ |
RISC-V $\geq v2.1$ | 0 | 10000000000000000000000 | ✓ |
As you can see, the qNaN of RISC-V and x64 differ in their signs. Thus, if we were to translate RISC-V FP instructions one-to-one to x64, we’d have to check for qNaNs after each instruction. If qNaN is encountered as a result, the sign must be inverted. In case you’d like to see the different qNaNs, execute the following code on different ISAs:
// x64: 0xffc00000
// RISC-V: 0x7fc00000
// MIPS: 0x7fbfffff
#include <iostream>
#include <ios>
int main() {
float a = 0.f;
float b = 0.f;
a /= b; // Generates a canonical qNaN.
unsigned int* c = reinterpret_cast<unsigned int *>(&a);
std::cout << std::hex << "0x" << *c << std::endl;
return 0;
}
Now to one of my favorite problems, which shows in an absurd way that even IEEE standards created by experts are not impeccable. Let’s start with a simple question: What is the maximum of an sNaN and an arbitrary number? Or expressed directly as instructions:
x64: maxss 5.f, sNaN = ?
RISC-V: fmax 5.f, sNaN = ?
The answers to this question are as numerous as they are confusing:
x64:
maxss 5.f, sNaN = sNaN
maxss sNaN, 5.f = 5.f
RISC-V <2.2:
fmax 5.f, sNaN = qNaN
fmax sNaN, 5.f = qNaN
RISC-V 2.2:
fmax 5.f, sNaN = 5.f
fmax sNaN, 5.f = 5.f
I guess the results show quite well, that some instructions cannot be mapped 1-to-1.
So, why is that? The answer is interesting, but not relevant for the understanding of the rest of the post. Thus, feel free to skip the rest of this subsection.
Let’s start with the odd behavior of the x64 maxss
instruction.
When the modern x64 floating point arithmetic was introduced as part of the SSE extension in 1999, the current
IEEE 754 standard was still from 1985.
If you look into this standard and look for guidance on maximum/minimum instructions, you find exactly… nothing!
So, here is my guess how Intel’s engineers made it more or less compliant.
Instead of regarding the maximum/mininum instruction as atomic, you define it using order relations.
For example, using C++ syntax, you could define it as:
a > b ? a : b;
Fortunately, we find some information about comparisons in the standard.
IEEE 754 1985 defines any comparisons with NaNs as unordered, requiring false to be returned [10].
This means, 5.f > sNaN
is false, as well as sNaN > 5.f
.
Also things like sNaN == sNaN
evaluate to false.
So if every comparison with NaN is false, our maximum/minimum instruction defined by order relations will always return the second operand (b)
if one or more operands are NaN.
And that’s exactly what you see with x64’s maxss
instruction.
A few years later, the IEEE 754 2008 standard was published, which finally included a definition of the maximum/minimum operation
(see subsection 5.3.1 General operations, maxNum
and minNum
).
According to this standard, maximum/mininum should return a qNaN when one of the operands is a sNaN.
If only one of the operands is a qNaN, the number shall be returned.
This definition was adopted by the RISC-V ISA for the fmax
/fmin
instruction and kept until version 2.2.
In comparison to maxss
, this instruction is commutative, which is what a maximum/minimum operation should be in my opinion.
So apparently, the experts thought about commutativity, but a closer look reveals they forgot about associativity.
In his article The IEEE Standard 754: One for the History Books [11] the author David G. Hough confirms
that the aspect of associativity in the presence of NaNs was simply overseen.
To show you what is meant by this, consider the following operations:
max(6.f, max(5.f, sNaN)) = max(6.f, qNaN) = 6.f
max(max(6.f, 5.f), sNaN) = max(6.f, sNaN) = qNaN
If you just follow the standard, you get different results depending on the way the operations are associated. That sounds like a possible source of trouble, so the experts rectified the definition in the IEEE 754 2019 standard.
To be more precise, they replaced maxNum
and minNum
with the associative operations
maximumNumber
and minimumNumber
.
They also introduced maximum
and minimum
, but these are not relevant in the context of RISC-V.
These new operations simply do not turn sNaNs into qNaNs which makes them associative and commutative.
Since RISC-V tries to adhere to IEEE 754 standard and is also not afraid to change things,
the fmax
and fmin
were adjusted in version 2.2.
So here we are. We just needed 34 years to figure out what the maximum/minimum of two values is.
Besides maximum and minimum, also other instructions like fused multiply-add and float to integer conversions show slightly different behavior. Execute the following program on x64 and RISC-V to see it with your own eyes:
#include <cfenv>
#include <cmath>
#include <iostream>
#include <limits>
template <typename T>
using nl = std::numeric_limits<T>;
int main() {
// Maximum/Minimum
float res1 = nl<float>::signaling_NaN();
float res2 = 5.f;
#ifdef __x86_64
asm volatile("maxss %0, %1" :"=x"(res1) : "x"(5.0f));
asm volatile("maxss %0, %1" :"=x"(res2) : "x"(nl<float>::signaling_NaN()));
#elif __riscv
asm volatile("fmax.s %0, %1, %2" :"=f"(res1) : "f"(5.0f) , "f"(res1));
asm volatile("fmax.s %0, %1, %2" :"=f"(res2) : "f"(nl<float>::signaling_NaN()) , "f"(res2));
#else
static_assert(false, "No architecture detected.");
#endif
std::cout << "max(sNaN, 5.f) = " << res1 << std::endl
<< "max(5.f, sNaN) = " << res2 << std::endl;
// Fused Multiply-Add
std::feclearexcept(FE_ALL_EXCEPT);
float res = std::fma(0, nl<float>::infinity(), nl<float>::quiet_NaN());
std::cout << "Invalid: " << std::fetestexcept(FE_INVALID) << std::endl;
// Float to Integer
volatile float a = 2e10;
std::cout << "(int)2e10 = " << (int) a << std::endl;
return 0;
}
On x64 the output is:
max(sNaN, 5.f) = 5
max(5.f, sNaN) = nan
Invalid: 0
(int)2e10 = -2147483648
On RISC-V you get:
max(sNaN, 5.f) = 5
max(5.f, sNaN) = 5
Invalid: 16
(int)2e10 = 2147483647
As already explained in the background section, x64 misses the “roundTiesToAway”, which was introduced in the IEEE 754 2008 standard. So, whenever we want to simulate RISC-V FP instructions under a “roundTiesToAway”, the host’s FPU cannot be used. Yet, this is a corner case, as most applications just use the default RNE rounding mode.
Now to a unique feature/clarification that was introduced in 2017 with version 2.2 of the RISC-V ISA [12]. Until version 2.2, there was no definition of how 32-bit FP values are encoded in 64-bit registers. This can lead to several problems as described in [13] and [14]. After a lively discussion, the chosen solution was a NaN boxing scheme, which was used in no other ISA at that point as far as I know (remark: in 2019 OpenRISC 1000 also adopted NaN Boxing with version 1.3). That means, if a 32-bit FP value is stored in a 64-bit FP register, the upper 32 bits are set to 1’s. Hence, the 32-bit FP value is basically a payload of a 64-bit negative qNaN.
This gives you some advantage in terms of debuging capabilities, but requires additional treatment for emulation. If you want to see NaN boxing in action, execute the following code on RISC-V and x64:
#include <iostream>
int main() {
const float a = -0.f;
const double b = -0.;
double out;
#ifdef __x86_64
// Storing the float does not touch the upper bits.
// Hence, the output is 0x8000000080000000 (-1.0609978955e-314).
asm volatile("movsd %2, %%xmm0 \n\t\
movss %1, %%xmm0 \n\t\
movsd %%xmm0, %0"
: "=x" (out) : "x" (a), "x" (b) : "xmm0");
#elif __riscv
// Output should be -qNaN due to RISC-V NaN boxing.
asm volatile("fmv.d f0, %2 \n\t\
fmv.s f0, %1 \n\t\
fmv.d %0, f0"
: "=f" (out), "=f" (a) : "f" (b) : "f0");
#else
static_assert(false, "No architecture detected.");
#endif
std::cout << "out = " << out << std::endl;
return 0;
}
A feature recommended but not mandated by IEEE 754 is NaN propagation. The idea is to propagate inputs NaN payloads through instruction as some kind of diagnostic information. It is part of x64 and ARM, but RISC-V doesn’t mandate it due to additional hardware costs. To see how it looks like, execute the following code on x64 and RISC-V:
// x64: 0xffc00123
// RISC-V: 0x7fc00000
#include <iostream>
int main() {
float a = 0.f;
float b;
unsigned int *ai = reinterpret_cast<unsigned int *>(&a);
unsigned int *bi = reinterpret_cast<unsigned int *>(&b);
*bi = 0xffc00123;
a += b;
std::cout << std::hex << "0x" << *ai << std::endl;
return 0;
}
Whenever FP instructions are executed, certain exceptions may occur. The IEEE 754 standard defines 5 exception flags which indicate irregularities during an instruction’s execution:
This was already defined in the first standard and hasn’t changed. So, what is the problem if RISC-V and x64 are equal in this regard? Finding a working solution isn’t the problem, but having a fast one is.
But let me begin with the naive approach, that I call FPU guards.
It involves the following steps to load and save the FP exception flags from the mxcsr
register:
Or in C++ terms, it could look like this:
#include <cfenv>
struct fpu_guard {
std::fenv_t envp;
void lock() {
std::fegetenv(&envp);
std::fesetenv(&envp);
}
void unlock() {
std::fegetenv(&envp);
std::fesetenv(&envp);
}
};
int main() {
fpu_guard fg;
float a, b, c;
fg.lock();
a = b + c;
fg.unlock();
return 0;
}
It’s simple, maintainable, and ISA-agnostic. So why not use it? Because it is ridiculously slow. The lock guard, including FP operation, just comprises a few instructions, so you’d expect a performance in the range of 100-1000MIPS. But what you get is merely 2-4 MIPS, even on the most modern machines.
As a computer engineer, it’s my passion to explore such mysteries, which is what I will do in the rest of this subsection.
The slow part of my code is obviously the lock guard, which is implemented by fegetenv
and fesetenv
from the standard library.
Consequently, analyzing the corresponding code in glibc seems to be the next logical step.
With a few minutes of research, I found the following code
(which I also deconvoluted and commented a little bit) for the fegetenv
function.
int __fegetenv (fenv_t *envp) {
// x87 state
__asm__ ("fnstenv %0\n" : "=m" (*envp));
__asm__ ("fldenv %0\n" : "=m" (*envp));
// SSE state
__asm__ ("stmxcsr %0\n" : "=m" (envp->__mxcsr));
return 0;
}
As you can see, it only comprises 3 instructions.
Two of them are responsible for the x87 part (yey, legacy), while only one is needed to fetch the mxcsr
register.
In a profiling run, I could see the x87 part taking about 90% of the total execution time of the function.
That’s a big share, considering that x87 is an obsolete extension for which compilers, with a few exceptions, no longer generate code.
So, I decided to remove the x87 instructions and reevaluate the performance.
Now it was faster, but still far away from my excpectations.
Since there’s only one remaining instruction, the case is clear more or less.
In the infinite realms of the internet, I found this cool website/document,
which analyzed the throughput and latency of all x64 instructions.
The following table summarizes it for the LDMXCSR and STMXCSR instructions (load and store of the MXCSR register).
µArch | Latency | Reciprocal Throughput | ||
LDMXCSR | STMXCSR | LDMXCSR | STMXCSR | |
AMD Zen 2 | - | - | 17 | 16 |
AMD Zen 3 | 13 | 13 | 20 | 15 |
AMD Zen 4 | 13 | 13 | 21 | 15 |
Intel Coffee Lake | 5 | 4 | 3 | 1 |
Intel Cannon Lake | 5 | 4 | 3 | 1 |
Intel Ice Lake | 6 | 4 | 3 | 1 |
As you can see in the table, executing these instructions is relatively costly (13 cycles latency for the AMD Zen microarchitecture). Surprisingly, AMD also performs much worse than Intel. Since I used an AMD machine for my benchmarks, better results could have been obtained with an Intel CPU. Anyway, I ultimately wanted an approach that works well on all microarchitectures, so I decided to go for something different as shown later.
A possible approach to hide the expensive cost of LDMXCSR and STMXCSR, is to only invoke them when the simulator switches between the generated code and the host environment. As already hinted in the FPU guard description, multiple instructions can be between LDMXCSR and STMXCSR. I guess this allows to attain reasonable performance, but you drastically reduce the code modularity. You also increase the cost of switching between simulator and simulated code. So, in the end, I took a different way.
But before I present that, the next section shows how other simulators deal with all these problems.
Whenever I code something, I try to get some inspiration from other projects first.
Or as one of my colleagues said:
“Before you code something simulation-related, ask yourself: What would QEMU do?”
Wise words to live by, so the next sections dissect the FP implementations of a few simulators, such as QEMU, rv8, and gem5.
I also present all academic works that have been published in this field to this date (2023-11-11).
Don’t worry, it’s only 3 papers.
The open-source projects gem5 [15], Spike [16], Uni Bremem RISC-V VP [17], [18] and QEMU [19] pre-v4.0.0, all use a method called soft float to simulate FP arithmetic. Note that QEMU changed to a different approach in version 4.0.0, but more on that later. The idea of soft float is to use integer arithmetic and boolean operations to mimic arbitrary FP behavior. It often comes as a C/C++ library, making it easy to integrate. For example, all simulators listed above use the open-source library Berkley Softfloat by J. Hauser [20], which is based on the IEEE 754 1985 standard. Soft float libraries that implement the more recent IEEE 754 2008 standard include SoftFP by F. Bellard [21], and FLIP by C.-P. Jeannerod et al. [22]. Besides generic solutions in programming languages like C, there are also architecture-optimized soft float libraries. For example, RVfplib [23] is an optimized soft float library for RISC-V systems that do not include the F or D extension.
The availability of multiple open-source libraries and the ease of use make it the most popular FP arithmetic simulation approach. If you are starting to develop your own simulator, I recommend to use it for the first proof of concept. That’s also what we did at MachineWare. Yet, the performance might be somewhat disappointing. Using tens or hundreds of integer instructions to simulate one FP instruction can easily reduce your performance by that same factor. Some exact slowdown factors are provided in the results section.
If you want to enjoy the full pain of coding your own soft float library, the Handbook of Floating Point Arithmetic [24] provides you with all the necessary background information.
The open source project rv8 [25], [26] is a DBT-based, RISC-V simulator for x64 hosts.
With rv8, the RISC-V target rounding mode and exception flags are mapped 1-to-1 to the x64 host.
So, it’s basically the FPU guard approach that I explained in Subsection 4.6 Floating Point Exception Flags.
Hence, checking and setting the target exception flags is simply achieved by accessing the x64 host’s mxcsr
register.
But besides the poor performance of FPU guards on certain AMD microarchitectures, mapping the rounding modes is also a problem.
Because x64 simply misses the RMM rounding mode (see 4.3 The Missing Rounding Mode)!
So, let’s take a look at rv8’s code to see how it solves this problem (rv8/src/asm/fpu.h:9):
inline void fenv_setrm(int rm) {
int x86_mxcsr_val = __builtin_ia32_stmxcsr();
x86_mxcsr_val &= ~x86_mxcsr_RC_RZ;
switch (rm) {
case rv_rm_rne: x86_mxcsr_val |= x86_mxcsr_RC_RN; break;
case rv_rm_rtz: x86_mxcsr_val |= x86_mxcsr_RC_RZ; break;
case rv_rm_rdn: x86_mxcsr_val |= x86_mxcsr_RC_DN; break;
case rv_rm_rup: x86_mxcsr_val |= x86_mxcsr_RC_UP; break;
case rv_rm_rmm: x86_mxcsr_val |= x86_mxcsr_RC_RN; break;
}
__builtin_ia32_ldmxcsr(x86_mxcsr_val);
}
In the function fenv_setrm(int rm)
the RISC-V rounding mode is loaded into the host FPU.
As you can see, the missing rounding mode RMM of x64 is simply mapped to RNE!
This is not correct and leads to rv8 not being compliant with the official RISC-V standard.
The other problems, such as semantically different instructions or NaN boxing, are solved by rectifications in software.
Furthermore, FP instructions are not directly translated, but use an interpreter.
This interpreter falls back to standard C++ operators to implement RISC-V instructions.
For example, the following code shows the implementation of the fadd
and fmax
instructions.
P::ux exec_inst_rv32(T &dec, P &proc, P::ux pc_offset) {
// ...
switch (dec.op) {
case rv_op_fadd_s:
if (rvf) {
fenv_setrm((fcsr >> 5) & 0b111);
freg[dec.rd] = freg[dec.rs1] + freg[dec.rs2];
}
break;
case rv_op_fmax_s:
if (rvf) {
freg[dec.rd] = (freg[dec.rs1] > freg[dec.rs2]) || isnan(freg[dec.rs2])
? freg[dec.rs1] : freg[dec.rs2];
}
break;
// ...
}
}
As of version 4.0.0, QEMU’s slow soft float approach was replaced by the faster method of Guo et al. [27].
Initially Guo et al. tried to calculate the result of an FP instruction on the host FPU and determine the exception flags in software.
However, their way of calculating the inexact exception was so costly, that ultimatley no speedup compared to soft float was achieved.
Note that they could find a fast solution for additions, but more on that in Section 5.5 You et al..
After their failed initial attempt, Guo et al. noticed an obivous but important detail:
the inexact exception is “sticky” and does not need to be recalculated if it was already set.
Or in other words: If an instructions sets the inexact flag, which is very likely, it does not need to be recalculated for all following instructions.
Well, if you clear the flag you have to recalculate it, but there’s almost no software that actually does this.
So, to avoid the high costs for the inexact calculation, an FP operation is preceded by a quick check, whether the exception must be calculated at all.
An example for the square root instruction in QEMU using the method of Guo et al. is shown in the following, simplified, code from qemu/fpu/softfloat.c: (yes, despite not being a mere soft float implementation, the file is called “softfloat” ¯\(ツ)/¯ )
static inline bool can_use_fpu(const float_status *s) {
if (QEMU_NO_HARDFLOAT)
return false;
return likely(s->f_excep_flags & f_flag_inexact && s->f_round_mode == f_round_near_even);
}
float32 float32_sqrt(float32 xa, float_status *s) {
union_float32 ua, ur;
ua.s = xa;
if (unlikely(!can_use_fpu(s)))
goto soft;
float32_input_flush1(&ua.s, s);
if (unlikely(!float32_is_zero_or_normal(ua.s) || float32_is_neg(ua.s)))
goto soft;
ur.h = sqrtf(ua.h);
return ur.s;
soft: return soft_f32_sqrt(ua.s, s);
}
As you can see, the function float32_sqrt
starts with a call to can_use_fpu
.
Here QEMU checks whether the inexact flag must be calculated at all.
Moreover, the host FPU can only be used if target and host rounding mode are the same.
It is assumed that the default C rounding mode of RNE is used and not changed during execution.
Thus, a quick check of the target’s rounding mode suffices.
Since some target architectures like PowerPC also require a non-sticky inexact exception,
the check can be skipped disabled at compile time by defining the macro QEMU_NO_HARDFLOAT
accordingly.
Ultimately, it’s very unlikely that we have to resort to soft float method, which is also hinted by the compiler attribute unlikely
.
To also avoid setting the underflow and invalid exception, the soft float method is used if the input is negative or subnormal.
But again subnormal values as well as negative inputs for float32_sqrt
are very rare.
The idea of extending Guo’s method by checking both invalid and underflow flags was proposed by Cota et al. [28].
It was also E. G. Cota who committed the code to QEMU in 2018.
If all checks passed, which is the most probable case, the function sqrtf
is called, resulting in a sqrtss
instruction for x64 hosts.
With the new method of Guo and Cota, the performance of FP instructions could be increased by a factor of more than $2\times$ in comparison to soft float. However, this speedup is only attainable if an inexact exception occurs at some point and if the RNE rounding mode is used. Tackling the latter issue, at least for additions, Guo et al. developed a quick inexact check, which is pretty similar to the Fast2Sum algorithm by T. J. Dekker [29].
Rosetta 2 is Apple’s x64-on-ARM emulator, which was introduced in 2020 to aid the transition from x64 to ARM-based Apple Silicon [30]. Despite translating instructions from x64 to ARM, which is not the focus of this post, the underlying principle can be applied to any architecture as well. In fact, I’m currently implementing a similar thing for RISC-V, but shhhh.
Since Apple does not disclose the technical details of their products, the following statements are based on internet sources. In general, most problems of x64-to-ARM FP simulation concern non-standard behavior and cases labeled as “implementation defined”. For example, the FTZ and DAZ flags of the x64 ISA are not part of the IEEE 754 standard. These flags allow to individually flush the input and output of an instruction to zero. Similarly, the ARM ISA also allows to flush numbers to zero, yet there is no way to control both input and output as on x64.
According to [31], Apple introduced an alternate FP mode to solve this problem in hardware.
By setting a certain bit in the ARM FP control register, x64 FP arithmetic can be mimicked.
While the Rosetta 2 approach allows for maximum performance, it requires full control of the ISA and silicon.
Shortly after Apple’s release of the M1 processor [32],
the first physical implementation of the alternate FP mode,
ARM officially included this mode in the ARMv8 ISA.
More specifically, it is part of ARMv8.7 architecture extension from January 2021 [33]
and technically referenced it as FEAT_AFP
(fun fact: rumours say, that AFP might also be interpreted as Apple Floating Point 🤔).
Thus, in the future, the alternate FP mode could also find its way in the products of other manufacturers.
Interestingly, just recently I saw this article about Loongson’s LBT extension
for hardware-accelerated DBT.
The Loongson ISA manual and this article still lack important details, but I guess that parts of the additional hardware features
go into a similar direction as FEAT_AFP
.
As mentioned, in Section 5.3 QEMU post-v4.0.0, Guo et al. [27] tried to implement software-based calculations for the inexact exception, but could only come up with a solution for additions/subtractions. Their solution looked as follows:
inexact = ((a + b) - a) < b
Guo et al. don’t mention it in their paper, but that is pretty much the so-called Fast2Sum algorithm that was introduced in 1971 by T. J. Dekker [29]. According to Dekker, the result of a rounded addition can be described by the sum of its exact value and a residual:
\begin{equation}
\label{eq:fast2sum-main}
\begin{gathered}
a + b - r = s = \circ(a + b) \\\
r = \circ(b - \circ(s - a)) \quad with : |a|>|b|
\end{gathered}
\end{equation}
The residual can be calculated by rounded FP instructions as follows:
\begin{equation} \label{eq:fast2sum-residuum} \begin{aligned} r = \circ(b - \circ(s - a)) \quad with : |a|>|b| \end{aligned} \end{equation}
As mathematically proven by Dekker, the residual $r$ holds the exact rounding error of the addition of the variables $a$ and $b$. Hence, if the residual $r$ is not 0, the FP addition was inexact. Additionally, the value of the residual also determines the rounding direction of the preceding addition $\circ(a + b)$. For values greater than 0, the result of the addition was rounded down; for values less than 0, the result was rounded up. This fact wasn’t used by Guo et al. [27], but by You et al. [34] in 2019. Note that Guo et al. [27] and You et al. [34] share a similar co-author. So, ultimately, a solution to emulate RUP rounding using RNE on the host might look like this
// Fast2Sum for RUP
float c = a + b; // Result.
float x = fabs(a) > fabs(b) ? a : b;
float y = fabs(a) > fabs(b) ? b : a;
float r = y - (c - x); // Rounding error.
if (r != 0) {
inexact = true;
if (r > 0) {
c = nextup(c); // Next greater FP value.
overflow = is_inf(c) ? true : overflow;
}
}
While You et al. and Guo et al. managed to develop fast inexact checks and rounding adjustments for additions/subtractions, other arithmetic instructions remained untouched. They developed an inexact check for FMA instructions using integer-based intermediate results, but their measurements show no speedup compared to a soft float implementation. So, let’s take a look at a more successful attempt in the next section.
The approach from Sarrazin et al. [35] isn’t really about determining the inexactness of FMA, but it comes close to it. Interestingly, their work was published in 2016, which predates the unsuccessful attempt of You et al. [34] in 2019.
The group of Sarrazin faced the problem of emulating FMA instructions on systems with hardware FMA support.
So, they combined UpMul with the 2Sum algorithm to get the following equations:
\begin{equation}
\label{eq:ErrFma-residual}
\begin{gathered}
M = \circ_{64}(a \cdot b) \\\
S,T = 2Sum(M, \circ_{64}(c)) \\\
r = \circ_{32}(S) \\\
E = ||S-r|| \\\
with : \circ_{32}(a)=a, \quad \circ_{32}(b)=b, \quad \circ_{32}(c)=c
\end{gathered}
\end{equation}
The output of the 2Sum algorithm is identical to the Fast2Sum algorithm, which was presented in the previous subsection.
A more detailed discussion about the differences and performance implication is provided in the following section.
The residual $T$ (yes, suboptimal variable name) determines if the addition $c$ and $a \cdot b$ was inexact.
This can have an impact on the rounding if $E$ is in the middle of two 32-bit FP numbers ($E=2^{e_r - p}$).
So, if $E$ is equal to $2^{e_r - p}$, you have to check $S$ and $T$, and adapt $r$ accordingly.
As you can see, that doesn’t really indicate if the calculation was inexact or not. Later in Section 6.5 Fast 32-bit Fused Multiply-Add, I show how the equations can be rearranged to fulfill that purpose.
One major disadvantage of the method by Sarrazin et al. is the dependence on larger data types. If the residual of a 32-bit FMA instruction is computed, at least 64-bit FP precision is required. Or more precisely, the larger data type needs at least $2p$ significand bits. Hence, this algorithm does not work for double precision values on x64 systems. The 80-bit precision provided by x87 FPU cannot be used, as it does not have $2p$ significand bits.
In this section, I show which methods I used and developed to equip MachineWare’s SIM-V simulator
with an ultra-fast FP arithmetic.
As shown in the previous section, there are numerous ways to simulate FP arithmetic.
To make life easy for myself, I implemented a soft float library for the first proof concept.
With soft float, SIM-V was able to pass the RISCOF, but the performance was underwhelming.
So, for the second attempt, I implemented QEMU’s method.
This already increased the speed significantly, and profiling showed that there was only a limited room for optimization.
In more than 99.9% of all cases, the critical exception flags are already set and don’t need to be recalculated.
From the point of view of a programmer, certainly good - there is nothing more to do!
For an ongoing Phd under pressure to publish, rather suboptimal - there is nothing more to research!
Ok, but what if I focus on some of the corner cases in which QEMU’s method doesn’t perform well? For instance, if the target doesn’t use RNE, QEMU always has to fall back to soft float. You et al. [34] already showed how the residual of an addition could be used to account for different rounding modes. But they didn’t propose any methods for other arithmetic instructions, such as multiplication, division, or square root.
So, in the following, I will show for all relevant arithmetic instructions, how to quickly calculate a residual that can be used to determine inexactness and perform directed rounding. I call this approach floppy float, because it’s somewhere between soft and hard float. As far as I know, the methods for division and square root haven’t been described anywhere else in literature so far. The goal of the method is to perform equally fast as QEMU for standard rounding, and outperform it for non-standard rounding.
Besides using mathematical proofs to check the validity of the approaches, all instructions were verified using the RISC-V Architecture Test [37], as well as hand-crafted tests to confirm corner cases.
NOTE
In the following I’m using a positive residual (e.g. $c_{exact} + r = \circ(a+b)$).
Hence, if $r>0$, the result was rounded up, and if $r<0$, the result was rounded down.
In my opinion it feels more intuitive this way.
As explained in Section 5.5 You et al. the work of You et. al [34] uses the Fast2Sum algorithm for the calculation of the residual $r$.
This requires two arithmetic operations, but the operands must be sorted by absolute value.
Consequently, branching instructions might be needed, which can lead to performance penalties.
As an alternative without sorted operands, O. Møller [38] proposed the 2Sum algorithm in 1965.
Similar to Dekker’s Fast2Sum algorithm, the 2Sum’s motivation was to increase accuracy in floating point calculation.
But roughly 50 years later, we found a way to use it to speed up our simulations!
Opposed to the Fast2Sum algorithm, the 2Sum algorithm does not require branching instructions, but involves more arithmetic instructions:
\begin{equation}
\label{eq:2sum-main}
\begin{gathered}
c_{exact} + r = c = \circ(a+b) \\\
a’ = \circ(c-b) \quad
b’ = \circ(c-a’) \\\
\delta_a = \circ(a’ - a) \quad
\delta_b = \circ(b’ - b) \quad
r = \circ(\delta_a + \delta_b)
\end{gathered}
\end{equation}
This algorithm also exhibits some potential for instruction-level parallelism/vectorization, as the data dependency graph reveals:
In some benchmark experiments I ran, the 2Sum algorithm was ~10% faster than the Fast2Sum algorithm when working on randomized data.
If the input data is predictable, thus favorable to the branch predictor, both algorithms achieve the same performance.
Ultimately, a 32-bit FP add for RUP rounding might look like this:
// RUP case
float c = a + b; // Result.
float ad = c - b;
float bd = c - a;
float da = ad - a;
float db = bd - b;
float r = da + db; // Residual.
if (r != 0.f) {
inexact = true;
if (r > 0.f) {
c = nextup(c);
overflow = (c == infinity) ? true : overflow;
}
}
For the fast calculation and rounding of multiplications, I exploited one interesting property of IEEE FP numbers: multiplying two 32-bit FP values as 64-bit values always yields an exact result! Similar to addition, this allows to calculate a residual, which can be used for rounding and setting the inexact flag. For the sake of simplicity, I will call this approach UpMul from now on.
So, let’s start with some operands $a$ and $b$ as 32-bit FP values.
In a first step, these are upcasted to 64-bit values and then multiplied.
Since the number of significands more than doubles from 32-bit FP to 64-bit FP,
the result of the multiplication can be represented exactly.
If the exact value is subtracted from the erroneous value, the residual remains:
\begin{equation}
\label{eq:upmul-main}
\begin{gathered}
c_{exact} + r = c = a \cdot b + r = \circ_{32}(a \cdot b) \\\
r = a \cdot b + r - (a \cdot b) = \circ_{64}(\circ_{32}(a \cdot b) - \circ_{64}(a \cdot b) )
\end{gathered}
\end{equation}
The mathematical proof is provided at the end of this section. A C/C++ implementation for the RUP rounding mode can be found in the following code:
// RUP case
float c = a * b;
double r = (double)c - (double)a * (double)b;
if (r != 0.) {
inexact = true;
if (r < 0.) {
c = nextup(c); // Next greater FP value.
overflow = is_inf(c) ? true : overflow;
}
underflow = (is_subnormal(c) || is_zero(c)) ? true : underflow;
}
As shown in the code, an inexact calculation has occurred if $r\neq 0$. Subsequently, the result is rectified in case the host hardware rounded it down. This could lead to an overflow, hence the result is checked for infinity. According to the RISC-V ISA, tininess is detected after rounding, requiring an underflow check after rectification. Note that underflow only occurs when the result is subnormal and inexact.
So, now let’s take a look at mathematical proof of this method.
The formula can be derived by first showing that the multiplication of the 32-bit values
as 64-bit values is exact.
Using Equation \ref{eq:float1} the multiplication can be expressed as:
\begin{equation}
\label{eq:upmul3}
\begin{aligned}
a \cdot b = M_a \cdot M_b \cdot 2^{e_a + e_b - 2p_f + 2} =
c = M_c \cdot 2^{e_c - p_d + 1}
\end{aligned}
\end{equation}
As stated in Section 3.1 The Math, this model is not suitable for subnormal numbers.
So, how to deal with this case?
The trick is, we don’t need to consider it!
Casting 32-bit FP values to 64 bit can never lead to subnormal results.
And even the following multiplication cannot lead to subnormal results.
Why is that?
The smallest subnormal 32-bit FP number is $2^{e_{f,min}- p_f + 1} = 2^{-149}$.
Multiplying the smallest subnormal 32-bit FP number with itself results in $2^{2 \cdot -149} = 2^{-298}$.
These results are still far away from the 64-bit subnormal range, which begins at $2^{e_{d,min}} = 2^{-1022}$.
GG EZ!
Next, we derive the maximum ranges of $M_c$ and $e_c$:
\begin{equation}
\label{eq:upmul5}
\begin{gathered}
|M_c| = |M_a \cdot M_b| \leq (2^{p_f}-1)^2 \leq (2^{24}-1)^2 \leq 2^{48} - 1 \leq 2 ^{p_d} - 1 \leq 2 ^{53} - 1 \\\
|e_c| = |e_a + e_b - 2p_f + p_d + 1| \leq 260 \leq |e_{d,min}|
\end{gathered}
\end{equation}
Since both $M_c$ and $e_c$ fit into the range of a double-precision value, the result of the multiplication is exact.
From Equation \ref{eq:upmul5} we can also see why $2p$ significand bits are required to represent a multiplication exactly.
As the final step, the exactness of the subtraction needs to be shown. Here I simply used Sterbenz’ Lemma [39] . According to his Lemma, the subtraction of two very close FP numbers is always exact. Interesting remark: this only works if the FP number format supports subnormal. Or to express it mathematically:
\begin{equation}
\label{eq:sterbenz}
\begin{gathered}
\text{if} \quad a/2 \leq b \leq 2a \\\
\text{then} \quad \circ(b - a) = b - a
\end{gathered}
\end{equation}
Since the values of $\circ_{64}(a \cdot b)$ and $\circ_{32}(a \cdot b)$ differ by not more than $2\times$ their subtraction is exact.
For the fast division, I developed a new method called UpDiv, which was not seen in any other work before.
Similar to the UpMul method from before, both operands must be 32-bit FP values, and the goal is to compute the residual $r$.
However, in this case, the exact determination of the residual of a division is overambitious,
as certain rational numbers cannot be represented with a finite number of significand bits.
Nevertheless, the exact value of the residual is not crucial for our endeavor.
Rather, we want to know whether there was a rounding error, and if it is positive or negative.
In mathematical terms, an approximation of the residual $\tilde{r}$ is sought, for which $sgn(\tilde{r})=sgn(r)$ is satisfied.
Such an approximation is obtained by:
\begin{equation}
\label{eq:updiv-main}
\begin{gathered}
a / b + r = c_{exact} + r = c = \circ_{32}(a / b) \\\
\tilde{r} = \circ_{64}(\circ_{64}(\circ_{32}(a / b) \cdot b) - a) \cdot sgn(b)
\end{gathered}
\end{equation}
And in terms of C/C++:
// RUP case
float c = a / b;
double r = (double)c * (double)b - (double)a;
r = signbit(b) ? -r : r;
if (r != 0.) {
inexact = true;
if (r < 0.) {
c = nextup(c); // Next greater FP value.
overflow = is_inf(c) ? true : overflow;
}
underflow = (is_subnormal(c) || is_zero(c)) ? true : underflow;
}
If you are interested in the mathematical proof, here it comes.
The equation can be derived by using the standard model of FP arithmetic extended for subnormals (see Equation \ref{eq:standard-error-model}).
According to the model, the error of the FP division, including underflow and overflow, can be represented as follows:
\begin{equation}
\label{eq:updiv3}
\begin{aligned}
\frac{a}{b} \cdot (1 + \epsilon_1 ) + \eta_1 = \circ_{32}(a/b) = a / b + r
\end{aligned}
\end{equation}
If the result of the division is upcasted to 64-bit and multiplied by the value of $b$, which is also upcasted to 64-bit, the result must be exact (see previous subsection).
This allows to calculate the approximation $\tilde{a}$ as follows:
\begin{equation}
\label{eq:updiv4}
\begin{aligned}
\tilde{a} = a + a \epsilon_1 + b \eta_1 = \circ_{64}(b \cdot \circ_{32}(a/b))
\end{aligned}
\end{equation}
Subtracting $a$ from $\tilde{a}$ yields Equation \ref{eq:updiv5}:
\begin{equation}
\label{eq:updiv5}
\begin{gathered}
z = \circ_{64}(\tilde{a} - a) = \circ_{64}(a - \circ_{64}(b \cdot \circ_{32}(a/b))) = (a \epsilon_1 + b \eta_1)(1 + \epsilon_2) \\\
z =
\begin{cases}
b \eta_1 (1 + \epsilon_2) & subn.\\\
a \epsilon_1 (1 + \epsilon_2) & else
\end{cases}
\end{gathered}
\end{equation}
Although this addition can be inexact, which is described by $\epsilon_2$,
the result 0 can only be obtained if the preceding division was exact ($\epsilon_1=\eta_1=0$).
Otherwise, the sign of $z$ is directly determined by $a \epsilon_1$ or $b \eta_1$.
Next, Equation \ref{eq:updiv5} is rearranged to:
\begin{equation}
\label{eq:updiv7}
\begin{aligned}
\epsilon_1 = \frac{z}{a \cdot (1 + \epsilon_2)} \quad
\eta_1 = \frac{z}{b \cdot (1 + \epsilon_2)}
\end{aligned}
\end{equation}
Inserting Equation \ref{eq:updiv7} into Equation \ref{eq:updiv3} yields for both cases the following residual:
\begin{equation}
\label{eq:updiv8}
r = \frac{z} {b \cdot (1 + \epsilon_2)} = \frac{\circ_{64}(a - \circ_{64}(b \cdot \circ_{32}(a/b)))} {b \cdot (1 + \epsilon_2)}
\end{equation}
Therefore, the residual can only be 0 if $z$ is 0 as well.
Likewise, the sign of $r$ is directly determined by $z$ and $b$.
Consequently, we conclude $sgn(\tilde{r}) = sgn(r)$.
The calculation of a fast square root and its residual follows the same principle as the UpDiv algorithm. Hence, I named it UpSqrt. I exploit that multiplication is the inverse operation of square root, and that multiplication with larger data types is exact. The residual results according to Equation \ref{eq:upsqrt-main}:
\begin{equation}
\label{eq:upsqrt-main}
\begin{gathered}
\sqrt{a} + r = b_{exact} + r = b = \circ_{32}(\sqrt{a}) \\\
\tilde{r} = \circ_{64}(\circ_{64}(\circ_{32}(\sqrt{a})^2) - a)
\end{gathered}
\end{equation}
The proof of the algorithm is equivalent to the proof of the UpDiv algorithm. Again, an approximation $\tilde{r}$ for the residual $r$ with $sgn(r) = sgn(\tilde{r})$ is sought. And again, the property that the multiplication is precise on the one hand is exploited again, if a larger data type is available, and on the other hand that the multiplication can be used as an inverse function of the actual operation. The final result is the following expression: \begin{equation} \label{eq:upsqrt2} \begin{aligned} r = \sqrt{\frac{\tilde{r}}{1+\epsilon_2}+a} - \sqrt{a} \end{aligned} \end{equation} Since the sign of $r$ is only dependent on $\tilde{r}$, $sgn(r) = sgn(\tilde{r})$ holds. Here’s the corresponding C/C++ code:
// RUP case
float b = sqrt(a)
double r = (double)b * (double)b - (double)a;
if (r != 0.) {
inexact = true;
if (r < 0.) {
b = nextup(b);
}
}
And here’s the proof. According to the standard error model of FP, the 64-bit multiplication of the 32-bit square root $a$ results in: \begin{equation} \circ_{64}(\circ_{32}(\sqrt{a})^2) = \circ_{32}(\sqrt{a})^2 = (a \cdot (1 + \epsilon_1))^2 \end{equation} Note, that a square root cannot produce a subnormal result (thus no $\eta$) and that a 64-bit multiplication of 32-bit values is always exact. The latter is the same property of FP that I already used in the previous two sections. Next, we subtract $a$:
\begin{equation} \label{eq:upsqrt-proof1} \begin{gathered} \tilde{r} = \circ_{64}(\circ_{32}(\sqrt{a})^2 - a) = ((a \cdot (1 + \epsilon_1))^2 - a)\cdot (1 + \epsilon_2) \end{gathered} \end{equation}
And rearrange the formula:
\begin{equation} \label{eq:upsqrt-proof2} \begin{gathered} \epsilon_1 = \sqrt{\frac{\tilde{r}}{(1+\epsilon_2) \cdot a}+1} - 1 \end{gathered} \end{equation}
Inserting $\epsilon_1$ into $\sqrt{a} \cdot \epsilon_1 = r$ gives us:
\begin{equation} \label{eq:eq:upsqrt-proof3} \begin{aligned} r = \sqrt{\frac{\tilde{r}}{1+\epsilon_2}+a} - \sqrt{a} \end{aligned} \end{equation} And q.e.d.
For fast FMA simulation, I deployed a similar method as Sarrazin et al. [35].
Yet, I repurposed it to account for inexact excpetions.
The idea is to first calculate the exact multiplication of $a$ and $b$ using a larger data type.
Subsequently, the residual of the summation of $a \cdot b$ and $c$ is calculated using the 2Sum algorithm.
But even if this summation was exact ($r_1=0$), the final result might not be representable as 32-bit FP value.
Hence, another residual $r_2$ is calculated to determine the 64-bit to 32-bit rounding error.
Note that $r_2$ is exact due to Sterbenz’ Lemma [39].
\begin{equation}
\label{eq:fast-fma-main}
\begin{gathered}
d_{exact} + r = d = \circ_{32}(a \cdot b + c) \\\
r_1 = 2Sum(\circ_{64}(a \cdot b), c) \\\
r_2 = \circ_{64}(d - \circ_{64}(\circ_{64}(a \cdot b) + c))
\end{gathered}
\end{equation}
Finally, an approximation of the rounding error $\tilde{r}$ can be calculated, as shown in Equation \ref{eq:fast-fma-residual}:
\begin{equation}
\label{eq:fast-fma-residual}
\begin{aligned}
\tilde{r} & = r_1 + r_2
\end{aligned}
\end{equation}
Although the addition of $r_1$ and $r_2$ is not exact per se, it satisfies $sgn(\tilde{r})=sgn(r)$.
This is enabled by gradual underflows, due to which the following property holds for two arbitrary 32-bit FP numbers:
$sgn(a+b) = sgn(\circ_{32}(a + b))$.
As before, here the C/C++ code for a RUP case:
// RUP case
float d = std::fma(a, b, c);
double p = (double)a * (double)b;
double dd = p + (double)c;
double r1 = two_sum<double>(p, (double)c, dd);
double r2 = (double)d - dd;
double r = r1 + r2;
if (r != 0.) {
inexact = true;
if (r < 0.) {
d = nextup(d);
overflow = is_inf(d) ? true : overflow;
}
underflow = (is_subnormal(d) || is_zero(d)) ? true : underflow;
}
return d;
The previous upcast algorithms UpMul, UpDiv, UpSqrt, and also the FMA algorithm according to Sarrazin et al. [35], are all based on larger data type that can perform multiplications exactly. As mentioned earlier, these algorithms reach their limitations for 64-bit values on x64 systems. To circumvent these limitations, the fused multiply-add (FMA) instruction of the x64 ISA can be used. This instruction is formalized in the FMA3/FMA4 instruction set extensions and is part of all modern x64 processors. For example, using FMA, the residual of the UpMul algorithm can be calculated as follows:
\begin{equation} \label{eq:example-div} \begin{aligned} r’ & = \circ_{64}(a \cdot b - \circ_{64}(a \cdot b)) = \circ_{64}(c_{exact} - c) \end{aligned} \end{equation}
However, the rounding step at the end of each FMA instruction poses a problem. Although an FMA instruction calculates all intermediate results with infinite precision, the result is eventually rounded. In the example shown, it is possible that $r’$ is not representable with a 64-bit precision. One could therefore wrongly assume a value of 0, although the value is actually different from 0. Hence, $r’=r$ does not hold in all cases.
Consequently, bounds must be determined for which $r’$ is no longer representable. Since $r’$ is the direct result of the subtraction of $c$ and $c’$, we have to determine the smallest distance between these numbers, excluding 0. This distance is $|d| \geq 2^{e_c - 2p_d}$. The number of double significand bits $2p_d$ follows from the exact intermediate results of the FMA instruction. As explained previously, $2p$ significand bits are needed for the exact representation of a $p$-bit multiplication. In order to represent $r’$ as a 64-bit FP value, $e_c - 2p_d \geq e_{d,min} - p_d + 1$ must hold. A simple rearrangement leads to the following inequality: \begin{equation} \label{eq:example-div-bound} \begin{aligned} e_c \geq e_{d,min} + p_d + 1 = -1022 + 53 + 1 = -968 \end{aligned} \end{equation} If $|c|$ is less than $2^{-968}$, my method cannot be used, and the instruction has to be calculated using soft float. However, the range below $2^{-968}$ represents less than 3% of all 64-bit FP values. In practice, it’s even less, as most FP values are centered around 1. To prove this statement, I ran different 78 FP benchmarks and tracked the in- and output exponents of all 64-bit arithmetic FP instructions:
As you can, on average less than 0.1% values have an exponent less than $2^{-968}$.
A C/C++ example for the 64-bit division is given in the following code:
if (abs(a) < 4.008336720017946e-292)
return soft::div(a, b);
double r = std::fma(c, b, -a);
if (r != 0.0) {
inexact = true;
underflow = (is_subnormal(c) || is_zero(c)) ? true : underflow;
}
In this section, I show the results of some clean room benchmarks. The goal was to assess the maximum performance of each individual instruction for soft float, floppy float (my approach), and hard float (native FP instructions). That means inputs and outputs are never subnormal, there are no data dependencies between the instructions, standard rounding is used, and there’s no DBT overhead. While floppy float and hard float aren’t really sensitive to different kinds of input data (except subnormals), the soft float is due to its control-flow-heavy calculations. In general, the input data was designed to favor optimistic paths in soft float. So, let’s take a look at the results:
As you can see, simply executing FP instructions one after another (hard float) achieves around 8500 MIPS for instructions that can be executed in one cycle (max, min, add, sub, etc.). This is explained by the FP pipeline of the host processor, which was an AMD Ryzen Threadripper 3990X in my case. Most FP instructions can use 2 of 4 FP pipes provided by the Zen 2 microarchitecture, leading to $8500 MIPS \approx 2 \cdot 4.3GHz$. Some instructions, such as division, square root, or 64-bit multiplication, require multiple cycles, which results in lower performance. Nevertheless, hard float is faster than soft and floppy float in all cases. The performance of the floppy float approach is in the range of 300-600 MIPS, and is faster than soft float by up to $5 \times$ in some operations, such as square root. For lightweight operations, such as min or max, there is no significant difference between soft- and floppy float.
Since my approach is intended to accelerate FP performance in DBT simulators, a practical performance assessment is indispensable. For this purpose, I integrated my approach, the method by Cota et al. [28](QEMU’s method), and Bellard’s SoftFP [21], into MachineWare’s DBT-based RISC-V simulator SIM-V [1]. I then conducted a performance analysis using well-known FP benchmarks such as linpack, NPB, SPEC CPU 2017, and other representative workloads. The results can be found in the following graph:
In the graph, the speedups of the individual benchmarks are shown, whereby the soft float method was used as a reference baseline.
All benchmarks in Subplot a) were executed with the default RNE rounding, while Subplot b) represents the same benchmarks under RUP rounding.
Please not that this graph does not compare SIM-V with QEMU!
It’s only QEMU’s method implemented in SIM-V!
Since SIM-V uses multiple other techniques to speed up simulations, a comparison wouldn’t be fair.
As can be seen in the graph, QEMU’s method and my approach achieve a speedup of $3\times$ in a best case scenario (see Subplot a), NPB/ft.A and 508.namd). Also, in most cases, the performance of my approach is equal to the performance of QEMU’s approach when RNE rounding is used. As explained previously, my approach is only faster when underflows occur and no inexact flags are set, or when a non-default rounding mode is not used. Since most applications already set an inexact flag after a few executed instructions, the speedup gained from an accelerated inexact calculation is marginal. Also, underflows are seldom, as I could confirm with a separate instruction and data study. For example, in the case of the NPB/ft.A benchmark, not a single underflow occurred in a total of 3,875,127,289 executed fmadd instructions.
To demonstrate the advantages of my methods, I ran all benchmarks again under RUP rounding which is depicted in Subplot b). Here we can see that QEMU is slower than soft float in all cases. This can be attributed to the fact that QEMU first checks the rounding mode before resorting to soft float. My method, however, can rectify the result for most instructions and set the exception flags without using soft float. Thus, speedups of 50% over QEMU are achieved for benchmarks like linpack32. Since the speedup of my method depends on the executed instructions, we observe a heterogeneous picture of results. Moreover, the speedups under RNE cannot be used to infer the speedups under RUP. As described in previously, we do not have a method for 64-bit FMA instructions, and all presented approaches require less checks when working on 32-bit data. Hence, single precision benchmarks, such as linpack32 or machine learning applications (lenet, alexnet), achieve higher speedups in non-default rounding modes. Applications that comprise many 64-bit FMA instructions achieve low to no speedup (see NPB/bt.A and NPB/cg.A).
In this post, I showed how floating point arithmetic is calculated in emulators/simulators, such as QEMU, gem5, or Rosetta 2. To the best of my knowledge, this post provides the most complete picture of this topic to date. But if you find more literature worth citing, let me know!
Besides just providing a related work overview, I showed how the QEMU approach can be improved to also perform well for other rounding modes. I implemented my method in MachineWare’s SIM-V RISC-V simulator and beat QEMU’s by more than 50% in the best case. For the vanilla RNE rounding mode, I couldn’t achieve any speedups for standard benchmarks. This is due to exception bits being sticky and not requiring any recalculations. I later noticed that the PowerPC has non-sticky exception flags, which requires a recalculation for every instruction. Hence, I guess my method could significantly speed up PowerPc simulations even for standard benchmarks with RNE rounding.
One important missing piece of this work are efficient algorithms for 64-bit FMA instructions. Unfortunately, these instructions occur relatively frequently, costing us a significant chunk of performance for some benchmarks. I found an interesting work of Boldo et al. [36], which provides an algorithm to calculate the residual for FMA instructions. So exactly what I need! But I wasn’t able to get it running correctly for whatever reason… Since their paper is basically 8 pages of mathematical proofs, I leave this as a problem for other people and future Niko.
If you have remarks, questions, or just want to say “hello”, feel free to write me a mail!
What is the paper about?
The gist of it is a parallelized version of gem5’s atomic mode.
Note that this is for the atomic mode only!
If you are intersted in the timing mode, feel free to read our sequel parti-gem5: gem5’s Timing Mode Parallelised, which is available on Arxiv.
How fast is par-gem5?
For completely parallel benchmarks we managed to reach speedups of ~25x when simulating a 128-core ARM system on a 128-core x64 host system.
More realistic parallel benchmarks like NPB “only” attain speedups of up to ~12x.
Since par-gem5 creates a thread for each simulated CPU core, the maximum attainable speedup depends on several factors.
This includes: the number of available host threads, the number of simulated target CPUs, and the degree of parallelization in the executed benchmark.
Especially the latter is important.
If you are looking to speedup the execution of a single-core benchmark like Dhrystone, par-gem5 is probably not the right tool for you!
Is par-gem5 easy to use?
I would say it is fairly simple if you are already familiar with vanilla gem5.
You only have to set a CPU’s event queue and choose a reasonable quantum.
This can all be done in the python setup scripts with the following lines:
if args.parallel:
print("gem5 going parallel")
m5.ticks.fixGlobalFrequency()
root.sim_quantum = m5.ticks.fromSeconds(m5.util.convert.anyToLatency("500us"))
cpus = system.cpu_cluster[0].cpus
# Note: child objects usually inherit the parent's event queue.
if len(cpus) > 1:
first_cpu_eq = 1
for idx, cpu in enumerate(cpus, first_cpu_eq):
cpu.eventq_index = idx
How accurate and reliable is par-gem5?
The parallelization approach of par-gem5 is in many regards similar to SystemC TLM-2.0’s so-called temporal decoupling.
That means, rather than having one global time as in vanilla gem5, each simulated CPU resides in its own time and occasionally synchronizes
with the rest of the system at certain barrier points.
The distance of the barrier points is determined by the aforementioned quantum.
For instance, if the quantum is set to 500µs, the maximum time two CPUs can diverge is 500µs.
Surprisingly, the hardware and software of most modern general purpose CPU systems is pretty resilient to a certain amount of time skew. If you do not yeet up the quantum to values like 1 second, you can boot linux systems and run arbitrary software workloads without encountering any problems. Nevertheless, we are changing the semantics of the simulation and this has a non-negligible impact on multiple aspects.
For instance, if CPUs are communicating with each other, certain messages may be postponed to a barrier point, which in general leads to prolonged simulation times (the time that is provided in the gem5 statistics, not the the so-called wall clock time). As shown in the paper, a quantum of 1µs seems to keep inaccuracies in a single-dit percentage while still achieving significant speedups in most benchmarks.
The different time domain are also a problem for some of gem5’s hardware models. For instance, the ARM timer model casts time differences to unsigned integers, which may result in trouble if the deltas are negative. Here’s a snippet of the unfixed timer’s impact on the Linux boot timestamps.
gem5 par-gem5
[0.000385] [0.000385] Mount-cache hash table entries: 32768 [...]
[0.000396] [0.000396] Mountpoint-cache hash table entries: [...]
[0.024140] [422.828066] ASID allocator initialised with 128 entries
[0.032140] [3495.801687] Hierarchical SRCU implementation.
[0.048162] [845.656091] smp: Bringing up secondary CPUs ...
[0.080218] [5877.941435] Detected PIPT-Icache on CPU1
As you can see, at some point the timer blows up. That was a pain to debug, but we eventually managed to find the error and fix the timer model. After fixing some other issues, par-gem5 is now in a state, which I would consider as quite reliable. I would not launch a space craft with, but it’s good enough for software development and design space exploration.
Will par-gem5 be open source?
Since par-gem5 is the result of an industry project, the source code is not going to be disclosed.
Any Questions?
Feel free to write me a mail (see About).
This is another post of my TLMBoy series where I document the development of my equally named Game Boy Emulator. In contrast to my other posts, the following sections do not deal with any “How do I implement this and that?”. I rather dissect and explain the 256-byte hidden boot code that helps bringing up the Game Boy!
When turning on most compute systems, only a few things are guarenteed to have a certain value. The Game Boy is no exception and only guarantees the program counter register to be initialized with 0. All other things like other registers, the sound processor, and the pixel processing unit have to be initialized by the boot process.
In case of the Game Boy the boot code resides within a special 256-byte ROM that is mapped from 0x00 to 0xff. Interestingly, the boot ROM unmaps itself from the memory map after finishing the boot. This demap feature made it quite hard to reverse engineer the boot code.
The first succesful reverse engineering attempt was achieved by a dude(tte) called “neviksti” in 2003. This was 14 years after the initial release of the Game Boy in 1989! According to gbdev wiki [1] this person was actually mad enough decap the Game Boy’s SoC and read out every single bit using a microscope. Interestingly neviksti’s website [2] is still up today and features some cool die shots like this one:
In the following sections I’ll go through the boot code line by line and analyze it.
Furthermore, I’ll try to disassemble the assembly into some C-ish code.
Of course I’m a little bit late to the party and a lot of people wrote some nice wrapups before me. Take a look at the Literature to see what helped me writing this post.
Also Nintendo themselves helped me
by putting their boot CFG (control flow graph) into a patent [3] called
“System for preventing the use of an unauthorized external memory”:
Before analyzing the code, we do of course need some assembly code to work on! My personal favorite is this [4] commented, human-readable boot rom which I will refer to in the following.
The first three instructions are some plain register initializations.
The stack pointer sp
is set to 0xfffe; register a
is set to 0; and hl
now points to the VRAM (0x9fff).
BB0:
0x000 ld sp, $fffe // init stack
0x003 xor a // efficient way for: a = 0
0x004 ld hl, $9fff // set hl to VRAM
To avoid displaying random garbage, the Game Boy has to zero-initialize its VRAM. The following three-line loop takes care of it.
BB1:
0x007 ld [hl-], a // load a into [hl], then decrement hl
0x008 bit 7, h // stop condition
0x00a jr nz, @BB1 // jump to BB1, if not zero
This quite dense code can be achieved by using a little bit-trick. The VRAM ranges from 0x8000 to 0x9FFF, whereby all these addresses in binary have a “1” bit at position 8 int the MSB. But the first number under 0x8000 doesn’t:
0b10000000 00000000 = 0x8000
0b01111111 11111111 = 0x7FFF
The same functionality can be achieved with the following C-Code:
for (int i = 0x9FFF; i >= 0x8000; --i) {
mem[i] = 0;
}
The next lines setup the Game Boy’s sound processor:
0x00c ld hl, rNR52 // load 0xFF26 into hl: register no 52
0x00f ld c, $11
0x011 ld a, $80
0x013 ld [hl-], a // rNR52 = $80, all sound on
0x014 ld [c], a // rNR11 = $80, wave duty 50%
0x015 inc c
0x016 ld a, $f3
0x018 ld [c], a // rNR12 = $f3, envelope settings
0x019 ld [hl-], a // rNR51 = $f3, sound output terminals
0x01a ld a, $77
0x01c ld [hl], a // rNR50 = $77, SO2 on, full volume, SO1 off, full volume
They aren’t too interesting and of minor relevance for the boot process itself. A corresponding C-Code could look like this:
mem[0xff26] = 0x80; // all sound on
mem[0xff11] = 0x80; // wave duty 50%
mem[0xff12] = 0xf3; // envelope settings
mem[0xff25] = 0xf3; // sound output terminal
mem[0xff24] = 0x77; // SO2 on, full volume, SO1 off, full volume
As a next step the background and window color palette register (BGP, at 0xff47) is set to 0b11111100, and the pointers for logo load are prepared.
0x01d ld a, $fc
0x01f ldh [rBGP], a // BGP = $fc, set up color palette
0x021 ld de, $0104 // de = cartridge header logo
0x024 ld hl, $8010 // hl = VRAM
The BGP setup can be translated as:
11 10 01 00 # value
| | | |
11 11 11 00 # mapped to
| | | |
b b b w # b=black, w=white
It’s simply a remapping of colour values for the backround and window tiles. So, for a example, a pixel with the a value of 01 is displayed as 11, which is deep black (the reason for this mapping is explained in Subsection 2.7) The corresponding C-Code is just (ignoring the pointers):
mem[0xff47] = 0xfc; // set up BG and window colour palette
The job of the next basic block is to load the Nintendo logo from the cartridge into the VRAM:
BB4:
0x027 ld a, [de] // for loop over cartridge logo data, de = 0x104
0x028 call $0095 // copy cartridge logo data to VRAM at $8010
0x02b call $0096
0x02e inc de
0x02f ld a, e
0x030 cp $34 // a == 0x34?
0x032 jr nz, @BB4
However, due to size constrains the Nintendo logo is heavily compressed and needs to be decompressed by a relative simple algorithm. That way the 48 Bytes of the compressed Nintendo logo can be inflated to 384 Bytes (=24 tiles) worth of pixel data. The corresponding C-Code looks like this:
u8 *vram = 0x8010;
for (u8 *logo = 0x0104; logo < 0x0134; ++logo) {
u8 data = *logo;
DecompressAndCopy(data, vram);
vram += 4;
DecompressAndCopy(data >> 4, vram);
vram += 4;
}
// vram will be 80d0
In the following section we will take a closer look at the decompression algorithm.
The decompression algorithm of the Game Boy is not really complex, yet the assembly is quite:
// 'a' holds the next datum of the logo
DecompressAndCopy:
0x095 ld c, a // c = 76543210
0x096 ld b, $04 // loop counter
decomp_loop:
0x098 push bc
0x099 rl c
0x09b rla
0x09c pop bc
0x09d rl c
0x09f rla
0x0a0 dec b
0x0a1 jr nz, @decomp_loop
0x0a3 ld [hl+], a
0x0a4 inc hl // leave on byte blank
0x0a5 ld [hl+], a
0x0a6 inc hl // leave on byte blank
0x0a7 ret
So, let’s start with an abstract description of what the algorithm actually does. As an input the algorithm receives one byte of data (the numbers represent bit positions):
> in = 76543210
The output is then a scaled version (2x in x and y direction) distributed over 4 bytes:
> out0 = 77665544
> out1 = 77665544
> out2 = 33221100
> out3 = 33221100
I hope that this is a simple as I promised.
We now increase the difficulty and analyze the actual implementation.
The first call of the DecompressAndCopy
calculates the first two bytes of the outputs (out0, out1),
while the second call calculates the last two bytes (out2, out3).
Note, that the second call uses 0x96 instead of 0x95 as an entry point due intermediate values still residing in register c
.
To more make the code more accessible, I did a systematic analysis of the decomp_loop
.
In the following table each column represents an iteration of the decomp_loop
, whereby the numbers uniquely identify
the bits (C stands for carry):
instr | b = 4 | b = 3 | b = 2 | b = 1 |
---|---|---|---|---|
0x99 | c=6543210x, C=7 | c=54321076, C=6 | c=43210754, C=5 | c=32107532, C=4 |
0x9b | a=65432107, C=7 | a=43210776, C=5 | a=21077665, C=3 | a=07766554, C=1 |
0x9c | c=76543210 | c=65432107 | c=54321075 | c=43210753 |
0x9d | c=65432107, C=7 | c=54321075, C=6 | c=43210753, C=5 | c=32107531, C=4 |
0x9f | a=54321077, C=6 | a=32107766, C=4 | a=10776655, C=2 | a=77665544, C=0 |
Note, how the carry is used in very clever way to exchange bits between the c
and the a
register.
Creating some functionally similar C-code may look like this:
void DecompressAndCopy(u8 data, u8 *addr) {
u8 mask0 = 0b00000001;
u8 mask1 = 0b00000011;
u8 res = 0;
for (int i = 0; i < 4; ++i) {
res |= (data & mask0) ? mask1 : 0;
mask0 <<= 1;
mask1 <<= 2;
}
*addr = res;
*(addr+2) = res;
}
The C-code above is functionally equal, yet barely resembles the original assembly as there’s no way to utilize carry bits in C.
In contrast to the Nintendo logo, the registered trademark logo doesn’t need any decompression. Furtheremore, it’s fetched from the boot ROM, not from the cartridge! Hence, it’s simply loaded into the memory as follows:
0x034 ld de, $00d8 // de = boot rom data after logo
0x037 ld b, $08 // b = length of data
reg_trade:
0x039 ld a, [de]
0x03a inc de
0x03b ld [hl+], a // hl points to VRAM
0x03c inc hl
0x03d dec b
0x03e jr nz, @-$07 // 8 iterations
C-Code:
u8 *vram = 0x80d0;
for (u8 *logo = 0xd8; logo < 0xe0; ++logo) {
*vram = *logo;
vram += 2;
}
Note, that we leave, similarly to the previous section, one byte blank again.
Usually each pixel displayed comprises two bits spread over different bytes.
But due to our custom color mapping (only black and white), the second bit doesn’t really
carry any information and is thus left blank.
More information about how pixel data is represented will be provided in my soon to appear PPU post.
If one would render the tile map at this state, the following image would show up:
Most of the tilemap is just empty space, but the 25 tiles used to depict the Nintendo logo are already
more than recognizable!
Due to it’s memory limitations, the Game Boy doesn’t really have a pixel-wise buffer of the whole screen.
Instead it uses a tile-based system usually referring to 8x8 tiles via 32x32 byte pointers.
A more in-depth explanation will be provided in my yet to be written post about the PPU.
So for now this has to suffice ;)
Anyway, the decompression algorithm we already saw just drew some tiles into the tile data map.
But the information about where to draw these tiles is provided with the following lines:
0x040 ld a, $19 // select tile 25
0x042 ld [$9910], a // display tile 25 at (8,16)
0x045 ld hl, $992f // point to (9,15)
BB48:
0x048 ld c, $0c // c = 12
BB4a:
0x04a dec a
0x04b jr z, @BB55
0x04d ld [hl-], a
0x04e dec c
0x04f jr nz, @BB4a
0x051 ld l, $0f // point to tile (8,15)
0x053 jr @BB48
BB55:
The code initializes the display tiles from (9,3-15) and from (8,3-15) using a nested lopp. A corresponding C code:
int a = 25;
u8 *mem = 0x9910;
*mem = a;
mem = 0x992f;
for (int j = 0; j < 2; ++j) {
for (int i = 12; i > 0; --i) {
a--;
*mem = a;
mem--;
}
mem = 0x990f;
}
At this point the only thing yet to be configured is the PPU (Pixel Processing Unit). So, we could draw anything in the tile buffer, but we would never see a pixel without a turned on display. The following lines take care of that:
BB55:
0x055 ld h, a // h = 0
0x056 ld a, $64
0x058 ld d, a // d = 100
0x059 ldh [rSCY], a // scroll_y = 100
0x05b ld a, $91 // 0x91 = 0b10010001
0x05d ldh [rLCDC], a // [0xff40] = b10010001
The most of the configuration is done at instruction 0x5d. This instruction writes data into a PPU configuration register resulting in the following setup:
1 = turn on LCD screen.
0 = window tile map 0x9800-$9bff
0 = window display off
1 = bg and window tile data = 0x8800-0x97ff
0 = bg tile map 0x9800-0x9bff
0 = obj sprite size 8*8
0 = obj sprite display off
1 = bg and window display on
The Y scrolling is set up as well with a value of 100. This is is iteratively decremented to achieve the scroll down effect of the Nintendo logo. The C-Code is quite simple for this part:
u8* rSCY = 0xff42;
*rSCY = 100;
u8 *rLCDC = 0xff40;
*rLCDC = 0x91
Ok, now everything is set up and it’s time scroll down the Nintendo logo:
// h = 0
0x05f inc b // b = 1
BB60:
0x060 ld e, $02 // e = 2; 2MC
BB62:
0x062 ld c, $0c // c = 12; 2MC
BB64:
0x064 ldh a, [rLY] // a = [0xff44] vline number; 2MC
0x066 cp $90 // a == 144?; 1MC
0x068 jr nz, @BB64 // 2MC/3MC
0x06a dec c // 1MC
0x06b jr nz, @BB64 // 2MC/3MC
0x06d dec e // 1MC
0x06e jr nz, @BB62 // 2MC/3MC
0x070 ld c, $13
0x072 inc h
0x073 ld a, h
0x074 ld e, $83
0x076 cp $62
0x078 jr z, @BB80
0x07a ld e, $c1
0x07c cp $64
0x07e jr nz, @BB86
BB80:
0x080 ld a, e
0x081 ld [c], a
0x082 inc c
0x083 ld a, $87
0x085 ld [c], a
BB86:
0x086 ldh a, [rSCY]
0x088 sub b
0x089 ldh [rSCY], a // scroll_y -= 1
0x08b dec d
0x08c jr nz, @BB60
0x08e dec b
0x08f jr nz, @BBE0 // Jump to Nintendo Logo check, 0xe0
0x091 ld d, $20
0x093 jr @-$35 // BB60
However, before any configuration data of a running PPU is touched, the Game Boy needs
to makes sure that the PPU isn’t rendering at the moment.
This actually very short period of idling is either indicated by a v-blank interrupt
or by a LY-register (residing at 0xff44) value of greater or equal than 144..
Apparently the Game Boy engineers chose the latter option.
They implemented a busy waiting method that constantly polls the LY register
and compares its value against 144 (see instructions 0x64-0x68).
The code doesn’t look really obvious at first glance, so let’s take a closer look.
We’ll start at the inner loop beginning at BB64
which just waits for the v-blank register to return a 144.
Once this happens, two nested loops, from now on called e-loop and d-loop due to their loop variables, with loop counts of 2 and 12 are started.
Note, that in each iteration we’re still asking the v-blank register if it’s still at 144!
But how long does it keep that value?
According to the Game Boy CPU Manual [7] the v-blank register increases its value every 114 machine cycles (MC).
So, the Game Boy has 114 machine cycles worth of instructions to spend before the 144 turns into a 145.
These 114 machine cycles are more or less one iteration of the e-loop!
Here’s the calculation:
1 c-loop iteration = 2+1+2+1+3 = 9MC
12 iterations whereby the last one is only 8 cycles: 11*9+8 = 107MC
Plus e-loop part: 107+6 = 113MC
Note, that depending on the result (branch or not branch)
the jump instructions either take 3 or 2 machine cycles respectively.
After the first e-loop iteration the Game Boy has to wait for a whole frame ~17ms until the v-blank
register exposes as 144 again.
Therefore, the instructions from 0x60 to 0x6e can be summarized as: wait for two frames and finish with an idle PPU.
The next few instructions play some sound and most importantly: they scroll down the Nintendo logo by one pixel!
This scroll effect is achieved by changing the value of the scroll-y register. Its value determines the windows offset in pixels in y-direction.
Since this whole part is wrapped into a bigger loop (the d-loop), the Game Boy decreases the scroll-y registers the Nintendo logo 100 times.
Taking the two frames wait period into account, we arrive at roughly 3 seconds for the Nintendo logo scroll down sequence.
This pretty much complies with the real-word behaviour.
After the logo reached its final position it rests there for a short period of time. This is achieved by instructions
0x08e to 0x93. These instructions reduce the scroll increment to 0 (dec b) and then run the whole d-loop again for 32 times.
In the end the rendered result of my TLMBoy looks like this:
As usual, here’s the C-code of the current sequence:
int d = 100;
int h = 0;
for (int d = 100; d > 0; --d) {
// wait for 2 frames
for (int e = 2; i > 0; --i) {
for (int c = 12; j > 0; --j) {
while (vline() != 144) {}
}
}
h++;
u16 *sound_f_low;
u16 *sound_f_high;
sound_f_low = 0xFF13;
sound_f_high = 0xFF14;
e = 0x83;
if (h == 98) {
goto BB80;
}
e = 0xc1;
if (h != 100) {
goto BB86;
}
BB80:
*sound_f_high = e;
*sound_f_high = 0x87;
BB86:
*scroll_y -= 1;
}
// let the logo rest a short time
for (int d = 32; d > 0; --d) {
for (int e = 2; i > 0; --i) {
for (int c = 12; j > 0; --j) {
while (vline() != 144) {}
}
}
}
After scroll sequence, the Game Boy verifies whether it was really a Nintendo logo that showed up on your screen.
If it’s not, the boot loader just bricks.
As explained in [8], this was Nintendo’s way of preventing unlicensed game developers to publish games for the Game Boy. Because you cannot forbid someone to develop games for your hardware, but you can sue people for using your logo!
This check is done byte by byte from instruction 0x0e0 to 0x0ef.
The last instruction finally unloads the boot ROM by writing a 1 into address 0xFF50.
BBE0:
0x0e0 ld hl, $0104 // hl = rom cartridge header logo
0x0e3 ld de, $00a8 // de = boot rom logo
BBE6:
0x0e6 ld a, [de] // for loop over the cartridge header logo
0x0e7 inc de
0x0e8 cp [hl]
BBE9:
0x0e9 jr nz, @BBE9 // loop forever if fail
0x0eb inc hl
0x0ec ld a, l
0x0ed cp $34
0x0ef jr nz, @BBE6
0x0f1 ld b, $19
0x0f3 ld a, b
BBF4:
0x0f4 add [hl] // for loop through the rest of the header to calculate checksum, CODE XREF=CopyData+98
0x0f5 inc hl
0x0f6 dec b
0x0f7 jr nz, @BBF4
0x0f9 add [hl] // Validate against the cartridge header checksum field
BBFA:
0x0fa jr nz, @BBFA // If header checksum is invalid then loop forever
0x0fc ld a, $01
0x0fe ldh [$ff00+$50], a
C-Code
*cartridge_logo = 0x104
*boot_logo = 0xa8
for (int i = 0; i < 48; ++i) {
if (cartridge_logo[i] != boot_logo[i]) {
while (true) {}; // Loop forever.
}
}
*cartridge_header = 0x134
sum = 0x19;
for (int i = 0; i =< 25; ++i) {
sum += cartridge_header[i];
}
if (sum != 0) {
while (true) {}; // Loop forever.
}
unload_boot_rom();
All code snippets in one code box:
// (0x95-0xa7): Decompress and copy the data to VRAM.
void DecompressAndCopy(u8 data, u8 *addr) {
u8 mask0 = 0b00000001;
u8 mask1 = 0b00000011;
u8 res = 0;
for (int i = 0; i < 4; ++i) {
res |= (data & mask0) ? mask1 : 0;
mask0 <<= 1;
mask1 <<= 2;
}
*addr = res;
*(addr+2) = res;
}
void main() {
// BB1 (0x07-0x0a) : Setting up the VRAM.
u8 *mem = 0x0;
for (int i = 0x9FFF; i >= 0x8000; --i) {
mem[i] = 0;
}
// BB2 (0x0c-0x1c): Setting up the sound.
mem[0xff26] = 0x80; // All sound on.
mem[0xff11] = 0x80; // Wave duty 50%.
mem[0xff12] = 0xf3; // Envelope settings.
mem[0xff25] = 0xf3; // Sound output terminal.
mem[0xff24] = 0x77; // SO2 on, full volume, SO1 off, full volume.
// BB3 (0x1d-0x24): Init the color palette.
mem[0xff47] = 0xfc; // Set up BG and window colour palette.
// BB4 (0x27-0x32): Load the logo.
u8 *vram = 0x8010;
for (u8 *logo = 0x0104; logo < 0x0134; ++logo) {
u8 data = *logo;
DecompressAndCopy(data, vram);
vram += 4;
DecompressAndCopy(data >> 4, vram);
vram += 4;
}
// (0x34-3e): Load the registered trademark.
u8 *vram = 0x80d0;
for (u8 *logo = 0xd8; logo < 0xe0; ++logo) {
*vram = *logo;
vram += 2;
}
// (0x40-0x53): Selecting the right tiles.
int a = 25;
u8 *mem = 0x9910;
*mem = a;
mem = 0x992f;
for (int j = 0; j < 2; ++j) {
for (int i = 12; i > 0; --i) {
a--;
*mem = a;
mem--;
}
mem = 0x990f;
}
// (0x55-0x5d): Display init.
u8* rSCY = 0xff42;
*rSCY = 100;
u8 *rLCDC = 0xff40;
*rLCDC = 0x91
// (0x5f-0x93): Showtime.
int d = 100;
int h = 0;
for (int d = 100; d > 0; --d) {
// Wait for 2 frames.
for (int e = 2; i > 0; --i) {
for (int c = 12; j > 0; --j) {
while (vline() != 144) {}
}
}
h++;
u16 *sound_f_low;
u16 *sound_f_high;
sound_f_low = 0xFF13;
sound_f_high = 0xFF14;
e = 0x83;
if (h == 98) {
goto BB80;
}
e = 0xc1;
if (h != 100) {
goto BB86;
}
BB80:
*sound_f_high = e;
*sound_f_high = 0x87;
BB86:
*scroll_y -= 1;
}
// Let the logo rest a short time.
for (int d = 32; d > 0; --d) {
for (int e = 2; i > 0; --i) {
for (int c = 12; j > 0; --j) {
while (vline() != 144) {}
}
}
}
// (0xe0-0xfe) Checking the logo.
*cartridge_logo = 0x104
*boot_logo = 0xa8
for (int i = 0; i < 48; ++i) {
if (cartridge_logo[i] != boot_logo[i]) {
while (true) {}; // Loop forever.
}
}
*cartridge_header = 0x134
sum = 0x19;
for (int i = 0; i =< 25; ++i) {
sum += cartridge_header[i];
}
if (sum != 0) {
while (true) {}; // Loop forever.
}
unload_boot_rom();
return;
}
Despite being a fascinating and well-designed program, the boot ROM actually leaves some room for circumventing the logo check. Since the logo is loaded twice from the cartridge (one time for the VRAM, a second time for the check), providing the right data at the right time let’s you boot up the Game Boy without infringing any copyrights. This is achieved by first providing a custom logo for the scroll-up part, and then providing a Nintendo logo for the logo check. Of course, you need some custom logic in your cartridge to detect what kind of data is currently requested. Nevertheless, some companies used this exploit to sell some unlicensed games (see [9]).
I hope that you enjoyed this “little” post about the Game Boy’s boot process. Even though the boot ROM is just a 256-byte program (with a signifcant part of just logo data), it somehow suffices to write a more-than-3000-words blog post about it. I guess this shows how much you can achieve with a little of assembly, if you know how to do your job well. Especially the decompress and copy process is a good example for it. I doubt that any compiler could attain the same code density.
If there’s any feedback, don’t hesitate to contact me :)
[1] Gameboy Development Wiki
[2] neviksti’s website
[3] Game Boy patent
[4] Commented boot ROM
[5] Boot ROM tutorial 1 (detailed)
[6] Boot ROM tutorial 2
[7] Game Boy CPU manual
[8] History of boot ROM and logo generator
[9] Custom boot logos
Contents
In this post, I’ll cover how I implemented the GDB Serial Protocol (GDBRSP) for my Game Boy simulator TLMBoy.
For the whole code in action see gdb_server.cpp
and gdb_server.h
in my Github repo.
While I used the Game Boy as a target architecture, the principles and details presented here can be applied to every other platform as well.
In fact, you just need a GDB for your desired CPU architecture!
Since the Game Boy’s CPU (basically a Z80 clone) isn’t natively supported by GDB, I’ll show you how to get a Z80 GDB first.
If you don’t mind the extra work, you could also extend GDB by adding support for your favorite architecture.
But in the case of the Z80, someone already went ahead ;)
Before we dive into the technical details, let’s answer some simple yet important questions first:
GDB Remote Serial Protocol (GDBRSP) is the name of the protocol that GDB uses to communicate with so-called GDBstubs. The protocol defines how packets have to look and how servers and clients communicate. As a backbone usually either the TCP protocol or just a plain serial communication is employed. Extensive documentation can be found in the GDB docs
Because why don’t we just use plain GDB to debug stuff?
Imagine you’re programming a Game Boy simulator like in my case.
You end up with is a piece of software (a Game Boy simulator) that executes another piece of software (for instance Pokémon Red).
To debug your simulator, you’d probably just use GDB, which is perfectly fine.
But how do you debug the software inside the software (Pokémon Red) from the Game Boy’s perspective?
One common approach is to incorporate a so-called GDBstub into your simulator.
This stub receives messages from GDB, for example, via TCP, and translates them to simulator specific instructions as depicted in the following illustration:
Implementing this stub for your specific simulator requires some work by you, which is mainly covered in this post.
But trust me, having a GDBstub in your simulator is a really cool feature.
Because once you have your stub, you can just use the typical GDB frontend and start your debug sessions.
This is why many well-known simulators like QEMU or gem5
also have implemented their own GDBstub.
Before I explain the details on implementing a Game Boy GDBstub, let’s
take a look at how to get a GDB with Z80 (the Game Boy’s CPU) support first.
If you already have one, feel free to skip the next section.
Note: you can also my TLMBoy’s docker container,
which includes said Z80 GDB (start it with z80-unknown-elf-gdb
).
I guess you probably already consulted google searching for a Z80 GDB, which might have led you to the following Github repository.
However, most of this code is more than 10 years old, and compiling it is a pain in the *** if you’re using a quite recent Linux
environment.
As it happens to be, a few months ago (September 2020), some cool guy submitted a patch to the GDB team, including architecture support for Z80 CPUs and even the Game Boy’s modified version.
But as stated in the given link, it might take a while until this patch is upstream.
And I guess adding support for an antiquated architecture isn’t really the first item on the maintainers’ priority list…
So, in the meanwhile, let’s just compile it ourselves!
Fortunately, the glorious Z80 patcher provided a Github repository which can be found
here.
The next steps are just cloning the repository and building that stuff as follows:
git clone https://github.com/b-s-a/binutils-gdb.git
cd binutils-gdb
mkdir build
./configure --target=z80-unknown-elf --prefix=$(pwd)/build --exec-prefix=$(pwd)/build
make
make install
Depending on your preferences, you may want to change things like the build directory or the executable format.
Since the Game Boy doesn’t really have an executable format, I just took elf
, but other file formats like coff
should work as well.
At that point, you should find an executable Z80 GDB in the bin directory:
In this section, we’ll take a closer look at the protocol and what GDB expects from us.
As already mentioned, I want to implement a GDBstub for my Game Boy simulator.
Depending on your GDBstub, you might have to meet different design considerations at some points.
For example, my first few steps were implementing a TCP server (which is not covered in this post),
but if you’re implementing a GDBstub for some embedded device, a serial connection might be a better choice.
Anyway, let’s get down to business!
The typical GDBRSP packet uses the following pattern:
$packet-data#checksum
It comprises a “$” to indicate the beginning of a packet, some packet data, usually human-readable, and a two-digit hex checksum that is preceded by a “#”. For instance, a packet may look like this:
$m0,8#01
In this case m0,1
tells us to read 8 bytes beginning at memory location 0.
The checksum is calculated by summing up the ASCII values of each character of the packet data (“$” and “#” are excluded!)
and taking the first 8 bits of the results (corresponds to modulo 256). Or to formulate it as C++ code:
std::string GdbServer::GetChecksumStr(const std::string &msg) {
uint checksum = 0;
for (const char& c : msg) {
checksum += static_cast<uint>(c);
}
checksum &= 0xff;
return fmt::format("{:02x}", checksum);
}
Or in python:
def GetChecksumStr(msg):
return "{:02x}".format(sum(ord(c) for c in msg) & 0xff)
In theory, verifying the checksum doesn’t make sense from our stub’s perspective as the TCP protocol already has some error detection under the hood. But for the sake of completeness, I implemented it anyway.
Controlling the checksum is one thing, but how does one check if the message is syntactically correct? The messages used by GDB are pretty simple and don’t contain any nested structures. Hence, I used a big chonky regex to detect all the packets that I want to support:
std::vector<std::string> GdbServer::SplitMsg(const std::string &msg) {
static std::regex reg(
R"(^(\?)|(D)|(g))"
R"(|(c)([0-9]*))"
R"(|(G)([0-9A-Fa-f]+))"
R"(|(M)([0-9A-Fa-f]+),([0-9A-Fa-f]+):([0-9A-Fa-f]+))"
R"(|(m)([0-9A-Fa-f]+),([0-9A-Fa-f]+))"
R"(|([zZ])([0-1]),([0-9A-Fa-f]+),([0-9]))"
R"(|(qAttached)$)"
R"(|(qSupported):((?:[a-zA-Z-]+\+?;?)+))"
);
std::vector<std::string> res;
std::smatch sm;
regex_match(msg, sm, reg);
for (uint i = 1; i < sm.size(); ++i) {
if (sm[i].str() != "") {
res.push_back(sm[i].str());
}
}
return res;
}
Besides the standard packet, there is also an acknowledge packet +
and a not-acknowledge packet -
.
Every message transmitted via GDBRSP needs a response in the form of +
or -
.
With that in mind, let’s take a look at some first packets that GDB sends to a stub when initiating a connection!
To do this, we first need to set up a TCP client. You can program a TCP client, or just a Linux network tool like netcat. For instance:
netcat -l 1337
This starts a TCP client listening on port 1337. As a second step, GDB has to be started and connected, which can be achieved with the following commands:
z80-unknown-elf-gdb
(gdb) set arch gbz80
(gdb) set debug remote 1
(gdb) target remote localhost:1337
With set arch gbz80
, we tell GDB to switch to the modified Z80 instruction set that is used by the Game Boy.
I also added the set debug remote 1
to make GDB more verbose and provide us with some interesting insights.
The connection is finally established with target remote localhost:1337
.
If everything goes well, netcat should output the TCP messages sent by GDB.
Let’s analyze them in the next section!
The first packet which arrives at our GDBstub looks as follows:
$qSupported:multiprocess+;swbreak+;hwbreak+;qRelocInsn+;fork-events+;vfork-events+;exec-events+;vContSupported+;QThreadEvents+;no-resumed+#df
Using the gdb docs, let’s break down the message into its substantial parts.
With qSupported
(gdbdocs), GDB tries to tell us about all the fancy features it supports.
This message is not only a statement, but it’s also asking the stub about which features it supports.
So let’s take a look at the single parameters and try to contemplate which one we need:
multiprocess
: Indicates support of the multiprocess extensions. However, the Game Boy doesn’t really have multiple processes, so there’s no need to support it.swbreak
: Indicates support of software breakpoint stop reason.
With a software breakpoint, you basically replace the instruction with another instruction that triggers some behavior detected by the debugger. I chose not to support this as hardware breakpoints are a simpler alternative.hwbreak
: Indicates support of hardware breakpoint stop reason. Hardware breakpoints use special hardware registers that trigger some behavior if, for instance, a specified program counter value is reached.
This is quite easy to implement in a simulator, so I chose to support this.qRelocInsn
: Indicates support for relocating instructions, a feature needed for so-called tracepoints. Tracepoints aren’t really interesting for use, so skip them.fork-events
: The Game Boy doesn’t have an OS. Consequently, there are no
child processes (forks) to debug. Skip it.vfork-events
: Pretty similar to fork-events
. Skip it.vexec-events
: Indicates support of the Linux execve command. Again there’s not really an OS, so we’ll skip that one.vContSupported
: Indicates support for vCont. Might be useful if your system supports multiple threads, which isn’t the case for the Game Boy. Skip it.QThreadEvents
: Again thread-related stuff which we can skip.no-resumed
: More thread-related stuff … skipped.So, we only support hardware breakpoints. Consequently, the answer looks like this:
$hwbreak+#e4
And the C++ part:
void GdbServer::CmdSupported(const std::vector<std::string> &msg_split) {
std::string msg_resp;
if (msg_split[1].find("hwbreak+;") != std::string::npos) {
msg_resp.append("hwbreak+;");
}
msg_resp = Packetify(msg_resp);
DBG_LOG_GDB("sending supported features");
tcp_server_.SendMsg(msg_resp.c_str());
}
In general, the minimum set of commands and features that a GDBstub needs to support is relatively small. The gdb docs state:
At a minimum, a stub is required to support the ‘?’ command to tell GDB the reason for halting, ‘g’ and ‘G’ commands for register access, and the ‘m’ and ‘M’ commands for memory access. Stubs that only control single-threaded targets can implement run control with the ‘c’ (continue) command, and if the target architecture supports hardware-assisted single-stepping, the ‘s’ (step) command. Stubs that support multi-threading targets should support the ‘vCont’ command. All other commands are optional.
After sending our response, GDB immediately sends another packet to our stub:
$vMustReplyEmpty#3a
According to the docs, this command tests how our server responds to unknown packets (vMustReplyEmpty is not defined by definition). The correct response to an unknown packet is an empty response:
$#00
Apparently, some older stubs would incorrectly respond with an ‘OK’ to unknown packets. To test this, vMustReplyEmpty was introduced. The C++ code looks as follows:
// With: char const *kMsgEmpty = "+$#00";
void GdbServer::CmdNotFound(const std::vector<std::string> &msg_split) {
tcp_server_.SendMsg(kMsgEmpty);
}
GDB doesn’t get tired of sending us packets responding directly with a:
$Hg0#df
With this command, all following ‘g’ commands (read register) refer to the thread of the given thread id. However, thread id ‘0’ is a special case, as can be read in the gdb docs: A thread-id can also be a literal ‘-1’ to indicate all threads, or ‘0’ to pick any thread. Since this command is not in the minimum set, and we don’t have multiple threads, we can send an empty response (command unknown) again:
$#00
The next incoming packet is:
$qTStatus#49
GDB is asking us whether a trace experiment is currently running. Well, we’re not supporting tracing anyway, so respond empty:
$#00
With the ‘?’ packet, GDB asks for a reason why the target halted. Since we’re stopping our process once GDB connects, we have to reply with one of the responses listed in gdb docs. I felt like the following response was a good choice:
$S05#b8
Here ‘S05’ responds to POSIX signal SIGTRAP. It’s the typical signal being triggered when running into a software breakpoint, often leading to a halt. For instance, qemu uses the same signal in its stub. Also the guy from this cool tutorial uses SIGTRAP. Since the Game Boy doesn’t really have an OS, it doesn’t have POSIX signals as well. Hence, it’s more like a dummy answer to satisfy gdb. In theory, using any other signal number should work as well. The C++ looks as follows:
void GdbServer::CmdHalted(const std::vector<std::string> &msg_split) {
std::string msg_resp = Packetify(fmt::format("S{:02x}", SIGTRAP));
cpu_->Halt();
tcp_server_.SendMsg(msg_resp.c_str());
}
GDB seems to be happy with our ‘S05’ response and sends us the following packet afterward:
$qfThreadInfo#bb
With that packet, GDB is asking us about which threads are active. We’ll just respond empty as we’re not supporting threads:
$#00
GDB is really persistent about threads and sends us the predecessor of the qfThreadInfo packet:
$qL1160000000000000000#55
Gues what we respond?
$#00
The next incoming packet is:
$Hc-1#09
This packet is similar to the ‘Hg’ packet and indicates that all following ‘c’ packets refer to all threads (-1). Let’s respond with empty response as we haven’t changed our opinion about threads in the meanwhile. The subsequent packet asks for the current thread ID:
$qC#b4
… Insert generic statement about threads here …
GDB seems to be unstoppable and proceeds with the following packet:
$qAttached#8f
Here we have to respond either with ‘1’ indicating that our remote server is attached to an existing process or with a ‘0’ indicating that the remote server created a new process itself. Depending on our answer here, we either get a kill or detach command when invoking ‘quit’. Since I want to keep the Game Boy running even when quitting GDB, the appropriate answer is ‘1’:
$1#31
The next packet received is:
$g#67
Here GDB wants to read our CPUs registers. The documentation provides more information about the respone format:
Each byte of register data is described by two hex digits. The bytes with the register are transmitted in target byte order. The size of each register and their position within the ‘g’ packet is determined by the GDB internal gdbarch functions DEPRECATED_REGISTER_RAW_SIZE and gdbarch_register_name. When reading registers from a trace frame (see Using the Collected Data), the stub may also return a string of literal ‘x’’s in place of the register data digits, to indicate that the corresponding register has not been collected; thus its value is unavailable.
This means, in order to put the correct register value in the correct place, I have to search through GDB’s source code… I feel like this is not a well-conceived solution, especially if multiple debuggers are used with each having a different ordering of the registers. It would be better if there was some kind of message to define the layout, or if the GDB team would just establish a standard per ISA.
Anyway, I followed down the function gdbarch_register_name
in z80_tdep.c
until I found the corresponding array:
// Frame 2
set_gdbarch_register_name (gdbarch, z80_register_name);
// Frame 1
/* Return the name of register REGNUM. */
static const char *
z80_register_name (struct gdbarch *gdbarch, int regnum)
{
if (regnum >= 0 && regnum < ARRAY_SIZE (z80_reg_names))
return z80_reg_names[regnum];
return NULL;
}
// Frame 0
static const char *z80_reg_names[] =
{
/* 24 bit on eZ80, else 16 bit */
"af", "bc", "de", "hl",
"sp", "pc", "ix", "iy",
"af'", "bc'", "de'", "hl'",
"ir",
/* eZ80 only */
"sps"
};
Hence, our response will start with the “af” registers and then progress until the “pc” registers. Any subsequent registers are omitted due to the reduced registers set of the Game Boy’s Z80. Melting this into C++ code may look like this:
void GdbServer::CmdReadReg(const std::vector<std::string> &msg_split) {
std::string msg_resp;
msg_resp = fmt::format("{:04x}{:04x}{:04x}{:04x}{:04x}{:04x}{:x>{}}",
std::rotl(cpu_->reg_file.AF.val(), 8), std::rotl(cpu_->reg_file.BC.val(), 8),
std::rotl(cpu_->reg_file.DE.val(), 8), std::rotl(cpu_->reg_file.HL.val(), 8),
std::rotl(cpu_->reg_file.SP.val(), 8), std::rotl(cpu_->reg_file.PC.val(), 8),
"", 7*4);
DBG_LOG_GDB("reading geeneral registers");
msg_resp = Packetify(msg_resp);
tcp_server_.SendMsg(msg_resp.c_str());
}
Please note that the Z80 is a little-endian system requiring us to send the LSB first.
Hence the usage of this amazing new C++-20 Feature std::rotl
.
An example response may look like this one here:
$0000000000000000feff0000xxxxxxxxxxxxxxxxxxxxxxxxxxxx#77
Here only the stack pointer is initialized (SP=0xfffe) while all other registers are 0.
After answering more than 10 packets, GDB finally seems to be satisfied and offers me its terminal! See the debug log:
(gdb) target remote localhost:1337
Remote debugging using localhost:1337
Sending packet: $qSupported:multiprocess+;swbreak+;hwbreak+;qRelocInsn+;fork-events+;vfork-events+;exec-events+;vContSupported+;QThreadEvents+;no-resumed+#df...Ack
Packet received: swbreak+;
Packet qSupported (supported-packets) is supported
Sending packet: $vMustReplyEmpty#3a...Ack
Packet received:
Sending packet: $Hg0#df...Ack
Packet received:
Sending packet: $qTStatus#49...Ack
Packet received:
Packet qTStatus (trace-status) is NOT supported
Sending packet: $?#3f...Ack
Packet received: S05
Sending packet: $qfThreadInfo#bb...Ack
Packet received:
Sending packet: $qL1160000000000000000#55...Ack
Packet received:
Sending packet: $Hc-1#09...Ack
Packet received:
Sending packet: $qC#b4...Ack
Packet received:
Sending packet: $qAttached#8f...Ack
Packet received: 1
Packet qAttached (query-attached) is supported
warning: No executable has been specified and target does not support
determining executable automatically. Try using the "file" command.
Sending packet: $g#67...Ack
Packet received: 000000000000000000000000xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Sending packet: $qL1160000000000000000#55...Ack
Packet received:
0x00000000 in ?? ()
(gdb)
Yet we are not done, as some of the mandatory GDB commands aren’t implemented (like G, m, M, s, and c).
I think the best way to explore them is to regard them in the context of GDB terminal commands.
Hence, let’s start with some basic commands such as info registers
and then work our way up to stuff like setting breakpoints.
The command info registers
prints out the values of the CPUs registers:
(gdb) info registers
af 0x0 [ ]
bc 0x0 0
de 0x0 0x0
hl 0x0 0x0
sp 0xfffe 0xfffe
pc 0x0 0x0
ix <unavailable>
iy <unavailable>
af' <unavailable>
bc' <unavailable>
de' <unavailable>
hl' <unavailable>
ir <unavailable>
As you might see in the debug log, there’s actually no message being sent! This is due to gdb already having all information thanks to ‘g’ that was used to establish the connection.
With display/5i $pc
GDB shows us the next 5 assembly instructions:
(gdb) display/5i $pc
1: x/5i $pc
=> 0x0: Sending packet: $m0,1a#5b...Ack
Packet received: 31feffaf21ff9f32cb7c20fb2126ff0e113e8032e20c3ef3e232
Sending packet: $m1a,1a#bd...Ack
Packet received: 3e77773efce0471104012110801acd9500cd9600137bfe3420f3
Sending packet: $m34,c#63...Ack
Packet received: 11d80006081a1322230520f9
ld sp,0xfffe
0x3: xor a
0x4: ld hl,0x9fff
0x7: ld (0x7ccb),a
0xa: jr nz,0x0007
The debug log reveals that this command comprises a bunch of m
packets.
For instance, the first incoming packet looks like this:
$m0,1a#5b
A quick lookup in the docs reveals that GDB wants to read a chunk of size 0x1a from memory location 0x00. Nothing easier than that. Let’s code some reply:
void GdbServer::CmdReadMem(const std::vector<std::string> &msg_split) {
std::string msg_resp;
std::string addr_str = msg_split[1];
std::string length_str = msg_split[2];
uint addr = std::stoi(addr_str, nullptr, 16);
uint length = std::stoi(length_str, nullptr, 16);
for (uint i = 0; i < length; ++i) {
u8 data = cpu_->ReadBusDebug(addr + i);
msg_resp.append(fmt::format("{:02x}", data));
}
DBG_LOG_GDB("reading 0x" << length_str << " bytes at address 0x" << addr_str);
msg_resp = Packetify(msg_resp);
tcp_server_.SendMsg(msg_resp.c_str());
}
As a next typical GDB command, we’ll take a look at si
, which is short for step instruction
and tells
our program to execute the next assembly instruction.
So, let’s just take a look at the debug log and see what happens:
(gdb) si
Sending packet: $mffe0,1a#8c...Ack
Packet received: 0000000000000000000000000000000000000000000000000000
Sending packet: $mfffa,6#62...Ack
Packet received: 000000000000
Sending packet: $m0,8#01...Ack
Packet received: 31feffaf21ff9f32
Sending packet: $m3,1#fd...Ack
Packet received: af
Sending packet: $Z0,3,8#4d...Ack
Packet received: OK
Packet Z0 (software-breakpoint) is supported
Sending packet: $vCont?#49...Ack
Packet received:
Packet vCont (verbose-resume) is NOT supported
Sending packet: $Hc0#db...Ack
Packet received:
Sending packet: $c#63...Ack
Packet received: S05
Sending packet: $g#67...Ack
Packet received: 0000000000000000feff0300xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Sending packet: $z0,3,8#6d...Ack
Packet received: OK
Sending packet: $mffe0,1a#8c...Ack
Packet received: 0000000000000000000000000000000000000000000000000000
Sending packet: $mfffa,6#62...Ack
Packet received: 000000000000
Sending packet: $qL1160000000000000000#55...Ack
Packet received:
Sending packet: $mffe0,1a#8c...Ack
Packet received: 0000000000000000000000000000000000000000000000000000
Sending packet: $mfffa,6#62...Ack
Packet received: 000000000000
As you can see, the first few packets are multiple memory reads at different addresses. These reads are issued as GDB wants to know the instructions that follow after the current one. At first, I was like: “Why doesn’t gdb just only read program counter + 1?” Well, the next instruction to be executed isn’t necessarily the one at the next program counter address! For example, in case of return instructions GDB has to backtrack this next instruction by unwinding the call stack. This finally explains why GDB read that 32 bytes beginning from 0xFFE0 (the current stack pointer at that time) and the following instruction (program counter was 0x0 at that time). Warning: There are some cases in which this command might blow up. See section Final Thoughts for more information.
The next packet sent is a ‘Z’ packet telling us to insert a software breakpoint (=0) with kind 8 at address 0x3. But… didn’t we tell GDB that we don’t support software breakpoints in the initialization phase? Well, I tried to reject that packet, but this then led to no breakpoint being inserted at all.
At this point, I was a little unsure about how to proceed and implement stuff. So, I took a look at other emulators/simulators/frameworks, namely qemu, gem5 and vcml, and they all do it the same way:
Every kind of breakpoint, be it software or hardware, is mapped onto some kind of virtual hardware breakpoint. For instance, qemu:
switch (type) {
case GDB_BREAKPOINT_SW:
case GDB_BREAKPOINT_HW:
CPU_FOREACH(cpu) {
err = cpu_breakpoint_insert(cpu, addr, BP_GDB, NULL);
if (err) {
break;
}
}
This method is quite easy to implement and avoids changing the memory’s content. We just insert a given address into a data structure, for example a set, and do a check in the simulator’s main loop whether we reached one of the breakpoints. This lead me to the following implementation:
void GdbServer::CmdInsertBp(const std::vector<std::string> &msg_split) {
std::string msg_resp = "";
if (msg_split[1] == "0" || msg_split[1] == "1") {
msg_resp = "OK";
uint addr = std::stoi(msg_split[2], nullptr, 16);
DBG_LOG_GDB("set breakpoint at address 0x" << msg_split[2]);
bp_set_.insert(addr);
} else {
DBG_LOG_GDB("watchpoints aren't supported yet");
}
msg_resp = Packetify(msg_resp);
tcp_server_.SendMsg(msg_resp.c_str());
}
After the breakpoint was set, GDB tells us to continue execution with the ‘c’ packet. My implementation of that is quite simple:
void GdbServer::CmdContinue(std::vector<std::string> msg_split) {
cpu_->Continue();
}
Our CPU will now continue its execution until it encounters a breakpoint which is already the next instruction in case of si
.
We tell GDB about this event by sending a SIGTRAP signal:
void GdbServer::SendBpReached() {
std::string msg_resp = Packetify(fmt::format("S{:02x}", SIGTRAP));
DEBUG_LOG("GDB: sending breakpoint reached");
msg_resp = Packetify(msg_resp);
tcp_server_.SendMsg(msg_resp.c_str());
}
We then get asked to return the current register data (‘g’) and to remove the current breakpoint (‘z’). Removing the breakpoint is pretty much the same as inserting it, just vice versa:
void GdbServer::CmdRemoveBp(const std::vector<std::string> &msg_split) {
std::string msg_resp = "";
if (msg_split[1] == "0" || msg_split[1] == "1") {
msg_resp = "OK";
uint addr = std::stoi(msg_split[2], nullptr, 16);
DBG_LOG_GDB("removed breakpoint at address 0x" << msg_split[2]);
bp_set_.erase(addr);
} else {
DBG_LOG_GDB("watchpoints aren't supported yet");
}
msg_resp = Packetify(msg_resp);
tcp_server_.SendMsg(msg_resp.c_str());
}
After that, only a few memory reads follow, and this is it!
Nothing beats a fancy demo, so I made a video showing how you can use GDB to boot up the Game Boy with a custom logo:
In the video I used the following command to start the TLMBoy:
./tlmboy -r ../roms/tetris.bin --wait-for-gdb
To attach GDB to the simulation, use:
target remote localhost:1337
Once GDB is attached, the simulation halts at PC=0x0, and you are free to throw in some commands. In my case I want to replace the Nintendo logo with my own custom logo. The logo resides at address 0x104 and upwards, hence I replace this data:
set {char[48]} 0x104 = {0x03, 0x22, 0x09, 0x11, 0x02, 0x2e, 0x07, 0x44, \
0x02, 0x22, 0x04, 0x45, 0x01, 0x91, 0x0c, 0x00, 0x09, 0xdb, 0x00, \
0x00, 0x00, 0x00, 0x00, 0x00, 0x22, 0x30, 0x11, 0x90, 0x22, 0x20, \
0x44, 0x70, 0x22, 0x20, 0x65, 0x40, 0x11, 0x90, 0xc0, 0xc0, 0xb9, \
0x90, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00}
As seen in the video, a high-quality “CHCIKEN” logo is rendered instead of the Nintendo logo.
However, changing the boot logo results in a bricked boot process.
The logo keeps being displayed, but it doesn’t advance past this point.
This is Nintendo’s way of preventing the execution of non-licensed games
(see my boot post for more information).
So, I pressed Ctrl + C and called display/7i $pc
to examine the situation.
It can be seen that the Game Boy is stuck in the loop which compares the logo in the cartridge against
the logo in the boot ROM.
1: x/7i $pc
=> 0xe9: jr nz,0x00e9
0xeb: inc hl
0xec: ld a,l
0xed: cp 0x34
0xef: jr nz,0x00e6
0xf1: ld b,0x19
0xf3: ld a,b
The easiest way to resolve this awkward situation is to skip the check. With GDB, this can be achieved by advancing the program counter a few instructions:
set $pc = 0xfc
Alternatively, you can reload the Nintendo logo shortly before the check starts (your custom logo will remain displayed):
set {char[48]} 0x104 = {0xce, 0xed, 0x66, 0x66, 0xcc, 0x0d, 0x00, 0x0b, 0x03, 0x73, \
0x00, 0x83, 0x00, 0x0c, 0x00, 0x0d, 0x00, 0x08, 0x11, 0x1f, \
0x88, 0x89, 0x00, 0x0e, 0xdc, 0xcc, 0x6e, 0xe6, 0xdd, 0xdd, \
0xd9, 0x99, 0xbb, 0xbb, 0x67, 0x63, 0x6e, 0x0e, 0xec, 0xcc, \
0xdd, 0xdc, 0x99, 0x9f, 0xbb, 0xb9, 0x33, 0x3e}
And with that, Tetris finally starts 😊.
So, in this post I covered the basics of GDB remote serial protocol (GDBRSP) and how once can embed it into a Game Boy emulator (or any application). Due to the enormous scope of GDBRSP, this post just scratched the surface. Nevertheless, I hope that it provides a good starting for further adventures.
Last but not least, I still want to share some limitations, questions, and thoughts that came across my path during the development.
Let’s start with the limited debugability of the Game Boy’s ROMs.
These are basically a chunk of handcrafted assembly that doesn’t require any specific file format or underlying operating system.
Consequently, there are no such things as debug symbols or calling conventions that could be used by the debugger.
In some cases, I even observed crashes as GDB was trying to unwind call stacks, that weren’t really call stacks.
For instance, if you execute step instruction
directly after connecting, GDB (or my TLMBoy) will say “goodbye”.
This is because GDB tries to determine the callstack with a stack pointer that points to 0,
leading to multiple, seemingly random reads to non-mapped addresses.
Unfortunately, there’s not much one can do about it except avoiding commands that lead to undefined behavior.
Another thing that I didn’t consider at first, but later needed some problem solving, were bank switches.
These are used to circumvent the 64kiB limit imposed by the Game Boys 16-bit address bus.
With bank switching, some parts of the ROM are switched out by other parts of the ROM, which weren’t directly accessible prior to the switch.
This mechanism is triggered by writing a specific value in a specific location.
But in debug mode, I might want to write to certain locations to alter the memory’s value, not to trigger a bank switch.
So, how can I distinguish between bank switch and actual memory write?
The best solution I could come up with, are so-called custom queries. These can invoked with
monitor data
from the GDB terminal.
As the name implies, a custom query can convey a custom message that triggers a custom behavior in the stub.
Actually, this is so versatile that probably many other problems can be solved with it as well.
So, this finally concludes my post. If there’s any feedback, be it good or bad, feel free to contact me.
[1]
GDB’s online documentation. The first address to consult when questions about GDBRSP packets arise.
[2]
QEMU and GDBRSP.
[3]
gem5 and GDBRSP.
[4]
Quite old Github repository containing a GDB with Z80 support.
[5]
Discussion about the most recent Z80 GDB patch.
[6]
Most up-to-date Z80 GDB Github repository.
[7] Super detailed and user-oriented post about GDBRSP.
[8] Cool blog post about the GDBRSP.
Heyho! Welcome to the first post of my series "TLMBoy", which is about writing a Game Boy Emulator with SystemC TLM-2.0.
So, if you always wanted to write a GameBoy Emulator or learn SystemC TLM-2.0, you found the right place!
I guess writing a Game Boy emulator is nothing innovative (there are currently more than 2000 Game Boy emulators on Github)
and writing Software with SystemC isn't exiciting either.
But to the best of my knowledge, no one ever tried to combine these two!
The result of my attempt can be found in my
Github Repository.
There's no need to worry if you don't know what SystemC is, or have no clue how a Game Boy works.
The only prerequisite is some C++ knowledge as SystemC is library for C++.
This means you should definetely know what a pointer is, but you don't need to pull off some quadruple-singleton-polymorphic-macro C++ stunts.
Also some really basic knowledge of computer architecture is assumed.
The following tutorials will use Linux as an operating system.
However, all dependencies we are using are also available for Windows, so this should work out as well in theory.
Nevertheless, rather than running on native Windows,
I recommend installing WSL, which is a Microsoft-made emulator for Linux on Windows.
Before we begin with anything technical, I want to clarify that most of the Game Boy's technical details and interna presented in
this post series are from third-party sources. It's now more than thirty years since the release of the initial Game Boy,
and a lot of people spend a lot of time reverse-engineering, writing emulators, or creating tests in their spare time!
Most of this work is a valuable source of information that makes programming a Game Boy emulator quite enjoyable.
The most complete summary of anything Game Boy emulator related can be found in
this Github repo (awesome-gbdev).
Besides that, I often used the gbemu Github as a reference implementation.
This open-source emulator was pretty helpful at some points, especially when I had very detailed questions about certain things.
Anyway, I guess most people know what a Game Boy is, but not many have heard of SystemC yet, so let's get started with a short introduction to SystemC.
Behind the ominous term SystemC there is actually only a simple library for C++. I have the impression that it is often sold more like a language based on C++ rather than a library.
But simply put: SystemC is just a C++ library for so-called discrete-event simulation (DES).
These kind of simulations can be used to model systems where events happen at discrete points in time (woooooosh).
For example, a traffic light could easily be modeled with SystemC.
You turn on different lights at discrete points in time and then turn them off again at another certain time.
Although implementing a traffic light is feasible with SystemC, the focus of SystemC is put on modeling digital circuits like CPUs, busses, memories, and so on.
So, all the stuff a Game Boy consists of!
Some years ago, SystemC was extended by the TLM library (Transaction-Level Modelling),
which basically allowed modeling on a higher abstraction level, thus increasing the simulation performance
and making things easier to code. The current version of TLM is 2; this is why the term SystemC TLM-2.0 is often used.
We'll just keep using "SystemC" as a term for "SystemC TLM-2.0" in the following.
A mentionable sidenote about SystemC is its
standardization
by the IEEE (Institute of Electrical and Electronics Engineers, basically the Jedi High Council of electrical engineers).
Many companies in the field of electronic system-level (ESL) design have adopted this standard.
Thus, knowing SystemC might look nice on your curriculum vitae!
In my personal oppinion, the best way to learn SystemC is to use it as most concepts are quite straight forward.
To get familar with it, I can really recommend the tutorials of Doulos and
asic-world. But you can also read a book like
SystemC: From the Ground Up if you want (I personally learned more from the tutorials).
In this section we'll take a look the at general design of the Game Boy from a high-level perspective.
Obviously we need to know what we want to model with SystemC before we start with the actual modeling.
So, let's take a look at the following image that provides a rough overview of the Game Boy's components:
I guess the title is pretty self-explanatory ;) The most important features of this cheat sheet are:
]]>
Heyho! Here you can download my fist paper AMAIX: A Generic Analytical Model for Deep Learning Accelerators free of charge. We successfully submitted this paper to the SAMOS XX conference where it won the best paper award. I guess it will soon be published in the Springer Lecture Notes in Computer Science (LNCS).
In next section I'll be shortly describing what we did in this paper using the kind of language I prefer (not that super duper fancy paper-language). But of course you are also invited to read the paper ;)
As with many technical products, you want to know as early as possible in the development process how good the product you are developing actually is.
This is especially true when there are many competitors and the market is therefore highly competitive.
Such a situation can be found with so-called deep learning accelerators (DLAs) (hardware for the acceleration of AI applications).
This is favoured by the fact that an extreme growth is predicted for this market.
On their AI day in 2019, Qualcomm said that they expect a x10 growth in revenue from 1.8 billion american USD dollars in 2018 to 17 billion dollars in 2025.
And this is just for AI accelerators in data centers.
Thus, many large tech-giants (actually all of them), but also small start-ups, are trying to establish a dominant position as early as possible.
If you don't believe it: here's a list with companies currently trying to engage in this market.
So, long story short: if you want to be succesful in this market, you need methods to estimate your design's key performance indicators (chip area, power consumption, computing power, etc.). Well-known methods include RTL-simulations (e.g. Verilog) or System-level-simulations (e.g. SystemC). But if you have a Verilog or SystemC model of your DLA, your project is already in a progressed state.
A method, which you can use even directly after you had the initial idea of your DLA, are so called analytical models. These models try to estimate at system's KPIs using math or simple algorithms. A pen and a paper or an excel spreadsheet is everything you need to get started with them. The problem of analytical models is there extremely simplifying nature. If your system has any kind of non-determinism or is dynamic, the obtained results will be pretty inaccurate for sure. As most compute systems are very dynamic or include non-determinism (like caches), analytical models are usually not of great help. But how well do they work for these emerging deep learning accelerators?
The paper provides you with all the details, so here's a summarized answer for this question: At least for our examined case-study (the NVDLA) we could estimate the execution time pretty well. There are many reasons for that, so let my list a few of them:
Especially the last point is of particular interest as this allows us to use the so called roofline model. A significant part of the paper is about how to rearrange this model for DLAs and apply it to the NVDLA. The cool thing is, that this model takes some of the NVDLAs configurable hardware parameters as an input and gives you the estimated execution time as an output. If you pour this into a python script, you can evaluate thousands of designs in a few seconds and generate such nice graphs which you can use for design space exploration:
Besides design space exploration there are still some open topics which we haven't studied yet. I think that analytical models could be an interesting addition for DLA compilers. For example, many DLAs support a so called Winograd convolution which basically allows you to convolute with less operations compared to a standard convolution. But the downside is, that you need more weights leading to a higher memory bandwidth consumption. In my eyes a smart compiler could analyse the system and choose the right operation depending on the bottleneck of the system.
Anyway, this was the "short" summary of our paper. If you have any question, feel free to write an e-mail to me (see About or use the address in the paper).
]]>Contents
Hello there! In this post we will program a guitar tuner with Python. This project is a pure software project, so there is no soldering or tinkering involved. You just need a computer with a microphone (or an audio interface) and Python. Of course the algorithms presented in the post are not bound to Python, so feel free to use any other language if you don't mind the addtional translation (however, I recommend to not use tcl as it is "the best-kept secret in the software industry" and we better keep it a secret, lol).
We will start with analyzing the problem we have which is probably a detuned guitar and then forward to solving this problem using math and algorithms. The focus of this post lies on understanding the methods we use and what their pros and cons are. For those who want to code a guitar tuner in under 60 seconds: my Github repo ;)
Let's start with some really basic introduction to music theory and guitars. First, we have to define some important musical terms as an exact distinction will avoid some ambiguities:
With this defintions in mind we will now look at how a guitar works on a musical level. I guess most of you know this but the "default" guitar has 6 strings which are usually tuned in the standard tuning EADGBE. Whereby each note refers to one of the strings. For example, the lowest string is tuned to the note E_{2}. This means that the string has a pitch of 82.41Hz, since this is how the tone E_{2} is defined. If it would have a pitch of 81Hz, our guitar is out-of-tune and we have to use the tuners on the headstock to get it back in tune. Of course all other notes can be assigned to a certain pitch as well:
Note, that for this post we assume an equal temperament
and a concert pitch of A_{4}=440Hz which covers probably 99% of modern music.
The cool thing about the equal temperament is that it defines the notes and pitches in half step fashion described by the following formula:
$$f(i) = f_0 \cdot 2^{i/12} $$
So, if you have a pitch \(f_0\), for example A_{4} at 440Hz, and you want to increase it by one half step to an A#_{4} then you have to multiply
the pitch 440Hz with \(2^{1/12}\) resulting in 466.16Hz.
We can also derive an inverse formula which tells how many half steps are between the examined pitch \(f_i\) and a reference pitch \(f_o\).
$$12 \cdot log_2 \left( \frac{f_i}{f_o} \right) = i $$
This also allows us to assign a pitch a note. Or at least a note which is close to the pitch.
As you can imagine, this formula will be of particular interest for us.
Because if we can extract the pich from a guitar recoding, we want to know the closest note and how far away it is.
This leads us to the following Python function
As next step we need to record the guitar and determine the pitch of the audio signal. This is easier said than done as you will see ;)
After reading the following section you hopefully know what is meant by pitch detection and which algorithms are suited for this. As already mentioned above, pitch and frequencies are not the same. This might sound abstract at first, so let's "look" at an example.
The example is a short recording of me playing the note A_{4} with a pitch of 440Hz on a guitar.The same example but now visualized as a time/value graph looks like follows
As you can see the signal has a period length of roughly 2.27ms which corresponds to a frequency of 440Hz. So far so good. But you can also see that the signal is far away from being a pure tone. So, what is happening there?
To answer this question we need to make use of the so-called Discrete Fourier Transform (DFT).
It's basically the allround tool of any digital signal processing engineer.
From a mathematical point of view it shows how a discrete signal can be decomposed as a set of cosine functions
oscillating at different frequecies.
Or in musical terms: the DFT shows which pure tones can be found in an audio recording.
If you are interested in the mathematical details of the DFT, I recommend you to read my previous
post.
But no worries, the most important aspects will be repeated in this post.
The cool thing about the DFT is that it provides us with a so called magnitude spectrum. For the given example it looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import scipy.io.wavfile
import matplotlib.pyplot as plt
import numpy as np
from scipy.fftpack import fft
sampleFreq, myRecording = scipy.io.wavfile.read("example1.wav")
sampleDur = len(myRecording)/sampleFreq
timeX = np.arange(0, sampleFreq/2, sampleFreq/len(myRecording))
absFreqSpectrum = abs(fft(myRecording))
print(absFreqSpectrum)
plt.plot(timeX, absFreqSpectrum[:len(myRecording)//2])
plt.ylabel('|X(n)|')
plt.xlabel('frequency[Hz]')
plt.show()
On the x-axis you can see the frequencies of the pure tones while the y-Axis displays their intensity.
The spectrum reveals some interesting secrets which you couldn't see in the time domain. As expected there is a strong intensity of the pure tone at 440Hz. But there are other significant peaks at integer multiples of 440Hz. For example, 880Hz, 1320Hz, etc. If you are familiar with music you may know the name of these peaks: harmonics or overtones.
The reason for the overtones is quite simple. When you hit a guitar string you excite it to vibrate at certain frequencies. Especially frequencies which form standing waves can vibrate for a long time. These fulfill the boundary conditions that the string cannot move at the points where it is attached to the guitar (bridge and nut). Thus, multiple overtones are also excited which are all multiples of the fundamental frequency. The following GIF visualizes this:
The overall set of harmonics and how they are related is called timbre.
A timbre is what makes your guitar sound like a guitar and not like any other instrument.
This is pretty cool on the one hand, but it makes pitch detection a real challenge.
Because at this point you might already had an idea for a guitar tuner: create a DFT spectrum, determine the frequency of the highest peak, done.
Well, for the given spectrum about this might work, but there are many cases for which you will get wrong results.
The first reason is that the fundamental frequency does not always have to create the highest peak.
Altough not beeing the highest peak the pitch is determined by it.
This is the reason why pitch detection is not just a simple frequency detection!
The second reason is that the power of the guitar signal is distributed over a large frequency band.
By selecting only the highest peak, the algorithm would be very prone to narrowband noise.
In the example spectrum given about you can see a high peak at 50Hz which is caused by mains hum.
Although the peak is relatively high, it does not determine the overall sound impression of the recording.
Or did you feel like the 50Hz noise was very present?
The complexity of this problem has lead to a number of different pitch detection algorithms. In order to choose the right algorithm we have to think about what requirements a guitar tuner needs to fullfill. The most important requirements surely are:
In the following we will start with programming a simple maximum frequency peak algorithm. As already mentioned above, this method may not work pretty well since the fundamental frequency is not guarenteed to always have the highest peak. However, this method is quite simple and a gentle introduction to this subject.
In the second the section a more sophisticated algorithm using the Harmonic Product Spectrums (HPS) is implemented. It is based on the simple tuner, so don't skip the first section ;)
Our first approach will be a simple guitar tuner using the DFT peak approach. Usually the DFT algorithm is applied to the whole duration of signal. However, our guitar tuner is a realtime application where there is no concept of a "whole signal". Furthermore, as we are going to play several different notes, only the last few seconds are relevant for pitch detection. So, instead we use the so called discrete Short-Time Fourier Transform (STFT) which is basically just the DFT applied for the most recent samples. You can imagine it as some kind of window where new samples push out the oldest samples: Note, that the spectrum is now a so-called spectrogram as it varies over time.
Before we start with programming our tuner, we have to think about design considerations concerning the DFT algorithm. Because can the DFT fullfill the requirements we proposed above?
Let's begin with the frequency range.
The DFT allows you to analyze frequencies in the range of \( f < f_s / 2 \) with \(f_s\) beeing the sample frequency.
Typical sound recording devices use a sampling rate of around 40kHz giving us a frequency range of \(f < 20kHz\).
This is more than enough to even capture all the overtones.
Note, that the frequency range is an inherent property of the DFT algorithm, but there is also a close relation to the
Nyquist–Shannon sampling theorem.
The theorem states that you cannot extract all the information from a signal if the highest occuring frequencies
are greater than \(f_s / 2 \). This means the DFT is already working at the theoretical limit.
As a next point we look at the frequency resolution of the DFT which is (for details see my DFT post): $$ f_s / N \approx 1 / t_{window} [Hz]$$ With \(N\) being the window size in samples, and \(t_{window}\) the window size in seconds. The resolution in Hertz is approximately the reciprocal of the window size in seconds. So, if we have a window of 500ms, then our frequency resolution is 2Hz. This is where things become tricky as a larger window results in a better frequency resolution but negatively affects the delay. If we consider frequency resolution more important up to a certaint extent than delay, a windows size of 1s sounds like a good choice. With this setting we achieve a frequency resolution of 1Hz.
So far so good. If you convert all this knowledge to some code, your result might look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import sounddevice as sd
import numpy as np
import scipy.fftpack
import os
# General settings
SAMPLE_FREQ = 44100 # sample frequency in Hz
WINDOW_SIZE = 44100 # window size of the DFT in samples
WINDOW_STEP = 21050 # step size of window
WINDOW_T_LEN = WINDOW_SIZE / SAMPLE_FREQ # length of the window in seconds
SAMPLE_T_LENGTH = 1 / SAMPLE_FREQ # length between two samples in seconds
windowSamples = [0 for _ in range(WINDOW_SIZE)]
# This function finds the closest note for a given pitch
# Returns: note (e.g. A4, G#3, ..), pitch of the tone
CONCERT_PITCH = 440
ALL_NOTES = ["A","A#","B","C","C#","D","D#","E","F","F#","G","G#"]
def find_closest_note(pitch):
i = int(np.round(np.log2(pitch/CONCERT_PITCH)*12))
closest_note = ALL_NOTES[i%12] + str(4 + (i + 9) // 12)
closest_pitch = CONCERT_PITCH*2**(i/12)
return closest_note, closest_pitch
# The sounddecive callback function
# Provides us with new data once WINDOW_STEP samples have been fetched
def callback(indata, frames, time, status):
global windowSamples
if status:
print(status)
if any(indata):
windowSamples = np.concatenate((windowSamples,indata[:, 0])) # append new samples
windowSamples = windowSamples[len(indata[:, 0]):] # remove old samples
magnitudeSpec = abs( scipy.fftpack.fft(windowSamples)[:len(windowSamples)//2] )
for i in range(int(62/(SAMPLE_FREQ/WINDOW_SIZE))):
magnitudeSpec[i] = 0 #suppress mains hum
maxInd = np.argmax(magnitudeSpec)
maxFreq = maxInd * (SAMPLE_FREQ/WINDOW_SIZE)
closestNote, closestPitch = find_closest_note(maxFreq)
os.system('cls' if os.name=='nt' else 'clear')
print(f"Closest note: {closestNote} {maxFreq:.1f}/{closestPitch:.1f}")
else:
print('no input')
# Start the microphone input stream
try:
with sd.InputStream(channels=1, callback=callback,
blocksize=WINDOW_STEP,
samplerate=SAMPLE_FREQ):
while True:
pass
except Exception as e:
print(str(e))
This code should work out of the box, assuming that the corresponding python libraries are installed.
Here are some out-of-code comments which explain the single lines more in detail:
Line 1-4: Basic imports such as numpy for math stuff and sounddecive for capturing the microphone input
Line 7-12: Global variables
Line 14-22: The function for finding the nearest note for a given pitch. See section "Guitars & Pitches" for the detailed explaination.
Line 24-45: These lines are the heart of our simple guitar tuner, so a let's have a closer look.
Line 31-32: Here the incoming samples are appended to an array while the old samples are remmoved.
Thus, a window of WINDOW_SIZE samples is obtained.
Line 33: The magnitude spectrum is obtained by using the Fast Fourier Transform.
Note, that one half of the spectrum only provides redundant information.
Line 35-36: Here the mains hum is suppressed by simply setting all frequencies below 62Hz to 0.
This is still sufficient for a drop C tuning (C_{2}=65.4Hz).
Line 38-40: First, the highest frequency peak is determined.
As a next step the highest frequencies is used to get the closest pitch and note.
Line 48-55: The input stream is initialized and runs in an infinite loop.
Once enough data is sampled, the callback function is called.
Line 42-43: Printing the results. Depending on your operating system a different clear function has to be called.
I also made a javascript version which works directly from you browser. Note, that it uses slightly different parameters. The corresponding magnitude spectrum is also visualized:
If you tried to tune your guitar using this tuner you probably noticed that it doesn't work pretty well. As expected there main problem are harmonic errors as the overtones are often more intense than the actual fundamental frequency. A way to deal with is problem is using the Harmonic Product Spectrums as the next section will show.
In this section we will refine our simple tuner by using the so-called Harmonic Product Spectrum (HPS) which was introduced by A. M. Noll in 1969.
The idea behind it is quite simple yet clever.
The Harmonic Product Spectrum is a multiplication of \(R\) magnitude spectrums with different frequency scalings:
$$ Y(f) = \prod_{r=1}^{R} |X(fr)| $$
With \(X(f)\) being the magnitude spectrum of the signal.
I think that this is hard to explain in words, so let's take a look at a visualization for \(R=4\):
In the upper half of the visualization you can see the magnitude spectrums for the 440Hz guitar tone example.
Each with a different frequency scaling factor \(r\).
These magnitude spectrums are multiplied in a subsequent step resulting in the Harmonic Product Spectrum \(|Y(f)|\).
As the frequency scaling is always an integer number, the product vanishes for non-fundamental frequencies.
Thus, the last step is simply taking the highest peak of the HPS:
$$ f_{max} = \max_{f}{|Y(f)|} $$
For the given example the peak at 440Hz is perfectly determined.
In terms of frequency resolution and delay, the HPS tuner is pretty similar to the simple DFT tuner as the DFT is the basis of the HPS.
However, as the HPS uses the harmonies as well to determine the pitch a higher frequency resolution can be achieved if the spectrum is
interpolated and upsampled before the HPS process is executed.
Note, that upsampling and intepolating does not add any information to the spectrum but avoids information loss as the spectrum is effectively downsampled
when using different frequency scaling.
Let me illustrate this by using an intuitive example.
Assuming we have a DFT with a frequency resolution of 1Hz
and we have a peak at 1761 Hz from which we know that it is the 4th harmonic of a fundamental frequency at 440Hz in the spectrum.
If you have this information, you can calculate \(1761/4=440.25\) and conclude that the fundamental frequency is rather 440.25Hz than 440Hz.
The same principle is used by the HPS algorithm.
A python version of a HPS guitar tuner may look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
'''
Guitar tuner script based on the Harmonic Product Spectrum (HPS)
MIT License
Copyright (c) 2021 chciken
'''
import copy
import os
import numpy as np
import scipy.fftpack
import sounddevice as sd
import time
# General settings that can be changed by the user
SAMPLE_FREQ = 48000 # sample frequency in Hz
WINDOW_SIZE = 48000 # window size of the DFT in samples
WINDOW_STEP = 12000 # step size of window
NUM_HPS = 5 # max number of harmonic product spectrums
POWER_THRESH = 1e-6 # tuning is activated if the signal power exceeds this threshold
CONCERT_PITCH = 440 # defining a1
WHITE_NOISE_THRESH = 0.2 # everything under WHITE_NOISE_THRESH*avg_energy_per_freq is cut off
WINDOW_T_LEN = WINDOW_SIZE / SAMPLE_FREQ # length of the window in seconds
SAMPLE_T_LENGTH = 1 / SAMPLE_FREQ # length between two samples in seconds
DELTA_FREQ = SAMPLE_FREQ / WINDOW_SIZE # frequency step width of the interpolated DFT
OCTAVE_BANDS = [50, 100, 200, 400, 800, 1600, 3200, 6400, 12800, 25600]
ALL_NOTES = ["A","A#","B","C","C#","D","D#","E","F","F#","G","G#"]
def find_closest_note(pitch):
"""
This function finds the closest note for a given pitch
Parameters:
pitch (float): pitch given in hertz
Returns:
closest_note (str): e.g. a, g#, ..
closest_pitch (float): pitch of the closest note in hertz
"""
i = int(np.round(np.log2(pitch/CONCERT_PITCH)*12))
closest_note = ALL_NOTES[i%12] + str(4 + (i + 9) // 12)
closest_pitch = CONCERT_PITCH*2**(i/12)
return closest_note, closest_pitch
HANN_WINDOW = np.hanning(WINDOW_SIZE)
def callback(indata, frames, time, status):
"""
Callback function of the InputStream method.
That's where the magic happens ;)
"""
# define static variables
if not hasattr(callback, "window_samples"):
callback.window_samples = [0 for _ in range(WINDOW_SIZE)]
if not hasattr(callback, "noteBuffer"):
callback.noteBuffer = ["1","2"]
if status:
print(status)
return
if any(indata):
callback.window_samples = np.concatenate((callback.window_samples, indata[:, 0])) # append new samples
callback.window_samples = callback.window_samples[len(indata[:, 0]):] # remove old samples
# skip if signal power is too low
signal_power = (np.linalg.norm(callback.window_samples, ord=2)**2) / len(callback.window_samples)
if signal_power < POWER_THRESH:
os.system('cls' if os.name=='nt' else 'clear')
print("Closest note: ...")
return
# avoid spectral leakage by multiplying the signal with a hann window
hann_samples = callback.window_samples * HANN_WINDOW
magnitude_spec = abs(scipy.fftpack.fft(hann_samples)[:len(hann_samples)//2])
# supress mains hum, set everything below 62Hz to zero
for i in range(int(62/DELTA_FREQ)):
magnitude_spec[i] = 0
# calculate average energy per frequency for the octave bands
# and suppress everything below it
for j in range(len(OCTAVE_BANDS)-1):
ind_start = int(OCTAVE_BANDS[j]/DELTA_FREQ)
ind_end = int(OCTAVE_BANDS[j+1]/DELTA_FREQ)
ind_end = ind_end if len(magnitude_spec) > ind_end else len(magnitude_spec)
avg_energy_per_freq = (np.linalg.norm(magnitude_spec[ind_start:ind_end], ord=2)**2) / (ind_end-ind_start)
avg_energy_per_freq = avg_energy_per_freq**0.5
for i in range(ind_start, ind_end):
magnitude_spec[i] = magnitude_spec[i] if magnitude_spec[i] > WHITE_NOISE_THRESH*avg_energy_per_freq else 0
# interpolate spectrum
mag_spec_ipol = np.interp(np.arange(0, len(magnitude_spec), 1/NUM_HPS), np.arange(0, len(magnitude_spec)),
magnitude_spec)
mag_spec_ipol = mag_spec_ipol / np.linalg.norm(mag_spec_ipol, ord=2) #normalize it
hps_spec = copy.deepcopy(mag_spec_ipol)
# calculate the HPS
for i in range(NUM_HPS):
tmp_hps_spec = np.multiply(hps_spec[:int(np.ceil(len(mag_spec_ipol)/(i+1)))], mag_spec_ipol[::(i+1)])
if not any(tmp_hps_spec):
break
hps_spec = tmp_hps_spec
max_ind = np.argmax(hps_spec)
max_freq = max_ind * (SAMPLE_FREQ/WINDOW_SIZE) / NUM_HPS
closest_note, closest_pitch = find_closest_note(max_freq)
max_freq = round(max_freq, 1)
closest_pitch = round(closest_pitch, 1)
callback.noteBuffer.insert(0, closest_note) # note that this is a ringbuffer
callback.noteBuffer.pop()
os.system('cls' if os.name=='nt' else 'clear')
if callback.noteBuffer.count(callback.noteBuffer[0]) == len(callback.noteBuffer):
print(f"Closest note: {closest_note} {max_freq}/{closest_pitch}")
else:
print(f"Closest note: ...")
else:
print('no input')
try:
print("Starting HPS guitar tuner...")
with sd.InputStream(channels=1, callback=callback, blocksize=WINDOW_STEP, samplerate=SAMPLE_FREQ):
while True:
time.sleep(0.5)
except Exception as exc:
print(str(exc))
The basic code has many things in common with simple DFT tuner, but of course the algorithmic parts are pretty different. Furthermore, some signal processing methods were added in order to increase the signal quality. These methods could also be applied to the DFT tuner. In the following I will provide some comments on the code:
Line 64-68: Calculate the signal power. If there is no sound, we don't need to do the signal processing part.
Line 70-71: The signal is multiplied with a Hann Window to reduce
spectral leakage.
Line 74-76: Suppress mains hum. This is a quite important signal enhacement.
Line 78-87: The average energy for a frequency band is calculated.
If the energy of a given frequency is below this average energy, then the energy is set to zero.
With this method we can reduce white noise or noise which is very close to white noise (note, that white noise has a flat spectral distribution).
This is necessary as the HPS method does not work so well if there is a lot of white noise.
Line 89-94: Here the DFT spectrum is interpolated. We need to do this as we are required to downsample the spectrum in the later steps.
Imagine there is a perfect peak at a given frequency and all the frequencies next to it are zero.
If we now downsample the spectrum, there is a certain risk that this peak is simply ignored.
This can be avoided having an interpolated spectrum as the peaks are "smeared" over a larger area.
Line 96-101: The heart of the HPS algorithm. Here the frequency scaled spectrums are multiplied
Line 103-...: Basically the same as DFT algorithm but with a majority vote filter.
Only print the note, if the previous note is the same.
Again, I also made a javascript version of this with some reduced signal enhacement as javascript is not really made for realtime signal processing.
If you compare this tuner to the previous simple tuner, you will probably notice that it already works many times more accurate. In fact, when plugging my guitar directly into the computer with an audio interface, it works perfectly. When using a simple microphone I rarely notice some harmonic errors but in general tuning the guitar is possible.
Also other people sometimes observed these harmonic erros (thank you for feedback, Valentin), so I had to investigate.
By analyzing some spectrums where the pitch was incorrectly indentified, a came across a fundamental theoretical weakness of the algorithm.
If one the overtone is missing, then the fundamental frequency is eventually multiplied by zero and consequently vanishes from the HPS.
With this current implementation this situation might also occur, if one of the harmonics was so weak that it is considered as white noise.
To counteract this phenomenon I added the
In this post I showed how to write a guitar tuner using Python.
We first started with a simple DFT peak detection algorithm and then refined it using a Harmonic Product Spectrum approach
which already gave us a solid guitar tuner.
In case of harsh environments or missing overtones the HPS tuner sometimes suffers from harmonic errors,
so in the future I might make more guitar tuners using different pitch detection algorithms based on cepstrums (yes, this is correct, you are not having a stroke)
or correlation.
If you like to add or critize somthing, pease contact me :)
You can do this by writing an e-mail to me (see About).
In this post we will cover one of the most important transforms of digital processing: the The Discrete Fourier Transform (DFT).
If you don't know what a transform is, I recommend to read the Introduction of my
post
about the z transform. If you don't want to read it, I'll shortly summarize it:
the goal of most transforms is to apply some mathematical operation to data which then reveals
some information we could not see from its original representation.
In the case of the Fourier transform, a discrete signal is transformed, allowing us to see which frequencies the signal comprises.
This may sound insignificant, but it actually offers a whole world of new opportunities.
It is the essential ingredient of math for many applications such as guitar tuners (in my next post I'll show how to program a guitar tuner
using Python and the DFT), noise filtering, or x-ray image sharpening.
As the name may suggest, it is the discrete equivalent of the Fourier Transform.
Many lectures and courses begin with teaching the "standard" Fourier Transform first and then go over to the discrete ones.
However, for this post you don't need to know anything about the Fourier Transform.
We will start at nearly 0% and then try to derive, understand, and use the DFT.
The only prerequisite is some basic knowledge about linear algebra and complex numbers.
If you know what a matrix is and that \(e^{i x} = cos(x) + i \cdot sin(x) \), then you are already good to go ;)
As previously mentioned, the DFT can help us to determine which frequencies a discrete signal is made of. To make things a little bit easier, we have to regard discrete signals from a new perspective. When imagining discrete signals, people (or at least I) often think about a bunch of values spread on a time axis: However, another way of representing signals is using vectors. So, if we have the signal \((2,1)\) the corresponding vector would look like this: Whereby each dimension corresponds to a different value in time. If the signal has more values, we need to increase the number of dimensions. I have to admit that at a certain number of dimensions, it is hard imagine how a vector would look like (for me, this already the case at the humble number of 4 dimensions). But the general concept of thinking of signals as vectors is clear, I hope.
The reason why we want to regard signals as vectors, is because the DFT can be interpreted as a rotation of a signal/vector! Using this interpretation has some nice advantages. First, a rotation is something familiar from our everyday lifes. Second, using rotation matrices, it is quite simple to derive the inverse discrete fourier transform.
As the DFT is some kind of rotation, we need to explore the characteristics of rotation matrices as the next step.
Because which mathematical properties does a rotation matrix have?
The answer is kind of intuitive: A rotation shall not change the length of a vector
and the angle between two vectors has to stay the same. You can also see this in the example.
No matter how you change the angles, the vectors of the chciken box keep their length and their relative angles.
Expressing this mathematically will lead us to the following relationship:
$$R^{-1} = R^T$$
This means that the inverse of the rotation matrix exists and it's simply the matrix transposed (this is really nice)!
Because generally, inverting arbitrary matrices might require some fancy algorithms, and in some cases, an invertible matrix might not even exists.
Transposing, in contrast, is basically one of the simplest things to do.
Just mirror the elements at the diagonal, and you are done.
Note, that this concept is not only valid for 3 or 2, but an arbitrary number of dimensions.
If you want to see the derivation:
From this property we can derive a further one: The rows of a rotation matrix are perpendicular/orthogonal. To be more exact: They are orthonormal (orthogonal and have a length of 1). Therefore their dot products are 0! Except you take the dot product of row with itself. The derivation can be found here:
As a next step, we want to show that the DFT can be interpreted as a complex rotation. So, let's take a look at the definition of the DFT, which you can find in any textbook or online article: $$ X(n)=\sum_{k=0}^{N-1} x(k) \cdot e^{-i\frac{2\pi}{N}kn} $$ Basically, we have our discrete signal \(x(k)\) with a length \(N\). This signal is multiplied with some expression and then summed up. The transformed signal is \(X(n)\) (this domain is also called frequency domain) and has the same number of elements as its time domain counterpart \(x(k)\). Note, that the common convention is to represent the frequency domain by capital letters. The nice thing about this formula is that we can also represent it by a matrix multiplication: $$ \begin{pmatrix} X(0) \\ X(1) \\ .. \end{pmatrix} = \begin{pmatrix} e^{-i\frac{2\pi}{N}(0 \cdot 0)} & e^{-i\frac{2\pi}{N}(0 \cdot 1)} & ...\\ e^{-i\frac{2\pi}{N}(0 \cdot 1)} & e^{-i\frac{2\pi}{N}(1 \cdot 1)} & ...\\ ... & ... & ...\\ \end{pmatrix} \cdot \begin{pmatrix} x(0) \\ x(1) \\ ... \end{pmatrix} $$ If that matrix is a rotation matrix, then it should also be easy to define the inverse transform. Since \(R^{-1} = R^T\) applies to rotation matrices, we can simply do the following trick: $$X = R \cdot x $$ $$R^T \cdot X = x$$ So if we are in the frequency domain, we multiply the transposed matrix to get back in the time domain!
To prove that the matrix is a rotation matrix, we will check if the rows are orthogonal using the dot product
(see Derivation for orthogonality above to understand why we can use the dot product here):
$$
\langle r_n, r_m \rangle = r_n \cdot r_m^T = \sum_{k=0}^{N-1} e^{-i\frac{2\pi}{N}km} \cdot e^{+i\frac{2\pi}{N}kn}
= \sum_{k=0}^{N-1} e^{i\frac{2\pi}{N}k(n-m)}
$$
Note, that if you transpose a complex vector, the values are complex conjugated (the sign of the imaginary part switches).
Next, we have to distinguish between two cases.
First we have \(m \neq n\) (two different rows).
Using the following formula (this formula is also used for the
derivation of the geometric series):
$$
\sum_{k=0}^{N-1} x^k = \frac{x^N-1}{x-1}
$$
We get:
$$
\sum_{k=0}^{N-1} e^{i\frac{2\pi}{N}k(n-m)} = \frac{e^{i2\pi(n-m)}-1}{e^{i\frac{2\pi}{N}(n-m)}} = 0
$$
As the row dot product of two different rows is zero, we know that they are orthogonal. Nice!
Now we have to check the dot product of a row with itself (\(n=m\)) which can be interpreted as the squared length of the row vector:
$$
\sum_{k=0}^{N-1} e^{0} = \sum_{k=0}^{N-1} 1 = N
$$
As we can see, the squared length is not \(1\) but \(N\).
This means that the length of the row vector is \(\sqrt{N}\) times larger
than we need it to be for orthonormality (right now it is only orthogonal).
We can deal with this by simply putting a normalization factor in front of the DFT:
$$ X(n)= \frac{1}{\sqrt{N}} \sum_{k=0}^{N-1} x(k) \cdot e^{-i\frac{2\pi}{N}kn} $$
The length of a row vector is now \(1\) using the normalization factor. In literature this is sometimes called the Normalized DFT (NDFT).
While this seems more convenient from a geometrical point of view,
the version without a scaling factor is the standard DFT for digital signal processing.
Why?
According to this site the main reason for omitting
the scaling factor is to save computations when calculating the DFT.
This makes perfect sense, as digital processing is often about optimizing certain mathematical algorithms.
We will use the DFT without the scaling since following conventions is always a good thing to do.
Furthermore, having only an orthogonal but not orthonormal matrix does not really have to bother us, as the next section will show.
In this section we will derive the inverse DFT. In many applications you want to transform your signal to the frequency domain,
do some processing, and then transform it back to the time domain.
This does, of course, require an inverse transformation.
As already explained above, we can simply take the inverse matrix of the DFT to get back to the time domain.
Using the non-normalized DFT (which is the default DFT)
the lack of orthonormality first looks like an issue because we cannot simply use \(R^T=R^{-1}\).
But we can use a little trick and factor out a \(\sqrt{N}\) and make it normal.
Starting with the DFT as a matrix representation:
$$
\begin{pmatrix}
X(0) \\ X(1) \\ ..
\end{pmatrix}
=
\begin{pmatrix}
e^{-i\frac{2\pi}{N}(0 \cdot 0)} & e^{-i\frac{2\pi}{N}(0 \cdot 1)} & ...\\
e^{-i\frac{2\pi}{N}(0 \cdot 1)} & e^{-i\frac{2\pi}{N}(1 \cdot 1)} & ...\\
... & ... & ...\\
\end{pmatrix}
\cdot
\begin{pmatrix}
x(0) \\ x(1) \\ ...
\end{pmatrix}
$$
We can factor out a \(\sqrt{N}\):
$$
\begin{pmatrix}
X(0) \\ X(1) \\ ..
\end{pmatrix}
= \sqrt{N}
\begin{pmatrix}
\frac{1}{\sqrt{N}} e^{-i\frac{2\pi}{N}(0 \cdot 0)} & \frac{1}{\sqrt{N}} e^{-i\frac{2\pi}{N}(0 \cdot 1)} & ...\\
\frac{1}{\sqrt{N}} e^{-i\frac{2\pi}{N}(0 \cdot 1)} & \frac{1}{\sqrt{N}} e^{-i\frac{2\pi}{N}(1 \cdot 1)} & ...\\
... & ... & ...\\
\end{pmatrix}
\cdot
\begin{pmatrix}
x(0) \\ x(1) \\ ...
\end{pmatrix}
$$
Now the matrix is orthonormal, so we can use the relation \(R^T = R^{-1}\) to obtain:
$$
\frac{1}{N}
\begin{pmatrix}
e^{+i\frac{2\pi}{N}(0 \cdot 0)} & e^{+i\frac{2\pi}{N}(1 \cdot 0)} & ...\\
e^{+i\frac{2\pi}{N}(1 \cdot 0)} & e^{+i\frac{2\pi}{N}(1 \cdot 1)} & ...\\
... & ... & ...\\
\end{pmatrix}
\cdot
\begin{pmatrix}
X(0) \\ X(1) \\ ..
\end{pmatrix}
=
\begin{pmatrix}
x(0) \\ x(1) \\ ...
\end{pmatrix}
$$
This leads to the following
definition of the inverse DFT:
$$ x(k)= \frac{1}{N} \sum_{n=0}^{N-1} X(n) \cdot e^{+i\frac{2\pi}{N}kn} $$
This definition of the inverse DFT is actually quite similar to the DFT.
We just have to change the sign of the exponent and divide by \(N\).
At this point, we covered the definition of the DFT, how it can be interpreted as an n-dimensional complex rotation
, and how the inverse DFT can be derived using this interpretation.
But we haven't covered yet what we actually gain from the DFT.
Because why do we actually do this?
The next section will hopefully answer this question as we explore why the DFT lets us "see" frequencies in signals.
In the following, we look at mathematical lego bricks more in detail as this will reveal some interesting secrets. Let's assume we have a DFT of size 4 as an example. Then we would get the following matrix: $$ \begin{pmatrix} e^{+i\frac{2\pi}{4}(0 \cdot 0)} & e^{+i\frac{2\pi}{4}(1 \cdot 0)} & e^{+i\frac{2\pi}{4}(2 \cdot 0)} & e^{+i\frac{2\pi}{4}(3 \cdot 0)} \\ e^{+i\frac{2\pi}{4}(0 \cdot 1)} & e^{+i\frac{2\pi}{4}(1 \cdot 1)} & e^{+i\frac{2\pi}{4}(2 \cdot 1)} & e^{+i\frac{2\pi}{4}(3 \cdot 1)} \\ e^{+i\frac{2\pi}{4}(0 \cdot 2)} & e^{+i\frac{2\pi}{4}(1 \cdot 2)} & e^{+i\frac{2\pi}{4}(2 \cdot 2)} & e^{+i\frac{2\pi}{4}(3 \cdot 2)} \\ e^{+i\frac{2\pi}{4}(0 \cdot 3)} & e^{+i\frac{2\pi}{4}(1 \cdot 3)} & e^{+i\frac{2\pi}{4}(2 \cdot 3)} & e^{+i\frac{2\pi}{4}(3 \cdot 3)} \\ \end{pmatrix} = \begin{pmatrix} \color{#338cff}{1} & \color{#3d72b8}{1} & \color{#3c5f8c}{1} & \color{#374f6e}{1} \\ \color{#338cff}{1} & \color{#3d72b8}{i} & \color{#3c5f8c}{-1} & \color{#374f6e}{-i} \\ \color{#338cff}{1} & \color{#3d72b8}{-1} & \color{#3c5f8c}{1} & \color{#374f6e}{-1} \\ \color{#338cff}{1} & \color{#3d72b8}{-i} & \color{#3c5f8c}{-1} & \color{#374f6e}{j} \\ \end{pmatrix} $$ If we depict the real part of each column as a function, we get the following graph (remember: \(Re\{ e^{i \frac{2\pi}{N}kn} \} = cos( \frac{2\pi}{N}kn ) \) ): Note, that \(k\) is used as a reel (non-discrete) number in the graph to emphasize the underlying function. This means that every signal includes some cosine waves oscillating at different frequencies (and a DC component which is the straight line and basically a cosine with 0Hz). Of course there is also an imaginary part \(j \, sin(2\pi k)\), but in many cases this can be omitted as will be shown later. As the DFT coefficients scale these cosines, we can tell which frequencies are apparent in a signal simply by looking at the DFT coefficients! This is why the DFT lets us see frequencies in Signals!
Since nothing explains things better than examples, we will now take a look at some example signals and their corresponding DFTs.
As a next step, let us look at the mathematical reasons why the imaginary party cancels out.
First, the columns of the inverse DFT matrix are complex conjugated.
So, for every column there is another column where the exponent has a switched sign.
Second, the DFT coefficients are complex conjugated for real signals such that \(X(n)=X^{*}(N-n)\).
You can also see this in the example where we have two "4"s.
There is no way of having a 4 and 3.
Note, that this is only the case for real signals but not complex signals.
However, in most cases, our signals are real, and for this whole chapter we will only assume real signals in the time domain.
Only for some really fancy stuff you would use complex signals in the time domain.
These two reasons lead to many pairs such as (assuming the DFT coefficients also to be reel):
$$X(n) \cdot e^{i\frac{i2\pi}{N}nk} + X(N-n) \cdot e^{i\frac{2\pi}{N}(N-n)k}$$
$$=X(n) \cdot e^{i\frac{i2\pi}{N}nk} + X^{*}(n) \cdot e^{-i\frac{2\pi}{N}nk}$$
$$=2 X(n) \cdot cos(\frac{i2\pi}{N}nk)$$
At this point, let us summarize what we learned from this example.
Basically that are two important things.
First, only one half of the DFT coefficients provide us with information for reel signals. The other half can be reconstructed
by complex conjugation.
Second, for real signals the imaginary parts of the complex exponential function cancel out. Thus, every reel signal can
be reconstructed by using cosine waves!
So, I guess most of you know this, but complex numbers can be represented in a cartesian form or in a polar form (thanks to Euler):
Both forms are actually mathematically equal and can be transformed into each other.
For example, the DFT coefficients of this example can also be written as:
$$X(n) = (8, 4 e^{i\frac{\pi}{4}} , 0, 0, 0, 0, 0, 4e^{-i\frac{\pi}{4}} )$$
The advantage of the polar representation is that we can directly see the magnitude part \(r\)
(the length of the vector) and the phase part (the angle of the vector).
Furthermore, from the polar representation we can easily derive that multiplying two complex numbers
will result in their phases adding:
$$ r_1 \cdot e^{i\varphi_1} \cdot r_2 \cdot e^{i\varphi_2} = r_1 \cdot r_2 \cdot e^{i (\varphi_1 + \varphi_2) }$$
This property of adding phases will be pretty important in the following.
So let's assume that we have our DFT coefficients in a polar form and we use the inverse DFT to reconstruct a signal in the time domain.
For the given example this will be:
$$ x(k) = \frac{1}{8} \cdot 8 + \frac{1}{8} \cdot 4e^{i\frac{\pi}{4}} \cdot e^{i \frac{2\pi}{8} k}
+ \frac{1}{8} \cdot 4e^{-i\frac{\pi}{4}} \cdot e^{-i\frac{2\pi}{8} k} $$
$$ x(k) = \frac{1}{8} \cdot 8 + \frac{1}{8} \cdot 4 \cdot e^{i \frac{2\pi}{8} k + \frac{\pi}{4}}
+ \frac{1}{8} \cdot 4 \cdot e^{-i\frac{2\pi}{8} k - \frac{\pi}{4}} $$
$$ x(k) = \frac{1}{8} \cdot 8 + \frac{1}{8} \cdot 8 \cdot cos(\frac{2\pi}{8} k + \frac{\pi}{4}) $$
As you can see, the phase of the DFT coefficient first moves into the complex exponential function and can finally be found as
a time shift of the cosine.
While the magnitude of the DFT coefficients tells us how to scale the cosine waves.
These two statements are pretty important, so make sure to understand and remember them.
To emphasize their importance, read them again in bold letters ;)
The magnitude of DFT coefficient tells us how to scale the corresponding frequency.
The phase of a DFT coefficient tells us how to shift the corresponding frequency
At this point, we covered most of the essential theoretical aspects of the DFT.
Of course there are still some (important) topics left, but let's leave them for the future post and head over
to some application.
Because what is the purpose of all this stuff if we don't apply it?
Whenever you obtain some results, displaying them in a meaningful way is at least as important as the results themselves.
A great way of displaying DFT coeffcients are so called spectrums.
These spectrums can be further classified as magnitude spectrums or phase spectrums.
A magnitude spectrum can be created by plotting the absolute values of the DFT coefficients \(|X(n)|\) on a y-axis
and the coefficient indices \(n\) on an x-axis.
An example can look like this magnitude spectrum:
It is the spectrum of a 1 second sound where I played a note on my guitar.
Note, that as most signals are real, only one half the DFT coefficients are often displayed.
Furthermore, it is more common to plot the frequencies affected by the DFT coefficients instead of the coefficient indices on the x-axis.
You find the affected frequencies by multiplying \(n\) with \(f_s/N\) where \(f_s\) is the sampling frequency and N is the number of samples.
In my guitar example, the sound was sampled 44100 times per second resulting in \(f_s = 44100\).
As the signal is 1 second long, the number of samples is 44100.
This results in 22050 independent DFT coefficients giving us frequencies from 0Hz to 22049Hz in 1Hz steps.
The cool thing about magnitude spectrums is that we only use the absolute value of the DFT coefficients.
Thus, we can easily see which frequencies are apparent in a time discrete signal!
If we zoom into the guitar example, we can see that I played the tone "A" at 440Hz:
Just by looking at the spectrum, you can also tell that I probably played a string instrument as there are many overtones at multiples of 440Hz.
But things get even more interesting: You can tell that I don't live in the United States or Canada and that I own shitty equipment just by looking at the spectrum!
If you zoom even further, you can see a peak at 50Hz:
You can see this peak thanks to the german 50HZ AC network, which induces a 50Hz noise signal called mains hum.
If I'd live in the US or Canada, this peak would probably be at 60Hz.
And things get even crazier. Just by using the spectrum you cannot only tell where I was but also when!
As I surfed through the endless realms of the Internet I found a Wikipedia article about
electrical network frequency analysis, which cited this
newspaper article.
According to the article, the german AC network frequency varies between 49.95Hz and 50.05Hz, which obviously also affects the frequency of the main hum.
The trick is to have a quite long audio recording and to estimate the varying mains hum frequency over time.
(FYI: A minimum of 100s is needed for a frequency resolution of 0.01Hz. This can be calculated by using \(f_s/N\) as stated above.)
By comparing the noise signal with a network frequency database, you may obtain when the audio signal was recorded.
This is why the bavarian police records the network frequency 24/7 as of 2010.
Of course you have to create multiple spectrums over time and see how the frequency changes. So, if you have a 10-minute audio recording,
you may take several 100s intervals and determine their spectrum.
When doing this we rather refer to a short-time Fourier Transform (STFT) and as our spectrum varies over time we rather call it a spectrogram.
You may want to create some spectrograms/spectrums yourself, so take a look a this little Javascript application I wrote:
If you click on the "Start" button, a DFT is applied on the audio signal captured from your device's microphone. Note, that everything is processed locally on your device and no data is sent to some fishy servers (I'm cooler than google, lol). The DFT coefficients are displayed as a magnitude spectrum giving you information about how much of each frequency can be found in the signal. Try to sing, or play some tones if you have an instrument. You will see peaks in the spectrum for the corresponding pitches, which may include some overtones. This does not work for mayonnaise as it is not an instrument...
As mentioned above, there are also so-called phase spectrums. To create a phase spectrum, we use the same principles for the x-axis as explained above. But the y-axis now represents the phase of the DFT coefficients \(arg\{X(n)\}\). However, with the topics covered in this article, I could not find a good use case for it :( If you know some applications, please don't hesitate to contact me.