<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.2">Jekyll</generator><link href="https://www.chciken.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.chciken.com/" rel="alternate" type="text/html" /><updated>2026-04-17T16:47:30+00:00</updated><id>https://www.chciken.com/feed.xml</id><title type="html">chciken</title><subtitle>This is my website :)</subtitle><entry><title type="html">ROM Hacking a Game Boy Game</title><link href="https://www.chciken.com/game/boy/2026/03/21/rom-hacking-a-game-boy-game.html" rel="alternate" type="text/html" title="ROM Hacking a Game Boy Game" /><published>2026-03-21T14:46:44+00:00</published><updated>2026-03-21T14:46:44+00:00</updated><id>https://www.chciken.com/game/boy/2026/03/21/rom-hacking-a-game-boy-game</id><content type="html" xml:base="https://www.chciken.com/game/boy/2026/03/21/rom-hacking-a-game-boy-game.html"><![CDATA[<p>This post covers the ROM hacking of Game Boy games.
It is designed as a comprehensive, in-depth guide that covers everything from simple to advanced ROM hacks as well as the closely related reverse engineering.
As a practical example, this guide is accompanied by the Game Boy game “Disney’s The Jungle Book” from 1994 - a game on which I spent more than 200 hours of reverse-engineering and ROM hacking.
You are probably already familiar with the term “ROM hacking” (why else would you be here?), but anyway, let’s start at the very beginning…</p>

<p><strong>What is ROM hacking?</strong><br />
ROM hacking refers to the process of changing software for legacy consoles (NES, SNES, Game Boy, etc.).
It is also often referred to as ROM patching, or ROM modding.
The term ROM (Read-Only Memory) is derived from the read-only cartridge memory used by these older consoles.
Reasons for ROM hacking are manyfold: Maybe you want to add cheats, maybe you want to create a translation for a game that was only released in Japanese,
maybe you want to add new levels - the ways of altering your favorite game are endless.</p>

<p><strong>Do I need to be an experienced programmer?</strong><br />
The difficulty of your ROM hack primarily depends on your ambitions.
Simple hacks can be learned and understood by beginners with little programming experience in a few minutes.
More elaborated projects may require good knowledge of the Game Boy CPU’s assembly language (Z80-like), the Game Boy’s hardware, and algorithms in general.
Some examples of easy and hard projects are provided in the following.</p>

<p><strong>Is there anything I need?</strong><br />
Besides the ROM that you want to hack (more on that shortly), there is not much you need.
I highly recommend a Linux system for ROM hacking, but many of the used tools work on other operating systems as well.
If you are on Windows, you can simply use WSL2 to get a Linux environment.</p>

<!-- However, while many PC games provide interfaces for user-created mods,
games for old consoles reside on a read-only cartridge and were thus not intended to be changed by a user.
This is why ROM hacking is so hacky. -->

<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }

  .center {
    margin-left: auto;
    margin-right: auto;
  }

  #classic-png {
    border: 2px solid #006fa2;
  }

  #classic-changed-png {
    border: 2px solid #006fa2;
  }
</style>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
    <li><a href="#1-overview">1. Overview</a></li>
    <li><a href="#2-expectation-management">2. Expectation Management</a></li>
    <li><a href="#3-reverse-engineering">3. Reverse Engineering</a>
      <ul>
      <li><a href="#31-effort-estimate">3.1 Effort Estimate</a></li>
      <li><a href="#32-separating-data-and-instructions">3.2 Separating Data and Instructions</a></li>
      <li><a href="#33-labeling-code">3.3 Labeling Code</a></li>
      <li><a href="#34-the-role-of-ai">3.4 The Role of AI</a></li>
      </ul>
    </li>
    <li><a href="#4-rom-hacking">4. ROM hacking</a>
    <ul>
      <li><a href="#41-simple-hack">4.1 Simple Hack</a></li>
      <li><a href="#42-medium-difficulty-hack">4.2 Medium-difficulty Hack</a></li>
    </ul>
    </li>
    <li><a href="#5-conclusion">5. Conclusion</a></li>
  </ul>
</div>

<h2 id="1-overview">1. Overview</h2>

<p>Before starting with ROM hacks, you need the content of a Game Boy cartridge in a computer-readable form.
There are two ways of getting there.</p>

<p>First, you can use a cartridge reader, such as <a href="https://www.gbxcart.com/">GBxCart</a>.
Among other things, it allows you to dump the bits and bytes of a Game Boy cartridge directly on your PC using a USB connection.</p>

<p>Second, if you don’t want to spend money on a cartridge reader, you may resort to certain websites to download a ROM file.
But beware: if you do not own the downloaded game, this is illegal in many jurisdictions. And even if you own it, some countries may still regard downloading/distributing ROMs as a copyright violation.</p>

<p>So, assuming you somehow got a ROM file, let’s quickly talk about expectation management and difficulty of ROM hacks.</p>

<h2 id="2-expectation-management">2. Expectation Management</h2>

<p>Before starting your own ROM hacking project it is <strong>extremely important</strong> to reflect on your goals.
Depending on the difficulty of your goal, a project can be concluded within minutes or it might take thousands of hours!
To give you some examples.</p>

<p>Imagine you want to hack a ROM, such that the player has an infinite number of lives.
In most games, the number of lives is located somewhere in the Game Boy’s RAM at a fixed address.
If you can prevent the game from decrementing the value at said address, then you have an infinite number of lives - congrats!
But how does one get the address of the number of lives?
The easiest way to find it, would be to use the game’s original source code and search for variables like <code class="language-plaintext highlighter-rouge">NumLives</code>, <code class="language-plaintext highlighter-rouge">number_of_lives</code>, and so forth.
Usually it’s a matter of minutes to locate points of interest, especially for Game Boy games which are relatively little in size.
Unfortunately, most developers do not publish a game’s source code, so you either rely on a community-based reverse-engineering project
or you start your own reverse-engineering project.
Telling from <a href="https://github.com/gbdev/awesome-gbdev">awesome-gb-dev</a>, there are around 15 reverse-engineered Game Boy games (also called disassemblies) available.
This is a fraction of the <a href="https://en.wikipedia.org/wiki/List_of_Game_Boy_games">more than 1000</a> released Game Boy games.
So, if your to-be-hacked game isn’t one of the most popular games, you likely have to do the reverse-engineering yourself.
But luckily, for small changes, it is not always necessary to have the full source code.
I will show some techniques later, which usually achieve the goal in minutes or hours.</p>

<p>Ok, that was an “easy” project.
Now let’s maximize the difficulty of your goal by saying you want to add an extra level to a game.
This is the point at which I would say: It’s practically impossible without having the source code.
Because first, you need to understand how the game stores, loads, and handles levels.
Next, you need to design your own level, which might be very tedious work if you don’t write aids like a level editor.
Lastly, you need to add the level to the game and recompile it.
With Game Boy games often being heavily optimized for size, the resulting spaghetti code quickly breaks when adding a few bits here and there.
So, you might end up fixing tons of other things as well.
Including the reverse-engineering, your project may easily require 1000+ hours, even if you are an experienced programmer.</p>

<p>Since reverse-engineering is likely a fundamental part of your ROM hacking project, the next section highlights it in greater detail.
If you already have the source code, feel free to skip the next section.</p>

<!-- | Source code available | Minor changes     | Major changes   |
|-----------------------|-------------------|-----------------|
| Yes                   | Minutes           | 1-10 hours      |
| No                    | Minutes to hours  | 10-1000 hours   | -->

<h2 id="3-reverse-engineering">3. Reverse Engineering</h2>

<p>When software developers create software, they usually code something in a high-level programming language (C, C++, Rust, etc.).
Once the code is ready to be deployed or tested, it is compiled into something a computer can execute.
Unless you are working with interpreted languages (like Python),
the compiled result is a so-called <em>binary</em>, comprising bytes ready to be fed into your CPU.</p>

<p>The goal of software reverse engineering is to reverse this process by taking a binary and transforming it back into source code.
Unfortunately, most important meta-information (variable and function names, comments, code layout, etc.) gets lost during the compilation process.
So, it is nearly impossible to recreate the original source.
But that is not necessarily a bad thing: Maybe your reverse-engineered source code is better than the original ones,
and if you planned to make it publicly available releasing the original source is not an option anyway due to copyright protections.</p>

<p>Let us now take a look at how this reverse-engineering process specifically looks like for the Game Boy.
Assuming you have the ROM file of your favorite game available, mapping the machine code in the ROM file to human-readable <a href="https://en.wikipedia.org/wiki/Assembly_language">assembly language</a>
is actually very simple.
Just use the open-source tool <a href="https://github.com/mattcurrie/mgbdis">mgbdis</a>.
It’s a Python script that converts your ROM file into several <code class="language-plaintext highlighter-rouge">.asm</code> files and a Makefile.
The assembly files can be converted back to an executable ROM file by executing the Makefile.
Note that the Makefile uses <a href="https://github.com/gbdev/rgbds">rgbds</a>, which needs to be installed on your system.
Here’s how to do it in detail on Linux:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;</span> <span class="nb">cd </span>mgbdis
<span class="o">&gt;</span> python3 mgbdis.py jungle_book.gb
<span class="o">&gt;</span> <span class="nb">cd </span>disassembly/
<span class="o">&gt;</span> <span class="nb">ls
</span>bank_000.asm  bank_002.asm  bank_004.asm  bank_006.asm  game.asm  hardware.inc
bank_001.asm  bank_003.asm  bank_005.asm  bank_007.asm  gfx/      Makefile
<span class="o">&gt;</span> make
</code></pre></div></div>

<p>If you now look into the <code class="language-plaintext highlighter-rouge">.asm</code> files, you find a lot of Game Boy/Z80-like assembly .
But how does one get that to a typical high-level language like C or C++?
Well, I have a good and a bad message:
The good message is that we don’t have to deal with uplifting the assembly to a high-level programming as most Game Boy games were programmed in assembly.
The bad message is that we have to deal with assembly.
Since we already arrived at our target programming language, we can now focus on the core tasks of reverse engineering:</p>
<ul>
  <li>Giving names to labels and variables</li>
  <li>Resolving magic numbers</li>
  <li>Separating data and code segments</li>
  <li>Writing macros if you are feeling fancy.</li>
</ul>

<p>Before diving into the details of reverse-engineering,
I want to give you a rough idea of how much effort is involved.</p>

<h3 id="31-effort-estimate">3.1 Effort Estimate</h3>

<p>Although reverse engineering is a lot of fun, it can be equally a lot of work.
To give you an intuition of how much work may be involved, consider my case of reverse-engineering “The Jungle Book”:</p>

<p>Using the Linux tool <a href="https://github.com/AlDanial/cloc">cloc</a>, I count roughly 22,000 software lines of code (SLOC).
Some of that is data, but most big data chunks are separated in external files.
Of these 22,000 lines, ca. 90% received a semantic labeling by me.
I didn’t do exact measurements, but on average I need something like 1 hour for 100 lines of code.
Overall, it means that I spent more than 150 hours on reverse-engineering this game, which matches my gut feeling very well.
Including tools and these blog posts I write, my total time spent on this game likely exceeds 200 hours.</p>

<p>Ultimately, the time spent reverse-engineering a game probably correlates very linearly with its size.
So, I took a look at some open-source Game Boy game reverse-engineering projects, to see how much SLOC and effort they involve.
Note that many projects aren’t completely reverse-engineered, which may lead to some data being included in the SLOC counting.
Here’s the list:</p>

<table>
  <thead>
    <tr>
      <th>Game</th>
      <th>Cartridge Size</th>
      <th>kSLOC</th>
      <th>Release</th>
      <th>Contributors</th>
      <th>Effort estimate</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://github.com/vinheim3/tetris-gb-disasm">Tetris</a></td>
      <td>32 KiB</td>
      <td>13.5</td>
      <td>1989</td>
      <td>1</td>
      <td>135 hours</td>
    </tr>
    <tr>
      <td><a href="https://github.com/kaspermeerts/supermarioland.git">Super Mario Land</a></td>
      <td>64 KiB</td>
      <td>14</td>
      <td>1989</td>
      <td>2</td>
      <td>140 hours</td>
    </tr>
    <tr>
      <td><a href="https://github.com/not-chciken/jungle-book-gb-disassembly">The Jungle Book</a></td>
      <td>128 KiB</td>
      <td>22</td>
      <td>1994</td>
      <td>1</td>
      <td>220 hours</td>
    </tr>
    <tr>
      <td><a href="https://github.com/daid/FFA-Disassembly.git">Final Fantasy Adventure</a></td>
      <td>256 KiB</td>
      <td>79</td>
      <td>1991</td>
      <td>4</td>
      <td>790 hours</td>
    </tr>
    <tr>
      <td><a href="https://github.com/huderlem/kirbydreamland">Kirby’s Dream Land</a></td>
      <td>256 KiB</td>
      <td>89</td>
      <td>1992</td>
      <td>2</td>
      <td>890 hours</td>
    </tr>
    <tr>
      <td><a href="https://github.com/froggestspirit/marioland2">Super Mario Land 2</a></td>
      <td>512 KiB</td>
      <td>37.5</td>
      <td>1992</td>
      <td>1</td>
      <td>375 hours</td>
    </tr>
    <tr>
      <td><a href="https://github.com/froggestspirit/mmania">Mole Mania</a></td>
      <td>512 KiB</td>
      <td>74.5</td>
      <td>1996</td>
      <td>1</td>
      <td>745 hours</td>
    </tr>
    <tr>
      <td><a href="https://github.com/CelestialAmber/DKGBDisasm">Donkey Kong</a></td>
      <td>512 KiB</td>
      <td>103</td>
      <td>1994</td>
      <td>1</td>
      <td>1030 hours</td>
    </tr>
    <tr>
      <td><a href="https://github.com/pret/pokered">Pokémon Red</a></td>
      <td>1024 KiB</td>
      <td>150</td>
      <td>1996</td>
      <td>55</td>
      <td>1500 hours</td>
    </tr>
  </tbody>
</table>

<p>As you can see, with only 22 kSLOC, my reverse-engineering project is rather in the lower half in terms of complexity.
Other projects, like Pokémon for instance, comprise more than 100 kSLOC!
So, I guess the two major conclusion points of this subsection are:</p>
<ul>
  <li>Even if you are 10x faster than I am: Fully reverse-engineering a Game Boy game is likely in the order of tens to hundreds of hours of work</li>
  <li>The amount of work correlates with cartridge size. For an easy project, maybe consider an older and smaller game.</li>
</ul>

<p>After this quick effort estimate, let’s get into the nitty-gritty.</p>

<h3 id="32-separating-data-and-instructions">3.2 Separating Data and Instructions</h3>

<p>Assuming you successfully executed <a href="https://github.com/mattcurrie/mgbdis">mgbdis</a> as shown above, we now take a closer look at the generated assembly files.
For instance, when disassembling the Jungle Book game, I get files which look like this:</p>

<pre><code class="language-z80">    ld c, $0a
    ld hl, $c507
    ld de, $c511

jr_007_40b6:
    ld a, [hl+]
    ld [de], a
    inc de
    dec c
    jr nz, jr_007_40b6
</code></pre>

<p>That looks like solid assembly code.
If you take a closer look at it, it looks like it is copying something from address $c507 to address $c511.
Of course, the generated code uses placeholder labels and there is no semantic information, but the generated code looks meaningful.</p>

<p>Now to another excerpt from the same file:</p>

<pre><code class="language-z80">    nop
    nop
    nop
    sub d
    ld a, h
    inc b
    ld hl, sp+$03
    db $fc
    ld b, $f9
    rrca
</code></pre>

<p>That one looks a bit weird.
Is the code really executing 3 consecutive <a href="https://en.wikipedia.org/wiki/NOP_(code)">nop</a> operations?
Is it really incrementing Register b just to overwrite it with $f9?</p>

<p>No, what we actually have here is data.
If we are not providing a <a href="https://github.com/mattcurrie/mgbdis?tab=readme-ov-file#symbol-files">symbol file</a> to <a href="https://github.com/mattcurrie/mgbdis">mgbdis</a>, it will simply assume that the whole cartridge comprises instructions.
However, there is also data to consider.
In fact, for most games the majority of the ROM is occupied by data used for things like sprites, sound tracks, maps, and so forth.
That leaves one with the problem of effectively separating data and instructions.
Unlike modern file formats like <a href="https://en.wikipedia.org/wiki/Executable_and_Linkable_Format">ELF</a> on Linux or <a href="https://en.wikipedia.org/wiki/.exe">.exe</a> on Windows,
the Game Boy ROMs are just binary blobs without any metadata that helps to distinguish between data and instructions.
A possible method is of course the previous way by inspecting the assembly and decide whether it makes sense or not.
If you’d want to do it in a more automated way, I can recommend two approaches.</p>

<p>The first one is executing the game in an emulator and see where it loads and where it executes parts of the cartridge.
I implemented such a feature in my Game Boy emulator <a href="https://github.com/not-chciken/TLMBoy">TLMBoy</a>.
But some other emulators like <a href="https://mattcurrie.com/bdm/">Beaten Dying Moon</a> support that as well.
The result is a symbol file with data and instruction sections, which can be fed to <a href="https://github.com/mattcurrie/mgbdis">mgbdis</a>.
For this approach to work sufficiently well, every byte in the executed game needs to be touched at least once.
Either by loads/stores or by executing it.
In practice, this requires a 100% playthrough that may take multiple hours depending on the game.</p>

<p>Another method requires <a href="https://github.com/radareorg/radare2">radare2</a>, which is an extremely useful tool for reverse-engineering in general.
The following method only works for strings, but when reverse-engineering, every little aid is welcome.
Open your ROM with <a href="https://github.com/radareorg/radare2">radare2</a> as follows:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>radare2 jungle_book.gb
</code></pre></div></div>
<p>Now simply type <code class="language-plaintext highlighter-rouge">izzq</code> to list all the strings <a href="https://github.com/radareorg/radare2">radare2</a> can find in your binary:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[0x00000100]&gt; izzq
...
0x172ee 7 6 `h0H0(
0x172f7 6 5 X0X08
0x1733d 6 5 X0X0(
0x1736c 5 4 `h04
0x17415 6 5 X H08
0x1746a 5 4 &lt;| `
0x175d1 23 22  !LICENSED BY NINTENDO
0x175e8 9 8 PRESENTS
0x175f4 37 36 1994 THE WALT\r   DISNEY COMPANY\r\r
0x1761b 98 97 1994 VIRGIN\r    INTERACTIVE\r   ENTERTAINMENT\r\rDEVELOPED BY EUROCOM\r\rPRESS START TO BEGIN\r  LEVEL
0x1767d 10 9  NORMAL
0x17687 9 8 PRACTICE
0x17690 14 13 JUNGLE BY DAY
0x1769e 15 14 THE GREAT TREE
0x176ad 13 12  DAWN PATROL
0x176ba 13 12 BY THE RIVER
...
</code></pre></div></div>
<p>What you can see above are strings radare2 identified and the corresponding addresses and sizes of the strings.
As you can see for the Jungle Book game, radare2 identifies a lot of false positives (e.g., <code class="language-plaintext highlighter-rouge">X0X08</code> and <code class="language-plaintext highlighter-rouge">X H08</code> are unlikely to be strings).
But occasionally it finds some candidates that are very likely to be strings.
As shown above, radare2 identified the strings of the start screen, as well as the strings for the level names.</p>

<p>The relatively high number of false positives can be explained by radare2’s string-detection algorithm.
Basically, it just looks for printable characters with a minimum length.
Since <code class="language-plaintext highlighter-rouge">izz</code> searches the whole file, you get a lot of “strings” just by chance.
If you want to reduce the number of false positives, you may want to play with the minimum string length.
For instance, increase the minimum number to 10 by executing:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>e bin.minstr<span class="o">=</span>10
</code></pre></div></div>

<h3 id="33-labeling-code">3.3 Labeling Code</h3>

<p>The probably most important but also most time-consuming and challenging part of reverse-engineering is <em>labeling</em>.
Labeling involves replacing the disassembler-generated placeholder labels by semantic labels as well as assigning variable names to memory addresses.
Or in other words: <a href="https://martinfowler.com/bliki/TwoHardThings.html">naming things</a>
(which is one of the two hardest problems in computer science!).
Ultimately, the goal of labeling is simply to make your code more accessible to humans.
If you are not too strict with the definition, writing comments or documentation can also be regarded as some kind of labeling.</p>

<p>So, how does one identify labels and variable names?
Well, that is the tricky part - there is no golden approach that will lead you to results.
Instead, it is a combination of reading the code, debugging, and coming up with creative ideas.
Furthermore, to find the name of a label or a variable, there is often no direct path.
Rather, you have to solve other parts first and sometimes that brings you to your goal without actively working towards it.
I feel like it’s a bit similar to Sudoku, where finding the number of a field is usually achieved by finding the number of other fields first.</p>

<p>To give you an example on how reverse-engineering looks like,
let us try to find out where the variable for the number of lives in “The Jungle Book” is.
As you can see from the following screenshot, the player starts with 6 lives:</p>

<div style="text-align:center">
  <img src="/assets/rom_hacking/jb_screenshot.png" alt="Screenshot of The Jungle Book Level 1" width="50%" />
</div>
<p><br /></p>

<p>Hence, somewhere in the code the number 6 needs to be loaded into an address, which is likely done by a load or store.
The most likely way of doing that probably looks like this ($1234 just as an example address):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ld a, $06
ld [$1234], a
</code></pre></div></div>
<p>Using a Regex (<code class="language-plaintext highlighter-rouge">ld\s\w,\s\$06\n\s+ld\s\[\$\w+\]</code>), you can scan the code to get the following candidates:
$c13d, $c14c, $c15f, $c1b7, $c1fc, $c501.</p>

<p>Next, start the game using a phenomenal debugger called <a href="https://github.com/drhelius/Gearboy">Gearboy</a>. This Game Boy debugger gives you introspection into every tiny bit of the Game Boy including the Game Boy’s working RAM, which is what we are interested in.
Because if any of the aforementioned addresses holds the number of lives,
it should be “6” during the game’s execution.
Furthermore, it should decrement if the player is losing a life.
Of all address candidates, only one showed this behavior: $c1b7.
Here’s a screenshot showing the memory content using Gearboy’s memory editor:</p>

<div style="text-align:center">
  <img src="/assets/rom_hacking/gearboy_memory_view.png" alt="Using Gearboy's memory viewer" width="70%" />
</div>
<p><br /></p>

<p>You can now assume with high confidence that $c1b7 holds the player’s number of lives.
In the source code that can be annotated by creating a variable for this address<br />
and a constant to replace the magic number “6”:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def CurrentLives EQU $c1b7
def NUM_LIVES EQU 6
</code></pre></div></div>
<p>Pretty much all occurrences of $c1b7 in the source code can now be replaced by <code class="language-plaintext highlighter-rouge">CurrentLives</code>,
and you can move on to label other parts of the code.</p>

<p>With increasing progress, labeling gets harder and easier at same time.
It gets harder because you run out of low-hanging fruits like the example of this subsection.
But it also gets easier, because identifying variables provides more context.
The next section is something I’d rather skip, but unfortunately it works too well: AI for reverse engineering.</p>

<h3 id="34-the-role-of-ai">3.4 The Role of AI</h3>

<p>With AI becoming more pervasive in a programmer’s everyday life, I was wondering how well AI performs in reverse-engineering.
I did some tests with ChatGPT at the start of the project in 2024 with mediocre outcomes (I guess training LLMs on Z80 assembly wasn’t the highest priority),
however quite a bit has happened in the meantime.
<!-- My last experiences with ChatGPT and coding Game Boy assembly where mediocre at best. I guess there is just not much data to train these networks on. -->
To evaluate the performance of my favourite LLMs, I provided them with some snippets of unreverse-engineered code of which I knew what it was doing.</p>

<p>In particular, I used the <code class="language-plaintext highlighter-rouge">DecompressData</code> function of the Jungle Book game, which you can find at the very end of <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/blob/master/src/bank_000.asm">this file</a>. The function uses the <a href="https://en.wikipedia.org/wiki/LZ77_and_LZ78">LZ77</a>-like algorithm to decompress data from the game’s ROM.
Although the function is relatively small and neatly confined, its hacky implementation was a pain to reverse engineer.
Fully reverse-engineering and understanding it took me multiple hours.</p>

<p>So, can AI do my work in less than a minute?
Here are their answers to <a href="/assets/rom_hacking/request.txt">this request</a> and a subsequent request asking for annotated code with semantic labels:</p>

<hr />

<p>GPT-5: Short version: it’s a backwards LZ-style decompressor. <a href="/assets/rom_hacking/chatgpt.txt">Annotated code</a>.</p>

<p>Gemini 2.5 Pro: This Z80 assembly code implements a data decompression routine, likely a variant of the LZ77 algorithm. <a href="/assets/rom_hacking/gemini.txt">Annotated code</a>.</p>

<p>Claude Sonnet 4.5: This is a data decompression function written in Game Boy (Z80-like) assembly language. It appears to implement a variant of LZ77 or LZSS compression, which is commonly used in Game Boy games to save ROM space. <a href="/assets/rom_hacking/sonnet.txt">Annotated code</a>.</p>

<hr />

<p>I was quite impressed to see that all LLMs managed to identify the data decompression algorithm.
While Sonnet 4.5 and GPT-5 remained kind of fuzzy concerning the exact algorithm, Gemini 2.5 Pro hit the nail on the head.
Also the annotated code of all three LLMs was spot on.
So, hours of human work done in a matter of seconds.
Even though this seemed super impressive, keep in mind that my project is open source and that it might have been part of the training dataset.</p>

<p>Since it worked so much better than my early attempts in 2024, I decided to use LLMs as a tool from that point on.
After a few more hours with LLM-guided reverse-engineering, I have to admit: It can be useful.
It was a bit hit-and-miss sometimes but if you thoroughly evaluate the generated answers/code,
it can give you a nice performance boost.
I really wonder where it will be at in a few years.</p>

<h2 id="4-rom-hacking">4. ROM Hacking</h2>

<h3 id="41-simple-hack">4.1 Simple Hack</h3>

<p>Assuming you now have the source code of the game, or at least you know the addresses of certain variables,
it’s now time to perform the actual ROM hack.
A ROM hack can be performed directly by altering the underlying code, or indirectly by using cheat modules like
<a href="https://en.wikipedia.org/wiki/Game_Genie">Game Genie</a>
or <a href="https://en.wikipedia.org/wiki/GameShark">Game Shark</a>.</p>

<p>Cheat modules do not alter the Game’s binary directly, but whenever the game tries to read from a given address, cheat modules intercept the read and replace the response with a predefined one.
In my opinion it’s the easiest way to perform a ROM hack. Also, most emulators support them.</p>

<p>Note: the following code format is the <em>GameShark / Pro Action Replay</em> style used by many Game Boy emulators (Game Genie codes work differently and have a different format).
These cheats typically force a value into a RAM address:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>01VVAAAA
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">VV</code> is the 8-bit value, and <code class="language-plaintext highlighter-rouge">AAAA</code> is the target address encoded in little-endian order (low byte first).</p>

<p>So, let’s create our own cheat code to make the player in “The Jungle Book” invincible.
In the <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/blob/master/src/bank_000.asm">reverse-engineered source code of the game</a>
you find a function called <code class="language-plaintext highlighter-rouge">ReceiveDamage</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>; $197d: Input: "c" = damage to receive.
ReceiveDamage::
    ld a, [InvincibilityTimer]  ; a = [$c189]
    or a
    ret nz                      ; Not receiving damage if invincible.
    ...
</code></pre></div></div>

<p>As you can see from the code, the function returns early if a variable named <code class="language-plaintext highlighter-rouge">InvincibilityTimer</code> is non-zero.
So, let’s just set this variable to 1 by using the following cheat code: <code class="language-plaintext highlighter-rouge">010189c1</code>.
Note that the address is encoded in little endian format.
Playing the game with this cheat code confirmed that it actually works:</p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="50%" height="50%">
    <source src="/assets/rom_hacking/jb_cheat.webm" type="video/webm" />
  </video>
</div>

<p>Mowgli isn’t really impressed by getting attacked by boars and mosquitoes (or whatever these dots are supposed to represent).
Alternatively, if you don’t want to use cheat codes, you can replace the <code class="language-plaintext highlighter-rouge">ret nz</code> by an unconditional return <code class="language-plaintext highlighter-rouge">ret</code> and recompile the game.
In this particular case, both instructions are 1 byte long, so it’s a safe in-place patch.</p>

<p>But beware, if you change the code in general, you might run into two problems:
The first one is missing space in the cartridge.
If you add additional bytes, the usually densely packed ROM banks might reach their 16 KiB limit.
The second issue occurs if you add or remove bytes.
Because any code after your change will be relocated, leading to problems with position-dependent code.</p>

<h3 id="42-medium-difficulty-hack">4.2 Medium-difficulty Hack</h3>

<p>After this initial simple ROM hack, I want to show a more elaborated example by replacing the boar enemy with a Goomba
from “Super Mario Land 2: 6 Golden Coins”.
Adding instead of replacing is not really an option, because the cartridge is already filled to the brim.
So, first of all, we have to get the Goomba sprites from Super Mario Land.
Luckily, someone already reverse engineered the game including its sprites.
Download the <a href="https://github.com/froggestspirit/marioland2.git">source code here</a>
and take a look at the file <code class="language-plaintext highlighter-rouge">gfx/enemies/classic.2bpp</code>.
This 896-byte 2-bits-per-pixel file is where the Goomba sprites live.
To visualize the sprite data, use rgbgfx (which is part of <a href="https://github.com/gbdev/rgbds">rgbds</a>) with the following command:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rgbgfx <span class="nt">--reverse</span> 2 <span class="nt">-o</span> classic.2bpp classic.png
</code></pre></div></div>
<p>The output PNG should look like this:</p>

<div style="text-align:center">
  <img src="/assets/rom_hacking/classic.png" id="classic-png" alt="Sprite palette of classic enemies in Super Mario Land" width="4%" />
</div>
<p><br /></p>

<p>It takes some imagination, but the Goomba sprites can be spotted at the top of the PNG.</p>

<p>And this is where the first problem already emerges.
The Game Boy’s Pixel Processing Unit (PPU) can handle two kinds of sprites: 8x16 and 8x8.
Of course “The Jungle Book” uses 8x16 sprites while “Super Mario Land 2” uses 8x8 sprites (at least for the Goombas).
Also, “Super Mario Land 2” uses some mirroring tricks that we cannot really use in “The Jungle Book”,
the colors need some adjustment, and the size of the tileset needs to be adjusted to fit into the cartridge:
So, a little <a href="/assets/rom_hacking/convert.py">Python script</a> and a bit of tweaking is needed:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">cp </span>classic.2bpp GoombaSprites.2bpp
<span class="nb">truncate</span> <span class="nt">-s</span> 736 GoombaSprites.2bpp
rgbgfx GoombaSprites.2bpp GoombaSprites.png
./convert.py
rgbgfx <span class="nt">-o</span> GoombaSprites.2bpp <span class="nt">-c</span> <span class="nv">dmg</span><span class="o">=</span>d8 GoombaSprites_swapped.png
</code></pre></div></div>
<p>To save you the hassle, here’s the new palette:</p>

<div style="text-align:center">
  <img src="/assets/rom_hacking/GoombaSprites_swapped.png" id="classic-changed-png" alt="Changed sprite palette of classic enemies in Super Mario Land" width="4%" />
</div>
<p><br /></p>

<p>Next, we need to alter the source code of the Jungle Book game.
Specifically, we need to add the sprite palette to <code class="language-plaintext highlighter-rouge">gfx/GoombaSprites.2bpp</code>.
This includes changing the parts where the file is included.</p>

<p>In <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/blob/master/src/bank_005.asm">Bank 5</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>; $491c: Replaces the 736 bytes of BoarSprites.2bpp.
GoombaSprites::
    INCBIN "gfx/GoombaSprites.2bpp"
</code></pre></div></div>

<p>And in <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/blob/master/src/bank_004.asm">Bank 4</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>; $7f72: Upper two bits of each pointer + 5 determines ROM bank.
ObjectSpritePointers::
    ; ROM bank 5
    MakeObjSpritePtr 5, AssetSprites                    ; $07
    MakeObjSpritePtr 5, SittingMonkeySprites            ; $08
    MakeObjSpritePtr 5, GoombaSprites                   ; $09
</code></pre></div></div>
<p>Note that you could also just overwrite <code class="language-plaintext highlighter-rouge">gfx/BoarSprites.2bpp</code> with the Goomba’s sprites,
but for this post we go with the clean approach.
Now the game already takes the Goomba’s sprites, but this is by far not sufficient as several other things need to be defined as well.
This includes the frames for an object’s animation.
To let our Goomba walk, we first need to define the sizes of each animation frame:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NumObjectSprites::
.Unknown0:          db $11
.Goomba:            db $22, $22, $22, $22
</code></pre></div></div>
<p>Since Goombas aren’t into crazy gymnastics, every animation frame simply has the size of 2x2 tiles.
In theory 2x1 would also suffice (remember: each sprite is 8x16) but with the anchor of the object being relatively high up,
the Goomba would fly over the ground.
Also, the number of animation frames is hardcoded to 4 as we are taking the object slot of the boar.
Next, object sprite offset pixels need to be defined:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ObjSpritePixelOffsets::
.Unknown0:         db   0,   0
.Goomba:           db   0,   1,   0,   1,   0,   1,   0,   1
</code></pre></div></div>
<p>These offsets are useful if an object is jumping or dancing, but our short-legged friends don’t have much to offer in that regard.
Just a little offset in the Y direction suffices to align the Goomba perfectly with the ground.
As a last step, the actual animation needs to be defined.
Since there are 4 animation frames, 4 pointers to index sets are needed:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ObjAnimationIndicesPtr::
.Unknown:         dw $0000
.Goomba           dw $0019, $0011, $0019, $0011
</code></pre></div></div>
<p>As you can see, we let our Goomba switch between two different animations.
The pointers from above point to an entry in <code class="language-plaintext highlighter-rouge">ObjAnimationIndices</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ObjAnimationIndices::
.Ind000:          db $02
.Ind001:          db $04, $06, $08, $0a,
.Ind005:          db $14, $16, $18, $1a
.Ind009:          db $0c, $0e, $08, $0a, $1c, $1e, $20, $22
.Ind011:          db $02, $02, $04, $0a, $24, $26, $28, $22
.Ind019:          db $02, $02, $06, $08, $02, $02, $02, $02
</code></pre></div></div>
<p>Each of these entries is a set of indices that point to a tile in the given sprite palette.
Note that the actual index is calculated by <code class="language-plaintext highlighter-rouge">(value - 4) / 2</code>.
If an index is 2, the corresponding tile will be empty.</p>

<p>And that’s already it.
If don’t want to change the code yourself,
here’s my <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/tree/dev-goomba">dev-goomba branch</a> with all the aforementioned changes.
Now recompile the game, launch it, and see what happens:</p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="50%" height="50%">
    <source src="/assets/rom_hacking/jb_with_goomba.webm" type="video/webm" />
  </video>
</div>

<p>With the appearance now being defined it’s time to give our Goomba a new hitbox - in theory.
Because the game defines multiple static hitboxes and some objects share the same hitbox:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HitBoxData::
    db  -4, -12,  4,  -4   ;  $1 = Projectiles
    db  -6, -12,  6,   0   ;  $2 = Pineapple, diamond, ...
    db  -8, -16,  8,   0   ;  $3 = Sitting monkey
    db  -8, -26,  8,   0   ;  $4 = Walking monkey, standing monkey
    db -10, -32, 10,   0   ;  $5 = Cobra
    db -12, -18, 12,   0   ;  $6 = Boar/Goomba, porcupine, armadillo
</code></pre></div></div>

<p>If we’d change the Goomba’s hitbox, we’d also change the porcupine’s and armadillo’s hitbox.
We could add another hitbox, but Bank 1 is already completely full,
which means we are slowly descending into rewriting the game for which my autism level doesn’t suffice.</p>

<h2 id="5-conclusion">5 Conclusion</h2>

<p>So, that’s about it.
If you have any corrections or additions, feel free to <a href="/about">send me an email</a> :)</p>]]></content><author><name></name></author><category term="Game" /><category term="Boy" /><summary type="html"><![CDATA[This post covers the ROM hacking of Game Boy games. It is designed as a comprehensive, in-depth guide that covers everything from simple to advanced ROM hacks as well as the closely related reverse engineering. As a practical example, this guide is accompanied by the Game Boy game “Disney’s The Jungle Book” from 1994 - a game on which I spent more than 200 hours of reverse-engineering and ROM hacking. You are probably already familiar with the term “ROM hacking” (why else would you be here?), but anyway, let’s start at the very beginning…]]></summary></entry><entry><title type="html">TLMBoy: The Audio Processing Unit (APU) - Wave Channel</title><link href="https://www.chciken.com/tlmboy/2025/04/22/gameboy-apu-wave.html" rel="alternate" type="text/html" title="TLMBoy: The Audio Processing Unit (APU) - Wave Channel" /><published>2025-04-22T11:41:44+00:00</published><updated>2025-04-22T11:41:44+00:00</updated><id>https://www.chciken.com/tlmboy/2025/04/22/gameboy-apu-wave</id><content type="html" xml:base="https://www.chciken.com/tlmboy/2025/04/22/gameboy-apu-wave.html"><![CDATA[<p>In this part of my Game Boy simulator post series, I will cover the details of the wave channel of the so-called Audio Processing Unit (APU).
Unlike the other channels of the APU (noise and square), the wave channel allows you to customize the output sound.
Well, it only provides 32 samples with a 4-bit resolution, which makes it rather suitable for custom wave forms than anything else.
In theory, you can also constantly rewrite the 32 samples in order to play back any kind of recording.
For instance, this <a href="https://www.youtube.com/watch?v=1lzHfLYzyRM">video</a> provides examples of voice playbacks in Game Boy games.
Playing back voices with a 4-bit resolution on a Game Boy speaker sounds horrible, but things like the super crappy “PIKACHU” from Pokémon Yellow also have their own appeal.</p>

<p>When it comes to information about the Game Boy’s hardware, there’s already plenty of information available.
The following sources helped me a lot to write this post and my Game Boy simulator:</p>

<p><a href="https://dn790000.ca.archive.org/0/items/GameBoyProgManVer1.1/GameBoyProgManVer1.1.pdf">Official Game Boy Programming Manual</a> <br />
<a href="https://gbdev.gg8.se/wiki/articles/Gameboy_sound_hardware">Game Boy Development Wiki</a> <br />
<a href="http://marc.rawer.de/Gameboy/Docs/GBCPUman.pdf">Game Boy CPU Manual</a> <br />
<a href="https://gbdev.io/pandocs/Audio.html">Pan Docs (my favorite source)</a></p>

<p>Unlike the technical documentation from above, this post follows a more example-driven approach.
So, rather than getting lost in every tiny obscure behavior, I first highlight the general principles of the wave channel, which is then followed by some practical examples on how games made use of it.
<!-- I also provide a [test ROM](https://github.com/not-chciken/gb-square-test), which can be used for testing in emulator/simulator development. --></p>

<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }

  .center {
    margin-left: auto;
    margin-right: auto;
  }

  .slider-container {
    display: flex;
    gap: 13px; /* No gap between sliders */
  }

  .slider-wrapper {
    padding: 0;
    margin: 0;
    display: flex;
    align-items: center;
    justify-content: center;
    height: 200px;
    width: 10px; /* Almost no wrapper width */
  }

  .slider {
    transform: rotate(-90deg) translateX(100px);
    width: 150px;
    height: 10px;
    padding: 0;
    margin: 0;
    margin-top: 200px; /* Push the rotated slider visually up */
  }

  table {
    tr, th, td {
      padding: 0px !important;
    }
  }

</style>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
  <li><a href="#overview">Overview</a>
    <ul>
      <li><a href="#nr30-dac-enable">NR30: DAC Enable</a></li>
      <li><a href="#nr31-length-timer">NR31: Length Timer</a></li>
      <li><a href="#nr32-volume">NR32: Volume</a></li>
      <li><a href="#nr33-frequency-lsb">NR33: Frequency LSB</a></li>
      <li><a href="#nr34-channel-control--frequency-msb">NR34: Channel Control &amp; Frequency MSB</a></li>
    </ul>
  </li>
  <li><a href="#wave-simulator">Wave Simulator</a></li>
  <li><a href="#examples">Examples</a>
    <ul>
      <li><a href="#tetris">Tetris</a></li>
      <li><a href="#the-legend-of-zelda-links-awakening">The Legend of Zelda: Link's Awakening</a></li>
      <li><a href="#super-mario-land">Super Mario Land</a></li>
      <li><a href="#the-jungle-book">The Jungle Book</a></li>
    </ul>
  </li>
  </ul>
</div>

<h2 id="overview">Overview</h2>

<p>Similar to other units of the Game Boy (DMA, Pixel Processing Unit, etc.), communication with the APU is facilitated by
<a href="https://en.wikipedia.org/wiki/Memory-mapped_I/O_and_port-mapped_I/O">memory-mapped I/O</a>.
That means if you want to tell the APU something you just write something into certain memory-mapped registers,
while information about the APU’s current status can retrieved by reading these registers.
For the wave channels, the following registers and addresses are relevant:</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Address</th>
      <th>Bits</th>
      <th>Function</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>NR30</td>
      <td>0xFF1A</td>
      <td><code class="language-plaintext highlighter-rouge">D--- ----</code></td>
      <td>DAC enable</td>
    </tr>
    <tr>
      <td>NR31</td>
      <td>0xFF1B</td>
      <td><code class="language-plaintext highlighter-rouge">LLLL LLLL</code></td>
      <td>Length load</td>
    </tr>
    <tr>
      <td>NR32</td>
      <td>0xFF1C</td>
      <td><code class="language-plaintext highlighter-rouge">-VV- ----</code></td>
      <td>Volume</td>
    </tr>
    <tr>
      <td>NR33</td>
      <td>0xFF1D</td>
      <td><code class="language-plaintext highlighter-rouge">FFFF FFFF</code></td>
      <td>Frequency LSB</td>
    </tr>
    <tr>
      <td>NR34</td>
      <td>0xFF1E</td>
      <td><code class="language-plaintext highlighter-rouge">TL-- -FFF</code></td>
      <td>Trigger, length enable, frequency MSB</td>
    </tr>
    <tr>
      <td>Wave pattern</td>
      <td>0xFF30</td>
      <td><code class="language-plaintext highlighter-rouge">SSSS SSSS</code></td>
      <td>Sample 0, Sample 1</td>
    </tr>
    <tr>
      <td>…</td>
      <td>…</td>
      <td>…</td>
      <td>…</td>
    </tr>
    <tr>
      <td>Wave pattern</td>
      <td>0xFF3F</td>
      <td><code class="language-plaintext highlighter-rouge">SSSS SSSS</code></td>
      <td>Sample 30, Sample 31</td>
    </tr>
  </tbody>
</table>

<p>The next section highlights their function in greater detail.</p>

<h2 id="wave-channel">Wave Channel</h2>

<p>First, the very technical definition of the square wave channel register before we head to some examples.</p>

<h3 id="nr30-dac-enable">NR30: DAC Enable</h3>
<p>The sweep channel can be used to change the frequency of the square wave over time.
This is primarily used to model sound effects, such as hopping on a Goomba in Super Mario Land.</p>

<p><strong>[0:6] 7-bit unused</strong>: Unused.<br />
<strong>[7:7] 1-bit DAC enable</strong>: 0 → DAC (and therefore sound) is turned off. 1 → DAC (and therefore sound) is turned on.</p>

<h3 id="nr31-length-timer">NR31: Length Timer</h3>
<p><strong>[0:7] 8-bit length timer</strong>:
Can be read from or written to.
The 8 bits are interpreted as an unsigned number ranging from 0 to 255.
This number determines the length of the sound: length = (256-value)*(1/256) seconds.
So, the shortest sound is 1/256 second, while the longest is 1 second.
Note that the “256-value” part leads to some counterintuitive behavior.
When writing 0, you get the longest possible length, an when writing 255, you get the shortest possible length.
If you want indefinite sustain, disable Bit 6 in register NR34.</p>

<h3 id="nr32-volume">NR32: Volume</h3>
<p><strong>[0:4] 5-bit unused</strong>: Unused.<br />
<strong>[5:6] 2-bit volume</strong>: Controls sound volume: 00 → 0%; 01 → 100%; 10 → %50; 11 → %25.<br />
<strong>[7:7] 1-bit unused</strong>: Unused.</p>

<h3 id="nr33-frequency-lsb">NR33: Frequency LSB</h3>
<p><strong>[0:7] 8-bit frequency lower bits</strong>:
The frequency comprises 11 bits in total (see NR34 as well).
The wave channel uses a non-exposed, 11-bit counter that increases every time it is clocked.
After 2047 it overflows, generates a signal, and is set to the value of NR33 and NR34.
The resulting sample rate is: 2,097,152/(2048-frequency).
Hence, the lowest sample rate is 1024 Hz and the highest ones is 2,097,152 Hz.
Note that this is the rate at which individual samples of the wave pattern are processed.</p>

<h3 id="nr34-channel-control--frequency-msb">NR34: Channel Control &amp; Frequency MSB</h3>
<p><strong>[0:2] 3-bit frequency lower bits</strong>: Upper bits of the period. See NR33.<br />
<strong>[3:5] 3-bit unused</strong>: Unused.<br />
<strong>[6:6] 1-bit length enable</strong>:
0 → Regardless of the length data in NR31 sound can be produced consecutively.
1 → Sound is generated during the time period set by the length data in NR31.<br />
<strong>[7:7] 1-bit trigger (write-only)</strong>: Writing 1 to this bit causes the following things:
The wave channel is enabled.
If the length timer expired it is reset.
Volume is set to contents of NR32 initial volume.
Wave RAM index is reset, but not refilled!</p>

<h2 id="wave-simulator">Wave Simulator</h2>

<p>Here’s a Javascript-based wave channel simulator.
Using table and sliders below you can define individual settings and listen to the sound they would create on the Game Boy.
Note that the simulator repeats every 2 seconds.
Predefined setups of some games are provided in the next section.</p>

<script type="text/javascript" src="/assets/gameboy_apu/wave_simulator.js"></script>

<table id="wave-simulator-table">
  <tr>
    <th>Register</th>
    <th>Setting</th>
  </tr>
  <tr>
    <td>NR30: DAC Enable</td>
    <td>
      DAC Enable
      <input type="checkbox" id="dac-enable" name="dac-enable" checked="" />
    </td>
  </tr>
  <tr>
    <td>NR31: Length Timer</td>
    <td>
      Length
      <input type="number" id="wave-length" name="wave-length" value="42" min="0" max="255" /><br />
    </td>
  </tr>
  <tr>
    <td>NR32: Volume</td>
    <td>
      Volume
      <select name="wave-volume" id="wave-volume">
        <option value="0.0">0 / 0%</option>
        <option value="1.0">1 / 100%</option>
        <option value="0.5">2 / 50%</option>
        <option value="0.25">3 / 25%</option>
      </select>
    </td>
  </tr>
  <tr>
    <td>NR33/NR34: Frequency</td>
    <td>
      Frequency
      <input type="number" id="wave-frequency" name="wave-frequency" value="1800" min="0" max="2047" />
    </td>
  </tr>
  <tr>
    <td>NR34: Channel Control</td>
    <td>
      Length enable:
      <input type="checkbox" id="length-enable" name="length-enable" /><br />
      <button type="button" id="play-wave">Play/Stop</button>
    </td>
  </tr>
</table>

<div class="slider-container" id="sliders">
    <!-- Sliders will be injected here -->
  </div>
<p><br />
Sample values:<br /></p>
<table id="value-table">
  <tr id="header-row"></tr>
  <tr id="value-row"></tr>
</table>

<!-- Besides an implementation in Javascript, I also wrote the same application for the Game Boy:

<div style="text-align:center">
  <img id="lfsr-register" src="/assets/gameboy_apu/screenshot_gb_square_test.png"
  alt="Square Test Screenshot"
  width="40%"/>
</div> <br> -->

<!-- The source code and the corresponding ROM can be found in [this GitHub repository](https://github.com/not-chciken/gb-square-test). -->

<h2 id="examples">Examples</h2>

<p>In the following, examples of the wave channel in games are provided.
To get these examples I used the <a href="https://github.com/drhelius/Gearboy">Gearboy</a> emulator.
Click on “Use this setup” to load the wave simulator with the corresponding setup.
Note that all settings are just recordings at one point in time.
While most games use the same wave pattern throughout a theme or song, frequency and volume do change frequently.</p>

<h3 id="tetris">Tetris</h3>

<p>In the very well-known main theme of Tetris the following setting of the wave channel can be found:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Main theme

NR30: 1000 0000 -&gt; DAC enabled
NR31: 0000 0000 -&gt; maximum length, but irrelevant due to NR34
NR32: 0010 0000 -&gt; volume = 100%
NR33: 0000 1010 -&gt; frequency (1546) = 4178 HZ sample rate
NR34: 1000 0110 -&gt; trigger, length disabled

Wave samples:
0123 4567 89AB CDEF FEDC BA98 7654 3210
</code></pre></div></div>

<p><button type="button" id="tetris-main-theme" onclick="TetrisMainTheme()">Use this setup</button>
<script>
  function TetrisMainTheme() {
    document.getElementById("slider0").value = 0;
    document.getElementById("slider1").value = 1;
    document.getElementById("slider2").value = 2;
    document.getElementById("slider3").value = 3;
    document.getElementById("slider4").value = 4;
    document.getElementById("slider5").value = 5;
    document.getElementById("slider6").value = 6;
    document.getElementById("slider7").value = 7;
    document.getElementById("slider8").value = 8;
    document.getElementById("slider9").value = 9;
    document.getElementById("slider10").value = 10;
    document.getElementById("slider11").value = 11;
    document.getElementById("slider12").value = 12;
    document.getElementById("slider13").value = 13;
    document.getElementById("slider14").value = 14;
    document.getElementById("slider15").value = 15;
    document.getElementById("slider16").value = 15;
    document.getElementById("slider17").value = 14;
    document.getElementById("slider18").value = 13
    document.getElementById("slider19").value = 12;
    document.getElementById("slider20").value = 11;
    document.getElementById("slider21").value = 10;
    document.getElementById("slider22").value = 9;
    document.getElementById("slider23").value = 8;
    document.getElementById("slider24").value = 7;
    document.getElementById("slider25").value = 6;
    document.getElementById("slider26").value = 5;
    document.getElementById("slider27").value = 4;
    document.getElementById("slider28").value = 3;
    document.getElementById("slider29").value = 2;
    document.getElementById("slider30").value = 1;
    document.getElementById("slider31").value = 0;
    DispatchChanges();
    document.getElementById("wave-volume").value = 0.5;
    document.getElementById("wave-simulator-table").scrollIntoView();
    document.getElementById("wave-frequency").value = 1546;
  }
</script></p>

<p>While this setting is used mostly unchanged throughout the theme, the actual frequency configuration changes from note to note.
In the given example, we have a frequency value of 1546, resulting in a sample rate of 4178 Hz.
Dividing this by 32 gives us a frequency of 130 Hz, which corresponds to a C3 note.
A relatively deep note, but no surprise as the wave channel is used for the bass.
As a wave form, a pretty vanilla triangle wave is used.
Click on “Use this setup” to get an immediate visual representation of such a wave form.</p>

<p>In the opening theme of Tetris (not the famous one but the one you hear in the start screen) I found the following sinus-like wave form.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Opening theme

NR30: 1000 0000 -&gt; DAC enable
NR31: 0000 0000 -&gt; maximum length, but irrelevant due to NR34
NR32: 0010 0000 -&gt; volume = 100%
NR33: 1100 0100 -&gt; frequency (1732) = 6636 Hz sample rate
NR34: 1000 0110 -&gt; trigger, length disabled

Wave samples:
1123 5678 9998 7667 9ADF FEC9 8542 1131
</code></pre></div></div>

<p><button type="button" id="tetris-opening-theme" onclick="TetrisOpeningTheme()">Use this setup</button>
<script>
  function TetrisOpeningTheme() {
    document.getElementById("slider0").value = 1;
    document.getElementById("slider1").value = 1;
    document.getElementById("slider2").value = 2;
    document.getElementById("slider3").value = 3;
    document.getElementById("slider4").value = 5;
    document.getElementById("slider5").value = 6;
    document.getElementById("slider6").value = 7;
    document.getElementById("slider7").value = 8;
    document.getElementById("slider8").value = 9;
    document.getElementById("slider9").value = 9;
    document.getElementById("slider10").value = 9;
    document.getElementById("slider11").value = 8;
    document.getElementById("slider12").value = 7;
    document.getElementById("slider13").value = 6;
    document.getElementById("slider14").value = 6;
    document.getElementById("slider15").value = 7;
    document.getElementById("slider16").value = 9;
    document.getElementById("slider17").value = 10;
    document.getElementById("slider18").value = 13
    document.getElementById("slider19").value = 15;
    document.getElementById("slider20").value = 15;
    document.getElementById("slider21").value = 14;
    document.getElementById("slider22").value = 11;
    document.getElementById("slider23").value = 9;
    document.getElementById("slider24").value = 8;
    document.getElementById("slider25").value = 5;
    document.getElementById("slider26").value = 4;
    document.getElementById("slider27").value = 2;
    document.getElementById("slider28").value = 1;
    document.getElementById("slider29").value = 3;
    document.getElementById("slider30").value = 1;
    document.getElementById("slider31").value = 1;
    DispatchChanges();
    document.getElementById("wave-frequency").value = 1732;
    document.getElementById("wave-volume").value = 0.5;
    document.getElementById("wave-simulator-table").scrollIntoView();
  }
</script></p>

<p>Just as in the other Tetris theme, the wave form is again used as a bass.
From the 6636 Hz sample rate, you can derive a 207 Hz tone, which corresponds to a G#3.
Just the wave form is kind of different than before.
It looks like two overlapping sine waves.</p>

<h3 id="the-legend-of-zelda-links-awakening">The Legend Of Zelda: Link’s Awakening</h3>

<p>In the Legend Of Zelda: Link’s Awakening I found the following setting in the <a href="https://www.youtube.com/watch?v=GAMdutjuIMA">intro of the game</a>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Intro

NR30: 1000 0000 -&gt; DAC enable
NR31: 0000 0000 -&gt; maximum length, irrelevant due to NR34
NR32: 0010 0000 -&gt; volume = 100%
NR33: 0100 1110 -&gt; frequency (1102) = 2216 Hz sample rate
NR34: 1000 0100 -&gt; trigger, length disable

Wave samples:
9999 9999 0000 00000 9999 9999 0000 00000
</code></pre></div></div>

<p><button type="button" id="links-awakening-intro" onclick="LinksAwakeningIntro()">Use this setup</button>
<script>
  function LinksAwakeningIntro() {
    document.getElementById("slider0").value = 9;
    document.getElementById("slider1").value = 9;
    document.getElementById("slider2").value = 9;
    document.getElementById("slider3").value = 9;
    document.getElementById("slider4").value = 9;
    document.getElementById("slider5").value = 9;
    document.getElementById("slider6").value = 9;
    document.getElementById("slider7").value = 9;
    document.getElementById("slider8").value = 0;
    document.getElementById("slider9").value = 0;
    document.getElementById("slider10").value = 0;
    document.getElementById("slider11").value = 0;
    document.getElementById("slider12").value = 0;
    document.getElementById("slider13").value = 0;
    document.getElementById("slider14").value = 0;
    document.getElementById("slider15").value = 0;
    document.getElementById("slider16").value = 9;
    document.getElementById("slider17").value = 9;
    document.getElementById("slider18").value = 9;
    document.getElementById("slider19").value = 9;
    document.getElementById("slider20").value = 9;
    document.getElementById("slider21").value = 9;
    document.getElementById("slider22").value = 9;
    document.getElementById("slider23").value = 9;
    document.getElementById("slider24").value = 0;
    document.getElementById("slider25").value = 0;
    document.getElementById("slider26").value = 0;
    document.getElementById("slider27").value = 0;
    document.getElementById("slider28").value = 0;
    document.getElementById("slider29").value = 0;
    document.getElementById("slider30").value = 0;
    document.getElementById("slider31").value = 0;
    DispatchChanges();
    document.getElementById("wave-frequency").value = 1102;
    document.getElementById("wave-volume").value = 0.5;
    document.getElementById("wave-simulator-table").scrollIntoView();
  }
</script></p>

<p>You can already see from the data that this is just a simple square wave.
From the sample rate of 2216 Hz, we can derive a wave pattern period of 69 Hz.
But notice that the 32 samples contain two periods of the wave pattern!
Hence, the perceived tone is a C#3 (138 Hz) and the wave channel takes the role of a bass.</p>

<p>After the cinematic intro, you get into the main menu where the <a href="https://www.youtube.com/watch?v=EreHPNJHn18">iconic TLoZ</a> theme plays.
Here, the wave channel is used in a similar way as before.
However, this time there is only one period of a square wave in the wave pattern.
Also note that in this case the volume of the wave channel is frequently changed to model some decay.
While other channels (like noise and square) provide an envelope function for this effect, this feature is absent in the wave channel.
The sample rate of 4434 Hz results in a C#3 (138 Hz).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NR30: 1000 0000 -&gt; DAC enable
NR31: 0000 0000 -&gt; maximum length, but irrelevant due to NR34
NR32: 0110 0000 -&gt; volume = 25%
NR33: 0010 0111 -&gt; frequency (1575) -&gt; 4434 Hz sample rate
NR34: 1000 0110 -&gt; trigger, length disabled

Wave samples:
8888 8888 8888 8888 0000 0000 0000 0000
</code></pre></div></div>

<p><button type="button" id="links-awakening-main-theme" onclick="LinksAwakeningMainTheme()">Use this setup</button>
<script>
  function LinksAwakeningMainTheme() {
    document.getElementById("slider0").value = 8;
    document.getElementById("slider1").value = 8;
    document.getElementById("slider2").value = 8;
    document.getElementById("slider3").value = 8;
    document.getElementById("slider4").value = 8;
    document.getElementById("slider5").value = 8;
    document.getElementById("slider6").value = 8;
    document.getElementById("slider7").value = 8;
    document.getElementById("slider8").value = 8;
    document.getElementById("slider9").value = 8;
    document.getElementById("slider10").value = 8;
    document.getElementById("slider11").value = 8;
    document.getElementById("slider12").value = 8;
    document.getElementById("slider13").value = 8;
    document.getElementById("slider14").value = 8;
    document.getElementById("slider15").value = 8;
    document.getElementById("slider16").value = 0;
    document.getElementById("slider17").value = 0;
    document.getElementById("slider18").value = 0;
    document.getElementById("slider19").value = 0;
    document.getElementById("slider20").value = 0;
    document.getElementById("slider21").value = 0;
    document.getElementById("slider22").value = 0;
    document.getElementById("slider23").value = 0;
    document.getElementById("slider24").value = 0;
    document.getElementById("slider25").value = 0;
    document.getElementById("slider26").value = 0;
    document.getElementById("slider27").value = 0;
    document.getElementById("slider28").value = 0;
    document.getElementById("slider29").value = 0;
    document.getElementById("slider30").value = 0;
    document.getElementById("slider31").value = 0;
    DispatchChanges();
    document.getElementById("wave-frequency").value = 1575;
    document.getElementById("wave-volume").value = 0.5;
    document.getElementById("wave-simulator-table").scrollIntoView();
  }
</script></p>

<h3 id="super-mario-land">Super Mario Land</h3>

<p>In the <a href="https://www.youtube.com/watch?v=ZexyCYXoYHM">Super Mario Land overworld theme</a> I found the following setting:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Theme
NR30: 1000 0000 -&gt; DAC enable
NR31: 0000 0000 -&gt; maximum length, but irrelevant due to NR34
NR32: 0010 0000 -&gt; volume = 100%
NR33: 0000 0110 -&gt; frequency (1798) -&gt; 8389 Hz sample rate
NR34: 1000 0111 -&gt; trigger, length disabled

Wave samples:
0123 5678 9998 7667 9ADF FEC9 8542 1100
</code></pre></div></div>
<p>Surprisingly, it’s almost the same wave pattern as in the Tetris opening theme.
Also, similar to all examples before, the wave channel again takes the role of a bass.
In the given setting, the played sound is at 262 Hz, resulting in a C4.</p>

<p><button type="button" id="super-mario-land-theme" onclick="SuperMarioLandTheme()">Use this setup</button>
<script>
  function SuperMarioLandTheme() {
    document.getElementById("slider0").value = 0;
    document.getElementById("slider1").value = 1;
    document.getElementById("slider2").value = 2;
    document.getElementById("slider3").value = 3;
    document.getElementById("slider4").value = 5;
    document.getElementById("slider5").value = 6;
    document.getElementById("slider6").value = 7;
    document.getElementById("slider7").value = 8;
    document.getElementById("slider8").value = 9;
    document.getElementById("slider9").value = 9;
    document.getElementById("slider10").value = 9;
    document.getElementById("slider11").value = 8;
    document.getElementById("slider12").value = 7;
    document.getElementById("slider13").value = 6;
    document.getElementById("slider14").value = 6;
    document.getElementById("slider15").value = 7;
    document.getElementById("slider16").value = 9;
    document.getElementById("slider17").value = 10;
    document.getElementById("slider18").value = 12;
    document.getElementById("slider19").value = 15;
    document.getElementById("slider20").value = 15;
    document.getElementById("slider21").value = 14;
    document.getElementById("slider22").value = 11;
    document.getElementById("slider23").value = 9;
    document.getElementById("slider24").value = 8;
    document.getElementById("slider25").value = 5;
    document.getElementById("slider26").value = 4;
    document.getElementById("slider27").value = 2;
    document.getElementById("slider28").value = 1;
    document.getElementById("slider29").value = 1;
    document.getElementById("slider30").value = 0;
    document.getElementById("slider31").value = 0;
    DispatchChanges();
    document.getElementById("wave-frequency").value = 1798;
    document.getElementById("wave-volume").value = 0.5;
    document.getElementById("wave-simulator-table").scrollIntoView();
  }
</script></p>

<h3 id="the-jungle-book">The Jungle Book</h3>

<p>During my work on reverse-engineering “The Jungle Book” for the Game Boy, I also stumbled across the settings for the game’s wave channel.
In total, there are 3 different wave sample setting which can be found in <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/blob/master/src/bank_007.asm">this file</a> (see <code class="language-plaintext highlighter-rouge">WaveSampleData</code>).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Theme
NR30: 1000 0000 -&gt; DAC enable
NR31: 0000 0000 -&gt; maximum length, but irrelevant due to NR34
NR32: 0010 0000 -&gt; volume = 100%
NR33: 0011 1011 -&gt; frequency (1339) -&gt; 2958 Hz sample rate
NR34: 1000 0101 -&gt; trigger, length disabled

Wave samples:
0123 4567 89AB CDEF EDCB A987 6543 2100  ; Config0: Triange wave
0124 68AB CDEF FFFF FFFF FEDC BA86 4210  ; Config1: Clipped sine
7CEE EFFF FFEE EDC9 6321 1110 0001 1137  ; Config2: Noisy and rounded square wave
</code></pre></div></div>

<p><button type="button" id="jungle-book-config-0" onclick="JungleBookConfig0()">Use Config0</button>
<button type="button" id="jungle-book-config-1" onclick="JungleBookConfig1()">Use Config1</button>
<button type="button" id="jungle-book-config-2" onclick="JungleBookConfig2()">Use Config2</button></p>

<script>
  function JungleBookConfig0() {
    document.getElementById("slider0").value = 0x0;
    document.getElementById("slider1").value = 0x1;
    document.getElementById("slider2").value = 0x2;
    document.getElementById("slider3").value = 0x3;
    document.getElementById("slider4").value = 0x4;
    document.getElementById("slider5").value = 0x5;
    document.getElementById("slider6").value = 0x6;
    document.getElementById("slider7").value = 0x7;
    document.getElementById("slider8").value = 0x8;
    document.getElementById("slider9").value = 0x9;
    document.getElementById("slider10").value = 0xA;
    document.getElementById("slider11").value = 0xB;
    document.getElementById("slider12").value = 0xC;
    document.getElementById("slider13").value = 0xD;
    document.getElementById("slider14").value = 0xE;
    document.getElementById("slider15").value = 0xF;
    document.getElementById("slider16").value = 0xE;
    document.getElementById("slider17").value = 0xD;
    document.getElementById("slider18").value = 0xC;
    document.getElementById("slider19").value = 0xB;
    document.getElementById("slider20").value = 0xA;
    document.getElementById("slider21").value = 0x9;
    document.getElementById("slider22").value = 0x8;
    document.getElementById("slider23").value = 0x7;
    document.getElementById("slider24").value = 0x6;
    document.getElementById("slider25").value = 0x5;
    document.getElementById("slider26").value = 0x4;
    document.getElementById("slider27").value = 0x3;
    document.getElementById("slider28").value = 0x2;
    document.getElementById("slider29").value = 0x1;
    document.getElementById("slider30").value = 0x0;
    document.getElementById("slider31").value = 0x0;
    DispatchChanges();
    document.getElementById("wave-frequency").value = 1339;
    document.getElementById("wave-volume").value = 0.5;
    document.getElementById("wave-simulator-table").scrollIntoView();
  }

  function JungleBookConfig1() {
    document.getElementById("slider0").value = 0x0;
    document.getElementById("slider1").value = 0x1;
    document.getElementById("slider2").value = 0x2;
    document.getElementById("slider3").value = 0x4;
    document.getElementById("slider4").value = 0x6;
    document.getElementById("slider5").value = 0x8;
    document.getElementById("slider6").value = 0xA;
    document.getElementById("slider7").value = 0xB;
    document.getElementById("slider8").value = 0xC;
    document.getElementById("slider9").value = 0xD;
    document.getElementById("slider10").value = 0xE;
    document.getElementById("slider11").value = 0xF;
    document.getElementById("slider12").value = 0xF;
    document.getElementById("slider13").value = 0xF;
    document.getElementById("slider14").value = 0xF;
    document.getElementById("slider15").value = 0xF;
    document.getElementById("slider16").value = 0xF;
    document.getElementById("slider17").value = 0xF;
    document.getElementById("slider18").value = 0xF;
    document.getElementById("slider19").value = 0xF;
    document.getElementById("slider20").value = 0xF;
    document.getElementById("slider21").value = 0xE;
    document.getElementById("slider22").value = 0xD;
    document.getElementById("slider23").value = 0xC;
    document.getElementById("slider24").value = 0xB;
    document.getElementById("slider25").value = 0xA;
    document.getElementById("slider26").value = 0x8;
    document.getElementById("slider27").value = 0x6;
    document.getElementById("slider28").value = 0x4;
    document.getElementById("slider29").value = 0x2;
    document.getElementById("slider30").value = 0x1;
    document.getElementById("slider31").value = 0x0;
    DispatchChanges();
    document.getElementById("wave-frequency").value = 1339;
    document.getElementById("wave-volume").value = 0.5;
    document.getElementById("wave-simulator-table").scrollIntoView();
  }

  function JungleBookConfig2() {
    document.getElementById("slider0").value = 0x7;
    document.getElementById("slider1").value = 0xc;
    document.getElementById("slider2").value = 0xe;
    document.getElementById("slider3").value = 0xe;
    document.getElementById("slider4").value = 0xe;
    document.getElementById("slider5").value = 0xf;
    document.getElementById("slider6").value = 0xf;
    document.getElementById("slider7").value = 0xf;
    document.getElementById("slider8").value = 0xf;
    document.getElementById("slider9").value = 0xf;
    document.getElementById("slider10").value = 0xe;
    document.getElementById("slider11").value = 0xe;
    document.getElementById("slider12").value = 0xe;
    document.getElementById("slider13").value = 0xd;
    document.getElementById("slider14").value = 0xc;
    document.getElementById("slider15").value = 0x9;
    document.getElementById("slider16").value = 0x6;
    document.getElementById("slider17").value = 0x3;
    document.getElementById("slider18").value = 0x2;
    document.getElementById("slider19").value = 0x1;
    document.getElementById("slider20").value = 0x1;
    document.getElementById("slider21").value = 0x1;
    document.getElementById("slider22").value = 0x1;
    document.getElementById("slider23").value = 0x0;
    document.getElementById("slider24").value = 0x0;
    document.getElementById("slider25").value = 0x0;
    document.getElementById("slider26").value = 0x0;
    document.getElementById("slider27").value = 0x1;
    document.getElementById("slider28").value = 0x1;
    document.getElementById("slider29").value = 0x1;
    document.getElementById("slider30").value = 0x3;
    document.getElementById("slider31").value = 0x7;
    DispatchChanges();
    document.getElementById("wave-frequency").value = 1339;
    document.getElementById("wave-volume").value = 0.5;
    document.getElementById("wave-simulator-table").scrollIntoView();
  }
</script>]]></content><author><name></name></author><category term="TLMBoy" /><summary type="html"><![CDATA[In this part of my Game Boy simulator post series, I will cover the details of the wave channel of the so-called Audio Processing Unit (APU). Unlike the other channels of the APU (noise and square), the wave channel allows you to customize the output sound. Well, it only provides 32 samples with a 4-bit resolution, which makes it rather suitable for custom wave forms than anything else. In theory, you can also constantly rewrite the 32 samples in order to play back any kind of recording. For instance, this video provides examples of voice playbacks in Game Boy games. Playing back voices with a 4-bit resolution on a Game Boy speaker sounds horrible, but things like the super crappy “PIKACHU” from Pokémon Yellow also have their own appeal.]]></summary></entry><entry><title type="html">TLMBoy: The Audio Processing Unit (APU) - Square Channel</title><link href="https://www.chciken.com/tlmboy/2025/03/28/gameboy-apu-square.html" rel="alternate" type="text/html" title="TLMBoy: The Audio Processing Unit (APU) - Square Channel" /><published>2025-03-28T14:51:44+00:00</published><updated>2025-03-28T14:51:44+00:00</updated><id>https://www.chciken.com/tlmboy/2025/03/28/gameboy-apu-square</id><content type="html" xml:base="https://www.chciken.com/tlmboy/2025/03/28/gameboy-apu-square.html"><![CDATA[<p>In this part of my Game Boy simulator post series, I will cover the details of the square channel of the so-called Audio Processing Unit (APU).
Unlike modern hardware, the Game Boy cannot (or is not supposed) to play black sample-based audio recordings.
Rather the APU has 4 different channels that act like instruments controlled by notes and dynamics.
Two of these channels are square channels, which are the focus of this post.
These channels generate square waves at given frequencies allowing you to play notes just like an instrument.
The first square channels also has some extra frequencies-shifting features, which can be used to create various kinds of sounds.</p>

<p>When it comes to information about the Game Boy’s hardware, there’s already plenty information available.
The following sources helped me a lot to write this post and my Game Boy simulator:</p>

<p><a href="https://dn790000.ca.archive.org/0/items/GameBoyProgManVer1.1/GameBoyProgManVer1.1.pdf">Official Game Boy Programming Manual</a> <br />
<a href="https://gbdev.gg8.se/wiki/articles/Gameboy_sound_hardware">Game Boy Development Wiki</a> <br />
<a href="http://marc.rawer.de/Gameboy/Docs/GBCPUman.pdf">Game Boy CPU Manual</a> <br />
<a href="https://gbdev.io/pandocs/Audio.html">Pan Docs (my favorite source)</a></p>

<p>Unlike the technical documentation from above, this post follows a more example-driven approach.
So, rather than getting lost in every tiny obscure behavior, I first highlight the general principles of the square channel, which is then followed by some practical examples on how games made use of it.
I also provide a <a href="https://github.com/not-chciken/gb-square-test">test ROM</a>, which can be used for testing in emulator/simulator development.</p>

<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }

  .center {
    margin-left: auto;
    margin-right: auto;
  }
</style>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
  <li><a href="#overview">Overview</a>
    <ul>
      <li><a href="#nr10-channel-sweep">NR10: Channel Sweep</a></li>
      <li><a href="#nr11-length-timer-duty">NR11: Length Timer &amp; Duty</a></li>
      <li><a href="#nr12-envelope">NR12: Envelope</a></li>
      <li><a href="#nr13-frequency-lsb">NR13: Frequency LSB</a></li>
      <li><a href="#nr14-channel-control--frequency-msb">NR14: Channel Control &amp; Frequency MSB</a></li>
    </ul>
  </li>
  <li><a href="#square-simulator">Square Simulator</a></li>
  <li><a href="#examples">Examples</a>
    <ul>
      <li><a href="#tetris">Tetris</a></li>
      <li><a href="#super-mario-land">Super Mario Land</a></li>
      <li><a href="#bomberman-gb">Bomberman GB</a></li>
      <li><a href="#the-legend-of-zelda-links-awakening">The Legend of Zelda: Link's Awakening</a></li>
    </ul>
  </li>
  </ul>
</div>

<h2 id="overview">Overview</h2>

<p>Similar to other units of the Game Boy (DMA, Pixel Processing Unit, etc.), communication with the APU is facilitated by
<a href="https://en.wikipedia.org/wiki/Memory-mapped_I/O_and_port-mapped_I/O">memory-mapped I/O</a>.
That means if you want to tell the APU something you just write something into certain memory-mapped registers,
while information about the APU’s current status can retrieved by reading these registers.
For the two square channels, the following registers are relevant:</p>

<p>Square 1:</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Address</th>
      <th>Bits</th>
      <th>Function</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>NR10</td>
      <td>0xFF10</td>
      <td><code class="language-plaintext highlighter-rouge">-PPP NSSS</code></td>
      <td>Sweep period, negate, shift</td>
    </tr>
    <tr>
      <td>NR11</td>
      <td>0xFF11</td>
      <td><code class="language-plaintext highlighter-rouge">DDLL LLLL</code></td>
      <td>Wave duty, length load</td>
    </tr>
    <tr>
      <td>NR12</td>
      <td>0xFF12</td>
      <td><code class="language-plaintext highlighter-rouge">VVVV EPPP</code></td>
      <td>Init volume, envelope mode, envelope period</td>
    </tr>
    <tr>
      <td>NR13</td>
      <td>0xFF13</td>
      <td><code class="language-plaintext highlighter-rouge">FFFF FFFF</code></td>
      <td>Frequency LSB</td>
    </tr>
    <tr>
      <td>NR14</td>
      <td>0xFF14</td>
      <td><code class="language-plaintext highlighter-rouge">TL-- -FFF</code></td>
      <td>Trigger, length enable, frequency MSB</td>
    </tr>
  </tbody>
</table>

<p>Square 2:</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Address</th>
      <th>Bits</th>
      <th>Function</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>NR21</td>
      <td>0xFF16</td>
      <td><code class="language-plaintext highlighter-rouge">DDLL LLLL</code></td>
      <td>Wave duty, length load</td>
    </tr>
    <tr>
      <td>NR22</td>
      <td>0xFF17</td>
      <td><code class="language-plaintext highlighter-rouge">VVVV EPPP</code></td>
      <td>Init volume, envelope mode, envelope period</td>
    </tr>
    <tr>
      <td>NR23</td>
      <td>0xFF18</td>
      <td><code class="language-plaintext highlighter-rouge">FFFF FFFF</code></td>
      <td>Frequency LSB</td>
    </tr>
    <tr>
      <td>NR24</td>
      <td>0xFF19</td>
      <td><code class="language-plaintext highlighter-rouge">TL-- -FFF</code></td>
      <td>Trigger, length enable, frequency MSB</td>
    </tr>
  </tbody>
</table>

<p>Since Square 2 has a subset of the features of Square 1, the following only highlights the details of Square 2.
Except for the missing features (sweep period, negate, shift), Square 1 and Square 2 work similarly.</p>

<h2 id="square-channels">Square Channels</h2>

<p>The square wave channel allows you to play - big surprise - square waves.
This kind of wave truly defines the Game Boy’s chiptune sound.
And since this channel is so important, the Game Boy allows you two play two square waves at the same time!
While many synthesizers allow you to heavily modify the square wave, the Game Boy’s options are very limited in that regard.
Anyway, first the very technical definition of the square wave channel register before we are heading to some examples.</p>

<h3 id="nr10-channel-sweep">NR10: Channel Sweep</h3>
<p>The sweep channel can be used to change the frequency of the square wave over time.
This is primarily used to model sound effects, such as hopping on a Goomba in Super Mario Land.</p>

<p><strong>[0:2] 3-bit step</strong>: Each iteration, a new frequency is calculated as: F(i+1) = F(i) + F(i) / 2^step.
This value is also written back to register NR13 and NR14!<br />
<strong>[3:3] 1-bit direction</strong>: 0 → frequency increase, 1 → frequency decrease.<br />
<strong>[4:6] 3-bit period</strong>: The sweep is updated every (period * 7.8 milliseconds). A value of 0 disables the sweep.</p>

<h3 id="nr11-channel-length-timer--duty">NR11: Channel Length Timer &amp; Duty</h3>
<p><strong>[0:5] 6-bit initial length timer</strong>:
Can be read from or written to.
The 6 bits are interpreted as an unsigned number ranging from 0 to 63.
This number determines the length of the sound: length = (64-value)*(1/256) seconds.
So, the shortest sound is 1/256 second, while the longest is 1/4 second.
Note that the “64-value” part leads to some counterintuitive behavior.
When writing 0, you get the longest possible length, an when writing 63, you get the shortest possible length.
If you want indefinite sustain, disable Bit 6 in register NR14.<br />
<strong>[6:7] 2-bit duty cycle</strong>: Determines the duty cycles of the square wave (00 → 12.5%, 01 → 25%, 10 → 50%, 11 → 75%).
Note that 25% and 75% give the same audible impression when the square wave is played without the other channels.</p>

<h3 id="nr12-channel-volume--envelope">NR12: Channel Volume &amp; Envelope</h3>
<p><strong>[0:2] 3-bit Envelope update period</strong>: The envelope ticks at 64 Hz, and the channel’s volume is updated every Nth (given by 3-bit value) tick.
So, the fastest possible update is 64 Hz, while the slowest 8Hz. 0 disables the envelope.<br />
<strong>[3:3] 1-bit envelope mode</strong>: 0 → decrement volume, 1 → increment volume.<br />
<strong>[4:7] 4-bit initial volume</strong>: Starting volume representing values between 0-15. You can read these bits but the hardware does not update them!</p>

<h3 id="nr13-frequency-lsb">NR13: Frequency LSB</h3>
<p><strong>[0:7] 8-bit frequency lower bits</strong>: The frequency comprises 11 bits in total (see NR14).
The square channel uses a non-exposed, 11-bit counter that increases every time it is clocked.
After 2047 it overflows, generates a signal, and is set to the value of NR13 and NR14.
The resulting frequency is: 131,072/(2048-frequency).
Hence, the lowest frequency is 64 Hz and highest ones is 131,072 Hz, which is already far out of the reach humans can hear.</p>

<h3 id="nr14-channel-control--frequency-msb">NR14: Channel Control &amp; Frequency MSB</h3>
<p><strong>[0:2] 3-bit frequency lower bits</strong>: Upper bits of the period. See NR13.<br />
<strong>[6:6] 1-bit length enable</strong>:
0 → Regardless of the length data in NR11 sound can be produced consecutively.
1 → Sound is generated during the time period set by the length data in NR11.
After this period the sound 1 ON flag (bit 0 of NR52) is reset.<br />
<strong>[7:7] 1-bit trigger (write-only)</strong>: Writing 1 to this bit causes the following things:
The square channel is enabled. If the length timer expired it is reset. Envelope timer is reset.
Volume is set to contents of NR12 initial volume.
The period divider is set to the contents of NR13 and NR14.
Sweep does things.</p>

<h2 id="square-simulator">Square Simulator</h2>

<p>Here’s a Javascript-based square channel simulator.
In the table below you can setup individual fields and listen to the sound they would create on the Game Boy.
Note that the simulator repeats every 2 seconds.
Predefined setups of some games are provided in the next section.</p>

<script type="text/javascript" src="/assets/gameboy_apu/square_simulator.js"></script>

<table id="square-simulator-table">
  <tr>
    <th>Register</th>
    <th>Setting</th>
  </tr>
  <tr>
    <td>NR10: Channel Sweep</td>
    <td>
      Step
      <input type="number" id="square-sweep-step" name="square-sweep-step" value="0" min="0" max="7" /><br />
      Direction
      <select name="square-sweep-direction" id="square-sweep-direction">
        <option value="0">Increase</option>
        <option value="1">Decrease</option>
      </select><br />
      Period
      <input type="number" id="square-sweep-period" name="square-sweep-period" value="0" min="0" max="7" />
    </td>
  </tr>
  <tr>
    <td>NR11: Length Timer</td>
    <td>
      Length
      <input type="number" id="square-length" name="square-length" value="42" min="0" max="63" /><br />
      Duty
      <select name="square-duty" id="square-duty">
        <option value="0.125">0 / 12.5%</option>
        <option value="0.25">1 / 25%</option>
        <option value="0.5">2 / 50%</option>
        <option value="0.75">3 / 75%</option>
      </select>
    </td>
  </tr>
  <tr>
    <td>NR12: Envelope</td>
    <td>
      Volume
      <input type="number" id="square-volume" name="square-volume" value="10" min="0" max="15" /><br />
      Envelope mode
      <select name="envelope-mode" id="envelope-mode">
        <option value="0">Decrement</option>
        <option value="1">Increment</option>
      </select><br />
      Envelope period
      <input type="number" id="envelope-period" name="envelope-period" value="0" min="0" max="7" />
    </td>
  </tr>
  <tr>
    <td>NR13/NR14: Square Frequency</td>
    <td>
      Frequency
      <input type="number" id="square-frequency" name="square-frequency" value="500" min="0" max="2047" />
    </td>
  </tr>
  <tr>
    <td>NR14: Channel Control</td>
    <td>
      Length enable:
      <input type="checkbox" id="length-enable" name="length-enable" /><br />
      <button type="button" id="play-square">Play/Stop</button>
    </td>
  </tr>
</table>

<p>Besides an implementation in Javascript, I also wrote the same application for the Game Boy:</p>

<div style="text-align:center">
  <img id="lfsr-register" src="/assets/gameboy_apu/screenshot_gb_square_test.png" alt="Square Test Screenshot" width="40%" />
</div>
<p><br /></p>

<p>The source code and the corresponding ROM can be found in <a href="https://github.com/not-chciken/gb-square-test">this GitHub repository</a>.</p>

<h2 id="examples">Examples</h2>

<p>In the following, examples of square channel in real-world software are provided.
Click on “Use this setup” to load the square simulator with the corresponding setup.</p>

<h3 id="boot">Boot</h3>

<p>One simple, yet iconic example of the square wave is the <a href="https://www.youtube.com/watch?v=jCfPojZ_xLw">Game Boy’s boot sound</a>.
From examining the boot code (see also my <a href="/tlmboy/2022/05/02/gameboy-boot.html">boot ROM post</a>), I found the two following register settings:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NR10: 0000 0000 -&gt; No sweep
NR11: 1000 0000 -&gt; Duty cycle 50%, length irrelevant due to NR14
NR12: 1111 0011 -&gt; Full volume, decrement volume, update envelope every 3 envelope ticks
NR13: 1000 0011 -&gt; Frequency: 1048.576 Hz (C6)
NR14: 1000 0111 -&gt; Trigger, sound indefinite length, period upper 3 bits
</code></pre></div></div>

<p><button type="button" id="game-boy-boot-c6-setup" onclick="GameBoyBootC6Setup()">Use this setup</button>
<script>
  function GameBoyBootC6Setup() {
    document.getElementById("square-sweep-period").value = 0;
    document.getElementById("square-sweep-direction").value = 0;
    document.getElementById("square-sweep-step").value = 0;
    document.getElementById("square-duty").value = 0.5;
    document.getElementById("square-length").value = 0;
    document.getElementById("square-volume").value = 15;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 3;
    document.getElementById("square-frequency").value = 1923;
    document.getElementById("length-enable").checked = false;
    document.getElementById("square-simulator-table").scrollIntoView();
  }
</script></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NR10: 0000 0000 -&gt; No sweep
NR11: 1000 0000 -&gt; Duty cycle 50%, length irrelevant due to NR14
NR12: 1111 0011 -&gt; Full volume, decrement volume, update envelope every 3 envelope ticks
NR13: 1100 0001 -&gt; Frequency: 2080.50 Hz (C7)
NR14: 1000 0111 -&gt; Trigger, sound indefinite length, period upper 3 bits
</code></pre></div></div>

<p><button type="button" id="game-boy-boot-c7-setup" onclick="GameBoyBootC7Setup()">Use this setup</button>
<script>
  function GameBoyBootC7Setup() {
    document.getElementById("square-sweep-period").value = 0;
    document.getElementById("square-sweep-direction").value = 0;
    document.getElementById("square-sweep-step").value = 0;
    document.getElementById("square-duty").value = 0.5;
    document.getElementById("square-length").value = 0;
    document.getElementById("square-volume").value = 15;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 3;
    document.getElementById("square-frequency").value = 1985;
    document.getElementById("length-enable").checked = false;
    document.getElementById("square-simulator-table").scrollIntoView();
  }
</script></p>

<p>So, a very simple square with 50% duty and no fancy sweep settings.
The sound starts at full volume and is then decremented every 3 envelope ticks.
If I did the math correctly, that should correspond to a length of ~0.7s until the volume reaches 0.
Note that the Game Boy plays two sounds to get this “bling bling.”
First a C6, which is only played for four frames (~66 milliseconds), and then a C7, which is played for the full duration of ~0.7 seconds.</p>

<h3 id="super-mario-land">Super Mario Land</h3>

<p>In Super Mario Land I found a few examples that make use the square’s sweep setting to model sound effects.
Note that sound effects may involve multiple subsequent setting.
In the following only single settings are provided.</p>

<p>When jumping on a Goomba, you get the following setting:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NR10: 0101 0111 -&gt; step = 7, frequency increase, sweep period = 5
NR11: 1000 0000 -&gt; duty = 2%, length = 0 (irrelevant due to NR14)
NR12: 0110 0010 -&gt; envelope period = 2, decrement volume, volume = 6
NR13: 0000 0110 -&gt; frequency = 1798
NR14: 1000 0111 -&gt; trigger, sound indefinite length
</code></pre></div></div>

<p><button type="button" id="super-mario-land-goomba-hop" onclick="SuperMarioLandGoombaHop()">Use this setup</button>
<script>
  function SuperMarioLandGoombaHop() {
    document.getElementById("square-sweep-period").value = 5;
    document.getElementById("square-sweep-direction").value = 0;
    document.getElementById("square-sweep-step").value = 7;
    document.getElementById("square-duty").value = 0.5;
    document.getElementById("square-length").value = 0;
    document.getElementById("square-volume").value = 6;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 2;
    document.getElementById("square-frequency").value = 1798;
    document.getElementById("length-enable").checked = false;
    document.getElementById("square-simulator-table").scrollIntoView();
  }
</script></p>

<p>Parts of the sound when taking a mushroom are very similar to the Goomba sound:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NR10: 0010 0111 -&gt; sweep step = 7 , increase frequency, sweep period = 2
NR11: 1000 0000 -&gt; length = 0, duty = 2
NR12: 0110 0010 -&gt; envelope period = 2, decrement volume, volume = 6
NR13: 0111 0010 -&gt; frequency = 1650
NR14: 1000 0110 -&gt; trigger, sound indefinite length
</code></pre></div></div>

<p><button type="button" id="super-mario-shroom-consumed" onclick="SuperMarioLandShroomConsumed()">Use this setup</button>
<script>
  function SuperMarioLandShroomConsumed() {
    document.getElementById("square-sweep-period").value = 2;
    document.getElementById("square-sweep-direction").value = 0;
    document.getElementById("square-sweep-step").value = 7;
    document.getElementById("square-duty").value = 0.5;
    document.getElementById("square-length").value = 0;
    document.getElementById("square-volume").value = 6;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 2;
    document.getElementById("square-frequency").value = 1650;
    document.getElementById("length-enable").checked = false;
    document.getElementById("square-simulator-table").scrollIntoView();
  }
</script></p>]]></content><author><name></name></author><category term="TLMBoy" /><summary type="html"><![CDATA[In this part of my Game Boy simulator post series, I will cover the details of the square channel of the so-called Audio Processing Unit (APU). Unlike modern hardware, the Game Boy cannot (or is not supposed) to play black sample-based audio recordings. Rather the APU has 4 different channels that act like instruments controlled by notes and dynamics. Two of these channels are square channels, which are the focus of this post. These channels generate square waves at given frequencies allowing you to play notes just like an instrument. The first square channels also has some extra frequencies-shifting features, which can be used to create various kinds of sounds.]]></summary></entry><entry><title type="html">TLMBoy: The Audio Processing Unit (APU) - Noise Channel</title><link href="https://www.chciken.com/tlmboy/2025/03/24/gameboy-apu-noise.html" rel="alternate" type="text/html" title="TLMBoy: The Audio Processing Unit (APU) - Noise Channel" /><published>2025-03-24T12:22:44+00:00</published><updated>2025-03-24T12:22:44+00:00</updated><id>https://www.chciken.com/tlmboy/2025/03/24/gameboy-apu-noise</id><content type="html" xml:base="https://www.chciken.com/tlmboy/2025/03/24/gameboy-apu-noise.html"><![CDATA[<p>In this part of my Game Boy simulator post series, I will cover the details of the noise channel of the so-called Audio Processing Unit (APU).
Unlike modern hardware, the Game Boy cannot (or is not supposed) to play black sample-based audio recordings.
Rather the APU has 4 different channels that act like instruments controlled by notes and dynamics.
One of these channels is the noise channel.
It’s quite versatile and can be used to resemble snares, hi-hats, explosions, or even waves washing up on shore.</p>

<p>When it comes to information about the Game Boy’s hardware, there’s already plenty information available.
The following sources helped me a lot to write this post and my Game Boy simulator:</p>

<p><a href="https://dn790000.ca.archive.org/0/items/GameBoyProgManVer1.1/GameBoyProgManVer1.1.pdf">Official Game Boy Programming Manual</a> <br />
<a href="https://gbdev.gg8.se/wiki/articles/Gameboy_sound_hardware">Game Boy Development Wiki</a> <br />
<a href="http://marc.rawer.de/Gameboy/Docs/GBCPUman.pdf">Game Boy CPU Manual</a> <br />
<a href="https://gbdev.io/pandocs/Audio.html">Pan Docs (my favorite source)</a></p>

<p>To not write yet another technical documentation like the sources above, this post follows a more example-driven approach.
So, rather than getting lost in every tiny obscure behavior, I first highlight the general principles of the noise channel, which is then followed by some practical examples on how games made use of it.
I also provide a <a href="https://github.com/not-chciken/gb-noise-test">test ROM</a>, which can be used for testing in emulator/simulator development.</p>

<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }

  .center {
    margin-left: auto;
    margin-right: auto;
  }
</style>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
  <li><a href="#overview">Overview</a>
    <ul>
      <li><a href="#nr41-length-timer">NR41: Length Timer</a></li>
      <li><a href="#nr42-envelope">NR42: Envelope</a></li>
      <li><a href="#nr43-noise-shape">NR43: Noise Shape</a></li>
      <li><a href="#nr44-channel-control">NR44: Channel Control</a></li>
    </ul>
  </li>
  <li><a href="#noise-simulator">Noise Simulator</a></li>
  <li><a href="#examples">Examples</a>
    <ul>
      <li><a href="#tetris">Tetris</a></li>
      <li><a href="#super-mario-land">Super Marion Land</a></li>
      <li><a href="#bomberman-gb">Bomberman GB</a></li>
      <li><a href="#the-legend-of-zelda-links-awakening">The Legend of Zelda: Link's Awakening</a></li>
    </ul>
  </li>
  </ul>
</div>

<h2 id="overview">Overview</h2>

<p>Similar to other units of the Game Boy (DMA, Pixel Processing Unit, etc.), communication with the APU is facilitated by
<a href="https://en.wikipedia.org/wiki/Memory-mapped_I/O_and_port-mapped_I/O">memory-mapped I/O</a>.
That means if you want to tell the APU something you just write something into certain memory-mapped registers,
while information about the APU’s current status can retrieved by reading these registers.
For the noise channel, the following 4 registers are relevant:</p>

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Address</th>
      <th>Bits</th>
      <th>Function</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>NR41</td>
      <td>0xFF20</td>
      <td><code class="language-plaintext highlighter-rouge">--LL LLLL</code></td>
      <td>Length load</td>
    </tr>
    <tr>
      <td>NR42</td>
      <td>0xFF21</td>
      <td><code class="language-plaintext highlighter-rouge">VVVV APPP</code></td>
      <td>Starting volume, envelope mode, period</td>
    </tr>
    <tr>
      <td>NR43</td>
      <td>0xFF22</td>
      <td><code class="language-plaintext highlighter-rouge">SSSS WDDD</code></td>
      <td>Clock shift, width mode of LFSR, divisor code</td>
    </tr>
    <tr>
      <td>NR44</td>
      <td>0xFF23</td>
      <td><code class="language-plaintext highlighter-rouge">TL-- ----</code></td>
      <td>Trigger, length enable</td>
    </tr>
  </tbody>
</table>

<p>In the following these registers are described in greater detail.</p>

<h3 id="nr41-length-timer">NR41: Length Timer</h3>

<p>This register is used to control the length of a sound.
In musical terms, this refers to a note’s <a href="https://en.wikipedia.org/wiki/Duration_(music)">duration</a>.
The register has only one field, controlling the length of a sound as follows:</p>

<p><strong>[0:5] 6-bit length load</strong>: Can be read from or written to.
The 6 bits are interpreted as an unsigned number ranging from 0 to 63.
This number determines the length of the sound: length = (64-value)*(1/256) seconds.
So, the shortest sound is 1/256 second, while the longest is 1/4 second.
Note that the “64-value” part leads to some counterintuitive behavior.
When writing 0, you get the longest possible length, an when writing 63, you get the shortest possible length.
If you want indefinite sustain, disable Bit 6 in register NR44.</p>

<h3 id="nr42-envelope">NR42: Envelope</h3>

<p>The envelope register is used to control the volume <a href="https://en.wikipedia.org/wiki/Envelope_(music)">envelope</a> of a sound.
The volume envelope describes how the volume of a sound changes over time.
For example, a decreasing envelope can be used to mimic some kind of decay as with pianos or guitars.
The register comprises 3 fields:</p>

<p><strong>[0:2] 3-bit envelope period</strong>:
Interpreted as a 3-bit unsigned integer (0-7).
Determines how often the envelope is updated.
A decrement/increment happens every period*(1/64) second.
Writing 0 disables the envelope.<br />
<strong>[3:3] 1-bit envelope add mode</strong>:
0 → decrement volume, 1 → increment volume.<br />
<strong>[4:7] 4-bit starting volume of the envelope</strong>:
Represents a starting volume between 0-15.
The value is incremented/decremented depending on the envelope mode and period.
Important: You can read these bits but the hardware does not update them!</p>

<h3 id="nr43-noise-shape">NR43: Noise Shape</h3>

<p>This register is used to control the noises’s shape/color.
Depending on the settings you can create everything from white noise to high-pitched metallic sounds.
To create pseudo-random numbers for the noise channel, the Game Boy uses a <a href="https://en.wikipedia.org/wiki/Linear-feedback_shift_register">Linear-Feedback Shift Register (LFSR)</a>:</p>

<div style="text-align:center">
  <img id="lfsr-register" src="/assets/gameboy_apu/lfsr_register.svg" alt="LFSR Register" width="70%" />
</div>
<p><br /></p>

<p>Each time the LFSR is ticked, it performs the following three steps:</p>
<ol>
  <li>The low two bits (0 and 1) are XORed and negated.</li>
  <li>The result of the XOR is put into the now-empty high bit (either Bit 15 or Bit 7 depending on the mode).</li>
  <li>All bits are shifted right by one.</li>
</ol>

<p>The frequency by which new values are generated is: 4.194304 MHz / ( divider « shift),
whereby divider is in {8, 16, 32, 48, 64, 80, 96, 112}, and shift is between 0 and 13.
Hence, the highest frequency is 524,288 MHz and the lowest frequency is 4.57 Hz.
Divider, width mode, and clock shift are derived from the following fields:</p>

<p><strong>[0:2] 3-bit divider</strong>: Interpreted as a 3-bit unsigned integer (0-7). See formula.<br />
<strong>[3:3] 1-bit width mode</strong>: Width of the LFSR. 0 → 15 bit, 1 → 7 bit.<br />
<strong>[4:7] 4-bit clock shift</strong>: Interpreted as a 4-bit unsigned integer (0-15). See formula.
According to the programmer manual, the values 14 and 15 are illegal. Interestingly, this constraint is not mentioned in all documentation you can find online.</p>

<p>The LFSR has some interesting properties that I want to highlight in greater detail.
For instance, in 7-bit mode, the generated pattern repeats every 127 cycles.
In case you want to see it yourself, use this python script:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/python
</span>
<span class="n">IND</span> <span class="o">=</span> <span class="mi">7</span>
<span class="n">NUM_SAMPLES</span> <span class="o">=</span> <span class="mi">128</span>
<span class="n">reg</span> <span class="o">=</span> <span class="mi">0</span>

<span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Cycle</span><span class="se">\t</span><span class="s">LFSR</span><span class="se">\t</span><span class="s">Waveform Output</span><span class="sh">"</span><span class="p">)</span>

<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">NUM_SAMPLES</span><span class="p">):</span>
  <span class="n">val</span> <span class="o">=</span> <span class="ow">not</span> <span class="p">((</span><span class="n">reg</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">^</span> <span class="p">((</span><span class="n">reg</span> <span class="o">&gt;&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mi">1</span><span class="p">))</span>
  <span class="n">reg</span> <span class="o">|=</span> <span class="p">(</span><span class="n">val</span> <span class="o">&lt;&lt;</span> <span class="n">IND</span><span class="p">)</span>
  <span class="n">reg</span> <span class="o">&gt;&gt;=</span> <span class="mi">1</span>
  <span class="n">output</span> <span class="o">=</span> <span class="n">reg</span> <span class="o">&amp;</span> <span class="mi">1</span>
  <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="se">\t</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">{0:07b}</span><span class="se">\t</span><span class="sh">"</span><span class="p">.</span><span class="nf">format</span><span class="p">(</span><span class="n">reg</span><span class="p">),</span> <span class="n">output</span><span class="p">)</span>
</code></pre></div></div>

<p>It generates the following table:</p>

<table>
  <thead>
    <tr>
      <th>Cycle</th>
      <th>Value</th>
      <th>Waveform Output</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>01000000</td>
      <td>0</td>
    </tr>
    <tr>
      <td>1</td>
      <td>01100000</td>
      <td>0</td>
    </tr>
    <tr>
      <td>2</td>
      <td>01110000</td>
      <td>0</td>
    </tr>
    <tr>
      <td>3</td>
      <td>01111000</td>
      <td>0</td>
    </tr>
    <tr>
      <td>4</td>
      <td>01111100</td>
      <td>0</td>
    </tr>
    <tr>
      <td>5</td>
      <td>01111110</td>
      <td>0</td>
    </tr>
    <tr>
      <td>6</td>
      <td>00111111</td>
      <td>1</td>
    </tr>
    <tr>
      <td>7</td>
      <td>01011111</td>
      <td>1</td>
    </tr>
    <tr>
      <td>…</td>
      <td>…</td>
      <td>…</td>
    </tr>
    <tr>
      <td>127</td>
      <td>01000000</td>
      <td>0</td>
    </tr>
  </tbody>
</table>

<p>Interestingly, the only value a 7-bit LFSR never reaches is all bits being 1.
This is for a good reason, as such a value would lock the LFSR permanently in this state, resulting in only 1s being generated for the output.
In practice it’s actually possible to arrive in such a situation, when switching from 15-bit mode to 7-bit in the right moment.</p>

<h3 id="nr44-channel-control">NR44: Channel Control</h3>
<p>The channel control register only comprises two 1-bit fields:</p>

<p><strong>[6:6] 1-bit length enable</strong>:
0 → Regardless of the length data in NR41 sound can be produced consecutively.
1 → Sound is generated during the time period set by the length data in NR41.
After this period the sound 4 ON flag (bit 3 of NR52) is reset.<br />
<strong>[7:7] 1-bit trigger (write-only)</strong>: Writing 1 to this bit causes the following things:
The noise channel is enabled. If the length timer expires it is reset.
Envelope timer is reset.
Volume is set to contents of NR42 initial volume.
LFSR bits are reset.</p>

<h2 id="noise-simulator">Noise Simulator</h2>

<p>After all these technical details, it’s time for some practical evaluation.
To get a better understanding of how the individual register and their fields are playing together in practice, I wrote a Javascript-based noise simulator.
In the table below you can setup individual fields and listen to the sound they would create on the Game Boy. Note that the simulator repeats every 2 seconds.
Predefined setups of some games are provided in the next section.</p>

<script type="text/javascript" src="/assets/gameboy_apu/noise_simulator.js"></script>

<table id="noise-simulator-table">
  <tr>
    <th>Register</th>
    <th>Setting</th>
  </tr>
  <tr>
    <td>NR41: Length Timer</td>
    <td>
      Length
      <input type="number" id="noise-length" name="noise-length" value="42" min="0" max="63" />
    </td>
  </tr>
  <tr>
    <td>NR42: Envelope</td>
    <td>
      Volume
      <input type="number" id="noise-volume" name="noise-volume" value="10" min="0" max="15" /><br />
      Envelope mode
      <select name="envelope-mode" id="envelope-mode">
        <option value="0">Decrement</option>
        <option value="1">Increment</option>
      </select><br />
      Envelope period
      <select name="envelope-period" id="envelope-period">
        <option value="0">0</option>
        <option value="1">1</option>
        <option value="2">2</option>
        <option value="3">3</option>
        <option value="4">4</option>
        <option value="5">5</option>
        <option value="6">6</option>
        <option value="7">7</option>
      </select>
    </td>
  </tr>
  <tr>
    <td>NR43: Noise Shape</td>
    <td>
      Divisor
      <select name="noise-divisor" id="noise-divisor">
        <option value="8">8</option>
        <option value="16">16</option>
        <option value="32">32</option>
        <option value="48">48</option>
        <option value="64">64</option>
        <option value="80">80</option>
        <option value="96">96</option>
        <option value="112">112</option>
      </select><br />
      Shift
      <select name="noise-shift" id="noise-shift">
        <option value="0">0</option>
        <option value="1">1</option>
        <option value="2">2</option>
        <option value="3">3</option>
        <option value="4">4</option>
        <option value="5">5</option>
        <option value="6">6</option>
        <option value="7">7</option>
        <option value="8">8</option>
        <option value="9">9</option>
        <option value="10">10</option>
        <option value="11">11</option>
        <option value="12">12</option>
        <option value="13">13</option>
      </select><br />
      LFSR Width
      <select name="noise-lfsr-width" id="noise-lfsr-width">
        <option value="7">7</option>
        <option value="15">15</option>
      </select><br />
      Resulting LFSR Sample Rate:
      <span id="lfsr-sample-rate">
        524288 Hz
      </span>
    </td>
  </tr>
  <tr>
    <td>NR44: Channel Control</td>
    <td>
      Length enable:
      <input type="checkbox" id="length-enable" name="length-enable" /><br />
      <button type="button" id="play-noise">Play/Stop</button>
    </td>
  </tr>
</table>

<p>Since only having this in Javascript is lame, I also wrote the same application for the Game Boy:</p>

<div style="text-align:center">
  <img id="lfsr-register" src="/assets/gameboy_apu/screenshot_gb_noise_test.png" alt="Noise Test Screenshot" width="40%" />
</div>
<p><br /></p>

<p>The source code and the corresponding ROM can be found in <a href="https://github.com/not-chciken/gb-noise-test">this GitHub repository</a>.</p>

<h2 id="examples">Examples</h2>

<p>To see the noise channel in action, I tried to find out how different games make use of this channel.
Here’s what I found.</p>

<h3 id="tetris">Tetris</h3>

<p>In the <a href="https://www.youtube.com/watch?v=6MsXiUnHHqU&amp;list=PLKkxnBwFOJGIu3XSOHYW4r9dFyaoC9zNW&amp;index=1">title theme of Tetris</a>, I found the following two settings:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tetris Hi-hat:
NR41: 0011 1010 -&gt; 58 -&gt; length = 1/42 s (~23 ms)
NR42: 1010 0001 -&gt; Envelope: Decrement every 1/64s (~16ms), starting from volume 10
NR43: 0000 0000 -&gt; 15-bit LFSR, divisor 8, shift 0 (524,288 Hz)
NR44: 1100 0000 -&gt; Sound according to length
</code></pre></div></div>
<p><button type="button" id="tetris-hihat-button" onclick="TetrisHihatSetup()">Use this setup</button>
<script>
  function TetrisHihatSetup() {
    document.getElementById("noise-length").value = 58;
    document.getElementById("noise-volume").value = 10;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 1;
    document.getElementById("noise-divisor").value = 8;
    document.getElementById("noise-shift").value = 0;
    document.getElementById("noise-lfsr-width").value = 15;
    document.getElementById("length-enable").checked = true;
    document.getElementById("noise-simulator-table").scrollIntoView();
  }
</script></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Tetris Snare:
NR41: 0010 1001 -&gt; 41 -&gt; length = 23/264 s (~90 ms)
NR42: 1011 0001 -&gt; Envelope: Decrement every 1/64s (~16ms), starting from volume 11
NR43: 0000 0001 -&gt; 15-bit LFSR, divisor 16, shift 0 (262,144 Hz)
NR44: 1100 0000 -&gt; Sound according to length
</code></pre></div></div>
<p><button type="button" id="tetris-snare-button" onclick="TetrisSnareSetup()">Use this setup</button>
<script>
  function TetrisSnareSetup() {
    document.getElementById("noise-length").value = 41;
    document.getElementById("noise-volume").value = 11;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 1;
    document.getElementById("noise-divisor").value = 16;
    document.getElementById("noise-shift").value = 0;
    document.getElementById("noise-lfsr-width").value = 15;
    document.getElementById("length-enable").checked = true;
    document.getElementById("noise-simulator-table").scrollIntoView();
  }
</script></p>

<p>The first setting is used for something hi-hat-like sound, while the other one is used for a snare.
As you can see and hear, they are not too different.
The snare has a slightly higher starting volume, a greater length, and uses a higher divisor.
All of that doesn’t really change the color of the sound but makes the snare more dominant compared to the hi-hat.</p>

<h3 id="super-mario-land">Super Mario Land</h3>

<p>When defeating a Bombshell Koopa in Super Mario Land, the noise channel is used to create a sound of an explosion.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Bombshell Koopa Explosion:
NR41: 0000 0000 -&gt; 0 -&gt; length = 64/256 s (250 ms)
NR42: 1111 0100 -&gt; Envelope: Decrement every 4/64s (~62.5ms), starting from volume 15
NR43: 0101 0111 -&gt; 15-bit LFSR, divisor 112, shift 5 (1170.3 Hz)
NR44: 1000 0000 -&gt; Sound not according to length
</code></pre></div></div>

<p><button type="button" id="koopa-explosion-button" onclick="KoopaExplosionSetup()">Use this setup</button>
<script>
  function KoopaExplosionSetup() {
    document.getElementById("noise-length").value = 0;
    document.getElementById("noise-volume").value = 15;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 4;
    document.getElementById("noise-divisor").value = 112;
    document.getElementById("noise-shift").value = 5;
    document.getElementById("noise-lfsr-width").value = 15;
    document.getElementById("length-enable").checked = false;
    document.getElementById("noise-simulator-table").scrollIntoView();
  }
</script></p>

<p>In comparison to the settings in Tetris, Super Mario Land does not use the length register, as this would limit the sound to 250 ms at most.
By only relying on the decrement register, it takes roughly 1 second for the sound to go from volume 15 to volume 0.
With an LFSR sample rate of 1170.29 Hz, the noise is also very chiptune-like.</p>

<p>So far, all examples used a 15-bit LFSR. This is not very surprising as 7-bit provides very little randomness.
In fact, the 127-cycle repetition gives it a metallic high-pitched sound for higher LFSR sample rates, which is very far away from being white noise.
This can be heard in parts of the sound that are played when defeating a Fighter Fly:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Fighter Fly Defeated:
NR41: 0000 0000 -&gt; 0 -&gt; length = 64/256 s (250 ms)
NR42: 0010 1100 -&gt; Envelope: Increase every 4/64s (~62.5 ms), starting from volume 2
NR43: 0001 1110 -&gt; 7-bit LFSR, divisor 96, shift 1  (21,845.3 Hz)
NR44: 1000 0000 -&gt; Sound not according to length
</code></pre></div></div>

<p><button type="button" id="fighter-fly-button" onclick="FighterFlySetup()">Use this setup</button>
<script>
  function FighterFlySetup() {
    document.getElementById("noise-length").value = 0;
    document.getElementById("noise-volume").value = 2;
    document.getElementById("envelope-mode").value = 1;
    document.getElementById("envelope-period").value = 4;
    document.getElementById("noise-divisor").value = 96;
    document.getElementById("noise-shift").value = 1;
    document.getElementById("noise-lfsr-width").value = 7;
    document.getElementById("length-enable").checked = false;
    document.getElementById("noise-simulator-table").scrollIntoView();
  }
</script></p>

<p>Note that this only a part of the sound when defeating a Fighter Fly. Some of the registers are altered after a short period of time to make the sound more insect-like.</p>

<h3 id="bomberman-gb">Bomberman GB</h3>

<p>For lower LFSR sample rates, the sound of a 7-bit LFSR gets “noisier” and somewhat approximates the sound of a 15-bit LFSR.
Nevertheless, the 127-cycle repetition leads to some kind of reverb effect.
The exploding bombs in Bomberman GB are a good example of an explosion sound with a touch of reverb.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Bomb explosion:
NR41: 1111 0111 -&gt; 0 -&gt; length = 55/256 s (214 ms)
NR42: 1110 0101 -&gt; Envelope: Decrease every 5/64s (~78.1 ms), starting from volume 14
NR43: 0110 1011 -&gt; 7-bit LFSR, divisor 48, shift 6  (87,381.3 Hz)
NR44: 1000 0000 -&gt; Sound not according to length
</code></pre></div></div>
<p><button type="button" id="bomberman-bomb-button" onclick="BombermanBombSetup()">Use this setup</button>
<script>
  function BombermanBombSetup() {
    document.getElementById("noise-length").value = 55;
    document.getElementById("noise-volume").value = 14;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 5;
    document.getElementById("noise-divisor").value = 48;
    document.getElementById("noise-shift").value = 6;
    document.getElementById("noise-lfsr-width").value = 7;
    document.getElementById("length-enable").checked = false;
    document.getElementById("noise-simulator-table").scrollIntoView();
  }
</script></p>

<h3 id="the-legend-of-zelda-links-awakening">The Legend Of Zelda: Link’s Awakening</h3>

<p>Another example showcasing the great versatility of the noise channel can be found in the <a href="https://www.youtube.com/watch?v=GAMdutjuIMA">intro of The Legend Of Zelda: Link’s Awakening</a>.
Here, a fading white noise sound is used to mimic the waves washing up on shore.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Wave fading in:
NR41: 0000 0000 -&gt; 0 -&gt; length = 64/256 s (250 ms)
NR42: 0000 1111 -&gt; Envelope: Increase every 7/64s (~109.4 ms), starting from volume 0
NR43: 0011 0000 -&gt; 15-bit LFSR, divisor 8, shift 3  (65,536 Hz)
NR44: 1000 0000 -&gt; Sound not according to length
</code></pre></div></div>
<p><button type="button" id="wave-fading-in-button" onclick="WaveFadingInSetup()">Use this setup</button>
<script>
  function WaveFadingInSetup() {
    document.getElementById("noise-length").value = 0;
    document.getElementById("noise-volume").value = 0;
    document.getElementById("envelope-mode").value = 1;
    document.getElementById("envelope-period").value = 7;
    document.getElementById("noise-divisor").value = 8;
    document.getElementById("noise-shift").value = 3;
    document.getElementById("noise-lfsr-width").value = 15;
    document.getElementById("length-enable").checked = false;
    document.getElementById("noise-simulator-table").scrollIntoView();
  }
</script></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Wave fading out:
NR41: 0000 0000 -&gt; 0 -&gt; length = 64/256 s (250 ms)
NR42: 0110 0111 -&gt; Envelope: Decrease every 7/64s (~109.4 ms), starting from volume 6
NR43: 0000 0011 -&gt; 15-bit LFSR, divisor 48, shift 0  (87,381.3 Hz)
NR44: 1000 0000 -&gt; Sound not according to length
</code></pre></div></div>
<p><button type="button" id="wave-fading-out-button" onclick="WaveFadingOutSetup()">Use this setup</button>
<script>
  function WaveFadingOutSetup() {
    document.getElementById("noise-length").value = 0;
    document.getElementById("noise-volume").value = 6;
    document.getElementById("envelope-mode").value = 0;
    document.getElementById("envelope-period").value = 7;
    document.getElementById("noise-divisor").value = 48;
    document.getElementById("noise-shift").value = 0;
    document.getElementById("noise-lfsr-width").value = 15;
    document.getElementById("length-enable").checked = false;
    document.getElementById("noise-simulator-table").scrollIntoView();
  }
</script></p>]]></content><author><name></name></author><category term="TLMBoy" /><summary type="html"><![CDATA[In this part of my Game Boy simulator post series, I will cover the details of the noise channel of the so-called Audio Processing Unit (APU). Unlike modern hardware, the Game Boy cannot (or is not supposed) to play black sample-based audio recordings. Rather the APU has 4 different channels that act like instruments controlled by notes and dynamics. One of these channels is the noise channel. It’s quite versatile and can be used to resemble snares, hi-hats, explosions, or even waves washing up on shore.]]></summary></entry><entry><title type="html">The Jungle Book (Game Boy) : A Complete Guide</title><link href="https://www.chciken.com/gaming/2024/10/27/jungle-book-gameboy.html" rel="alternate" type="text/html" title="The Jungle Book (Game Boy) : A Complete Guide" /><published>2024-10-27T14:22:44+00:00</published><updated>2024-10-27T14:22:44+00:00</updated><id>https://www.chciken.com/gaming/2024/10/27/jungle-book-gameboy</id><content type="html" xml:base="https://www.chciken.com/gaming/2024/10/27/jungle-book-gameboy.html"><![CDATA[<p>Now to a project into which I invested way too much time: A Complete Guide for the Game Boy’s “The Jungle Book” game from 1994.
By complete I mean two things.</p>

<p>First, a very detailed guide on how to play through the game. To the best of my knowledge, there is no such guide available on the internet.
In fact, there seems to be only very little information about the game at all.</p>

<p>Second, my journey of reverse engineering the game.
In order to understand every bit of the game, I reverse engineered the game and created a disassembly.
Many of the results, such as the level maps, were used for the walkthrough guide.
The <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/">GitHub repository</a> is available as open source.</p>

<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }

  .center {
    margin-left: auto;
    margin-right: auto;
  }

  .video-border {
    border: 1px solid #aaa;
  }
</style>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
  <li><a href="#the-game">The Game</a></li>
  <li><a href="#history">History</a></li>
  <li><a href="#basic-game-facts">Basic Game Facts</a>
      <ul>
      <li><a href="#gameplay">Gameplay</a></li>
      <li><a href="#controls">Controls</a></li>
      <li><a href="#items">Items</a></li>
      <li><a href="#enemies">Enemies</a></li>
      <li><a href="#cheats">Cheats</a></li>
    </ul>
  </li>
  <li><a href="#the-levels">The Levels</a>
    <ul>
      <li><a href="#level-1-jungle-by-day">Level 1 (JUNGLE BY DAY)</a></li>
      <li><a href="#level-2-the-great-tree">Level 2 (THE GREAT TREE)</a></li>
      <li><a href="#level-3-dawn-patrol">Level 3 (DAWN PATROL)</a></li>
      <li><a href="#level-4-by-the-river">Level 4 (BY THE RIVER)</a></li>
      <li><a href="#level-5-in-the-river">Level 5 (IN THE RIVER)</a></li>
      <li><a href="#level-6-tree-village">Level 6 (TREE VILLAGE)</a></li>
      <li><a href="#level-7-ancient-ruins">Level 7 (ANCIENT RUINS)</a></li>
      <li><a href="#level-8-falling-ruins">Level 8 (FALLING RUINS)</a></li>
      <li><a href="#level-9-jungle-by-night">Level 9 (JUNGLE BY NIGHT)</a></li>
      <li><a href="#level-10-the-wastelands">Level 10 (THE WASTELANDS)</a></li>
      <li><a href="#level-11-bonus">Level 11 (Bonus)</a></li>
      <li><a href="#level-12-transition">Level 12 (Transition)</a></li>
    </ul>
  </li>
  <li><a href="#putting-it-all-together">Putting It All Together</a></li>
  <li><a href="#the-reverse-engineering-process">The Reverse Engineering Process</a></li>
  <li><a href="#bugs-and-glitches">Bugs And Glitches</a>
    <ul>
      <li><a href="#weapon-damage-glitch">Weapon Damage Glitch</a></li>
      <li><a href="#teleport-glitch">Teleport Glitch</a></li>
      <li><a href="#enemy-point-glitch">Enemy Point Glitch</a></li>
    </ul>
  </li>
  <li><a href="#conclusion">Conclusion</a></li>
  </ul>
</div>

<h2 id="the-game">The Game</h2>
<p>The game “The Jungle Book” was actually released for multiple platforms in 1994 with the Game Boy version being the technically most limited.
It is a very classic platformer that doesn’t really have much to offer from a game-play perspective.
There’s just running, jumping, and defeating enemies - everything underpinned with rather sluggish controls and awkward hit boxes.
The graphics are quite neat for a Game Boy game, but the frequently dropping frame rate is really stressful for the eye.
So, overall a pretty mediocre 90s Game Boy game.
Although the UK-based video game magazine <a href="https://retrocdn.net/images/d/dd/CVG_UK_150.pdf">“Computer and Video Games” Issue 150</a> from May 1994 gave it a solid 87/100 score (see page 91):</p>
<div style="text-align:center">
    <img id="cvg-150" src="/assets/jungle_book/cvg150.png" alt="Computer and Video Games Issue 150" width="50%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/cvg150.png', '_blank');" />
</div>
<p><br />
I guess the publishers/developers put everything on the Disney card - a typical pathology of franchise games.
The only outstanding thing was the game’s insane difficulty.
It’s still etched into my mind how I was never able to get past Level 2.
In the 90s I was still very young and far away from my gaming skill all-time high, so almost 25 years later (and exactly 30 years after the game’s release!) I decided to update my conclusion.
Even with more experience and skill I have to admit: the game is hard.
While the levels become manageable with some training, the lack of save states is really annoying.
In order to complete the game, you need to finish the 10 levels without having the chance to save even once.
With the former speedrun record in practice mode already requiring <a href="https://www.speedrun.com/the_jungle_book_gb">27 minutes</a>, I really can’t imagine how little children are supposed to beat the game.
But if you really bring the perseverance and motivation to defeat the game, having a plan helps a lot.
Because most of the time, your objective is to collect gems, which are sprinkled across the map.
If you know where these gems are, the game becomes much easier.
And this is why I wrote this guide.
So, if you also want to overcome your childhood trauma of an unbeaten Jungle Book Game Boy game, you have come to the right place.</p>

<p>In the following, I will first list general details about the game, which is then followed by a per-level guide.
For every level, I created three different maps from the ROM.
First, the plain map extracted from the ROM.
Second, an annotated map with gem and enemy locations.
Third, a map for a practice-mode speedrun.</p>

<h2 id="history">History</h2>
<p>Since mere gameplay facts about a 30-year-old Game Boy game are perhaps a little too dull,
I added a chapter about the hopefully exciting story of the game’s development and some other curious facts.
As with most things, a fair starting point might be the <a href="https://en.wikipedia.org/wiki/The_Jungle_Book_(video_game)">Wikipedia article of the game</a>.</p>

<p>Wikipedia mentions that Virgin Games (later renamed to Virgin Interactive Entertainment) started the development of the Genesis/Mega Drive version in 1993, and the game was intended to be delivered in the same year.
It is not clear from the Wikipedia article why Virgin Games developed this particular game.
However, if you look at the <a href="https://en.wikipedia.org/wiki/Virgin_Interactive_Entertainment">Virgin Games release list</a>,
you quickly get the impression that Virgin Games worked through one franchise after the next.
So far, they had published games including <a href="https://en.wikipedia.org/wiki/Disney%27s_Aladdin_(Sega_Genesis_video_game)">Aladdin</a>,
<a href="https://en.wikipedia.org/wiki/The_Terminator_(Sega_video_game)">The Terminator</a>,
<a href="https://en.wikipedia.org/wiki/Dune_(video_game)">Dune</a>, <a href="https://en.wikipedia.org/wiki/Alien_3_(video_game)">Alien</a>, and <a href="https://en.wikipedia.org/wiki/Global_Gladiators">McDonald’s</a>.
I guess “The Jungle Book” game was just next on the list.
However, the development lead <a href="https://en.wikipedia.org/wiki/David_Perry_(game_developer)">David Perry</a> including most of his team left Virgin Games during the game’s development.
Subsequently, the Genesis version was completed by <a href="https://en.wikipedia.org/wiki/Eurocom">Eurocom Entertainment Software</a>.
They probably completed the Game Boy version too, as the starting screen and credits screen mention Eurocom as the developer:</p>

<div style="text-align:center">
    <img id="jb-start-screen" src="/assets/jungle_book/jb_starting_screen.png" alt="Jungle Book Starting Screen" width="30%" />
    <img id="jb-credit-screen" src="/assets/jungle_book/jb_credit_screen.png" alt="Jungle Book Credits Screen" width="30%" />
</div>
<p><br /></p>

<p>After its completion, the game could finally be found in stores in 1994, where it was released for Genesis, Master System, SNES, NES, and Game Boy.
Interestingly, the packaging, the cartridge number (DMG-J7-USA, DMG-J7-USA-1, DMG-J7-EUR, …) as well as the instruction booklet differed between regions.
Furthermore, the game was rereleased in 1997 (at least the boxing implies that) as part of “Disney’s Classic”.
Here’s my collection of different versions:</p>

<div style="text-align:center">
    <img id="dmg-j7-eur-cartridge" src="/assets/jungle_book/dmg-j7-eur-cartridge.jpg" alt="DMG-J7-EUR Cartridge" width="30%" />
    <img id="dmg-j7-noe-cartridge" src="/assets/jungle_book/dmg-j7-noe-cartridge.jpg" alt="DMG-J7-NOE Cartridge" width="30%" />
    <img id="dmg-j7-usa-cartridge" src="/assets/jungle_book/dmg-j7-usa-cartridge.jpg" alt="DMG-J7-USA Cartridge" width="30%" />
</div>

<div style="text-align:center">
    <img id="dmg-j7-eur-package-front" src="/assets/jungle_book/dmg-j7-eur-package-front.jpg" alt="DMG-J7-EUR Package" width="30%" />
    <img id="dmg-j7-noe-package-front" src="/assets/jungle_book/dmg-j7-noe-package-front.jpg" alt="DMG-J7-NOE Package" width="30%" />
    <img id="dmg-j7-usa-package-front" src="/assets/jungle_book/dmg-j7-usa-package-front.jpg" alt="DMG-J7-USA Package" width="30%" />
</div>

<div style="text-align:center">
    <img id="dmg-j7-eur-package-back" src="/assets/jungle_book/dmg-j7-eur-package-back.jpg" alt="DMG-J7-EUR Package" width="30%" />
    <img id="dmg-j7-noe-package-back" src="/assets/jungle_book/dmg-j7-noe-package-back.jpg" alt="DMG-J7-NOE Package" width="30%" />
    <img id="dmg-j7-usa-package-back" src="/assets/jungle_book/dmg-j7-usa-package-back.jpg" alt="DMG-J7-USA Package" width="30%" />
</div>
<p><br /></p>

<p>Despite looking different, all games use the same binary with English language output.
Interestingly, the US version (DMG-J7-USA) is quite outstanding in many regards.
First of all, the cartridge design is very… black.
I don’t know if Metallica’s black album was used as inspiration but the black design is really off and does not fit the cheerful setting of Jungle Book at all.
Second, the instruction booklet is a fever dream:</p>
<embed src="/assets/jungle_book/dmg-j7-usa-instruction-booklet.pdf" width="100%" />

<p>As you can see at first glance, the quality of the backgrounds and screenshots used is miserable (it’s not my scanner’s fault 😉).
At first, I thought I had bought a fake copy, but after some investigation, I came to a different conclusion:
Whoever designed the instruction booklet didn’t have sufficient information or wasn’t very committed.
Because the poor picture quality is not the only flaw.
If you read through the instruction booklet, you quickly realize that many of the things mentioned, such as 3 difficulties, don’t seem to apply.
The screenshots shown don’t match the game either.
But if you know the NES version of the game, you will notice that the content of the <a href="https://www.retrogames.cz/manualy/NES/NintendoNESJungleBook.pdf">NES instruction booklet</a> has been adopted here without much thought.
The European version is better in that regard and does not contain any major flaws (sorry, I only have it in German):</p>
<embed src="/assets/jungle_book/dmg-j7-noe-booklet.pdf" width="100%" />

<p>But also here, the quality of the used sprites is poor (why is Mowgli’s head missing a piece?!).
While the European booklet is okish, the packaging mentions three difficulty levels, which is clearly wrong (the game only has a practice and a normal mode).</p>

<p>As a next point, let us move to the game’s perception.
When the game was released, the internet was still in its infancy, so any reviews from that time come from gaming magazines.
In total, I found five reviews from five magazines. Here’s what they say:</p>

<hr />
<p>UK-based video game magazine <a href="https://retrocdn.net/images/d/dd/CVG_UK_150.pdf">Computer and Video Games Issue 150</a> from May 1994.<br />
Score: 87/100.<br />
Pro: Gem system gives the game some exploration depth. Nice graphics.<br />
Cons: Enemies are too weak.<br />
<!-- Game release in June. Game has 12 levels. "Well-placed continues which prevent Jungle being as frustrating as some similar platform games."
"The exploration, action and humour are well-thought-out."
SNES and MegaDrive release in July. --></p>

<p>A German article from <a href="https://www.kultboy.com/index.php?site=t&amp;id=19378">Video Games</a> from June 1994.<br />
Score: 80/100.<br />
Pro: Good graphics. Good animations. Diverse and detailed levels.<br />
Con: Background too lavish.<br />
<!-- Again, 12 levels. No release date mentioned. --></p>

<p>Another German article from <a href="https://total.seppatoni.ch/ausgabe-05-94/">Total!</a> from April 1994.
I couldn’t find a copy online, so I bought an original print from ebay.<br />
Score: 2. They use a grading system similar to the German school grading system. With a “2”, the game is among the best 3 games out of 12 in the magazine’s issue.<br />
Pro: Nice animation and soundtrack.<br />
Con: Levelcodes missing.<br />
<!-- Again, 12 levels. Release date July. --></p>

<p>Article from from British <a href="https://www.reddit.com/r/retrogamingmagazines/comments/jdl9fe/the_jungle_book_gameboy_review_from_total/#lightbox">Total! Nintendo Magazin</a> from April 1994.
Score: 90/100.<br />
Pro: Good graphics. Good animations. Positive feeling of control.<br />
Con: More contrasting background would be great. The game overreaches itself.<br />
<!-- One of the best games you'll get for your handheld. 12 levels. Release in March. --></p>

<p>A French article from <a href="https://www.game-boy-database.com/tests/885/super%20power%20preview.jpg">SUPER POWER</a> from March 1994.
My French is a bit rusty, so I asked ChatGPT to translate the article.
It’s not really a review but more like a description with some visual impressions.
Apparently, there’s a more detailed review following (“À suivre…” -&gt; “To be continued…”), but I wasn’t able to find anything.
<!-- Roughly 10 levels ("dizaine").  Also no rating. Release in April. --></p>

<div style="text-align:center">
    <img id="cvg-150" src="/assets/jungle_book/cvg150.png" alt="Computer and Video Games Issue 150" width="15%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/cvg150.png', '_blank');" />
    <img id="video-games-review" src="/assets/jungle_book/games_review.jpg" alt="Video Games 6/94" width="15%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/games_review.jpg', '_blank');" />
    <img id="total-94-jb" src="/assets/jungle_book/total94_jungle_book.jpg" alt="Total! Das unabhängige Magazin 5/1994" width="15%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/total94_jungle_book.jpg', '_blank');" />
    <img id="total-nintendo-magazine" src="/assets/jungle_book/total_nintendo_magazin_1.webp" alt="Total Nintendo Magazine Issue 28 April 1994" width="15%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/total_nintendo_magazin_1.webp', '_blank');" />
    <img id="cvg-150" src="/assets/jungle_book/total_nintendo_magazin_2.webp" alt="Total Nintendo Magazine Issue 28 April 1994" width="15%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/total_nintendo_magazin_2.webp', '_blank');" />
    <img id="super-power-94" src="/assets/jungle_book/super_power_preview_jb.jpg" alt="Super Power March 1994" width="15%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/super_power_preview_jb.jpg', '_blank');" />
</div>

<hr />

<p><br />
On average, this gives a score of (87/100 + 80/100 + 90/100) / 3 = 86/100.
So, a pretty good score. But maybe too good?
I mean the game is nice to look at (given a constant frame rate), and it’s at least a bit of fun, but an average 86/100 score somehow feels unjustified.
Also, many of the negative points can be interpreted as positive points in disguise.
Like the background being too lavish, or the game trying to overreach itself.
Another thing that is slightly off is the number of mentioned levels.
Four of the magazines wrote about the game having 12 levels.
Well, in theory there are 12 levels, but one is a bonus level, and another one is just a transition animation.
Especially the latter cannot really be counted as a level.
Also, the mentioned number of continues is off in two magazines (magazines say 2 and 3 or 4; the actual number is 6),
as well as the size of the cartridge (1 MiB mentioned; actual ROM size is 128 KiB).
Overall, it feels like someone sent these magazines some predetermined scores and information.</p>

<p>But well, maybe my feeling is just wrong and I’m the only person who doesn’t consider the game to be highly outstanding.
That’s why I tried to find independent user reviews on the net.
You can’t find many user-written reviews of the game, but I was able to find three. Here’s their conclusion:</p>

<hr />

<p><a href="https://gamefaqs.gamespot.com/gameboy/585767-disneys-the-jungle-book/reviews/173387">User review 1</a> from 2022 gave it a 7/10.<br />
Pro: Good graphics.<br />
Con: Not much variation among levels. Stiff controls. Awkward floor hitbox. Annoying leaps of faith. Time limit too strict.</p>

<p><a href="https://www.freezenet.ca/review-jungle-book-game-boy/">User review 2</a> from 2020 gave it 48/100.<br />
Pro: Good graphics.<br />
Con: Not knowing what to do. No hints for the gems. Awkward movement. Too much blurring.</p>

<p><a href="https://gamefaqs.gamespot.com/gameboy/585767-disneys-the-jungle-book/reviews/104735">User review 3</a> from 2006 gave it a 5/10.<br />
Pro: Nothing.<br />
Con: Not knowing what to do. Annoying when the last item cannot be found.</p>

<hr />

<p>The user reviews all agree on the same facts: the game’s graphics are nice to look at, but the awkward gem collecting system with the stiff controls ruins the game.
On average, this gives a score of (70/100 + 48/100 + 50/100) / 3 = 56/100.
So, quite a contrast to the 86/100 average score of the gaming magazines.</p>

<p>I’m currently doing further investigations and will update this section from time to time.
But for now, that’s it!</p>

<!-- Also a nice website about reverse engineering some parts of the game: https://agorbz.substack.com/p/i-wrote-a-patch-to-beat-an-awful -->

<h2 id="basic-game-facts">Basic Game Facts</h2>

<h3 id="gameplay">Gameplay</h3>
<ul>
  <li>The game comprises 10 levels that have to be defeated in order to reach the credits screen.</li>
  <li>You cannot save. If you want to beat the game, you have to beat all levels in one session.</li>
  <li>In the start menu, you can set the game to practice mode (it should rather be “easy mode” IMHO) by pressing SELECT. I highly recommend this mode.</li>
  <li>In order to finish a level, you need to collect all gems (7 in practice mode and 10 in normal mode). Only for Level 8 (FALLING RUINS) collecting a single gem is sufficient.</li>
  <li>Some levels additionally require you to meet/defeat characters such as Kaa or Baloo.</li>
  <li>You have 5 minutes of time to finish a level.</li>
  <li>Mowgli has 52 health points.</li>
  <li>All enemies deal 4 damage per hit in normal mode and 2 damage in practice mode.</li>
  <li>Water deals continuous damage, independent of the chosen mode.</li>
  <li>You have 6 lives.</li>
  <li>You have 4 continues in normal mode and 6 continues in practice mode.</li>
</ul>

<h3 id="controls">Controls</h3>
<ul>
  <li>Press SELECT in the start menu to toggle difficulty modes.</li>
  <li>Press SELECT in the game to switch between different items/weapons.</li>
  <li>Press START to (un)pause the game.</li>
  <li>Use the D-pad to control Mowgli.</li>
  <li>Press A to jump.</li>
  <li>Press B to shoot projectiles and run faster.</li>
  <li>If you press A+B+START+SELECT, the game will be restarted.</li>
</ul>

<h3 id="items">Items</h3>
<p>The following items can be found across the map or are dropped by enemies:</p>
<ul>
  <li>Boomerang <img id="item-boomerang" src="/assets/jungle_book/item_boomerang.svg" style="display:inline; height:2em;" />: A boomerang that can be used as a weapon.</li>
  <li>Double banana <img id="item-double-banana" src="/assets/jungle_book/item_double_banana.svg" style="display:inline; height:2em;" />: A double banana that can be used as a weapon.</li>
  <li>Extra life <img id="item-extra-life" src="/assets/jungle_book/item_extra_life.svg" style="display:inline; height:2em;" />: Collect Mowgli’s head to get an extra life.</li>
  <li>Extra time <img id="item-extra-time" src="/assets/jungle_book/item_extra_time.svg" style="display:inline; height:2em;" />: Gives some extra time when collected (1 minute in normal, 2 minutes in practice).</li>
  <li>Extra level <img id="item-extra-level" src="/assets/jungle_book/item_extra_level.svg" style="display:inline; height:2em;" />: If you collect the shovel, there will be a bonus level before the next level to collect additional items.</li>
  <li>Flower <img id="item-grapes" src="/assets/jungle_book/item_checkpoint.svg" style="display:inline; height:2em;" />: Activates a checkpoint when walking through the flower.</li>
  <li>Gem <img id="item-gem" src="/assets/jungle_book/item_diamond.svg" style="display:inline; height:2em;" />: Collect gems to beat a level.</li>
  <li>Grapes <img id="item-grapes" src="/assets/jungle_book/item_grapes.svg" style="display:inline; height:2em;" />: Fills up your health bar when collected.</li>
  <li>Leaf <img id="item-leaf" src="/assets/jungle_book/item_leaf.svg" style="display:inline; height:2em;" />: Can only be collected in the bonus level. Grants an additional continue.</li>
  <li>Medicine man mask <img id="medicine-man-mask" src="/assets/jungle_book/item_mask.svg" style="display:inline; height:2em;" />: Grants invulnerability if selected as a weapon.</li>
  <li>Pineapple <img id="item-pineapple" src="/assets/jungle_book/item_pineapple.svg" style="display:inline; height:2em;" />: Just gives some extra points.</li>
  <li>Stones <img id="item-leaf" src="/assets/jungle_book/item_stones.svg" style="display:inline; height:2em;" />: Stones that can be used as a weapon.</li>
</ul>

<h3 id="weapons">Weapons</h3>
<p>There are 5 different weapons/items the player can use:</p>
<ul>
  <li>Banana (Index 0): Default weapon. Unlimited.</li>
  <li>Double Bananas (Index 1): 0 by default. Dropped by enemies.</li>
  <li>Boomerang (Index 2): 0 by default. Dropped by enemies.</li>
  <li>Stones (Index 3): 0 by default. Dropped by enemies.</li>
  <li>Mask (Index 4): 0 by default. Dropped by enemies. Grants you invincibility for a given time. During invincibility you shoot your default bananas.</li>
</ul>

<p>The damage of the weapons (except for the default banana) is calculated as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>damage = (weapon_index * 2 + 1) * (NormalMode ? 1 : 2)
</code></pre></div></div>

<p>Or summarized in a table:</p>

<table>
  <thead>
    <tr>
      <th>Weapon</th>
      <th>Normal Mode Damage</th>
      <th>Practice Mode Damage</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Banana</td>
      <td>3</td>
      <td>6</td>
    </tr>
    <tr>
      <td>Double Banana</td>
      <td>3</td>
      <td>6</td>
    </tr>
    <tr>
      <td>Boomerang</td>
      <td>5</td>
      <td>10</td>
    </tr>
    <tr>
      <td>Stones</td>
      <td>7</td>
      <td>14</td>
    </tr>
  </tbody>
</table>

<p>Note that the double banana may hit a target twice (once with each individual banana) leading to twice the damage.</p>

<h3 id="enemies">Enemies</h3>

<p>During your adventure, you will face many different enemies.
Compared to platformers like Super Mario Land, the enemies in this game are not really an immediate threat to your health.
In normal mode, collisions with enemies deal 4 damage, and in practice mode it is 2.
With 52 starting health points, you can take quite a few hits before having a problem.
As with most Game Boy platformers, the enemies are quite bland and essentially boil down to moving obstacles.
For the sake of completeness of this guide, let me give you a list of all non-boss enemies in this game (the enemy animations were extracted with my <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/blob/master/utils/animation_extractor.py">animation extractor</a>):</p>

<p><img id="armadillo" src="/assets/jungle_book/sprites/armadillo_animation.webp" alt="Armadillo" width="60pt" />
<strong>Armadillo</strong>: An enemy that can curl up to avoid taking damage. Works similarly to the porcupine.</p>

<p><img id="bat" src="/assets/jungle_book/sprites/bat_animation.webp" alt="Bat" width="60pt" />
<strong>Bat</strong>: A flying obstacle.</p>

<p><img id="boar" src="/assets/jungle_book/sprites/boar_animation.webp" alt="Boar" width="60pt" />
<strong>Boar</strong>: A running obstacle.</p>

<p><img id="cobra" src="/assets/jungle_book/sprites/cobra_animation.webp" alt="Cobra" width="60pt" />
<strong>Cobra</strong>: A standing obstacle that shoots projectiles.</p>

<p><img id="crawling_snake" src="/assets/jungle_book/sprites/crawling_snake_animation.webp" alt="Crawling snake" width="60pt" />
<strong>Crawling snake</strong>: Unkillable, crawling enemy. Similar to the lizard.</p>

<p><img id="Crocodile" src="/assets/jungle_book/sprites/crocodile_animation.webp" alt="Crocodile" width="60pt" />
<strong>Crocodile</strong>: Can be used as a platform. But beware: it might open its mouth.</p>

<p><img id="fish" src="/assets/jungle_book/sprites/fish_animation.webp" alt="Fish" width="60pt" />
<strong>Fish</strong>: A jumping, unkillable obstacle. In Level 5, you can use it to skip the floating Baloo part.</p>

<p><img id="flying-bird" src="/assets/jungle_book/sprites/flying_bird_animation.webp" alt="Flying Bird" width="60pt" />
<strong>Flying Bird</strong>: A flying obstacle.</p>

<p><img id="frog" src="/assets/jungle_book/sprites/frog_animation.webp" alt="Frog" width="60pt" />
<strong>Frog</strong>: Can jump and shoot projectiles.</p>

<p><img id="hippo" src="/assets/jungle_book/sprites/hippo_animation.webp" alt="Hippo" width="60pt" />
<strong>Hippo</strong>: Floating Baloo can collide with these.</p>

<p><img id="lizzard" src="/assets/jungle_book/sprites/lizzard_animation.webp" alt="Lizard" width="60pt" />
<strong>Lizard</strong>: Unkillable, crawling enemy. Similar to the crawling snake.</p>

<p><img id="monkey" src="/assets/jungle_book/sprites/walking_monkey_animation.webp" alt="Monkey" width="60pt" />
<strong>Monkey</strong>: The most frequent enemy in the game. May walk, stand, or hang down.</p>

<p><img id="mosquito" src="/assets/jungle_book/sprites/mosquito_animation.webp" alt="Mosquito" width="60pt" />
<strong>Mosquito</strong>: I guess this represents some kind of mosquito swarm. Often placed near lianas.</p>

<p><img id="porcupine" src="/assets/jungle_book/sprites/porcupine_animation.webp" alt="Porcupine" width="60pt" />
<strong>Porcupine</strong>: An enemy that can curl up to avoid taking damage. Works similarly to the armadillo.</p>

<p><img id="scorpion" src="/assets/jungle_book/sprites/scorpion_animation.webp" alt="Scorpion" width="60pt" />
<strong>Scorpion</strong>: Walking enemy that shoots projectiles.</p>

<h3 id="cheats">Cheats</h3>

<p>After searching on the internet and reverse engineering large parts of the game, I’m almost certain: there are no intentional cheats in this game.
This is a bit of a bummer, especially when comparing the Game Boy game against its other versions, but also no big surprise.
I guess cramming cheat features into the small 128 KiB cartridge wasn’t deemed too important.</p>

<p>But luckily there’s a way to retrospectively add cheats to a game using cheat modules, like Game Genie or Game Shark.
The working principle of these cheat modules is quite simple: they set a given memory address to a given value.
The address and value are provided by you to the cheat modules using cheat codes.
Since attributes like health or number of projectiles often stick to a static address, it’s very simple to manipulate them.
You just need to know where they are located in the memory.
Since I was already reverse-engineering the game, getting these memory locations was a byproduct.
Here are some Game Shark codes that I created (remove the comment before pasting them):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>019985c1  ; Infinite double bananas
019986c1  ; Infinite boomerangs
019987c1  ; Infinite stones
019988c1  ; Infinite invincibility
0134b8c1  ; Infinite health
0109c3c1  ; Infinite time
0109b7c1  ; Infinite lives
</code></pre></div></div>
<p>Note that you don’t necessarily need a real Game Shark or Game Genie. Many popular emulators also support these codes.</p>

<h2 id="the-levels">The Levels</h2>
<h3 id="level-1-jungle-by-day">Level 1 (JUNGLE BY DAY)</h3>

<p>The first, and probably most simple level, plays in the jungle by day.
I think there is no real association with the movie’s plot and it just serves as an introduction.
The gimmick of this level is a catapult.
With a tool-assisted replay, the level can be beaten in 0:42.</p>

<p>width x height (in pixels): 3072 x 512</p>

<p>Items:</p>
<ul>
  <li>1x bonus level</li>
  <li>1x boomerang</li>
  <li>1x double banana</li>
  <li>2x extra lives</li>
  <li>2x mask</li>
  <li>0x stones</li>
  <li>2x time</li>
</ul>

<p>Passing Criteria:</p>
<ul>
  <li>Collect all gems</li>
</ul>

<div style="text-align:center">
    <img id="lvl1-map" src="/assets/jungle_book/lvl1map_clean.png" alt="Level 1 map" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl1map_clean.png', '_blank');" />
</div>
<p><br /></p>
<div style="text-align:center">
    <img id="lvl1-map-annotated" src="/assets/jungle_book/lvl1map_annotated.svg" alt="Level 1 map with annotations" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl1map_annotated.svg', '_blank');" />
</div>
<p><br /></p>
<div style="text-align:center">
    <img id="lvl1-map-speedrun" src="/assets/jungle_book/lvl1map_speedrun.svg" alt="Level 1 map speedrun" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl1map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level1.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-2-the-great-tree">Level 2 (THE GREAT TREE)</h3>

<p>The second level takes place at the Great Tree, in which Kaa resides.
In the game, there’s pretty much no plot, but I guess this is the point in the movie where Mowgli and Bagheera meet Kaa for the first time.
After collecting all gems you still have to defeat Kaa in some kind of boss battle at the end of the level.
The gimmick in this level are some kind of elevators in the tree’s stem.
With a tool-assisted replay, the level can be beaten in 1:00.</p>

<p>width x height (in pixels): 768 x 2048</p>

<p>Passing Criteria:</p>
<ul>
  <li>Collect all gems</li>
  <li>Defeat Kaa</li>
</ul>

<div style="text-align:center">
    <img id="lvl2-map" src="/assets/jungle_book/lvl2map_clean.png" alt="Level 2 map" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl2map_clean.png', '_blank');" />

    <img id="lvl2-map-annotated" src="/assets/jungle_book/lvl2map_annotated.svg" alt="Level 2 map with annotations" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl2map_annotated.svg', '_blank');" />

    <img id="lvl2-map-speedrun" src="/assets/jungle_book/lvl2map_speedrun.svg" alt="Level 2 map speedrun" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl2map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level2.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-3-dawn-patrol">Level 3 (DAWN PATROL)</h3>

<p>After the encounter with Kaa, Mowgli meets Colonel Hathi and his dawn patrol.
The dawn patrol also represents this level’s gimmick: A walking elephant herd that can be used as a platform.
With a tool-assisted replay, the level can be beaten in 0:50.</p>

<p>width x height (in pixels): 5376 x 320</p>

<p>Items:</p>
<ul>
  <li>1x bonus level</li>
  <li>1x boomerang</li>
  <li>0x double banana</li>
  <li>0x extra lives</li>
  <li>1x mask</li>
  <li>0x stones</li>
  <li>0x time</li>
</ul>

<p>Passing Criteria:</p>
<ul>
  <li>Collect all gems</li>
</ul>

<div style="text-align:center">
<img id="lvl3-map-annotated" src="/assets/jungle_book/lvl3map_clean.png" alt="Level 3 map with annotations" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl3map_clean.png', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center">
<img id="lvl3-map-annotated" src="/assets/jungle_book/lvl3map_annotated.svg" alt="Level 3 map with annotations" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl3map_annotated.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center">
<img id="lvl3-map-speedrun" src="/assets/jungle_book/lvl3map_speedrun.svg" alt="Level 3 map speedrun" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl3map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level3.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-4-by-the-river">Level 4 (BY THE RIVER)</h3>

<p>This is the first level where you encounter water.
Being in the water progressively reduces your health until you die.
Unfortunately, the invincibility mask does not work against water.
I would say that this is one of the harder levels as falling into the water may happen frequently.
At the end of this level, you have to defeat Baloo.
With a tool-assisted replay, the level can be beaten in 1:02.</p>

<p>width x height (in pixels): 4096 x 512</p>

<p>Items:</p>
<ul>
  <li>1x bonus level</li>
  <li>2x boomerang</li>
  <li>0x double banana</li>
  <li>1x extra lives</li>
  <li>0x mask</li>
  <li>3x stones</li>
  <li>1x time</li>
</ul>

<p>Passing Criteria:</p>
<ul>
  <li>Collect all gems</li>
  <li>Defeat Baloo</li>
</ul>

<div style="text-align:center">
<img id="lvl4-map-annotated" src="/assets/jungle_book/lvl4map_clean.png" alt="Level 4 map" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl4map_clean.png', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center">
<img id="lvl4-map-annotated" src="/assets/jungle_book/lvl4map_annotated.svg" alt="Level 4 map with annotations" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl4map_annotated.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center">
<img id="lvl4-map-annotated" src="/assets/jungle_book/lvl4map_speedrun.svg" alt="Level 4 speedrun" width="99%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl4map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level4.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-5-in-the-river">Level 5 (IN THE RIVER)</h3>

<p>Loosely following the plot of the movie, Mowgli is floating down the river on Baloo.
Interestingly, you can shorten the level significantly by using one of the fish at the beginning of the level to push you on a platform.
This avoids floating down the whole river and saves more than a minute.
With a tool-assisted replay, the level can be beaten in 0:39.</p>

<p>width x height (in pixels): 1792 x 1024</p>

<p>Items:</p>
<ul>
  <li>1x bonus level</li>
  <li>1x boomerang</li>
  <li>2x double banana</li>
  <li>1x extra lives</li>
  <li>1x mask</li>
  <li>0x stones</li>
  <li>1x time</li>
</ul>

<p>Passing Criteria:</p>
<ul>
  <li>Collect all gems</li>
</ul>

<div style="text-align:center">
    <img id="lvl5-map" src="/assets/jungle_book/lvl5map_clean.png" alt="Level 5 map" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl5map_clean.png', '_blank');" />

    <img id="lvl5-map-annotated" src="/assets/jungle_book/lvl5map_annotated.svg" alt="Level 5 map with annotations" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl5map_annotated.svg', '_blank');" />

    <img id="lvl5-map-speedrun" src="/assets/jungle_book/lvl5map_speedrun.svg" alt="Level 5 map speedrun" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl5map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level5.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-6-tree-village">Level 6 (TREE VILLAGE)</h3>

<p>Next, Mowgli is in the tree village where he has to defeat the monkeys.
This level is relatively easy with its gimmick being some teleporting tree houses.
With a tool-assisted replay, the level can be beaten in 1:10.</p>

<p>width x height (in pixels): 2048 x 1024</p>

<p>Items:</p>
<ul>
  <li>1x bonus level</li>
  <li>1x boomerang</li>
  <li>0x double banana</li>
  <li>0x extra lives</li>
  <li>0x mask</li>
  <li>2x stones</li>
  <li>1x time</li>
</ul>

<p>Passing Criteria:</p>
<ul>
  <li>Collect all gems</li>
  <li>Defeat the monkeys</li>
</ul>

<div style="text-align:center">
    <img id="lvl6-map" src="/assets/jungle_book/lvl6map_clean.png" alt="Level 6 map" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl6map_clean.png', '_blank');" />

    <img id="lvl6-map-annotated" src="/assets/jungle_book/lvl6map_annotated.svg" alt="Level 6 map with annotations" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl6map_annotated.svg', '_blank');" />

    <img id="lvl6-map-speedrun" src="/assets/jungle_book/lvl6map_speedrun.svg" alt="Level 6 map speedrun" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl6map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level6.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-7-ancient-ruins">Level 7 (ANCIENT RUINS)</h3>
<p>Again a rather easy level with teleporting doors as a gimmick.
With a tool-assisted replay, the level can be beaten in 0:35.</p>

<p>width x height (in pixels): 2048 x 1024</p>

<p>Items:</p>
<ul>
  <li>1x bonus level</li>
  <li>2x boomerang</li>
  <li>2x double banana</li>
  <li>1x extra lives</li>
  <li>1x mask</li>
  <li>0x stones</li>
  <li>1x time</li>
</ul>

<p>Passing Criteria:</p>
<ul>
  <li>Collect all gems</li>
</ul>

<div style="text-align:center">
    <img id="lvl7-map" src="/assets/jungle_book/lvl7map_clean.png" alt="Level 7 map" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl7map_clean.png', '_blank');" />

    <img id="lvl7-map-annotated" src="/assets/jungle_book/lvl7map_annotated.svg" alt="Level 7 map with annotations" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl7map_annotated.svg', '_blank');" />

    <img id="lvl7-map-speedrun" src="/assets/jungle_book/lvl7map_speedrun.svg" alt="Level 7 map speedrun" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl7map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level7.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-8-falling-ruins">Level 8 (FALLING RUINS)</h3>

<p>This level is quite outstanding as jumping up the falling stones is your primary objective.
At the end of this stage, a single gem and a fight with King Louie await Mowgli.
During the boss fight, King Louie occasionally drops items with the shovel (bonus level) being one of them.
While jumping from stone to stone is relatively easy, some parts of this level require you to jump on stones without seeing them.
At this point having a map comes in handy.
With a tool-assisted replay, the level can be beaten in 1:31.</p>

<p>width x height (in pixels): 1056 x 1728</p>

<p>Items:</p>
<ul>
  <li>1x bonus level</li>
  <li>2x boomerang</li>
  <li>2x double banana</li>
  <li>1x extra lives</li>
  <li>1x mask</li>
  <li>0x stones</li>
  <li>1x time</li>
</ul>

<p>Passing Criteria:</p>
<ul>
  <li>Collect the single gem</li>
  <li>Defeat King Louie</li>
</ul>

<div style="text-align:center">
    <img id="lvl8-map" src="/assets/jungle_book/lvl8map_clean.png" alt="Level 8 map" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl8map_clean.png', '_blank');" />

    <img id="lvl8-map-annotated" src="/assets/jungle_book/lvl8map_annotated.svg" alt="Level 8 map with annotations" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl8map_annotated.svg', '_blank');" />

    <img id="lvl8-map-speedrun" src="/assets/jungle_book/lvl8map_speedrun.svg" alt="Level 8 map speedrun" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl8map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level8.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-9-jungle-by-night">Level 9 (JUNGLE BY NIGHT)</h3>

<p>This level plays in a similar setting as the first level, but now by night.
That also seems to be this level’s “gimmick”.
There isn’t really anything worth noting, except for a platform that seems to be unreachable (see “?” in the annotated version).
Finally something really interesting! What secrets might be hidden there? Maybe some easter egg? Or an alternative ending?
Since I was already reverse engineering the game, I was looking for some easy ways to get me there.
I chose to replace all normal jumps with catapult jumps and yeet me up there.
Using the code, this can be achieved by simply replacing <code class="language-plaintext highlighter-rouge">JUMP_DEFAULT</code> with <code class="language-plaintext highlighter-rouge">JUMP_CATAPULT</code>.
With catapult jumps throwing me through the level I finally arrived at the mysterious platform, and I found…  a walking monkey, which drops a health package.
So, I guess this unreachable platform is just a flaw in the level’s design…
With a tool-assisted replay, the level can be beaten in 0:55.</p>

<p>width x height (in pixels): 2048 x 1024</p>

<p>Passing Criteria:</p>
<ul>
  <li>Collect all gems</li>
</ul>

<p>Items:</p>
<ul>
  <li>1x bonus level</li>
  <li>2x boomerang</li>
  <li>3x double banana</li>
  <li>1x extra lives</li>
  <li>2x mask</li>
  <li>0x stones</li>
  <li>1x time</li>
</ul>

<div style="text-align:center">
    <img id="lvl9-map" src="/assets/jungle_book/lvl9map_clean.png" alt="Level 9 map" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl9map_clean.png', '_blank');" />

    <img id="lvl9-map-annotated" src="/assets/jungle_book/lvl9map_annotated.svg" alt="Level 9 map with annotations" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl9map_annotated.svg', '_blank');" />

    <img id="lvl9-map-speedrun" src="/assets/jungle_book/lvl9map_speedrun.svg" alt="Level 9 map speedrun" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl9map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level9.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-10-the-wastelands">Level 10 (THE WASTELANDS)</h3>

<p>This is the final level in which you have to defeat Shere Khan.
Besides some fire on the ground, there’s nothing particularly special.
Just be careful with the last checkpoint as it may soft lock you.
With a tool-assisted replay, the level can be beaten in 0:44.</p>

<p>Passing Criteria:</p>
<ul>
  <li>Collect the single gem</li>
  <li>Defeat Shere Khan</li>
</ul>

<p>Items:</p>
<ul>
  <li>0x bonus level</li>
  <li>2x boomerang</li>
  <li>0x double banana</li>
  <li>1x extra lives</li>
  <li>1x mask</li>
  <li>2x stones</li>
  <li>1x time</li>
</ul>

<p>width x height (in pixels): 2048 x 1024</p>

<div style="text-align:center">
  <img id="lvl10-map" src="/assets/jungle_book/lvl10map_clean.png" alt="Level 10 map" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl10map_clean.png', '_blank');" />
  <img id="lvl10-map-annotated" src="/assets/jungle_book/lvl10map_annotated.svg" alt="Level 10 map annotated" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl10map_annotated.svg', '_blank');" />
  <img id="lvl10-map-speedrun" src="/assets/jungle_book/lvl10map_speedrun.svg" alt="Level 10 map speedrun" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl10map_speedrun.svg', '_blank');" />
</div>
<p><br /></p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="70%" height="70%">
    <source src="/assets/jungle_book/level10.webm" type="video/webm" />
  </video>
</div>

<h3 id="level-11-bonus">Level 11 (Bonus)</h3>
<p>This is the bonus level, which can be reached by collecting a shovel in a regular level.
I labeled it “Level 11” due to the game internally encoding it as the 11th level.
The point of this level is to gear up Mowgli with all sorts of weapons, extra lives, and continues.
However, the actual type of items is randomly determined, which is annotated by a “?” in the annotated version of the level’s map.
The level finishes when all eight items have been collected or when the time runs out.</p>

<p>width x height (in pixels): 768 x 640</p>

<p>Items:</p>
<ul>
  <li>8x random item</li>
</ul>

<div style="text-align:center">
  <img id="lvl11-map" src="/assets/jungle_book/lvl11map_clean.png" alt="Level 11 map" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl11map_clean.png', '_blank');" />
  <img id="lvl11-map" src="/assets/jungle_book/lvl11map_annotated.svg" alt="Level 11 map annotated" width="32%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl11map_annotated.svg', '_blank');" />
</div>
<p><br /></p>

<h3 id="level-12-transition">Level 12 (Transition)</h3>

<p>This is the transition “level” which is used in between levels.
It’s not really a level, but the game internally encodes it as the 12th level.
You cannot move here, and there are several animations playing depending on what you collected in the previous level.
Usually you only see the left part of the level, but after finishing Level 10, the camera moves to the right and reveals the girl from the nearby village.
Nothing really special happens and after a few seconds the credits are shown.</p>

<p>width x height (in pixels): 320 x 160</p>

<div style="text-align:center">
  <img id="lvl12-map" src="/assets/jungle_book/lvl12map_clean.png" alt="Level 12 map" width="50%" style="cursor: pointer" onclick="window.open('/assets/jungle_book/lvl12map_clean.png', '_blank');" />
</div>
<p><br /></p>

<h2 id="putting-it-all-together">Putting It All Together</h2>

<p>Since I was way too invested into the game, doing a speedrun was the next logical step.
Here’s my attempt that I also submitted to <a href="https://www.speedrun.com/the_jungle_book_gb">www.speedrun.com</a>.
To comply with the speedrun rules I played the <a href="https://store.steampowered.com/app/1636800/The_Jungle_Book_and_MORE_Aladdin_Pack/">Disney Classics</a> version.</p>

<div style="text-align:center" class="video-border">
  <video controls="" preload="none" width="90%" height="70%">
    <source src="/assets/jungle_book/jb_speedrun_wr.webm" type="video/webm" />
  </video>
</div>

<p>I fucked up a few times, but I still managed to get the first place 😎</p>

<div style="text-align:center">
  <img id="speedrun-list" src="/assets/jungle_book/speedrun_list.png" alt="Speedrun list" width="80%" />
</div>
<p><br /></p>

<h2 id="the-reverse-engineering-process">The Reverse Engineering Process</h2>

<p>In this section, I highlight the details of extracting the level maps from the game.
All code references are taken from the corresponding <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/">Github repository</a>.</p>

<p>When I initially planned to extract the maps from this game, I was like: “That’s going to be easy, I just need to find the right memory location and copy the data.”
Well, turns out I was wrong, as the game uses way too many methods to cram the maps into the 128 KiB of the cartridge.
To understand how much compression you need, let us do some basic calculations.
Level 1 (JUNGLE BY DAY) has a size of 3072 x 512 pixels.
With two bits per pixel that would be 384 KiB (3072 x 512 x 2 / 8 = 384 KiB) of data.
That is around 3 times more than the size of the cartridge (128 KiB).
And that is just one of 10 levels.
So, what are the tricks here?</p>

<p>The first “trick” is the Game Boy’s way of tile-based rendering.
Instead of providing the data for the whole screen pixel per pixel, so pretty much like a framebuffer, you provide 8x8-sized tiles and pointers to the tiles.
The idea is to reuse tiles across the screen and save enormous amounts of memory.
So, the first step of the reverse engineering process was finding out where the data of the tiles resides in the ROM.</p>

<p>It took me a while, but I managed to find an array that holds pointers to the tiles for every level:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>; $409a: A 4-tuple per level (vram pointer0, pointer to compressed data0, vram pointer1, pointer to compressed data1)
; The first pointer points to data for the general level setting (jungle, tree, ruins, etc.).
; The second pointer points to data for level-specific stuff (catapult, elephants, etc.).
CompressedMapBgTilesBasePtr::
    dw $9000, CompressedMapBgTiles1, $96c0, CompressedMapBgTiles10 ; Level 1: JUNGLE BY DAY
    dw $9000, CompressedMapBgTiles2, $96d0, CompressedMapBgTiles20 ; Level 2: THE GREAT TREE
    dw $9000, CompressedMapBgTiles1, $96c0, CompressedMapBgTiles30 ; Level 3: DAWN PATROL
    ...
</code></pre></div></div>
<p>As already mentioned in the code’s comments, each level has a basic tile palette, such as a plain jungle setting, and some special level-specific tiles, such as the catapult.
Unfortunately, the data is not residing as simple tile palettes in the ROM.
Instead, the data is compressed and the game uses a decompression algorithm to get it in a usable structure.
Also other games, such as <a href="https://www.huderlem.com/blog/posts/carrot-crazy-2/">Looney Tunes: Carrot Crazy</a> used similar means.
After reverse engineering the code and reading some stuff about compression algorithms, I eventually managed to find out that the developers used the <a href="https://en.wikipedia.org/wiki/LZ77_and_LZ78#LZ77">LZ77 algorithm</a>.
If you are interested in the implementation of the algorithm, search the source code for <code class="language-plaintext highlighter-rouge">DecompressData</code>.
Using the LZ77 algorithm, the 1728 bytes of tile data for the first level can be compressed to 1247 bytes.
So, a space saving of 27.8%, which is something but not that much.
After rewriting the algorithm in Python, I managed to extract the basic and special tile palettes.
For instance, the combined basic and special tiles for the first level (JUNGLE BY DAYLIGHT) look like this:</p>

<p><img src="/assets/jungle_book/lvl1_basic_and_special.png" alt="image" /></p>

<p>Note that for some levels some special cases arise, but this is basically the gist of it.</p>

<p>After obtaining the tiles, the next step is to obtain the indices, also called tile map.
These indices are simple 8-bit integer numbers indicating where each tile is put to.
At first, I thought the levels would use a simple 2D array.
But if I had calculated the size of the array, I could have seen at the beginning that this idea does not work out.
With the first level having a size of 3072 x 512 pixels, you would need (3072/8) x (512/8) = 24,576 bytes for the indices.
As the other levels have a similar size, putting 10 levels like that into a 128 KiB cartridge does not really work.</p>

<p>So, I had to do some more reverse engineering.
The conclusion was that the game uses tiles to create meta tiles (with 16 x 16 pixels, or 2 x 2 tiles).
And these meta tiles are again used to create bigger meta tiles (with 32 x 32 pixels, or 4 x 4 tiles).
Here are the 2x2 and 4x4 meta tiles of the first level (maybe open them in a new tab and zoom in):</p>

<p><img src="/assets/jungle_book/lvl1_2x2.png" alt="image" /></p>

<p><img src="/assets/jungle_book/lvl1_4x4.png" alt="image" /></p>

<p>As nicely described in <a href="https://www.huderlem.com/blog/posts/carrot-crazy-3/">this post</a>, also other games seem to use a similar concept.
Using these big meta tiles, the indices of the first level are stored in a 2D array, only requiring (3072/32) x (512/32) = 1536 bytes!
Of course you need some data to construct small and big meta tiles, but reusing all kinds of tiles across levels also helps to reduce the memory footprint.</p>

<p>If you want to extract the maps yourself, feel free to use the <a href="https://github.com/not-chciken/jungle-book-gb-disassembly/blob/master/utils/level_renderer.py">python script</a> I wrote.</p>

<h2 id="bugs-and-glitches">Bugs And Glitches</h2>
<p>During the reverse-engineering process as well as my speedrunning attempts, I came across curious design choices or even bugs and glitches.
Here are my findings.</p>

<h3 id="weapon-damage-glitch">Weapon Damage Glitch</h3>
<p>Once a projectile hits an enemy, the game calculates the damage an enemy receives with the following code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
ld a, [WeaponActive]            ; Glitch: Using the active weapon is not the shot weapon! Damage calculator is broken!
add a                           ; a = 2 * a
jr nz, .NonDefaultBanana
ld a, DAMAGE_BANANA             ; a = 2
.NonDefaultBanana:
inc a                           ; a += 1
ld d, a
ld a, [DifficultyMode]          ; normal = 0, practice = 1
or a
jr z, .NormalMode
sla d                           ; Projectiles deal 2x damage in practice mode.
...
</code></pre></div></div>
<p>First, the game loads the actively selected weapon, whereby the following values are used:
Banana (0), double banana (1), boomerang (2), stones (3).
Now that value is multiplied by 2. Except for the default banana which is just set to 2.
Then the value is increased by one. Finally, the damage is multiplied by 2 in case the game is played in practice mode.</p>

<p>Using this implementation is somewhat glitchy, because the projectile hitting an enemy is not necessarily the active weapon!
So, if you change the weapon using SELECT while a projectile is flying, the flying projectile inherits the damage of the active weapon.
For instance, if you fire a double banana and quickly switch to stones, the damage of the bananas is based on the stone damage, allowing you to deal 2x14=28 in a single shot (practice mode assumed).
Note that switching to an active weapon requires at least one projectile of that kind.</p>

<h3 id="teleport-glitch">Teleport Glitch</h3>
<p>When using portals to teleport, the player’s position seems to be changed immediately while the view of the window follows an animation.
During the animation, the player cannot move unless there is a liana directly under the targeted portal.
This liana can be grabbed when pressing the down direction during the animation.
Such a scenario can be found in Level 6 (TREE VILLAGE) and allows you to already move forward while the animation is playing.
If you go far enough, Mowgli can be placed out of bounds bringing the game into some glitchy state.</p>

<h3 id="enemy-point-glitch">Enemy Point Glitch</h3>
<p>Hitting an enemy with a projectile grants you 50 points and subtracts the projectile’s damage from the enemy’s health.
However, when getting too far away from an enemy, the game unloads the enemy from the RAM.
During this process, the decreased health is not stored!
Hence, when entering the spawning zone of the enemy, it will respawn with full health, allowing you to hit it again and collect some points.</p>

<h2 id="conclusion">Conclusion</h2>
<p>Thanks for reading this post :)
Please <a href="/about">write me a mail</a> if you have any corrections, additions, or simply an interesting story about the game.</p>]]></content><author><name></name></author><category term="Gaming" /><summary type="html"><![CDATA[Now to a project into which I invested way too much time: A Complete Guide for the Game Boy’s “The Jungle Book” game from 1994. By complete I mean two things.]]></summary></entry><entry><title type="html">The Optimal Quantum of Temporal Decoupling</title><link href="https://www.chciken.com/simulation/2023/11/14/the-optimal-quantum.html" rel="alternate" type="text/html" title="The Optimal Quantum of Temporal Decoupling" /><published>2023-11-14T16:25:44+00:00</published><updated>2023-11-14T16:25:44+00:00</updated><id>https://www.chciken.com/simulation/2023/11/14/the-optimal-quantum</id><content type="html" xml:base="https://www.chciken.com/simulation/2023/11/14/the-optimal-quantum.html"><![CDATA[<script>
  window.MathJax = {
  tex: {
    loader: {load: ['[tex]/ams']},
    tex: {packages: {'[+]': ['ams']}},
    tags: 'ams',
    inlineMath: [['$', '$']]
  }
};
</script>

<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }

  .center {
    margin-left: auto;
    margin-right: auto;
  }
</style>

<script id="MathJax-script" async="" src="/assets/common/mathjax/mathjax-3.2.2.js"></script>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
  <li><a href="#1-introduction">1. Introduction</a></li>
  <li><a href="#2-what-is-temporal-decoupling">2. What is Temporal Decoupling?</a></li>
  <li><a href="#3-the-story">3. The Story</a></li>
  <li><a href="#4-analytical-models">4. Analytical Models</a>
    <ul>
      <li><a href="#41-a-speedup-model">4.1 A Speedup Model</a></li>
      <li><a href="#42-an-accuracy-model">4.2 An Accuracy Model</a></li>
    </ul>
  </li>
  <li><a href="#5-practical-assesment">5. Practical Assesment</a>
      <ul>
        <li><a href="#51-speedupaccuracy-experiments">5.1 Speedup/Accuracy Experiments</a></li>
        <li><a href="#52-qualitative-accuracy">5.2 Qualitative Accuracy</a></li>
      </ul>
  </li>
  <li><a href="#6-conclusion">6. Conclusion</a></li>
  <li><a href="#7-related-work">7. Related Work</a></li>
  <li><a href="#8-references">8. References</a></li>
  </ul>
</div>

<h2 id="1-introduction">1. Introduction</h2>

<p>This post is an extended and completely reworked version of our paper “The Optimal Quantum of Temporal Decoupling”,
which I presented at the <em>29th Asia and South Pacific Design Automation Conference 2024</em>.
The preprint version of the paper can be downloaded <a href="/assets/optimal_quantum/the-optimal-quantum-preprint.pdf">here 🗎</a>.
A big “thank you” goes to <a href="https://github.com/RubBra">Ruben</a> for doing the hard work behind this paper.</p>

<p>The idea of this work is to shine a greater light on <em>Temporal Decoupling</em> (<abbr title="Temporal Decoupling">TD</abbr>) in <em>Electronic System Level</em> (<abbr title="Electronic System Level">ESL</abbr>) simulations.
More specifically, we embarked on the quest to find and understand the <em>optimal quantum</em>.
In contrast to the paper, this post focuses more on SystemC-based examples.
Hence, some basic knowledge of SystemC is required to understand the rest of this post.
For everything else, even including temporal decoupling, we provide some gentle introduction.
This directly leads us to the first question:</p>

<h2 id="2-what-is-temporal-decoupling">2. What is Temporal Decoupling?</h2>
<p>Temporal Decoupling (<abbr title="Temporal Decoupling">TD</abbr>) is a modeling style that aims at speeding up (SystemC) simulations.
The principles behind <abbr title="Temporal Decoupling">TD</abbr> can best be explained by some minimal example.</p>

<p>Let’s suppose we want to model a very simple <abbr title="System on a Chip">SoC</abbr> comprising 2 CPUs.
In terms of SystemC/C++, the system might look like this (download the cpp file <a href="/assets/optimal_quantum/simple.cpp">here</a>):</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;iostream&gt;</span><span class="cp">
#include</span> <span class="cpf">"systemc.h"</span><span class="cp">
</span>
<span class="k">struct</span> <span class="nc">Cpu</span> <span class="o">:</span> <span class="k">public</span> <span class="n">sc_module</span> <span class="p">{</span>
  <span class="n">SC_HAS_PROCESS</span><span class="p">(</span><span class="n">Cpu</span><span class="p">);</span>

  <span class="kt">void</span> <span class="kr">thread</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
      <span class="c1">// Do stuff...</span>
      <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">name</span><span class="p">()</span> <span class="o">&lt;&lt;</span> <span class="s">": "</span> <span class="o">&lt;&lt;</span> <span class="n">sc_time_stamp</span><span class="p">()</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
      <span class="n">wait</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">SC_NS</span><span class="p">);</span>
    <span class="p">}</span>
  <span class="p">}</span>

  <span class="n">Cpu</span><span class="p">(</span><span class="n">sc_module_name</span> <span class="n">name</span><span class="p">)</span> <span class="o">:</span> <span class="n">sc_module</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">SC_THREAD</span><span class="p">(</span><span class="kr">thread</span><span class="p">);</span>
  <span class="p">}</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="nc">Soc</span> <span class="o">:</span> <span class="k">public</span> <span class="n">sc_module</span> <span class="p">{</span>
  <span class="n">SC_HAS_PROCESS</span><span class="p">(</span><span class="n">Soc</span><span class="p">);</span>
  <span class="n">Cpu</span> <span class="n">cpu0</span><span class="p">,</span> <span class="n">cpu1</span><span class="p">;</span>

  <span class="n">Soc</span><span class="p">(</span><span class="n">sc_module_name</span> <span class="n">name</span><span class="p">)</span> <span class="o">:</span> <span class="n">sc_module</span><span class="p">(</span><span class="n">name</span><span class="p">),</span> <span class="n">cpu0</span><span class="p">(</span><span class="s">"cpu0"</span><span class="p">),</span> <span class="n">cpu1</span><span class="p">(</span><span class="s">"cpu1"</span><span class="p">)</span> <span class="p">{</span>
  <span class="p">}</span>
<span class="p">};</span>

<span class="kt">int</span> <span class="nf">sc_main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span> <span class="p">{</span>
  <span class="n">Soc</span> <span class="n">soc</span><span class="p">(</span><span class="s">"soc"</span><span class="p">);</span>
  <span class="n">sc_start</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">SC_NS</span><span class="p">);</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As you can see, the two CPUs are repeatedly calling <code class="language-plaintext highlighter-rouge">wait</code> with a nanosecond delay in their thread, resulting in an effective clock speed of 1 GHz.
Usually, the “Do stuff…” part executes the current instruction of the CPU,
but for the sake of simplicity this is not modeled.
Thus, the example exhibits a typical SystemC loosely-timed (LT) style, in which each instruction executes in one cycle.
To see everything in action, execute the program above to get the following output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>soc.cpu0: 0 s
soc.cpu1: 0 s
soc.cpu0: 1 ns
soc.cpu1: 1 ns
soc.cpu0: 2 ns
soc.cpu1: 2 ns
[...]
</code></pre></div></div>
<p>The output also reveals that the SystemC kernel first executes the cycle of “cpu0”, while then executing the cycle of “cpu1”.
While there’s actually nothing wrong with this kind of modeling, the performance of the simulation might be somewhat disappointing.
Using this very simple example from above, I achieve at most 12 <abbr title="Millions Instructions Per Second">MIPS</abbr> on my Intel i5-8265U (click <a href="/assets/optimal_quantum/simple_bm.cpp">here</a> for a benchmark version).
For sure, it’s not the latest and greatest CPU, but 12 <abbr title="Millions Instructions Per Second">MIPS</abbr> is nothing!
Especially, if you consider that the program doesn’t even do anything.
With other simulators, such as QEMU, I can easily crack 1000 <abbr title="Millions Instructions Per Second">MIPS</abbr>. <br />
I know, it’s a bold comparison, but I’ve heard people preferring QEMU-based simulations over SystemC-based simulations because “SystemC is so slow”.<br />
And that leads us to very important question: Why is SystemC “so slow”?</p>

<p>Well, SystemC per se is not slow.
In the given example, it’s rather the frequent use of <code class="language-plaintext highlighter-rouge">wait</code> that cripples the simulation’s performance.
Because whenever <code class="language-plaintext highlighter-rouge">wait</code> is called, the SystemC kernel switches to the context of the other <code class="language-plaintext highlighter-rouge">SC_THREAD</code>.
While <code class="language-plaintext highlighter-rouge">wait</code> enables some kind of coroutine semantics, SystemC context switching comes at a relatively high price in terms of performance.</p>

<p>And this is where the idea of <em>Temporal Decoupling</em> (<abbr title="Temporal Decoupling">TD</abbr>) begins.
Instead of doing a context switch for each cycle, we just let a CPU run for multiple cycles before switching to the other thread. <br />
In other words: one CPU can run ahead of time, temporally decoupling it from the rest of the system.
Again, concepts are best explained by examples, so let’s look at the initial code, but now incorporating <abbr title="Temporal Decoupling">TD</abbr>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="nc">Cpu</span> <span class="o">:</span> <span class="k">public</span> <span class="n">sc_module</span> <span class="p">{</span>
  <span class="n">SC_HAS_PROCESS</span><span class="p">(</span><span class="n">Cpu</span><span class="p">);</span>
  <span class="n">tlm_utils</span><span class="o">::</span><span class="n">tlm_quantumkeeper</span> <span class="n">qk</span><span class="p">;</span>

  <span class="kt">void</span> <span class="kr">thread</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">if</span> <span class="p">(</span><span class="n">qk</span><span class="p">.</span><span class="n">need_sync</span><span class="p">())</span>
        <span class="n">qk</span><span class="p">.</span><span class="n">sync</span><span class="p">();</span>
      <span class="c1">// Do stuff..</span>
      <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">name</span><span class="p">()</span> <span class="o">&lt;&lt;</span> <span class="s">" current time:"</span> <span class="o">&lt;&lt;</span> <span class="n">qk</span><span class="p">.</span><span class="n">get_current_time</span><span class="p">()</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
      <span class="n">qk</span><span class="p">.</span><span class="n">inc</span><span class="p">(</span><span class="n">sc_time</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">SC_NS</span><span class="p">));</span>
    <span class="p">}</span>
  <span class="p">}</span>

  <span class="n">Cpu</span><span class="p">(</span><span class="n">sc_module_name</span> <span class="n">name</span><span class="p">)</span> <span class="o">:</span> <span class="n">sc_module</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">SC_THREAD</span><span class="p">(</span><span class="kr">thread</span><span class="p">);</span>
    <span class="n">qk</span><span class="p">.</span><span class="n">reset</span><span class="p">();</span>
  <span class="p">}</span>
<span class="p">};</span>


<span class="k">struct</span> <span class="nc">Soc</span> <span class="o">:</span> <span class="k">public</span> <span class="n">sc_module</span> <span class="p">{</span>
  <span class="n">SC_HAS_PROCESS</span><span class="p">(</span><span class="n">Soc</span><span class="p">);</span>
  <span class="n">Cpu</span> <span class="n">cpu0</span><span class="p">,</span> <span class="n">cpu1</span><span class="p">;</span>

  <span class="n">Soc</span><span class="p">(</span><span class="n">sc_module_name</span> <span class="n">name</span><span class="p">)</span> <span class="o">:</span> <span class="n">sc_module</span><span class="p">(</span><span class="n">name</span><span class="p">),</span> <span class="n">cpu0</span><span class="p">(</span><span class="s">"cpu0"</span><span class="p">),</span> <span class="n">cpu1</span><span class="p">(</span><span class="s">"cpu1"</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">tlm_utils</span><span class="o">::</span><span class="n">tlm_quantumkeeper</span><span class="o">::</span><span class="n">set_global_quantum</span><span class="p">(</span><span class="n">sc_time</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">SC_NS</span><span class="p">));</span>
  <span class="p">}</span>
<span class="p">};</span>

<span class="kt">int</span> <span class="nf">sc_main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span><span class="o">*</span> <span class="n">argv</span><span class="p">[])</span> <span class="p">{</span>
  <span class="n">Soc</span> <span class="n">soc</span><span class="p">(</span><span class="s">"soc"</span><span class="p">);</span>
  <span class="n">sc_start</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="n">SC_NS</span><span class="p">);</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here, a few new things are introduced.
First, there is:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tlm_utils</span><span class="o">::</span><span class="n">tlm_quantumkeeper</span><span class="o">::</span><span class="n">set_global_quantum</span><span class="p">(</span><span class="n">sc_time</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="n">SC_NS</span><span class="p">));</span>
</code></pre></div></div>
<p>This static function sets the so-called <em>quantum</em>.
The quantum is simply the maximum time a thread can run ahead of time.
So, in the given example, a quantum of 2 nanoseconds allows the thread to simulate 2 cycles before switching to another thread.
In the CPU threads, you now also find:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">qk</span><span class="p">.</span><span class="n">need_sync</span><span class="p">())</span>
    <span class="n">qk</span><span class="p">.</span><span class="n">sync</span><span class="p">()</span>
</code></pre></div></div>
<p>This simply checks if the thread has exhausted its quantum, and if so, syncs up with the rest of the system.
To advance the time, you don’t call <code class="language-plaintext highlighter-rouge">wait</code> anymore but <code class="language-plaintext highlighter-rouge">qk.inc(sc_time(1, SC_NS))</code>.</p>

<p>Ultimately, the <abbr title="Temporal Decoupling">TD</abbr> simulation generates the following output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>soc.cpu1 current time:0 s
soc.cpu1 current time:1 ns
soc.cpu0 current time:0 s
soc.cpu0 current time:1 ns
soc.cpu1 current time:2 ns
soc.cpu1 current time:3 ns
soc.cpu0 current time:2 ns
soc.cpu0 current time:3 ns
...
</code></pre></div></div>
<p>As you can see, we now managed to cut the number of context switches in half with a quantum of 2 ns.
Using even higher quanta like 100 ns, the simulation speed could be increased to 120 <abbr title="Millions Instructions Per Second">MIPS</abbr> on my computer!<br />
<strong>That means, the SystemC simulation is now 10x faster than without <abbr title="Temporal Decoupling">TD</abbr>!</strong> <br />
This observation is in line with the SystemC language reference manual <a class="citation" href="#systemcStandard">[1]</a>,
which also describes a potential speedup of up to 10x when using <abbr title="Temporal Decoupling">TD</abbr>.
Ez pz, problem solved… you may think.</p>

<p>Well, as so often in life, there’s no free lunch, and unfortunately, this also applies to <abbr title="Temporal Decoupling">TD</abbr>.
Since some threads might advance into the future, we are changing the semantics of the simulation.
This opens the door to a whole new universe of things that may go wrong and impact the functionality/accuracy of simulations.
So, finding an “optimal” quantum that yields the best compromise between performance and accuracy is one of the key challenges when using <abbr title="Temporal Decoupling">TD</abbr>.
And that is where the story of this post begins!</p>

<h2 id="3-the-story">3. The Story</h2>
<p>As part of an industry project, my institute developed a faster version of the simulator gem5.
We managed to speed up gem5 by more than 20x by employing some kind of parallel temporal decoupling.
It’s basically the same principle as above, but instead of simulating the quanta one after another, we are doing everything in parallel.
After a few months of development, we eventually shipped the first version of the simulator to our industry partner.</p>

<p>Much to our surprise, they said it didn’t work.
So, we had a joint debug session and eventually figured out the reason: the quantum was set to 1 second.
That’s a completely absurd value.
It’s like ordering water in a restaurant and suddenly the waiter starts to flood the restaurant.<br />
In order to have a working simulation, you need quanta like 1µs or 10µs, not 1s.</p>

<p>But I guess it was my fault, as I told them to increase the quantum if they want to have more performance.
I mean it’s not wrong, but I should also have told them that an increased quantum may impact accuracy or even functionality.
Moreover, I could have just provided some example values.</p>

<p>So I thought, maybe there’s some literature that could explain the relation between quantum and accuracy more in detail.
At that point, even we had little understanding and just chose our quanta by observation.
Or in other words: the simulation is fast and doesn’t crash? That’s a good quantum.
Well, every work I found provided the same fuzzy explanation and used the same empirical methods which we also employed.
To give you some examples:</p>

<hr />
<p><br />
J. Engblom <a class="citation" href="#engblom2018">[2]</a>:
“Time quantum lengths of 10k to 1M cycles are needed to maximize <abbr title="Virtual Platform">VP</abbr> performance.
Most of the time, software functionality and correctness are unaffected by <abbr title="Temporal Decoupling">TD</abbr>, and <strong>the default should be to use long time quanta</strong>.”</p>

<p>Ryckbosch et al. <a class="citation" href="#ryckbosch2012">[3]</a>:
“We set the simulation window to 10ms and the simulation quantum to 100ms in all of our experiments.
We experimentally evaluated different values for the simulation window and quantum, and we found the above values to be effective.”</p>

<p>J. Joy <a class="citation" href="#joy2020">[4]</a>:
“Increasing the quantum can cause a thread to run for a longer time, thus reducing the context switching overhead.
<strong>This increases the simulation speed, but at the cost of accuracy.</strong>”</p>

<p>Jünger et al. <a class="citation" href="#juenger2021">[5]</a>:
“To increase performance, <strong>the quantum should be as large as possible</strong> to reduce context switching.
<strong>However, a large quantum reduces simulation accuracy</strong>, as events may be handled too late. Therefore, deploying <abbr title="Temporal Decoupling">TD</abbr> is not trivial.”</p>

<hr />
<p><br />
Apparently, they all draw the same image of more quantum, more speed, but less accuracy:</p>

<p><br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/performance_accuracy_mental_model.svg" alt="Performance vs. Accuracy Mental Model" width="60%" />
</div>
<p><br /></p>

<p>However, a quantized relation is missing in all of the mentioned works.
Sure, some of the works provide speedup/quantum graphs, but they rather stick to observations than explanations.
Fortunately, for me as a Phd student, these kinds of unresolved mysteries offer the perfect opportunity to write a paper.
So, in the next few subsections, I’ll try to bring some light into the darkness by using analytical models to describe speedup and accuracy.</p>

<h2 id="4-analytical-models">4. Analytical Models</h2>
<p>Analytical models are a popular approach in computer science/engineering to describe a complex systems by simple mathematical means.
Some famous examples include:
<a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl’s law</a> <a class="citation" href="#amdahl1967">[6]</a>,
<a href="https://en.wikipedia.org/wiki/Gustafson%27s_law">Gustaffson’s law</a> <a class="citation" href="#gustafson1998">[7]</a>,
or the <a href="https://en.wikipedia.org/wiki/Roofline_model">Roofline model</a> <a class="citation" href="#williams2009">[8]</a>.
Often the goal is not to describe something 100% accurately, but to find a parsimonious yet evocative model.
Or in the words of George Box: <a href="https://en.wikipedia.org/wiki/All_models_are_wrong">“All models are wrong, but some are useful”</a>.
With a similar thought in mind, the following subsections introduce analytical models for performance and accuracy prediction in
temporally-decoupled simulations.</p>

<h3 id="41-a-speedup-model">4.1 A Speedup Model</h3>
<p>In this subsection, a speedup model for <abbr title="Temporal Decoupling">TD</abbr> simulations is introduced.
As already mentioned before, the speedup of a <abbr title="Temporal Decoupling">TD</abbr> simulation is attained by reducing the number of the simulator’s context switches.
Thus, for an ideal simulation without any context switches,
the execution time ($T_{ideal}$) is simply given by the sum of the time of all simulation segments $T_i$:
<br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/timing_ideal.svg" alt="Timing of an ideal SystemC simulation without context-switching costs" width="60%" />
</div>
<p><br />
Or in mathematical terms:
\begin{equation} \label{eq:6}
T_{ideal} = \sum_{i=1}^{K} T_ {i}
\end{equation}</p>

<p>Practically, there are context switches (CS) between the individual simulation segments leading to a prolongued execution time $T_{real}$:
<br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/timing_td.svg" alt="Timing of a SystemC simulation with context-switching costs" width="60%" />
</div>
<p><br />
This can be modelled by an abstract, relative overhead $O_c$
\begin{equation} \label{eq:7}
  T_{real} = T_{ideal} \cdot (1 + O_c)
\end{equation}</p>

<p>This overhead is almost inversely proportional to the chosen quantum ($t_{\Delta q}$).
Because if we double the quantum, we almost halve the number of context switches.
Note it’s “almost” because of the process at the end, which doesn’t really have a context switch.
Since most real-world simulations have way more than just a handful of context switches, this last missing context switch is negligible.
We’re also assuming that the quantum is larger than the average event distance.
For example, using quanta below 1 ns for a CPU system with a 1 ns clock cycle wouldn’t result in any changes.
But again, for most real-world scenarios this assumption should hold valid.</p>

<p>Using an inverse relation between quantum and overhead, the resulting formula is:</p>

<p>\begin{equation} \label{eq:8}
  T_{real} = T_{ideal} \cdot \left(1 + \frac{O_c’}{t_{\Delta q}} \right)
\end{equation}</p>

<p>Now we are left with an overhead factor $O_c’$.
This factor can be determined by curve fitting or running two reference simulations.
For the latter the formula is:</p>

<p>\begin{equation} \label{eq:9}
  \begin{split}
  \frac{T_{real}(t_{\Delta q1})}{T_{real}(t_{\Delta q2})} = \frac{1 + \frac{O’}{t_{\Delta q1}}}{1 + \frac{O’}{t_{\Delta q2}}}
  \Rightarrow O_c’ = \frac{T(t_{\Delta q_1}) - T(t_{\Delta q_2})}{\frac{T(t_{\Delta q_2})}{t_{\Delta q_1}} - \frac{T(t_{\Delta q_1})}{t_{\Delta q_2}} }
  \end{split}
\end{equation}</p>

<p>To accurately determine the factor $O_c’$, we recommend choosing low quanta, for which the context switching time is a significant fraction of the total simulation time.
This overhead factor also has meaning.
For example, a factor $O_c’ = 15 ns$ implies that at a quantum of 15 ns half of the execution time is spent in context switching.</p>

<p>Ultimately, the speedup can be formulated as:
\begin{equation} \label{eq:10}
  S(t_{\Delta q}) = \frac{T_{ideal}}{T_{real}} = \frac{t_{\Delta q}}{t_{\Delta q} + O_c’} <br />
\end{equation}</p>

<p>Note that this equation always yields values smaller than 1.
We chose this design for several reasons.
First, the goal of <abbr title="Temporal Decoupling">TD</abbr> is to reduce the number of context switches, which is just a performance-degrading environmental effect.
Hence, <abbr title="Temporal Decoupling">TD</abbr> doesn’t really make simulations faster, but it prevents them from being slowed down.<br />
Second, with this representation, it is very easy to see, how close you are to the theoretical optimum.
For example, if the speedup is already at 0.99, increasing the quantum will not yield in any significant performance increases.</p>

<p>To already provide a visual impression of the model, I decided to run an experiment with the system from the <a href="#2-what-is-temporal-decoupling">2. What is Temporal Decoupling?</a> section.
<br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/performance_simple.svg" alt="Analytical vs. measured performance for a very simple SystemC simulation" width="85%" />
</div>
<p><br /></p>

<p>In the given graph, the model’s prediction is depicted in orange, while the measurement is represented by the blue line.
Using the formula approach, an overhead factor of $O_c’ = 10.95ns$ was determined.
If you want to conduct this experiment on your own, feel free to use the <a href="/assets/optimal_quantum/td_simple_bm.cpp">benchmark</a>
and the <a href="/assets/optimal_quantum/td_simple_bm.py">corresponding python script</a> for the graph.
More extensive experiments are presented in Section <a href="&quot;#51-speedupaccuracy-experiments&quot;">5.1 Speedup/Accuracy Experiments</a>.
In the next subsection, the second important aspect of <abbr title="Temporal Decoupling">TD</abbr> is discussed: accuracy.</p>

<h3 id="42-an-accuracy-model">4.2 An Accuracy Model</h3>
<p>While the aspect of speedup was very clearly defined, the term “accuracy” (or “inaccuracy”) can be understood in multiple ways.
First of all, “accuracy” can be categorized into qualitative and quantitative aspects.</p>

<p>Qualitative inaccuracy includes all effects that can hardly be expressed as a metric and lead to changed simulation semantics.
For example, if <abbr title="Temporal Decoupling">TD</abbr> leads to the crash of a program, you observed qualitative inaccuracy.</p>

<p>Quantitative accuracy, on the other hand, is something that can be meaningfully captured in numbers.
For example, it can be the accuracy of interrupt timings, cache hit rates, memory bandwidth, simulation time, etc.
Since some simulations offer hundreds of simulation statistics, the question arises of which one to pick.
For our model and experiments, we only chose the target simulation time as a representative measure of accuracy.
This statistic is present in all SystemC simulations and it may capture the influence of various other factors.
Ultimately, a simulation user must individually consider which particular simulation statistics are relevant.</p>

<p>As before, we tried to develop an analytical model to predict and understand accuracy.
Of course this model is limited to quantitative accuracy, because the point of qualitative accuracy is its non-numerical nature.
We’re also only modeling target simulation time for the aforementioned reasons.
So, the first step in the model design was to think about, which situation in <abbr title="Temporal Decoupling">TD</abbr> could lead to a changed target simulation time,
Well, there are actually a few situations with different outcomes, but we thought that the most prevalent one is <em>process communication</em>.
In practice this covers cases like two target CPUs communicating over shared memory.
Let’s stick to this example an take a look at the following visulization:</p>

<p><br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/temporal_decoupling_communication.svg" alt="Process to process communication in temporally-decoupled simulations." width="70%" />
</div>
<p><br /></p>

<p>In the given example, Process 2 wants to send a message to Process 1.
For the bidirectional case, Process 2 also expects a response from Process 1.
The numbers in the white circles indicate the order in which the processes were executed as this
leads to different outcomes.
The example also assumes that Process 2 starts with the communication in the middle of its quantum.
Let’s dissect the individual cases one by one to get a better understanding.</p>

<p>For unidirectional communication, there are 2 subcases:
Process 2 gets executed first, leading to Process 1 receiving the message $t_{\Delta q}/2$ earlier
compared to a non-<abbr title="Temporal Decoupling">TD</abbr> simulation.
In the vice versa case, the message is received later by $t_{\Delta q}/2$.
If both cases are assumed to be equally likely, there should be no change in target simulation on average.
One may argue about the different semantical impacts of receiving data later or earlier, but let’s not overcomplicate things and head to the next case.</p>

<p>For bidirectional communication, there are 3 different subcases:
Process 2 first, then Process 1 leads to a delay of $t_{\Delta q}/2$.
Process 1 first, Process 2 second and third, Process 1 fourth, leads to a delay of $3t_{\Delta q}/2$.
Process 1 first, Process 2 second, Process 3 third, Process 4 fourth, leads to a delay of $t_{\Delta q}/2$.
As you can see, all cases lead to a prolongued communication, which ultimately may lead to a prolongued
target simulation time if the communication was on the program’s critical path,
We can also see, that that this extended time depends linearly on the quantum.
So far the model assumed a communication in the middle of a quantum, which may be a little bit too simple.
To make it more accurate we modeled communications as randomly occurring events, leading us to the Poisson distribution.
The average incured prolonguation time per quantum (Case 1 and Case 3) can then be calculated as follows:</p>

<p>\begin{equation} \label{eq10}
  \begin{split}
  t_d &amp; = t_{\Delta q} - E(X | X \leq t_{\Delta q}) P(X &lt; t_{\Delta q}) - t_{\Delta q} P(X &gt; t_{\Delta q}) \\\<br />
      &amp; = t_{\Delta q} - \int_{0}^{t_{\Delta q}} rt e^{-r t}  \,dt - \int_{t_{\Delta q}}^{\infty} rt_{\Delta q} e^{-r t}  \,dt \\\<br />
      %&amp; = t_{\Delta q} - (r t_{\Delta q} e^{-r t_{\Delta q}}) + rt_{\Delta q} e^{-t_{\Delta q} t}  \,dt \\\<br />
      &amp; = t_{\Delta q} - \frac{1 - e^{-r t_{\Delta q}}}{r} \\\<br />
      &amp; = t_{\Delta q} - (1 - e^{-r t_{\Delta q}})/r
\end{split}
\end{equation}
This results in the relative timing inaccuracy of:
\begin{equation} \label{eq11}
  I = \frac{t_{\Delta q}}{t_{\Delta q} - t_d} - 1 =   \frac{r \cdot t_{\Delta q}}{1 - e^{-r t_{\Delta q}}} - 1 \approx r \cdot t_{\Delta q}
\end{equation}
With $r$ being the rate of cross-scheduled events per time unit.
The result is a hockey stick curve, which can be approximated by a simple linear curve (note that Case 2 yields a similar result).
This linear curve is in stark contrast to the sigmoidal speedup model.
While the attainable speedup eventually saturates, the inaccuracy continues to increase indefinitely.
This underpins why the choice of the optimal quantum is so essential.</p>

<p>Without specifying the linear factor in particular, the inaccuracy equation can also be written as:
\begin{equation} \label{eq12}
I = \alpha \cdot t_{\Delta q}
\end{equation}
The factor $\alpha$ can be determined by two reference simulations or curve fitting.</p>

<h2 id="5-practical-assesment">5. Practical Assesment</h2>

<p>To see whether the model can stand the test of practice, running some simulations is a neccessity.
All following simulations were executed on an AMD Ryzen 3990x (64 physical cores/128 logical cores) host system.</p>

<h3 id="51-speedupaccuracy-experiments">5.1 Speedup/Accuracy Experiments</h3>
<p>This is currently under construction.
<!-- As a reference implementation for sequential TD simulation, we used the open-source ARMv8 TLM-2.0-based <a class="citation" href="#systemcStandard">[1]</a> \gls{vp} avp64 <a class="citation" href="#armv8vp">[9]</a>
and a RISC-V \gls{vp} based on MachineWare's \emph{SIM-V}~<a class="citation" href="#simvpaper">[10]</a>.
%Its backend deploys QEMU~<a class="citation" href="#bellard2005">[11]</a> allowing for near-native execution speeds.
% To allow comparison with par-gem5, both VPs are executed with a clock speed of 1 Ghz.
To allow comparison with par-gem5, the CPUs of both VPs are clocked at 1 Ghz.
As benchmarks we chose Dhrystone, NPB, STREAM, and an operating system boot.
All benchmarks were executed on a buildroot-configured Linux system.
The results in Fig.~\ref{fig:gem5-quant-speedup-benchmarks} show a similar trend of diminishing speedup returns
and linearly increasing inaccuracy.
Compared to gem5 the speedup begins to saturate at values greater than 10µs.
This can be attributed to the efficient dynamic binary translation backend of the simulators and the relatively huge overhead of context switches, which is also reflected in the factor $O_c'$.
For example, the Dhrystone benchmark attains a value of $O_c'=393ns$, which means the execution of 393 cycles takes just as long as one SystemC context switch.
This example shows why TD has become a staple of \gls{esl} simulations.
Simulated with a quantum of 10µs, Dhrystone needs about 10 minutes of host simulation time.
Without TD, the simulation would require more than 2 days.
In contrast to the speedup, the inaccuracy behaves unpredictably at first and only changes to a linear growth from a quantum of 10µs.
The CPU is particularly responsible for low quanta, which can end its quantum earlier in sequential TD (see Section~\ref{sec:background}).
Thus, an increasing quantum does not necessarily lead to increasing inaccuracy, as can be seen in the example of the \emph{SIM-V} boot process (Fig.~\ref{fig:gem5-quant-speedup-benchmarks} f). --></p>

<h3 id="52-qualitative-accuracy">5.2 Qualitative Accuracy</h3>

<p>Now to one of my favorite subsection: qualitative accurracy.
As already mentioned, this concerns all effects, which change the semantics of the simulation and can hardly be captured in numbers.
That means, without <abbr title="Temporal Decoupling">TD</abbr> a simulation did A and with <abbr title="Temporal Decoupling">TD</abbr> it suddenly does B.
To start with a tangible example, take a look at the following Linux boot timestamps that we obtained from default gem5 and our proprietary version with <abbr title="Temporal Decoupling">TD</abbr>:
<br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/linux_boot_timer.svg" alt="Linux boot timestamps of a default and a TD simulation in gem5" width="90%" />
</div>
<p><br />
In the <abbr title="Temporal Decoupling">TD</abbr> simulation, the timestamps suddenly jump to extremely high numbers, which are also occasionally jumping back in time.
Obviously, something went wrong here, with <abbr title="Temporal Decoupling">TD</abbr> probably being the culprit. But what exactly happened?
After spending way too much time debugging, we ultimately found the problem
in gem5’s implementation of the ARM virtual count <code class="language-plaintext highlighter-rouge">CNTVCT_EL0</code> register.
This register holds an increasing count value, which is later used by Linux to derive the timestamps.
When fetching the register, the current value is calculated by the time difference between the current and the last access.
However, in <abbr title="Temporal Decoupling">TD</abbr> simulations some simulation threads can run ahead of time.
That means the last access may have a higher timestamp, resulting in a negative delta.
Since gem5 stores this delta in an unsigned integer, exploding values are the consequence.
Or to summarize this in a slide from my ASP-DAC presentation:
<br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/asp_dac_linux_timer.png" alt="Slide from ASP-DAC 2024 presentation about the timer" width="80%" />
</div>
<p><br />
The solution for this problem is quite simple: restrict deltas to be greater than zero.
After that, we were finally able to boot Linux using temporally-decoupled gem5.
Interestingly, J. Engblom <a class="citation" href="#engblom2022">[12]</a> observed the same issue completely independent of ours.
He also proposes a restriction to deltas greater than or equal to zero as a solution.</p>

<p>The second type of observed error arises from delayed communication between simulation objects.
As previously explained, events or messages from one process to another may only become apparent at the beginning of a quantum.
This leads to a communication latency that grows quasi-proportionally with the quantum.
This communication latency could also be oberserved when executing a multi-threaded NPB benchmark with AVP64 <a class="citation" href="#armv8vp">[9]</a>,
where the synchronization of threads was delayed by <abbr title="Temporal Decoupling">TD</abbr>.
Well, in theory this delay was avoidable, because thread synchronization is usually achieved by putting a waiting CPU into a low-power state.
For ARM this could be a WFI instruction.
Whenever the simulation encounters such an instruction, it could terminate the quantum early to increase performance and accuracy.
Unfortunately, due to a bug in AVP64, the WFI instruction was executed as NOP.
Note that such a behavior is actually allowed according to the ARM reference manual manual, which is why WFI instructions are usually guarded by spin loop executing NOPs.
For large quanta, this leads to an interesting effect: The total number of instructions executed increases, causing the speedup measured in host execution time to decrease.
However, the speedup of the simulator measured in <abbr title="Millions Instructions Per Second">MIPS</abbr> stagnates or even increases since NOPs are easy to simulate.
As shown in the following figures, first effects are already visible at $t_{\Delta q}&gt;1ms$:
<br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/systemc_quantitative_inaccuracy.svg" alt="Comparing normalized MIPS, speedup, and instructions for the NPB IS benchmark running on the avp64 SystemC VP" width="70%" />
</div>
<p><br />
At $t_{\Delta q}&gt;100ms$, more than half of the time is spent in spin loops.
To conclude, if someone is selling a simulator that can achieve a lot of <abbr title="Millions Instructions Per Second">MIPS</abbr>, it may actually be executing NOPs.</p>

<p>In addition to the effects on simulation performance, throughput or functionality of peripherals can also be affected by delayed communication.
As an example, we executed the iperf3 <a class="citation" href="#iperf3">[13]</a> benchmark in avp64 with the <abbr title="Virtual Platform">VP</abbr> as a server and the host system as a client.
In our configuration, the benchmark determines the maximum throughput of a TCP-based connection between a server and a client.
As shown in the following figure, the throughput rapidly decreases from 2690 Mbit/s at $t_{\Delta q}=1µs$ to 77 MBit/s at $t_{\Delta q}=100µs$:
<br /></p>
<div style="text-align:center">
<img src="/assets/optimal_quantum/iperf_benchmark.svg" alt="Executing the iperf benchmark using the VP as the server and the host system as a client. The throughput uses the simulation time as a reference." width="80%" />
</div>
<p><br />
This performance drop can be explained by the implementation of the OpenCores Ethernet device <em>ETHOC</em> <a class="citation" href="#ethoc">[14]</a>, which is used in avp64.
The device uses one thread each for sending and receiving Ethernet frames, and each of these threads is executed only once per quantum.
Thus, only one Ethernet frame can be received per quantum, which limits the maximum achievable throughput.
Ultimately, this can affect the data rate to such an extent that timeouts of the network driver watchdog occur.
The choice to send/receive only one packet per quantum is probably due to the fact that <abbr title="Temporal Decoupling">TD</abbr> was not properly taken into account during the device implementation.
It would be more accurate to calculate the number of packets to be processed once per quantum based on the elapsed time.
Since the respective thread is still activated once per quantum, there would be no performance loss.</p>

<p>With this explanation, a steadily decreasing throughput would be expected, but we saw that the value stagnates from a quantum of 100µs.
The explanation for this can be found in the Linux’s NAPI which is responsible for interrupt handling of network devices.
When the system receives an Ethernet frame, an interrupt is generated, which leads to the execution of an Interrupt Service Routine (ISR) as in most systems.
However, since network connections can transfer considerable amounts of data, the resulting interrupts can have a significant impact on the performance of the system.
Therefore, after receiving an interrupt, NAPI masks the corresponding interrupt and switches to a poll mode for a certain time, waiting for more packets to accumulate.
Only after a certain time has elapsed, it switches back to interrupt mode and a WFI instruction is executed.
If implemented correctly, the execution of a WFI instruction leads to an early termination of the quantum, allowing the reception of the next of a next Ethernet frame.</p>

<h2 id="6-conclusion">6. Conclusion</h2>
<ul>
  <li>More quantum, more speed, less accuracy
    <ul>
      <li>Diminshing performance returns</li>
      <li>Inaccuracy grows linearly</li>
    </ul>
  </li>
  <li>Temporal decoupliing may break your simulation
    <ul>
      <li>Many ways</li>
      <li>Temporal decoupling aware design</li>
      <li>gem5 timer fix</li>
      <li>Ethernet adapter fix</li>
    </ul>
  </li>
</ul>

<h2 id="7-related-work">7. Related Work</h2>
<p>“Related Work” section at the end as the motivation of this paper was a lack of related work.
Anyway, here’s a list of works/website, which I consider related our paper:</p>

<p><strong>What is temporal decoupling?</strong><br /></p>
<ul>
  <li><a href="https://www.embecosm.com/appnotes/ean1/html/ch09s01.html">EMEBCOSM</a></li>
  <li><a href="https://www.doulos.com/knowhow/systemc/tlm-20/example-5-temporal-decoupling-multiple-initiators-and-targets/">DOULOS Examples</a></li>
</ul>

<p><strong>Interesting works about temporal decoupling</strong> (from relevant ot less relevant)<br /></p>
<ul>
  <li><a href="https://www.diva-portal.org/smash/get/diva2:1530379/FULLTEXT01.pdf">Evaluating Temporal Decoupling in a Virtual Platform, Jinju Joy, 2020</a></li>
  <li><a href="https://jakob.engbloms.se/archives/3467">Some Notes on Temporal Decoupling, Jakob Engblom, 2022</a> <a class="citation" href="#engblom2022">[12]</a></li>
  <li><a href="https://dvcon-proceedings.org/wp-content/uploads/temporal-decoupling-are-fast-and-correct-mutually-exclusive.pdf">Temporal Decoupling – Are “Fast” and “Correct” Mutually Exclusive?, Jakob Engblom, 2018</a> <a class="citation" href="#engblom2018">[2]</a></li>
  <li>Optimizing Temporal Decoupling using Event Relevance, Jünger et al., 2021, <a class="citation" href="#optimizingtemporal">[15]</a></li>
  <li>Temporal decoupling with error-bounded predictive quantum control, Glaser et al., 2015 <a class="citation" href="#glaser2015">[16]</a></li>
  <li>Speculative Temporal Decoupling Using fork(), Jung et al., 2019, <a class="citation" href="#spectempfork">[17]</a></li>
  <li>Efficient Parallel Transaction Level Simulation by Exploiting Temporal Decoupling, Khaligh et al., 2009 <a class="citation" href="#khaligh2009">[18]</a></li>
</ul>

<p><strong>Analytical models and computer simulation</strong><br /></p>
<ul>
  <li>Cost/Performance of a Parallel Computer Simulator, Falsafi et al., 1994, <a class="citation" href="#falsafi1994">[19]</a></li>
  <li>A Comparison of Two Approaches to Parallel Simulation of Multiprocessors, Over et al.,  2007, <a class="citation" href="#over2007">[20]</a></li>
</ul>

<h2 id="8-references">8. References</h2>
<ol class="bibliography"><li><span id="systemcStandard">[1]“IEEE Standard for Standard SystemC Language Reference Manual,” <i>IEEE Std 1666-2011 (Revision of IEEE Std 1666-2005)</i>, 2012, doi: 10.1109/IEEESTD.2012.6134619. </span></li>
<li><span id="engblom2018">[2]J. Engblom, “Temporal Decoupling-Are ‘Fast’and ‘Correct’Mutually Exclusive?,” in <i>DVCon Europe</i>, 2018. </span></li>
<li><span id="ryckbosch2012">[3]F. Ryckbosch, S. Polfliet, and L. Eeckhout, “VSim: Simulating Multi-Server Setups at near Native Hardware Speed,” <i>ACM Trans. Archit. Code Optim.</i>, Jan. 2012. </span></li>
<li><span id="joy2020">[4]J. Joy, “Evaluating Temporal Decoupling in a Virtual Platform.” 2020 [Online]. Available at: https://www.diva-portal.org/smash/get/diva2:1530379/FULLTEXT01.pdf</span></li>
<li><span id="juenger2021">[5]L. Jünger, A. Belke, and R. Leupers, “Software-defined Temporal Decoupling in Virtual Platforms,” in <i>2021 IEEE 34th International System-on-Chip Conference (SOCC)</i>, 2021, pp. 40–45, doi: 10.1109/SOCC52499.2021.9739242. </span></li>
<li><span id="amdahl1967">[6]G. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities,” in <i>Proceedings of the April 18-20, 1967, Spring Joint Computer Conference</i>, 1967. </span></li>
<li><span id="gustafson1998">[7]J. L. Gustafson, “Reevaluating Amdahl’s Law,” <i>Commun. ACM</i>, vol. 31, no. 5, 1988. </span></li>
<li><span id="williams2009">[8]S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” <i>Commun. ACM</i>, vol. 52, no. 4, Apr. 2009. </span></li>
<li><span id="armv8vp">[9]“ARMv8 Virtual Platform (AVP64).” [Online]. Available at: https://github.com/aut0/avp64</span></li>
<li><span id="simvpaper">[10]L. Jünger, J. H. Weinstock, and R. Leupers, “SIM-V: Fast, Parallel RISC-V Simulation for Rapid Software Verification,” <i>DVCON Europe</i>, 2022. </span></li>
<li><span id="bellard2005">[11]F. Bellard, “QEMU, a Fast and Portable Dynamic Translator.,” 2005, pp. 41–46. </span></li>
<li><span id="engblom2022">[12]J. Engblom, “Some Notes on Temporal Decoupling.” 2022 [Online]. Available at: https://jakob.engbloms.se/archives/3467</span></li>
<li><span id="iperf3">[13]“iperf3 benchmark.” [Online]. Available at: https://software.es.net/iperf/</span></li>
<li><span id="ethoc">[14]“OpenCores Ethernet MAC 10/100 Mbps.” [Online]. Available at: https://opencores.org/projects/ethmac</span></li>
<li><span id="optimizingtemporal">[15]L. Jünger, C. Bianco, K. Niederholtmeyer, D. Petras, and R. Leupers, “Optimizing Temporal Decoupling using Event Relevance,” in <i>ASP-DAC</i>, 2021. </span></li>
<li><span id="glaser2015">[16]G. Glaser, G. Nitsche, and E. Hennig, “Temporal decoupling with error-bounded predictive quantum control,” in <i>FDL</i>, 2015. </span></li>
<li><span id="spectempfork">[17]M. Jung, F. Schnicke, M. Damm, T. Kuhn, and N. Wehn, “Speculative Temporal Decoupling Using fork(),” in <i>DATE</i>, 2019, doi: 10.23919/DATE.2019.8714823. </span></li>
<li><span id="khaligh2009">[18]R. Salimi Khaligh and M. Radetzki, “Efficient Parallel Transaction Level Simulation by Exploiting Temporal Decoupling,” in <i>Analysis, Architectures and Modelling of Embedded Systems</i>, Berlin, Heidelberg, 2009, pp. 149–158. </span></li>
<li><span id="falsafi1994">[19]B. Falsafi and D. A. Wood, “Cost/Performance of a Parallel Computer Simulator,” in <i>Proceedings of the Eighth Workshop on Parallel and Distributed Simulation</i>, 1994. </span></li>
<li><span id="over2007">[20]A. Over, B. Clarke, and P. Strazdins, “A Comparison of Two Approaches to Parallel Simulation of Multiprocessors,” <i>Performance Analysis of Systems and Software, IEEE International Symmposium on</i>, vol. 0, pp. 12–22, Apr. 2007, doi: 10.1109/ISPASS.2007.363732. </span></li></ol>]]></content><author><name></name></author><category term="Simulation" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Efficient RISC-V-on-x64 Floating Point Simulation</title><link href="https://www.chciken.com/simulation/2023/11/12/fast-floating-point-simulation.html" rel="alternate" type="text/html" title="Efficient RISC-V-on-x64 Floating Point Simulation" /><published>2023-11-12T09:55:44+00:00</published><updated>2023-11-12T09:55:44+00:00</updated><id>https://www.chciken.com/simulation/2023/11/12/fast-floating-point-simulation</id><content type="html" xml:base="https://www.chciken.com/simulation/2023/11/12/fast-floating-point-simulation.html"><![CDATA[<script>
  window.MathJax = {
  tex: {
    loader: {load: ['[tex]/ams']},
    tex: {packages: {'[+]': ['ams']}},
    tags: 'ams',
    inlineMath: [['$', '$']]
  }
};

</script>

<script id="MathJax-script" async="" src="/assets/common/mathjax/mathjax-3.2.2.js"></script>

<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }

  .center {
    margin-left: auto;
    margin-right: auto;
  }
</style>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
  <li><a href="#1-introduction">1. Introduction</a></li>
  <li><a href="#2-the-story">2. The Story</a></li>
  <li><a href="#3-floating-point-basics">3. Floating Point Basics</a>
    <ul>
      <li><a href="#31-the-math">3.1 The Math</a></li>
      <li><a href="#32-risc-v-floating-point">3.2 RISC-V Floating Point</a></li>
      <li><a href="#33-x64-floating-point">3.3 x64 Floating Point</a></li>
    </ul>
  </li>
  <li><a href="#4-the-problems">4. The Problems</a>
    <ul>
      <li><a href="#41-different-canonical-qnan-encodings">4.1 Different Canonical qNaN Encoding</a></li>
      <li><a href="#42-different-instruction-semantics">4.2 Different Instruction Semantics</a></li>
      <li><a href="#43-the-missing-rounding-mode">4.3 The Missing Rounding Mode</a></li>
      <li><a href="#44-nan-boxing">4.4 NaN Boxing</a></li>
      <li><a href="#45-nan-propagation">4.5 NaN Propagation</a></li>
      <li><a href="#46-floating-point-exception-flags">4.6 Floating Point Exception Flags</a></li>
    </ul>
  </li>
  <li><a href="#5-how-other-simulators-work">5. How Other Simulators Work</a>
      <ul>
        <li><a href="#51-soft-float">5.1 Soft Float</a></li>
        <li><a href="#52-rv8">5.2 rv8</a></li>
        <li><a href="#53-qemu-post-v400">5.3 QEMU post-v4.0.0</a></li>
        <li><a href="#54-rosetta-2">5.4 Rosetta 2</a></li>
        <li><a href="#55-dolphin">5.5 Dolphin</a></li>
        <li><a href="#56-virtual-console">5.6 Virtual Console</a></li>
        <li><a href="#57-libriscv">5.7 libriscv</a></li>
        <li><a href="#58-you-et-al">5.8 You et al.</a></li>
        <li><a href="#59-sarrazin-et-al">5.9 Sarrazin et al.</a></li>
      </ul>
  </li>
  <li><a href="#6-methods">6. Methods</a>
      <ul>
        <li><a href="#61-fast-additionsubtraction">6.1 Fast Addition/Subtraction</a></li>
        <li><a href="#62-fast-32-bit-multiplication">6.2 Fast 32-bit Multiplication</a></li>
        <li><a href="#63-fast-32-bit-division">6.3 Fast 32-bit Division</a></li>
        <li><a href="#64-fast-32-bit-square-root">6.4 Fast 32-bit Square Root</a></li>
        <li><a href="#65-fast-32-bit-fused-multiply-add">6.5 Fast 32-bit Fused Multiply-Add</a></li>
        <li><a href="#66-fast-64-bit-operations">6.6 Fast 64-bit Operations</a></li>
      </ul>
  </li>
  <li><a href="#7-results--discussion">7. Result &amp; Discussion</a>
      <ul>
        <li><a href="#71-clean-room-benchmarks">7.1 Clean Room Benchmarks</a></li>
        <li><a href="#72-my-method-vs-qemu">7.2 My Method vs. QEMU</a></li>
      </ul>
  </li>
  <li><a href="#8-conclusion--outlook">8. Conclusion &amp; Outlook</a></li>
  <li><a href="#9-references">9. References</a></li>
  </ul>
</div>

<h2 id="1-introduction">1. Introduction</h2>
<p>This post is an extended and completely reworked version of our paper “Efficient RISC-V-on-x64 Floating Point Simulation”.
A preprint version of the original paper can be donwloaded <a href="/assets/fast_floating_point_simulation/fast-float-paper-preprint.pdf">here</a>.
In order to guide expectations right from the start, I would like to answer three essential questions first.</p>

<p><strong>What is this post about and is it worth reading?</strong> <br />
This post is about floating point (<abbr title="Floating Point">FP</abbr>) arithmetic in simulators/emulators.
So, if you ever wondered how simulators/emulators like QEMU or gem5 handle floating point arithmetic,
the following might be of interest for you.
Although the title says RISC-V,
the methods presented here are applicable to most other Instruction Set Architectures (<abbr title="Instruction Set Architectures">ISAs</abbr>) as well.
In fact, I also present a little section about Apple’s Rosetta 2 (x64-on-ARM) and the Wii/Gamecube emulator Dolphin (PowerPC-on-x64, PowerPC-on-ARM).</p>

<p><strong>Should I read the paper or this blog post?</strong> <br />
Read this post for the reasons described in the next answer.</p>

<p><strong>Why did I spend my free time rewriting something I already spent weeks on?</strong> <br />
Blog posts are better than papers because:</p>
<ul>
  <li>I don’t have to appeal to reviewers</li>
  <li>No page limit</li>
  <li>Additional material (data, videos, code, etc.)</li>
</ul>

<p><strong>How to cite?</strong> <br />
Please prefer to cite original paper:</p>
<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@INPROCEEDINGS</span><span class="p">{</span><span class="nl">zurstrassen2023</span><span class="p">,</span>
  <span class="na">author</span><span class="p">=</span><span class="s">{Zurstraßen, Niko and Bosbach, Nils and Joseph, Jan Moritz and Jünger, Lukas and Weinstock, Jan Henrik and Leupers, Rainer}</span><span class="p">,</span>
  <span class="na">booktitle</span><span class="p">=</span><span class="s">{2023 IEEE 41st International Conference on Computer Design (ICCD)}</span><span class="p">,</span>
  <span class="na">title</span><span class="p">=</span><span class="s">{Efficient RISC-V-on-x64 Floating Point Simulation}</span><span class="p">,</span>
  <span class="na">year</span><span class="p">=</span><span class="s">{2023}</span><span class="p">,</span>
  <span class="na">volume</span><span class="p">=</span><span class="s">{}</span><span class="p">,</span>
  <span class="na">number</span><span class="p">=</span><span class="s">{}</span><span class="p">,</span>
  <span class="na">pages</span><span class="p">=</span><span class="s">{1-6}</span><span class="p">,</span>
  <span class="na">doi</span><span class="p">=</span><span class="s">{10.1109/ICCD58817.2023.00090}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="2-the-story">2. The Story</h2>
<p>In 2022, a colleague of mine and his friend took the courage and founded a <a href="https://www.machineware.de/">startup</a>.
Their flagship product is a RISC-V simulator called <a href="https://www.machineware.de/pages/products.html"><em>SIM-V</em></a>
<a class="citation" href="#simv2022">[1]</a>,
which can be used to simulate RISC-V systems on x64 (or other) machines.
One of the key selling points is the almost native performance.
The simulated system is so fast, that you can interact with it like a real system.</p>

<p>So, how does one make a simulator go 🚀🚀🚀? <br />
I am certainly not giving away any secrets when I reveal that the underlying technology is
<a href="https://en.wikipedia.org/wiki/Binary_translation"><abbr title="Dynamic Binary Translation">Dynamic Binary Translation</abbr> (DBT)</a>.
So basically the same method that is used by QEMU.
With DBT, binary instructions of the target system (RISC-V in our case) are translated into instructions of the host system (i.e. x64) at runtime and executed.
If possible, instructions are translated 1-to-1 (or at least 1-to-only-a-few), which also explains the native speed.
For example, one could simply translate a RISC-V 32-bit floating point (<abbr title="Floating Point">FP</abbr>) addition <code class="language-plaintext highlighter-rouge">fadd.s</code> to an x64 <abbr title="Floating Point">FP</abbr> addition <code class="language-plaintext highlighter-rouge">addss</code>.
Semantically, these two instructions seem to be identical, at least at first sight.</p>

<p>My colleagues thought so too and implemented it this way in their first version of SIM-V.
In practice, this method actually works quite well.
You can boot Linux systems with it, and execute many applications without encountering problems.</p>

<p>One of the few applications that doesn’t work with this method is the <a href="https://github.com/riscv-software-src/riscof">RISC-V Architectural Test Framework (<abbr title="RISC-V Architectural Test Framework">RISCOF</abbr>)</a>.
Unfortunately, that’s a real showstopper, since passing these tests is required to license the RISC-V trademark.
Or to quote <a href="https://riscof.readthedocs.io/en/stable/intro.html#intent-of-the-architectural-test-suite"><abbr title="RISC-V Architectural Test Framework">RISCOF</abbr>’s documentation</a>:</p>

<p><em>Passing the tests and having the results approved by RISC-V International is a prerequisite to licensing the RISC-V trademarks in connection with the design.</em></p>

<p>So, passing these tests was top priority and my colleagues asked me to do an investigation.
After taking a closer look at the failing tests, I could pinpoint the following 6 reasons why they failed:</p>

<ol>
  <li>Different NaN encodings</li>
  <li>Different instruction semantics</li>
  <li>x64’s missing <abbr title="Round to Nearest, Ties to Maximum Magnitude">RMM</abbr> rounding mode</li>
  <li>NaN boxing</li>
  <li>Floating Point exception flags</li>
  <li>NaN Propagation</li>
</ol>

<p>In the following, I will explain each of these points in greater detail.
Subsequently, I show how other simulators and how I solve these issues.</p>

<p>But first, I’ll explain some basics about <abbr title="Floating Point">FP</abbr> arithmetic, IEEE 754, and how it is implemented in RISC-V and x64.
Feel free to skip the next section if you are already familiar with these topics.</p>

<!-- If you are new to this topic (like I was), this is quite surprising.
Because as far as I knew, FP arithmetic was standardized in 1985 by [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754)
and both RISC-V and x64 are adhering to it.
Well, it turns out, that this standard was revised in multiple times (2008, 2019) and some parts of it leave room for implementation-defined behavior.
Moreover, RISC-V and x64 add some ISA-specific details which is not included in IEEE 754. -->

<h2 id="3-floating-point-basics">3. Floating Point Basics</h2>
<h3 id="31-the-math">3.1 The Math</h3>
<p>Floating point (<abbr title="Floating Point">FP</abbr>) numbers are the most common way to approximate real numbers in computing.
You find them in most programming languages with names such as <code class="language-plaintext highlighter-rouge">float</code>, <code class="language-plaintext highlighter-rouge">double</code>, <code class="language-plaintext highlighter-rouge">f32</code> or <code class="language-plaintext highlighter-rouge">f64</code>.
Due to the many ways <abbr title="Floating Point">FP</abbr> arithmetic can be implemented, adhering to standards avoids a lot of problems.
This is why most software and hardware follows the <em>IEEE 754</em> standard.
But also standards might be erroneous or incomplete, which is why there are now 3 versions:</p>
<ul>
  <li>IEEE 754 1985, 20 pages <a class="citation" href="#ieee7541985">[2]</a></li>
  <li>IEEE 754 2008, 70 pages <a class="citation" href="#ieee7542008">[3]</a></li>
  <li>IEEE 754 2019, 84 pages <a class="citation" href="#ieee7542019">[4]</a></li>
</ul>

<p>They differ mostly in some details, which will be discussed later.</p>

<p>The most important number formats defined by IEEE 754 are <code class="language-plaintext highlighter-rouge">binary32</code> and <code class="language-plaintext highlighter-rouge">binary64</code>.
If you program C/C++, you already know them as <code class="language-plaintext highlighter-rouge">float</code> and <code class="language-plaintext highlighter-rouge">double</code>.
In Rust they are called <code class="language-plaintext highlighter-rouge">f32</code> and <code class="language-plaintext highlighter-rouge">f64</code>.
A <abbr title="Floating Point">FP</abbr> number comprises a sign, a significand, an exponent, and a bias
with the following bit representation:</p>
<div style="text-align:center">
<img src="/assets/fast_floating_point_simulation/fp_value.svg" alt="ieee754-binary3264" width="99%" />
</div>
<p><br />
Note that the bias is implicit and fixed.
It is used to reach negative numbers in the exponent without using two’s complement.
Ultimately, the numerical value of an <abbr title="Floating Point">FP</abbr> number is given by:</p>

<p>\begin{equation}
f = (-1)^{sign} \cdot (1.s_{p-1}s_{p-2}…s_1)_2 \cdot 2^{exponent-bias}
\end{equation}</p>

<p>In the formula $s_i$ refers to the bit at position $i$ in the significand.
However, there are quite a few corner cases to represent some special values.</p>

<p>The first case is <a href="https://en.wikipedia.org/wiki/Subnormal_number">subnormal numbers</a>.
Whenever $exponent$ is 0, the implicit leading 1 turns into a 0.
So we get:</p>

<p>\begin{equation}
f = (-1)^{sign} \cdot (0.s_{p-1}s_{p-2}…s_1)_2 \cdot 2^{-bias}
\end{equation}</p>

<p>Having these special cases gives us some cool mathematical properties, like additions and subtractions that never underflow.
However, in many other regards like hardware complexity, some mathematical proofs, or timing side channels, it can be a pain.</p>

<p>Another special value is infinity. If all bits in the exponent are set and the significand is 0, the value is interpreted as  $\pm \infty$.</p>

<p>The last special value is <a href="https://en.wikipedia.org/wiki/NaN#Quiet_NaN">NaN (Not a Number)</a>,
which comes in two different flavors: quiet (qNaN) and signaling (sNaN).
qNaNs are used to represent non-meaningful results (e.g. $\infty-\infty$), while sNaNs are intended to be used for uninitialized variables/memory.
The bit pattern of a NaNs is an exponent with all bits set and a significand that is not 0.
How the encoding of qNaN and sNaN differ is explained in Section <a href="#41-different-nan-encodings">“4.1 Different NaN Encoding”</a>.</p>

<p>While Equation 1 is often used to introduce and understand the concept of IEEE <abbr title="Floating Point">FP</abbr> numbers,
the $p-1$ significand bits with an implicit leading 1 complicate mathematical proofs.
A representation more suited for mathematical adventures is:</p>

<p>\begin{equation}
\label{eq:float1}
f = M \cdot 2^{e - p + 1}, \quad e=exponent-bias
\end{equation}</p>

<p>With this representation, the significand is shifted so far, that it becomes an integer value.
Due to the finite number of bits in binary32 and binary64, the precision $p$, the significand $M$, and the exponent $e$ are constrained by the values given
in the following table:</p>

<table>
  <thead>
    <tr>
      <th>data type</th>
      <th>exponent range</th>
      <th>precision bits</th>
      <th>significand range</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>binary32</td>
      <td>$ e_{f,min}=-126 \leq e_f \leq 127 = e_{f,max}$</td>
      <td>$p_f=24$</td>
      <td>$\left\lvert M_f \right\rvert  \leq 2^{24}-1$</td>
    </tr>
    <tr>
      <td>binary64</td>
      <td>$ e_{d,min}=-1022 \leq e_d \leq 1023 = e_{d,max}$</td>
      <td>$p_d=53$</td>
      <td>$\left\lvert M_d \right\rvert \leq 2^{53}-1$</td>
    </tr>
  </tbody>
</table>

<p>Note that the $p$ precision bits include the implicit leading 1. For example, a binary32 value has a precision of
24 bits of which 23 bits are explicitly stored.
Hence, the representation is only suitable for normal numbers!
Or in other words: don’t use this model to represent subnormal numbers!</p>

<p>Another really painful aspect of <abbr title="Floating Point">FP</abbr> numbers is rounding errors.
Whenever mathematical operations, such as additions or multiplications, are performed on <abbr title="Floating Point">FP</abbr> numbers, rounding
errors may occur.
In literature and this post, rounding is symbolized by the $\circ$ operator.
While rounding errors are hard to avoid, most <abbr title="Floating Point">FP</abbr> hardware allows to control the sign of the error by means of <em>rounding modes</em>.
With these modes, you can control whether the final result is rounded down, up, to the nearest number, or however you define it.
The most recent IEEE 754 standard defines 5 rounding modes:</p>
<ul>
  <li>roundTiesToEven (mandatory)</li>
  <li>roundTiesToAway (introduced in 2008, not mandatory)</li>
  <li>roundTowardPositive (mandatory)</li>
  <li>roundTowardNegative (mandatory)</li>
  <li>roundTowardZero (mandatory)</li>
</ul>

<p>To indicate which rounding mode is used in mathematical representations, a little acronym is added to the circle operator.
For example, $\circ_{RNE32}(a+b)$ corresponds to a 32-bit addition under Round Nearest, Ties to Even (<abbr title="Round to Nearest, Ties to Even">RNE</abbr>) rounding mode.
I’m using the acronyms from the RISC-V spec <a class="citation" href="#riscv2019">[5]</a>.
In the following, if no rounding mode is given, <abbr title="Round to Nearest, Ties to Even">RNE</abbr> shall be assumed.</p>

<p>To assess the numerical impact of these errors, one can use the <em>standard error model of <abbr title="Floating Point">FP</abbr> arithmetic</em> <a class="citation" href="#higham2002">[6]</a>.
According to the model, the error of many arithmetic operations (+, −, /, ·, √), including underflows,
can be represented as:</p>

<p>\begin{equation}
  \label{eq:standard-error-model}
  \begin{gathered}
  z = (a \, \text{op}\, b) \cdot (1 + \epsilon ) + \eta = \circ(a \, \text{op}\, b) \\\<br />
  \epsilon \eta = 0, \quad |\epsilon| \leq \textbf{u}, \quad \eta \leq |2^{e_{min}}| \cdot \textbf{u}, \quad \textbf{u} = 2^{-p}
  \end{gathered}
\end{equation}</p>

<p>Whereby $\eta$ and $\epsilon$ are used to distinguish between subnormal and normal numbers:</p>
<ul>
  <li>Normal number: $\eta=0$</li>
  <li>Subnormal number: $\epsilon=0$</li>
</ul>

<p>The relative error $\epsilon$ is bounded by the so-called <em>unit roundoff</em> error $\textbf{u}$.
Note that this formula only works for the round-to-nearest rounding.
To account for other rounding modes as well, you can use a roundoff error of $2\textbf{u}$.
This is also referred to as the <a href="https://en.wikipedia.org/wiki/Machine_epsilon">machine epsilon</a>.</p>

<h3 id="32-risc-v-floating-point">3.2 RISC-V Floating Point</h3>
<p>In this subsection, I’ll explain how <abbr title="Floating Point">FP</abbr> arithmetic works on RISC-V systems.
All information presented here is based on the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> manual <a class="citation" href="#riscv2019">[5]</a>.</p>

<p>In general, RISC-V is organized in so-called <em>extensions</em>.
Each extensions defines a certain set of instructions and other characteristics, which can be assembled to larger systems in a modular way.
This includes <abbr title="Floating Point">FP</abbr> arithmetic, which is used in the extensions <em>F</em>, <em>D</em>, <em>Q</em>, <em>Zfa</em>, <em>Zfh</em>, <em>Zfhmin</em>, <em>Zfinx</em>, <em>Zhinx</em>, and <em>Zhinxmin</em>.
Moreover, there is a vector extension <em>V</em>, which also uses <abbr title="Floating Point">FP</abbr> arithmetic.
Vanilla 32-bit and 64-bit <abbr title="Floating Point">FP</abbr> arithmetic is provided by the extensions <em>F</em> and <em>D</em> respectively.</p>

<p>All <abbr title="Floating Point">FP</abbr> extensions mostly adhere to the latest IEEE 754 2019 standard <a class="citation" href="#ieee7542019">[4]</a>.
Accordingly, there are 5 <abbr title="Floating Point">FP</abbr> exceptions and 5 rounding modes.
Reading <abbr title="Floating Point">FP</abbr> exceptions and setting rounding modes is achieved by reading/writing the <code class="language-plaintext highlighter-rouge">fcsr</code> register
(see <a href="#fcsr-risc-v-x64">Figure below</a>).<br />
Opposed to many other <abbr title="Instruction Set Architectures">ISAs</abbr>, RISC-V doesn’t trigger hardware traps when encountering <abbr title="Floating Point">FP</abbr> exceptions.
Hence, you cannot catch, for example, a resulting underflow without constantly checking the <code class="language-plaintext highlighter-rouge">fcsr</code> register.<br />
Another interesting characteristic of RISC-V is the instruction-embedded rounding mode.
That means, it possible to specify an operation’s rounding mode directly in the instruction’s encoding.
However, if the instruction’s rounding mode encodes to “dynamic”, a global rounding mode from <code class="language-plaintext highlighter-rouge">fcsr</code> is used instead.<br />
A special peculiarity, that is not part of the IEEE standard, is RISC-V’s hardware-assisted NaN boxing.
With NaN boxing, the upper bits of an M-bit <abbr title="Floating Point">FP</abbr> register are saturated if an N-bit value is written to it with $M&gt;N$.
Also, values smaller than FLEN (<abbr title="Floating Point">FP</abbr> register width) are only considered valid if the upper bits in the register are set.
For example, if a 32-bit <abbr title="Floating Point">FP</abbr> value resides in a 64-bit register, it is only considered valid if the top 32 bits are set to 1.
This means, instructions working solely on 32-bit <abbr title="Floating Point">FP</abbr> values must check the upper bits when reading the operands and set them when writing back the result.
Since the whole 64-bit value encodes to a negative qNaN, there is no risk of creating valid values by accident. <br />
One issue where the IEEE standard leaves/left too much freedom in my oppinion are canonical qNaNs.
A canonical qNaN is the specific bit pattern returned by the hardware if it executed an invalid operation (e.g. 0/0).
For example, a 32-bit zero-through-zero division will result in <code class="language-plaintext highlighter-rouge">0x7fc00000</code> for 32-bit <abbr title="Floating Point">FP</abbr> registers.
The same 32-bit division for 64-bit <abbr title="Floating Point">FP</abbr> registers results in a NaN-boxed value of <code class="language-plaintext highlighter-rouge">0xffffffff7fc00000</code>.
But more on that later in Subsection <a href="#41-different-canonical-qnan-encodings">Different Canonical qNaN Encodings</a>.</p>

<p><br /></p>
<div style="text-align:center">
<img src="/assets/fast_floating_point_simulation/fp_registers_riscv_x86.svg" alt="fcsr-risc-v-x64" width="80%" id="fcsr-risc-v-x64" />
</div>

<h3 id="33-x64-floating-point">3.3 x64 Floating Point</h3>
<p>Similar to RISC-V, <abbr title="Floating Point">FP</abbr> arithmetic on x64 is also defined by extensions.
Yet, the story for this <abbr title="Instruction Set Architecture">ISA</abbr> is a little bit more convoluted.</p>

<p>The first <abbr title="Floating Point">FP</abbr> <abbr title="Instruction Set Architecture">ISA</abbr> for x64 was introduced in 1980 by the <a href="https://en.wikipedia.org/wiki/X87">x87 extension</a>.
This extension was succeeded by SSE in 1999, which not only provided scalar <abbr title="Floating Point">FP</abbr> arithmetic but also vector instructions.
Even though SSE mostly superseded x87, today’s x64 CPUs still support the x87 extension for legacy reasons I guess.
Modern compilers like <em>gcc</em> primarly generate SSE instructions when it comes to scalar <abbr title="Floating Point">FP</abbr> arithmetic.
There are only a few corner cases like <code class="language-plaintext highlighter-rouge">long double</code>, for which gcc will still generate x87 code.</p>

<p>In 2011, Intel and AMD released the first processors including the AVX extension, which had new SIMD and scalar instructions.
This was followed by AVX-512 in 2016, which adds scalar <abbr title="Floating Point">FP</abbr> instructions using an instruction-encoded rounding mode.
Yet AVX-512 isn’t even supported by many modern CPUs and in general doesn’t seem to be a very beloved child.
Or to quote Linux Torvalds: <a href="https://www.phoronix.com/news/Linus-Torvalds-On-AVX-512">“I hope Intel’s AVX-512 ‘dies a painful death’.”</a>.</p>

<p>So, after having introduced 4 different <abbr title="Floating Point">FP</abbr> extensions, which one is relevant for the following?
It’s not x87 due to its obsolescence, and it’s not AVX-512 due to its unpopularity.
Consequently, we are left with SSE and AVX.
Since SSE is the default extension when using gcc, the rest of this section describes how <abbr title="Floating Point">FP</abbr> works for SSE.</p>

<p>Since SSE was introduced in 1999, it mostly adheres to the most recent IEEE standard at that time, which was IEEE 754-1985 <a class="citation" href="#ieee7541985">[2]</a>.
That means, opposed to RISC-V, x64 misses the <abbr title="Round to Nearest, Ties to Maximum Magnitude">RMM</abbr> rounding mode, which was introduced in later standards (see <a href="#fcsr-risc-v-x64">Figure above</a>). <br />
The first standard already defined the five <abbr title="Floating Point">FP</abbr> exceptions (invalid, underflow, overflow, inexact, divide-by-zero).
So, x64 is equal to RISC-V in that regard.
Surprisingly, mapping the <abbr title="Floating Point">FP</abbr> exceptions from host to target turned out to be one of the most difficult challenges, as shown in the subsequent section. <br />
As already teased above, x64 <strong>mostly</strong> adheres to the IEEE 754 standard.
Well, SSE didn’t really change any specification, but they added additional features.
For instance, x64 also defines a denormal flag for the detection of subnormal results. Also, x64 allows to treat subnormal numbers as 0 using the <abbr title="Flush To Zero">FTZ</abbr> and <abbr title="Denormals As Zero">DAZ</abbr> flags.
Because depending on the microarchitecture, the processing of subnormal numbers can reduce your <abbr title="Floating Point Unit">FPU</abbr>’s performance by 10-100x <a class="citation" href="#dooley2006">[7]</a>!
But if you just map subnormal numbers to 0, you may lose some precision, but there’s no risk of a severe performance drop.
This flush-to-zero mode was designed for 3D applications where performance is a greater concern than accuracy
<a class="citation" href="#thakkur1999">[8]</a>. <br />
Besides defining a lot <abbr title="Floating Point">FP</abbr> stuff, the IEEE 754 still leaves some room for implementation-defined behavior.
One such thing are trapping <abbr title="Floating Point">FP</abbr> exceptions, which may or may not be present on a system.
In that regard the x64 <abbr title="Instruction Set Architecture">ISA</abbr> takes a hybrid approach allowing to specify which <abbr title="Floating Point">FP</abbr> exceptions cause a trap.
The corresponding masking bits are selected in the FMASK field, as depicted in the <a href="#fcsr-risc-v-x64">Figure above</a>. <br />
Another implementation-defined difference between RISC-V and x64 is the canonical NaN encoding.
On x64 systems, the canonical qNaN uses a negative sign, while RISC-V uses a positive sign.
That means, a 32-bit qNaN as a result of an invalid operation would be encoded as 0xffc00000.</p>

<h2 id="4-the-problems">4 The Problems</h2>

<p>As already mentioned in <a href="#2-the-story">Section 2</a>, we are facing 6 different problems when executing RISC-V instructions on x64 hosts.
In the following, I provide a more detailed explanation for each of them.</p>

<h3 id="41-different-canonical-qnan-encodings">4.1 Different Canonical qNaN Encodings</h3>
<p>For some operands, certain <abbr title="Floating Point">FP</abbr> instructions cannot provide a meaningful result.
For example, when multiplying ∞ and 0, or when adding +∞ and -∞.
To indicate the occurrence of an invalid operation, a specific pattern bit pattern has to be returned.
This pattern is referred to as a <a href="https://en.wikipedia.org/wiki/NaN#Quiet_NaN">qNaN (quiet Not A Number)</a>.
There is also an <a href="https://en.wikipedia.org/wiki/NaN#Signaling_NaN">sNaN (signaling Not A Number)</a>, but this is rather irrelevant in our case.
So, how does the bit pattern of a qNaN look like?<br />
The IEEE 754 standard from 1985 defines a NaN very vaguely as a number with all exponent bits set to one,
and a non-zero significand.
The exact difference between a qNaN and an sNaN was specified in the 2008 version, with a qNaN having a leading “1” in the significand
and sNaN having a leading “0”.
So, according to the latest IEEE 754 standard, a 32-bit qNaN looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x111 1111 11xx xxxx xxxx xxxx xxxx xxxx
x = arbitrary bit
</code></pre></div></div>

<p>As you can see, there’s not only one qNaN, but a whole range of patterns, leaving an <abbr title="Instruction Set Architecture">ISA</abbr> designer with the problem
of which exact pattern to return when encountering an invalid operation.
Since IEEE 754 unfortunately does not give a recommendation here, we see various patterns in practice.
The following extended table from <a class="citation" href="#waterman2016">[9]</a> shows the qNaN patterns of some popular <abbr title="Instruction Set Architectures">ISAs</abbr>.</p>

<div style="text-align:center">
<table style="width:90%;margin-left: auto; margin-right: auto;">
    <tr>
        <th>ISA</th>
        <th>Sign</th>
        <th>Significand</th>
        <th>IEEE 754 2008 compliant</th>
    </tr>
    <tr>
        <td>SPARC</td>
        <td>0</td>
        <td>11111111111111111111111</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>RISC-V $&lt; v2.1$</td>
        <td>0</td>
        <td>11111111111111111111111</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>MIPS</td>
        <td>0</td>
        <td>01111111111111111111111</td>
        <td>✗</td>
    </tr>
    <tr>
        <td>PA-RISC</td>
        <td>0</td>
        <td>01000000000000000000000</td>
        <td>✗</td>
    </tr>
    <tr>
        <td>x64</td>
        <td>1</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>Alpha</td>
        <td>1</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>ARM64</td>
        <td>0</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>PowerPc</td>
        <td>0</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>Loongson</td>
        <td>0</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>RISC-V $\geq v2.1$</td>
        <td>0</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
</table>
</div>

<p>As you can see, the qNaN of RISC-V and x64 differ in their signs.
Thus, if we were to translate RISC-V <abbr title="Floating Point">FP</abbr> instructions one-to-one to x64, we’d have to check for qNaNs after each instruction.
If qNaN is encountered as a result, the sign must be inverted.
In case you’d like to see the different qNaNs, execute the following code on different <abbr title="Instruction Set Architectures">ISAs</abbr>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// x64:    0xffc00000</span>
<span class="c1">// RISC-V: 0x7fc00000</span>
<span class="c1">// MIPS:   0x7fbfffff</span>
<span class="cp">#include</span> <span class="cpf">&lt;iostream&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;ios&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="kt">float</span> <span class="n">a</span> <span class="o">=</span> <span class="mf">0.</span><span class="n">f</span><span class="p">;</span>
  <span class="kt">float</span> <span class="n">b</span> <span class="o">=</span> <span class="mf">0.</span><span class="n">f</span><span class="p">;</span>
  <span class="n">a</span> <span class="o">/=</span> <span class="n">b</span><span class="p">;</span> <span class="c1">// Generates a canonical qNaN.</span>

  <span class="kt">unsigned</span> <span class="kt">int</span><span class="o">*</span> <span class="n">c</span> <span class="o">=</span> <span class="k">reinterpret_cast</span><span class="o">&lt;</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="o">*&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="n">a</span><span class="p">);</span>
  <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">hex</span> <span class="o">&lt;&lt;</span> <span class="s">"0x"</span> <span class="o">&lt;&lt;</span> <span class="o">*</span><span class="n">c</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="42-different-instruction-semantics">4.2 Different Instruction Semantics</h3>
<p>Now to one of my favorite problems, which shows in an absurd way that even IEEE standards created by experts are not impeccable.
Let’s start with a simple question: What is the maximum of an sNaN and an arbitrary number?
Or expressed directly as instructions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x64: maxss 5.f, sNaN = ?
RISC-V: fmax  5.f, sNaN = ?
</code></pre></div></div>
<p>The answers to this question are as numerous as they are confusing:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>x64:
  maxss 5.f, sNaN = sNaN
  maxss sNaN, 5.f = 5.f
RISC-V &lt;2.2:
  fmax  5.f, sNaN = qNaN
  fmax  sNaN, 5.f = qNaN
RISC-V 2.2:
  fmax  5.f, sNaN = 5.f
  fmax  sNaN, 5.f = 5.f
</code></pre></div></div>
<p>I guess the results show quite well, that some instructions cannot be mapped 1-to-1.</p>

<p>So, why is that?
The answer is interesting, but not relevant for the understanding of the rest of the post.
Thus, feel free to skip the rest of this subsection.</p>

<p>Let’s start with the odd behavior of the x64 <code class="language-plaintext highlighter-rouge">maxss</code> instruction.
When the modern x64 <abbr title="Floating Point">FP</abbr> arithmetic was introduced as part of the SSE extension in 1999, the current
IEEE 754 standard was still from 1985.
If you look into this standard and look for guidance on maximum/minimum instructions, you find exactly… nothing!
So, here is my guess how Intel’s engineers made it more or less compliant.
Instead of regarding the maximum/mininum instruction as atomic, you define it using order relations.
For example, using C++ syntax, you could define it as:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">&gt;</span> <span class="n">b</span> <span class="o">?</span> <span class="n">a</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
</code></pre></div></div>
<p>Fortunately, we find some information about comparisons in the standard.
IEEE 754 1985 defines any comparisons with NaNs as unordered, requiring false to be returned <a class="citation" href="#wikipedianan">[10]</a>.
This means, <code class="language-plaintext highlighter-rouge">5.f &gt; sNaN</code> is false, as well as <code class="language-plaintext highlighter-rouge">sNaN &gt; 5.f</code>.
Also things like <code class="language-plaintext highlighter-rouge">sNaN == sNaN</code> evaluate to false.
So if every comparison with NaN is false, our maximum/minimum instruction defined by order relations will always return the second operand (b)
if one or more operands are NaN.
And that’s exactly what you see with x64’s <code class="language-plaintext highlighter-rouge">maxss</code> instruction.</p>

<p>A few years later, the IEEE 754 2008 standard was published, which finally included a definition of the maximum/minimum operation
(see subsection 5.3.1 General operations, <code class="language-plaintext highlighter-rouge">maxNum</code> and <code class="language-plaintext highlighter-rouge">minNum</code>).
According to this standard, maximum/mininum should return a qNaN when one of the operands is a sNaN.
If only one of the operands is a qNaN, the number shall be returned.
This definition was adopted by the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> for the <code class="language-plaintext highlighter-rouge">fmax</code>/<code class="language-plaintext highlighter-rouge">fmin</code> instruction and kept until version 2.2.
In comparison to <code class="language-plaintext highlighter-rouge">maxss</code>, this instruction is commutative, which is what a maximum/minimum operation should be in my opinion.
So apparently, the experts thought about commutativity, but a closer look reveals they forgot about associativity.
In his article <em>The IEEE Standard 754: One for the History Books</em> <a class="citation" href="#hough2019">[11]</a> the author David G. Hough confirms
that the aspect of associativity in the presence of NaNs was simply overseen.
To show you what is meant by this, consider the following operations:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>max(6.f, max(5.f, sNaN)) = max(6.f, qNaN) = 6.f
max(max(6.f, 5.f), sNaN) = max(6.f, sNaN) = qNaN
</code></pre></div></div>
<p>If you just follow the standard, you get different results depending on the way the operations are associated.
That sounds like a possible source of trouble, so the experts rectified the definition in the IEEE 754 2019 standard.</p>

<p>To be more precise, they replaced <code class="language-plaintext highlighter-rouge">maxNum</code> and <code class="language-plaintext highlighter-rouge">minNum</code> with the associative operations
<code class="language-plaintext highlighter-rouge">maximumNumber</code> and <code class="language-plaintext highlighter-rouge">minimumNumber</code>.
They also introduced <code class="language-plaintext highlighter-rouge">maximum</code> and <code class="language-plaintext highlighter-rouge">minimum</code>, but these are not relevant in the context of RISC-V.
These new operations simply do not turn sNaNs into qNaNs which makes them associative and commutative.
Since RISC-V tries to adhere to IEEE 754 standard and is also not afraid to change things,
the <code class="language-plaintext highlighter-rouge">fmax</code> and <code class="language-plaintext highlighter-rouge">fmin</code> were adjusted in version 2.2.
So here we are. We just needed 34 years to figure out what the maximum/minimum of two values is.</p>

<p>Besides maximum and minimum, also other instructions like fused multiply-add and float to integer conversions show slightly different behavior.
Execute the following program on x64 and RISC-V to see it with your own eyes:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;cfenv&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;cmath&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;iostream&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;limits&gt;</span><span class="cp">
</span>
<span class="k">template</span> <span class="o">&lt;</span><span class="k">typename</span> <span class="nc">T</span><span class="p">&gt;</span>
<span class="k">using</span> <span class="n">nl</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">numeric_limits</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">;</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="c1">// Maximum/Minimum</span>
  <span class="kt">float</span> <span class="n">res1</span> <span class="o">=</span> <span class="n">nl</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;::</span><span class="n">signaling_NaN</span><span class="p">();</span>
  <span class="kt">float</span> <span class="n">res2</span> <span class="o">=</span> <span class="mf">5.</span><span class="n">f</span><span class="p">;</span>
<span class="cp">#ifdef __x86_64
</span>  <span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"maxss %0, %1"</span> <span class="o">:</span><span class="s">"=x"</span><span class="p">(</span><span class="n">res1</span><span class="p">)</span> <span class="o">:</span> <span class="s">"x"</span><span class="p">(</span><span class="mf">5.0</span><span class="n">f</span><span class="p">));</span>
  <span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"maxss %0, %1"</span> <span class="o">:</span><span class="s">"=x"</span><span class="p">(</span><span class="n">res2</span><span class="p">)</span> <span class="o">:</span> <span class="s">"x"</span><span class="p">(</span><span class="n">nl</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;::</span><span class="n">signaling_NaN</span><span class="p">()));</span>
<span class="cp">#elif __riscv
</span>  <span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"fmax.s %0, %1, %2"</span> <span class="o">:</span><span class="s">"=f"</span><span class="p">(</span><span class="n">res1</span><span class="p">)</span> <span class="o">:</span> <span class="s">"f"</span><span class="p">(</span><span class="mf">5.0</span><span class="n">f</span><span class="p">)</span> <span class="p">,</span> <span class="s">"f"</span><span class="p">(</span><span class="n">res1</span><span class="p">));</span>
  <span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"fmax.s %0, %1, %2"</span> <span class="o">:</span><span class="s">"=f"</span><span class="p">(</span><span class="n">res2</span><span class="p">)</span> <span class="o">:</span> <span class="s">"f"</span><span class="p">(</span><span class="n">nl</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;::</span><span class="n">signaling_NaN</span><span class="p">())</span> <span class="p">,</span> <span class="s">"f"</span><span class="p">(</span><span class="n">res2</span><span class="p">));</span>
<span class="cp">#else
</span>  <span class="k">static_assert</span><span class="p">(</span><span class="nb">false</span><span class="p">,</span> <span class="s">"No architecture detected."</span><span class="p">);</span>
<span class="cp">#endif
</span>  <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"max(sNaN, 5.f) = "</span> <span class="o">&lt;&lt;</span> <span class="n">res1</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span>
            <span class="o">&lt;&lt;</span> <span class="s">"max(5.f, sNaN) = "</span> <span class="o">&lt;&lt;</span> <span class="n">res2</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>

  <span class="c1">// Fused Multiply-Add</span>
  <span class="n">std</span><span class="o">::</span><span class="n">feclearexcept</span><span class="p">(</span><span class="n">FE_ALL_EXCEPT</span><span class="p">);</span>
  <span class="kt">float</span> <span class="n">res</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">fma</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">nl</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;::</span><span class="n">infinity</span><span class="p">(),</span> <span class="n">nl</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;::</span><span class="n">quiet_NaN</span><span class="p">());</span>
  <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"Invalid: "</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">fetestexcept</span><span class="p">(</span><span class="n">FE_INVALID</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>

  <span class="c1">// Float to Integer</span>
  <span class="k">volatile</span> <span class="kt">float</span> <span class="n">a</span> <span class="o">=</span> <span class="mf">2e10</span><span class="p">;</span>
  <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"(int)2e10 = "</span> <span class="o">&lt;&lt;</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span> <span class="n">a</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>

  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>On x64 the output is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>max(sNaN, 5.f) = 5
max(5.f, sNaN) = nan
Invalid: 0
(int)2e10 = -2147483648
</code></pre></div></div>
<p>On RISC-V you get:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>max(sNaN, 5.f) = 5
max(5.f, sNaN) = 5
Invalid: 16
(int)2e10 = 2147483647
</code></pre></div></div>

<!-- #include <iostream>
#include <limits>

int main() {
    float a = nl<float>::signaling_NaN();
    float b = 5;
    float result;
    asm (
        "maxss %1, %2\n\t"
        "movss %1, %0"
        : "=m" (result)
        : "x" (a), "x" (b)
    );
    std::cout << result << std::endl;
} -->

<h3 id="43-the-missing-rounding-mode">4.3 The Missing Rounding Mode</h3>
<p>As already explained in the background section, x64 misses the “roundTiesToAway”, which was introduced in the IEEE 754 2008 standard.
So, whenever we want to simulate RISC-V <abbr title="Floating Point">FP</abbr> instructions under a “roundTiesToAway”, the host’s <abbr title="Floating Point Unit">FPU</abbr> cannot be used.
Yet, this is a corner case, as most applications just use the default <abbr title="Round to Nearest, Ties to Even">RNE</abbr> rounding mode.</p>

<h3 id="44-nan-boxing">4.4 NaN Boxing</h3>
<p>Now to a unique feature/clarification that was introduced in 2017 with version 2.2 of the RISC-V <abbr title="Floating Point">FP</abbr> extensions <a class="citation" href="#riscv2017">[12]</a>.
Until version 2.2, there was no definition of how 32-bit <abbr title="Floating Point">FP</abbr> values are encoded in 64-bit registers.
This can lead to several problems as described in <a class="citation" href="#nan-box-rfc">[13]</a> and <a class="citation" href="#nan-box-google">[14]</a>.
After a lively discussion, the chosen solution was a NaN boxing scheme, which was used in no other <abbr title="Instruction Set Architecture">ISA</abbr> at that point as far as I know
(remark: in 2019 OpenRISC 1000 also adopted NaN Boxing with version 1.3).
That means, if a 32-bit <abbr title="Floating Point">FP</abbr> value is stored in a 64-bit <abbr title="Floating Point">FP</abbr> register, the upper 32 bits are set to 1’s.
Hence, the 32-bit <abbr title="Floating Point">FP</abbr> value is basically a payload of a 64-bit negative qNaN.</p>

<p>This gives you some advantage in terms of debuging capabilities, but requires additional treatment for emulation.
If you want to see NaN boxing in action, execute the following code on RISC-V and x64:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;iostream&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="k">const</span> <span class="kt">float</span> <span class="n">a</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.</span><span class="n">f</span><span class="p">;</span>
  <span class="k">const</span> <span class="kt">double</span> <span class="n">b</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.</span><span class="p">;</span>
  <span class="kt">double</span> <span class="n">out</span><span class="p">;</span>

<span class="cp">#ifdef __x86_64
</span>  <span class="c1">// Storing the float does not touch the upper bits.</span>
  <span class="c1">// Hence, the output is 0x8000000080000000 (-1.0609978955e-314).</span>
  <span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"movsd %2, %%xmm0 </span><span class="se">\n\t</span><span class="s">\
                movss %1, %%xmm0 </span><span class="se">\n\t</span><span class="s">\
                movsd %%xmm0, %0"</span>
                <span class="o">:</span> <span class="s">"=x"</span> <span class="p">(</span><span class="n">out</span><span class="p">)</span> <span class="o">:</span> <span class="s">"x"</span> <span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="s">"x"</span> <span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">:</span> <span class="s">"xmm0"</span><span class="p">);</span>
<span class="cp">#elif __riscv
</span>  <span class="c1">// Output should be -qNaN due to RISC-V NaN boxing.</span>
  <span class="k">asm</span> <span class="k">volatile</span><span class="p">(</span><span class="s">"fmv.d f0, %2 </span><span class="se">\n\t</span><span class="s">\
                fmv.s f0, %1 </span><span class="se">\n\t</span><span class="s">\
                fmv.d %0, f0"</span>
                <span class="o">:</span> <span class="s">"=f"</span> <span class="p">(</span><span class="n">out</span><span class="p">),</span> <span class="s">"f"</span> <span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">:</span> <span class="s">"f"</span> <span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">:</span> <span class="s">"f0"</span><span class="p">);</span>
<span class="cp">#else
</span>  <span class="k">static_assert</span><span class="p">(</span><span class="nb">false</span><span class="p">,</span> <span class="s">"No architecture detected."</span><span class="p">);</span>
<span class="cp">#endif
</span>
  <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"out = "</span> <span class="o">&lt;&lt;</span> <span class="n">out</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="45-nan-propagation">4.5 NaN Propagation</h3>
<p>A feature recommended but not mandated by IEEE 754 is NaN propagation.
The idea is to propagate inputs NaN payloads through instruction as some kind of diagnostic information.
It is part of x64 and ARM, but RISC-V doesn’t mandate it due to additional hardware costs.
To see how it looks like, execute the following code on x64 and RISC-V:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// x64:    0xffc00123</span>
<span class="c1">// RISC-V: 0x7fc00000</span>
<span class="cp">#include</span> <span class="cpf">&lt;iostream&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="kt">float</span> <span class="n">a</span> <span class="o">=</span> <span class="mf">0.</span><span class="n">f</span><span class="p">;</span>
  <span class="kt">float</span> <span class="n">b</span><span class="p">;</span>
  <span class="kt">unsigned</span> <span class="kt">int</span> <span class="o">*</span><span class="n">ai</span> <span class="o">=</span> <span class="k">reinterpret_cast</span><span class="o">&lt;</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="o">*&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="n">a</span><span class="p">);</span>
  <span class="kt">unsigned</span> <span class="kt">int</span> <span class="o">*</span><span class="n">bi</span> <span class="o">=</span> <span class="k">reinterpret_cast</span><span class="o">&lt;</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="o">*&gt;</span><span class="p">(</span><span class="o">&amp;</span><span class="n">b</span><span class="p">);</span>
  <span class="o">*</span><span class="n">bi</span> <span class="o">=</span> <span class="mh">0xffc00123</span><span class="p">;</span>
  <span class="n">a</span> <span class="o">+=</span> <span class="n">b</span><span class="p">;</span>

  <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">hex</span> <span class="o">&lt;&lt;</span> <span class="s">"0x"</span> <span class="o">&lt;&lt;</span> <span class="o">*</span><span class="n">ai</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>

</code></pre></div></div>

<h3 id="46-floating-point-exception-flags">4.6 Floating Point Exception Flags</h3>
<p>Whenever <abbr title="Floating Point">FP</abbr> instructions are executed, certain exceptions may occur.
The IEEE 754 standard defines 5 exception flags which indicate irregularities during an instruction’s execution:</p>
<ul>
  <li>invalid (e.g.: $\infty-\infty = qNaN$)</li>
  <li>underflow (e.g.: $(1.5046328E−36)^2=0$)</li>
  <li>overflow (e.g.: $(1.5845633𝐸29)^2=\infty$)</li>
  <li>inexact (e.g: $0.00390625+65536=65536$)</li>
  <li>divide-by-zero: (e.g: $1/+0 = \infty$)</li>
</ul>

<p>This was already defined in the first standard and hasn’t changed.
So, what is the problem if RISC-V and x64 are equal in this regard?
Finding a working solution isn’t the problem, but having a fast one is.</p>

<p>But let me begin with the naive approach, that I call <em><abbr title="Floating Point Unit">FPU</abbr> guards</em>.
It involves the following steps to load and save the <abbr title="Floating Point">FP</abbr> exception flags from the <code class="language-plaintext highlighter-rouge">mxcsr</code> register:</p>
<ol>
  <li>Save host <abbr title="Floating Point Unit">FPU</abbr> state</li>
  <li>Load target <abbr title="Floating Point Unit">FPU</abbr> state</li>
  <li>Execute target <abbr title="Floating Point">FP</abbr> instruction(s)</li>
  <li>Save target <abbr title="Floating Point Unit">FPU</abbr> state</li>
  <li>Load host <abbr title="Floating Point Unit">FPU</abbr> state</li>
</ol>

<p>Or in C++ terms, it could look like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;cfenv&gt;</span><span class="cp">
</span>
<span class="k">struct</span> <span class="nc">fpu_guard</span> <span class="p">{</span>
  <span class="n">std</span><span class="o">::</span><span class="n">fenv_t</span> <span class="n">envp</span><span class="p">;</span>
  <span class="kt">void</span> <span class="n">lock</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">std</span><span class="o">::</span><span class="n">fegetenv</span><span class="p">(</span><span class="o">&amp;</span><span class="n">envp</span><span class="p">);</span>
    <span class="n">std</span><span class="o">::</span><span class="n">fesetenv</span><span class="p">(</span><span class="o">&amp;</span><span class="n">envp</span><span class="p">);</span>
  <span class="p">}</span>

  <span class="kt">void</span> <span class="nf">unlock</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">std</span><span class="o">::</span><span class="n">fegetenv</span><span class="p">(</span><span class="o">&amp;</span><span class="n">envp</span><span class="p">);</span>
    <span class="n">std</span><span class="o">::</span><span class="n">fesetenv</span><span class="p">(</span><span class="o">&amp;</span><span class="n">envp</span><span class="p">);</span>
  <span class="p">}</span>
<span class="p">};</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="n">fpu_guard</span> <span class="n">fg</span><span class="p">;</span>
  <span class="kt">float</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">;</span>

  <span class="n">fg</span><span class="p">.</span><span class="n">lock</span><span class="p">();</span>
  <span class="n">a</span> <span class="o">=</span> <span class="n">b</span> <span class="o">+</span> <span class="n">c</span><span class="p">;</span>
  <span class="n">fg</span><span class="p">.</span><span class="n">unlock</span><span class="p">();</span>

  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It’s simple, maintainable, and <abbr title="Instruction Set Architecture">ISA</abbr>-agnostic. So why not use it?
Because it is ridiculously slow.
The lock guard, including <abbr title="Floating Point">FP</abbr> operation, just comprises a few instructions, so you’d expect a performance
in the range of 100-1000MIPS.
But what you get is merely 2-4 MIPS, even on the most modern machines.</p>

<p>As a computer engineer, it’s my passion to explore such mysteries, which is what I will do in the rest of this subsection.
The slow part of my code is obviously the lock guard, which is implemented by <code class="language-plaintext highlighter-rouge">fegetenv</code> and <code class="language-plaintext highlighter-rouge">fesetenv</code> from the standard library.
Consequently, analyzing the corresponding code in glibc seems to be the next logical step.
With a few minutes of research, I found the <a href="https://github.com/bminor/glibc/blob/master/sysdeps/x86_64/fpu/fegetenv.c">following code</a>
(which I also deconvoluted and commented a little bit) for the <code class="language-plaintext highlighter-rouge">fegetenv</code> function.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">__fegetenv</span> <span class="p">(</span><span class="n">fenv_t</span> <span class="o">*</span><span class="n">envp</span><span class="p">)</span> <span class="p">{</span>
  <span class="c1">// x87 state</span>
  <span class="n">__asm__</span> <span class="p">(</span><span class="s">"fnstenv %0</span><span class="se">\n</span><span class="s">"</span> <span class="o">:</span> <span class="s">"=m"</span> <span class="p">(</span><span class="o">*</span><span class="n">envp</span><span class="p">));</span>
  <span class="n">__asm__</span> <span class="p">(</span><span class="s">"fldenv %0</span><span class="se">\n</span><span class="s">"</span> <span class="o">:</span> <span class="s">"=m"</span> <span class="p">(</span><span class="o">*</span><span class="n">envp</span><span class="p">));</span>

  <span class="c1">// SSE state</span>
  <span class="n">__asm__</span> <span class="p">(</span><span class="s">"stmxcsr %0</span><span class="se">\n</span><span class="s">"</span> <span class="o">:</span> <span class="s">"=m"</span> <span class="p">(</span><span class="n">envp</span><span class="o">-&gt;</span><span class="n">__mxcsr</span><span class="p">));</span>

  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As you can see, it only comprises 3 instructions.
Two of them are responsible for the x87 part (yey, legacy), while only one is needed to fetch the <code class="language-plaintext highlighter-rouge">mxcsr</code> register.
In a profiling run, I could see the x87 part taking about 90% of the total execution time of the function.
That’s a big share, considering that x87 is an obsolete extension for which compilers, with a few exceptions, no longer generate code. <br />
So, I decided to remove the x87 instructions and reevaluate the performance.
Now it was faster, but still far away from my excpectations.
Since there’s only one remaining instruction, the case is clear more or less.
In the infinite realms of the internet, I found this cool  <a href="https://www.agner.org/optimize/instruction_tables.pdf">website/document</a>,
which analyzed the throughput and latency of all x64 instructions.
The following table summarizes it for the LDMXCSR and STMXCSR instructions (load and store of the MXCSR register).</p>

<table style="width:80%" class="center">
  <tr>
    <td>µArch</td>
    <td colspan="2">Latency</td>
    <td colspan="2">Reciprocal Throughput</td>
  </tr>
  <tr>
    <td>  </td>
    <td> LDMXCSR</td>
    <td> STMXCSR</td>
    <td> LDMXCSR</td>
    <td> STMXCSR</td>
  </tr>
  <tr>
    <td> AMD Zen 2</td>
    <td> - </td>
    <td> - </td>
    <td> 17 </td>
    <td> 16 </td>
  </tr>
  <tr>
    <td> AMD Zen 3</td>
    <td> 13 </td>
    <td> 13 </td>
    <td> 20 </td>
    <td> 15 </td>
  </tr>
  <tr>
    <td> AMD Zen 4</td>
    <td> 13 </td>
    <td> 13 </td>
    <td> 21 </td>
    <td> 15 </td>
  </tr>
  <tr>
    <td> Intel Coffee Lake</td>
    <td> 5 </td>
    <td> 4 </td>
    <td> 3 </td>
    <td> 1 </td>
  </tr>
  <tr>
    <td> Intel Cannon Lake</td>
    <td> 5 </td>
    <td> 4 </td>
    <td> 3 </td>
    <td> 1 </td>
  </tr>
  <tr>
    <td> Intel Ice Lake</td>
    <td> 6 </td>
    <td> 4 </td>
    <td> 3 </td>
    <td> 1 </td>
  </tr>
</table>

<p>As you can see in the table, executing these instructions is relatively costly (13 cycles latency for the AMD Zen microarchitecture).
Surprisingly, AMD also performs much worse than Intel.
Since I used an AMD machine for my benchmarks, better results could have been obtained with an Intel CPU.
Anyway, I ultimately wanted an approach that works well on all microarchitectures, so I decided to go for something different as shown later.</p>

<p>A possible approach to hide the expensive cost of LDMXCSR and STMXCSR, is to only invoke them when the simulator switches between the generated code and
the host environment. As already hinted in the <abbr title="Floating Point Unit">FPU</abbr> guard description, multiple instructions can be between LDMXCSR and STMXCSR.
I guess this allows to attain reasonable performance, but you drastically reduce the code modularity.
You also increase the cost of switching between simulator and simulated code.
So, in the end, I took a different way.</p>

<p>But before I present that, the next section shows how other simulators deal with all these problems.</p>

<h2 id="5-how-other-simulators-work">5. How Other Simulators Work</h2>
<p>Whenever I code something, I try to get some inspiration from other projects first.
Or as one of my colleagues said:
“Before you code something simulation-related, ask yourself: What would QEMU do?” <br />
Wise words to live by, so the next sections dissect the <abbr title="Floating Point">FP</abbr> implementations of a few simulators, such as QEMU, rv8, and gem5.
I also present all academic works that have been published in this field to this date (2023-11-11).
Don’t worry, it’s only 3 papers.</p>

<h3 id="51-soft-float">5.1 Soft Float</h3>
<p>The open-source projects
gem5 <a class="citation" href="#gem52011">[15]</a>,
Spike <a class="citation" href="#spike2022">[16]</a>,
Uni Bremen RISC-V VP <a class="citation" href="#riscvvp-bremen">[17], [18]</a>,
Whisper <a class="citation" href="#whispergithub">[19]</a>,
Bochs <a class="citation" href="#bochsgithub">[20], [21]</a>,
rvsim <a class="citation" href="#rvsimgithub">[22]</a>,
<!-- TODO: Another example: https://github.com/SDL-Hercules-390/hyperion -->
and QEMU <a class="citation" href="#bellard2005">[23]</a> pre-v4.0.0,
all use a method called <em>soft float</em> to simulate <abbr title="Floating Point">FP</abbr> arithmetic.
Note that QEMU changed to a different approach in version 4.0.0, but more on that later.
The idea of soft float is to use integer arithmetic and boolean operations to mimic arbitrary <abbr title="Floating Point">FP</abbr> behavior.
It often comes as a C/C++ library, making it easy to integrate.
For example, all simulators listed above use the open-source library <em>Berkley SoftFloat</em> by J. Hauser <a class="citation" href="#hauser1996">[24]</a>,
which is based on the IEEE 754 1985 standard.
Soft float libraries that implement the more recent IEEE 754 2008 standard include
<em>SoftFP</em> by F. Bellard <a class="citation" href="#bellard2018">[25]</a>,
and <em>FLIP</em> by C.-P. Jeannerod et al. <a class="citation" href="#flip2004">[26]</a>.
Besides generic solutions in programming languages like C, there are also architecture-optimized soft float libraries.
For example, <em>RVfplib</em> <a class="citation" href="#perotti2022">[27]</a> is an optimized soft float library for RISC-V systems that do not include the F or D extension.</p>

<p>The availability of multiple open-source libraries and the ease of use make it the most popular <abbr title="Floating Point">FP</abbr> arithmetic simulation approach.
If you are starting to develop your own simulator, I recommend to use it for the first proof of concept.
That’s also what we did at MachineWare.
Yet, the performance might be somewhat disappointing.
Using tens or hundreds of integer instructions to simulate one <abbr title="Floating Point">FP</abbr> instruction can easily reduce your performance by that same factor.
Some exact slowdown factors are provided in the <a href="#hard-floppy-soft">results section</a>.</p>

<p>If you want to enjoy the full pain of coding your own soft float library,
the <em>Handbook of Floating Point Arithmetic</em> <a class="citation" href="#muller2010">[28]</a> provides you with all the necessary background information.</p>

<h3 id="52-rv8">5.2 rv8</h3>
<p>The open source project rv8 <a class="citation" href="#rv8">[29], [30]</a> is a DBT-based, RISC-V simulator for x64 hosts.
With rv8, the RISC-V target rounding mode and exception flags are mapped 1-to-1 to the x64 host.
So, it’s basically the <abbr title="Floating Point Unit">FPU</abbr> guard approach that I explained in Subsection <a href="#46-floating-point-exception-flags">4.6 Floating Point Exception Flags</a>.
Hence, checking and setting the target exception flags is simply achieved by accessing the x64 host’s <code class="language-plaintext highlighter-rouge">mxcsr</code> register.
But besides the poor performance of <abbr title="Floating Point Unit">FPU</abbr> guards on certain AMD microarchitectures, mapping the rounding modes is also a problem.
Because x64 simply misses the <abbr title="Round to Nearest, Ties to Maximum Magnitude">RMM</abbr> rounding mode (see <a href="#43-the-missing-rounding-mode">4.3 The Missing Rounding Mode</a>)!
So, let’s take a look at rv8’s code to see how it solves this problem (<a href="https://github.com/michaeljclark/rv8/blob/834259098a5c182874aac97d82a164d144244e1a/src/asm/fpu.h#L175">rv8/src/asm/fpu.h</a>:9):</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">inline</span> <span class="kt">void</span> <span class="nf">fenv_setrm</span><span class="p">(</span><span class="kt">int</span> <span class="n">rm</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">x86_mxcsr_val</span> <span class="o">=</span> <span class="n">__builtin_ia32_stmxcsr</span><span class="p">();</span>
    <span class="n">x86_mxcsr_val</span> <span class="o">&amp;=</span> <span class="o">~</span><span class="n">x86_mxcsr_RC_RZ</span><span class="p">;</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">rm</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="n">rv_rm_rne</span><span class="p">:</span> <span class="n">x86_mxcsr_val</span> <span class="o">|=</span> <span class="n">x86_mxcsr_RC_RN</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="n">rv_rm_rtz</span><span class="p">:</span> <span class="n">x86_mxcsr_val</span> <span class="o">|=</span> <span class="n">x86_mxcsr_RC_RZ</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="n">rv_rm_rdn</span><span class="p">:</span> <span class="n">x86_mxcsr_val</span> <span class="o">|=</span> <span class="n">x86_mxcsr_RC_DN</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="n">rv_rm_rup</span><span class="p">:</span> <span class="n">x86_mxcsr_val</span> <span class="o">|=</span> <span class="n">x86_mxcsr_RC_UP</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
        <span class="k">case</span> <span class="n">rv_rm_rmm</span><span class="p">:</span> <span class="n">x86_mxcsr_val</span> <span class="o">|=</span> <span class="n">x86_mxcsr_RC_RN</span><span class="p">;</span> <span class="k">break</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">__builtin_ia32_ldmxcsr</span><span class="p">(</span><span class="n">x86_mxcsr_val</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the function <code class="language-plaintext highlighter-rouge">fenv_setrm(int rm)</code> the RISC-V rounding mode is loaded into the host <abbr title="Floating Point Unit">FPU</abbr>.
As you can see, the missing rounding mode <abbr title="Round to Nearest, Ties to Maximum Magnitude">RMM</abbr> of x64 is simply mapped to <abbr title="Round to Nearest, Ties to Even">RNE</abbr>!
This is not correct and leads to rv8 not being compliant with the official RISC-V standard.</p>

<p>The other problems, such as semantically different instructions or NaN boxing, are solved by rectifications in software.
Furthermore, <abbr title="Floating Point">FP</abbr> instructions are not directly translated, but use an interpreter.
This interpreter falls back to standard C++ operators to implement RISC-V instructions.
For example, the following code shows the implementation of the <code class="language-plaintext highlighter-rouge">fadd</code> and <code class="language-plaintext highlighter-rouge">fmax</code> instructions.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">P</span><span class="o">::</span><span class="n">ux</span> <span class="nf">exec_inst_rv32</span><span class="p">(</span><span class="n">T</span> <span class="o">&amp;</span><span class="n">dec</span><span class="p">,</span> <span class="n">P</span> <span class="o">&amp;</span><span class="n">proc</span><span class="p">,</span> <span class="n">P</span><span class="o">::</span><span class="n">ux</span> <span class="n">pc_offset</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// ...</span>
    <span class="k">switch</span> <span class="p">(</span><span class="n">dec</span><span class="p">.</span><span class="n">op</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">case</span> <span class="n">rv_op_fadd_s</span><span class="p">:</span>
          <span class="k">if</span> <span class="p">(</span><span class="n">rvf</span><span class="p">)</span> <span class="p">{</span>
              <span class="n">fenv_setrm</span><span class="p">((</span><span class="n">fcsr</span> <span class="o">&gt;&gt;</span> <span class="mi">5</span><span class="p">)</span> <span class="o">&amp;</span> <span class="mb">0b111</span><span class="p">);</span>
              <span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rd</span><span class="p">]</span> <span class="o">=</span> <span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rs1</span><span class="p">]</span> <span class="o">+</span> <span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rs2</span><span class="p">];</span>
          <span class="p">}</span>
          <span class="k">break</span><span class="p">;</span>
          <span class="k">case</span> <span class="n">rv_op_fmax_s</span><span class="p">:</span>
              <span class="k">if</span> <span class="p">(</span><span class="n">rvf</span><span class="p">)</span> <span class="p">{</span>
                  <span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rd</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rs1</span><span class="p">]</span> <span class="o">&gt;</span> <span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rs2</span><span class="p">])</span> <span class="o">||</span> <span class="n">isnan</span><span class="p">(</span><span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rs2</span><span class="p">])</span>
                                <span class="o">?</span> <span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rs1</span><span class="p">]</span> <span class="o">:</span> <span class="n">freg</span><span class="p">[</span><span class="n">dec</span><span class="p">.</span><span class="n">rs2</span><span class="p">];</span>
              <span class="p">}</span>
          <span class="k">break</span><span class="p">;</span>
    <span class="c1">// ...</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="53-qemu-post-v400">5.3 QEMU post-v4.0.0</h3>
<p>As of version 4.0.0, QEMU’s slow soft float approach was replaced by the faster method of Guo et al. <a class="citation" href="#guo2016">[31]</a>.
Initially Guo et al. tried to calculate the result of an <abbr title="Floating Point">FP</abbr> instruction on the host <abbr title="Floating Point Unit">FPU</abbr> and determine the exception flags in software.
However, their way of calculating the inexact exception was so costly, that ultimatley no speedup compared to soft float was achieved.
Note that they could find a fast solution for additions, but more on that in Section <a href="#58-you-et-al">5.8 You et al.</a>. <br />
After their failed initial attempt, Guo et al. noticed an obivous but important detail:
the inexact exception is “sticky” and does not need to be recalculated if it was already set.
Or in other words: If an instructions sets the inexact flag, which is very likely, it does not need to be recalculated for all following instructions.
Well, if you clear the flag you have to recalculate it, but there’s almost no software that actually does this.
So, to avoid the high costs for the inexact calculation, an <abbr title="Floating Point">FP</abbr> operation is preceded by a quick check, whether the exception must be calculated at all.
<!-- Because with most ISAs, such as RISC-V or x64, the inexact exception is sticky.
That means, if an instruction has generated an inexact result, the inexact exception remains set, even if subsequent instructions produce exact results.
Since most applications do not clear the inexact exception and tend to generate an inexact result at some point,
it can be assumed that in most cases the inexact exception is already set.
Hence, there's no need to recalculate it for every instruction. --></p>

<p>An example for the square root instruction in QEMU using the method of Guo et al. is shown in the following, simplified, code from
<a href="https://github.com/qemu/qemu/blob/9ba37026fcf6b7f3f096c0cca3e1e7307802486b/fpu/softfloat.c#L4553C4-L4553C4">qemu/fpu/softfloat.c</a>:
(yes, despite not being a mere soft float implementation, the file is called “softfloat” ¯\<em>(ツ)</em>/¯ )</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="n">bool</span> <span class="nf">can_use_fpu</span><span class="p">(</span><span class="k">const</span> <span class="n">float_status</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">QEMU_NO_HARDFLOAT</span><span class="p">)</span>
        <span class="k">return</span> <span class="nb">false</span><span class="p">;</span>

    <span class="k">return</span> <span class="n">likely</span><span class="p">(</span><span class="n">s</span><span class="o">-&gt;</span><span class="n">f_excep_flags</span> <span class="o">&amp;</span> <span class="n">f_flag_inexact</span> <span class="o">&amp;&amp;</span> <span class="n">s</span><span class="o">-&gt;</span><span class="n">f_round_mode</span> <span class="o">==</span> <span class="n">f_round_near_even</span><span class="p">);</span>
<span class="p">}</span>

<span class="n">float32</span> <span class="nf">float32_sqrt</span><span class="p">(</span><span class="n">float32</span> <span class="n">xa</span><span class="p">,</span> <span class="n">float_status</span> <span class="o">*</span><span class="n">s</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">union_float32</span> <span class="n">ua</span><span class="p">,</span> <span class="n">ur</span><span class="p">;</span>

    <span class="n">ua</span><span class="p">.</span><span class="n">s</span> <span class="o">=</span> <span class="n">xa</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="o">!</span><span class="n">can_use_fpu</span><span class="p">(</span><span class="n">s</span><span class="p">)))</span>
        <span class="k">goto</span> <span class="n">soft</span><span class="p">;</span>

    <span class="n">float32_input_flush1</span><span class="p">(</span><span class="o">&amp;</span><span class="n">ua</span><span class="p">.</span><span class="n">s</span><span class="p">,</span> <span class="n">s</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="o">!</span><span class="n">float32_is_zero_or_normal</span><span class="p">(</span><span class="n">ua</span><span class="p">.</span><span class="n">s</span><span class="p">)</span> <span class="o">||</span> <span class="n">float32_is_neg</span><span class="p">(</span><span class="n">ua</span><span class="p">.</span><span class="n">s</span><span class="p">)))</span>
        <span class="k">goto</span> <span class="n">soft</span><span class="p">;</span>

    <span class="n">ur</span><span class="p">.</span><span class="n">h</span> <span class="o">=</span> <span class="n">sqrtf</span><span class="p">(</span><span class="n">ua</span><span class="p">.</span><span class="n">h</span><span class="p">);</span>
    <span class="k">return</span> <span class="n">ur</span><span class="p">.</span><span class="n">s</span><span class="p">;</span>

    <span class="nl">soft:</span> <span class="k">return</span> <span class="n">soft_f32_sqrt</span><span class="p">(</span><span class="n">ua</span><span class="p">.</span><span class="n">s</span><span class="p">,</span> <span class="n">s</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>As you can see, the function <code class="language-plaintext highlighter-rouge">float32_sqrt</code> starts with a call to <code class="language-plaintext highlighter-rouge">can_use_fpu</code>.
Here QEMU checks whether the inexact flag must be calculated at all.
Moreover, the host <abbr title="Floating Point Unit">FPU</abbr> can only be used if target and host rounding mode are the same.
It is assumed that the default C rounding mode of <abbr title="Round to Nearest, Ties to Even">RNE</abbr> is used and not changed during execution.
Thus, a quick check of the target’s rounding mode suffices.
Since some target architectures like PowerPC also require a non-sticky inexact exception,
the check can be skipped disabled at compile time by defining the macro <code class="language-plaintext highlighter-rouge">QEMU_NO_HARDFLOAT</code> accordingly.
Ultimately, it’s very unlikely that we have to resort to soft float method, which is also hinted by the compiler attribute <code class="language-plaintext highlighter-rouge">unlikely</code>.</p>

<p>To also avoid setting the underflow and invalid exception, the soft float method is used if the input is negative or subnormal.
But again subnormal values as well as negative inputs for <code class="language-plaintext highlighter-rouge">float32_sqrt</code> are very rare.
The idea of extending Guo’s method by checking both invalid and underflow flags was proposed by Cota et al. <a class="citation" href="#cota2019">[32]</a>.
It was also <a href="https://github.com/cota">E. G. Cota</a> who <a href="https://github.com/qemu/qemu/commit/a94b783952cc493cb241aabb1da8c7a830385baa">committed</a> the code to QEMU in 2018.
If all checks passed, which is the most probable case, the function <code class="language-plaintext highlighter-rouge">sqrtf</code> is called, resulting in a <code class="language-plaintext highlighter-rouge">sqrtss</code> instruction for x64 hosts.</p>

<p>With the new method of Guo and Cota, the performance of <abbr title="Floating Point">FP</abbr> instructions could be increased by a factor of more than $2\times$ in comparison to soft float.
However, this speedup is only attainable if an inexact exception occurs at some point and if the <abbr title="Round to Nearest, Ties to Even">RNE</abbr> rounding mode is used.
Tackling the latter issue, at least for additions,
Guo et al. developed a quick inexact check, which is pretty similar to the Fast2Sum algorithm by T. J. Dekker <a class="citation" href="#dekker1971">[33]</a>.</p>

<h3 id="54-rosetta-2">5.4 Rosetta 2</h3>
<p>Rosetta 2 is Apple’s x64-on-ARM emulator, which was introduced in 2020 to aid the transition from x64 to ARM-based Apple Silicon
<a class="citation" href="#rosetta22020">[34]</a>.
Despite translating instructions from x64 to ARM, which is not the focus of this post,
the underlying principle can be applied to any architecture as well.
In fact, I’m currently implementing a similar thing for RISC-V, but shhhh.</p>

<p>Since Apple does not disclose the technical details of their products, the following statements are based on internet sources.
In general, most problems of x64-to-ARM <abbr title="Floating Point">FP</abbr> simulation concern non-standard behavior and cases labeled as “implementation defined”.
For example, the <abbr title="Flush To Zero">FTZ</abbr> and <abbr title="Denormals As Zero">DAZ</abbr> flags of the x64 <abbr title="Instruction Set Architecture">ISA</abbr> are not part of the IEEE 754 standard.
These flags allow to individually flush the input and output of an instruction to zero.
Similarly, the ARM <abbr title="Instruction Set Architecture">ISA</abbr> also allows to flush numbers to zero, yet there is no way to control both input and output as on x64.</p>

<p>According to <a class="citation" href="#rosetta2022">[35]</a>, Apple introduced an alternate <abbr title="Floating Point">FP</abbr> mode to solve this problem in hardware.
By setting a certain bit in the ARM <abbr title="Floating Point">FP</abbr> control register, x64 <abbr title="Floating Point">FP</abbr> arithmetic can be mimicked.
While the Rosetta 2 approach allows for maximum performance, it requires full control of the <abbr title="Instruction Set Architecture">ISA</abbr> and silicon.
Shortly after Apple’s release of the M1 processor <a class="citation" href="#applem12020">[36]</a>,
the first physical implementation of the alternate <abbr title="Floating Point">FP</abbr> mode,
ARM officially included this mode in the ARMv8 <abbr title="Instruction Set Architecture">ISA</abbr>.
More specifically, it is part of ARMv8.7 architecture extension from January 2021 <a class="citation" href="#armreference2022">[37]</a>
and technically referenced it as <code class="language-plaintext highlighter-rouge">FEAT_AFP</code> (fun fact: rumours say, that AFP might also be interpreted as Apple Floating Point 🤔).
Thus, in the future, the alternate <abbr title="Floating Point">FP</abbr> mode could also find its way in the products of other manufacturers.</p>

<p>Interestingly, just recently I saw this <a href="https://www.phoronix.com/news/LoongArch-LBT-Linux-6.6">article</a> about Loongson’s <abbr title="Loongson Binary Translation">LBT</abbr> extension
for hardware-accelerated DBT.
The Loongson <abbr title="Instruction Set Architecture">ISA</abbr> manual and this article still lack important details, but I guess that parts of the additional hardware features
go into a similar direction as <code class="language-plaintext highlighter-rouge">FEAT_AFP</code>.</p>

<h3 id="55-dolphin">5.5 Dolphin</h3>
<p><a href="https://dolphin-emu.org/">Dolphin</a> is an open-source Wii and GameCube emulator.
Both consoles use a PowerPC CPU, which adheres to IEEE 754 and even adds some features beyond that.
Since GameCube and Wii accompanied my childhood, understanding how Dolphin handles <abbr title="Floating Point">FP</abbr> was initially more like a personal matter.
But it turned out to be actually interesting, because it provides some real-world examples,
where not adhering to the architecture’s <abbr title="Floating Point">FP</abbr> specs might break things.</p>

<p>In general, Dolphin translates most PowerPC <abbr title="Floating Point">FP</abbr> instructions to your host’s instructions, ignoring all the pain points like correct NaNs or exception flags.
That allows for super fast emulation, the most important concern for gaming console emulation.
But it turns out, that a handful of games actually rely on correct <abbr title="Floating Point">FP</abbr> emulation.
So, let’s take a look at two interesting cases.</p>

<p>The first case concerns correct qNaN generation.
As shown in Subsection <a href="#41-different-canonical-qnan-encodings">Different Canonical qNaN Encodings</a>, x86 generates negative canonical qNaNs while PowerPC generates positive qNaNs.
Apparently, the game “Dragon Ball: Revenge of King Piccolo” relies on positive qNaNs, otherwise the following happens
(video from the <a href="https://dolphin-emu.org/blog/2015/07/01/dolphin-progress-report-june-2015/#40-6704-optionally-emulate-accurate-nans-by-flacs">progress report June 2015</a>):</p>

<div style="text-align:center">
 <video controls="">
  <source src="/assets/fast_floating_point_simulation/CostlyWaterloggedBubblefish.mp4" type="video/mp4" />
</video>
</div>

<p>As you can see, 2 of the 5 enemies land behind the field. Unfortunately, you have to defeat all of them to progress.
To solve this bug, the variable <code class="language-plaintext highlighter-rouge">m_accurate_nans</code> was introduced by Tillmann Karras (<a href="https://github.com/dolphin-emu/dolphin/commit/aec38466d9c68a735da61786053030d4b333bcf0">commit here</a>).
It only enables accurate qNaN generation for games like “Dragon Ball: Revenge of King Piccolo”, to not unnecessarily cripple the performance of other games,</p>

<p>The second case is about correct <abbr title="Floating Point">FP</abbr> exception handling.
Similar to x86, PowerPC allows to trap on <abbr title="Floating Point">FP</abbr> exceptions.
However, this wasn’t modelled in Dolphin, because it would be costly to simulate, and also no game uses this feature.
Well, it turns out, there are two games, which actually rely on proper division-by-zero exceptions.
The whole story is told in the <a href="https://dolphin-emu.org/blog/2021/11/13/dolphin-progress-report-september-and-october-2021/#50-15330-raise-program-exceptions-on-floating-point-exceptions-by-josjuice">progress report September and October 2021</a>,
but let me just give you a short TLDR.</p>

<p>The games which rely on this feature (“True Crime: New York City” and “Call of Duty: Finest Hour”), weren’t developed for GameCube but ported from a PlayStation 2 version
by a studio called <a href="https://en.wikipedia.org/wiki/Exakt_Entertainment">Exakt Entertainment</a>.
On PlayStation 2, a division by zero would yield the largest positive floating point number, while the GameCube (and also x86) follows the IEEE standard and generates infinity.
Since a normal number and infinity behave completely different, their processing in subsequent instructions would lead to NaNs, which would then lead to the game’s crash.
As a simple solution, the studio came up with the following idea: whenever there is division, the code traps and rectifies the result in the exception handler.
While this works for real hardware, Dolphin didn’t support <abbr title="Floating Point">FP</abbr> exception traps.</p>

<p>To fix this issue, the emulator resorts to an interpreter mode, where each floating point instruction is a function call.
Here, checking for divisions by zero and other stuff is simply handled by C++ code.
However, the really interesting things like calculating inexact flags don’t seem to be there.
In fact, the code is quite confusing and at some points the flags are just randomly set to zero for no apparent reason at all:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">Interpreter</span><span class="o">::</span><span class="n">fmulx</span><span class="p">(</span><span class="n">Interpreter</span><span class="o">&amp;</span> <span class="n">interpreter</span><span class="p">,</span> <span class="n">UGeckoInstruction</span> <span class="n">inst</span><span class="p">)</span> <span class="p">{</span>
  <span class="p">...</span>
  <span class="n">ppc_state</span><span class="p">.</span><span class="n">fpscr</span><span class="p">.</span><span class="n">FI</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>  <span class="c1">// are these flags important?</span>
  <span class="n">ppc_state</span><span class="p">.</span><span class="n">fpscr</span><span class="p">.</span><span class="n">FR</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I love these kinds of situation where I’m like: either I’m missing something completely obvious, or the code doesn’t make sense at all.
Fortunately, this <a href="https://github.com/dolphin-emu/dolphin/pull/7141">pull request</a> from 2018 helped me to regain my confidence: the code doesn’t make sense.
But as mentioned in the pull request, maybe there was a reason for it, so better don’t touch it.</p>

<h3 id="56-virtual-console">5.6 Virtual Console</h3>
<p>Let’s stick with simulators for Nintendo consoles but this time from Nintendo itself: the Virtual Console.
Among others, the Virtual Console allows you to play N64 games on your Wii or Wii U.
I couldn’t find much about its inner simulation engine, but there is a really interesting <abbr title="Floating Point">FP</abbr> bug that is actually used in the Super Mario 64 A Button Challenge (beating the game without pressing the A button).
The bug in action is shown in the following video:</p>
<iframe width="100%" src="https://www.youtube.com/watch?v=TiUwJSOCbYE"> </iframe>
<p>So why is the platform moving upwards?
The oscillating height of the platform is implemented by a code that looks like this:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="n">time</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">y_pos</span> <span class="p">{</span><span class="o">-</span><span class="mf">3065.</span><span class="n">f</span><span class="p">};</span>
<span class="p">...</span>
<span class="n">y_pos</span> <span class="o">-=</span> <span class="n">std</span><span class="o">::</span><span class="n">sin</span><span class="p">(</span><span class="n">time</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.58</span><span class="p">;</span>
<span class="n">time</span> <span class="o">+=</span> <span class="mh">0x100</span><span class="p">;</span>
</code></pre></div></div>

<p>So simply a sinus that is subtracted from the platform’s position.
But in the code, there’s a little “mistake”. Can you spot it?
It may not be obvious but the double value of <code class="language-plaintext highlighter-rouge">0.58</code> is probably not what the programmer intended.
Rather a single-prevision value of <code class="language-plaintext highlighter-rouge">0.58f</code> would be a better fit.
Because in the double case, the result of <code class="language-plaintext highlighter-rouge">std::sin(time)</code> will be cast to a double, then the multiplication will be executed with double precision, and the final result is converted to float and stored in <code class="language-plaintext highlighter-rouge">y_pos</code>.
A lot of unnecessary casting, but nothing that should lead to serious problems.
Unless your simulator has a bug in its <abbr title="Floating Point">FP</abbr> casting operations…
As thoroughly explained in the <a href="https://dolphin-emu.org/blog/2018/07/06/dolphin-progress-report-june-2018/">Dolphin progress report form 2018</a>,
the Virtual Console does not use round-to-nearest for double-to-float conversions but round-to-zero.
Hence, a rounding error will accumulate over time that pushes the platform towards 0.
With rounding errors usually being very small in comparison to the calculated number, it takes multiple hours for the platform to rise any substantial distance.</p>

<h3 id="57-libriscv">5.7 libriscv</h3>
<p><a href="https://github.com/fwsGonzo/libriscv">libriscv</a> is a RISC-V userspace emulator library.
Since it’s a library, it’s marketed as being easy to integrate and configure.
With currently 489 stars on GitHub (2024-06-06), its popularity gets close to rv8, so I think it’s worth being covered in this post.
Note that all of the following refers to v1.3.</p>

<p>Opposed to other simulators, libriscv has some toggles that allow to configure the accuracy/performance of the simulation.
For instance, simulation of the fcsr is disabled by default (e.g., <code class="language-plaintext highlighter-rouge">option(RISCV_FCSR   "Enable FCSR emulation" OFF)</code>).
Also, things like NaN boxing are disabled by default.
But since I’m interested in accurate simulations, I closer examined the accurate paths of libriscv.</p>

<p>One interesting thing I noticed is the modeling of the <abbr title="Floating Point">FP</abbr> exception flags.
So, let’s take a look at the following code excerpt from <code class="language-plaintext highlighter-rouge">rvf_instr.cpp</code> with a <abbr title="Floating Point">FP</abbr> addition as an example.</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Sets the RISC-V fcsr flags.</span>
<span class="k">static</span> <span class="kt">void</span> <span class="nf">fsflags</span><span class="p">(</span><span class="n">CPU</span><span class="o">&lt;</span><span class="n">W</span><span class="o">&gt;&amp;</span> <span class="n">cpu</span><span class="p">,</span> <span class="kt">long</span> <span class="kt">double</span> <span class="n">exact</span><span class="p">,</span> <span class="n">T</span><span class="o">&amp;</span> <span class="n">inexact</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">if</span> <span class="k">constexpr</span> <span class="p">(</span><span class="n">fcsr_emulation</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">auto</span><span class="o">&amp;</span> <span class="n">fcsr</span> <span class="o">=</span> <span class="n">cpu</span><span class="p">.</span><span class="n">registers</span><span class="p">().</span><span class="n">fcsr</span><span class="p">();</span>
    <span class="n">fcsr</span><span class="p">.</span><span class="n">fflags</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">isnan</span><span class="p">(</span><span class="n">exact</span><span class="p">)</span> <span class="o">||</span> <span class="n">std</span><span class="o">::</span><span class="n">isnan</span><span class="p">(</span><span class="n">inexact</span><span class="p">))</span> <span class="p">{</span>
      <span class="n">fcsr</span><span class="p">.</span><span class="n">fflags</span> <span class="o">|=</span> <span class="mi">16</span><span class="p">;</span>
      <span class="k">if</span> <span class="k">constexpr</span> <span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o">==</span> <span class="mi">4</span><span class="p">)</span>
        <span class="o">*</span><span class="p">(</span><span class="kt">int32_t</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">inexact</span> <span class="o">=</span> <span class="n">CANONICAL_NAN_F32</span><span class="p">;</span>
      <span class="k">else</span>
        <span class="o">*</span><span class="p">(</span><span class="kt">int64_t</span> <span class="o">*</span><span class="p">)</span><span class="o">&amp;</span><span class="n">inexact</span> <span class="o">=</span> <span class="n">CANONICAL_NAN_F64</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
      <span class="k">if</span> <span class="p">(</span><span class="n">exact</span> <span class="o">!=</span> <span class="n">inexact</span><span class="p">)</span> <span class="n">fcsr</span><span class="p">.</span><span class="n">fflags</span> <span class="o">|=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
  <span class="p">}</span>
<span class="p">}</span>

<span class="c1">// "Accurate" floating point addition.</span>
<span class="n">FLOAT_INSTR</span><span class="p">(</span><span class="n">FADD</span><span class="p">,</span> <span class="p">[]</span> <span class="p">(</span><span class="k">auto</span><span class="o">&amp;</span> <span class="n">cpu</span><span class="p">,</span> <span class="n">rv32i_instruction</span> <span class="n">instr</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">const</span> <span class="n">rv32f_instruction</span> <span class="n">fi</span> <span class="p">{</span> <span class="n">instr</span> <span class="p">};</span>
  <span class="k">auto</span><span class="o">&amp;</span> <span class="n">dst</span> <span class="o">=</span> <span class="n">cpu</span><span class="p">.</span><span class="n">registers</span><span class="p">().</span><span class="n">getfl</span><span class="p">(</span><span class="n">fi</span><span class="p">.</span><span class="n">R4type</span><span class="p">.</span><span class="n">rd</span><span class="p">);</span>
  <span class="k">auto</span><span class="o">&amp;</span> <span class="n">rs1</span> <span class="o">=</span> <span class="n">cpu</span><span class="p">.</span><span class="n">registers</span><span class="p">().</span><span class="n">getfl</span><span class="p">(</span><span class="n">fi</span><span class="p">.</span><span class="n">R4type</span><span class="p">.</span><span class="n">rs1</span><span class="p">);</span>
  <span class="k">auto</span><span class="o">&amp;</span> <span class="n">rs2</span> <span class="o">=</span> <span class="n">cpu</span><span class="p">.</span><span class="n">registers</span><span class="p">().</span><span class="n">getfl</span><span class="p">(</span><span class="n">fi</span><span class="p">.</span><span class="n">R4type</span><span class="p">.</span><span class="n">rs2</span><span class="p">);</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">fi</span><span class="p">.</span><span class="n">R4type</span><span class="p">.</span><span class="n">funct2</span> <span class="o">==</span> <span class="mh">0x0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// float32</span>
    <span class="n">dst</span><span class="p">.</span><span class="n">set_float</span><span class="p">(</span><span class="n">rs1</span><span class="p">.</span><span class="n">f32</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="n">rs2</span><span class="p">.</span><span class="n">f32</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
    <span class="n">fsflags</span><span class="p">(</span><span class="n">cpu</span><span class="p">,</span> <span class="p">(</span><span class="kt">double</span><span class="p">)(</span><span class="n">rs1</span><span class="p">.</span><span class="n">f32</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">+</span> <span class="p">(</span><span class="kt">double</span><span class="p">)(</span><span class="n">rs2</span><span class="p">.</span><span class="n">f32</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">dst</span><span class="p">.</span><span class="n">f32</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span> <span class="c1">// Nope, don't do it like this!!!</span>
  <span class="p">}</span> <span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">fi</span><span class="p">.</span><span class="n">R4type</span><span class="p">.</span><span class="n">funct2</span> <span class="o">==</span> <span class="mh">0x1</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// float64</span>
    <span class="n">dst</span><span class="p">.</span><span class="n">f64</span> <span class="o">=</span> <span class="n">rs1</span><span class="p">.</span><span class="n">f64</span> <span class="o">+</span> <span class="n">rs2</span><span class="p">.</span><span class="n">f64</span><span class="p">;</span>
    <span class="n">fsflags</span><span class="p">(</span><span class="n">cpu</span><span class="p">,</span> <span class="p">(</span><span class="kt">long</span> <span class="kt">double</span><span class="p">)(</span><span class="n">rs1</span><span class="p">.</span><span class="n">f64</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="kt">long</span> <span class="kt">double</span><span class="p">)(</span><span class="n">rs2</span><span class="p">.</span><span class="n">f64</span><span class="p">),</span> <span class="n">dst</span><span class="p">.</span><span class="n">f64</span><span class="p">);</span>
  <span class="p">}</span>
  <span class="p">...</span>
<span class="p">}</span> <span class="p">...</span> <span class="p">)</span>
</code></pre></div></div>

<p>To set the <abbr title="Floating Point">FP</abbr> exception flags, <abbr title="Floating Point">FP</abbr> instructions call the function <code class="language-plaintext highlighter-rouge">fsflags</code>.
One can quickly see that this function only handles the inexact and the invalid case.
In general, other exception flags like division-by-zero, underflow or overflow seem to be missing in libriscv.
Anyway, let’s take a look at a particular flag they are apparently modeling: the inexact <abbr title="Floating Point">FP</abbr> exception flag.
As you can see, the function takes an exact value and an inexact value as arguments.
The latter stems from the actual executed instruction.
If <code class="language-plaintext highlighter-rouge">exact != inexact</code>, then the instruction was inexact, and the corresponding flag has to be set.
So, how do you calculate an exact value, for example, for a <abbr title="Floating Point">FP</abbr> addition?
Well, apparently you just upcast values to the next larger datatype and perform the addition (see: <code class="language-plaintext highlighter-rouge">Nope, don't do it like this!!!</code> in the code).
You can be sure that this addition was exact…</p>

<p><strong>No, please don’t do it this way! It’s not exact!</strong></p>

<p>As shown <a href="#62-fast-32-bit-multiplication">later</a>, this may work for multiplications, but for other arithmetic instruction you need other methods!
Especially for the square root instruction it should appear natural that calculating something like $\sqrt{2}$ exactly may require quite a few bits.
You may be able to correctly determine some inexact cases, but it’s by far not all.
If you want to see how libriscv fails to determine some cases, compile and execute the following C++ program:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Compile with: g++ -static inexact.cpp -o inexact.rv64</span>
<span class="c1">// Execute with: yes "" | DEBUG=TRUE ./rvlinux inexact.rv64 | grep Inexact</span>

<span class="cp">#include</span> <span class="cpf">&lt;cfenv&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;climits&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;cmath&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;iostream&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="k">volatile</span> <span class="kt">float</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">;</span>

  <span class="n">std</span><span class="o">::</span><span class="n">feclearexcept</span><span class="p">(</span><span class="n">FE_ALL_EXCEPT</span><span class="p">);</span>
  <span class="n">a</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">numeric_limits</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;::</span><span class="n">max</span><span class="p">();</span>
  <span class="n">b</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">numeric_limits</span><span class="o">&lt;</span><span class="kt">float</span><span class="o">&gt;::</span><span class="n">denorm_min</span><span class="p">();</span>
  <span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span>
  <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"Inexact hard case: "</span> <span class="o">&lt;&lt;</span>  <span class="p">(</span><span class="kt">bool</span><span class="p">)</span><span class="n">std</span><span class="o">::</span><span class="n">fetestexcept</span><span class="p">(</span><span class="n">FE_INEXACT</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>

  <span class="n">std</span><span class="o">::</span><span class="n">feclearexcept</span><span class="p">(</span><span class="n">FE_ALL_EXCEPT</span><span class="p">);</span>
  <span class="n">a</span> <span class="o">=</span> <span class="mf">3.0000002384185791015625</span><span class="n">f</span><span class="p">;</span>
  <span class="n">b</span> <span class="o">=</span> <span class="mf">3.</span><span class="n">f</span><span class="p">;</span>
  <span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span>
  <span class="n">std</span><span class="o">::</span><span class="n">cout</span> <span class="o">&lt;&lt;</span> <span class="s">"Inexact easy case: "</span> <span class="o">&lt;&lt;</span>  <span class="p">(</span><span class="kt">bool</span><span class="p">)</span><span class="n">std</span><span class="o">::</span><span class="n">fetestexcept</span><span class="p">(</span><span class="n">FE_INEXACT</span><span class="p">)</span> <span class="o">&lt;&lt;</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>

  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>On any non-broken computer/simulator, both cases should be inexact.
With libriscv, the hard case is not detected as inexact.</p>

<h3 id="58-you-et-al">5.8 You et al.</h3>
<p>As mentioned, in Section <a href="#53-qemu-post-v400">5.3 QEMU post-v4.0.0</a>,
Guo et al. <a class="citation" href="#guo2016">[31]</a> tried to implement software-based calculations for the inexact exception, but could only come up with a solution for additions/subtractions.
Their solution looked as follows:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">inexact</span> <span class="o">=</span> <span class="p">((</span><span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">)</span> <span class="o">-</span> <span class="n">a</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">b</span>
</code></pre></div></div>
<p>Guo et al. don’t mention it in their paper, but that is pretty much the so-called <em>Fast2Sum</em> algorithm that was introduced in 1971
by T. J. Dekker <a class="citation" href="#dekker1971">[33]</a>.
According to Dekker, the result of a rounded addition can be described by the sum of its exact value and a residual:</p>

<p>\begin{equation}
  \label{eq:fast2sum-main}
  \begin{gathered}
  a + b - r = s = \circ(a + b) \\\<br />
  r = \circ(b - \circ(s - a)) \quad with : |a|&gt;|b|
  \end{gathered}
\end{equation}</p>

<p>The residual can be calculated by rounded <abbr title="Floating Point">FP</abbr> instructions as follows:</p>

<p>\begin{equation}
  \label{eq:fast2sum-residuum}
  \begin{aligned}
  r = \circ(b - \circ(s - a)) \quad with : |a|&gt;|b|
  \end{aligned}
\end{equation}</p>

<p>As mathematically proven by Dekker, the residual $r$ holds the exact rounding error of the addition of the variables $a$ and $b$.
Hence, if the residual $r$ is not 0, the <abbr title="Floating Point">FP</abbr> addition was inexact.
Additionally, the value of the residual also determines the rounding direction of the preceding addition $\circ(a + b)$.
For values greater than 0, the result of the addition was rounded down; for values less than 0, the result was rounded up.
This fact wasn’t used by Guo et al. <a class="citation" href="#guo2016">[31]</a>,
but by You et al. <a class="citation" href="#you2019">[38]</a> in 2019.
Note that Guo et al. <a class="citation" href="#guo2016">[31]</a> and You et al. <a class="citation" href="#you2019">[38]</a> share a similar co-author.
So, ultimately, a solution to emulate <abbr title="Round Up">RUP</abbr> rounding using <abbr title="Round to Nearest, Ties to Even">RNE</abbr> on the host might look like this</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Fast2Sum for RUP</span>
<span class="kt">float</span> <span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span> <span class="c1">// Result.</span>
<span class="kt">float</span> <span class="n">x</span> <span class="o">=</span> <span class="n">fabs</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">fabs</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">?</span> <span class="n">a</span> <span class="o">:</span> <span class="n">b</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">y</span> <span class="o">=</span> <span class="n">fabs</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">&gt;</span> <span class="n">fabs</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">?</span> <span class="n">b</span> <span class="o">:</span> <span class="n">a</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">r</span> <span class="o">=</span> <span class="n">y</span> <span class="o">-</span> <span class="p">(</span><span class="n">c</span> <span class="o">-</span> <span class="n">x</span><span class="p">);</span> <span class="c1">// Rounding error.</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">inexact</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">c</span> <span class="o">=</span> <span class="n">nextup</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// Next greater FP value.</span>
        <span class="n">overflow</span> <span class="o">=</span> <span class="n">is_inf</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">overflow</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>While You et al. and Guo et al. managed to develop fast inexact checks and rounding adjustments for additions/subtractions,
other arithmetic instructions remained untouched.
They developed an inexact check for <abbr title="Fused Multiply-Add">FMA</abbr> instructions using integer-based intermediate results,
but their measurements show no speedup compared to a soft float implementation.
So, let’s take a look at a more successful attempt in the next section.</p>

<h3 id="59-sarrazin-et-al">5.9 Sarrazin et al.</h3>
<p>The approach from Sarrazin et al. <a class="citation" href="#sarrazin2016">[39]</a> isn’t really about determining the inexactness of <abbr title="Fused Multiply-Add">FMA</abbr>, but it comes close to it.
Interestingly, their work was published in 2016, which predates the unsuccessful attempt of You et al. <a class="citation" href="#you2019">[38]</a> in 2019.</p>

<p>The group of Sarrazin faced the problem of emulating <abbr title="Fused Multiply-Add">FMA</abbr> instructions on systems without hardware <abbr title="Fused Multiply-Add">FMA</abbr> support.
So, they combined UpMul with the 2Sum algorithm to get the following equations:
\begin{equation}
  \label{eq:ErrFma-residual}
  \begin{gathered}
    M = \circ_{64}(a \cdot b) \\\<br />
    S,T = 2Sum(M, \circ_{64}(c)) \\\<br />
    r = \circ_{32}(S) \\\<br />
    E = ||S-r|| \\\<br />
    with : \circ_{32}(a)=a, \quad \circ_{32}(b)=b, \quad \circ_{32}(c)=c
  \end{gathered}
\end{equation}
The output of the 2Sum algorithm is identical to the Fast2Sum algorithm, which was presented in the previous subsection.
A more detailed discussion about the differences and performance implication is provided in the following section.
The residual $T$ (yes, suboptimal variable name) determines if the addition $c$ and $a \cdot b$ was inexact.
This can have an impact on the rounding if $E$ is in the middle of two 32-bit <abbr title="Floating Point">FP</abbr> numbers ($E=2^{e_r - p}$).
So, if $E$ is equal to $2^{e_r - p}$, you have to check $S$ and $T$, and adapt $r$ accordingly.</p>

<p>As you can see, that doesn’t really indicate if the calculation was inexact or not.
Later in Section <a href="#65-fast-32-bit-fused-multiply-add">6.5 Fast 32-bit Fused Multiply-Add</a>,
I show how the equations can be rearranged to fulfill that purpose.</p>

<p>One major disadvantage of the method by Sarrazin et al. is the dependence on larger data types.
If the residual of a 32-bit <abbr title="Fused Multiply-Add">FMA</abbr> instruction is computed, at least 64-bit <abbr title="Floating Point">FP</abbr> precision is required.
Or more precisely, the larger data type needs at least $2p$ significand bits.
Hence, this algorithm does not work for double precision values on x64 systems.
The 80-bit precision provided by x87 <abbr title="Floating Point Unit">FPU</abbr> cannot be used, as it does not have $2p$ significand bits.</p>

<!-- <a class="citation" href="#boldo2011">[40]</a>

\begin{equation}
  \label{eq:ErrFma-residual}
  \begin{gathered}
    z = \circ (ax+b)
    (p_h, p_l) = Fast2Mult(a,x)\\\\\\\\
    (u_h, u_l) = 2Sum(b,p_h)\\\\\\\\
    t = \circ (u_h - z)\\\\\\\\
    z' = \circ (t + \circ (p_l + u_l))
  \end{gathered}
\end{equation} -->

<h2 id="6-methods">6. Methods</h2>
<p>In this section, I show which methods I used and developed to equip MachineWare’s SIM-V simulator
with an ultra-fast <abbr title="Floating Point">FP</abbr> arithmetic.
As shown in the previous section, there are numerous ways to simulate <abbr title="Floating Point">FP</abbr> arithmetic.
To make life easy for myself, I implemented a soft float library for the first proof concept.
With soft float, SIM-V was able to pass the <abbr title="RISC-V Architectural Test Framework">RISCOF</abbr>, but the performance was underwhelming.
So, for the second attempt, I implemented QEMU’s method.
This already increased the speed significantly, and profiling showed that there was only a limited room for optimization.
In more than 99.9% of all cases, the critical exception flags are already set and don’t need to be recalculated. <br />
From the point of view of a programmer, certainly good - there is nothing more to do! <br />
For an ongoing Phd under pressure to publish, rather suboptimal - there is nothing more to research!</p>

<p>Ok, but what if I focus on some of the corner cases in which QEMU’s method doesn’t perform well?
For instance, if the target doesn’t use <abbr title="Round to Nearest, Ties to Even">RNE</abbr>, QEMU always has to fall back to soft float.
You et al. <a class="citation" href="#you2019">[38]</a> already showed how the residual of an addition could be used to account for different rounding modes.
But they didn’t propose any methods for other arithmetic instructions, such as multiplication, division, or square root.</p>

<p>So, in the following, I will show for all relevant arithmetic instructions, how to quickly calculate a residual that can be used
to determine inexactness and perform directed rounding.
I call this approach <em>floppy float</em>, because it’s somewhere between soft and hard float.
As far as I know, the methods for division and square root haven’t been described anywhere else in literature so far.
The goal of the method is to perform equally fast as QEMU for standard rounding, and outperform it for non-standard rounding.</p>

<p>Besides using mathematical proofs to check the validity of the approaches,
all instructions were verified using the <em>RISC-V Architecture Test</em> <a class="citation" href="#riscv-arch-test">[41]</a>,
as well as hand-crafted tests to confirm corner cases.</p>

<p><strong>NOTE</strong><br />
In the following I’m using a positive residual (e.g. $c_{exact} + r  = \circ(a+b)$).
Hence, if $r&gt;0$, the result was rounded up, and if $r&lt;0$, the result was rounded down.
In my opinion it feels more intuitive this way.</p>

<h3 id="61-fast-additionsubtraction">6.1 Fast Addition/Subtraction</h3>
<p>As explained in Section <a href="#58-you-et-al">5.8 You et al.</a> the work of You et. al <a class="citation" href="#you2019">[38]</a> uses the Fast2Sum algorithm for the calculation of the residual $r$.
This requires two arithmetic operations, but the operands must be sorted by absolute value.
Consequently, branching instructions might be needed, which can lead to performance penalties.
As an alternative without sorted operands, O. Møller <a class="citation" href="#moller1965">[42]</a> proposed the <em>2Sum</em> algorithm in 1965.
Similar to Dekker’s Fast2Sum algorithm, the 2Sum’s motivation was to increase accuracy in floating point calculation.
But roughly 50 years later, we found a way to use it to speed up our simulations!
Opposed to the Fast2Sum algorithm, the 2Sum algorithm does not require branching instructions, but involves more arithmetic instructions:
\begin{equation}
  \label{eq:2sum-main}
  \begin{gathered}
    c_{exact} + r = c  = \circ(a+b)  \\\<br />
    a’ = \circ(c-b) \quad
    b’ = \circ(c-a’) \\\<br />
    \delta_a = \circ(a’ - a) \quad
    \delta_b = \circ(b’ - b) \quad
    r = \circ(\delta_a + \delta_b)
  \end{gathered}
\end{equation}</p>

<p>This algorithm also exhibits some potential for instruction-level parallelism/vectorization, as the data dependency graph reveals:</p>
<div style="text-align:center">
<img src="/assets/fast_floating_point_simulation/2sum_data_dependencies.svg" alt="2Sum data dependencies" width="30%" />
</div>
<p><br />
In some benchmark experiments I ran, the 2Sum algorithm was ~10% faster than the Fast2Sum algorithm when working on randomized data.
If the input data is predictable, thus favorable to the branch predictor, both algorithms achieve the same performance.
Ultimately, a 32-bit <abbr title="Floating Point">FP</abbr> add for <abbr title="Round Up">RUP</abbr> rounding might look like this:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// RUP case</span>
<span class="kt">float</span> <span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span> <span class="c1">// Result.</span>
<span class="kt">float</span> <span class="n">ad</span> <span class="o">=</span> <span class="n">c</span> <span class="o">-</span> <span class="n">b</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">bd</span> <span class="o">=</span> <span class="n">c</span> <span class="o">-</span> <span class="n">ad</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">da</span> <span class="o">=</span> <span class="n">ad</span> <span class="o">-</span> <span class="n">a</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">db</span> <span class="o">=</span> <span class="n">bd</span> <span class="o">-</span> <span class="n">b</span><span class="p">;</span>
<span class="kt">float</span> <span class="n">r</span> <span class="o">=</span> <span class="n">da</span> <span class="o">+</span> <span class="n">db</span><span class="p">;</span> <span class="c1">// Residual.</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="mf">0.</span><span class="n">f</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">inexact</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;</span> <span class="mf">0.</span><span class="n">f</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// We accidentally rounded down and have to rectify the result.</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">nextup</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// Next greater FP value.</span>
    <span class="n">overflow</span> <span class="o">=</span> <span class="p">(</span><span class="n">c</span> <span class="o">==</span> <span class="n">infinity</span><span class="p">)</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">overflow</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>I had initially chosen the 2Sum algorithm for this work, but extensive tests later revealed severe problems with overflows.
For example, for two 16-bit <abbr title="Floating Point">FP</abbr> values, assume an addition of -48.f16 and 65504.f16 (largest positive finite number).
The rounded result of this addition is 65472.f16, which is inexact with a residual of 16:
\begin{equation}
  \label{eq:twosum-broken-1}
  \begin{gathered}
  c_{exact} + r = c  = \circ(a+b) \<br />
  65456 + 16 = 65472 = \circ_{16}(65504-48)
  \end{gathered}
\end{equation}
In the intermediate calculations of the 2Sum algorithm, the value of $c$ leads to infinite values:
\begin{equation}
  \label{eq:twosum-broken-2}
  \begin{gathered}
  a’ = \circ(c-b) \<br />
  \infty = \circ_{16}(65472 + 48)
  \end{gathered}
\end{equation}
Unfortunately, this leads to the residual being a qNaN:
\begin{equation}
  \label{eq:twosum-broken-3}
  \begin{gathered}
  r = \circ(\delta_a + \delta_b) \<br />
  qNaN = \circ_{16}(\infty - \infty)
  \end{gathered}
\end{equation}
So, ultimately I chose the Fast2Sum instead of the 2Sum algorithm.</p>

<h3 id="62-fast-32-bit-multiplication">6.2 Fast 32-bit Multiplication</h3>
<p>For the fast calculation and rounding of multiplications, I exploited one interesting property of IEEE <abbr title="Floating Point">FP</abbr> numbers:
multiplying two 32-bit <abbr title="Floating Point">FP</abbr> values as 64-bit values always yields an exact result!
Similar to addition, this allows to calculate a residual, which can be used for rounding and setting the inexact flag.
For the sake of simplicity, I will call this approach <em>UpMul</em> from now on.</p>

<p>So, let’s start with some operands $a$ and $b$ as 32-bit <abbr title="Floating Point">FP</abbr> values.
In a first step, these are upcasted to 64-bit values and then multiplied.
Since the number of significands more than doubles from 32-bit <abbr title="Floating Point">FP</abbr> to 64-bit <abbr title="Floating Point">FP</abbr>,
the result of the multiplication can be represented exactly.
If the exact value is subtracted from the erroneous value, the residual remains:
\begin{equation}
  \label{eq:upmul-main}
  \begin{gathered}
    c_{exact} + r  = c = a \cdot b + r = \circ_{32}(a \cdot b) \\\<br />
    r = a \cdot b  +  r - (a \cdot b) = \circ_{64}(\circ_{32}(a \cdot b) - \circ_{64}(a \cdot b) )
  \end{gathered}
\end{equation}</p>

<p>The mathematical proof is provided at the end of this section.
A C/C++ implementation for the <abbr title="Round Up">RUP</abbr> rounding mode can be found in the following code:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// RUP case</span>
<span class="kt">float</span> <span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="n">b</span><span class="p">;</span>
<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">c</span> <span class="o">-</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="mf">0.</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">inexact</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;</span> <span class="mf">0.</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// We accidentally rounded down and have to rectify the result.</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">nextup</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// Next greater FP value.</span>
    <span class="n">overflow</span> <span class="o">=</span> <span class="n">is_inf</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">overflow</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="n">underflow</span> <span class="o">=</span> <span class="p">(</span><span class="n">is_subnormal</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">||</span> <span class="n">is_zero</span><span class="p">(</span><span class="n">c</span><span class="p">))</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">underflow</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As shown in the code, an inexact calculation has occurred if $r\neq 0$.
Subsequently, the result is rectified in case the host hardware rounded it down.
This could lead to an overflow, hence the result is checked for infinity.
According to the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr>, tininess is detected after rounding, requiring an underflow check after rectification.
Note that underflow only occurs when the result is subnormal and inexact.</p>

<p>So, now let’s take a look at mathematical proof of this method.
The formula can be derived by first showing that the multiplication of the 32-bit values
as 64-bit values is exact.
Using Equation \ref{eq:float1} the multiplication can be expressed as:
\begin{equation}
    \label{eq:upmul3}
    \begin{aligned}
      a \cdot b  = M_a \cdot M_b \cdot 2^{e_a + e_b - 2p_f + 2} =
      c          = M_c \cdot 2^{e_c - p_d + 1}
    \end{aligned}
\end{equation}
As stated in Section <a href="#31-the-math">3.1 The Math</a>, this model is not suitable for subnormal numbers.
So, how to deal with this case?
The trick is, we don’t need to consider it! <br />
Casting 32-bit <abbr title="Floating Point">FP</abbr> values to 64 bit can never lead to subnormal results. <br />
And even the following multiplication cannot lead to subnormal results. <br />
Why is that? <br />
The smallest subnormal 32-bit <abbr title="Floating Point">FP</abbr> number is $2^{e_{f,min}- p_f + 1} = 2^{-149}$.
Multiplying the smallest subnormal 32-bit <abbr title="Floating Point">FP</abbr> number with itself results in $2^{2 \cdot -149} = 2^{-298}$.
These results are still far away from the 64-bit subnormal range, which begins at $2^{e_{d,min}} = 2^{-1022}$.
GG EZ!</p>

<p>Next, we derive the maximum ranges of $M_c$ and $e_c$:
\begin{equation}
\label{eq:upmul5}
  \begin{gathered}
      |M_c|  = |M_a \cdot M_b| \leq (2^{p_f}-1)^2 \leq (2^{24}-1)^2 \leq 2^{48} - 1 \leq 2 ^{p_d} - 1 \leq 2 ^{53} - 1 \\\<br />
      |e_c| = |e_a + e_b - 2p_f + p_d + 1| \leq 260 \leq |e_{d,min}|
  \end{gathered}
\end{equation}
Since both $M_c$ and $e_c$ fit into the range of a double-precision value, the result of the multiplication is exact.
From Equation \ref{eq:upmul5} we can also see why $2p$ significand bits are required to represent a multiplication exactly.</p>

<p>As the final step, the exactness of the subtraction needs to be shown.
Here I simply used <a href="https://en.wikipedia.org/wiki/Sterbenz_lemma">Sterbenz’ Lemma</a> <a class="citation" href="#sterbenz1974">[43]</a> .
According to his Lemma, the subtraction of two very close <abbr title="Floating Point">FP</abbr> numbers is always exact.
Interesting remark: this only works if the <abbr title="Floating Point">FP</abbr> number format supports subnormal.
Or to express it mathematically:</p>

<p>\begin{equation}
\label{eq:sterbenz}
  \begin{gathered}
      \text{if} \quad a/2 \leq b \leq 2a \\\<br />
      \text{then} \quad \circ(b - a) = b - a
  \end{gathered}
\end{equation}</p>

<p>Since the values of $\circ_{64}(a \cdot b)$ and $\circ_{32}(a \cdot b)$ differ by not more than $2\times$ their subtraction is exact.</p>

<h3 id="63-fast-32-bit-division">6.3 Fast 32-bit Division</h3>
<p>For the fast division, I developed a new method called <em>UpDiv</em>, which was not seen in any other work before.
Similar to the UpMul method from before, both operands must be 32-bit <abbr title="Floating Point">FP</abbr> values, and the goal is to compute the residual $r$.
However, in this case, the exact determination of the residual of a division is overambitious,
as certain rational numbers cannot be represented with a finite number of significand bits.
Nevertheless, the exact value of the residual is not crucial for our endeavor.
Rather, we want to know whether there was a rounding error, and if it is positive or negative.
In mathematical terms, an approximation of the residual $\tilde{r}$ is sought, for which $sgn(\tilde{r})=sgn(r)$ is satisfied.
Such an approximation is obtained by:
\begin{equation}
  \label{eq:updiv-main}
  \begin{gathered}
    a / b + r = c_{exact} + r = c = \circ_{32}(a / b) \\\<br />
    \tilde{r}  = \circ_{64}(\circ_{64}(\circ_{32}(a / b) \cdot b) - a) \cdot sgn(b)
  \end{gathered}
\end{equation}</p>

<p>And in terms of C/C++:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// RUP case</span>
<span class="kt">float</span> <span class="n">c</span> <span class="o">=</span> <span class="n">a</span> <span class="o">/</span> <span class="n">b</span><span class="p">;</span>
<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">c</span> <span class="o">*</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">b</span> <span class="o">-</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">a</span><span class="p">;</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">signbit</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">?</span> <span class="o">-</span><span class="n">r</span> <span class="o">:</span> <span class="n">r</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="mf">0.</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">inexact</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;</span> <span class="mf">0.</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// We accidentally rounded down and have to rectify the result.</span>
    <span class="n">c</span> <span class="o">=</span> <span class="n">nextup</span><span class="p">(</span><span class="n">c</span><span class="p">);</span> <span class="c1">// Next greater FP value.</span>
    <span class="n">overflow</span> <span class="o">=</span> <span class="n">is_inf</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">overflow</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="n">underflow</span> <span class="o">=</span> <span class="p">(</span><span class="n">is_subnormal</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">||</span> <span class="n">is_zero</span><span class="p">(</span><span class="n">c</span><span class="p">))</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">underflow</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If you are interested in the mathematical proof, here it comes.</p>

<p>The equation can be derived by using the standard model of <abbr title="Floating Point">FP</abbr> arithmetic extended for subnormals (see Equation \ref{eq:standard-error-model}).
According to the model, the error of the <abbr title="Floating Point">FP</abbr> division, including underflow and overflow, can be represented as follows:
\begin{equation}
  \label{eq:updiv3}
  \begin{aligned}
  \frac{a}{b} \cdot (1 + \epsilon_1 ) + \eta_1 = \circ_{32}(a/b) = a / b + r
  \end{aligned}
\end{equation}
If the result of the division is upcasted to 64-bit and multiplied by the value of $b$, which is also upcasted to 64-bit, the result must be exact (see previous subsection).
This allows to calculate the approximation $\tilde{a}$ as follows:
\begin{equation}
  \label{eq:updiv4}
  \begin{aligned}
  \tilde{a} = a + a \epsilon_1 + b \eta_1 = \circ_{64}(b \cdot \circ_{32}(a/b))
  \end{aligned}
\end{equation}
Subtracting $a$ from $\tilde{a}$ yields Equation \ref{eq:updiv5}:
\begin{equation}
  \label{eq:updiv5}
  \begin{gathered}
  z = \circ_{64}(\tilde{a} - a) = \circ_{64}(a - \circ_{64}(b \cdot \circ_{32}(a/b))) = (a \epsilon_1 + b \eta_1)(1 + \epsilon_2) \\\<br />
  z =
  \begin{cases}
    b \eta_1 (1 + \epsilon_2) &amp; subn.\\\<br />
    a \epsilon_1 (1 + \epsilon_2) &amp; else
  \end{cases}
  \end{gathered}
\end{equation}
Although this addition can be inexact, which is described by $\epsilon_2$,
the result 0 can only be obtained if the preceding division was exact ($\epsilon_1=\eta_1=0$).
Otherwise, the sign of $z$ is directly determined by $a \epsilon_1$ or $b \eta_1$.
Next, Equation \ref{eq:updiv5} is rearranged to:
\begin{equation}
  \label{eq:updiv7}
  \begin{aligned}
  \epsilon_1 = \frac{z}{a \cdot (1 + \epsilon_2)} \quad
  \eta_1 = \frac{z}{b \cdot (1 + \epsilon_2)}
  \end{aligned}
\end{equation}
Inserting Equation \ref{eq:updiv7} into Equation \ref{eq:updiv3} yields for both cases the following residual:
\begin{equation}
  \label{eq:updiv8}
  r = \frac{z} {b \cdot (1 + \epsilon_2)} =  \frac{\circ_{64}(a - \circ_{64}(b \cdot \circ_{32}(a/b)))} {b \cdot (1 + \epsilon_2)}
\end{equation}
Therefore, the residual can only be 0 if $z$ is 0 as well.
Likewise, the sign of $r$ is directly determined by $z$ and $b$.
Consequently, we conclude $sgn(\tilde{r}) = sgn(r)$.</p>

<h3 id="64-fast-32-bit-square-root">6.4 Fast 32-bit Square Root</h3>
<p>The calculation of a fast square root and its residual follows the same principle as the UpDiv algorithm.
Hence, I named it <em>UpSqrt</em>.
I exploit that multiplication is the inverse operation of square root, and that multiplication with larger data types is exact.
The residual results according to Equation \ref{eq:upsqrt-main}:</p>

<p>\begin{equation}
  \label{eq:upsqrt-main}
  \begin{gathered}
    \sqrt{a} + r = b_{exact} + r = b = \circ_{32}(\sqrt{a}) \\\<br />
    \tilde{r} = \circ_{64}(\circ_{64}(\circ_{32}(\sqrt{a})^2) - a)
  \end{gathered}
\end{equation}</p>

<p>The proof of the algorithm is equivalent to the proof of the UpDiv algorithm.
Again, an approximation $\tilde{r}$ for the residual $r$ with $sgn(r) = sgn(\tilde{r})$ is sought.
And again, the property that the multiplication is precise on the one hand is exploited again,
if a larger data type is available, and on the other hand that the multiplication can be used as an inverse function of the actual operation.
The final result is the following expression:
\begin{equation}
  \label{eq:upsqrt2}
  \begin{aligned}
    r = \sqrt{\frac{\tilde{r}}{1+\epsilon_2}+a} - \sqrt{a}
  \end{aligned}
\end{equation}
Since the sign of $r$ is only dependent on $\tilde{r}$, $sgn(r) = sgn(\tilde{r})$ holds.
Here’s the corresponding C/C++ code:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// RUP case</span>
<span class="kt">float</span> <span class="n">b</span> <span class="o">=</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>
<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">b</span> <span class="o">*</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">b</span> <span class="o">-</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">a</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="mf">0.</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">inexact</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;</span> <span class="mf">0.</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// We accidentally rounded down and have to rectify the result.</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">nextup</span><span class="p">(</span><span class="n">b</span><span class="p">);</span> <span class="c1">// Next greater FP value.</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>And here’s the proof. According to the standard error model of <abbr title="Floating Point">FP</abbr>, the 64-bit multiplication of the 32-bit square root $a$ results in:
\begin{equation}
\circ_{64}(\circ_{32}(\sqrt{a})^2) = \circ_{32}(\sqrt{a})^2 = (a \cdot (1 + \epsilon_1))^2
\end{equation}
Note, that a square root cannot produce a subnormal result (thus no $\eta$) and that a 64-bit multiplication of 32-bit values is always exact.
The latter is the same property of <abbr title="Floating Point">FP</abbr> that I already used in the previous two sections.
Next, we subtract $a$:</p>

<p>\begin{equation}
  \label{eq:upsqrt-proof1}
  \begin{gathered}
    \tilde{r} = \circ_{64}(\circ_{32}(\sqrt{a})^2 - a) = ((a \cdot (1 + \epsilon_1))^2 - a)\cdot (1 + \epsilon_2)
  \end{gathered}
\end{equation}</p>

<p>And rearrange the formula:</p>

<p>\begin{equation}
  \label{eq:upsqrt-proof2}
  \begin{gathered}
    \epsilon_1 = \sqrt{\frac{\tilde{r}}{(1+\epsilon_2) \cdot a}+1} - 1
  \end{gathered}
\end{equation}</p>

<p>Inserting $\epsilon_1$ into $\sqrt{a} \cdot \epsilon_1 = r$ gives us:</p>

<p>\begin{equation}
  \label{eq:eq:upsqrt-proof3}
  \begin{aligned}
    r = \sqrt{\frac{\tilde{r}}{1+\epsilon_2}+a} - \sqrt{a}
  \end{aligned}
\end{equation}
And q.e.d.</p>

<h3 id="65-fast-32-bit-fused-multiply-add">6.5 Fast 32-bit Fused Multiply-Add</h3>
<p>For fast <abbr title="Fused Multiply-Add">FMA</abbr> simulation, I deployed a similar method as Sarrazin et al. <a class="citation" href="#sarrazin2016">[39]</a>.
Yet, I repurposed it to account for inexact excpetions.
The idea is to first calculate the exact multiplication of $a$ and $b$ using a larger data type.
Subsequently, the residual of the summation of $a \cdot b$ and $c$ is calculated using the 2Sum algorithm.
But even if this summation was exact ($r_1=0$), the final result might not be representable as 32-bit <abbr title="Floating Point">FP</abbr> value.
Hence, another residual $r_2$  is calculated to determine the 64-bit to 32-bit rounding error.
Note that $r_2$ is exact due to Sterbenz’ Lemma <a class="citation" href="#sterbenz1974">[43]</a>.
\begin{equation}
  \label{eq:fast-fma-main}
  \begin{gathered}
    d_{exact} + r = d = \circ_{32}(a \cdot b + c) \\\<br />
    r_1 = 2Sum(\circ_{64}(a \cdot b),  c) \\\<br />
    r_2  = \circ_{64}(d - \circ_{64}(\circ_{64}(a \cdot b) + c))
  \end{gathered}
\end{equation}
Finally, an approximation of the rounding error $\tilde{r}$ can be calculated, as shown in Equation \ref{eq:fast-fma-residual}:
\begin{equation}
  \label{eq:fast-fma-residual}
  \begin{aligned}
    \tilde{r} &amp; = r_1 + r_2
  \end{aligned}
\end{equation}
Although the addition of $r_1$ and $r_2$ is not exact per se, it satisfies $sgn(\tilde{r})=sgn(r)$.
This is enabled by gradual underflows, due to which the following property holds for two arbitrary 32-bit <abbr title="Floating Point">FP</abbr> numbers:
$sgn(a+b) = sgn(\circ_{32}(a + b))$.</p>

<p>As before, here the C/C++ code for a <abbr title="Round Up">RUP</abbr> case:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// RUP case</span>
<span class="kt">float</span> <span class="n">d</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">fma</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">);</span>
<span class="kt">double</span> <span class="n">p</span> <span class="o">=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
<span class="kt">double</span> <span class="n">dd</span> <span class="o">=</span> <span class="n">p</span> <span class="o">+</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">c</span><span class="p">;</span>
<span class="kt">double</span> <span class="n">r1</span> <span class="o">=</span> <span class="n">two_sum</span><span class="o">&lt;</span><span class="kt">double</span><span class="o">&gt;</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">c</span><span class="p">,</span> <span class="n">dd</span><span class="p">);</span>
<span class="kt">double</span> <span class="n">r2</span> <span class="o">=</span> <span class="p">(</span><span class="kt">double</span><span class="p">)</span><span class="n">d</span> <span class="o">-</span> <span class="n">dd</span><span class="p">;</span>
<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="n">r1</span> <span class="o">+</span> <span class="n">r2</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="mf">0.</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">inexact</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">&lt;</span> <span class="mf">0.</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// We accidentally rounded down and have to rectify the result.</span>
    <span class="n">d</span> <span class="o">=</span> <span class="n">nextup</span><span class="p">(</span><span class="n">d</span><span class="p">);</span> <span class="c1">// Next greater FP value.</span>
    <span class="n">overflow</span> <span class="o">=</span> <span class="n">is_inf</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">overflow</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="n">underflow</span> <span class="o">=</span> <span class="p">(</span><span class="n">is_subnormal</span><span class="p">(</span><span class="n">d</span><span class="p">)</span> <span class="o">||</span> <span class="n">is_zero</span><span class="p">(</span><span class="n">d</span><span class="p">))</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">underflow</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">d</span><span class="p">;</span>
</code></pre></div></div>

<h3 id="66-fast-64-bit-operations">6.6 Fast 64-bit Operations</h3>
<p>The previous upcast algorithms UpMul, UpDiv, UpSqrt, and also the <abbr title="Fused Multiply-Add">FMA</abbr> algorithm according to Sarrazin et al.
<a class="citation" href="#sarrazin2016">[39]</a>,
are all based on larger data type that can perform multiplications exactly.
As mentioned earlier, these algorithms reach their limitations for 64-bit values on x64 systems.
To circumvent these limitations, the fused multiply-add (<abbr title="Fused Multiply-Add">FMA</abbr>) instruction of the x64 <abbr title="Instruction Set Architecture">ISA</abbr> can be used.
This instruction is formalized in the FMA3/FMA4 instruction set extensions and is part of all modern x64 processors.
For example, using <abbr title="Fused Multiply-Add">FMA</abbr>, the residual of the UpMul algorithm can be calculated as follows:</p>

<p>\begin{equation}
  \label{eq:example-div}
  \begin{aligned}
    r’ &amp; = \circ_{64}(a \cdot b - \circ_{64}(a \cdot b)) = \circ_{64}(c_{exact} - c)
  \end{aligned}
\end{equation}</p>

<p>However, the rounding step at the end of each <abbr title="Fused Multiply-Add">FMA</abbr> instruction poses a problem.
Although an <abbr title="Fused Multiply-Add">FMA</abbr> instruction calculates all intermediate results with infinite precision, the result is eventually rounded.
In the example shown, it is possible that $r’$ is not representable with a 64-bit precision.
One could therefore wrongly assume a value of 0, although the value is actually different from 0.
Hence, $r’=r$ does not hold in all cases.</p>

<p>Consequently, bounds must be determined for which $r’$ is no longer representable.
Since $r’$ is the direct result of the subtraction of $c$ and $c’$, we have to determine the smallest distance between these numbers, excluding 0.
This distance is $|d| \geq 2^{e_c - 2p_d}$.
The number of double significand bits $2p_d$ follows from the exact intermediate results of the <abbr title="Fused Multiply-Add">FMA</abbr> instruction.
As explained previously, $2p$ significand bits are needed for the exact representation of a $p$-bit multiplication.
In order to represent $r’$ as a 64-bit <abbr title="Floating Point">FP</abbr> value, $e_c - 2p_d \geq e_{d,min} - p_d + 1$ must hold.
A simple rearrangement leads to the following inequality:
\begin{equation}
  \label{eq:example-div-bound}
  \begin{aligned}
    e_c \geq e_{d,min} + p_d + 1  = -1022 + 53 + 1  = -968
  \end{aligned}
\end{equation}
If $|c|$ is less than $2^{-968}$, my method cannot be used, and the instruction has to be calculated using soft float.
However, the range below $2^{-968}$ represents less than 3% of all 64-bit <abbr title="Floating Point">FP</abbr> values.
In practice, it’s even less, as most <abbr title="Floating Point">FP</abbr> values are centered around 1.
To prove this statement, I ran different 78 <abbr title="Floating Point">FP</abbr> benchmarks and tracked the in- and output exponents of all 64-bit arithmetic <abbr title="Floating Point">FP</abbr> instructions:</p>
<div style="text-align:center">
<img src="/assets/fast_floating_point_simulation/exp_dist_overlay.svg" alt="Exponent distribution in FP benchmarks" width="95%" />
</div>
<p><br />
As you can, on average less than 0.1% values have an exponent less than $2^{-968}$.</p>

<p>A C/C++ example for the 64-bit division is given in the following code:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">abs</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mf">4.008336720017946e-292</span><span class="p">)</span>
  <span class="k">return</span> <span class="n">soft</span><span class="o">::</span><span class="n">div</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">);</span>

<span class="kt">double</span> <span class="n">r</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">fma</span><span class="p">(</span><span class="n">c</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="o">-</span><span class="n">a</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">r</span> <span class="o">!=</span> <span class="mf">0.0</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">inexact</span> <span class="o">=</span> <span class="nb">true</span><span class="p">;</span>
  <span class="n">underflow</span> <span class="o">=</span> <span class="p">(</span><span class="n">is_subnormal</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="o">||</span> <span class="n">is_zero</span><span class="p">(</span><span class="n">c</span><span class="p">))</span> <span class="o">?</span> <span class="nb">true</span> <span class="o">:</span> <span class="n">underflow</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="7-results--discussion">7. Results &amp; Discussion</h2>

<h3 id="71-clean-room-benchmarks">7.1 Clean Room Benchmarks</h3>
<p>In this section, I show the results of some clean room benchmarks.
The goal was to assess the maximum performance of each individual instruction
for soft float, floppy float (my approach), and hard float (native <abbr title="Floating Point">FP</abbr> instructions).
That means inputs and outputs are never subnormal, there are no data dependencies between the instructions, standard rounding is used,
and there’s no DBT overhead.
While floppy float and hard float aren’t really sensitive to different kinds of input data (except subnormals),
the soft float is due to its control-flow-heavy calculations.
In general, the input data was designed to favor optimistic paths in soft float.
So, let’s take a look at the results:</p>

<div style="text-align:center">
<img src="/assets/fast_floating_point_simulation/sf_hf_ff_comparison.svg" alt="HF vs. FF vs. SF" width="80%" />
</div>

<p>As you can see, simply executing <abbr title="Floating Point">FP</abbr> instructions one after another (hard float) achieves around 8500 MIPS for instructions that can be executed in one cycle (max, min, add, sub, etc.).
This is explained by the <abbr title="Floating Point">FP</abbr> pipeline of the host processor, which was an AMD Ryzen Threadripper 3990X in my case.
Most <abbr title="Floating Point">FP</abbr> instructions can use 2 of 4 <abbr title="Floating Point">FP</abbr> pipes provided by the Zen 2 microarchitecture, leading to $8500 MIPS \approx 2 \cdot 4.3GHz$.
Some instructions, such as division, square root, or 64-bit multiplication, require multiple cycles, which results in lower performance.
Nevertheless, hard float is faster than soft and floppy float in all cases.
The performance of the floppy float approach is in the range of 300-600 MIPS, and is faster than soft float by up to $5 \times$ in some operations, such as square root.
For lightweight operations, such as min or max, there is no significant difference between soft- and floppy float.</p>

<h3 id="72-my-method-vs-qemu">7.2 My Method vs. QEMU</h3>
<p>Since my approach is intended to accelerate <abbr title="Floating Point">FP</abbr> performance in DBT simulators,
a practical performance assessment is indispensable.
For this purpose, I integrated my approach, the method by Cota et al. <a class="citation" href="#cota2019">[32]</a>(QEMU’s method),
and Bellard’s SoftFP <a class="citation" href="#bellard2018">[25]</a>,
into MachineWare’s DBT-based RISC-V simulator SIM-V <a class="citation" href="#simv2022">[1]</a>.
I then conducted a performance analysis using well-known <abbr title="Floating Point">FP</abbr> benchmarks
such as linpack, NPB, SPEC CPU 2017, and other representative workloads.
The results can be found in the following graph:</p>
<div id="hard-floppy-soft" style="text-align:center">
<img src="/assets/fast_floating_point_simulation/hard_vs_fast_combined.svg" alt="QEMU vs. my method." width="75%" />
</div>
<p><br />
In the graph, the speedups of the individual benchmarks are shown, whereby the soft float method was used as a reference baseline.
All benchmarks in Subplot a) were executed with the default <abbr title="Round to Nearest, Ties to Even">RNE</abbr> rounding, while Subplot b) represents the same benchmarks under <abbr title="Round Up">RUP</abbr> rounding.
Please not that this graph does not compare SIM-V with QEMU!
It’s only QEMU’s method implemented in SIM-V!
Since SIM-V uses multiple other techniques to speed up simulations, a comparison wouldn’t be fair.</p>

<p>As can be seen in the graph, QEMU’s method and my approach achieve a speedup of $3\times$ in a best case scenario (see Subplot a), NPB/ft.A and 508.namd).
Also, in most cases, the performance of my approach is equal to the performance of QEMU’s approach when <abbr title="Round to Nearest, Ties to Even">RNE</abbr> rounding is used.
As explained previously, my approach is only faster when underflows occur and no inexact flags are set, or when a non-default rounding mode is not used.
Since most applications already set an inexact flag after a few executed instructions, the speedup gained from an accelerated inexact calculation is marginal.
Also, underflows are seldom, as I could confirm with a separate <a href="/risc-v/2023/08/06/evaluation-riscv-fd.html#63-subnormal-numbers--underflows">instruction and data study</a>.
For example, in the case of the NPB/ft.A benchmark, not a single underflow occurred in a total of 3,875,127,289 executed fmadd instructions.</p>

<p>To demonstrate the advantages of my methods, I ran all benchmarks again under <abbr title="Round Up">RUP</abbr> rounding which is depicted in Subplot b).
Here we can see that QEMU is slower than soft float in all cases.
This can be attributed to the fact that QEMU first checks the rounding mode before resorting to soft float.
My method, however, can rectify the result for most instructions and set the exception flags without using soft float.
Thus, speedups of 50% over QEMU are achieved for benchmarks like linpack32.
Since the speedup of my method depends on the executed instructions, we observe a heterogeneous picture of results.
Moreover, the speedups under <abbr title="Round to Nearest, Ties to Even">RNE</abbr> cannot be used to infer the speedups under <abbr title="Round Up">RUP</abbr>.
As described in previously, we do not have a method for 64-bit <abbr title="Fused Multiply-Add">FMA</abbr> instructions,
and all presented approaches require less checks when working on 32-bit data.
Hence, single precision benchmarks, such as linpack32 or machine learning applications (lenet, alexnet), achieve higher speedups in non-default rounding modes.
Applications that comprise many 64-bit <abbr title="Fused Multiply-Add">FMA</abbr> instructions achieve low to no speedup (see NPB/bt.A and NPB/cg.A).</p>

<!-- TODO: semihosting stuff? See assets -->
<h2 id="8-conclusion--outlook">8. Conclusion &amp; Outlook</h2>
<p>In this post, I showed how floating point arithmetic is calculated in emulators/simulators, such as QEMU, gem5, or Rosetta 2.
To the best of my knowledge, this post provides the most complete picture of this topic to date.
But if you find more literature worth citing, let me know!</p>

<p>Besides just providing a related work overview, I showed how the QEMU approach can be
improved to also perform well for other rounding modes.
I implemented my method in MachineWare’s SIM-V RISC-V simulator and beat QEMU’s by more than 50% in the best case.
For the vanilla <abbr title="Round to Nearest, Ties to Even">RNE</abbr> rounding mode, I couldn’t achieve any speedups for standard benchmarks.
This is due to exception bits being sticky and not requiring any recalculations.
I later noticed that the PowerPC has non-sticky exception flags, which requires a recalculation for every instruction.
Hence, I guess my method could significantly speed up PowerPc simulations even for standard benchmarks with <abbr title="Round to Nearest, Ties to Even">RNE</abbr> rounding.</p>

<p>One important missing piece of this work are efficient algorithms for 64-bit <abbr title="Fused Multiply-Add">FMA</abbr> instructions.
Unfortunately, these instructions occur relatively frequently, costing us a significant chunk of performance for some benchmarks.
I found an interesting work of Boldo et al. <a class="citation" href="#boldo2011">[40]</a>, which provides an algorithm to calculate the residual for <abbr title="Fused Multiply-Add">FMA</abbr> instructions.
So exactly what I need!
But I wasn’t able to get it running correctly for whatever reason…
Since their paper is basically 8 pages of mathematical proofs, I leave this as a problem for other people and future Niko.</p>

<p>If you have remarks, questions, or just want to say “hello”, feel free to write me a <a href="/about/">mail</a>!</p>

<h2 id="9-references">9. References</h2>

<ol class="bibliography"><li><span id="simv2022">[1]L. Jünger, J. H. Weinstock, and R. Leupers, “SIM-V: Fast, Parallel RISC-V Simulation for Rapid Software Verification,” <i>DVCON Europe 2022</i>, 2022. </span></li>
<li><span id="ieee7541985">[2]“IEEE Standard for Binary Floating-Point Arithmetic,” <i>ANSI/IEEE Std 754-1985</i>, pp. 1–20, 1985, doi: 10.1109/IEEESTD.1985.82928. </span></li>
<li><span id="ieee7542008">[3]“IEEE Standard for Floating-Point Arithmetic,” <i>IEEE Std 754-2008</i>, pp. 1–70, 2008, doi: 10.1109/IEEESTD.2008.4610935. </span></li>
<li><span id="ieee7542019">[4]“IEEE Standard for Floating-Point Arithmetic,” <i>IEEE Std 754-2019 (Revision of IEEE 754-2008)</i>, pp. 1–84, 2019, doi: 10.1109/IEEESTD.2019.8766229. </span></li>
<li><span id="riscv2019">[5]R. I. S. C.-V. Foundation, <i>The RISC-V Instruction Set Manual</i>, vol. Volume I: User-Level ISA, Document Version 20191213. 2019 [Online]. Available at: https://riscv.org/wp-content/uploads/2019/12/riscv-spec-20191213.pdf</span></li>
<li><span id="higham2002">[6]N. J. Higham, <i>Accuracy and Stability of Numerical Algorithms</i>, 2nd ed. USA: Society for Industrial and Applied Mathematics, 2002. </span></li>
<li><span id="dooley2006">[7]I. Dooley and L. Kale, “Quantifying the interference caused by subnormal floating-point values,” Jan. 2006. </span></li>
<li><span id="thakkur1999">[8]S. Thakkur and T. Huff, “Internet Streaming SIMD Extensions,” <i>Computer</i>, vol. 32, no. 12, pp. 26–34, 1999, doi: 10.1109/2.809248. </span></li>
<li><span id="waterman2016">[9]A. Waterman, “Design of the RISC-V Instruction Set Architecture,” 2016 [Online]. Available at: https://people.eecs.berkeley.edu/ krste/papers/EECS-2016-1.pdf</span></li>
<li><span id="wikipedianan">[10]“Wikipedia - Comparision with NaN.” [Online]. Available at: https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN</span></li>
<li><span id="hough2019">[11]D. G. Hough, “The IEEE Standard 754: One for the History Books,” <i>Computer</i>, vol. 52, no. 12, pp. 109–112, 2019, doi: 10.1109/MC.2019.2926614. </span></li>
<li><span id="riscv2017">[12]R. I. S. C.-V. Foundation, <i>The RISC-V Instruction Set Manual</i>, vol. Volume I: User-Level ISA, Document Version 2.2. 2017 [Online]. Available at: https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf</span></li>
<li><span id="nan-box-rfc">[13]A. Bradbury, “NaN Boxing RFC.” Mar-2017 [Online]. Available at: https://gist.github.com/asb/a3a54c57281447fc7eac1eec3a0763fa</span></li>
<li><span id="nan-box-google">[14]A. Bradbury, “NaN Boxing ISA-Dev Group.” Mar-2017 [Online]. Available at: https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/_r7hBlzsEd8/m/z1rjr2BaAwAJ</span></li>
<li><span id="gem52011">[15]N. Binkert <i>et al.</i>, “The Gem5 Simulator,” <i>SIGARCH Comput. Archit. News</i>, vol. 39, no. 2, pp. 1–7, Aug. 2011, doi: 10.1145/2024716.2024718. [Online]. Available at: https://doi.org/10.1145/2024716.2024718</span></li>
<li><span id="spike2022">[16]R. I. S. C.-V. Foundation, “Spike RISC-V ISA Simulator.” [Online]. Available at: https://github.com/riscv-software-src/riscv-isa-sim</span></li>
<li><span id="riscvvp-bremen">[17]V. Herdt, D. Große, P. Pieper, and R. Drechsler, “AGRA Uni Bremen RISC-VP.” [Online]. Available at: https://github.com/agra-uni-bremen/riscv-vp</span></li>
<li><span id="herdt2020">[18]V. Herdt, D. Große, P. Pieper, and R. Drechsler, “RISC-V based virtual prototype: An extensible and configurable platform for the system-level,” <i>Journal of Systems Architecture</i>, vol. 109, p. 101756, 2020, doi: https://doi.org/10.1016/j.sysarc.2020.101756. [Online]. Available at: https://www.sciencedirect.com/science/article/pii/S1383762120300503</span></li>
<li><span id="whispergithub">[19]“Whisper Github Repository.” CHIPS Alliance [Online]. Available at: https://github.com/chipsalliance/VeeR-ISS</span></li>
<li><span id="bochsgithub">[20]Lawton, Kevin P., “Bochs Github Repository.” [Online]. Available at: https://github.com/bochs-emu/Bochs</span></li>
<li><span id="lawton1996bochs">[21]K. P. Lawton, “Bochs: A Portable PC Emulator For Unix/X,” <i>Linux Journal</i>, vol. 1996, no. 29es, pp. 7–es, 1996. </span></li>
<li><span id="rvsimgithub">[22]Stéphan Kochen, “rvsim.” [Online]. Available at: https://github.com/stephank/rvsim</span></li>
<li><span id="bellard2005">[23]F. Bellard, “QEMU, a Fast and Portable Dynamic Translator,” in <i>Proceedings of the Annual Conference on USENIX Annual Technical Conference</i>, USA, 2005, p. 41. </span></li>
<li><span id="hauser1996">[24]J. R. Hauser, “Berkley SoftFloat.” 1996 [Online]. Available at: https://github.com/ucb-bar/berkeley-softfloat-3</span></li>
<li><span id="bellard2018">[25]F. Bellard, “SoftFP.” 2018 [Online]. Available at: https://bellard.org/softfp/</span></li>
<li><span id="flip2004">[26]C. Bertin <i>et al.</i>, “A floating-point library for integer processors,” <i>Proceedings of SPIE - The International Society for Optical Engineering</i>, vol. 5559, Oct. 2004, doi: 10.1117/12.557168. </span></li>
<li><span id="perotti2022">[27]M. Perotti, G. Tagliavini, S. Mach, L. Bertaccini, and L. Benini, “RVfplib: A Fast and Compact Open-Source Floating-Point Emulation Library for Tiny RISC-V Processors,” in <i>Embedded Computer Systems: Architectures, Modeling, and Simulation</i>, Cham, 2022, pp. 16–32. </span></li>
<li><span id="muller2010">[28]J.-M. Muller <i>et al.</i>, <i>Handbook of Floating-Point Arithmetic</i>. 2010. </span></li>
<li><span id="rv8">[29]M. Clark and B. Hoult, “rv8 - RISC-V simulator for x86-64.” [Online]. Available at: https://github.com/michaeljclark/rv8</span></li>
<li><span id="clark2017">[30]M. Clark and B. Hoult, “rv8: a high performance RISC-V to x86 binary translator,” <i>CARRV</i>, Oct. 2017, doi: 10.13140/RG.2.2.30957.69601. </span></li>
<li><span id="guo2016">[31]Y.-C. Guo, W. Yang, J.-Y. Chen, and J.-K. Lee, “Translating the ARM Neon and VFP Instructions in a Binary Translator,” <i>Softw. Pract. Exper.</i>, vol. 46, no. 12, Dec. 2016. </span></li>
<li><span id="cota2019">[32]E. G. Cota and L. P. Carloni, “Cross-ISA Machine Instrumentation Using Fast and Scalable Dynamic Binary Translation,” in <i>Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments</i>, New York, NY, USA, 2019, pp. 74–87, doi: 10.1145/3313808.3313811 [Online]. Available at: https://doi.org/10.1145/3313808.3313811</span></li>
<li><span id="dekker1971">[33]T. J. Dekker, “A floating-point technique for extending the available precision,” <i>Numerische Mathematik</i>, vol. 18, pp. 224–242, 1971. </span></li>
<li><span id="rosetta22020">[34]Apple Inc., “Apple announces Mac transition to Apple silicon.” Jun-2020 [Online]. Available at: https://www.apple.com/newsroom/2020/06/apple-announces-mac-transition-to-apple-silicon/</span></li>
<li><span id="rosetta2022">[35]D. Johnson, “Why is Rosetta 2 fast?” [Online]. Available at: https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-fast/</span></li>
<li><span id="applem12020">[36]Apple Inc., “Apple unleashes M1.” Nov-2020 [Online]. Available at: https://www.apple.com/newsroom/2020/11/apple-unleashes-m1/</span></li>
<li><span id="armreference2022">[37]“ARM Architecture Reference Manual.” ARM [Online]. Available at: https://developer.arm.com/documentation/ddi0487/latest</span></li>
<li><span id="you2019">[38]Y.-P. You, T.-C. Lin, and W. Yang, “Translating AArch64 Floating-Point Instruction Set to the X86-64 Platform,” in <i>Proceedings of the 48th International Conference on Parallel Processing: Workshops</i>, 2019. </span></li>
<li><span id="sarrazin2016">[39]G. Sarrazin, N. Brunie, and F. Pétrot, “Virtual Prototyping of Floating Point Units,” 2016. </span></li>
<li><span id="boldo2011">[40]S. Boldo and J.-M. Muller, “Exact and Approximated Error of the FMA,” <i>IEEE Transactions on Computers</i>, vol. 60, no. 2, pp. 157–164, 2011, doi: 10.1109/TC.2010.139. </span></li>
<li><span id="riscv-arch-test">[41]Gala, N. and Karasek, M., “RISC-V Architecture Test.” [Online]. Available at: ttps://github.com/riscv-non-isa/riscv-arch-test</span></li>
<li><span id="moller1965">[42]O. Møller, “Quasi Double-Precision in Floating Point Addition,” <i>BIT</i>, vol. 5, no. 1, pp. 37–50, Mar. 1965. </span></li>
<li><span id="sterbenz1974">[43]S. P.H., “Floating Point Computation.” Prentice Hall, 1974. </span></li></ol>]]></content><author><name></name></author><category term="Simulation" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">par-gem5: Parallelizing gem5’s Atomic Mode</title><link href="https://www.chciken.com/simulation/2023/11/11/par-gem5.html" rel="alternate" type="text/html" title="par-gem5: Parallelizing gem5’s Atomic Mode" /><published>2023-11-11T09:15:44+00:00</published><updated>2023-11-11T09:15:44+00:00</updated><id>https://www.chciken.com/simulation/2023/11/11/par-gem5</id><content type="html" xml:base="https://www.chciken.com/simulation/2023/11/11/par-gem5.html"><![CDATA[<p>Most important things first: download the preprint of our paper <a href="/assets/par_gem5/par-gem5-preprint.pdf"><em>par-gem5: Parallelizing gem5’s Atomic Mode</em> here</a>.</p>

<p><strong>What is the paper about?</strong> <br />
The gist of it is a parallelized version of gem5’s atomic mode.
Note that this is for the atomic mode only!
If you are intersted in the timing mode, feel free to read our sequel <a href="https://arxiv.org/pdf/2308.09445.pdf"><em>parti-gem5: gem5’s Timing Mode Parallelised</em></a>, which is available on Arxiv.</p>

<p><strong>How fast is par-gem5?</strong> <br />
For completely parallel benchmarks we managed to reach speedups of ~25x when simulating a 128-core ARM system on a 128-core x64 host system.
More realistic parallel benchmarks like <a href="https://en.wikipedia.org/wiki/NAS_Parallel_Benchmarks">NPB</a> “only” attain speedups of up to ~12x.
Since par-gem5 creates a thread for each simulated CPU core, the maximum attainable speedup depends on several factors.
This includes: the number of available host threads, the number of simulated target CPUs, and the degree of parallelization in the executed benchmark.
Especially the latter is important.
If you are looking to speedup the execution of a single-core benchmark like Dhrystone, par-gem5 is probably not the right tool for you!</p>

<p><strong>Is par-gem5 easy to use?</strong> <br />
I would say it is fairly simple if you are already familiar with vanilla gem5.
You only have to set a CPU’s event queue and choose a reasonable quantum.
This can all be done in the python setup scripts with the following lines:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">args</span><span class="p">.</span><span class="n">parallel</span><span class="p">:</span>
    <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">gem5 going parallel</span><span class="sh">"</span><span class="p">)</span>
    <span class="n">m5</span><span class="p">.</span><span class="n">ticks</span><span class="p">.</span><span class="nf">fixGlobalFrequency</span><span class="p">()</span>
    <span class="n">root</span><span class="p">.</span><span class="n">sim_quantum</span> <span class="o">=</span> <span class="n">m5</span><span class="p">.</span><span class="n">ticks</span><span class="p">.</span><span class="nf">fromSeconds</span><span class="p">(</span><span class="n">m5</span><span class="p">.</span><span class="n">util</span><span class="p">.</span><span class="n">convert</span><span class="p">.</span><span class="nf">anyToLatency</span><span class="p">(</span><span class="sh">"</span><span class="s">500us</span><span class="sh">"</span><span class="p">))</span>
    <span class="n">cpus</span> <span class="o">=</span> <span class="n">system</span><span class="p">.</span><span class="n">cpu_cluster</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">cpus</span>
    <span class="c1"># Note: child objects usually inherit the parent's event queue.
</span>    <span class="k">if</span> <span class="nf">len</span><span class="p">(</span><span class="n">cpus</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">:</span>
        <span class="n">first_cpu_eq</span> <span class="o">=</span> <span class="mi">1</span>
        <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">cpu</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">cpus</span><span class="p">,</span> <span class="n">first_cpu_eq</span><span class="p">):</span>
            <span class="n">cpu</span><span class="p">.</span><span class="n">eventq_index</span> <span class="o">=</span> <span class="n">idx</span>
</code></pre></div></div>

<p><strong>How accurate and reliable is par-gem5?</strong> <br />
The parallelization approach of par-gem5 is in many regards similar to SystemC TLM-2.0’s so-called <em>temporal decoupling</em>.
That means, rather than having one global time as in vanilla gem5, each simulated CPU resides in its own time and occasionally synchronizes
with the rest of the system at certain barrier points.
The distance of the barrier points is determined by the aforementioned <em>quantum</em>.
For instance, if the quantum is set to 500µs, the maximum time two CPUs can diverge is 500µs.</p>

<p>Surprisingly, the hardware and software of most modern general purpose CPU systems is pretty resilient to a certain amount of time skew.
If you do not yeet up the quantum to values like 1 second, you can boot linux systems and run arbitrary software workloads without encountering any problems.
Nevertheless, we are changing the semantics of the simulation and this has a non-negligible impact on multiple aspects.</p>

<p>For instance, if CPUs are communicating with each other, certain messages may be postponed to a barrier point, which in general leads to prolonged simulation times
(the time that is provided in the gem5 statistics, not the the so-called wall clock time).
As shown in the paper, a quantum of 1µs seems to keep inaccuracies in a single-dit percentage while still achieving significant speedups in most benchmarks.</p>

<p>The different time domain are also a problem for some of gem5’s hardware models.
For instance, the ARM timer model casts time differences to unsigned integers, which may result in trouble if the deltas are negative.
Here’s a snippet of the unfixed timer’s impact on the Linux boot timestamps.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gem5       par-gem5
[0.000385] [0.000385]     Mount-cache hash table entries: 32768 [...]
[0.000396] [0.000396]     Mountpoint-cache hash table entries: [...]
[0.024140] [422.828066]   ASID allocator initialised with 128 entries
[0.032140] [3495.801687]  Hierarchical SRCU implementation.
[0.048162] [845.656091]   smp: Bringing up secondary CPUs ...
[0.080218] [5877.941435]  Detected PIPT-Icache on CPU1
</code></pre></div></div>
<p>As you can see, at some point the timer blows up.
That was a pain to debug, but we eventually managed to find the error and fix the timer model.
After fixing some other issues, par-gem5 is now in a state, which I would consider as quite reliable.
I would not launch a space craft with, but it’s good enough for software development and design space exploration.</p>

<p><strong>Will par-gem5 be open source?</strong> <br />
Since par-gem5 is the result of an industry project, the source code is not going to be disclosed.</p>

<p><strong>Any Questions?</strong> <br />
Feel free to write me a mail (see <a href="/about">About</a>).</p>]]></content><author><name></name></author><category term="Simulation" /><summary type="html"><![CDATA[Most important things first: download the preprint of our paper par-gem5: Parallelizing gem5’s Atomic Mode here.]]></summary></entry><entry><title type="html">Evaluation of the RISC-V Floating Point Extensions F/D</title><link href="https://www.chciken.com/risc-v/2023/08/06/evaluation-riscv-fd.html" rel="alternate" type="text/html" title="Evaluation of the RISC-V Floating Point Extensions F/D" /><published>2023-08-06T09:55:44+00:00</published><updated>2023-08-06T09:55:44+00:00</updated><id>https://www.chciken.com/risc-v/2023/08/06/evaluation-riscv-fd</id><content type="html" xml:base="https://www.chciken.com/risc-v/2023/08/06/evaluation-riscv-fd.html"><![CDATA[<script>
  window.MathJax = {
  tex: {
    loader: {load: ['[tex]/ams']},
    tex: {packages: {'[+]': ['ams']}},
    tags: 'ams',
    inlineMath: [['$', '$']]
  }
};

</script>

<script id="MathJax-script" async="" src="/assets/common/mathjax/mathjax-3.2.2.js"></script>

<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }

  .center {
    margin-left: auto;
    margin-right: auto;
  }

  td,th {
    font-size: 13px
  }
</style>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
  <li><a href="#1-introduction">1. Introduction</a></li>
  <li><a href="#2-story--motivation">2. Story &amp; Motivation</a></li>
  <li><a href="#3-history--background">3. History &amp; Background</a>
    <ul>
      <li><a href="#31-risc-v-history-and-basics">3.1 RISC-V History and Basics</a></li>
      <li><a href="#32-the-instructions">3.2 The Instructions</a></li>
      <li><a href="#33-the-registers">3.3 The Registers</a></li>
      <li><a href="#34-the-canonical-qnan">3.4 The Canonical qNaN</a></li>
      <li><a href="#35-nan-boxing">3.5 NaN Boxing</a></li>
    </ul>
  </li>
  <li><a href="#4-methods">4. Methods</a>
    <ul>
      <li><a href="#41-the-applications">5.1 The Applications</a></li>
      <li><a href="#42-the-virtual-platform">5.2 The Virtual Platform</a></li>
    </ul>
  </li>
  <li><a href="#5-related-work">5. Related Work</a></li>
  <li><a href="#6-results--discussion">6. Results &amp; Discussion</a>
    <ul>
      <li><a href="#61-instruction-distribution">6.1 Instruction Distribution</a></li>
      <li><a href="#62-more-on-fclass">6.2 More on FLCASS</a></li>
      <li><a href="#63-subnormal-numbers--underflows">6.3 Subnormal Numbers &amp; Underflows</a></li>
      <li><a href="#64-exponent-distribution">6.4 Exponent Distribution</a></li>
      <li><a href="#65-mantissa-distribution">6.5 Mantissa Distribution</a></li>
      <li><a href="#66-rounding-modes">6.6 Rounding Modes</a></li>
    </ul>
  </li>
  <li><a href="#7-conclusion--outlook">7. Conclusion &amp; Outlook</a></li>
  <li><a href="#8-references">8. References</a></li>
  </ul>
</div>

<h2 id="1-introduction">1. Introduction</h2>
<p>This post is an extended and remastered version of our paper “Evaluation of the RISC-V Floating Point Extensions”.
Feel free to download the preprint version <a href="/assets/riscv_eval/evaluation-of-rv-float-preprint.pdf">here</a>.
The paper and also this post basically comprise two parts.</p>

<p>First, I summarize the history of RISC-V <abbr title="Floating Point">FP</abbr> floating point extensions F and D.
Additionally, I highlight the RISC-V design rationale and compare it qualitatively against ARM64 and x64.</p>

<p>The second part is a practical evaluation of the RISC-V <abbr title="Floating Point">FP</abbr> extensions F and D.
I used a modified RISC-V <abbr title="Virtual Platform">VP</abbr> to track aspects like the number of executed instructions, distribution of in-/output data,
usage of rounding modes, etc.
Much in the spirit of RISC-V, I provide the <a href="/assets/risc-v_floating_point/traces.zip">data</a> as open access.
Feel free to draw your own conclusion and write me a <a href="/about/">mail</a> if I missed something.</p>

<h2 id="2-story--motivation">2. Story &amp; Motivation</h2>
<p>In 2022 a friend and his colleague asked me to help them with the implementation of fast floating point arithmetic in their RISC-V simulator <em>SIM-V</em> <a class="citation" href="#simv2022">[1]</a>.
Just recently, I wrote a <a href="/simulation/2023/11/12/fast-floating-point-simulation.html">post</a> about it.
As described in the post, every <abbr title="Instruction Set Architecture">ISA</abbr> from ARM64 to RISC-V has its own interpretation of how floating point works.
It’s not like they differ in major things, but there are so many minor aspects, where one <abbr title="Instruction Set Architecture">ISA</abbr> does A while the other does B.
<!-- This makes cross-platform floating point simulation a real pain. -->
Ironically, most <abbr title="Instruction Set Architectures">ISAs</abbr> follow the IEEE 754 floating point standard, which was particularly designed to avoid fragmentation.</p>

<p>Anyway, at one point I wondered why there are so many differences despite having a standard.
Or regarding this from an even higher perspective: How does one design the <abbr title="Floating Point">FP</abbr> part of an <abbr title="Instruction Set Architecture">ISA</abbr>?
Which instructions do you implement? Which data formats do you support? Why should you (not) adhere to IEEE 754? <br />
To quench my thirst for knowledge, I embarked on a semi-successful literature journey.
The prevailing <abbr title="Instruction Set Architectures">ISAs</abbr>, such as x64 and ARM64, are in the hands of big companies.
Hence, they do not disclose any details about their design decisions. <br />
Since RISC-V is an open standard that embraces open discussions,
you find way more information.
Unfortunately, this information is spread around multiple sources.
There is a RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> dev Google group <a class="citation" href="#risc-v-isa-dev">[2]</a>,
a RISC-<abbr title="Instruction Set Architecture">ISA</abbr> manual GitHub repository  <a class="citation" href="#risc-v-isa-manual-repo">[3]</a>,
a RISC-V working groups mailing list <a class="citation" href="#riscv-mailing-lists">[4]</a>,
a RISC-V workshop <a class="citation" href="#riscv-workshop-2015">[5]</a>,
and there are scientific publications <a class="citation" href="#risc-v-geneology">[6], [7]</a>.
One of the goals of this post is to summarize the most important points of all these sources,
with a focus on the floating point extensions F and D.</p>

<p>Additionally, I evaluate these extensions using a large sample of <abbr title="Floating Point">FP</abbr> benchmarks.
Because often design decisions seem to be motivated by anecdotal evidence.
But if you look for literature or data, you don’t find anything at all.
So, with this post I also want to provide some data to hone future discussions.</p>

<h2 id="3-history--background">3. History &amp; Background</h2>

<h3 id="31-risc-v-history-and-basics">3.1 RISC-V History and Basics</h3>
<p>To cover a wide range of applications, such as embedded systems or high performance computing,
the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> provides several so-called <em>extensions</em>.
Each of these extensions describes a set of properties, like instructions or registers, which can be assembled to larger systems in a modular way.
This includes the F/D extensions, which extend RISC-V systems with 32-bit and 64-bit <abbr title="Floating Point">FP</abbr> arithmetic respectively.
The extensions for 16-bit (Zfh) and 128-bit <abbr title="Floating Point">FP</abbr> arithmetic (Q) are not considered in this work due to their relatively low popularity in general programming.</p>

<p>Opposed to many other extensions, the F and D extensions were already introduced in the first version of the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> manual <a class="citation" href="#risc-isa-manual-2011">[8]</a> in 2011.
This is a bit unfortunate, as there was never a public debate about the F/D extensions’ design.
Telling from the <abbr title="Instruction Set Architecture">ISA</abbr> manual, it looks like these were contributed by <a href="http://www.jhauser.us/">John Hauser</a>.
Or to directly quote the manual <a class="citation" href="#risc-isa-manual-2011">[8]</a>: <em>“John Hauser contributed to the floating-point <abbr title="Instruction Set Architecture">ISA</abbr> definition.”</em><br />
To a large extent, the extensions implement features as mandated in IEEE 754 <a class="citation" href="#waterman2016">[7]</a>:
<em>“RISC-V’s F extension adds single-precision floating-point support, compliant with the 2008 revision of the IEEE 754 standard
for floating-point arithmetic.”</em>.
Following the IEEE 754 standard is not the worst idea and already defines most parts of the <abbr title="Floating Point">FP</abbr> extensions.
So, in the subsequent subsection I show how RISC-V implements IEEE 754 and how it compares to other <abbr title="Instruction Set Architectures">ISAs</abbr> in that regard.
The properties of the F extension can be transferred 1-to-1 to D except for the bit width.
<!-- The F extension adds 32 FP registers, 1 FP CSR (fcsr), and 29 new instructions to RISC-V systems. --></p>

<h3 id="32-the-instructions">3.2 The Instructions</h3>
<p>The heart of any <abbr title="Instruction Set Architecture">ISA</abbr> are its instructions.
While at the beginning of computer development, there were still significant differences between implementations,
today’s prevalent <abbr title="Floating Point">FP</abbr> <abbr title="Instruction Set Architectures">ISAs</abbr> are similar to a large extent.
This is mainly due to IEEE 754, which specifies the <abbr title="Floating Point">FP</abbr> formats and instructions to be supported by a conforming <abbr title="Instruction Set Architecture">ISA</abbr>.
Also RISC-V follows the IEEE 754-2008 standard <a class="citation" href="#ieee754-2008">[9]</a>.
Well, that’s what they say, but more on that in a few seconds.
<!-- In theory RISC-V isn't really compliant with IEEE 754 --></p>

<p>The following table highlights the difference between all RISC-V F instructions and their correspondents in x64 and ARM64.
It also reflects the instruction’s IEEE 754-2019 status (r=recommended, m=mandated):</p>

<table>
  <thead>
    <tr>
      <th>x64 <abbr title="Streaming SIMD Extensions">SSE</abbr> <abbr title="Fused Multiply-Add">FMA</abbr></th>
      <th>ARM64</th>
      <th>RISC-V</th>
      <th>IEEE 754-2019</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MOVSS</td>
      <td>LDR</td>
      <td>FLW</td>
      <td>(m) copy(s1)</td>
    </tr>
    <tr>
      <td>MOVSS</td>
      <td>STR</td>
      <td><abbr title="Floating Point 32-bit store">FSW</abbr></td>
      <td>(m) copy(s1)</td>
    </tr>
    <tr>
      <td>VFMADDxxxSS</td>
      <td>FMADD</td>
      <td>FMADD.S</td>
      <td>(m) fusedMultiplyAdd(s1, s2, s3)</td>
    </tr>
    <tr>
      <td>VFMSUBxxxSS</td>
      <td>FMSUB</td>
      <td>FMSUB.S</td>
      <td>(-)</td>
    </tr>
    <tr>
      <td>VFNMADDxxxSS</td>
      <td>FNMADD</td>
      <td>FNMADD.S</td>
      <td>(-)</td>
    </tr>
    <tr>
      <td>VFNMSUBxxxSS</td>
      <td>FNSUB</td>
      <td>FNMSUB.S</td>
      <td>(-)</td>
    </tr>
    <tr>
      <td>ADDSS</td>
      <td>FADD</td>
      <td>FADD.S</td>
      <td>(m) addition(s1, s2)</td>
    </tr>
    <tr>
      <td>SUBSS</td>
      <td>FSUB</td>
      <td>FSUB.S</td>
      <td>(m) subtraction(s1, s2)</td>
    </tr>
    <tr>
      <td>MULSS</td>
      <td>FMUL</td>
      <td>FMUL.S</td>
      <td>(m) multiplication(s1, s2)</td>
    </tr>
    <tr>
      <td>DIVSS</td>
      <td>FDIV</td>
      <td>FDIV.S</td>
      <td>(m) division(s1, s2)</td>
    </tr>
    <tr>
      <td>SQRTSS</td>
      <td>FSQRT</td>
      <td>FSQRT.S</td>
      <td>(m) squareRoot(s1)</td>
    </tr>
    <tr>
      <td>MOVSS	(<a href="#sign-injection">1</a>)</td>
      <td>FMOV (<a href="#sign-injection">1</a>)</td>
      <td>FSGNJ.S (<a href="#sign-injection">1</a>) (FMV.S)</td>
      <td>(m) copy(s1)</td>
    </tr>
    <tr>
      <td>XORPS (<a href="#sign-injection">1</a>)</td>
      <td>FNEG (<a href="#sign-injection">1</a>)</td>
      <td>FSGNJN.S (<a href="#sign-injection">1</a>) (FNEG.S)</td>
      <td>(m) negate(s1)</td>
    </tr>
    <tr>
      <td>ANDPS (<a href="#sign-injection">1</a>)</td>
      <td>FABS (<a href="#sign-injection">1</a>)</td>
      <td>FSGNJX.S (<a href="#sign-injection">1</a>) (FABS.S)</td>
      <td>(m) abs(s1)</td>
    </tr>
    <tr>
      <td>MAXSS (<a href="#maximum-minimum">5</a>)</td>
      <td>FMAX (<a href="#maximum-minimum">5</a>)</td>
      <td>FMAX (<a href="#maximum-minimum">5</a>)</td>
      <td>(r) maximumNumber(s1, s2)</td>
    </tr>
    <tr>
      <td>MINSS (<a href="#maximum-minimum">5</a>)</td>
      <td>FMIN (<a href="#maximum-minimum">5</a>)</td>
      <td>FMIN (<a href="#maximum-minimum">5</a>)</td>
      <td>(r) minimumNumber(s1, s2)</td>
    </tr>
    <tr>
      <td>CVTSS2SI (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT*S (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT.W.S (<a href="#conversions-and-rounding">2</a>)</td>
      <td>(m) convertToInteger(s1)</td>
    </tr>
    <tr>
      <td>CVTSS2SI (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT*U (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT.WU.S (<a href="#conversions-and-rounding">2</a>)</td>
      <td>(m) convertToInteger(s1)</td>
    </tr>
    <tr>
      <td>MOVD</td>
      <td>FMOV</td>
      <td>FMV.X.W</td>
      <td>(m) copy(s1)</td>
    </tr>
    <tr>
      <td>UCOMISS (<a href="#comparisons">3</a>)</td>
      <td>FCMP (<a href="#comparisons">3</a>)</td>
      <td>FEQ.S (<a href="#comparisons">3</a>)</td>
      <td>(m) compare(Quiet|Signaling)Equal(s1,s2)</td>
    </tr>
    <tr>
      <td>UCOMISS (<a href="#comparisons">3</a>)</td>
      <td>FCMPE (<a href="#comparisons">3</a>)</td>
      <td>FLT.S (<a href="#comparisons">3</a>)</td>
      <td>(m) compare(Quiet|Signaling)Less(s1,s2)</td>
    </tr>
    <tr>
      <td>UCOMISS (<a href="#comparisons">3</a>)</td>
      <td>FCMPE (<a href="#comparisons">3</a>)</td>
      <td>FLE.S (<a href="#comparisons">3</a>)</td>
      <td>(m) compare(Quiet|Signaling)LessEqual(s1,s2)</td>
    </tr>
    <tr>
      <td>- (<a href="#classification">4</a>)</td>
      <td>- (<a href="#classification">4</a>)</td>
      <td>FCLASS.S (<a href="#classification">4</a>)</td>
      <td>(m) class(s1)</td>
    </tr>
    <tr>
      <td>CVTSI2SS</td>
      <td>SCVTF</td>
      <td>FCVT.S.W</td>
      <td>(m) convertFromInt(s1)</td>
    </tr>
    <tr>
      <td>CVTSI2SS</td>
      <td>UCVTF</td>
      <td>FCVT.S.WU</td>
      <td>(m) convertFromInt(s1)</td>
    </tr>
    <tr>
      <td>MOVD</td>
      <td>FMOV</td>
      <td>FMV.W.X</td>
      <td>(m) copy(s1)</td>
    </tr>
    <tr>
      <td>CVTSS2SI (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT*S (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT.L.S (<a href="#conversions-and-rounding">2</a>)</td>
      <td>(m) convertToInteger(s1)</td>
    </tr>
    <tr>
      <td>- (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT*U (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT.LU.S (<a href="#conversions-and-rounding">2</a>)</td>
      <td>(m) convertToInteger(s1)</td>
    </tr>
    <tr>
      <td>CVTSI2SS (<a href="#conversions-and-rounding">2</a>)</td>
      <td>SCVTF (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT.S.L (<a href="#conversions-and-rounding">2</a>)</td>
      <td>(m) convertFromInt(s1)</td>
    </tr>
    <tr>
      <td>- (<a href="#conversions-and-rounding">2</a>)</td>
      <td>UCVTF (<a href="#conversions-and-rounding">2</a>)</td>
      <td>FCVT.S.LU (<a href="#conversions-and-rounding">2</a>)</td>
      <td>(m) convertFromInt(s1)</td>
    </tr>
  </tbody>
</table>

<p>As can be seen in the table, the majority of RISC-V instructions are mandated by IEEE 754 and are consequently also prevalent in x64 and ARMv8.
<!-- Yet there are some subtle differences, which are explained in the following. -->
<!-- But first, let me explain why RISC-V does (not) conform to IEEE 754, and why this standard is broken in my humble opinion. --></p>

<!-- First of all, IEEE 754 is paywalled. For the cheap price of 100$ you can get access to a digital version: -->
<!-- <div style="text-align:center"> -->
<!--<img src="/assets/riscv_eval/ieee754-price.png"
alt="IEEE 754 pricing" width="95%"/>
</div> <br> -->
<!-- It's mildly infuriating to make such an important standard basically inaccessible.
But ok, let's focus on technical aspects. -->
<!--
Besides the absurd pricing, another issue are the absurd instruction requirements.
As you could already see in the table, some instructions are mandated, while others are only recommended. -->
<p>What the table doesn’t tell you is which instructions are mandated but not implemented by RISC-V or other <abbr title="Instruction Set Architectures">ISAs</abbr>.
To just name a few instructions mandated by IEEE 754 but not implemented by RISC-V <a class="citation" href="#ieee754-2019">[10]</a>:</p>
<ul>
  <li>The closest <abbr title="Floating Point">FP</abbr> numbers (similar to <a href="https://en.cppreference.com/w/cpp/numeric/math/nextafter">std::nextafter</a>): nextUp(s1), nextDown(s2):</li>
  <li>A division’s remainder: remainder(s1, s2)</li>
  <li>Hex character conversion: convertFromHexCharacter(s1),convertFromToHexCharacter(s1):</li>
  <li>All kinds of comparisons: greater, greater equal, not greater, greater equal, not equal, not less.</li>
  <li>Confirmance predications: is754version1985(void), is754version2008(void)</li>
  <li>Classification instructions: isSignMinus(s1), isNormal(s1), isZero(s1), isSubnormal(s1), isInfinite(s1), isSignaling(s1), isCanonical(s1), radix(s1), totalOrder(s1,s2), totalOrderMag(s1,s2)</li>
  <li>Logartihmic stuff: logB(s1), scaleB(s1, format)</li>
</ul>

<p>And this is just the mandated stuff.
There are many more unimplemented instructions which fall into the category of <a href="https://en.wikipedia.org/wiki/IEEE_754#Recommended_operations">“recommended”</a>.
So, what the f is happening here? How can RISC-V (and also the other <abbr title="Instruction Set Architectures">ISAs</abbr>) be compliant with IEEE 754 if it doesn’t implement all mandated instructions?
Also, how did RISC-V decide on which instruction they want to implement and which not?<br />
Since literature didn’t help me to answer these questions, I wrote a mail to the RISC-V <abbr title="Floating Point">FP</abbr> contributor <a href="http://www.jhauser.us/">John Hauser</a>.
Much to my surprise, he took the time to answer my stupid questions. Thanks for that!
Anyway, here’s an excerpt from our conversation:</p>

<hr />
<p><br />
Niko: <em>I see many mandated instructions, which aren’t implemented in RISC-V. …</em></p>

<p>John: <em>The IEEE 754 Standard mandates that certain operations be supported, but it does not mandate that each operation be implemented by a single processor machine instruction.  A sequence of multiple machine instructions is a valid impelementation, and that extends even to complete software subroutines, which is how many operations such as remainder and binary-decimal conversion are implemented, not only for RISC-V but for many other processors as well.</em></p>

<p>Niko: <em>What was the rationale for the choice of floating point instructions?</em></p>

<p>John: <em>Actually, I had little involvement in choosing the floating-point instructions for RISC-V.</em>
<em>However, I believe the choice was shaped largely by the use of floating-point in “typical” programs, probably starting with the SPEC benchmarks and the GCC libraries.</em></p>

<hr />
<p><br /></p>

<p>As you can see (and as I confirmed with the standard), an implementation of the IEEE 754 does not neccessarily have to be in hardware.
It can also be in software or in a combination of both.
But nevertheless, labeling RISC-V as compliant doesn’t really make sense.
It’s rather the software running on top of RISC-V that makes it compliant.
Also, following this argumentation every basic microcontroller could be IEEE 754 compliant if you just have the right software.</p>

<p>Since we can just choose our instructions as we like, the next consequent question is: Which instruction do you implement in hardware?
As John said, usage of <abbr title="Floating Point">FP</abbr> instructions and library functions in benchmarks like SPEC may have shaped the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr>.
If this theory is correct, I should see a broad utiiization of all RISC-V instructions in SPEC and probably other benchmarks.
So, why not check if this is really the case?
In section <a href="#61-instruction-distribution">6.1 Instruction Distribution</a> this theory will stand the test of practice!
But first, the subsequent subsections explain some further peculiarities that distinguish the RISC-V <abbr title="Floating Point">FP</abbr> <abbr title="Instruction Set Architecture">ISA</abbr> from other <abbr title="Instruction Set Architectures">ISAs</abbr>.</p>

<!-- Due to these bloated requirements, there's no fully IEEE 754 compliant ISA as far as I know.
ISAs rather implement a subset of the required instructions.
So, which instruction do you pick?
Intuitively, we can certainly all agree on frequent instructions, such additions or multiplications,
but what about exotic ones like FCLASS?
I couldn't find any information about that, so I decided to run my own experiments and draw my own conclusions in the next section.
But first I have some remarks on individual instructions given by their index in the list. -->

<p><em>1) Sign Injection</em> <br id="sign-injection" />
The three sign injection instructions (FSGNJ, FSGNJN, FSGNJX) were contributed by J. Hauser <a class="citation" href="#risc-isa-manual-2017">[11]</a>
and are unique to RISC-V <a class="citation" href="#risc-v-geneology">[6]</a>.
Their main goal is to implement the operations <code class="language-plaintext highlighter-rouge">copy</code> (FMV in RISC-V), <code class="language-plaintext highlighter-rouge">negate</code> (FNEG in RISC-V), <code class="language-plaintext highlighter-rouge">abs</code> (FABS in RISC-V), and <code class="language-plaintext highlighter-rouge">copySign</code>, which are mandated by the IEEE 2019 standard <a class="citation" href="#ieee754-2019">[10]</a>.
This is achieved by transferring the value from rs1 into rd while using a sign based on the following description:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">FSGNJ rd, rs1, rs2</code>: Sign from rs2. Implements <code class="language-plaintext highlighter-rouge">copy</code> if rs1=rs2.</li>
  <li><code class="language-plaintext highlighter-rouge">FSGNJN rd, rs1, rs2</code>: Negative sign from rs2. Implements <code class="language-plaintext highlighter-rouge">negate</code> if rs1=rs2.</li>
  <li><code class="language-plaintext highlighter-rouge">FSGNJX rd, rs1, rs2</code>: XORed signs of r1 and r2. Implements <code class="language-plaintext highlighter-rouge">abs</code> if rs1=rs2.</li>
</ul>

<p>On x64 systems, the operations <code class="language-plaintext highlighter-rouge">negate</code> and <code class="language-plaintext highlighter-rouge">abs</code> are implemented using AND and XOR instructions with a corresponding bitmask.
For example, using a mask to zero out the sign bit <code class="language-plaintext highlighter-rouge">ANDPS  reg, [mask]</code>.</p>

<p><em>2) Conversions and Rounding</em> <br id="conversions-and-rounding" />
For every possible conversion from integer to float and vice versa, RISC-V as well as ARM64 provide the required instructions as mandated by the IEEE 754 standard.
The standard also mentions 5 different rounding modes for these instructions.
Both ARM64 and RISC-V allow to directly encode this rounding mode in the instruction.</p>

<p>For ARM64 it’s quite interesting. There’s in theory an rmode field which dictates the rounding direction
(see this <a href="https://developer.arm.com/documentation/ddi0596/2021-12/Index-by-Encoding/Data-Processing----Scalar-Floating-Point-and-Advanced-SIMD?lang=en#float2int">link</a>).
However, it only has 2 bits which makes 5 rounding modes impossible.
So, “ties to even” and “ties away” share the same rounding modes and differ in other aspects of the encoding
(00 = ties to even or ties away,  01 = plus infinity, 10 = minus infinity, 11 = toward zero).</p>

<p>In RISC-V, the rounding mode is given by 3 reserved bits in an <abbr title="Floating Point">FP</abbr> instruction’s encoding.
Hence, we have:</p>
<ul>
  <li>000: Round to Nearest, ties to Even</li>
  <li>001: Round towards Zero</li>
  <li>010: Round Down</li>
  <li>011: Round Up</li>
  <li>100: Round to Nearest, ties to Magnitude</li>
  <li>101: Reserved for future use</li>
  <li>110: Reserved for future use</li>
  <li>111: Dyanmic - use rounding mode from fcsr</li>
</ul>

<p>There’s even space for two more rounding modes in case IEEE 754 decides to bother us with new inventions.</p>

<p>A similar approach can be found AVX-512, where it is also possible to encode the rounding mode in the instruction.
On x64 systems, the rounding mode has to be set in the <abbr title="Floating Point">FP</abbr> <abbr title="Control and Status Register">CSR</abbr> (mxcsr).
x64 lacks instructions to convert from unsigned 64-bit integer to float and vice versa.</p>

<p><em>3) Comparisons</em><br id="comparisons" />
While RISC-V provides comparisons, such as equal (FEQ) or less than (FLT), directly by instructions,
ARM64 and x64 take a different approach.
Here, instructions such as FCMP and UCOMISS set flags in status registers, which can be used as comparisons in subsequent instructions.</p>

<p><em>4) Classification</em><br id="classification" />
An instruction which cannot be found in ARM64 and x64 is FCLASS.
The instruction allows to classify a <abbr title="Floating Point">FP</abbr> number into several classes, as shown in the following table, and return the result using a <a href="https://en.wikipedia.org/wiki/One-hot">one-hot encoding</a>:</p>

<!-- | rd | meaning    | rd | meaning    |
|----|------------|----|------------|
| 0  | $-\infty$  | 5  | +subnormal |
| 1  | -normal    | 6  | +normal    |
| 2  | -subnormal | 7  | $+\infty$  |
| 3  | $-0$       | 8  | sNaN       |
| 4  | $+0$       | 9  | qNaN       | -->

<div style="text-align:center">
<table style="width:50%;margin-left: auto; margin-right: auto;">
    <tr>
        <th>rd</th>
        <th>meaning</th>
        <th>rd</th>
        <th>meaning</th>
    </tr>
    <tr>
        <td>0</td>
        <td>$-\infty$</td>
        <td>5</td>
        <td>+subnormal</td>
    </tr>
    <tr>
        <td>1</td>
        <td>-normal</td>
        <td>6</td>
        <td>+normal</td>
    </tr>
    <tr>
        <td>2</td>
        <td>-subnormal</td>
        <td>7</td>
        <td>$+\infty$</td>
    </tr>
    <tr>
        <td>3</td>
        <td>$-0$</td>
        <td>8</td>
        <td>sNaN</td>
    </tr>
    <tr>
        <td>4</td>
        <td>$+0$</td>
        <td>9</td>
        <td>qNaN</td>
    </tr>
</table>
</div>

<p>This allows to quickly react to the classification by ANDing the result with a bitmask.
The instruction is recommended, but not mandated by IEEE 754-1985 <a class="citation" href="#ieee754-1985">[12]</a> and referred to as <code class="language-plaintext highlighter-rouge">Class(x)</code>.
With the IEEE 754-2008 <a class="citation" href="#ieee754-2008">[9]</a> it was redeclared as mandatory and renamed to <code class="language-plaintext highlighter-rouge">class(x)</code>.
I searched through old IEEE 754 meeting minutes for quite a while, but I couldn’t find anything about the rationale for this decision.
Please write me a mail if you know more!</p>

<p>The classification instruction can be found in other <abbr title="Instruction Set Architectures">ISAs</abbr> as well,
including Intel i960 (CLASS{R/RL}) <a class="citation" href="#80960-programmers-manual">[13]</a>,
LoongArch (FCLASS.{S/S}) <a class="citation" href="#loongarch-reference-manual">[14]</a>,
IA-64 (FCLASS) <a class="citation" href="#ia64-developers-manual">[15]</a>, and
MIPS64 (CLASS.{S/D}, since release 6) <a class="citation" href="#mips-reference-manual">[16]</a>.
It is also present in Intel’s 80-bit x87 extension as FXAM <a class="citation" href="#x86-developers-manual">[17]</a>, which is the predecessor of <abbr title="Streaming SIMD Extensions">SSE</abbr>.
Interestingly, Intel decided to remove this instruction from all subsequent extensions. <br />
Some architectures like PowerPC <a class="citation" href="#powerpc-reference-manual">[18]</a> or OpenRISC 1000 <a class="citation" href="#openrisc1000-arch-manual-2019">[19]</a> implement <code class="language-plaintext highlighter-rouge">class</code> in an implicit way.
With PowerPC, for example, after each <abbr title="Floating Point">FP</abbr> instruction, a classification of the result is stored in a register called FPSCR_FPRF.</p>

<p>The purpose of the FCLASS instruction is to allow software to react to unusual outputs from other <abbr title="Floating Point">FP</abbr> instructions with relatively cycle low overhead.
In <a class="citation" href="#waterman2016">[7]</a> A. Waterman argues that library routines often branch at outputs like NaNs.
However, without a designated instruction, this check can take “many more instructions”.
To what extent cycles are saved is not mentioned.
The article also lacks information about how often <code class="language-plaintext highlighter-rouge">class</code> is used in practice, and which exact outputs trigger branching.
To remedy this circumstance, I decided to run some experiments on my own.
The results are presented in Section <a href="#6-results--discussion">6 Results &amp; Discussion</a>.</p>

<p><em>5) Maximum/Minimum</em><br id="maximum-minimum" />
What is the maximum of a numerical value and a signaling NaN?
Right, it depends!<br />
Depending on the used IEEE 754 standard, you might end up with different answers.
With the new IEEE 754-2019 standard, RISC-V unflinchingly changed its definition to incorporate some bug fixes.
ARM64 and x64 didn’t, so their maximum/minimum isn’t really the same as RISC-V’s.
If you want to learn more about the maximum/minimum messup, take a look at my other <a href="/simulation/2023/04/07/fast-floating-point-simulation.html#42-different-instruction-semantics">blog post</a>.</p>

<h3 id="33-the-registers">3.3 The Registers</h3>
<p>In addition to the general purpose registers, the RISC-V F extension adds 32 dedicated <abbr title="Floating Point">FP</abbr> registers with a bit width of FLEN=32 (FLEN=64 for D).
During the development of RISC-V, a unified register file was initially considered,
but a separate register was ultimately chosen because of the following reasons <a class="citation" href="#waterman2016">[7]</a>:</p>
<ul>
  <li>Some types do not align with the architecture. For example, using the D extension on an RV32 system.</li>
  <li>Separate registers allow for recoded formats (internal representation to accelerate handling of subnormal numbers <a class="citation" href="#hardfloat-recoding">[20]</a>).
This plays an inmportant role later in Section <a href="l#63-subnormal-numbers--underflows">6.3 Subnormal Numbers &amp; Underflows</a>.</li>
  <li>There are more addressable registers (the instruction implicitly selects a set of registers).</li>
  <li>Natural register file banking simplifying the implementation of superscalar designs.</li>
</ul>

<p>As explained in <a class="citation" href="#waterman2016">[7]</a>, a separate register file comes with the following drawbacks:</p>
<ul>
  <li>Register pressure increases unless the number of registers is increased. <a href="https://thinkingeek.com/2020/06/20/forgotten-memories-1/">Soft spilling</a> can be used to mitigate this issue.</li>
  <li>Context switching time might increase due to additional register saves. To mitigate this issue, RISC-V introduced dirty flags. Registers are only saved if their content changed.</li>
</ul>

<p>Besides general purpose <abbr title="Floating Point">FP</abbr> registers, the F extension also adds a <abbr title="Control and Status Register">CSR</abbr> to configure rounding modes and indicate <abbr title="Floating Point">FP</abbr> exceptions
(see <a href="#riscv-v-fp-registers">Figure</a> below).
The exceptions do not cause traps to facilitate non-speculative out-of-order execution <a class="citation" href="#waterman2016">[7]</a>.</p>

<div style="text-align:center">
<img id="riscv-v-fp-registers" src="/assets/riscv_eval/fp_registers_riscv.svg" alt="RISC-V FP registers" width="65%" />
</div>
<p><br /></p>

<h3 id="34-the-canonical-qnan">3.4 The canonical qNaN</h3>
<p>The <abbr title="Floating Point">FP</abbr> standard according to IEEE 754 reserves part of the encoding space for a so-called NaN.
A NaN either represents the result of an invalid operations (qNaN) or an uninitialized value (sNaN).
According to IEEE 754, a NaN is encoded by a value, which has all exponents set to 1, with a non-zero mantissa.
The encoding difference between a qNaN and an sNaN was specified in IEEE 754-2008, stating that the MSB in the mantissa
functions as a quiet bit.
The lax definition of the non-zero mantissa allows to encode information in a NaN, called <em>payload</em>.
For instance, you could use the payload to encode why the operation failed.
But IEEE 754 fails to further elaborate how this should work in detail, so in practice,
I’m not aware of any relevant <abbr title="Instruction Set Architecture">ISA</abbr> implementing this feature.
That means, whenever you generate an invalid operation on x64 or RISC-V, the same <em>canonical</em> qNaN is returned for every kind of invalid.<br />
But how does it look like? <br />
Since IEEE 754 doesn’t exactly specify the bit encoding of a canonical qNaN, it came how it had to come.
We are now left with different canonical qNaNs among <abbr title="Instruction Set Architectures">ISAs</abbr>:</p>

<!-- | ISA                | Sign | Significand               |
|--------------------|------|---------------------------|
| SPARC              |0     | 11111111111111111111111   |
| MIPS               |0     | 01111111111111111111111   |
| RISC-V $< v2.1$    |0     | 01111111111111111111111   |
| PA-RISC            |0     | 01000000000000000000000   |
| x64                |1     | 10000000000000000000000   |
| Alpha              |1     | 10000000000000000000000   |
| ARM                |0     | 10000000000000000000000   |
| PowerPc            |0     | 10000000000000000000000   |
| Loongson           |0     | 10000000000000000000000   |
| RISC-V $\geq v2.1$ |0     | 10000000000000000000000   | -->

<div style="text-align:center">
<table style="width:80%;margin-left: auto; margin-right: auto;">
    <tr>
        <th>ISA</th>
        <th>Sign</th>
        <th>Significand</th>
        <th>IEEE 754-2008 compliant</th>
    </tr>
    <tr>
        <td>SPARC</td>
        <td>0</td>
        <td>11111111111111111111111</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>RISC-V F $&lt; v2.1$</td>
        <td>0</td>
        <td>11111111111111111111111</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>MIPS</td>
        <td>0</td>
        <td>01111111111111111111111</td>
        <td>✗</td>
    </tr>
    <tr>
        <td>PA-RISC</td>
        <td>0</td>
        <td>01000000000000000000000</td>
        <td>✗</td>
    </tr>
    <tr>
        <td>x64</td>
        <td>1</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>Alpha</td>
        <td>1</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>ARM64</td>
        <td>0</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>PowerPc</td>
        <td>0</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>Loongson</td>
        <td>0</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
    <tr>
        <td>RISC-V F $\geq v2.1$</td>
        <td>0</td>
        <td>10000000000000000000000</td>
        <td>✓</td>
    </tr>
</table>
</div>

<p>As you can see in the table, RISC-V initially started with a SPARC-like canonical qNaN.
However, the encoding was changed to ARM64’s NaN as stated at the 3rd RISC-V Workshop  <a class="citation" href="#riscv-workshop-2015">[5]</a> in 2016.
This eventually found influence RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> manual version 2.1  <a class="citation" href="#risc-isa-manual-2016">[21]</a>. <br />
So, why did they change it? <br />
According to A. Waterman  <a class="citation" href="#waterman2016">[7]</a>, the new encoding was chosen based on the following arguments:</p>
<ul>
  <li>It is the same NaN as used in ARM64 and Java.</li>
  <li>Clearing bits has lower hardware cost than setting bits.</li>
  <li>It is the only qNaN that cannot be generated by quieting an sNaN.</li>
</ul>

<p>The reason behind the third argument is to distinguish propagated from generated NaNs in case of an input sNaN.
Yet, this remains a rather hypothetical argument, as the RISC-V standard does not mandate NaN propagation.</p>

<h3 id="35-nan-boxing">3.5 NaN Boxing</h3>
<p>On 2017-03-19, A. Waterman opened a GitHub issue  <a class="citation" href="#nan-box-github-issue">[22]</a>, remarking that the undefined of behavior of <abbr title="Floating Point">FP</abbr> load and store instructions might lead to problems.
At that time, storing smaller than FLEN <abbr title="Floating Point">FP</abbr> values did not have a specified memory layout.
For example, if a RISC-V system with F and D extensions loads a 32-bit <abbr title="Floating Point">FP</abbr> value into register f0, and subsequently stores the register using the <abbr title="Floating Point 64-bit store">FSD</abbr> instruction, there is no defined memory layout.
It is only guaranteed that loading the value from the same address reinstantiates the intended value.</p>

<p>The undefined memory layouts can be problematic in multiple scenarios, as pointed out by A. Bradburry in his RFC <a class="citation" href="#nan-box-rfc">[23]</a> on 2017-03-23.
For example, when migrating tasks on a heterogeneous SoC, each core could interpret the <abbr title="Floating Point">FP</abbr> register file dump differently.
To solve this problem, A. Bradburry proposed multiple solutions, which were then discussed in the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr>-Dev group <a class="citation" href="#nan-box-google">[24]</a>.
Among the most favored and <abbr title="Instruction Set Architecture">ISA</abbr>-compliant approaches were:</p>
<ul>
  <li>Store 32-bit <abbr title="Floating Point">FP</abbr> values in the lower half of a 64-bit register. This approach is used by ARM64.</li>
  <li>Cast 32-bit <abbr title="Floating Point">FP</abbr> values to 64 bit and perform appropriate rounding and masking whenever 32-bit operations are used. Implemented in POWER6 and Alpha.</li>
  <li>Encapsulate 32-bit <abbr title="Floating Point">FP</abbr> values in a 64-bit <abbr title="Floating Point">FP</abbr> NaN. Not seen in any architecture before.</li>
</ul>

<p>After discussing arguments of all approaches, the NaN-boxing scheme was ultimately chosen as the solution and added to the specification on 2017-04-13 <a class="citation" href="#nan-box-github-issue">[22]</a>.
This feature saturates upper bits when working on <abbr title="Floating Point">FP</abbr> data, which is smaller than the architecture’s <abbr title="Floating Point">FP</abbr> register width FLEN.
If the aforementioned RISC-V system loads a 32-bit <abbr title="Floating Point">FP</abbr> value, e.g. $2.5$, into register f0, the lower 32 bits of the register represent the <abbr title="Floating Point">FP</abbr> value, while the upper 32 bits of f0 are set to 1.
Hence, the register f0 reads as 0xffffffff40200000.
Additionally, a 32-bit value is only considered valid if the upper bits are saturated.
Otherwise, the value is interpreted as a negative qNaN.</p>

<p>This approach allows for additional debug information, which is not available in other <abbr title="Instruction Set Architectures">ISAs</abbr>.
As with most <abbr title="Instruction Set Architectures">ISAs</abbr>, a <abbr title="Floating Point">FP</abbr> register file dump does not allow to infer the currently saved data types.
However, with NaN boxing, the presence of saturated upper bits allows to determine the data type with high certainty.
Because these special NaN values cannot be produced by standard arithmetic instructions, as NaN propagation is not mandated by RISC-V.
Yet, there is a risk of confusion with dynamically interpreted languages, which often use a software-based NaN boxing for encoding data types.</p>

<p>While NaN boxing might look useful at first glance, it increases fragmentation among <abbr title="Instruction Set Architectures">ISAs</abbr> and complicates cross-platform simulation/emulation.
As explained in my recent <a href="/simulation/2023/04/07/fast-floating-point-simulation.html#44nan-boxing">post about fast RISC-V <abbr title="Floating Point">FP</abbr> simulation</a>,
NaN boxing is one of 6 reasons why simulating RISC-V <abbr title="Floating Point">FP</abbr> instructions on x64 is complicated and slow.
So, with NaN boxing only providing very hypothetical benefits but causing real issues, I personally think that the RISC-V designers took a wrong turn in that regard.
Simply using the lower half of the register, like ARM, would have been a better choice.</p>

<p>Lastly, and maybe as an interesting remark, OpenRISC 1000 also adopted NaN Boxing in 2019 with version 1.3 <a class="citation" href="#openrisc1000-arch-manual-2019">[19]</a>.</p>

<h2 id="4-methods">4 Methods</h2>
<p>After the first survey-like part, it is now time for the RISC-V <abbr title="Floating Point">FP</abbr> extensions F/D to stand the test of practice.
The goal was to get a general picture of instruction/data distribution and how often certain cases arise.
Since real hardware is not really suited for this, I extended MachineWare’s RISC-V simulator SIM-V with a profiling <abbr title="Floating Point Unit">FPU</abbr>.
I then executed a bunch of applications.
For both the applications and SIM-V, I provide a more in-depth explanation in the following two subsections.</p>

<h3 id="41-the-applications">4.1 The Applications</h3>
<p>The main criterion for the selection of the applications was the use of <abbr title="Floating Point">FP</abbr> instructions.
Once I found an application with at least a few <abbr title="Floating Point">FP</abbr> instructions, I included it in my list.
In total, I ran 78 applications, which are given in the list below.</p>

<p>Another concern was that the application should cover a variety of scenarios.
From high-performance computing (linpack <a class="citation" href="#linpack">[25]</a>, NPB <a class="citation" href="#npb">[26]</a>)
over machine learning (OpenNN <a class="citation" href="#opennn-examples">[27]</a>)
to graphics computation (glmark2 <a class="citation" href="#glmark2">[28]</a>);
a large spectrum of different use cases is reflected in the chosen applications.
This also includes applications written in different programming languages.
Because depending on the language, different peculiarities in the <abbr title="Floating Point">FP</abbr> arithmetic can arise.
Therefore, I selected benchmarks in
C++ (FinanceBench <a class="citation" href="#financebench">[29]</a>),
Erlang (smallptr-erlang <a class="citation" href="#smallpt">[30]</a>),
Fortran (NPB<a class="citation" href="#npb">[26]</a>),
Java (SciMark 2.0 <a class="citation" href="#scimark">[31]</a>),
Javascript (Octane 2.0 <a class="citation" href="#octane-benchmark">[32]</a>),
Python (NumPy <a class="citation" href="#numpy-benchmarks">[33]</a>),
and other programming languages.</p>

<p>In total, the 78 benchmarks executed more than 80 trillion instructions (80,653,539,756,271) of which more than 16 trillion (16,824,921,642,417)
were part of the F/D extensions. The instruction distribution and other interesting stuff are presented in the next section.</p>

<div>
  <table style="width:25%; float: left;">
    <tr><th> OpenNN <a class="citation" href="#opennn-examples">[27]</a></th></tr>
    <tr><td> (35) iris_plant </td></tr>
    <tr><td> (36) breast_cancer </td></tr>
    <tr><td> (37) simple_approx </td></tr>
    <tr><td> (38) simple_class </td></tr>
    <tr><td> (39) logical_operations </td></tr>
    <tr><td> (40) airfoil </td></tr>
    <tr><td> (41) mnist </td></tr>
    <tr><td> (42) outlier_detection </td></tr>
  </table>
 <table style="width:25%; float: left; border-collapse: collapse;">
  <tr><th>SPEC CPU 2017 <a class="citation" href="#spec-cpu-2017">[34]</a></th> </tr>
  <tr><td> (1) 503.bwaves </td> </tr>
  <tr><td> (2) 507.cactuBSSN </td> </tr>
  <tr><td> (3) 508.namd </td> </tr>
  <tr><td> (4) 510.parest </td></tr>
  <tr><td> (5) 511.povray </td></tr>
  <tr><td> (6) 519.lbm </td></tr>
  <tr><td> (7) 527.cam4 </td></tr>
  <tr><td> (8) 538.imagick </td></tr>
  <tr><td> (9) 544.nab </td></tr>
  <tr><td> (10) 549.fotonik3d </td></tr>
  <tr><td> (11) 554.roms </td></tr>
</table>
<table style="width:25%; float: left;">
  <tr><th> Other </th></tr>
  <tr><td> (66) fbench <a class="citation" href="#fbench">[35]</a> </td></tr>
  <tr><td> (67) ffbench <a class="citation" href="#ffbench">[36]</a> </td></tr>
  <tr><td> (68) linpack32 <a class="citation" href="#linpack">[25]</a> </td></tr>
  <tr><td> (69) linpack64 <a class="citation" href="#linpack">[25]</a> </td></tr>
  <tr><td> (70) whetstone <a class="citation" href="#whetstone">[37]</a> </td></tr>
  <tr><td> (71) stream <a class="citation" href="#streambenchmark">[38]</a> </td></tr>
  <tr><td> (72) lenet-infer </td></tr>
  <tr><td> (73) alexnet-train </td></tr>
  <tr><td> (74) cray <a class="citation" href="#c-ray">[39]</a> </td></tr>
  <tr><td> (75) aobench <a class="citation" href="#aobench">[40]</a> </td></tr>
  <tr><td> (76) glxgears  </td></tr>
  <tr><td> (77) himeno <a class="citation" href="#himeno-benchmark">[41]</a> </td></tr>
  <tr><td> (78) SciMark 2.0 <a class="citation" href="#scimark">[31]</a> </td></tr>
</table>
<table style="width:25%; float: left; border-collapse: collapse;">
  <tr> <th> glmark2 <a class="citation" href="#glmark2">[28]</a></th> </tr>
  <tr><td> (18) buffer </td></tr>
  <tr><td> (19) build </td></tr>
  <tr><td> (20) bump </td></tr>
  <tr><td> (21) clear </td></tr>
  <tr><td> (22) conditionals </td></tr>
  <tr><td> (23) desktop </td></tr>
  <tr><td> (24) effect2d </td></tr>
  <tr><td> (25) function </td></tr>
  <tr><td> (26) ideas </td></tr>
  <tr><td> (27) jellyfish </td></tr>
  <tr><td> (28) loop </td></tr>
  <tr><td> (29) pulsar </td></tr>
  <tr><td> (30) refract </td></tr>
  <tr><td> (31) shading </td></tr>
  <tr><td> (32) shadow </td></tr>
  <tr><td> (33) terrain </td></tr>
  <tr><td> (34) texture </td></tr>
</table>
<table style="width:30%; float: left;">
  <tr><th> CoreMark-PRO 2.0 <a class="citation" href="#coremark-pro">[42]</a></th></tr>
  <tr><td> (50) loops-all-mid-10k </td></tr>
  <tr><td> (51) linear_alg-mid-100x100 </td></tr>
  <tr><td> (52) nnet_test </td></tr>
  <tr><td> (53) radix2-big-64k </td></tr>
</table>
<table style="width:25%; float: left;">
  <tr><th> NPB <a class="citation" href="#npb">[26]</a></th></tr>
  <tr><td> (12) NPB.bt.A</td></tr>
  <tr><td> (13) NPB.cg.A</td></tr>
  <tr><td> (14) NPB.ep.A</td></tr>
  <tr><td> (15) NPB.ft.A</td></tr>
  <tr><td> (16) NPB.mg.A</td></tr>
  <tr><td> (17) NPB.sp.A</td></tr>
</table>
<table style="width:20%; float: left;">
  <tr><th> mibench <a class="citation" href="#mibench">[43]</a></th></tr>
  <tr><td> (60) basicmath </td></tr>
  <tr><td> (61) susan </td></tr>
  <tr><td> (62) qsort </td></tr>
  <tr><td> (63) lame </td></tr>
  <tr><td> (64) rsynth </td></tr>
  <tr><td> (65) fft </td></tr>
</table>
<table style="width:25%; float: left;">
  <tr><th> smallpt <a class="citation" href="#smallpt">[30]</a></th></tr>
  <tr><td> (54) smallpt-c </td></tr>
  <tr><td> (55) smallpt-cpp </td></tr>
  <tr><td> (56) smallpt-java </td></tr>
  <tr><td> (57) smallpt-erlang </td></tr>
  <tr><td> (58) smallpt-numpy </td></tr>
  <tr><td> (59) smallpt-python </td></tr>
</table>
<table style="width:25%; float: left; border-collapse: collapse;">
  <tr><th> NumPy <a class="citation" href="#numpy-benchmarks">[33]</a></th></tr>
  <tr><td> (48) linalg </td></tr>
  <tr><td> (49) scalar </td></tr>
</table>
<table style="width:25%;  float: left;">
  <tr><th> Octane 2.0 <a class="citation" href="#octane-benchmark">[32]</a></th></tr>
  <tr><td> (46) raytrace </td></tr>
  <tr><td> (47) navierstoke </td></tr>
</table>
<table style="width:25%; border-collapse: collapse;">
  <tr><th> FinanceBench <a class="citation" href="#financebench">[29]</a></th></tr>
  <tr><td> (43) Black Scholes </td></tr>
  <tr><td> (44) Bonds </td></tr>
  <tr><td> (45) Monte Carlo </td></tr>
</table>
</div>
<p><br /></p>

<h3 id="42-the-virtual-platform">4.2 The Virtual Platform</h3>
<p>To execute the aforementioned 78 applications, I used <a href="https://www.machineware.de/pages/products.html">MachineWare’s RISC-V simulator SIM-V</a> <a class="citation" href="#simvpaper">[44]</a>.
The simulator was part of Virtual Platform (<abbr title="Virtual Platform">VP</abbr>) configured to model a RV64IMAFDC <abbr title="Virtual Platform">VP</abbr> with 4GB of main memory.
For most benchmarks, the <abbr title="Virtual Platform">VP</abbr> runs an Ubuntu 22.04 operating system.
Some benchmarks run on a minimal buildroot-configured Linux.
The <abbr title="Virtual Platform">VP</abbr> was modified to track the number of executed instructions and other data of interest.</p>

<p>To not accidentally track boot or non-benchmark related instructions, the <abbr title="Virtual Platform">VP</abbr> was extended by semihosting instructions that allow to reset and dump the statistics.
That means, before the execution of each benchmark, the statistics were reset, which was followed by a dump after the execution finished.
In contrast to compiler-based annotations, as for example in gcov <a class="citation" href="#gcov">[45]</a>, a <abbr title="Virtual Platform">VP</abbr>-based approach allows to track every detail, reaching from instructions in the kernel to closed-source libraries.</p>

<p>To really track every tiny detail, <em>softpipe</em> was configured as the system’s graphics driver.
Using <em>softpipe</em> the CPU also executes tasks, which are usually outsourced to the GPU.</p>

<p>If you want to also conduct such a study on your own, you can probably also use an open-source simulator like gem5, Spike, or QEMU.
But please beware, none of them are currently able to track <abbr title="Floating Point">FP</abbr> details.
So, you’d have to implement this first.
Due to performance reasons I’d recommend to implement this in QEMU.
QEMU also uses callbacks for <abbr title="Floating Point">FP</abbr> instructions, which should make it relatively easy to add this feature.</p>

<h2 id="5-related-work">5 Related Work</h2>
<p>Maybe it is a bit unusual to have the related work section at this point,
but I thought it made sense to place it after explaining the methodologies.
Similar to the structure of this post, it is divided into two parts.
First, I provide literature about the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> design.
Second, I present papers about assessing the characteristics of applications with regards to the host <abbr title="Instruction Set Architecture">ISA</abbr>.</p>

<p>As already mentioned in Section <a href="#2-story--motivation">2. Story &amp; Motivation</a>, information about the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> design is spread everywhere -
there is a RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> dev Google group <a class="citation" href="#risc-v-isa-dev">[2]</a>,
a RISC-<abbr title="Instruction Set Architecture">ISA</abbr> manual GitHub repository  <a class="citation" href="#risc-v-isa-manual-repo">[3]</a>,
a RISC-V working groups mailing list <a class="citation" href="#riscv-mailing-lists">[4]</a>,
and a RISC-V workshop <a class="citation" href="#riscv-workshop-2015">[5]</a>.
Furthermore, there are some publications/books from the RISC-V authors themselves:</p>
<ul>
  <li><em>RISC-V Geneology</em>  <a class="citation" href="#risc-v-geneology">[6]</a> by T. Chen and D. Patterson, 2016</li>
  <li><em>Design of the RISC-V Instruction Set Architecture</em>  <a class="citation" href="#waterman2016">[7]</a> by A. Waterman, 2016</li>
  <li><em>The RISC-V Reader: An Open Architecture Atlas</em> <a class="citation" href="#patterson2017">[46]</a> by A. Waterman and D. Patterson, 2017</li>
</ul>

<p>If you want to know why certain aspects of RISC-V are designed the way they are, I can recommend <em>Design of the RISC-V Instruction Set Architecture</em> and <em>The RISC-V Reader: An Open Architecture Atlas</em>.
While these publications already provide many explanations, they are far from complete.
Moreover, at least for the <abbr title="Floating Point">FP</abbr> part, many of the arguments are of qualitative nature. Not much is backed by actual data or evidence.</p>

<p>And this is where this work begins.
Of course, if some already did such an analysis for the <abbr title="Floating Point">FP</abbr> extensions, I wouldn’t have done it.
I’m also not aware of literature specifically analyzing the <abbr title="Floating Point">FP</abbr> parts of other <abbr title="Instruction Set Architectures">ISAs</abbr>.
If you increase the scope and just look for papers, which assess aspects like instruction distributions,
you are more successful.
In literature, two approaches are commonly used to assess instruction distributions.</p>

<p>The static analysis approach, as used by <a class="citation" href="#x86-inst-distribution">[47], [48]</a>, simply assesses the instruction occurrences in the binary.
However, the results obtained from this method can be misleading, as the number of occurrences does not necessarily indicate how often an instruction is actually executed.
Moreover, this approach reaches its limitations for self-modifying code and dynamically interpreted languages.</p>

<p>A more accurate and less constrained approach is dynamic analysis, as used in <a class="citation" href="#bosbach2023">[49], [50], [43]</a>.
In dynamic analysis, the instruction distribution is directly obtained from the execution of the benchmark itself.
This can be achieved by counting instructions in a simulator or by using compiler annotations.
The latter has the disadvantage of only counting instructions in the application’s user mode.</p>

<p>Ultimately, the instructions distribution should reflect what is executed on the user’s system, including operating system, drivers,
and other aspects, which are indirectly related to the executed benchmark.
To obtain results that encompass all executed instructions and side effects,
a simulator-based approach, as utilized by my colleague <a href="https://www.linkedin.com/in/nils-bosbach/">N. Bosbach</a> <a class="citation" href="#bosbach2023">[49], [43]</a>, proves to be one of the few viable methods.
This is why experiments were conducted using a profiling RISC-V simulator.</p>

<h2 id="6-results--discussion">6 Results &amp; Discussion</h2>

<h3 id="61-instruction-distribution">6.1 Instruction Distribution</h3>
<p>In this subsection, I present and discuss the results of <abbr title="Floating Point">FP</abbr> instruction distributions in the applications.
Note that I treat 32-bit and 64-bit instructions as one entity.
For example, FLX refers to both FLS (32 bit) and FLD (64 bit).
I also clustered the conversion functions partially.
FCVT.I.F refers to float-to-integer conversions,
FCVT.F.I to integer-to-float conversions,
and FCVT.F.F to float-to-float conversions.</p>

<p>So let’s start with the general results before we move on to the individual benchmarks.
The following graph depicts the instruction distribution accumulated over all benchmarks:</p>
<div style="text-align:center">
<img src="/assets/riscv_eval/general_fp_hist_clustered.svg" alt="Average FP instruction distribution over all benchmarks" width="70%" />
</div>
<p><br />
As you can see, the general trend looks like a exponential distribution.
I also put an ideal exponential distribution in the graph (orange line) and it fits surprisingly well.
Surprisingly well with one outlier: FLCASS, which only occurs once every 13,812 <abbr title="Floating Point">FP</abbr> instructions. But more on that in a few sentences.</p>

<p>Besides that, we also observe a few instructions making up the majority of all executed instructions.
For example, the instructions FLX (32%), and FSX (17%), sum up to nearly 50% of all executed <abbr title="Floating Point">FP</abbr> instructions.
This in line with the observation of other people.
In an <a href="https://www.youtube.com/watch?v=Nb2tebYAaOA?t=06m22s">interview with Lex Fridman</a>, Jim Keller, the <abbr title="Instruction Set Architecture">ISA</abbr>-god himself, said: “90% of the execution is on 25 opcodes.”. <br />
The contribution of each application to the overall instructions can be inferred from the <a href="#fp-inst-share">left Figure below</a>.</p>

<p>As a next step, let us look at the relative distributions for each individual benchmark.
A heatmap depicting the relative distribution of <abbr title="Floating Point">FP</abbr> instructions per benchmark can be found in the <a href="#fp-inst-heatmap">right Figure below</a>.
As already seen in the accumulated distribution, <abbr title="Floating Point">FP</abbr> store and load instructions are the most prevalent instructions in nearly every benchmark.
This stands in contrast to instructions such as FNMADD, FMIN, FMAX, or FCLASS, which are often not even executed once (gray boxes).
Especially the latter is only present in 12 out of 78 benchmarks.
This raises the question whether such an instruction should be part of a RISC <abbr title="Instruction Set Architecture">ISA</abbr>.
To answer this question, you need to consider many aspects, such as the context of instruction, possible alternatives, and impact on performance/hardware cost/encoding space.
And this is where the next subsection begins!</p>

<div style="text-align:center">
  <img id="fp-inst-share" src="/assets/riscv_eval/bm_inst_bar.svg" alt="Relative and absolute number of F and D FP instrutions" width="43%" style="margin:10px;" />
  <img id="fp-inst-heatmap" src="/assets/riscv_eval/fp_inst_dist_clustered.svg" alt="Heatmap of relative FP instruction distribution per benchmark" width="40%" style="margin:10px;" />
</div>
<p><br /></p>

<h3 id="62-more-on-fclass">6.2 More on FCLASS</h3>
<p>As shown before, the FCLASS instruction occurs infrequently, with many applications not only using it once.
The benchmark <em>glmark2-bump</em> attains the highest relative value, with 0.0909% of all instructions being FCLASS.
Besides being present in all glmark benchmarks, it also occurs in FinanceBench and 507.cactuBSSN.
Since FCLASS can appear in different contexts, I investigated the reasons for its use in the applications.</p>

<p>For all(!) applications, I could track down all(!) usages of the FCLASS instruction to glibc’s <code class="language-plaintext highlighter-rouge">fmax</code>/<code class="language-plaintext highlighter-rouge">fmin</code> function.
The <a href="https://github.com/lattera/glibc/blob/master/sysdeps/riscv/rvf/s_fmaxf.c">corresponding C implementation</a> for 32-bit <abbr title="Floating Point">FP</abbr> is depicted in the following code:</p>

<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">__fmaxf</span><span class="p">(</span><span class="kt">float</span> <span class="n">x</span><span class="p">,</span> <span class="kt">float</span> <span class="n">y</span><span class="p">)</span> <span class="p">{</span>
  <span class="kt">float</span> <span class="n">r</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">((</span><span class="n">_FCLASS</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">|</span> <span class="n">_FCLASS</span> <span class="p">(</span><span class="n">y</span><span class="p">))</span> <span class="o">&amp;</span> <span class="n">_FCLASS_SNAN</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="p">;</span>

  <span class="k">asm</span> <span class="p">(</span><span class="s">"fmax.s %0, %1, %2"</span> <span class="o">:</span> <span class="s">"=f"</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">:</span> <span class="s">"f"</span><span class="p">(</span><span class="n">x</span><span class="p">),</span> <span class="s">"f"</span><span class="p">(</span><span class="n">y</span><span class="p">));</span>
  <span class="k">return</span> <span class="n">r</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Here, you would intuitively expect only a RISC-V <code class="language-plaintext highlighter-rouge">fmax</code> instruction, yet there are additional checks for sNaNs.
This is due to RISC-V adhering to the IEEE 754 standard from 2019 in that regard, where the maximum of an sNaN and numerical value must return the latter.
In <code class="language-plaintext highlighter-rouge">glibc</code>, however, this operation has to return a qNaN, making it compliant with older IEEE 754 standards.
To rectify this mismatch, additional checks and treatments for sNaN is needed.
As explained by David G. Hough <a class="citation" href="#hough2019">[51]</a>, converting qNaN to sNaN in minimum/maximum functions, as in glibc and older IEEE 754 standards,
was a bug in the specification and entails awkward mathematical properties.
The bug fix from IEEE 754-2019 is not yet present in glibc.
And I’m not sure if it ever will be present.</p>

<p>Other C standard libraries, such as <em>musl</em> <a class="citation" href="#musl">[52]</a> or <em>Newlib</em> <a class="citation" href="#newlib">[53]</a>,
directly map <code class="language-plaintext highlighter-rouge">fmax</code> and <code class="language-plaintext highlighter-rouge">fmin</code> to the underlying <abbr title="Instruction Set Architecture">ISA</abbr> implementations inheriting their NaN-handling characteristics.
That means, if the applications are linked against musl or NewLib instead of glibc, the number of executed FCLASS instructions can be reduced to 0.
Or in other words, using this approach, FCLASS does not occur once in 78 benchmarks executing trillions of instructions. <br />
Also, just recently the RISC-V “Zfa” (Additional Floating-Point Instructions) extension was specified.
This extension provides backward compatible maximum and minimum instructions (FMINM, FMMAXM), allowing us to implement glibc’s <code class="language-plaintext highlighter-rouge">fmax</code> and <code class="language-plaintext highlighter-rouge">fmin</code> without FCLASS.</p>

<p>Anyway, let us assume we might want to remove this instruction from the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr>.
This means, that at least at some points we have to replace the FCLASS instruction with other instructions that achieve the same semantics.
The important question is: Do we need 1, 10, or 100 instructions to mimic the same behavior?
Interestingly, in the case of FCLASS, it is probably not necessary to aim for a bit-exact reproduction.
As mentioned by A. Waterman <a class="citation" href="#waterman2016">[7]</a>, the purpose of FCLASS is to branch if exceptional values, such as NaN,
are encountered.
The code below shows both a typical assembly context for detecting sNaN using FCLASS:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// fclass sNaN example</span>
<span class="n">fclass</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">andi</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="mh">0x100</span>
<span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">snan</span>
</code></pre></div></div>

<p>As can be seen, a typical check for a certain <abbr title="Floating Point">FP</abbr> type using FCLASS requires 3 instructions.
First, FCLASS returns the value type in a one-hot encoding, then the type of interest is extracted by bitmasking, and finally a branch is taken depending on the previous result.
So, <a href="https://github.com/alexanderthiem">Alexander</a> and I tried our best and coded some FLCASS-less alternatives, as shown in the following code:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// generic zero         // positive zero             // negative zero</span>
<span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x0</span>          <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span>  <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>              <span class="n">fneg</span><span class="p">.</span><span class="n">s</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>        <span class="n">bez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">p</span><span class="o">-</span><span class="n">zero</span>            <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">zero</span>                                     <span class="n">bez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">n</span><span class="o">-</span><span class="n">zero</span>

<span class="c1">// generic NaN          // quiet NaN                 // signaling NaN</span>
<span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>        <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>               <span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">beqz</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">nan</span>         <span class="n">lui</span> <span class="n">x2</span><span class="p">,</span> <span class="mh">0x7fc00</span>              <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f0</span>
                        <span class="n">and</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span>               <span class="n">bexti</span> <span class="n">x2</span><span class="p">,</span> <span class="n">x2</span><span class="p">,</span> <span class="mi">22</span>
                        <span class="n">beq</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span> <span class="n">is</span><span class="o">-</span><span class="n">qnan</span>           <span class="n">or</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span>
                                                     <span class="n">beqz</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">snan</span>

<span class="c1">// generic infinity     // positive infinity         // negative infinity</span>
<span class="n">fli</span><span class="p">.</span><span class="n">s</span> <span class="n">f1</span><span class="p">,</span> <span class="n">inf</span>           <span class="n">fli</span><span class="p">.</span><span class="n">s</span> <span class="n">f1</span><span class="p">,</span> <span class="n">inf</span>                <span class="n">lui</span> <span class="n">x1</span><span class="p">,</span> <span class="mh">0x8f800</span>
<span class="n">fabs</span><span class="p">.</span><span class="n">s</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>           <span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>             <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x1</span>
<span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>        <span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">p</span><span class="o">-</span><span class="n">inf</span>            <span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">p</span><span class="o">-</span><span class="n">inf</span>                                    <span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">n</span><span class="o">-</span><span class="n">inf</span>

<span class="c1">// generic normal       // positive normal           // negative normal</span>
<span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>          <span class="n">fli</span><span class="p">.</span><span class="n">s</span> <span class="n">f1</span><span class="p">,</span> <span class="n">inf</span>                <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">t0</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">lui</span> <span class="n">x2</span><span class="p">,</span> <span class="mh">0x7f800</span>         <span class="n">fli</span><span class="p">.</span><span class="n">s</span> <span class="n">f2</span><span class="p">,</span> <span class="n">min</span>                <span class="n">bgtz</span> <span class="n">t0</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>
<span class="n">and</span> <span class="n">x3</span><span class="p">,</span> <span class="n">x2</span><span class="p">,</span> <span class="n">x1</span>          <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f1</span>             <span class="n">lui</span> <span class="n">t1</span><span class="p">,</span> <span class="mh">0x7f800</span>
<span class="n">beqz</span> <span class="n">x3</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">not</span><span class="o">-</span><span class="n">norm</span>    <span class="n">fle</span><span class="p">.</span><span class="n">s</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f2</span><span class="p">,</span> <span class="n">f0</span>             <span class="n">and</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t1</span>
<span class="n">beq</span> <span class="n">x3</span><span class="p">,</span> <span class="n">x2</span> <span class="n">is</span><span class="o">-</span><span class="n">not</span><span class="o">-</span><span class="n">norm</span>  <span class="n">and</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span>               <span class="n">beqz</span> <span class="n">t0</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>
                        <span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">normal</span>           <span class="n">beq</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t1</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>

<span class="c1">// generic subnormal    // positive subnormal        // negative subnormal</span>
<span class="n">fabs</span><span class="p">.</span><span class="n">s</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>           <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x0</span>               <span class="n">fneg</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x0</span>          <span class="n">fli</span><span class="p">.</span><span class="n">s</span> <span class="n">f2</span><span class="p">,</span> <span class="n">min</span>                <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x0</span>
<span class="n">fli</span><span class="p">.</span><span class="n">s</span> <span class="n">f2</span><span class="p">,</span> <span class="n">min</span>           <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>             <span class="n">fli</span><span class="p">.</span><span class="n">s</span> <span class="n">f2</span><span class="p">,</span> <span class="n">min</span>
<span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>        <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f2</span>             <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f2</span>        <span class="n">bnez</span> <span class="n">x2</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">subn</span>             <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f2</span>
<span class="n">bnez</span> <span class="n">x2</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">subn</span>                                     <span class="n">bnez</span> <span class="n">x2</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">subn</span>
</code></pre></div></div>

<p>With Standard extensions only:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// generic zero         // positive zero             // negative zero</span>
<span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x0</span>          <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span>  <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>              <span class="n">fneg</span><span class="p">.</span><span class="n">s</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>        <span class="n">bez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">p</span><span class="o">-</span><span class="n">zero</span>            <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">zero</span>                                     <span class="n">bez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">n</span><span class="o">-</span><span class="n">zero</span>

<span class="c1">// generic NaN          // quiet NaN                 // signaling NaN</span>
<span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>        <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>               <span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">beqz</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">nan</span>         <span class="n">lui</span> <span class="n">x2</span><span class="p">,</span> <span class="mh">0x7fc00</span>              <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f0</span>
                        <span class="n">and</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span>               <span class="n">lui</span> <span class="n">x3</span><span class="p">,</span> <span class="mh">0x00400</span>
                        <span class="n">beq</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span> <span class="n">is</span><span class="o">-</span><span class="n">qnan</span>           <span class="n">and</span> <span class="n">x3</span><span class="p">,</span> <span class="n">x3</span><span class="p">,</span> <span class="n">x2</span>
                                                     <span class="n">or</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x3</span>
                                                     <span class="n">beqz</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">snan</span>

<span class="c1">// generic infinity     // positive infinity         // negative infinity</span>
<span class="n">lui</span> <span class="n">x1</span><span class="p">,</span> <span class="mh">0x7f800</span>         <span class="n">lui</span> <span class="n">x1</span><span class="p">,</span> <span class="mh">0x7f800</span>              <span class="n">lui</span> <span class="n">x1</span><span class="p">,</span> <span class="mh">0x8f800</span>
<span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x1</span>          <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x1</span>               <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x1</span>
<span class="n">fabs</span><span class="p">.</span><span class="n">s</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>           <span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>             <span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">feq</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>        <span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">p</span><span class="o">-</span><span class="n">inf</span>            <span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">n</span><span class="o">-</span><span class="n">inf</span>
<span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">p</span><span class="o">-</span><span class="n">inf</span>

<span class="c1">// generic normal       // positive normal           // negative normal</span>
<span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f0</span>           <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">t0</span><span class="p">,</span> <span class="n">f0</span>              <span class="n">fmv</span><span class="p">.</span><span class="n">x</span><span class="p">.</span><span class="n">w</span> <span class="n">t0</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">lui</span> <span class="n">x2</span><span class="p">,</span> <span class="mh">0x7f800</span>          <span class="n">bltz</span> <span class="n">t0</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>           <span class="n">bgtz</span> <span class="n">t0</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>
<span class="n">and</span> <span class="n">x3</span><span class="p">,</span> <span class="n">x2</span><span class="p">,</span> <span class="n">x1</span>           <span class="n">lui</span> <span class="n">t1</span><span class="p">,</span> <span class="mh">0x7f800</span>             <span class="n">lui</span> <span class="n">t1</span><span class="p">,</span> <span class="mh">0x7f800</span>
<span class="n">beqz</span> <span class="n">x3</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">not</span><span class="o">-</span><span class="n">norm</span>     <span class="n">and</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t1</span>              <span class="n">and</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t1</span>
<span class="n">beq</span> <span class="n">x3</span><span class="p">,</span> <span class="n">x2</span> <span class="n">is</span><span class="o">-</span><span class="n">not</span><span class="o">-</span><span class="n">norm</span>   <span class="n">beqz</span> <span class="n">t0</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>           <span class="n">beqz</span> <span class="n">t0</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>
                         <span class="n">beq</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t1</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>        <span class="n">beq</span> <span class="n">t0</span><span class="p">,</span> <span class="n">t1</span><span class="p">,</span> <span class="n">not</span><span class="o">-</span><span class="n">norm</span>

<span class="c1">// generic subnormal    // positive subnormal        // negative subnormal</span>
<span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x0</span>          <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x0</span>               <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f1</span><span class="p">,</span> <span class="n">x0</span>
<span class="n">fabs</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f0</span>             <span class="n">lui</span> <span class="n">x1</span><span class="p">,</span> <span class="mh">0x00800</span>              <span class="n">lui</span> <span class="n">x1</span><span class="p">,</span> <span class="mh">0x80800</span>
<span class="n">lui</span> <span class="n">x1</span><span class="p">,</span> <span class="mh">0x00800</span>         <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f2</span><span class="p">,</span> <span class="n">x1</span>               <span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f2</span><span class="p">,</span> <span class="n">x1</span>
<span class="n">fmv</span><span class="p">.</span><span class="n">w</span><span class="p">.</span><span class="n">x</span> <span class="n">f2</span><span class="p">,</span> <span class="n">x1</span>          <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>             <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>
<span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x1</span><span class="p">,</span> <span class="n">f1</span><span class="p">,</span> <span class="n">f0</span>        <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f2</span>             <span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f2</span>
<span class="n">flt</span><span class="p">.</span><span class="n">s</span> <span class="n">x2</span><span class="p">,</span> <span class="n">f0</span><span class="p">,</span> <span class="n">f2</span>        <span class="n">and</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span>               <span class="n">and</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span>
<span class="n">and</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x1</span><span class="p">,</span> <span class="n">x2</span>          <span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">subn</span>             <span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">subn</span>
<span class="n">bnez</span> <span class="n">x1</span><span class="p">,</span> <span class="n">is</span><span class="o">-</span><span class="n">subn</span>
</code></pre></div></div>

<p>The first code block includes instructions from the B and Zfa extension, which might not be present on many systems.
So, the second block only includes instructions from the standard extensions.
To test the functionality of the code, I embedded it in a C++ test environment, which you can download <a href="/assets/riscv_eval/class.cpp">here</a>.</p>

<p>Interestingly, if FCLASS is not used, some cases can be achieved with even less instructions (see positive zero, or generic NaN).
For example, we can exploit that comparisons with NaN values always return false, allowing us to check for their presence in only one instruction.
Similar to FCLASS, all instructions used in the code are also lightweight and do not require any data memory accesses.</p>

<p>So, let’s assume we’d remove FCLASS from the <abbr title="Instruction Set Architecture">ISA</abbr>/<abbr title="Floating Point Unit">FPU</abbr>. What would be the associated saving in terms of hardware?
Fortunately, the hardware expert <a href="https://www.linkedin.com/in/lennart-reimann-4016191b3/">Lennart</a>
was there to help me synthesize designs.
Using Synopsys ASIP designer and a 28nm/32nm TSMC standard cell library, he designed a 3-stage RV32IMF processor with and without FCLASS.
Ultimately, the FCLASS instruction accounted for ~0.25% of the <abbr title="Floating Point Unit">FPU</abbr>’s area, excluding register file.
That’s not much, but in comparison to its relative execution share of 0.0072%, still a considerable amount.</p>

<p>To conclude, I recommend reconsidering the role of FCLASS in the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr>.
I personally feel like the best place for FCLASS is the quite recent “Zfa” (additional <abbr title="Floating Point">FP</abbr> instructions).
With that, it’s not part of the really basic <abbr title="Floating Point">FP</abbr> stuff, but if you need all that corner-case-fancy <abbr title="Floating Point">FP</abbr> things, you can still add it with “Zfa”.
I also believe Intel came to the same conclusion, which is why <abbr title="Floating Point">FP</abbr>-related extensions after x87 do not include this instruction.
It’s also not present in ARM64, which I interpret as another argument for this conclusion.</p>

<h3 id="63-subnormal-numbers--underflows">6.3 Subnormal Numbers &amp; Underflows</h3>
<p>Now to one the of most controversial features of the IEEE 754 standard <a class="citation" href="#kahan1998">[54]</a>:
<a href="https://en.wikipedia.org/wiki/Subnormal_number">subnormal numbers</a> and <a href="https://en.wikipedia.org/wiki/Arithmetic_underflow">gradual underflows</a>.
On the one hand, subnormal numbers bring numerically advantageous properties like Sterbenz’ lemma <a class="citation" href="#sterbenz1973">[55]</a>,
on the other hand, they increase hardware cost, and their implementation is considered the most challenging task in <abbr title="Floating Point Unit">FPU</abbr> design <a class="citation" href="#schwarz2005">[56]</a>.
As shown by numerous works, handling subnormal numbers can reduce a <abbr title="Floating Point Unit">FPU</abbr>’s attainable throughput by more than $100\times$ <a class="citation" href="#dooley2006">[57], [58], [59]</a>.</p>

<p>Due to this possible performance degradation, Intel introduced the so-called <abbr title="Flush To Zero">FTZ</abbr> mode with the release of <abbr title="Streaming SIMD Extensions">SSE</abbr> in 1999 <a class="citation" href="#thakkur1999">[60]</a>.
This mode allows to flush subnormal numbers to zero, increasing the performance of applications with non-critical accuracy requirements like real-time 3D applications.
Such a mode is also present in ARM64 (FPSCR:FZ), but you don’t find it in RISC-V!</p>

<p>How often subnormal numbers and underflows occur in practice is not stated in any of the aforementioned works.
Also other works only provide anecdotal evidence and statements like “gradual underflows are uncommon” <a class="citation" href="#kahan1997">[61]</a>.
So, let me remedy this circumstance using the profiling <abbr title="Virtual Platform">VP</abbr>.
The following graph depicts the relative share of underflows for applications with at least one underflow:</p>
<div style="text-align:center">
  <img id="fp-inst-heatmap" src="/assets/riscv_eval/underflow_dist.svg" alt="Applications the with highest share of underflow" width="70%" style="margin:10px;" />
</div>
<p>The results confirm that underflows and subnormals are rather an exception than the norm.
Out of 78 benchmarks, 59 did not raise a single underflow exception or have a single subnormal in-/output operand.
The highest share of underflows occurs in MiBench susan with 0.48% of all arithmetic <abbr title="Floating Point">FP</abbr> instructions underflowing.
Accumulated over all benchmarks, underflows occurred once every 7992 arithmetic <abbr title="Floating Point">FP</abbr> instructions, with subnormal in-/outputs every 3875/4427 operands.
Hence, only a fraction of <abbr title="Floating Point">FP</abbr> applications would benefit from an <abbr title="Flush To Zero">FTZ</abbr> mode.
To what extent performance can be increased, ultimately depends on the hardware implementation and application.</p>

<p>To get some coarse idea, you can run the subnormal arithmetic evaluation benchmark by Dooley et al. <a class="citation" href="#dooley2006">[57]</a>.
On my x64 laptop (Intel(R) Core(TM) i5-8265U CPU), I get a slow-to-fast factor of 11.28.
To test some RISC-V hardware, I ran the same benchmark on StarFive’s VisionFive 2.
Surprsingly, the results showed no performance degradation due to subnormal arithmetic!
It even handles subnormal arithmetic faster than the laptop I’m currently using to write this blog post.
So why is that?</p>

<p>I cannot say it with 100% confidence, but I guess the underlying VisionFive 2 <abbr title="Floating Point Unit">FPU</abbr> is Berkley’s Hardfloat <a class="citation" href="#berkley-hardfloat">[62]</a>
or at least some derivative of it.
This <abbr title="Floating Point Unit">FPU</abbr> uses a special recoded format <a class="citation" href="#hardfloat-recoding">[20]</a>, enabled by RISC-V’s separate registers for <abbr title="Floating Point">FP</abbr> arithmetic,
to facilitate fast subnormal calculation.
How did I come to this conclusion?
Starfive’s Visionfive uses an SoC called JH7110.
This incorporates multiple U74 cores from SiFiVe.
Andrew Waterman and Yunsup Lee, the founding members of SiFive, are among the <a href="https://github.com/ucb-bar/berkeley-hardfloat/graphs/contributors">top contributors for this project</a>.</p>

<p>Ultimately, the decision not to endow RISC-V with a <abbr title="Flush To Zero">FTZ</abbr> mode, as in ARM64 or x64, seems reasonable in my opinion.</p>

<h3 id="64-exponent-distribution">6.4 Exponent Distribution</h3>
<p>Although the IEEE 754 binary floating point is the most widespread approximation of real numbers in computing, other formats can be considered as well.
An often discussed alternative is the <em>posit</em> format introduced by the famous computer scientist J. L. Gustafson in 2017 <a class="citation" href="#gustafson2017">[63]</a>.
Opposed to IEEE 754’s quasi-uniform accuracy, posit exhibits a tapered accuracy centered around 1, which is qualitatively depicted in the following figure:</p>
<div style="text-align:center">
  <img id="fp-inst-heatmap" src="/assets/riscv_eval/float_posit_accuracy.svg" alt="Qualitative depiction of floating point and posit accuracy." width="40%" style="margin:10px;" />
</div>
<p><br />
According to many works, most values in practical applications are centered around 1.
Consequently, posit should accumulate less error in many benchmarks.
Or to provide some quotes:</p>
<ul>
  <li>“Close to the number 1, posits have better precision than floating point. This is useful because numbers close to 1 are very common.” <a class="citation" href="#loyc2019">[64]</a></li>
  <li>“Posits have superior accuracy in the range near one, where most computations occur.” <a class="citation" href="#wikipedia-unum">[65]</a></li>
  <li>“Worst-case precision is highest where the most common numbers are, in the center of the range of possible exponents.” <a class="citation" href="#gustafson22017">[66]</a></li>
  <li>“For the most common values in the range of about 0.01 to 100, posits have higher accuracy than IEEE floats and bfloats, but less accuracy outside this dynamic range.” <a class="citation" href="#guntoro2020">[67]</a></li>
</ul>

<p>Interestingly, the claimed centering around 1 is not substantiated with data in any of the sources mentioned.
This is only derived from the observed lower rounding error of posit.</p>

<p>So it is time to bring some light into the darkness with the profiling <abbr title="Floating Point Unit">FPU</abbr>!
To do this, I recorded the exponent distribution of the in- and outputs for all arithmetic 64-bit instructions.
After executing all 78 applications, the following picture emerged (the blue line represents the average, while each of the faint colors is an individual benchmark):</p>
<div style="text-align:center">
  <img id="exp-dist-overlay" src="/assets/riscv_eval/exp_dist_overlay_linear.svg" alt="64-bit exponent distribution of the 78 benchmarks (logarithmic Y axis)" width="100%" style="margin:10px;" />
</div>
<p><br />
Please note that only the exponents of subnormal and normal numbers were assessed, i.g. NaNs and infinities were excluded.
As you can see, most applications and also the average are indeed centered around a magnitude of $2^{0} = 1$ with a gaussian-like distribution.
In that regard, the results speak for posit.
To get some more differentiated conclusions, I redrew the graph with a logarithmic Y axis:</p>
<div style="text-align:center">
  <img id="exp-dist-overlay" src="/assets/riscv_eval/exp_dist_overlay.svg" alt="64-bit exponent distribution of the 78 benchmarks (linear Y axis)" width="100%" style="margin:10px;" />
</div>
<p><br />
This graph reveals a distribution, which is skewed towards smaller exponents.
So maybe, having some sort of negatively-shifted exponent could help prevent underflows, without risking too many infinities 🤔.
I guess someone already did it, but I couldn’t find any literature about that topic.</p>

<p>To conclude, just looking at the topic from a mathematical point of view, posit seems to be a better number representation for the majority of the applications.
Maybe some inofficial RISC-V extensions, like Xposit <a class="citation" href="#mallasen2022">[68]</a>, might find their way into the official specification one day.</p>

<h3 id="65-mantissa-distribution">6.5 Mantissa Distribution</h3>
<p>RISC-V and most other <abbr title="Instruction Set Architectures">ISAs</abbr> use a radix of 2 for their <abbr title="Floating Point">FP</abbr> arithmetic.
But why not use a radix of 3, 4, or 10?
While radix 10 has some advantages in terms of representing human everyday life numbers,
the highest average accuracy is achieved with radix 2.
If you are interested in the deeper theoretical background of this conclusion,
I can highly recommend the <em>Handbook of Floating-Point Arithmetic</em>  <a class="citation" href="#handbookoffloat2010">[69]</a>.</p>

<p>One important thing about proving the superiority of radix 2, is assuming a logarithmic mantissa distribution.
At least from a theoretical perspective, this assumption is fine.
As shown by R. W. Hamming, <a class="citation" href="#hamming1970">[70]</a> arithmetic operations transform various mantissa input distributions to a logarithmic distribution.
But how about a practical assessment?</p>

<p>The following graph depicts the mantissa distribution for all benchmarks.
Again, the blue line represents the average, while the faint colors represent individual benchmarks.
Note that I distributed the mantissa into 256 different bins.</p>

<div style="text-align:center">
  <img id="fp-inst-heatmap" src="/assets/riscv_eval/mant_dist_overlay_linear.svg" alt="Qualitative depiction of floating point and posit accuracy." width="100%" style="margin:10px;" />
</div>
<p><br /></p>

<p>The linear graph is not really meaningful, so here’s the same data with a logarithmic  Y-axis.</p>

<div style="text-align:center">
  <img id="fp-inst-heatmap" src="/assets/riscv_eval/mant_dist_overlay.svg" alt="Qualitative depiction of floating point and posit accuracy." width="100%" style="margin:10px;" />
</div>
<p><br /></p>

<p>I also added an ideal logarithmic distribution, which is represented by the thick orange line.
Except some outliers here and there, the ideal distribution comes really close to the measurements.</p>

<p>To conclude, choosing radix 2 doesn’t seem to be the worst decision.</p>

<h3 id="66-rounding-modes">6.6 Rounding Modes</h3>
<p>Whenever <abbr title="Floating Point">FP</abbr> stuff is computed, rounding errors might occur.
There’s not really a way to avoid them, but at least we can direct them in one or the other way.
This can be achieved by means of <em>rounding modes</em> of which IEEE 754 standard defines the following:</p>
<ul>
  <li>roundTiesToEven (mandatory)</li>
  <li>roundTiesToAway (introduced in 2008, not mandatory)</li>
  <li>roundTowardPositive (mandatory)</li>
  <li>roundTowardNegative (mandatory)</li>
  <li>roundTowardZero (mandatory)</li>
</ul>

<p>I guess the names are quite self-explanatory. For example, roundTowardPositive will always round a value towards positive infinity.
The most common rounding mode for arithmetic is roundTiesToEven.
With that rounding mode, the result is always rounded to the nearest representable values.
If there are two nearest values, the result is rounded towards the even ones.</p>

<p>Following the IEEE 754 standard, RISC-V also implements these five rounding modes.
As already mentioned in Subsubsection <a href="#conversions-and-rounding">2) Conversions and Rounding</a> and Subsection <a href="#33-the-registers">3.3 The Registers</a>,
there are two ways to make use of rounding modes.</p>

<p>The first one is by specifying the rounding mode in an instructions.
Many F/D instructions have a dedicated 3-bit field for that as shown in the following excerpt from the RISC-V <abbr title="Instruction Set Architecture">ISA</abbr> manual <a class="citation" href="#risc-isa-manual-2016">[21]</a>:</p>

<div style="text-align:center">
  <img id="fp-inst-heatmap" src="/assets/riscv_eval/riscv-rm-encoding.jpg" alt="Qualitative depiction of floating point and posit accuracy." width="90%" style="margin:10px;" />
</div>
<p><br /></p>

<p>The second option is to specify “dynamic” the instruction, which then uses the rounding mode as specified in the register <a href="#riscv-v-fp-registers">FPCSR</a>.</p>

<p>So, why have two ways when one suffices?
As described in <em>Design of the RISC-V Instruction Set Architecture</em>  <a class="citation" href="#waterman2016">[7]</a>,
the design of the rounding mode things follow the design of most programming languages.
For instance, in C++ you can set a dynamic rounding mode for following arithmetic floating point operations with <code class="language-plaintext highlighter-rouge">std::fesetround</code>.
So pretty much the way FPCSR works.
But additionally, you have non-dynamic parts.
For example, casting a float value to an integer always uses roundTowardZero.
So, in that case, having the rounding mode statically encoded in the instruction is beneficial.</p>

<p>But how often does which case arise?
Again, I couldn’t find any literature, so I consulted my profiling <abbr title="Virtual Platform">VP</abbr>.
Using the <abbr title="Virtual Platform">VP</abbr>, I tracked the rounding modes under which each instruction was executed.
For the conversion instructions (float to int, int to float, etc.), the following distribution emerged:</p>
<ul>
  <li>roundTiesToEven: 0.843</li>
  <li>roundTowardZero: 0.045</li>
  <li>roundTowardNegative: 0.056</li>
  <li>roundTowardPositive: 0.056</li>
  <li>roundTiesToAway: 4.92e-05</li>
</ul>

<p>As you can see, roundTiesToEven is the most frequent rounding mode, while roundTiesToAway is rarely seen.</p>

<p>Now to arithmetic instructions (addition, multiplication, etc.):</p>
<ul>
  <li>roundTiesToEven: 1.0</li>
  <li>roundTowardZero: 0.0</li>
  <li>roundTowardNegative: 0.0</li>
  <li>roundTowardPositive: 0.0</li>
  <li>roundTiesToAway: 0.0</li>
</ul>

<p>Yes, you see it correctly.
Out of 7,290,823,332,047 arithmetic <abbr title="Floating Point">FP</abbr> instructions, not a single one used a non-default rounding mode!
So, why is that?
Or the better question is: Why would you use a non-default rounding mode?
roundTiesToEven already gives you the smallest error, so there’s not much reason to change it.</p>

<p>One of the very few applications of non-default rounding modes is <a href="https://en.wikipedia.org/wiki/Interval_arithmetic">interval arithmetic</a>.
Using interval arithmetic, you try to determine an upper and a lower bound for your result.
For example, when adding two numbers, the lower bound is given by roundTowardNegative, while the upper bound is given by roundTowardPositive.
The correct result is somewhere in between.
An implementation of interval arithmetic in C++ is the <a href="https://github.com/boostorg/interval">boost interval library</a> <a class="citation" href="#boost-interval">[71]</a>.
Besides interval arithmetic, I couldn’t find any compelling reasons for non-default rounding in arithmetic instructions.</p>

<p>Ultimately, just telling from my data, having a statically encoded rounding mode in arithmetic <abbr title="Floating Point">FP</abbr> instructions doesn’t make sense.
If I’m missing an important aspect, please contact me!</p>

<h2 id="7-conclusion--outlook">7 Conclusion &amp; Outlook</h2>
<p>In this work, I showed how a modified RISC-V <abbr title="Virtual Platform">VP</abbr> can be used to analyze the characteristics of the RISC-V <abbr title="Floating Point">FP</abbr> extensions F and D.
In total, the <abbr title="Virtual Platform">VP</abbr> executed more than 16 trillion <abbr title="Floating Point">FP</abbr> instructions of 78 applications, precisely tracking the distribution of <abbr title="Floating Point">FP</abbr> of instructions, <abbr title="Floating Point">FP</abbr> mantissa, <abbr title="Floating Point">FP</abbr> exponent, and frequency of underflows.</p>

<p>Overall, I think the F/D extension is well-thought-out, but if I had the change to redesign it from scratch, I’d reconsider the following things:</p>
<ul>
  <li>The FCLASS instruction seemed to be heavily underutilized. Maybe the “Zfa” extension is a more appropriate place for it.</li>
  <li>Non-default rounding modes for arithmetic <abbr title="Floating Point">FP</abbr> instructions are extremely rare. Maybe the static rounding mode encoding in the instruction can be removed.</li>
</ul>

<p>Besides the RISC-V-specific things, I learned the following about <abbr title="Floating Point">FP</abbr> in practice:</p>
<ul>
  <li>Most <abbr title="Floating Point">FP</abbr> data is centered around a magnitude of 1</li>
  <li>Underflows are rare</li>
  <li>Loads and stores and seem to be the most common <abbr title="Floating Point">FP</abbr> operations</li>
  <li>Having IEEE 754 is nice, but 2 revisions and lax definitions have lead to a significant fragmenation among <abbr title="Instruction Set Architectures">ISAs</abbr></li>
  <li>Most <abbr title="Instruction Set Architectures">ISAs</abbr> don’t really fully adhere to IEEE 754 because it mandates too many instructions</li>
</ul>

<p>One major <abbr title="Instruction Set Architecture">ISA</abbr> characteristic not analyzed in this work is the number of optimal registers.
Here, the <abbr title="Virtual Platform">VP</abbr> could be modified to track the register pressure of <abbr title="Floating Point">FP</abbr> registers during the execution.
But this post is already long enough, so maybe I will address it in future work.</p>

<p>If you found any bugs/typos or have some remarks, feel free to write me a <a href="/about/">mail</a>.
I also welcome any kind of discussion 🙂.</p>

<h2 id="8-references">8 References</h2>
<ol class="bibliography"><li><span id="simv2022">[1]L. Jünger, J. H. Weinstock, and R. Leupers, “SIM-V: Fast, Parallel RISC-V Simulation for Rapid Software Verification,” <i>DVCON Europe 2022</i>, 2022. </span></li>
<li><span id="risc-v-isa-dev">[2]“RISC-V ISA Dev Google Group.” [Online]. Available at: https://groups.google.com/a/groups.riscv.org/g/isa-dev</span></li>
<li><span id="risc-v-isa-manual-repo">[3]“RISC-V ISA Manual Github Repository.” [Online]. Available at: https://github.com/riscv/riscv-isa-manual</span></li>
<li><span id="riscv-mailing-lists">[4]“RISC-V Working Groups Mailing List.” [Online]. Available at: https://lists.riscv.org/g/main</span></li>
<li><span id="riscv-workshop-2015">[5]K. Asanovic, “3rd RISC-V Workshop: RISC-V Updates.” Jan-2016 [Online]. Available at: https://riscv.org/wp-content/uploads/2016/01/Tues1000-RISCV-20160105-Updates.pdf</span></li>
<li><span id="risc-v-geneology">[6]T. Chen and D. A. Patterson, “RISC-V Geneology,” <i>EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-6</i>, 2016. </span></li>
<li><span id="waterman2016">[7]A. Waterman, “Design of the RISC-V Instruction Set Architecture,” 2016. </span></li>
<li><span id="risc-isa-manual-2011">[8]A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovic, “The RISC-V Instruction Set Manual, Volume I: Base User-Level ISA, Version 1,” <i>EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-2011-62</i>, vol. 116, 2011. </span></li>
<li><span id="ieee754-2008">[9]“IEEE Standard for Floating-Point Arithmetic,” <i>IEEE Std 754-2008</i>. IEEE, 2008. </span></li>
<li><span id="ieee754-2019">[10]“IEEE Standard for Floating-Point Arithmetic,” <i>IEEE Std 754-2019 (Revision of IEEE 754-2008)</i>. IEEE, 2019. </span></li>
<li><span id="risc-isa-manual-2017">[11]A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovic, “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.2,” 2017. </span></li>
<li><span id="ieee754-1985">[12]“IEEE Standard for Binary Floating-Point Arithmetic,” <i>ANSI/IEEE Std 754-1985</i>. IEEE, 1985. </span></li>
<li><span id="80960-programmers-manual">[13]Intel, “80960KB Programmer’s Reference Manual.” . </span></li>
<li><span id="loongarch-reference-manual">[14]“LoongArch Reference Manual Volume 1: Basic Architecture.” . </span></li>
<li><span id="ia64-developers-manual">[15]Intel, “Intel® IA-64 Architecture Software Developer’s Manual Volume 3: Instruction Set Reference.” 2000. </span></li>
<li><span id="mips-reference-manual">[16]MIPS, “MIPS® Architecture For Programmers Volume II-A: The MIPS64® Instruction Set Reference Manual Revision 6.05.” 2016. </span></li>
<li><span id="x86-developers-manual">[17]Intel, “Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture.” 2016. </span></li>
<li><span id="powerpc-reference-manual">[18]IBM, “PowerPC User Instruction Set Architecture Book I Version 2.01.” 2003. </span></li>
<li><span id="openrisc1000-arch-manual-2019">[19]OPENRISC.IO, “OpenRISC 1000 Architecture Manual - Architecture Version 1.3.” 2019 [Online]. Available at: https://raw.githubusercontent.com/openrisc/doc/master/openrisc-arch-1.3-rev1.pdf</span></li>
<li><span id="hardfloat-recoding">[20]J. R. Hauser, “HardFloat Recoding.” [Online]. Available at: www.jhauser.us/arithmetic/HardFloat-1/doc/HardFloat-Verilog.html</span></li>
<li><span id="risc-isa-manual-2016">[21]A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovi, “The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.1,” California Univ Berkeley Dept of Electrical Engineering and Computer Sciences, 2016. </span></li>
<li><span id="nan-box-github-issue">[22]A. Waterman, “NaN Boxing Github Issue.” [Online]. Available at: https://github.com/riscv/riscv-isa-manual/issues/30</span></li>
<li><span id="nan-box-rfc">[23]A. Bradbury, “NaN Boxing RFC.” [Online]. Available at: https://gist.github.com/asb/a3a54c57281447fc7eac1eec3a0763fa</span></li>
<li><span id="nan-box-google">[24]A. Bradbury, “NaN Boxing ISA-Dev Group.” Mar-2017 [Online]. Available at: https://groups.google.com/a/groups.riscv.org/g/isa-dev/c/_r7hBlzsEd8/m/z1rjr2BaAwAJ</span></li>
<li><span id="linpack">[25]“linpack.” [Online]. Available at: https://www.netlib.org/linpack/</span></li>
<li><span id="npb">[26]“NAS Parallel Benchmarks.” [Online]. Available at: https://www.nas.nasa.gov/software/npb.html</span></li>
<li><span id="opennn-examples">[27]“OpenNN Examples.” [Online]. Available at: https://github.com/Artelnics/opennn/tree/master/examples</span></li>
<li><span id="glmark2">[28]“glmark2.” [Online]. Available at: https://github.com/glmark2/glmark2</span></li>
<li><span id="financebench">[29]“FinanceBench.” [Online]. Available at: https://github.com/cavazos-lab/FinanceBench</span></li>
<li><span id="smallpt">[30]“smallpt.” [Online]. Available at: https://github.com/matt77hias/smallpt</span></li>
<li><span id="scimark">[31]“SciMark 2.0.” [Online]. Available at: https://math.nist.gov/scimark2/</span></li>
<li><span id="octane-benchmark">[32]“Octane 2.0.” [Online]. Available at: https://github.com/chromium/octane</span></li>
<li><span id="numpy-benchmarks">[33]“NumPy benchmarks.” [Online]. Available at: https://github.com/numpy/numpy/tree/main/benchmarks</span></li>
<li><span id="spec-cpu-2017">[34]“SPEC CPU 2017.” [Online]. Available at: https://spec.org/cpu2017/</span></li>
<li><span id="fbench">[35]J. Walker, “fbench.” [Online]. Available at: https://www.fourmilab.ch/fbench/fbench.html</span></li>
<li><span id="ffbench">[36]J. Walker, “ffbench.” [Online]. Available at: https://www.fourmilab.ch/fbench/ffbench.html</span></li>
<li><span id="whetstone">[37]“whetstone.” [Online]. Available at: https://netlib.org/benchmark/whetstone.c</span></li>
<li><span id="streambenchmark">[38]“STREAM benchmark.” [Online]. Available at: https://www.cs.virginia.edu/stream/</span></li>
<li><span id="c-ray">[39]“c-ray.” [Online]. Available at: https://github.com/jtsiomb/c-ray</span></li>
<li><span id="aobench">[40]S. Fujita, “aobench.” [Online]. Available at: https://github.com/syoyo/aobench</span></li>
<li><span id="himeno-benchmark">[41]“Himeno Benchmark.” [Online]. Available at: https://github.com/kowsalyaChidambaram/Himeno-Benchmark</span></li>
<li><span id="coremark-pro">[42]“CoreMark®-PRO.” [Online]. Available at: https://github.com/eembc/coremark-pro</span></li>
<li><span id="mibench">[43]M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, “MiBench: A free, commercially representative embedded benchmark suite,” in <i>Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538)</i>, 2001, pp. 3–14, doi: 10.1109/WWC.2001.990739. </span></li>
<li><span id="simvpaper">[44]L. Jünger, J. H. Weinstock, and R. Leupers, “SIM-V: Fast, Parallel RISC-V Simulation for Rapid Software Verification,” <i>DVCON Europe 2022</i>. </span></li>
<li><span id="gcov">[45]“gcov.” [Online]. Available at: https://gcc.gnu.org/onlinedocs/gcc/Gcov.html</span></li>
<li><span id="patterson2017">[46]D. Patterson and A. Waterman, <i>The RISC-V Reader: An Open Architecture Atlas</i>, 1st ed. Strawberry Canyon, 2017. </span></li>
<li><span id="x86-inst-distribution">[47]A. Akshintala, B. Jain, C.-C. Tsai, M. Ferdman, and D. E. Porter, “X86-64 Instruction Usage among C/C++ Applications,” in <i>Proceedings of the 12th ACM International Conference on Systems and Storage</i>, New York, NY, USA, 2019, pp. 68–79, doi: 10.1145/3319647.3325833 [Online]. Available at: https://doi.org/10.1145/3319647.3325833</span></li>
<li><span id="ibrahim2010">[48]A. H. Ibrahim, M. B. Abdelhalim, H. Hussein, and A. Fahmy, “Analysis of x86 instruction set usage for Windows 7 applications,” in <i>2010 2nd International Conference on Computer Technology and Development</i>, 2010, pp. 511–516, doi: 10.1109/ICCTD.2010.5645851. </span></li>
<li><span id="bosbach2023">[49]N. Bosbach, L. Jünger, R. Pelke, N. Zurstraßen, and R. Leupers, “Entropy-Based Analysis of Benchmarks for Instruction Set Simulators,” in <i>RAPIDO2023: Proceedings of the DroneSE and RAPIDO: System Engineering for constrained embedded systems</i>, New York, NY, USA, 2023, pp. 54–59, doi: 10.1145/3579170.3579267. </span></li>
<li><span id="huang1998">[50]“Analysis of X86 Instruction Set Usage for DOS/Windows Applications and Its Implication on Superscalar Design,” in <i>Proceedings of the International Conference on Computer Design</i>, USA, 1998, p. 566. </span></li>
<li><span id="hough2019">[51]D. G. Hough, “The IEEE Standard 754: One for the History Books,” <i>Computer</i>, vol. 52, no. 12, pp. 109–112, 2019, doi: 10.1109/MC.2019.2926614. </span></li>
<li><span id="musl">[52]“musl.” [Online]. Available at: https://musl.libc.org/</span></li>
<li><span id="newlib">[53]“Newlib.” [Online]. Available at: https://sourceware.org/newlib/</span></li>
<li><span id="kahan1998">[54]W. M. Kahan and C. Severance, “An Interview with the Old Man of Floating-Point.” [Online]. Available at: https://people.eecs.berkeley.edu/ wkahan/ieee754status/754story.html</span></li>
<li><span id="sterbenz1973">[55]P. H. Sterbenz, “Floating-point computation,” 1973. </span></li>
<li><span id="schwarz2005">[56]E. M. Schwarz, M. Schmookler, and S. D. Trong, “FPU implementations with denormalized numbers,” <i>IEEE Transactions on Computers</i>, vol. 54, no. 7, pp. 825–836, 2005, doi: 10.1109/TC.2005.118. </span></li>
<li><span id="dooley2006">[57]I. Dooley and L. Kale, “Quantifying the interference caused by subnormal floating-point values,” Jan. 2006. </span></li>
<li><span id="bjorndalen2006">[58]J. Bjørndalen and O. Anshus, “Trusting Floating Point Benchmarks - Are Your Benchmarks Really Data Independent?,” 2006, pp. 178–188, doi: 10.1007/978-3-540-75755-9_23. </span></li>
<li><span id="wittmann2015">[59]M. Wittmann, T. Zeiser, G. Hager, and G. Wellein, “Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero,” Jun. 2015. </span></li>
<li><span id="thakkur1999">[60]S. Thakkur and T. Huff, “Internet Streaming SIMD Extensions,” <i>Computer</i>, vol. 32, no. 12, pp. 26–34, 1999, doi: 10.1109/2.809248. </span></li>
<li><span id="kahan1997">[61]W. M. Kahan, “Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic.” [Online]. Available at: https://people.eecs.berkeley.edu/ wkahan/ieee754status/IEEE754.PDF</span></li>
<li><span id="berkley-hardfloat">[62]J. R. Hauser, “Berkley Hardfloat Github Repository.” [Online]. Available at: https://github.com/ucb-bar/berkeley-hardfloat</span></li>
<li><span id="gustafson2017">[63]J. Gustafson and I. Yonemoto, “Beating Floating Point at its Own Game: Posit Arithmetic,” <i>Supercomputing Frontiers and Innovations</i>, vol. 4, pp. 71–86, Jun. 2017, doi: 10.14529/jsfi170206. </span></li>
<li><span id="loyc2019">[64]Loyc, “Better floating point: posits in plain language.” [Online]. Available at: http://loyc.net/2019/unum-posits.html</span></li>
<li><span id="wikipedia-unum">[65]Wikipedia, “Wikipedia - Unum (Number Format).” [Online]. Available at: https://en.wikipedia.org/wiki/Unum_(number_format)</span></li>
<li><span id="gustafson22017">[66]J. Gustafson, “Posit arithmetic,” <i>Mathematica Notebook describing the posit number system</i>, 2017. </span></li>
<li><span id="guntoro2020">[67]A. Guntoro <i>et al.</i>, “Next Generation Arithmetic for Edge Computing,” in <i>2020 Design, Automation and Test in Europe Conference and Exhibition (DATE)</i>, 2020, pp. 1357–1365, doi: 10.23919/DATE48585.2020.9116196. </span></li>
<li><span id="mallasen2022">[68]D. Mallasén, R. Murillo, A. A. Del Barrio, G. Botella, L. Piñuel, and M. Prieto-Matias, “PERCIVAL: Open-Source Posit RISC-V Core With Quire Capability,” <i>IEEE Transactions on Emerging Topics in Computing</i>, vol. 10, no. 3, pp. 1241–1252, 2022, doi: 10.1109/TETC.2022.3187199. </span></li>
<li><span id="handbookoffloat2010">[69]J.-M. Muller <i>et al.</i>, <i>Handbook of Floating-Point Arithmetic</i>. 2010. </span></li>
<li><span id="hamming1970">[70]R. W. Hamming, “On the distribution of numbers,” <i>The Bell System Technical Journal</i>, vol. 49, no. 8, pp. 1609–1625, 1970, doi: 10.1002/j.1538-7305.1970.tb04281.x. </span></li>
<li><span id="boost-interval">[71]Boost, “Boost interval.” [Online]. Available at: https://github.com/boostorg/interval</span></li></ol>]]></content><author><name></name></author><category term="RISC-V" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">TLMBoy: Exploring the Game Boy’s Boot</title><link href="https://www.chciken.com/tlmboy/2022/05/02/gameboy-boot.html" rel="alternate" type="text/html" title="TLMBoy: Exploring the Game Boy’s Boot" /><published>2022-05-02T09:55:44+00:00</published><updated>2022-05-02T09:55:44+00:00</updated><id>https://www.chciken.com/tlmboy/2022/05/02/gameboy-boot</id><content type="html" xml:base="https://www.chciken.com/tlmboy/2022/05/02/gameboy-boot.html"><![CDATA[<style>
  #toc_container {
    background: #f9f9f9 none repeat scroll 0 0;
    border: 1px solid #aaa;
    display: table;
    margin-bottom: 1em;
    padding: 20px;
    width: auto;
  }

  .toc_title {
      font-weight: 700;
      text-align: center;
  }

  #toc_container li, #toc_container ul, #toc_container ul li{
      list-style: outside none none !important;
  }
</style>

<div id="toc_container">
  <p class="toc_title">Contents</p>
  <ul class="toc_list">
  <li><a href="#1-introduction">1. Introduction</a></li>
  <li><a href="#2-the-boot-code">2. The Boot Code</a>
    <ul>
      <li><a href="#21-bbo-init-regfile">2.1 BB0: Init Regfile</a></li>
      <li><a href="#22-bb1-init-vram">2.2 BB1: Init VRAM</a></li>
      <li><a href="#23-bb2-init-sound">2.3 BB2: Init Sound</a></li>
      <li><a href="#24-bb3-init-color-palette">2.4 BB3: Init Color Palette</a></li>
      <li><a href="#25-load-the-logo">2.5 BB4: Load the Logo</a></li>
      <li><a href="#26-decompress-and-copy">2.6 Decompress and Copy</a></li>
      <li><a href="#27-registered-trademark">2.7 Registered Trademark</a></li>
      <li><a href="#28-selecting-the-right-tiles">2.8 Selecting the Right Tiles</a></li>
      <li><a href="#29-display-init">2.9 Display Init</a></li>
      <li><a href="#210-showtime">2.10 Showtime!</a></li>
      <li><a href="#211-checking-the-logo">2.11 Checking the Logo</a></li>
    </ul>
  </li>
  <li><a href="#3-conclusion">3. Conclusion</a></li>
  <li><a href="#4-references">4. References</a></li>
  </ul>
</div>

<h2 id="1-introduction">1. Introduction</h2>
<p>This is another post of my TLMBoy series where I document the development of my equally named Game Boy Emulator.
In contrast to my other posts, the following sections do not deal with any “How do I implement this and that?”.
I rather dissect and explain the 256-byte hidden boot code that helps bringing up the Game Boy!</p>

<p>When turning on most compute systems, only a few things are guaranteed to have a certain value. The Game Boy is no exception and only guarantees the program counter register to be initialized with 0.
All other things like other registers, the sound processor, and the pixel processing unit have to be initialized by the boot process.</p>

<p>In case of the Game Boy, the boot code resides within a special 256-byte ROM that is mapped from 0x00 to 0xff.
Interestingly, the boot ROM unmaps itself from the memory map after finishing the boot.
This demap feature made it quite hard to reverse engineer the boot code.</p>

<p>The first successful reverse engineering attempt was achieved by a dude(tte) called “neviksti” in 2003. This was 14 years after the initial release of the Game Boy in 1989!
According to gbdev wiki <a href="#6-references">[1]</a> this person was actually mad enough decap the Game Boy’s SoC and read out every single bit using a microscope.
Interestingly neviksti’s website <a href="#6-references">[2]</a> is still up today and features some cool die shots like this one:</p>
<div style="text-align:center">
<img src="/assets/gameboy_boot/DMG_overview_commented.jpg" alt="drawing" width="70%" />
</div>
<p><br />
If you are interested in reading and interpreting bits of a ROM I can highly recommend <a href="https://github.com/travisgoodspeed/gbrom-tutorial">this</a> tutorial.</p>

<p>In the following sections, I’ll go through the boot code line by line and analyze it.
Furthermore, I’ll try to disassemble the assembly into some C-ish code.
<br />
Of course I’m a little bit late to the party and a lot of people wrote some nice wrapups before me. Take a look at the <a href="#6-references">Literature</a> to see what helped me writing this post.
<br />
Also Nintendo themselves helped me
by putting their boot CFG (control flow graph) into a patent <a href="#6-references">[3]</a> called
“System for preventing the use of an unauthorized external memory”:</p>
<div style="text-align:center">
<img src="/assets/gameboy_boot/nintendo_patent.png" alt="drawing" width="70%" />
</div>
<p><br /></p>

<h2 id="2-the-boot-code">2. The Boot Code</h2>
<p>Before analyzing the code, we do of course need some assembly code to work on!
My personal favorite is this <a href="#6-references">[4]</a> commented, human-readable boot rom which I will refer to in the following.</p>

<h3 id="21-bb0-init-regfile">2.1 BB0: Init Regfile</h3>
<p>The first three instructions are some plain register initializations.
The stack pointer <code class="language-plaintext highlighter-rouge">sp</code> is set to 0xfffe; register <code class="language-plaintext highlighter-rouge">a</code> is set to 0; and <code class="language-plaintext highlighter-rouge">hl</code> now points to the VRAM (0x9fff).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BB0:
0x000  ld   sp, $fffe   // init stack
0x003  xor  a           // efficient way for: a = 0
0x004  ld   hl, $9fff   // set hl to VRAM
</code></pre></div></div>
<h3 id="22-bb1-init-the-vram">2.2 BB1: Init the VRAM</h3>
<p>To avoid displaying random garbage, the Game Boy has to zero-initialize its VRAM.
The following three-line loop takes care of it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BB1:
0x007  ld   [hl-], a   // load a into [hl], then decrement hl
0x008  bit  7, h       // stop condition
0x00a  jr   nz, @BB1   // jump to BB1, if not zero
</code></pre></div></div>
<p>This quite dense code can be achieved by using a little bit-trick.
The VRAM ranges from 0x8000 to 0x9FFF, whereby all these addresses in binary
have a “1” bit at position 8 in the MSB.
But the first number under 0x8000 doesn’t:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0b10000000 00000000 = 0x8000
0b01111111 11111111 = 0x7FFF
</code></pre></div></div>
<p>The same functionality can be achieved with the following C-Code:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mh">0x9FFF</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mh">0x8000</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">mem</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="23-bb2-init-the-sound">2.3 BB2: Init the sound</h3>
<p>The next lines setup the Game Boy’s sound processor:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x00c  ld  hl, rNR52  // load 0xFF26 into hl: register no 52
0x00f  ld  c, $11
0x011  ld  a, $80
0x013  ld  [hl-], a   // rNR52 = $80, all sound on
0x014  ld  [c], a     // rNR11 = $80, wave duty 50%
0x015  inc c
0x016  ld  a, $f3
0x018  ld  [c], a     // rNR12 = $f3, envelope settings
0x019  ld  [hl-], a   // rNR51 = $f3, sound output terminals
0x01a  ld  a, $77
0x01c  ld  [hl], a    // rNR50 = $77, SO2 on, full volume, SO1 off, full volume
</code></pre></div></div>

<p>These lines setup the square wave channel for the iconic boot “bling bling” sound.
I try not to get lost in details, as this setup is of minor relevance for the boot process.
A corresponding C-Code could look like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mem</span><span class="p">[</span><span class="mh">0xff26</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x80</span><span class="p">;</span> <span class="c1">// All sound on.</span>
<span class="n">mem</span><span class="p">[</span><span class="mh">0xff11</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x80</span><span class="p">;</span> <span class="c1">// Square wave: Wave duty 50%, don't use length register.</span>
<span class="n">mem</span><span class="p">[</span><span class="mh">0xff12</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xf3</span><span class="p">;</span> <span class="c1">// Square wave: Start at full volume (15), and then decrement every 3 envelope ticks until 0.</span>
<span class="n">mem</span><span class="p">[</span><span class="mh">0xff25</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xf3</span><span class="p">;</span> <span class="c1">// Sound output terminal.</span>
<span class="n">mem</span><span class="p">[</span><span class="mh">0xff24</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x77</span><span class="p">;</span> <span class="c1">// SO2 on, full volume, SO1 off, full volume</span>
</code></pre></div></div>

<h3 id="24-bb3-init-the-color-palette">2.4 BB3: Init the color palette</h3>
<p>As a next step, the background and window color palette register (BGP, at 0xff47) is set to 0b11111100,
and the pointers for logo load are prepared.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x01d  ld  a, $fc
0x01f  ldh [rBGP], a  // BGP = $fc, set up color palette
0x021  ld  de, $0104  // de = cartridge header logo
0x024  ld  hl, $8010  // hl = VRAM
</code></pre></div></div>
<p>The BGP setup can be translated as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>11 10 01 00 # value
|  |  |  |
11 11 11 00 # mapped to
|  |  |  |
b  b  b  w # b=black, w=white
</code></pre></div></div>
<p>It’s simply a remapping of color values for the background and window tiles.
So, for a example, a pixel with the a value of 01 is displayed as 11, which is deep black
(the reason for this mapping is explained in <a href="#27-registered-trademark">Subsection 2.7</a>)
The corresponding C-Code is just (ignoring the pointers):</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mem</span><span class="p">[</span><span class="mh">0xff47</span><span class="p">]</span>  <span class="o">=</span> <span class="mh">0xfc</span><span class="p">;</span> <span class="c1">// set up BG and window color palette</span>
</code></pre></div></div>

<h3 id="25-bb4-load-the-logo">2.5 BB4: Load the Logo</h3>
<p>The job of the next basic block is to load the Nintendo logo from the cartridge into the VRAM:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BB4:
0x027  ld   a, [de]    // for loop over cartridge logo data, de = 0x104
0x028  call $0095      // copy cartridge logo data to VRAM at $8010
0x02b  call $0096
0x02e  inc  de
0x02f  ld   a, e
0x030  cp   $34        // a == 0x34?
0x032  jr   nz, @BB4
</code></pre></div></div>
<p>However, due to size constraints, the Nintendo logo is heavily compressed and needs to be decompressed by a relatively simple algorithm.
That way the 48 Bytes of the compressed Nintendo logo can be inflated to 384 Bytes (=24 tiles) worth of pixel data.
The corresponding C-Code looks like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u8</span> <span class="o">*</span><span class="n">vram</span> <span class="o">=</span> <span class="mh">0x8010</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="n">logo</span> <span class="o">=</span> <span class="mh">0x0104</span><span class="p">;</span> <span class="n">logo</span> <span class="o">&lt;</span> <span class="mh">0x0134</span><span class="p">;</span> <span class="o">++</span><span class="n">logo</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">u8</span> <span class="n">data</span> <span class="o">=</span> <span class="o">*</span><span class="n">logo</span><span class="p">;</span>
  <span class="n">DecompressAndCopy</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">vram</span><span class="p">);</span>
  <span class="n">vram</span> <span class="o">+=</span> <span class="mi">4</span><span class="p">;</span>
  <span class="n">DecompressAndCopy</span><span class="p">(</span><span class="n">data</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">,</span> <span class="n">vram</span><span class="p">);</span>
  <span class="n">vram</span> <span class="o">+=</span> <span class="mi">4</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// vram will be 80d0</span>
</code></pre></div></div>
<p>In the following section, we will take a closer look at the decompression algorithm.</p>

<h3 id="26-decompress-and-copy">2.6 Decompress And Copy</h3>
<p>The decompression algorithm of the Game Boy is not really complex, yet the assembly is quite:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// 'a' holds the next datum of the logo
DecompressAndCopy:
0x095   ld    c, a    // c = 76543210
0x096   ld    b, $04  // loop counter

decomp_loop:
0x098   push  bc
0x099   rl    c
0x09b   rla
0x09c   pop   bc
0x09d   rl    c
0x09f   rla
0x0a0   dec   b
0x0a1   jr    nz, @decomp_loop

0x0a3   ld    [hl+], a
0x0a4   inc   hl        // leave on byte blank
0x0a5   ld    [hl+], a
0x0a6   inc   hl        // leave on byte blank
0x0a7   ret
</code></pre></div></div>
<p>So, let’s start with an abstract description of what the algorithm actually does.
As an input, the algorithm receives one byte of data (the numbers represent bit positions):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; in = 76543210
</code></pre></div></div>
<p>The output is then a scaled version (2x in x and y direction) distributed over 4 bytes:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; out0 = 77665544
&gt; out1 = 77665544
&gt; out2 = 33221100
&gt; out3 = 33221100
</code></pre></div></div>
<p>I hope that this is as simple as I promised.
We now increase the difficulty and analyze the actual implementation.
The first call of the <code class="language-plaintext highlighter-rouge">DecompressAndCopy</code> calculates the first two bytes of the outputs (out0, out1),
while the second call calculates the last two bytes (out2, out3).
Note, that the second call uses 0x96 instead of 0x95 as an entry point due intermediate values still residing in register <code class="language-plaintext highlighter-rouge">c</code>.<br />
To more make the code more accessible, I did a systematic analysis of the <code class="language-plaintext highlighter-rouge">decomp_loop</code>.
In the following table, each column represents an iteration of the <code class="language-plaintext highlighter-rouge">decomp_loop</code>, whereby the numbers uniquely identify
the bits (C stands for carry):</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">instr</th>
      <th style="text-align: left">b = 4</th>
      <th style="text-align: left">b = 3</th>
      <th style="text-align: left">b = 2</th>
      <th>b = 1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">0x99</td>
      <td style="text-align: left">c=6543210x, C=7</td>
      <td style="text-align: left">c=54321076, C=6</td>
      <td style="text-align: left">c=43210754, C=5</td>
      <td>c=32107532, C=4</td>
    </tr>
    <tr>
      <td style="text-align: left">0x9b</td>
      <td style="text-align: left">a=65432107, C=7</td>
      <td style="text-align: left">a=43210776, C=5</td>
      <td style="text-align: left">a=21077665, C=3</td>
      <td>a=07766554, C=1</td>
    </tr>
    <tr>
      <td style="text-align: left">0x9c</td>
      <td style="text-align: left">c=76543210</td>
      <td style="text-align: left">c=65432107</td>
      <td style="text-align: left">c=54321075</td>
      <td>c=43210753</td>
    </tr>
    <tr>
      <td style="text-align: left">0x9d</td>
      <td style="text-align: left">c=65432107, C=7</td>
      <td style="text-align: left">c=54321075, C=6</td>
      <td style="text-align: left">c=43210753, C=5</td>
      <td>c=32107531, C=4</td>
    </tr>
    <tr>
      <td style="text-align: left">0x9f</td>
      <td style="text-align: left">a=54321077, C=6</td>
      <td style="text-align: left">a=32107766, C=4</td>
      <td style="text-align: left">a=10776655, C=2</td>
      <td>a=77665544, C=0</td>
    </tr>
  </tbody>
</table>

<p>Note, how the carry is used in very clever way to exchange bits between the <code class="language-plaintext highlighter-rouge">c</code> and the <code class="language-plaintext highlighter-rouge">a</code> register.
Creating some functionally similar C-code may look like this:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">DecompressAndCopy</span><span class="p">(</span><span class="n">u8</span> <span class="n">data</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">addr</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">u8</span> <span class="n">mask0</span> <span class="o">=</span> <span class="mb">0b00000001</span><span class="p">;</span>
  <span class="n">u8</span> <span class="n">mask1</span> <span class="o">=</span> <span class="mb">0b00000011</span><span class="p">;</span>
  <span class="n">u8</span> <span class="n">res</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">4</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">res</span> <span class="o">|=</span> <span class="p">(</span><span class="n">data</span> <span class="o">&amp;</span> <span class="n">mask0</span><span class="p">)</span> <span class="o">?</span> <span class="n">mask1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">mask0</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">mask1</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="o">*</span><span class="n">addr</span> <span class="o">=</span> <span class="n">res</span><span class="p">;</span>
  <span class="o">*</span><span class="p">(</span><span class="n">addr</span><span class="o">+</span><span class="mi">2</span><span class="p">)</span> <span class="o">=</span> <span class="n">res</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The C-code above is functionally equal,
yet barely resembles the original assembly as there’s no way to utilize carry bits in C.</p>

<h3 id="27-registered-trademark">2.7 Registered Trademark</h3>
<p>In contrast to the Nintendo logo, the registered trademark logo doesn’t need any decompression.
Furthermore, it’s fetched from the boot ROM, not from the cartridge!
Hence, it’s simply loaded into the memory as follows:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x034   ld   de, $00d8   // de = boot rom data after logo
0x037   ld  b, $08       // b = length of data
reg_trade:
0x039   ld  a, [de]
0x03a   inc de
0x03b   ld  [hl+], a     // hl points to VRAM
0x03c   inc hl
0x03d   dec b
0x03e   jr  nz, @-$07    // 8 iterations
</code></pre></div></div>
<p>C-Code:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u8</span> <span class="o">*</span><span class="n">vram</span> <span class="o">=</span> <span class="mh">0x80d0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="n">logo</span> <span class="o">=</span> <span class="mh">0xd8</span><span class="p">;</span> <span class="n">logo</span> <span class="o">&lt;</span> <span class="mh">0xe0</span><span class="p">;</span> <span class="o">++</span><span class="n">logo</span><span class="p">)</span> <span class="p">{</span>
  <span class="o">*</span><span class="n">vram</span> <span class="o">=</span> <span class="o">*</span><span class="n">logo</span><span class="p">;</span>
  <span class="n">vram</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note, that we leave, similarly to the previous section, one byte blank again.
Usually, each pixel displayed comprises two bits spread over different bytes.
But due to our custom color mapping (only black and white), the second bit doesn’t really
carry any information and is thus left blank.
More information about how pixel data is represented will be provided in my soon-to-appear PPU post. <br />
If one would render the tile map at this state, the following image would show up:</p>

<div style="text-align:center">
<img src="/assets/gameboy_boot/tilemap_low.jpg" alt="drawing" width="50%" style="border: 3px solid #ccc;" />
</div>
<p><br />
Most of the tilemap is just empty space, but the 25 tiles used to depict the Nintendo logo are already
more than recognizable!</p>

<h3 id="28-selecting-the-right-tiles">2.8 Selecting the Right Tiles</h3>
<p>Due to its memory limitations, the Game Boy doesn’t really have a pixel-wise buffer of the whole screen.
Instead, it uses a tile-based system usually referring to 8x8 tiles via 32x32 byte pointers.
A more in-depth explanation will be provided in my yet to be written post about the PPU.
So for now this has to suffice ;) <br />
Anyway, the decompression algorithm we already saw just drew some tiles into the tile data map.
But the information about where to draw these tiles is provided with the following lines:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0x040  ld   a, $19      // select tile 25
0x042  ld   [$9910], a  // display tile 25 at (8,16)
0x045  ld   hl, $992f   // point to (9,15)
BB48:
0x048  ld   c, $0c      // c = 12

BB4a:
0x04a  dec  a
0x04b  jr   z, @BB55
0x04d  ld   [hl-], a
0x04e  dec  c
0x04f  jr   nz, @BB4a
0x051  ld   l, $0f      // point to tile (8,15)
0x053  jr   @BB48

BB55:
</code></pre></div></div>
<p>The code initializes the display tiles from (9,3-15) and from (8,3-15) using a nested loop.
A corresponding C code:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">25</span><span class="p">;</span>
<span class="n">u8</span> <span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="mh">0x9910</span><span class="p">;</span>
<span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
<span class="n">mem</span> <span class="o">=</span> <span class="mh">0x992f</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">;</span> <span class="o">++</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">a</span><span class="o">--</span><span class="p">;</span>
    <span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
    <span class="n">mem</span><span class="o">--</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="n">mem</span> <span class="o">=</span> <span class="mh">0x990f</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="29-display-init">2.9 Display Init</h3>
<p>At this point, the only thing yet to be configured is the PPU (Pixel Processing Unit).
So, we could draw anything in the tile buffer, but we would never see a pixel without a turned-on display.
The following lines take care of that:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BB55:
0x055  ld   h, a        // h = 0
0x056  ld   a, $64
0x058  ld   d, a        // d = 100
0x059  ldh  [rSCY], a   // scroll_y = 100
0x05b  ld   a, $91      // 0x91 = 0b10010001
0x05d  ldh  [rLCDC], a  // [0xff40] = b10010001
</code></pre></div></div>
<p>Most of the configuration is done at instruction 0x5d.
This instruction writes data into a PPU configuration register resulting in the following setup:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 = turn on LCD screen
0 = window tile map 0x9800-$9bff
0 = window display off
1 = bg and window tile data = 0x8800-0x97ff
0 = bg tile map 0x9800-0x9bff
0 = obj sprite size 8*8
0 = obj sprite display off
1 = bg and window display on
</code></pre></div></div>
<p>The Y scrolling is set up as well with a value of 100.
This is iteratively decremented to achieve the scroll-down effect of the Nintendo logo.
The C-Code is quite simple for this part:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">u8</span><span class="o">*</span> <span class="n">rSCY</span> <span class="o">=</span> <span class="mh">0xff42</span><span class="p">;</span>
<span class="o">*</span><span class="n">rSCY</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span>
<span class="n">u8</span> <span class="o">*</span><span class="n">rLCDC</span> <span class="o">=</span> <span class="mh">0xff40</span><span class="p">;</span>
<span class="o">*</span><span class="n">rLCDC</span> <span class="o">=</span> <span class="mh">0x91</span>
</code></pre></div></div>

<h3 id="210-showtime">2.10 Showtime!</h3>
<p>Ok, now everything is set up and it’s time to scroll down the Nintendo logo:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// h = 0
0x05f  inc  b           // b = 1

BB60:
0x060  ld  e, $02       // e = 2; 2MC

BB62:
0x062  ld  c, $0c       // c = 12; 2MC

BB64:
0x064  ldh  a, [rLY]    // a = [0xff44] vline number; 2MC
0x066  cp   $90         // a == 144?; 1MC
0x068  jr   nz, @BB64   // 2MC/3MC

0x06a  dec  c           // 1MC
0x06b  jr   nz, @BB64   // 2MC/3MC

0x06d  dec   e          // 1MC
0x06e  jr    nz, @BB62  // 2MC/3MC

0x070  ld    c, $13
0x072  inc   h
0x073  ld    a, h
0x074  ld    e, $83
0x076  cp    $62
0x078  jr    z, @BB80

0x07a  ld    e, $c1
0x07c  cp    $64
0x07e  jr    nz, @BB86

BB80:
0x080  ld   a, e
0x081  ld   [c], a
0x082  inc  c
0x083  ld   a, $87
0x085  ld   [c], a

BB86:
0x086  ldh  a, [rSCY]
0x088  sub  b
0x089  ldh  [rSCY], a  // scroll_y -= 1
0x08b  dec  d
0x08c  jr   nz, @BB60

0x08e  dec  b
0x08f  jr   nz, @BBE0  // Jump to Nintendo Logo check, 0xe0

0x091  ld   d, $20
0x093  jr   @-$35      // BB60
</code></pre></div></div>

<p>However, before any configuration data of a running PPU is touched, the Game Boy needs to make sure that the PPU isn’t rendering at the moment.
This actually very short period of idling is either indicated by a v-blank interrupt
or by a LY-register (residing at 0xff44) value of greater or equal than 144..
Apparently, the Game Boy engineers chose the latter option.
They implemented a busy waiting method that constantly polls the LY register
and compares its value against 144 (see instructions 0x64-0x68).
<br />
The code doesn’t look really obvious at first glance, so let’s take a closer look.</p>

<p>We’ll start at the inner loop beginning at <code class="language-plaintext highlighter-rouge">BB64</code> which just waits for the v-blank register to return a 144.
Once this happens, two nested loops, from now on called e-loop and d-loop due to their loop variables, with loop counts of 2 and 12 are started.
Note, that in each iteration we’re still asking the v-blank register if it’s still at 144!
But how long does it keep that value? <br />
According to the Game Boy CPU Manual <a href="#6-references">[7]</a> the v-blank register increases its value every 114 machine cycles (MC).
So, the Game Boy has 114 machine cycles worth of instructions to spend before the 144 turns into a 145.
These 114 machine cycles are more or less one iteration of the e-loop!
Here’s the calculation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1 c-loop iteration = 2+1+2+1+3 = 9MC
12 iterations whereby the last one is only 8 cycles: 11*9+8 = 107MC
Plus e-loop part: 107+6 = 113MC
</code></pre></div></div>
<p>Note, that depending on the result (branch or not branch)
the jump instructions either take 3 or 2 machine cycles respectively.
After the first e-loop iteration the Game Boy has to wait for a whole frame ~17ms until the v-blank
register exposes as 144 again. <br />
Therefore, the instructions from 0x60 to 0x6e can be summarized as: wait for two frames and finish with an idle PPU.
<br />
The next few instructions play the iconic “bling bling” sound and most importantly: they scroll down the Nintendo logo by one pixel!
This scroll effect is achieved by changing the value of the scroll-y register. Its value determines the window’s offset in pixels in y-direction.
Since this whole part is wrapped into a bigger loop (the d-loop), the Game Boy decreases the scroll-y register 100 times.
Taking the two frames wait period into account, we arrive at roughly 3 seconds for the Nintendo logo scroll-down sequence.
This pretty much complies with the real-word behaviour.
After the logo reached its final position it rests there for a short period of time. This is achieved by instructions
0x08e to 0x93. These instructions reduce the scroll increment to 0 (dec b) and then run the whole d-loop again for 32 times.<br />
In the end, the rendered result of my TLMBoy looks like this:</p>

<div style="text-align:center">
<video autoplay="" loop="" muted="" playsinline="" width="40%">
<source src="/assets/gameboy_boot/logo_scroll.webm" type="video/webm" />
</video>
</div>
<p><br /></p>

<p>As usual, here’s the C-code of the current sequence:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">h</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span> <span class="n">d</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">d</span><span class="p">)</span> <span class="p">{</span>
  <span class="c1">// wait for 2 frames</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">e</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span> <span class="n">j</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">while</span> <span class="p">(</span><span class="n">vline</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">144</span><span class="p">)</span> <span class="p">{}</span>
    <span class="p">}</span>
  <span class="p">}</span>
  <span class="n">h</span><span class="o">++</span><span class="p">;</span>
  <span class="n">u16</span> <span class="o">*</span><span class="n">sound_f_low</span><span class="p">;</span>
  <span class="n">u16</span> <span class="o">*</span><span class="n">sound_f_high</span><span class="p">;</span>
  <span class="n">sound_f_low</span> <span class="o">=</span> <span class="mh">0xFF13</span><span class="p">;</span>
  <span class="n">sound_f_high</span> <span class="o">=</span> <span class="mh">0xFF14</span><span class="p">;</span>
  <span class="n">e</span> <span class="o">=</span> <span class="mh">0x83</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">h</span> <span class="o">==</span> <span class="mi">98</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">goto</span> <span class="n">BB80</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="n">e</span> <span class="o">=</span> <span class="mh">0xc1</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">h</span> <span class="o">!=</span> <span class="mi">100</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">goto</span> <span class="n">BB86</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="n">BB80</span><span class="o">:</span>
  <span class="o">*</span><span class="n">sound_f_low</span> <span class="o">=</span> <span class="n">e</span><span class="p">;</span>     <span class="c1">// "e" is first 0x83 (a C6 note) and then 0xc1 (a C7 note).</span>
  <span class="o">*</span><span class="n">sound_f_high</span> <span class="o">=</span> <span class="mh">0x87</span><span class="p">;</span>

  <span class="n">BB86</span><span class="o">:</span>
  <span class="o">*</span><span class="n">scroll_y</span> <span class="o">-=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>

<span class="c1">// let the logo rest a short time</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">32</span><span class="p">;</span> <span class="n">d</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">d</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">e</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span> <span class="n">j</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">while</span> <span class="p">(</span><span class="n">vline</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">144</span><span class="p">)</span> <span class="p">{}</span>
    <span class="p">}</span>
  <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="211-checking-the-logo">2.11 Checking the logo</h3>
<p>After the scroll sequence, the Game Boy verifies whether it was really a Nintendo logo that showed up on your screen.
If it’s not, the boot loader just bricks. <br />
As explained in <a href="#6-references">[8]</a>, this was Nintendo’s way of preventing unlicensed game developers from publishing games for the Game Boy.
Because you cannot forbid someone to develop games for your hardware, but you can sue people for using your logo! <br />
This check is done byte by byte from instruction 0x0e0 to 0x0ef.
The last instruction finally unloads the boot ROM by writing a 1 into address 0xFF50. <br /></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BBE0:
0x0e0  ld  hl, $0104  // hl = rom cartridge header logo
0x0e3  ld  de, $00a8  // de = boot rom logo

BBE6:
0x0e6  ld  a, [de]    // for loop over the cartridge header logo
0x0e7  inc de
0x0e8  cp  [hl]

BBE9:
0x0e9  jr  nz, @BBE9  // loop forever if fail

0x0eb  inc  hl
0x0ec  ld   a, l
0x0ed  cp   $34
0x0ef  jr   nz, @BBE6

0x0f1  ld   b, $19
0x0f3  ld   a, b

BBF4:
0x0f4  add  [hl] // for loop through the rest of the header to calculate checksum, CODE XREF=CopyData+98
0x0f5  inc  hl
0x0f6  dec  b
0x0f7  jr   nz, @BBF4

0x0f9  add  [hl]      //  Validate against the cartridge header checksum field

BBFA:
0x0fa  jr   nz, @BBFA // If header checksum is invalid then loop forever

0x0fc  ld   a, $01
0x0fe  ldh  [$ff00+$50], a
</code></pre></div></div>
<p>C-Code</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">*</span><span class="n">cartridge_logo</span> <span class="o">=</span> <span class="mh">0x104</span>
<span class="o">*</span><span class="n">boot_logo</span> <span class="o">=</span> <span class="mh">0xa8</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">48</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">cartridge_logo</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">!=</span> <span class="n">boot_logo</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{};</span>  <span class="c1">// Loop forever.</span>
  <span class="p">}</span>
<span class="p">}</span>
<span class="o">*</span><span class="n">cartridge_header</span> <span class="o">=</span> <span class="mh">0x134</span>
<span class="n">sum</span> <span class="o">=</span> <span class="mh">0x19</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">=&lt;</span> <span class="mi">25</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">sum</span> <span class="o">+=</span> <span class="n">cartridge_header</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">sum</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
  <span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{};</span> <span class="c1">// Loop forever.</span>
<span class="p">}</span>

<span class="n">unload_boot_rom</span><span class="p">();</span>
</code></pre></div></div>

<h2 id="3-the-whole-c-code">3. The Whole C-Code</h2>
<p>All code snippets in one code box:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// (0x95-0xa7): Decompress and copy the data to VRAM.</span>
<span class="kt">void</span> <span class="nf">DecompressAndCopy</span><span class="p">(</span><span class="n">u8</span> <span class="n">data</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">addr</span><span class="p">)</span> <span class="p">{</span>
  <span class="n">u8</span> <span class="n">mask0</span> <span class="o">=</span> <span class="mb">0b00000001</span><span class="p">;</span>
  <span class="n">u8</span> <span class="n">mask1</span> <span class="o">=</span> <span class="mb">0b00000011</span><span class="p">;</span>
  <span class="n">u8</span> <span class="n">res</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">4</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">res</span> <span class="o">|=</span> <span class="p">(</span><span class="n">data</span> <span class="o">&amp;</span> <span class="n">mask0</span><span class="p">)</span> <span class="o">?</span> <span class="n">mask1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">;</span>
    <span class="n">mask0</span> <span class="o">&lt;&lt;=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">mask1</span> <span class="o">&lt;&lt;=</span> <span class="mi">2</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="o">*</span><span class="n">addr</span> <span class="o">=</span> <span class="n">res</span><span class="p">;</span>
  <span class="o">*</span><span class="p">(</span><span class="n">addr</span><span class="o">+</span><span class="mi">2</span><span class="p">)</span> <span class="o">=</span> <span class="n">res</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="c1">// BB1 (0x07-0x0a) : Setting up the VRAM.</span>
  <span class="n">u8</span> <span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="mh">0x0</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mh">0x9FFF</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mh">0x8000</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">mem</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="c1">// BB2 (0x0c-0x1c): Setting up the sound.</span>
  <span class="n">mem</span><span class="p">[</span><span class="mh">0xff26</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x80</span><span class="p">;</span> <span class="c1">// All sound on.</span>
  <span class="n">mem</span><span class="p">[</span><span class="mh">0xff11</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x80</span><span class="p">;</span> <span class="c1">// Square wave: Wave duty 50%, don't use length register.</span>
  <span class="n">mem</span><span class="p">[</span><span class="mh">0xff12</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xf3</span><span class="p">;</span> <span class="c1">// Square wave: Start at full volume (15), and then decrement every 3 envelope ticks until 0.</span>
  <span class="n">mem</span><span class="p">[</span><span class="mh">0xff25</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xf3</span><span class="p">;</span> <span class="c1">// Sound output terminal.</span>
  <span class="n">mem</span><span class="p">[</span><span class="mh">0xff24</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x77</span><span class="p">;</span> <span class="c1">// SO2 on, full volume, SO1 off, full volume.</span>

  <span class="c1">// BB3 (0x1d-0x24): Init the color palette.</span>
  <span class="n">mem</span><span class="p">[</span><span class="mh">0xff47</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0xfc</span><span class="p">;</span> <span class="c1">// Set up BG and window color palette.</span>

  <span class="c1">// BB4 (0x27-0x32): Load the logo.</span>
  <span class="n">u8</span> <span class="o">*</span><span class="n">vram</span> <span class="o">=</span> <span class="mh">0x8010</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="n">logo</span> <span class="o">=</span> <span class="mh">0x0104</span><span class="p">;</span> <span class="n">logo</span> <span class="o">&lt;</span> <span class="mh">0x0134</span><span class="p">;</span> <span class="o">++</span><span class="n">logo</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">u8</span> <span class="n">data</span> <span class="o">=</span> <span class="o">*</span><span class="n">logo</span><span class="p">;</span>
    <span class="n">DecompressAndCopy</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">vram</span><span class="p">);</span>
    <span class="n">vram</span> <span class="o">+=</span> <span class="mi">4</span><span class="p">;</span>
    <span class="n">DecompressAndCopy</span><span class="p">(</span><span class="n">data</span> <span class="o">&gt;&gt;</span> <span class="mi">4</span><span class="p">,</span> <span class="n">vram</span><span class="p">);</span>
    <span class="n">vram</span> <span class="o">+=</span> <span class="mi">4</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="c1">// (0x34-3e): Load the registered trademark.</span>
  <span class="n">u8</span> <span class="o">*</span><span class="n">vram</span> <span class="o">=</span> <span class="mh">0x80d0</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="n">u8</span> <span class="o">*</span><span class="n">logo</span> <span class="o">=</span> <span class="mh">0xd8</span><span class="p">;</span> <span class="n">logo</span> <span class="o">&lt;</span> <span class="mh">0xe0</span><span class="p">;</span> <span class="o">++</span><span class="n">logo</span><span class="p">)</span> <span class="p">{</span>
    <span class="o">*</span><span class="n">vram</span> <span class="o">=</span> <span class="o">*</span><span class="n">logo</span><span class="p">;</span>
    <span class="n">vram</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="c1">// (0x40-0x53): Selecting the right tiles.</span>
  <span class="kt">int</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">25</span><span class="p">;</span>
  <span class="n">u8</span> <span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="mh">0x9910</span><span class="p">;</span>
  <span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
  <span class="n">mem</span> <span class="o">=</span> <span class="mh">0x992f</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="mi">2</span><span class="p">;</span> <span class="o">++</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
      <span class="n">a</span><span class="o">--</span><span class="p">;</span>
      <span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="n">a</span><span class="p">;</span>
      <span class="n">mem</span><span class="o">--</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">mem</span> <span class="o">=</span> <span class="mh">0x990f</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="c1">// (0x55-0x5d): Display init.</span>
  <span class="n">u8</span><span class="o">*</span> <span class="n">rSCY</span> <span class="o">=</span> <span class="mh">0xff42</span><span class="p">;</span>
  <span class="o">*</span><span class="n">rSCY</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span>
  <span class="n">u8</span> <span class="o">*</span><span class="n">rLCDC</span> <span class="o">=</span> <span class="mh">0xff40</span><span class="p">;</span>
  <span class="o">*</span><span class="n">rLCDC</span> <span class="o">=</span> <span class="mh">0x91</span>

  <span class="c1">// (0x5f-0x93): Showtime.</span>
  <span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span>
  <span class="kt">int</span> <span class="n">h</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span> <span class="n">d</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">d</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Wait for 2 frames.</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">e</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span> <span class="n">j</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="n">vline</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">144</span><span class="p">)</span> <span class="p">{}</span>
      <span class="p">}</span>
    <span class="p">}</span>
    <span class="n">h</span><span class="o">++</span><span class="p">;</span>
    <span class="n">u16</span> <span class="o">*</span><span class="n">sound_f_low</span><span class="p">;</span>
    <span class="n">u16</span> <span class="o">*</span><span class="n">sound_f_high</span><span class="p">;</span>
    <span class="n">sound_f_low</span> <span class="o">=</span> <span class="mh">0xFF13</span><span class="p">;</span>
    <span class="n">sound_f_high</span> <span class="o">=</span> <span class="mh">0xFF14</span><span class="p">;</span>
    <span class="n">e</span> <span class="o">=</span> <span class="mh">0x83</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">h</span> <span class="o">==</span> <span class="mi">98</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">goto</span> <span class="n">BB80</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">e</span> <span class="o">=</span> <span class="mh">0xc1</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">h</span> <span class="o">!=</span> <span class="mi">100</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">goto</span> <span class="n">BB86</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">BB80</span><span class="o">:</span>
    <span class="o">*</span><span class="n">sound_f_high</span> <span class="o">=</span> <span class="n">e</span><span class="p">;</span>
    <span class="o">*</span><span class="n">sound_f_high</span> <span class="o">=</span> <span class="mh">0x87</span><span class="p">;</span>
    <span class="n">BB86</span><span class="o">:</span>
    <span class="o">*</span><span class="n">scroll_y</span> <span class="o">-=</span> <span class="mi">1</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="c1">// Let the logo rest a short time.</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">d</span> <span class="o">=</span> <span class="mi">32</span><span class="p">;</span> <span class="n">d</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">d</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">e</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span> <span class="n">j</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">;</span> <span class="o">--</span><span class="n">j</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">while</span> <span class="p">(</span><span class="n">vline</span><span class="p">()</span> <span class="o">!=</span> <span class="mi">144</span><span class="p">)</span> <span class="p">{}</span>
      <span class="p">}</span>
    <span class="p">}</span>
  <span class="p">}</span>

  <span class="c1">// (0xe0-0xfe) Checking the logo.</span>
  <span class="o">*</span><span class="n">cartridge_logo</span> <span class="o">=</span> <span class="mh">0x104</span>
  <span class="o">*</span><span class="n">boot_logo</span> <span class="o">=</span> <span class="mh">0xa8</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="mi">48</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">cartridge_logo</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">!=</span> <span class="n">boot_logo</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="p">{</span>
      <span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{};</span>  <span class="c1">// Loop forever.</span>
    <span class="p">}</span>
  <span class="p">}</span>

  <span class="o">*</span><span class="n">cartridge_header</span> <span class="o">=</span> <span class="mh">0x134</span>
  <span class="n">sum</span> <span class="o">=</span> <span class="mh">0x19</span><span class="p">;</span>
  <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">=&lt;</span> <span class="mi">25</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">sum</span> <span class="o">+=</span> <span class="n">cartridge_header</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
  <span class="p">}</span>

  <span class="k">if</span> <span class="p">(</span><span class="n">sum</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{};</span> <span class="c1">// Loop forever.</span>
  <span class="p">}</span>

  <span class="n">unload_boot_rom</span><span class="p">();</span>

  <span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="4-trivia">4. Trivia</h2>
<p>Despite being a fascinating and well-designed program,
the boot ROM actually leaves some room for circumventing the logo check.
Since the logo is loaded twice from the cartridge (one time for the VRAM, a second time for the check),
providing the right data at the right time let’s you boot up the Game Boy without infringing any copyrights.
This is achieved by first providing a custom logo for the scroll-up part, and then providing a Nintendo logo for the logo check.
Of course, you need some custom logic in your cartridge to detect what kind of data is currently requested.
Nevertheless, some companies used this exploit to sell some unlicensed games (see <a href="#6-references">[9]</a>).</p>

<h2 id="5-conclusion">5. Conclusion</h2>
<p>I hope you enjoyed this “little” post about the Game Boy’s boot process.
Even though the boot ROM is just a 256-byte program (with a significant part of just logo data),
it somehow suffices to write a more-than-3000-words blog post about it.
I guess this shows how much you can achieve with a little of assembly if you know how to do your job well.
Especially the decompress and copy process is a good example of it.
I doubt that any compiler could attain the same code density.</p>

<p>If there’s any feedback, don’t hesitate to <a href="/about">contact me</a> :)</p>

<h2 id="6-references">6. References</h2>
<p><a href="https://gbdev.gg8.se/wiki/articles/Gameboy_Bootstrap_ROM#Contents_of_the_ROM">[1]</a> Gameboy Development Wiki <br />
<a href="http://www.neviksti.com/DMG/">[2]</a> neviksti’s website <br />
<a href="https://patents.google.com/patent/US5134391">[3]</a> Game Boy patent<br />
<a href="https://gist.github.com/knightsc/ab5ebda52045b87fa1772f5824717ddf">[4]</a> Commented boot ROM<br />
<a href="https://realboyemulator.wordpress.com/2013/01/03/a-look-at-the-game-boy-bootstrap-let-the-fun-begin/">[5]</a> Boot ROM tutorial 1 (detailed) <br />
<a href="https://knight.sc/reverse%20engineering/2018/11/19/game-boy-boot-sequence.html">[6]</a> Boot ROM tutorial 2 <br />
<a href="http://marc.rawer.de/Gameboy/Docs/GBCPUman.pdf">[7]</a> Game Boy CPU manual <br />
<a href="https://catskull.net/gameboy-boot-screen-logo.html">[8]</a> History of boot ROM and logo generator <br />
<a href="http://fuji.12bit.club/?post=87">[9]</a> Custom boot logos <br /></p>]]></content><author><name></name></author><category term="TLMBoy" /><summary type="html"><![CDATA[]]></summary></entry></feed>