Jekyll2021-11-18T23:16:16+00:00http://yizhang82.dev/feed.xmlyizhang82’s blogLanguages, Runtime, Data structures, Databases, and everything in between
Yi Zhangmail@yizhang82.meBloaty: A super handy linux binary analysis2021-02-10T00:00:00+00:002021-02-10T00:00:00+00:00http://yizhang82.dev/bloaty-for-binary-analysis<p><a href="https://github.com/google/bloaty">bloaty</a> is a great tool from Google for binary size analysis. We were just wondering why the binary size became so large for our code in production and bloaty is great at that.</p>
<p>For example, if you run it against a release build of bloaty itself, just for fun:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bloaty -d sections ./bloaty
FILE SIZE VM SIZE
-------------- --------------
35.8% 16.2Mi 0.0% 0 .debug_info
25.3% 11.4Mi 0.0% 0 .debug_loc
11.6% 5.26Mi 0.0% 0 .debug_str
6.5% 2.93Mi 0.0% 0 .debug_ranges
6.3% 2.83Mi 42.5% 2.83Mi .rodata
5.7% 2.60Mi 0.0% 0 .debug_line
4.4% 2.00Mi 29.9% 2.00Mi .text
0.0% 0 15.1% 1.01Mi .bss
1.3% 585Ki 0.0% 0 .strtab
1.0% 441Ki 6.5% 441Ki .data
0.7% 316Ki 0.0% 0 .debug_abbrev
0.6% 279Ki 4.1% 279Ki .eh_frame
0.5% 235Ki 0.0% 0 .symtab
0.1% 50.3Ki 0.7% 50.3Ki .eh_frame_hdr
0.1% 46.9Ki 0.7% 46.8Ki .gcc_except_table
0.1% 38.3Ki 0.0% 0 .debug_aranges
0.0% 14.2Ki 0.1% 7.80Ki [24 Others]
0.0% 7.78Ki 0.1% 7.72Ki .dynstr
0.0% 6.20Ki 0.1% 6.14Ki .dynsym
0.0% 4.89Ki 0.1% 4.83Ki .rela.plt
0.0% 3.30Ki 0.0% 3.23Ki .plt
100.0% 45.2Mi 100.0% 6.66Mi TOTAL
</code></pre></div></div>
<p>You can easily tell most of the size is actually debug information - 79.2% (35.8+25.3+11.6+6.5)! This is actually a pretty common pattern for C++ binarie and most of the size is debug info. These debug symbols can be offloaded to a symbol package and installed on-demand for coredumps and debugging if needed, if size is becoming an issue.</p>
<p>Another interesting analysis you can do is to look at how much each file is contributing to your different sections (text, string, etc). Again, using bloaty itself as an example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bloaty -d sections,compileunits ./bloaty
...
4.4% 2.00Mi 29.9% 2.00Mi .text
33.7% 688Ki 33.7% 688Ki [117 Others]
9.4% 193Ki 9.4% 193Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/descriptor.cc
6.2% 125Ki 6.2% 125Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/descriptor.pb.cc
5.6% 115Ki 5.6% 115Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/AArch64/AArch64InstPrinter.c
4.6% 94.6Ki 4.6% 94.6Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/Sparc/SparcInstPrinter.c
4.6% 93.3Ki 4.6% 93.3Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/ARM/ARMDisassembler.c
4.1% 83.3Ki 4.1% 83.3Ki /home/yzha/local/github/bloaty/src/bloaty.cc
3.9% 79.3Ki 3.9% 79.3Ki /home/yzha/local/github/bloaty/third_party/demumble/third_party/libcxxabi/cxa_demangle.cpp
3.8% 78.7Ki 3.8% 78.7Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/PowerPC/PPCInstPrinter.c
3.0% 62.1Ki 3.0% 62.1Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/text_format.cc
2.8% 56.9Ki 2.8% 56.9Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/generated_message_reflection.cc
2.5% 50.1Ki 2.5% 50.1Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/extension_set.cc
2.3% 46.0Ki 2.3% 46.0Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/ARM/ARMInstPrinter.c
2.1% 42.2Ki 2.1% 42.2Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/map_field.cc
2.1% 42.1Ki 2.1% 42.1Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/wire_format.cc
2.0% 40.6Ki 2.0% 40.6Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/SystemZ/SystemZDisassembler.c
1.7% 34.0Ki 1.7% 34.0Ki /home/yzha/local/github/bloaty/src/dwarf.cc
1.5% 30.9Ki 1.5% 30.9Ki /home/yzha/local/github/bloaty/src/elf.cc
1.5% 30.2Ki 1.5% 30.2Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/repeated_field.cc
1.5% 30.1Ki 1.5% 30.1Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/AArch64/AArch64Disassembler.c
1.3% 27.0Ki 1.3% 27.0Ki /home/yzha/local/github/bloaty/third_party/re2/re2/re2.cc
...
</code></pre></div></div>
<p>It looks like protobuf is a big contributor. Now we can add source filter to see how much:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./bloaty -d sections,compileunits --source-filter=protobuf ./bloaty
...
100.0% 24.1Mi 100.0% 1013Ki TOTAL
Filtering enabled (source_filter); omitted file = 21.1Mi, vm = 5.67Mi of entries
</code></pre></div></div>
<p>There are a lot of output here, but you can see protobuf contributs to 24.1/45.2=53% of size of bloaty itself. If you want you can also dive into different sections to see how much each individual files contributes to as well.</p>Yi Zhangmail@yizhang82.mebloaty is a great tool from Google for binary size analysis. We were just wondering why the binary size became so large for our code in production and bloaty is great at that. For example, if you run it against a release build of bloaty itself, just for fun: ./bloaty -d sections ./bloaty FILE SIZE VM SIZE -------------- -------------- 35.8% 16.2Mi 0.0% 0 .debug_info 25.3% 11.4Mi 0.0% 0 .debug_loc 11.6% 5.26Mi 0.0% 0 .debug_str 6.5% 2.93Mi 0.0% 0 .debug_ranges 6.3% 2.83Mi 42.5% 2.83Mi .rodata 5.7% 2.60Mi 0.0% 0 .debug_line 4.4% 2.00Mi 29.9% 2.00Mi .text 0.0% 0 15.1% 1.01Mi .bss 1.3% 585Ki 0.0% 0 .strtab 1.0% 441Ki 6.5% 441Ki .data 0.7% 316Ki 0.0% 0 .debug_abbrev 0.6% 279Ki 4.1% 279Ki .eh_frame 0.5% 235Ki 0.0% 0 .symtab 0.1% 50.3Ki 0.7% 50.3Ki .eh_frame_hdr 0.1% 46.9Ki 0.7% 46.8Ki .gcc_except_table 0.1% 38.3Ki 0.0% 0 .debug_aranges 0.0% 14.2Ki 0.1% 7.80Ki [24 Others] 0.0% 7.78Ki 0.1% 7.72Ki .dynstr 0.0% 6.20Ki 0.1% 6.14Ki .dynsym 0.0% 4.89Ki 0.1% 4.83Ki .rela.plt 0.0% 3.30Ki 0.0% 3.23Ki .plt 100.0% 45.2Mi 100.0% 6.66Mi TOTAL You can easily tell most of the size is actually debug information - 79.2% (35.8+25.3+11.6+6.5)! This is actually a pretty common pattern for C++ binarie and most of the size is debug info. These debug symbols can be offloaded to a symbol package and installed on-demand for coredumps and debugging if needed, if size is becoming an issue. Another interesting analysis you can do is to look at how much each file is contributing to your different sections (text, string, etc). Again, using bloaty itself as an example: ./bloaty -d sections,compileunits ./bloaty ... 4.4% 2.00Mi 29.9% 2.00Mi .text 33.7% 688Ki 33.7% 688Ki [117 Others] 9.4% 193Ki 9.4% 193Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/descriptor.cc 6.2% 125Ki 6.2% 125Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/descriptor.pb.cc 5.6% 115Ki 5.6% 115Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/AArch64/AArch64InstPrinter.c 4.6% 94.6Ki 4.6% 94.6Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/Sparc/SparcInstPrinter.c 4.6% 93.3Ki 4.6% 93.3Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/ARM/ARMDisassembler.c 4.1% 83.3Ki 4.1% 83.3Ki /home/yzha/local/github/bloaty/src/bloaty.cc 3.9% 79.3Ki 3.9% 79.3Ki /home/yzha/local/github/bloaty/third_party/demumble/third_party/libcxxabi/cxa_demangle.cpp 3.8% 78.7Ki 3.8% 78.7Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/PowerPC/PPCInstPrinter.c 3.0% 62.1Ki 3.0% 62.1Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/text_format.cc 2.8% 56.9Ki 2.8% 56.9Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/generated_message_reflection.cc 2.5% 50.1Ki 2.5% 50.1Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/extension_set.cc 2.3% 46.0Ki 2.3% 46.0Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/ARM/ARMInstPrinter.c 2.1% 42.2Ki 2.1% 42.2Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/map_field.cc 2.1% 42.1Ki 2.1% 42.1Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/wire_format.cc 2.0% 40.6Ki 2.0% 40.6Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/SystemZ/SystemZDisassembler.c 1.7% 34.0Ki 1.7% 34.0Ki /home/yzha/local/github/bloaty/src/dwarf.cc 1.5% 30.9Ki 1.5% 30.9Ki /home/yzha/local/github/bloaty/src/elf.cc 1.5% 30.2Ki 1.5% 30.2Ki /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/repeated_field.cc 1.5% 30.1Ki 1.5% 30.1Ki /home/yzha/local/github/bloaty/third_party/capstone/arch/AArch64/AArch64Disassembler.c 1.3% 27.0Ki 1.3% 27.0Ki /home/yzha/local/github/bloaty/third_party/re2/re2/re2.cc ... It looks like protobuf is a big contributor. Now we can add source filter to see how much: ./bloaty -d sections,compileunits --source-filter=protobuf ./bloaty ... 100.0% 24.1Mi 100.0% 1013Ki TOTAL Filtering enabled (source_filter); omitted file = 21.1Mi, vm = 5.67Mi of entries There are a lot of output here, but you can see protobuf contributs to 24.1/45.2=53% of size of bloaty itself. If you want you can also dive into different sections to see how much each individual files contributes to as well.std::atomic vs volatile, disassembled2021-01-30T00:00:00+00:002021-01-30T00:00:00+00:00http://yizhang82.dev/memory-model<p>This came up during a code review and the code was using <code class="language-plaintext highlighter-rouge">volatile</code> to ensure the access to the pointer variable is atomic and serialized, and we were sort of debating whether it is sufficient, in particular:</p>
<ol>
<li>Is it safer to switch to <code class="language-plaintext highlighter-rouge">std::atomic<T></code>, and if so, why?</li>
<li>Is volatile sufficiently safe for a strong memory model CPU like x86?</li>
</ol>
<p>Most of us can probably agree that <code class="language-plaintext highlighter-rouge">std::atomic<T></code> would be safer, but we need to dig a bit deeper to see why is it safer, and even for x86.</p>
<!--more-->
<h2 id="what-is-the-difference">What is the difference?</h2>
<p><code class="language-plaintext highlighter-rouge">std::atomic</code> provides atomic access to variables and provides different memory model access for store/load as well as bunch of multi-threading primitives. The default load and store provides sequential memory order guarantees.</p>
<p><code class="language-plaintext highlighter-rouge">volatile</code> only prevents compiler optimization (it may do more depending on compilers) so a read / write cannot be optimized away in case another thread might modify it. But it provides no gaurantees in the hardware level, and no barrier is guaranteed. In some compilers (such as Visual C++) might insert barrier for you, but it isn’t guaranteed - for example gcc gives you no barriers.</p>
<h2 id="is-stdatomic-still-required-if-you-have-volatile">Is std::atomic still required if you have volatile?</h2>
<p>To answer this question we need to understand the concept of memory model. If all access to memory were seqential in nature and exactly done as written in code, we wouldn’t be having this discussion. However, in practice, reordering can happen in two levels:</p>
<ul>
<li>compiler - compiler can reorder / delay / cache variables in registers</li>
<li>hardware - CPU can reorder read/write as long as the end result <em>should be</em> the same</li>
</ul>
<p><code class="language-plaintext highlighter-rouge">volatile</code> only prevents compiler optimizations but CPU might still reorder operations and/or cache the reads/writes, so the end result is still hardware dependent.</p>
<p>Memory model is how hardware models memory access and what kind of ordering and visibility guarantee it provides. CPUs are typically either strong memory model (x86, etc) or weak memory model (ARM, etc). <a href="https://preshing.com/20120930/weak-vs-strong-memory-models/">This blog</a> has one of the best description of weak memory model vs strong memory model. In particular, x86 CPU falls in the strong memory model category, which means <em>usually</em> load implies <strong>acquire</strong> semantics and load implies <strong>release</strong> semantics, but there is no ordering guarantee with <code class="language-plaintext highlighter-rouge">#StoreLoad</code> ordering, as <a href="https://preshing.com/20120515/memory-reordering-caught-in-the-act/">observed in this example</a>. To better understand acquire/release semantics, you can refer to <a href="https://preshing.com/20120913/acquire-and-release-semantics/">this post</a>.</p>
<p>So in short if you want your code to be correct and portable, and even in x86, the short answer is it’s best to not take any chances and use <code class="language-plaintext highlighter-rouge">std::atomic</code>. It’s better to be correct than <em>fast and wrong</em>.</p>
<h2 id="stdatomic-under-the-hood-for-x86">std::atomic under the hood for x86</h2>
<p>But you might wonder - what does <code class="language-plaintext highlighter-rouge">std::atomic<T></code> do for x86 anyway? What is the magic?</p>
<p>It’d be easier to look into this by writing code using <code class="language-plaintext highlighter-rouge">std::atomic<T></code> and looking at the disassembly code.</p>
<p>Suppose we have following code:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <atomic>
#include <stdio.h>
</span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">atomic</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">x</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="n">x</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>
<span class="n">x</span><span class="p">.</span><span class="n">store</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">memory_order_release</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">load</span><span class="p">();</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%d"</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_acquire</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%d"</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And let’s compile it with optimization and dump out the disassembly:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>g++ atomic.cc --std=c++11 -O3
objdump --all -d ./a.out > a
</code></pre></div></div>
<p>And the output of main looks as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000401040 <main>:
401040: 48 83 ec 08 sub $0x8,%rsp
401044: b8 02 00 00 00 mov $0x2,%eax
401049: 87 05 d9 2f 00 00 xchg %eax,0x2fd9(%rip) # 404028 <x>
40104f: bf 10 20 40 00 mov $0x402010,%edi
401054: 31 c0 xor %eax,%eax
401056: c7 05 c8 2f 00 00 03 movl $0x3,0x2fc8(%rip) # 404028 <x>
40105d: 00 00 00
401060: 8b 35 c2 2f 00 00 mov 0x2fc2(%rip),%esi # 404028 <x>
401066: e8 c5 ff ff ff callq 401030 <printf@plt>
40106b: 8b 35 b7 2f 00 00 mov 0x2fb7(%rip),%esi # 404028 <x>
401071: bf 10 20 40 00 mov $0x402010,%edi
401076: 31 c0 xor %eax,%eax
401078: e8 b3 ff ff ff callq 401030 <printf@plt>
40107d: 31 c0 xor %eax,%eax
40107f: 48 83 c4 08 add $0x8,%rsp
401083: c3 retq
</code></pre></div></div>
<p>For the first <code class="language-plaintext highlighter-rouge">store(2, std::memory_order_seq_cst)</code> (the default) in x86, gcc made it a full barrier using xchg instruction which has a <a href="https://stackoverflow.com/questions/9027590/do-we-need-mfence-when-using-xchg">implicit lock prefix</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 401049: 87 05 d9 2f 00 00 xchg %eax,0x2fd9(%rip) # 404028 <x>
</code></pre></div></div>
<p>Here the source is <code class="language-plaintext highlighter-rouge">%eax</code> = 2, the target of the move is address <code class="language-plaintext highlighter-rouge">rip</code> (=next instruction 0x40104f) + 0x2fd9 offset = 0x404028, which is the location of the global variable <code class="language-plaintext highlighter-rouge">x</code>.</p>
<p>If you are wondering what is the behavior of <code class="language-plaintext highlighter-rouge">std::atomic<T>::operator =</code> - it is the equivalent of <code class="language-plaintext highlighter-rouge">store(std::memory_order_seq_cst)</code></p>
<blockquote>
<p>In some compilers you may get <code class="language-plaintext highlighter-rouge">mfence</code> which is <em>the</em> full barrier instruction in x86 CPU, so the end result is the same.</p>
</blockquote>
<p>Now to the second <code class="language-plaintext highlighter-rouge">store(3, std::memory_order_release)</code>. Recall under x86 all store has release semantics, so the code is just normal mov:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 401056: c7 05 c8 2f 00 00 03 movl $0x3,0x2fc8(%rip) # 404028 <x>
</code></pre></div></div>
<p>Now let’s look at reads.</p>
<p>For the first <code class="language-plaintext highlighter-rouge">load(std::memory_order_seq_cst)</code> (the default), given that in a sequential memory order a write already would publish the results to all cores with a full memory barrier, there is nothing we need to do. It is just a regular read - reading a memory location into <code class="language-plaintext highlighter-rouge">esi</code>, which is the 2nd argument to printf as per <a href="https://raw.githubusercontent.com/wiki/hjl-tools/x86-psABI/x86-64-psABI-1.0.pdf">linux SystemV x64 ABI</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 401060: 8b 35 c2 2f 00 00 mov 0x2fc2(%rip),%esi # 404028 <x>
</code></pre></div></div>
<p>For the 2nd <code class="language-plaintext highlighter-rouge">load(std::memory_order_acquire)</code>, again recall x86 every load is implicitly has acquire semantics, so again it is just a regular read:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 40106b: 8b 35 b7 2f 00 00 mov 0x2fb7(%rip),%esi # 404028 <x>
</code></pre></div></div>
<h2 id="what-if-this-is-volatile">What if this is volatile?</h2>
<p>If we replace the atomic to be a volatile:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <atomic>
#include <stdio.h>
</span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="k">volatile</span> <span class="kt">int</span> <span class="nf">x</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="n">x</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="n">x</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"%d"</span><span class="p">,</span> <span class="n">y</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The result code looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000401040 <main>:
401040: 48 83 ec 08 sub $0x8,%rsp
401044: bf 10 20 40 00 mov $0x402010,%edi
401049: 31 c0 xor %eax,%eax
40104b: c7 05 d3 2f 00 00 02 movl $0x2,0x2fd3(%rip) # 404028 <x>
401052: 00 00 00
401055: c7 05 c9 2f 00 00 03 movl $0x3,0x2fc9(%rip) # 404028 <x>
40105c: 00 00 00
40105f: 8b 35 c3 2f 00 00 mov 0x2fc3(%rip),%esi # 404028 <x>
401065: e8 c6 ff ff ff callq 401030 <printf@plt>
40106a: 31 c0 xor %eax,%eax
40106c: 48 83 c4 08 add $0x8,%rsp
401070: c3 retq
</code></pre></div></div>
<p>You can see the <code class="language-plaintext highlighter-rouge">xchg</code> becomes a simple <code class="language-plaintext highlighter-rouge">movl</code> as volatile doesn’t guarantee any ordering - it only prevents compiler optimization. What optimization, you might ask? Let’s see what happens when we remove the <code class="language-plaintext highlighter-rouge">volatile</code>.</p>
<h2 id="taking-out-the-volatile">Taking out the volatile</h2>
<p>Now let’s just take out the volatile keyword, and see what we would get:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0000000000401040 <main>:
401040: 48 83 ec 08 sub $0x8,%rsp
401044: be 03 00 00 00 mov $0x3,%esi
401049: bf 10 20 40 00 mov $0x402010,%edi
40104e: 31 c0 xor %eax,%eax
401050: c7 05 ce 2f 00 00 03 movl $0x3,0x2fce(%rip) # 404028 <x>
401057: 00 00 00
40105a: e8 d1 ff ff ff callq 401030 <printf@plt>
40105f: 31 c0 xor %eax,%eax
401061: 48 83 c4 08 add $0x8,%rsp
401065: c3 retq
</code></pre></div></div>
<p>You might have already noticed two significant differences:</p>
<ul>
<li>The assignment <code class="language-plaintext highlighter-rouge">x=2</code> is completely gone. compiler knows there are no side effects to the <code class="language-plaintext highlighter-rouge">x=2</code> assignment so it is free to optimize it away</li>
<li>The read is completely gone, instead we assign <code class="language-plaintext highlighter-rouge">%esi = 3</code> for printf from the get go:</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 401044: be 03 00 00 00 mov $0x3,%esi
</code></pre></div></div>
<p>Again, compiler is free to optimize the load because no one else is going to change <code class="language-plaintext highlighter-rouge">x</code> in between, so it can simply replace <code class="language-plaintext highlighter-rouge">x</code> with 3 in the printf.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Multi-threading, memory-model, barriers are complicated topics but hopefully this gives you a good starting point. Even seemingly question like what is the difference of <code class="language-plaintext highlighter-rouge">volatile</code> and <code class="language-plaintext highlighter-rouge">atomic</code> can be quite confusing, and the fact that different compilers does different things for volatile made this more confusing (VC++ for example offers stronger gaurantee for volatile making it a full barrier). If you are still hungry for more, there is <a href="https://www.kernel.org/doc/Documentation/memory-barriers.txt">Linux Kernel Memory Barrier Doc</a> that has great details and every programmer does lock-free multi-thread programming or want to understand the details probably should read. At the end of the day, having good understanding of compilers, assembly code and computer/CPU architecture would go a long way for system programmers.</p>Yi Zhangmail@yizhang82.meThis came up during a code review and the code was using volatile to ensure the access to the pointer variable is atomic and serialized, and we were sort of debating whether it is sufficient, in particular: Is it safer to switch to std::atomic<T>, and if so, why? Is volatile sufficiently safe for a strong memory model CPU like x86? Most of us can probably agree that std::atomic<T> would be safer, but we need to dig a bit deeper to see why is it safer, and even for x86.Paper Reading: In Search of an Understandable Consensus Algorithm (Extended Version)2021-01-16T00:00:00+00:002021-01-16T00:00:00+00:00http://yizhang82.dev/paper-raft<p><a href="https://raft.github.io/raft.pdf">This paper</a> is <em>the</em> paper to read about <em>Raft consensus algorithm</em> and a good way to build intuition for consensus algorithms in general. The “consensus” about consensus algorithms is that they are hard to understand / build / test, and not surprisingly having an understandable consensus algorithm has a lot of value for system builders. I think Raft is designed for today’s mainstream single leader multi-follower log-replicated state machine model so it is a great starting point for building a practical distributed system around it. I’ve read about raft before but this is the first time I went through the paper in full. I must admit I find Paxos not intuitive and hard to follow as well and I might give Paxos/Multi-Paxos a go some other time. Meanwhile Raft is something I can get behind and feel comfortable with. And that is saying something.</p>
<!--more-->
<h2 id="overview">Overview</h2>
<p>Paxos is quite difficult to understand and requires complex changes to support practical systems. Raft is designed to be significantly easier to understand than Paxos, simlar with Viewstamped Replication, but with some novel features:</p>
<ul>
<li>Strong leader with single direction of flow</li>
<li>Leader election with randomized timers</li>
<li>Membership changes with <em>joint consensus</em></li>
</ul>
<p>Consensus algorithms typicaly on a collection of state machines computing identical copies. Typically implemented separately with a replicated log, and executes in order. State machines are determinstic in nature and produces exact state.</p>
<p>Paxos has become almost synonymous with consensus (at the time of writing). Paxos first define a protocol capable of reaching agreement within a single instance, referred to as <em>Single Decree Paxos</em>, and then combine multiple instances to faciliate a series of decision. Paxos ensures both safety and liveness, but it has two main drawbacks:</p>
<ul>
<li>Exceptionally difficult to understand. From the paper:
<blockquote>
<p>In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year</p>
</blockquote>
</li>
<li>Not a good foundation to building practical implementations, mainly because multi-Paxos is not sufficiently specified, and as a result practical systems bear little resemblance to Paxos. One comment from Chubby implement is typical:
<blockquote>
<p>There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. . . . the final system will be based on an unproven protocol.</p>
</blockquote>
</li>
</ul>
<p>For these reasons, the authors designed an alternative consensus algorithms - and that is Raft. Raft is designed for understandability:</p>
<ul>
<li>Decomposing the problems into easy-to-understand/explain pieces independently, such as leader election, log replication, safety, and membership changes.</li>
<li>Simplifying the problem space by placing constraints and reducing states, such as disallowing holes in logs.</li>
</ul>
<h2 id="raft-consensus-algorithm">Raft Consensus Algorithm</h2>
<p>Raft implements consensus by first electing a leader who is responsible for managing the replicated log. Therefore, consensus algorithm can be broken down into 3 categories:</p>
<ul>
<li>Leader election - leader must be elected</li>
<li>Log replication - leader replicates logs across cluster</li>
<li>Safety
<ul>
<li><strong>Election safety</strong>: at most one leader can be elected in a
given term</li>
<li><strong>Leader Append-Only</strong>: a leader never overwrites or deletes entries in its log; it only appends new entries</li>
<li><strong>Log Matching</strong>: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.</li>
<li><strong>Leader Completeness</strong>: if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.</li>
<li><strong>State Machine Safety</strong>: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.</li>
</ul>
</li>
</ul>
<h3 id="the-basics">The basics</h3>
<p>Raft server is in one of 3 states:</p>
<ul>
<li><strong>Leader</strong> - accept client requests</li>
<li><strong>Follower</strong> - accept requests from leaderes</li>
<li><strong>Candidate</strong> - used to elect new leader</li>
</ul>
<p>Raft divides time into terms marked with monotonically increasing integers. Each term begins with election where one or more candidiates attempt to become leader.</p>
<p>Following diagram shows possible state transitions:</p>
<p><img src="imgs/paper-raft-3.png" alt="State transitions" /></p>
<p>If one wins it becomes leader for the entire term. Otherwise in the case of split vote, the term ends with no leader. There is at most one leader in a given term. Term servces as <em>Logical Clock</em> in raft - each server maintains a current term that is monotically increased and exchanged when they communicate, and if stale term is detected server will update to the larger value. Server would reject requests with stale term.</p>
<p>Raft servers communicate with mainly two kinds of RPC:</p>
<ul>
<li>RequestVote RPC - initiated by candidate for leader election</li>
<li>AppendEntries RPC - initiated by leaders to replicate to follower and provide heartbeat
There is also a 3rd RPC for transferring snapshot.
RPC are issued in parallel and will be retried if no ACK is received within time.</li>
</ul>
<h3 id="leader-election">Leader election</h3>
<p>Servers start up as followers, and will stay as leaders if they keep receiving AppendEntry RPC from leader or candidate. Leader send periodic empty AppendEntries RPC as heartbeat, so if followers aren’t receiving such heartbeat within a period of time (called <em>Election Timeout</em>), followers will start leader election.</p>
<p>As part of leader election, follower increments its current term and transition to candidate state, and then vote for itself and issue RequestForVote RPC in parallel to all the other servers.</p>
<p>A candidate wins election if it receives majority of votes. A server can only hand out vote for a single candidate in a given term and first come first serve basis. Once it wins election it’ll become leader and send empty AppendEntries RPC to all followers to establish authority and prevent new elections.</p>
<p>If candidate receives AppendEntries RPC from another leader, it only accepts it as leader if leader’s term >= its own term and return to follower state. Otherwise it rejects the RPC.</p>
<p>If many followers timeout and become candidate at the same time, it’s possible to have split vote situation and no one wins the election. In this case a new term of election is started. However, without extra measures the vote can continue infinitely, or only complete by luck. This is why Raft uses randomized timeout (for example, 150-300ms) to ensure split vote case are rare - so that followers timeout and become candidate at different time, and once split vote happens each candidate will start its own vote at different time.</p>
<p>The randomized approach might seem a bit naive at first glance, but the authors have debated a few different approaches and concluded that randomized timeout is the easiest to understand and prove correct:</p>
<blockquote>
<p>From the paper:</p>
<p>lections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable</p>
</blockquote>
<h3 id="log-replication">Log Replication</h3>
<p>Once client send request to the leader, leader sends AppendEntries RPC to all followers in parallel to replicate the log entry. Once it is safely replicated, leader will then apply the log entry to its own state machine and returns the result of that exception to the client. The request is retried indefinitely until all followers have ACKed.</p>
<p>A raft log entry consists of (term, index, operation). Leader decides when it is safe to apply the log entry to the statement and applying the operation, and such log entry becomes <em>committed</em>. <strong>Raft gurantees that committed entries are durable and will eventually replicate to all available state machines.</strong> A log entry is committed once it is replicated to the majority of the servers. All the precending entries are considered committed as well. Once a follower learns the log entry is committed, it applies the entry to its own local state machine in log order.</p>
<p><img src="imgs/paper-raft-1.png" alt="Logs" /></p>
<blockquote>
<p>This implies the latency will be higher in a raft consensus system as the follower would have to know the log entry being committed, usually on the next AppendEntries RPC request (either real user request or heartbeat).</p>
</blockquote>
<p>Raft maintains the <strong>Log Match Property</strong>:</p>
<ul>
<li>If two entries in different logs have the same index and term, then they store the same command.</li>
<li>If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.</li>
</ul>
<blockquote>
<p>This property makes Raft logs much easier to understand and reason about its correctness.</p>
</blockquote>
<p>After leader crashes, follower logs may become inconsistent with leader log. The paper have discussed a few scenarios that we won’t repeat hre. Raft handles consistencies by follower log to duplicate with the leader’s log - so conflicting entries in follower log would need to be rewritten. This is done by finding the latest log entry that is consistent in follower, and delete any logs after that, and send all following entries after it. This is achieved by having the leader maintain <em>nextIndex</em> for each follower, and keep sending AppendRPC and decrease it if rejected, until they agree, and at that point follower log after nextIndex is deleted. This can be further optimized by having AppendEntries RPC return the first conflicting term and first index in the term so that the leader would skip conflicting entries.</p>
<p>If a candidate / follower crashes, the leader would just retry infinitely. If the server already inserted the log but didn’t ACK, raft RPC are idempotent so it’ll just get ignored.</p>
<h3 id="election-safety">Election Safety</h3>
<p>To prevent a stale follower overwriting committed entries, there must be further restrictions on leader election.</p>
<p>A candidate cannot win an election unless it contains all committed entries. When it sends RequestVote with the latest (term, index) in the log, such request will be rejected by other servers if their latest log entry (term, index) is larger and therefore more up to date.</p>
<p>It’s also possible for leader replicating previous term log entries to other stale followers and making them committed, but doing so before committing the new term running the risk of having the newly committed entry getting overwritten if it crashes before then. So committing log entries from previous term is deferred until an entry from current term commits. Other consensus algorithms tries to address this by “fixing” prior term to latest term but Raft is keeping things simple by having the log entry being immutable and retain the term number.</p>
<p>In the paper 5.4.3 Safety Argument section the correctness is proven there. Feel free to refer to the paper for more details.</p>
<h2 id="cluster-membership-changes">Cluster Membership Changes</h2>
<p>When cluster membership changes (adding/removing server, etc), it is important to prevent having two leaders at the same time with old/new configuration. This needs to be done with a two-phase approach - raft first switch to a joint consensus that is both old and new, and once the joint consensus has committed then raft will transition to the new configuration.</p>
<p>In joint consensus,</p>
<ul>
<li>Log entries are replicated to both configurations</li>
<li>Any server from either configuration may serve as leader</li>
<li>Agreement requires separate majority from both old and new configuration - this means the log entry gets replicated in both configurations</li>
</ul>
<p>Note the leader of joint consensus might not be part of the new cluster configuration. In this case it doesn’t count itself in the majority but still replicates to both majority, and step down once new configuration entry commits.</p>
<blockquote>
<p>I’m wondering if’d be easier if we force the new leader to be the intersection of old configuration and new configuration.</p>
</blockquote>
<p>When new server join the cluster, it might be a while for them to replicate all entries so new entries might not be able to proceed, so they need to join as non-voting members that is replicated but didn’t join the majority, until sufficiently caught up.</p>
<p>When removing servers from cluster, they will stop receiving heartbeats so they will start elections and disrupt the cluster availability. To prevent this problem, servers disregard RequestVote RPC within a minimum election timeout, and in such case the leader is considered alive. Note this isn’t the election timeout of the server (since it’ll revert to candidate if that happens) but rather a minimum “safe” election timeout. All the servers’s election timeout will be at least as large or larger than the minimum election timeout.</p>
<h2 id="other-practical-considerations">Other practical considerations</h2>
<p>In any log system you can’t have the log grow unbounded, so raft needs to be able to compact the logs. In theory you could just create a snapshot of committed entries, but for a slow follower or a new server you would have to send the snapshot over with InstallSnapshot RPC. In practice terms this means discarding the state of the follower entirely, then copying the entire state over to the follower either physically or logically, and delete all logs. This is no different than incremental logging system such as LSM tree.</p>
<p>When client interacts with the cluster for the first time, it sends to a random server. If not the leader it’ll reject / forward to the correct leader. It is also possible that the leader crashes after commit but before ACK, so the client would retry the write again on a new leader which may have this entry. The client would need to track the request with a serial number so that the new leader can detect and return immediately.</p>
<p>For read-only queries, leader can only return the data once it commits its first blank AppendEntries RPC so that it is up-to-date and all entries are committed - it is still possible for uncommitted log entries from earlier term to become committed later (as part of catching up other followers), or those uncommitted changes can get discarded if another lead gets elected without those changes. It is also possible for the leader not knowing others might have elected a new leader, so the leader needs to confirm by exchanging heartbeat with majority of cluster before responding. Alternatively you could also use a lease-based approach but requires bounded clock skew.</p>
<h2 id="my-closing-thoughts">My closing thoughts</h2>
<p>Raft protocol has definitely delivered its promise on being a practical and understandable consensus procotol - the proliferation of many implementation in various languages has already proven that. And there are already many system using raft in production such as <a href="https://github.com/etcd-io/etcd">Kubernetes/etcd</a>, <a href="https://github.com/cockroachdb/cockroach">CockroachDB</a>, <a href="https://github.com/tikv/tikv">TiKV</a>, etc. There are even <a href="https://www.percona.com/live/18/sessions/how-to-make-mysql-work-with-raft">raft support w/ MySQL from Alibaba</a>. It’d be interesting to see how Raft performs in real production systems and how well does it scale in practice.</p>Yi Zhangmail@yizhang82.meThis paper is the paper to read about Raft consensus algorithm and a good way to build intuition for consensus algorithms in general. The “consensus” about consensus algorithms is that they are hard to understand / build / test, and not surprisingly having an understandable consensus algorithm has a lot of value for system builders. I think Raft is designed for today’s mainstream single leader multi-follower log-replicated state machine model so it is a great starting point for building a practical distributed system around it. I’ve read about raft before but this is the first time I went through the paper in full. I must admit I find Paxos not intuitive and hard to follow as well and I might give Paxos/Multi-Paxos a go some other time. Meanwhile Raft is something I can get behind and feel comfortable with. And that is saying something.Writing your own NES emulator Part 3 - the 6502 CPU2021-01-10T00:00:00+00:002021-01-10T00:00:00+00:00http://yizhang82.dev/nes-emu-cpu<p>It’s been a while since the <a href="/nes-emu-main-loop">last update</a> - I was mostly focusing on database technologies. Beginning of the year 2021 is a bit slow (that’s when many big companies start their annual / semi-annual review process), so I had a bit of time to write up this post about 6502 CPU emulation. All the code referenced in this post is in my simple NES emulator github repo <a href="https://github.com/yizhang82/neschan">NesChan</a>. It’s fun to go back and look at my old code and the 6502 CPU wiki.</p>
<h2 id="the-6502-cpu">The 6502 CPU</h2>
<p>NES uses 8-bit <a href="https://en.wikipedia.org/wiki/MOS_Technology_6502">6502 CPU</a> with 16-bit address bus, meaning it can access memory range 0x0000~0xffff - not much, but more than enough for games back in the 80s with charming dots and sprites. It is used in a surprising large range of famous vintage computers/consoles like Apple I, Apple II, Atari, Commodore 64, and of course NES. The variant used by NES is a stock 6502 without decimal mode support. It is running at 1.79HMZ (PAL version runs at 1.66MHZ). It has 3 general purpose register A/X/Y, and 3 special register P (status) /SP (stack pointer) /PC (program counter, or instruction pointer), all of them being 8-bit except PC which is 16-bit. NES dev wiki has a <a href="http://wiki.nesdev.com/w/index.php/CPU">great section on 6502 CPU</a> that has a lot more details and we’ll be covering the most important aspects in the remainder of the article.</p>
<!--more-->
<p>To emulate the CPU, the main loop would look something like this:</p>
<ol>
<li>We start at a memory location by set current <em>program counter</em> (also known as instruction pointer in other architectures) <strong>PC</strong> to that location</li>
<li>Check if we reached the special end condition (end of program, <strong>BRK</strong> instruction, infinite loop, etc…), if met, terminate the execution process</li>
<li>Decode CPU instruction at current <strong>PC</strong></li>
<li>Set instruction pointer to next instruction</li>
<li>Fetch data as per memory access mode</li>
<li>Execute instruction with data fetched</li>
<li>Move to the next instruction by going back to 2</li>
</ol>
<p>The most interesting aspect are instruction decoding, memory access modes, and instruction execution. Let’s look at this one by one.</p>
<h2 id="decoding-the-instructions">Decoding the instructions</h2>
<p>Assembly instructions are usually encoded with 3 character memonics and they typically perform very low level hardware related operations supported by the CPU, to keep CPU simple and reduce cost. That’s why assembly instructions are considered <em>low level</em>. High-level language statements are usually compiled down to one or more CPU instructions, with the help of compiler. This is a perfect example of layering.</p>
<p>Let’s just take a look at a few examples of what 6502 CPU can do:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">DEC</code>, <code class="language-plaintext highlighter-rouge">DEX</code>, <code class="language-plaintext highlighter-rouge">DEY</code> for decrementing memory, X register, Y register, respectively</li>
<li><code class="language-plaintext highlighter-rouge">JMP</code> for jumping to a particular address to keep executing code</li>
<li><code class="language-plaintext highlighter-rouge">LDA</code>, <code class="language-plaintext highlighter-rouge">LDX</code>, <code class="language-plaintext highlighter-rouge">LDY</code> for loading A/X/Y register into target location, depending on the memory address mode</li>
<li><code class="language-plaintext highlighter-rouge">ADC</code>, <code class="language-plaintext highlighter-rouge">SBC</code> for addition / subtraction using the A register (accumulator) and specified memory location, so basically A += M and A-= M, taking carry flag into account as well</li>
</ul>
<p>If you are interested to know more, you can go to <a href="http://obelisk.me.uk/6502/reference.html">this page</a> for a list of common 6502 CPU instructions and what they do.</p>
<p>Before executing the instruction, you need to first look up the bytes in memory and understand which instruction it represents, what is the arguments, etc. This is called <em>decoding</em>. Fortunately, 6502 CPU instructions are always single byte, only the arguments differ by memory access mode. This makes the decoding much easier - we just need a big table of all instructions and then call the right helper function for that instruction based on the byte!</p>
<blockquote>
<p>We’ll look at memory access mode later. For now you’ll just need to know they indicates where the actual data is coming from, while the instruction itself is the <em>operation</em>. Instruction typically supports different memory access modes so that it can operate on different data from different locations using different methods, whether it is register, memory, etc.</p>
</blockquote>
<p>In order to build the table, it’s useful to visualize it by looking at the following table from <a href="http://wiki.nesdev.com/w/index.php/CPU_unofficial_opcodes">nesdev unofficial opcodes wiki</a>:</p>
<p><img src="/imgs/nes-emu-cpu-1.png" alt="img" /></p>
<p>But in order to see the patterns a bit better, let’s re-arrange it:</p>
<p><img src="/imgs/nes-emu-cpu-2.png" alt="img" /></p>
<p>You can see the ALU (green ones, that does math operations) and the RMW (blue ones, = Read Modify Write) instructions follow a very clear pattern, while the red (mostly control instructions) and gray (unofficial / undocumented instructions) are sort of all over the place.</p>
<p>To keep things simple (and make modification easier, as I was still learning the instructions, and I don’t want to do it over when I misunderstood something), in the current implementation I went with a switch case approach with macros. This could be easily updated to use a real table with helper function pointers. You might think the jump table approach might be faster, but actually the reality can be a bit more complicated: compiler should easily create a jump table, and jumping into inlined version of the helper functions directly, end up being much faster than a jump table solution. Such optimization are actually more difficult with function pointers (but not impossible). Either way, since I’m not optimizing for a benchmark but to run NES games, I didn’t care too much about performance.</p>
<p>For example, for ALU instructions we use this macro in <a href="https://github.com/yizhang82/neschan/blob/master/lib/src/nes_cpu.cpp">nes_cpu.cpp</a>:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define IS_ALU_OP_CODE_(op, offset, mode) \
case nes_op_code::op##_base + offset : \
NES_TRACE4(get_op_str(#op, nes_addr_mode::nes_addr_mode_##mode)); \
op(nes_addr_mode::nes_addr_mode_##mode); \
break;
</span></code></pre></div></div>
<p>This defines a <code class="language-plaintext highlighter-rouge">case</code> statement for a variant of instruction <code class="language-plaintext highlighter-rouge">op</code>. For example, for ADC, offset 0x9 is ADC with immediate memory access mode. We’ll be calling to the <code class="language-plaintext highlighter-rouge">op</code> helper function for executing the code with the corresponding memory access mode. <code class="language-plaintext highlighter-rouge">NES_TRACE4</code> is for logging and we can ignore that for now.</p>
<p>And for each particular ALU instruction, we define 8 variants of all memory access patterns based on the table earlier:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define IS_ALU_OP_CODE(op) \
IS_ALU_OP_CODE_(op, 0x9, imm) \
IS_ALU_OP_CODE_(op, 0x5, zp) \
IS_ALU_OP_CODE_(op, 0x15, zp_ind_x) \
IS_ALU_OP_CODE_(op, 0xd, abs) \
IS_ALU_OP_CODE_(op, 0x1d, abs_x) \
IS_ALU_OP_CODE_(op, 0x19, abs_y) \
IS_ALU_OP_CODE_(op, 0x1, ind_x) \
IS_ALU_OP_CODE_(op, 0x11, ind_y)
</span></code></pre></div></div>
<p>For example, ADC + 0x9 is immediate mode, ADC + 0x5 is zero page mode, etc.</p>
<p>Then we can support a series of ALU instructions easily with these macros:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">IS_ALU_OP_CODE</span><span class="p">(</span><span class="n">ADC</span><span class="p">)</span>
<span class="n">IS_ALU_OP_CODE</span><span class="p">(</span><span class="n">AND</span><span class="p">)</span>
<span class="n">IS_ALU_OP_CODE</span><span class="p">(</span><span class="n">CMP</span><span class="p">)</span>
<span class="n">IS_ALU_OP_CODE</span><span class="p">(</span><span class="n">EOR</span><span class="p">)</span>
<span class="n">IS_ALU_OP_CODE</span><span class="p">(</span><span class="n">ORA</span><span class="p">)</span>
<span class="n">IS_ALU_OP_CODE</span><span class="p">(</span><span class="n">SBC</span><span class="p">)</span>
</code></pre></div></div>
<p>Take a simple instruction as an example, the code looks like follows:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Logical AND</span>
<span class="kt">void</span> <span class="n">nes_cpu</span><span class="o">::</span><span class="n">AND</span><span class="p">(</span><span class="n">nes_addr_mode</span> <span class="n">addr_mode</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">operand_t</span> <span class="n">op</span> <span class="o">=</span> <span class="n">decode_operand</span><span class="p">(</span><span class="n">addr_mode</span><span class="p">);</span>
<span class="kt">uint8_t</span> <span class="n">val</span> <span class="o">=</span> <span class="n">read_operand</span><span class="p">(</span><span class="n">op</span><span class="p">);</span>
<span class="n">A</span><span class="p">()</span> <span class="o">&=</span> <span class="n">val</span><span class="p">;</span>
<span class="c1">// flags </span>
<span class="n">calc_alu_flag</span><span class="p">(</span><span class="n">A</span><span class="p">());</span>
<span class="c1">// cycle count</span>
<span class="n">step_cpu</span><span class="p">(</span><span class="n">get_cpu_cycle</span><span class="p">(</span><span class="n">op</span><span class="p">,</span> <span class="n">addr_mode</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<ul>
<li><code class="language-plaintext highlighter-rouge">decode_operand</code> is responsible for decoding the following bytes based on the address mode, and return the access pattern in <code class="language-plaintext highlighter-rouge">operand_t</code></li>
<li>Next we proceed to read the operand using <code class="language-plaintext highlighter-rouge">op</code> into <code class="language-plaintext highlighter-rouge">val</code>. The reason the decoding and reading are separate step is because some instruction do read, write, or both, so it is useful to separate them into different helpers.</li>
<li>Once we read the val, as per AND instruction, we <code class="language-plaintext highlighter-rouge">AND</code> the accmulator A register with <code class="language-plaintext highlighter-rouge">val</code> and then write it back. Note we have helpers that return registers (which really are just variables) as reference so the code reads quite naturally:</li>
</ul>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="kt">uint8_t</span> <span class="o">&</span><span class="n">A</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="n">_context</span><span class="p">.</span><span class="n">A</span><span class="p">;</span> <span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">&</span><span class="n">X</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="n">_context</span><span class="p">.</span><span class="n">X</span><span class="p">;</span> <span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">&</span><span class="n">Y</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="n">_context</span><span class="p">.</span><span class="n">Y</span><span class="p">;</span> <span class="p">}</span>
<span class="kt">uint16_t</span> <span class="o">&</span><span class="n">PC</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="n">_context</span><span class="p">.</span><span class="n">PC</span><span class="p">;</span> <span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">&</span><span class="n">P</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="n">_context</span><span class="p">.</span><span class="n">P</span><span class="p">;</span> <span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">&</span><span class="n">S</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="n">_context</span><span class="p">.</span><span class="n">S</span><span class="p">;</span> <span class="p">}</span>
</code></pre></div></div>
<ul>
<li>Based on the result of A, we need to update the ALU zero/negative flags accordingly. Those flags are typically checked at beginning of instruction and updated at the end of instruction, usually for math operations (carry flag) or controls (jump if zero). For a full list flags you can refer to <a href="http://wiki.nesdev.com/w/index.php/Status_flags">this list</a>.</li>
<li>Finally, we simulate the passing of CPU cycles (or rather, time). This is important for accuracy of emulation as many games rely this for timing, especially to synchronize with GPU cycles! Now that’s what we call <em>real</em> programmers .</li>
</ul>
<h2 id="memory-access-mode">Memory access mode</h2>
<p>This is one of the more complicated aspect of 6502 CPU. Many instructions have different modes when it comes to where the operands are coming from. This is the full list of all suppported modes:</p>
<table>
<thead>
<tr>
<th>Abbr</th>
<th>Name</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Imp</td>
<td>Implicit</td>
<td>Instructions like RTS or CLC have no address operand, the destination of results are implied.</td>
</tr>
<tr>
<td>A</td>
<td>Accumulator</td>
<td>Many instructions can operate on the accumulator, e.g. LSR A. Some assemblers will treat no operand as an implicit A where applicable.</td>
</tr>
<tr>
<td>#v</td>
<td>Immediate</td>
<td>Uses the 8-bit operand itself as the value for the operation, rather than fetching a value from a memory address.</td>
</tr>
<tr>
<td>d</td>
<td>Zero page</td>
<td>Fetches the value from an 8-bit address on the zero page.</td>
</tr>
<tr>
<td>a</td>
<td>Absolute</td>
<td>Fetches the value from a 16-bit address anywhere in memory.</td>
</tr>
<tr>
<td>label</td>
<td>Relative</td>
<td>Branch instructions (e.g. BEQ, BCS) have a relative addressing mode that specifies an 8-bit signed offset relative to the current PC.</td>
</tr>
<tr>
<td>(a)</td>
<td>Indirect</td>
<td>The JMP instruction has a special indirect addressing mode that can jump to the address stored in a 16-bit pointer anywhere in memory.</td>
</tr>
</tbody>
</table>
<p>There are also more complicated memory access modes using the above:</p>
<table>
<thead>
<tr>
<th>Abbr</th>
<th>Name</th>
<th>Formula</th>
<th>Cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>d,x</td>
<td>Zero page indexed</td>
<td>val = PEEK((arg + X) % 256)</td>
<td>4</td>
</tr>
<tr>
<td>d,y</td>
<td>Zero page indexed</td>
<td>val = PEEK((arg + Y) % 256)</td>
<td>4</td>
</tr>
<tr>
<td>a,x</td>
<td>Absolute indexed</td>
<td>val = PEEK(arg + X)</td>
<td>4+</td>
</tr>
<tr>
<td>a,y</td>
<td>Absolute indexed</td>
<td>val = PEEK(arg + Y)</td>
<td>4+</td>
</tr>
<tr>
<td>(d,x)</td>
<td>Indexed indirect</td>
<td>val = PEEK(PEEK((arg + X) % 256) + PEEK((arg + X + 1) % 256) * 256)</td>
<td>6</td>
</tr>
<tr>
<td>(d),y</td>
<td>Indirect indexed</td>
<td>val = PEEK(PEEK(arg) + PEEK((arg + 1) % 256) * 256 + Y)</td>
<td>5+</td>
</tr>
</tbody>
</table>
<p>In the code I have a enum for all the supported modes:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Addressing modes of 6502</span>
<span class="c1">// http://obelisk.me.uk/6502/addressing.html</span>
<span class="c1">// http://wiki.nesdev.com/w/index.php/CPU_addressing_modes</span>
<span class="k">enum</span> <span class="n">nes_addr_mode</span>
<span class="p">{</span>
<span class="n">nes_addr_mode_imp</span><span class="p">,</span> <span class="c1">// implicit</span>
<span class="n">nes_addr_mode_acc</span><span class="p">,</span> <span class="c1">// val = A</span>
<span class="n">nes_addr_mode_imm</span><span class="p">,</span> <span class="c1">// val = arg_8</span>
<span class="n">nes_addr_mode_ind_jmp</span><span class="p">,</span> <span class="c1">// val = peek16(arg_16), with JMP bug</span>
<span class="n">nes_addr_mode_rel</span><span class="p">,</span> <span class="c1">// val = arg_8, as offset</span>
<span class="n">nes_addr_mode_abs</span><span class="p">,</span> <span class="c1">// val = PEEK(arg_16), LSB then MSB </span>
<span class="n">nes_addr_mode_abs_jmp</span><span class="p">,</span> <span class="c1">// val = arg_16, LSB then MSB, direct jump address </span>
<span class="n">nes_addr_mode_zp</span><span class="p">,</span> <span class="c1">// val = PEEK(arg_8)</span>
<span class="n">nes_addr_mode_zp_ind_x</span><span class="p">,</span> <span class="c1">// d, x val = PEEK((arg_8 + X) % $FF ), 4 cycles</span>
<span class="n">nes_addr_mode_zp_ind_y</span><span class="p">,</span> <span class="c1">// d, y val = PEEK((arg_8 + Y) % $FF), 4 cycles</span>
<span class="n">nes_addr_mode_abs_x</span><span class="p">,</span> <span class="c1">// a, x val = PEEK(arg_16 + Y), 4+ cycles</span>
<span class="n">nes_addr_mode_abs_y</span><span class="p">,</span> <span class="c1">// a, y val = PEEK(arg_16 + Y), 4+ cycles</span>
<span class="n">nes_addr_mode_ind_x</span><span class="p">,</span> <span class="c1">// (d, x) val = PEEK(PEEK((arg + X) % $FF) + PEEK((arg + X + 1) % $FF) * $FF), 6 cycles</span>
<span class="n">nes_addr_mode_ind_y</span><span class="p">,</span> <span class="c1">// (d), y val = PEEK(PEEK(arg) + PEEK((arg + 1) % $FF)* $FF + Y), 5+ cycles</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Recall that in instruction implementation we call <code class="language-plaintext highlighter-rouge">decode_operand</code> and <code class="language-plaintext highlighter-rouge">read_operand</code> (there is also <code class="language-plaintext highlighter-rouge">write_operand</code>) to decode and then read the target (whether it is register, an address, etc). So all the magic for decoding memory address modes are in there.</p>
<p>For example, following code in <code class="language-plaintext highlighter-rouge">decode_operand_addr</code> (used in <code class="language-plaintext highlighter-rouge">decode_operand</code> internally supports indirect y mode:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">addr_mode</span> <span class="o">==</span> <span class="n">nes_addr_mode</span><span class="o">::</span><span class="n">nes_addr_mode_ind_y</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// Indirect Indexed</span>
<span class="c1">// implies a table of table address in zero page</span>
<span class="kt">uint8_t</span> <span class="n">arg_addr</span> <span class="o">=</span> <span class="n">decode_byte</span><span class="p">();</span>
<span class="kt">uint16_t</span> <span class="n">addr</span> <span class="o">=</span> <span class="n">peek</span><span class="p">(</span><span class="n">arg_addr</span><span class="p">)</span> <span class="o">+</span> <span class="p">(</span><span class="kt">uint16_t</span><span class="p">(</span><span class="n">peek</span><span class="p">((</span><span class="n">arg_addr</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">&</span> <span class="mh">0xff</span><span class="p">))</span> <span class="o"><<</span> <span class="mi">8</span><span class="p">);</span>
<span class="kt">uint16_t</span> <span class="n">new_addr</span> <span class="o">=</span> <span class="n">addr</span> <span class="o">+</span> <span class="n">_context</span><span class="p">.</span><span class="n">Y</span><span class="p">;</span>
<span class="k">return</span> <span class="n">new_addr</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="show-me-the-ram">Show me the RAM</h2>
<p>Accesing “RAM” in a emulator in theory should be easy, right? Just reserve a “big chunk” of whopping 64K RAM and access that. Unfortunately it is a little bit more complicated than that:</p>
<ul>
<li>The system only has built-in <strong>2KB</strong> RAM - RAM is expensive those days</li>
<li>Some memory are mapped to I/O (such as PPU) registers so accessing those registers become simple memory operations, rather than, say, dedicated instructions</li>
<li>When NES cartridges are inserted, its onboard data (RAM, ROM) are mapped onto the 64K memory space as well</li>
</ul>
<p>So the actual memory layout looks like this:</p>
<table>
<thead>
<tr>
<th>Address range</th>
<th>Size</th>
<th>Device</th>
</tr>
</thead>
<tbody>
<tr>
<td>$0000-$07FF</td>
<td>$0800</td>
<td>2KB internal RAM</td>
</tr>
<tr>
<td>$0800-$0FFF</td>
<td>$0800</td>
<td>Mirrors of $0000-$07FF</td>
</tr>
<tr>
<td>$1000-$17FF</td>
<td>$0800</td>
<td>Mirrors of $0000-$07FF</td>
</tr>
<tr>
<td>$1800-$1FFF</td>
<td>$0800</td>
<td>Mirrors of $0000-$07FF</td>
</tr>
<tr>
<td>$2000-$2007</td>
<td>$0008</td>
<td>NES PPU registers</td>
</tr>
<tr>
<td>$2008-$3FFF</td>
<td>$1FF8</td>
<td>Mirrors of $2000-2007 (repeats every 8 bytes)</td>
</tr>
<tr>
<td>$4000-$4017</td>
<td>$0018</td>
<td>NES APU and I/O registers</td>
</tr>
<tr>
<td>$4018-$401F</td>
<td>$0008</td>
<td>APU and I/O functionality that is normally disabled. See CPU Test Mode.</td>
</tr>
<tr>
<td>$4020-$FFFF</td>
<td>$BFE0</td>
<td>Cartridge space: PRG ROM, PRG RAM, and mapper registers (See Note)</td>
</tr>
</tbody>
</table>
<p>For more details you can refer to <a href="http://wiki.nesdev.com/w/index.php/CPU_memory_map">this page in NES wiki</a>.</p>
<p>Dealing with cartridges and mappers are another big topic and a whole lot of complexity which we’ll cover a bit later. For now we’ll treat it as a black box.</p>
<p>All these means that whenever you write to a byte you need to do a bit of indirection (just like most of magic in computer science):</p>
<p><a href="https://github.com/yizhang82/neschan/blob/master/lib/src/nes_memory.cpp">nes_memory.cpp</a></p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">nes_memory</span><span class="o">::</span><span class="n">set_byte</span><span class="p">(</span><span class="kt">uint16_t</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">uint8_t</span> <span class="n">val</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">redirect_addr</span><span class="p">(</span><span class="n">addr</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_io_reg</span><span class="p">(</span><span class="n">addr</span><span class="p">))</span>
<span class="p">{</span>
<span class="n">write_io_reg</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">_mapper</span> <span class="o">&&</span> <span class="p">(</span><span class="n">_mapper_info</span><span class="p">.</span><span class="n">flags</span> <span class="o">&</span> <span class="n">nes_mapper_flags_has_registers</span><span class="p">))</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">addr</span> <span class="o">>=</span> <span class="n">_mapper_info</span><span class="p">.</span><span class="n">reg_start</span> <span class="o">&&</span> <span class="n">addr</span> <span class="o"><=</span> <span class="n">_mapper_info</span><span class="p">.</span><span class="n">reg_end</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">_mapper</span><span class="o">-></span><span class="n">write_reg</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">val</span><span class="p">);</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">_ram</span><span class="p">[</span><span class="n">addr</span><span class="p">]</span> <span class="o">=</span> <span class="n">val</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="testing">Testing</h2>
<p>I use <a href="https://github.com/onqtam/doctest">doctest</a> which is a simple and convenient testing framework that is good enough for my needs. At the beginning I write manual tests - basically execute a bunch of instructions until <code class="language-plaintext highlighter-rouge">BRK</code> (stop the system) and verify the state of the CPU and RAM:</p>
<p><a href="https://github.com/yizhang82/neschan/blob/master/test/cpu_test.cpp">cpu_test.cpp</a></p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TEST_CASE</span><span class="p">(</span><span class="s">"CPU tests"</span><span class="p">)</span> <span class="p">{</span>
<span class="n">nes_system</span> <span class="n">system</span><span class="p">;</span>
<span class="n">SUBCASE</span><span class="p">(</span><span class="s">"simple"</span><span class="p">)</span> <span class="p">{</span>
<span class="n">INIT_TRACE</span><span class="p">(</span><span class="s">"neschan.instrtest.simple.log"</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Running [CPU][simple]..."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">system</span><span class="p">.</span><span class="n">power_on</span><span class="p">();</span>
<span class="n">system</span><span class="p">.</span><span class="n">run_program</span><span class="p">(</span>
<span class="p">{</span>
<span class="mh">0xa9</span><span class="p">,</span> <span class="mh">0x10</span><span class="p">,</span> <span class="c1">// LDA #$10 -> A = #$10</span>
<span class="mh">0x85</span><span class="p">,</span> <span class="mh">0x20</span><span class="p">,</span> <span class="c1">// STA $20 -> $20 = #$10</span>
<span class="mh">0xa9</span><span class="p">,</span> <span class="mh">0x01</span><span class="p">,</span> <span class="c1">// LDA #$1 -> A = #$1</span>
<span class="mh">0x65</span><span class="p">,</span> <span class="mh">0x20</span><span class="p">,</span> <span class="c1">// ADC $20 -> A = #$11</span>
<span class="mh">0x85</span><span class="p">,</span> <span class="mh">0x21</span><span class="p">,</span> <span class="c1">// STA $21 -> $21=#$11</span>
<span class="mh">0xe6</span><span class="p">,</span> <span class="mh">0x21</span><span class="p">,</span> <span class="c1">// INC $21 -> $21=#$12</span>
<span class="mh">0xa4</span><span class="p">,</span> <span class="mh">0x21</span><span class="p">,</span> <span class="c1">// LDY $21 -> Y=#$12</span>
<span class="mh">0xc8</span><span class="p">,</span> <span class="c1">// INY -> Y=#$13</span>
<span class="mh">0x00</span><span class="p">,</span> <span class="c1">// BRK </span>
<span class="p">},</span>
<span class="mh">0x1000</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">cpu</span> <span class="o">=</span> <span class="n">system</span><span class="p">.</span><span class="n">cpu</span><span class="p">();</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">cpu</span><span class="o">-></span><span class="n">peek</span><span class="p">(</span><span class="mh">0x20</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x10</span><span class="p">);</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">cpu</span><span class="o">-></span><span class="n">peek</span><span class="p">(</span><span class="mh">0x21</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x12</span><span class="p">);</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">cpu</span><span class="o">-></span><span class="n">A</span><span class="p">()</span> <span class="o">==</span> <span class="mh">0x11</span><span class="p">);</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">cpu</span><span class="o">-></span><span class="n">Y</span><span class="p">()</span> <span class="o">==</span> <span class="mh">0x13</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>But this quickly get tedious. Fortunately, there are a lot of existing test roms. I’m been using <a href="https://github.com/christopherpow/nes-test-roms/tree/master/nes_instr_test">this one</a> - it is fairly comprehensive. This does mean I need to implement a rudimentary ROM loading first (which we won’t cover here), but once that’s ready I can just load the ROM and follow the convention of the test ROM - in this case it means checking <code class="language-plaintext highlighter-rouge">peek(0x6000) == 0</code>.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define INSTR_V5_TEST_CASE(test) \
SUBCASE("instr_test-v5 " test) { \
INIT_TRACE("neschan.instrtest.instr_test-v5." test ".log"); \
cout << "Running [CPU][instr_test-v5-" << test << "]" << endl; \
system.power_on(); \
auto cpu = system.cpu(); \
cpu->stop_at_infinite_loop(); \
system.run_rom("./roms/instr_test-v5/rom_singles/" test ".nes", nes_rom_exec_mode_reset); \
CHECK(cpu->peek(0x6000) == 0); \
}
</span></code></pre></div></div>
<p>With that I can run a bunch of ROMs as regression tests, much better:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"01-basics"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"02-implied"</span><span class="p">)</span>
<span class="c1">// INSTR_V5_TEST_CASE("03-immediate")</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"04-zero_page"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"05-zp_xy"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"06-absolute"</span><span class="p">)</span>
<span class="c1">// INSTR_V5_TEST_CASE("07-abs_xy")</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"08-ind_x"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"09-ind_y"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"10-branches"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"11-stack"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"12-jmp_jsr"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"13-rts"</span><span class="p">)</span>
<span class="n">INSTR_V5_TEST_CASE</span><span class="p">(</span><span class="s">"14-rti"</span><span class="p">)</span>
<span class="c1">// INSTR_V5_TEST_CASE("15-brk")</span>
<span class="c1">// INSTR_V5_TEST_CASE("16-special")</span>
<span class="err">}</span>
</code></pre></div></div>
<blockquote>
<p>Some of the commenting are most likely signs there are still bugs in CPU emulation.</p>
</blockquote>
<h2 id="conclusion">Conclusion</h2>
<p>It took me a few days to implement all CPU and get majority of the CPU tests to pass. Things are a bit more subtle than I expected, but that’s probably the case for emulating real world CPU which always has its own quirks or even bugs, and the documentation has bugs too (which are hard to find unless you test with a real CPU or a emulator). There are quite a bit of subtle behavior I didn’t cover (such as page crossing, etc) that I need to get exactly right. Getting CPU emulation correct is absolutely critical for getting games working, not surprisingly. One thing that did surprise me is that the last bug that prevented <em>Super Mario Bros</em> from working is <a href="https://github.com/yizhang82/neschan/commit/7a397de0e6b6afcd50cc77bd33079ad854722205">bugs in my CPU emulation</a>, including a documentation bug. If I remember correctly I had to debug it with another emulator side by side to find out the exact problem. On retrospective I probably should’ve get all the CPU tests working properly, and the fact that I had disabled a few (especially the earlier ones from 1-14) is definitely red flag. Unfortunately I was too excited to push ahead and “mostly working” is deemed “good enough”, which turned out to be a big mistake. That’s why we work on side projects - have fun, and learn something doing it.</p>
<h2 id="the-series-so-far">The series so far…</h2>
<ul>
<li><a href="/nes-emu-overview">Part 1 - NES Emulator Overview</a></li>
<li><a href="/nes-emu-main-loop">Part 2 - Writing the main loop</a></li>
<li><a href="/nes-emu-cpu">Part 3 - Emulating the 6502 CPU</a></li>
</ul>
<h2 id="if-you-are-hungry-for-more-nes">If you are hungry for more NES…</h2>
<p>Head to <a href="http://wiki.nesdev.com/w/index.php/Nesdev">NESDev Wiki</a> - I’ve learned pretty much everything about NES there. There is also a great book on NES called <a href="https://www.amazon.com/Am-Error-Nintendo-Computer-Entertainment/dp/0262028778">I am error</a>, which is surprisingly deeply technical for a book about history of NES.</p>Yi Zhangmail@yizhang82.meIt’s been a while since the last update - I was mostly focusing on database technologies. Beginning of the year 2021 is a bit slow (that’s when many big companies start their annual / semi-annual review process), so I had a bit of time to write up this post about 6502 CPU emulation. All the code referenced in this post is in my simple NES emulator github repo NesChan. It’s fun to go back and look at my old code and the 6502 CPU wiki. The 6502 CPU NES uses 8-bit 6502 CPU with 16-bit address bus, meaning it can access memory range 0x0000~0xffff - not much, but more than enough for games back in the 80s with charming dots and sprites. It is used in a surprising large range of famous vintage computers/consoles like Apple I, Apple II, Atari, Commodore 64, and of course NES. The variant used by NES is a stock 6502 without decimal mode support. It is running at 1.79HMZ (PAL version runs at 1.66MHZ). It has 3 general purpose register A/X/Y, and 3 special register P (status) /SP (stack pointer) /PC (program counter, or instruction pointer), all of them being 8-bit except PC which is 16-bit. NES dev wiki has a great section on 6502 CPU that has a lot more details and we’ll be covering the most important aspects in the remainder of the article.Doctest - my favorite lightweight, zero-friction unit test framework2021-01-09T00:00:00+00:002021-01-09T00:00:00+00:00http://yizhang82.dev/doctest<p>In my personal C++ projects I’ve always been using <a href="https://github.com/onqtam/doctest">doctest</a>. It’s simply awesome. It takes a few seconds to get bootstrapped and you are ready to run your tests. And it should really be the first thing you do when you start a new project.</p>
<p>For example, I’ve been using it in <a href="https://github.com/yizhang82/neschan/">neschan</a> which is a NES emulator that I wrote for fun back in 2018, and one such example is a few unit test that validates the emulated 6502 CPU works correctly:</p>
<!--more-->
<p><a href="https://github.com/yizhang82/neschan/blob/master/test/cpu_test.cpp">cpu_test.cpp</a></p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">TEST_CASE</span><span class="p">(</span><span class="s">"CPU tests"</span><span class="p">)</span> <span class="p">{</span>
<span class="n">nes_system</span> <span class="n">system</span><span class="p">;</span>
<span class="n">SUBCASE</span><span class="p">(</span><span class="s">"simple"</span><span class="p">)</span> <span class="p">{</span>
<span class="n">INIT_TRACE</span><span class="p">(</span><span class="s">"neschan.instrtest.simple.log"</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Running [CPU][simple]..."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">system</span><span class="p">.</span><span class="n">power_on</span><span class="p">();</span>
<span class="n">system</span><span class="p">.</span><span class="n">run_program</span><span class="p">(</span>
<span class="p">{</span>
<span class="mh">0xa9</span><span class="p">,</span> <span class="mh">0x10</span><span class="p">,</span> <span class="c1">// LDA #$10 -> A = #$10</span>
<span class="mh">0x85</span><span class="p">,</span> <span class="mh">0x20</span><span class="p">,</span> <span class="c1">// STA $20 -> $20 = #$10</span>
<span class="mh">0xa9</span><span class="p">,</span> <span class="mh">0x01</span><span class="p">,</span> <span class="c1">// LDA #$1 -> A = #$1</span>
<span class="mh">0x65</span><span class="p">,</span> <span class="mh">0x20</span><span class="p">,</span> <span class="c1">// ADC $20 -> A = #$11</span>
<span class="mh">0x85</span><span class="p">,</span> <span class="mh">0x21</span><span class="p">,</span> <span class="c1">// STA $21 -> $21=#$11</span>
<span class="mh">0xe6</span><span class="p">,</span> <span class="mh">0x21</span><span class="p">,</span> <span class="c1">// INC $21 -> $21=#$12</span>
<span class="mh">0xa4</span><span class="p">,</span> <span class="mh">0x21</span><span class="p">,</span> <span class="c1">// LDY $21 -> Y=#$12</span>
<span class="mh">0xc8</span><span class="p">,</span> <span class="c1">// INY -> Y=#$13</span>
<span class="mh">0x00</span><span class="p">,</span> <span class="c1">// BRK </span>
<span class="p">},</span>
<span class="mh">0x1000</span><span class="p">);</span>
<span class="k">auto</span> <span class="n">cpu</span> <span class="o">=</span> <span class="n">system</span><span class="p">.</span><span class="n">cpu</span><span class="p">();</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">cpu</span><span class="o">-></span><span class="n">peek</span><span class="p">(</span><span class="mh">0x20</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x10</span><span class="p">);</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">cpu</span><span class="o">-></span><span class="n">peek</span><span class="p">(</span><span class="mh">0x21</span><span class="p">)</span> <span class="o">==</span> <span class="mh">0x12</span><span class="p">);</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">cpu</span><span class="o">-></span><span class="n">A</span><span class="p">()</span> <span class="o">==</span> <span class="mh">0x11</span><span class="p">);</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">cpu</span><span class="o">-></span><span class="n">Y</span><span class="p">()</span> <span class="o">==</span> <span class="mh">0x13</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It’s pretty self-explanatory - use <code class="language-plaintext highlighter-rouge">TEST_CASE</code> to define a test case and <code class="language-plaintext highlighter-rouge">SUBCASE</code> for scenarios, and <code class="language-plaintext highlighter-rouge">CHECK</code> for actual validation/assertion. (Ignore <code class="language-plaintext highlighter-rouge">INIT_TRACE</code> - it’s not part of the doctest framework)</p>
<p>To use it in your own project - just download one file:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl https://raw.githubusercontent.com/onqtam/doctest/master/doctest/doctest.h -o doctest.h
</code></pre></div></div>
<p>And include that and add a #define:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#define DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN
#include "doctest.h"
</span>
<span class="kt">int</span> <span class="nf">add</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">TEST_CASE</span><span class="p">(</span><span class="s">"testing 1+1=2"</span><span class="p">)</span> <span class="p">{</span>
<span class="n">CHECK</span><span class="p">(</span><span class="n">add</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="mi">2</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The magic <code class="language-plaintext highlighter-rouge">DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN</code> is to tell doctest.h this file needs a main. You should only have it before <code class="language-plaintext highlighter-rouge">#include doctest.h</code> (obviously), so that the following code in <code class="language-plaintext highlighter-rouge">doctest.h</code> can kick in:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#ifdef DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN
</span><span class="n">DOCTEST_MSVC_SUPPRESS_WARNING_WITH_PUSH</span><span class="p">(</span><span class="mi">4007</span><span class="p">)</span> <span class="c1">// 'function' : must be 'attribute' - see issue #182</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span><span class="o">**</span> <span class="n">argv</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">doctest</span><span class="o">::</span><span class="n">Context</span><span class="p">(</span><span class="n">argc</span><span class="p">,</span> <span class="n">argv</span><span class="p">).</span><span class="n">run</span><span class="p">();</span> <span class="p">}</span>
<span class="n">DOCTEST_MSVC_SUPPRESS_WARNING_POP</span>
<span class="cp">#endif // DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN
</span></code></pre></div></div>
<p>Note that you should only have this in a single file (perhaps a bit obvious). Other .cc/.cpp files just need to <code class="language-plaintext highlighter-rouge">#include "doctest.h"</code> without the <code class="language-plaintext highlighter-rouge">#define</code> - the linker wouldn’t be happy more than one main function, after all.</p>
<p>Compile and run:</p>
<blockquote>
<p>NOTE: –std=c++11 is required to use doctest, otherwise g++ would shout at you for feeding it nonsense</p>
</blockquote>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[~/tmp/test]: g++ test.cc --std=c++11 -o test
[~/tmp/test, 1s]: ./test
[doctest] doctest version is "2.3.1"
[doctest] run with "--help" for options
===============================================================================
[doctest] test cases: 1 | 1 passed | 0 failed | 0 skipped
[doctest] assertions: 1 | 1 passed | 0 failed |
[doctest] Status: SUCCESS!
</code></pre></div></div>
<p>It doesn’t get simpler than this. When I say zero friction I really mean it. OK, maybe not entirely zero, but close enough.</p>
<p>Note that the earlier <code class="language-plaintext highlighter-rouge">main</code> function calls out to <code class="language-plaintext highlighter-rouge">doctest::Context(argc, argv)</code>. This means that the final executable automatically comes with command line arguments you can use to control how the test executes, such as:</p>
<ol>
<li>Test case filters</li>
<li>Listing all test cases / test suites</li>
<li>Running tests N times</li>
<li>And much more</li>
</ol>
<h2 id="if-you-are-curious">If you are curious…</h2>
<p>If you are curious, <code class="language-plaintext highlighter-rouge">doctest.h</code> is giagantic 6000 line header file that got assembled from two files with a bit post-processing, if any of them changed:</p>
<p><a href="https://github.com/onqtam/doctest/blob/master/CMakeLists.txt">CMakeLists.txt</a></p>
<div class="language-cmake highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1"># add a custom target that assembles the single header when any of the parts are touched</span>
<span class="nb">add_custom_command</span><span class="p">(</span>
OUTPUT <span class="si">${</span><span class="nv">CMAKE_CURRENT_SOURCE_DIR</span><span class="si">}</span>/doctest/doctest.h
DEPENDS
<span class="si">${</span><span class="nv">doctest_parts_folder</span><span class="si">}</span>/doctest_fwd.h
<span class="si">${</span><span class="nv">doctest_parts_folder</span><span class="si">}</span>/doctest.cpp
COMMAND <span class="si">${</span><span class="nv">CMAKE_COMMAND</span><span class="si">}</span> -P <span class="si">${</span><span class="nv">CMAKE_CURRENT_SOURCE_DIR</span><span class="si">}</span>/scripts/cmake/assemble_single_header.cmake
COMMENT <span class="s2">"assembling the single header"</span><span class="p">)</span>
<span class="nb">add_custom_target</span><span class="p">(</span>assemble_single_header ALL DEPENDS <span class="si">${</span><span class="nv">CMAKE_CURRENT_SOURCE_DIR</span><span class="si">}</span>/doctest/doctest.h<span class="p">)</span>
</code></pre></div></div>
<p><a href="https://github.com/onqtam/doctest/blob/master/scripts/cmake/assemble_single_header.cmake">assemble_single_header.cmake</a></p>
<div class="language-cmake highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">set</span><span class="p">(</span>doctest_include_folder <span class="s2">"</span><span class="si">${</span><span class="nv">CMAKE_CURRENT_LIST_DIR</span><span class="si">}</span><span class="s2">/../../doctest/"</span><span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>READ <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/parts/doctest_fwd.h fwd<span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>READ <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/parts/doctest.cpp impl<span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>WRITE <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/doctest.h <span class="s2">"// ====================================================================== lgtm [cpp/missing-header-guard]</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>APPEND <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/doctest.h <span class="s2">"// == DO NOT MODIFY THIS FILE BY HAND - IT IS AUTO GENERATED BY CMAKE! ==</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>APPEND <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/doctest.h <span class="s2">"// ======================================================================</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>APPEND <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/doctest.h <span class="s2">"</span><span class="si">${</span><span class="nv">fwd</span><span class="si">}</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>APPEND <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/doctest.h <span class="s2">"#ifndef DOCTEST_SINGLE_HEADER</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>APPEND <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/doctest.h <span class="s2">"#define DOCTEST_SINGLE_HEADER</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>APPEND <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/doctest.h <span class="s2">"#endif // DOCTEST_SINGLE_HEADER</span><span class="se">\n</span><span class="s2">"</span><span class="p">)</span>
<span class="nb">file</span><span class="p">(</span>APPEND <span class="si">${</span><span class="nv">doctest_include_folder</span><span class="si">}</span>/doctest.h <span class="s2">"</span><span class="se">\n</span><span class="si">${</span><span class="nv">impl</span><span class="si">}</span><span class="s2">"</span><span class="p">)</span>
</code></pre></div></div>
<p>This makes bootstraping the whole unit test essentially painless. You can just include a copy in your repo/folder and you are done. No need to fiddle with package manager / submodule. I wish more frameworks are done like this at least during distribution. Of course, assembling the entire boost library into a single header might be a bit extreme, but for simple frameworks where reducing friction of adoption is important, this can be a rather useful technique.</p>Yi Zhangmail@yizhang82.meIn my personal C++ projects I’ve always been using doctest. It’s simply awesome. It takes a few seconds to get bootstrapped and you are ready to run your tests. And it should really be the first thing you do when you start a new project. For example, I’ve been using it in neschan which is a NES emulator that I wrote for fun back in 2018, and one such example is a few unit test that validates the emulated 6502 CPU works correctly:Putting Fedora 33 Workstation on X1 Carbon 7th gen2021-01-02T00:00:00+00:002021-01-02T00:00:00+00:00http://yizhang82.dev/lenovo-fedora<p>I’ve had a <a href="https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-x/X1-Carbon-Gen-7/p/22TP2TXX17G">Lenovo X1 Carbon 7th gen</a> for a while and tried putting Ubuntu 20.04 on it, but had quite a bit of trouble. Mostly the problem was this model has 4 speakers (two front and two bottom) so linux had quite a bit of trouble with it. The sound was tinny, and volume up / down doesn’t work either. The microphone jack also pops. There are other minor issues like finger print sensor doesn’t work, though I don’t care about it much. There is <a href="https://forums.lenovo.com/t5/Ubuntu/Guide-X1-Carbon-7th-Generation-Ubuntu-compatability/td-p/4489823?page=1">a long thread</a> discussing problems with ubuntu. I spend quite a while browsing forums and find some work arounds, but none are satisifactory. So I gave up and <a href="/set-up-wsl2">went WSL2</a>.</p>
<!--more-->
<p>WSL2 is basically a VM, so it works mostly quite well and is indistinguishable from a native linux, for the most part. However, it isn’t quite smooth sailing either. It is still quit a bit slower. For example, starting vim takes a second or so while in native linux it is pretty much instant. It is also very memory hungry - it seems that it aggressively will take over all memory for I/O cache, usually not a problem if it were the only game in town, but it would slow down Windows as a result. I have a desktop machine with 32G and it’ll happily push it over 80% in a memory intensive task such as compilation. Capping the memory consumption helps, though.</p>
<p>After a while I’ve heard <a href="https://www.forbes.com/sites/jasonevangelho/2020/05/08/lenovo-has-2-awesome-surprises-for-linux-thinkpad-customers-in-2020/?sh=404aaf72399d">Lenovo has been working with Fedora for ThinkPads</a>, and Fedora 33 is out, so I’d like to give it a spin, but didn’t get a chance to try it out yet until this week. I’m happy to report that putting Fedora Workstation 33 x64 works pretty much perfectly:</p>
<ul>
<li>Wifi works out of the box</li>
<li>Suspend/Resume works fine - Lenovo seems to suggest the keep the sleep state in BIOS to Windows as Linux supports it these days</li>
<li>Audio works fine - all 4 speakers seems to work and microphone works well as well. Volumn buttons work as well</li>
<li>Camera works - a must these days for meetings</li>
<li>Trackpad works - not quite as smooth as Windows but acceptable. Scrolling was a bit too fast for my liking and it looks like there isn’t a great way to tweak it in Gnome. But I can live with it</li>
<li>Fingerprint Sensor works - I didn’t even realize I need it but it even works for <code class="language-plaintext highlighter-rouge">sudo</code>, which is a pleasant surprise:</li>
</ul>
<p><img src="/imgs/lenovo-fedora-1.png" alt="Fingerprint Sensor for sudo" /></p>
<p>However, it did come with a catch. If I login with fingerprint, it’ll still ask me to unlock the keyring using password, which is broken for sure. Also the fingerprint daemon seems to be occasionally stop working and hang at the shutdown (until timeout), but either way using fingerprint for sudo is kinda nice so I’m ok with living with it.</p>
<p>One thing that annoyed me is the “task bar” won’t show up until I hover mouse to the top-left. Using <a href="https://extensions.gnome.org/extension/307/dash-to-dock/">Dash to Dock</a> fixed that.</p>
<p>Putting the software I need on it is also relatively straight-forward. I have <a href="https://github.com/yizhang82/dotfiles">dotfiles</a> that install vim/tmux/zsh for me and <a href="https://github.com/yizhang82/utils/blob/master/sys/linux/install.sh">install.sh</a> install all the utilities - I did have to adapt it to use dnf and some libraries need different names, but that’s pretty much it. Once installing VS Code and Chrome I’m good to go. I did run into a problem with Chrome 2nd window being super slow which seems to be a problem with wayland. Applying <a href="https://unix.stackexchange.com/questions/612325/opening-two-chrome-windows-on-fedora-32-is-very-slow">a fix from stackoverflow post</a> fixed it for me.</p>
<p><img src="/imgs/lenovo-fedora-2.png" alt="neofetch" /></p>
<p>Overall I’m quite happy with Fedora 33 on X1 Carbon 7th Gen. Linux has certainly came a long way and it’s great to see hardware manufacturers collaborating with Linux making the experience just work, so there are still hope. Unfortunately Linux desktop is still fragmented as ever, so maybe the year of linux desktop won’t be quite there yet. Who knows - maybe we’ll all end up with Chrome books and SSH to our dev boxes in the cloud.</p>Yi Zhangmail@yizhang82.meI’ve had a Lenovo X1 Carbon 7th gen for a while and tried putting Ubuntu 20.04 on it, but had quite a bit of trouble. Mostly the problem was this model has 4 speakers (two front and two bottom) so linux had quite a bit of trouble with it. The sound was tinny, and volume up / down doesn’t work either. The microphone jack also pops. There are other minor issues like finger print sensor doesn’t work, though I don’t care about it much. There is a long thread discussing problems with ubuntu. I spend quite a while browsing forums and find some work arounds, but none are satisifactory. So I gave up and went WSL2.Paper Reading - Hekaton: SQL Server’s Memory-Optimized OLTP Engine2020-12-29T00:00:00+00:002020-12-29T00:00:00+00:00http://yizhang82.dev/paper-hekaton<p><a href="http://web.eecs.umich.edu/~mozafari/fall2015/eecs584/papers/hekaton.pdf">This is a great paper covering Microsoft <em>Hekaton</em> storage engine</a>. There are a lot of meaty stuff here - lock-free Bw-tree indexing, MVCC, checkpointing, query compilation. I’m especially interested in its query compilation given my background in the .NET runtime and I’ve also spent some non-trivial amount of time focusing on optimizing query performance in my current job. Bw-tree is a very interesting direction for B+ tree as well and we’ll also be looking at a few papers that covers Bw tree in more depth in future posts.</p>
<!--more-->
<h2 id="overview">Overview</h2>
<p>Hekaton is an alternative SQL server storage engine optimized for main memory (not a separate DBMS). User can enable it by declaring tables to be “optimized”. Hekaton has following designing principles:</p>
<ul>
<li>Durability is ensured by logging and checkpointing, but index operations are not logged - they are rebuilt entirely from latest checkpoint and logs. This avoids complex buffer pool flush management.</li>
<li>Internal data structures (allocators, indexes, transaction map, etc) are all entirely lock-free / latch-free or in any other performace-critical path. Hekaton uses a new optimistic multi-version concurrency control for transaction isolation semantics as well to avoid locking.</li>
<li>Requests are compiled down to native code. Decisions are made in advance as much as possible to reduce runtime overhead.</li>
</ul>
<blockquote>
<p>Sidebar: Request compilation is especially interesting here. This is a advanced technique commonly seen in language runtimes as JIT. However it is likely in most cases it isn’t quite worth the complexity for DBMS unless memory access (instead of I/O) become the bottleneck when most SQL access data that are hot (in buffer pool cache) and in memory, which definitely is the case here.</p>
</blockquote>
<p>Note Hekaton doesn’t support table partitioning. Some in-memory database such as HyPer / H-Store / VoltDB / Dora supports partitioning database by core. However this has the downside when a query can’t be “partition aligned” (not using the partitioning index) needs to be sent to all partitions, and that can be potentially more expensive. To support wider variety of workloads Hekaton decided not to support table partioning. Keep in mind this is partitioning table inside the same database instances, and not related to distributed database where database are partitioned across database instances in different nodes.</p>
<p>In a high-level, Hekaton has 3 components:</p>
<ul>
<li>Hekaton storage engine - manages user data and index</li>
<li>Hekaton compiler - takes AST of stored procedure and metadata input, and compiles to native code</li>
<li>Hekaton runtime system - integration with SQL server and providing helpers needed by compiled code</li>
</ul>
<p>Hekaton also heavily leverages existing SQL server services - you can refer to the paper for more details.</p>
<h2 id="storage-and-indexing">Storage and Indexing</h2>
<p>Hekaton supports hash index with lock-free hashtable and range-index with Bw-tree (a novel lock-free version of B-tree).</p>
<p>Following table is a good example:</p>
<p><img src="/imgs/paper-hekaton-1.png" alt="Hekaton_index" /></p>
<ul>
<li>Both hash-index and Bw-tree index stores pointers to the actual data</li>
<li>Hash-index are divided by hash-buckets - so bucket J points to start of all the names begin with J. All data within same bucket are linked together</li>
<li>Different versons of the same key are also linked to provide MVCC support. The begin/end time describes the transaction timestamp range when the value is valid and the range is strictly non-overlapping. All the reads have a read-time and only matching records would be returned.</li>
</ul>
<p>During update, the record being updated has its end time marked with transaction id (Txn75 in the diagram) to indicating it is being updated, and any new record will have its start time to be the transaction id as well indicating it is a new record not being committed (the end time being infinity). Once the transaction commits, it updates the time to commit time. Old versions are garbage collected when they are no longer visible to any transaction, and done cooperatively by all worker threads.</p>
<h2 id="programming-and-compilation">Programming and Compilation</h2>
<p>Typically DBMS use a “interpreter” style execution model to execute SQL statements. Hekaton compiler reuses SQL server T-SQL compiler stack (metadata, parser, name resolution, type derivation, and query optimizer). The output is C code and compiled with Microsoft VC++ into a DLL which gets loaded and executed at runtime.</p>
<p>As part of creation a new table, the schema functions such as hashing function, record comparison, and record serialization are compiled as well and available for index operations such as search / inserts. As a result those operation are quite efficient.</p>
<blockquote>
<p>Ideally, these functions should be compiled together with SQL statements as well so that they can be properly inlined if needed, though the normal caveat of inlining applies.</p>
</blockquote>
<p>A SQL statement is compiled into MAT (Mixed Abstract Tree) which is a rich abstract syntax tree representing metadata, imperative logic, and query plans. It is then converted into PIT (Pure Imperative Tree) that is more easily converted into C or other intermediate representations. The following picture shows the high-level flow:</p>
<p><img src="/imgs/paper-hekaton-2.png" alt="Hekaton_Compilation" /></p>
<p>A query plan consists of a tree of operators, like most query execution engines. Each operator has a common interface of operations so that they can be composed. In the example, the code calls <code class="language-plaintext highlighter-rouge">Scan</code> Operator which calls <code class="language-plaintext highlighter-rouge">Filter</code> operator to filter on the list of rows. The operators are connected by gotos instead of making calls - this greatly reduces the overhead of passing parameters and procedure calls, though it makes debugging the code harder.</p>
<blockquote>
<p>The gotos is effectively “inlining” the code by hard coding the gotos, is less efficient but simpler to implement. It is also reasonable to expect compilers to inlining the code since the call graph are well defined.</p>
</blockquote>
<p>Not all code are compiled - some are available as helpers such as sorting, math, etc where the implementation are complex and the overhead of function calling are relatively low.</p>
<p>The compiled store procedure looks like just like any T-SQL store procedures and supports parameter passing. There are limitations to what those T-SQL procedures and the SQL statements can do due to implementation restrictions. To get around those limitations, Hektaon supports Query Interop that enables conventional disk based query engine to query memory optimized tables.</p>
<h2 id="transactions">Transactions</h2>
<p>Hekaton supports optimistic MVCC to provide snapshot, repeatable read, and serializable transaction isolation. For serializable transactions it ensures:</p>
<ol>
<li>Read stability - version still is the version visible at end of transaction</li>
<li>Phantom avoidance - scan wouldn’t return additional new versions</li>
</ol>
<p>It is worth noting that repeatable read only need read stability.</p>
<p>In order to validate the reads, transaction checks the versions it read are visible as of the transactions end time. Each transaction maintains read-set (a list of pointers to each version it has read), and a scan-set.</p>
<p>If transaction T1 sees data changes in T2, T1 takes a commit dependency on T2. Before T2 commits, T1’s result set is held back by a read barrier and will be sent to client as soon as it is cleared.</p>
<blockquote>
<p>Technically this is still blocking since the client won’t be receiving the results back. However in theory the thread can be freed to process other transactions. Until then the transaction isn’t actually committed.</p>
</blockquote>
<p>Once transaction’s update has been logged, it is irreversibility committed and during commit post processing phase it’ll update all end timestamps in all versions it touched to the end / commit timestamp of this transaction. The list of insert / deleted versions is maintained with a write-set.</p>
<p>During a rollback, all versions created in transaction will be invalidated. Delete version will be restored by clearing end timestamp (to infinite). Any transaction dependent will be notified.</p>
<h2 id="checkpoint-and-recovery">Checkpoint and recovery</h2>
<p>Hekaton ensures transaction durability that allows it to recover after a failure, using logs and checkpoints. The design minimizes transactional processing overhead and push work to recovery time if possible. It supports parallel processing during recovery. Index are reconstructed during recovery.</p>
<p>Logs are essentially redo log for committed transactions. No undo log is recorded.</p>
<p>Checkpoints are continuous, incremental, and append-only - they are essentially delta of changes recorded in sequential files, containing multiple data files and delta files. The reason they are contiuous is that periodic checkpoints are disruptive for performance. Data files contains inserted records covering a specific timestamp range, and loaded at recovery time and index reconstructed. Delta files are list of deleted versions in the data file and 1:1 maps to data file. At recovery time it filters out records in data files and avoid loading them into memory. They are loaded in parallel during recovery. In this sense, checkpoints are basically compressed log. Checkpoint data files are also merged to drop deleted versions.</p>
<blockquote>
<p>The continuous nature of checkpoints is different than traditional checkpoints. In traditional checkpoints, no filtering is required because data that are deleted in the timestamp range would get dropped in the checkpoint process. However with continuous checkpoint you need to record both inserts and deletes. It is essentially a segmented log that is self-contained (so just rotating redo logs won’t work) and optimized for batch loading.</p>
</blockquote>
<h2 id="garbage-collection">Garbage Collection</h2>
<p>Hekaton GC removes versions that are no longer visible to any active transactions. It is non-blocking, parallelizable and scalable. Most interestingly it is cooperative - worker threads doing transactional process will discard versions when they encounter it, making it naturally scalable. There are also background dedicated GC worker threads as well, in order to collect cold regions of index that might not be scanned at all.</p>
<p>Hekaton GC locate garbage versions by looking for end stamp smaller than oldest active transaction timestamp, which is determined periodically by a GC thread scanning global transaction map.</p>
<p>The background collection thread breaks the work and send the work to a set of work queues. Once Hekaton worker thread is done with transactional processing it’ll pick up a small chunk of garbage collection work as its CPU-local queue. It naturally parallizes work across CPU cores and also self-throttle since it is done in worker threads.</p>
<h2 id="whats-next">What’s next</h2>
<p>There are a few more related paper that we can explore. Bw-tree is probably the most interesting and worth looking into.</p>Yi Zhangmail@yizhang82.meThis is a great paper covering Microsoft Hekaton storage engine. There are a lot of meaty stuff here - lock-free Bw-tree indexing, MVCC, checkpointing, query compilation. I’m especially interested in its query compilation given my background in the .NET runtime and I’ve also spent some non-trivial amount of time focusing on optimizing query performance in my current job. Bw-tree is a very interesting direction for B+ tree as well and we’ll also be looking at a few papers that covers Bw tree in more depth in future posts.InnoDB Internals - Consistent Reads2020-06-18T00:00:00+00:002020-06-18T00:00:00+00:00http://yizhang82.dev/innodb-internals-consistent-reads<h2 id="overview">Overview</h2>
<p>I’ve been doing some research in this area trying to understand how this works in databases (for my upcoming project), so I thought I’d share some of my learnings here.</p>
<p>InnoDB internally uses ReadView to establish snapshots for consistent reads - basically giving you the point-in-time view of the database at the time when the snapshot it is created.</p>
<p>In InnoDB, all changes are immediately made on the latest version of Database regardless whether it has been committed or not, so if you don’t have MVCC, everybody will see the latest version of rows and it’ll be a disaster for consistency. Not to mention you’ll need to be able to rollback the changes. In order to achieve this, InnoDB maintains a undo log to track a link list of changes made by other transactions, so reading in the past with a snapshot means going from the latest record in the BufferPool, and walk backwards to find the first visible change. Rollback is similar.</p>
<blockquote>
<p>This also means the undo log can’t be purged if the snapshot is still active, and undo log will get longer and longer, which slows down the reads more and more. This is the infamous long running transaction issue.</p>
</blockquote>
<p>The fundamental issue is that you need to be able to determine visibility of changes. This is done with two things:</p>
<ol>
<li>InnoDB tracks the trx_id_t of each rows and in the undo log</li>
<li>InnoDB internally use a data structure called <code class="language-plaintext highlighter-rouge">ReadView</code> to determine if a transaction is visible in the snapshot.</li>
</ol>
<p>So the algorithm becomes as simple as walking the list backwards and find the first visible record.</p>
<!--more-->
<p>For example - assuming current transaction is <code class="language-plaintext highlighter-rouge">6941</code>, and the latest record is made by transaction <code class="language-plaintext highlighter-rouge">6999</code>, and the undo log looks as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>6940 -> 6943 -> 6945 -> 6999
</code></pre></div></div>
<p>This link means the row has been modified by <code class="language-plaintext highlighter-rouge">6940</code>, <code class="language-plaintext highlighter-rouge">6944</code>, <code class="language-plaintext highlighter-rouge">6958</code>, <code class="language-plaintext highlighter-rouge">6999</code> in order.</p>
<p>In order to determine visibility, <code class="language-plaintext highlighter-rouge">ReadView</code> tracks a upper bound, lower bound and list of active transactions.</p>
<p>Assuming the system has the following transactions on-going with following trx_id_t: <code class="language-plaintext highlighter-rouge">(6943, 6945)</code>, and <code class="language-plaintext highlighter-rouge">trx_sys->max_trx_id=6959</code>:</p>
<p>ReadView is going to establish the following view for snapshot:</p>
<table>
<thead>
<tr>
<th>Lowest</th>
<th>On-going</th>
<th>Future</th>
</tr>
</thead>
<tbody>
<tr>
<td>< 6943</td>
<td>6943, 6945</td>
<td>>= 6959 (max_trx_id)</td>
</tr>
</tbody>
</table>
<p>This implies:</p>
<ul>
<li>Any transactions < <code class="language-plaintext highlighter-rouge">6943</code> are definitely visible, because they are not active when the snapshot is established, and they have already been committed.</li>
<li>Any transactions >= <code class="language-plaintext highlighter-rouge">6959</code> (inclusive) are future changes that will not been seen by this snapshot.</li>
<li>Any transactions falling within this range have two possibilities:
<ul>
<li>At the time the snapshot is created, the on-going transactions are <code class="language-plaintext highlighter-rouge">6943</code> and <code class="language-plaintext highlighter-rouge">6945</code>. These transactions are old transactions and any updates by them are not visible, since they haven’t committed yet</li>
<li>Otherwise, they have already been committed and should be visible</li>
</ul>
</li>
</ul>
<p>BTW, in case if you are wondering: the reason <code class="language-plaintext highlighter-rouge">6959</code> is inclusive is because max_trx_id is reserved for the next transaction, just as the comment in InnoDB code suggests:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">volatile</span> <span class="n">trx_id_t</span> <span class="n">max_trx_id</span><span class="p">;</span> <span class="cm">/*!< The smallest number not yet
assigned as a transaction id or
transaction number. This is declared
volatile because it can be accessed
without holding any mutex during
AC-NL-RO view creation. */</span>
</code></pre></div></div>
<p>So, looking back at the link list:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>6940 -> 6943 -> 6945 -> 6999
</code></pre></div></div>
<p>We can determine:</p>
<ul>
<li>6999 is invisible because it is >= 6959 so belongs to the future (either committed or not committed, doesn’t matter)</li>
<li>6945 and 6943 are part of on-going transaction at time of snapshot, which means they are old transactions that are not yet committed at the time of snapshot creation (but they did commit later when we read now), so they are also invisible</li>
<li>6940 is visible because it is less than 6943, so it has already committed in the past and is by definition visible.</li>
</ul>
<p>So we should return the record with trx_id_t = <code class="language-plaintext highlighter-rouge">6940</code>.</p>
<p>Let’s look into this process with a bit more detail.</p>
<h2 id="creating-the-readview">Creating the ReadView</h2>
<p>Whenever you try to read any row in InnoDB with consistent read (as opposed to locking reads, which is another topic that is worth discussing in another article), a ReadView is going to be assigned to the active transaction:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="err">}</span> <span class="k">else</span> <span class="nf">if</span> <span class="p">(</span><span class="n">prebuilt</span><span class="o">-></span><span class="n">select_lock_type</span> <span class="o">==</span> <span class="n">LOCK_NONE</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* This is a consistent read */</span>
<span class="cm">/* Assign a read view for the query */</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">srv_read_only_mode</span><span class="p">)</span> <span class="p">{</span>
<span class="n">trx_assign_read_view</span><span class="p">(</span><span class="n">trx</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The assignment is rather straight-forward - it it either opens a view from free list or use the existing view if there is one already:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/** Assigns a read view for a consistent read query. All the consistent reads
within the same transaction will get the same read view, which is created
when this function is first called for a new started transaction.
@return consistent read view */</span>
<span class="n">ReadView</span> <span class="o">*</span><span class="nf">trx_assign_read_view</span><span class="p">(</span><span class="n">trx_t</span> <span class="o">*</span><span class="n">trx</span><span class="p">)</span> <span class="cm">/*!< in/out: active transaction */</span>
<span class="p">{</span>
<span class="n">ut_ad</span><span class="p">(</span><span class="n">trx</span><span class="o">-></span><span class="n">state</span> <span class="o">==</span> <span class="n">TRX_STATE_ACTIVE</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">srv_read_only_mode</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ut_ad</span><span class="p">(</span><span class="n">trx</span><span class="o">-></span><span class="n">read_view</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">NULL</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">MVCC</span><span class="o">::</span><span class="n">is_view_active</span><span class="p">(</span><span class="n">trx</span><span class="o">-></span><span class="n">read_view</span><span class="p">))</span> <span class="p">{</span>
<span class="n">trx_sys</span><span class="o">-></span><span class="n">mvcc</span><span class="o">-></span><span class="n">view_open</span><span class="p">(</span><span class="n">trx</span><span class="o">-></span><span class="n">read_view</span><span class="p">,</span> <span class="n">trx</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="p">(</span><span class="n">trx</span><span class="o">-></span><span class="n">read_view</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Assuming first time within this transaction, within <code class="language-plaintext highlighter-rouge">mvcc::view_open</code>, it calls into <code class="language-plaintext highlighter-rouge">ReadView::prepare</code> to setup the boundaries as discussed earlier:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="n">ReadView</span><span class="o">::</span><span class="n">prepare</span><span class="p">(</span><span class="n">trx_id_t</span> <span class="n">id</span><span class="p">)</span> <span class="p">{</span>
<span class="n">ut_ad</span><span class="p">(</span><span class="n">mutex_own</span><span class="p">(</span><span class="o">&</span><span class="n">trx_sys</span><span class="o">-></span><span class="n">mutex</span><span class="p">));</span>
<span class="n">m_creator_trx_id</span> <span class="o">=</span> <span class="n">id</span><span class="p">;</span>
<span class="n">m_low_limit_no</span> <span class="o">=</span> <span class="n">m_low_limit_id</span> <span class="o">=</span> <span class="n">m_up_limit_id</span> <span class="o">=</span> <span class="n">trx_sys</span><span class="o">-></span><span class="n">max_trx_id</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">trx_sys</span><span class="o">-></span><span class="n">rw_trx_ids</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="n">copy_trx_ids</span><span class="p">(</span><span class="n">trx_sys</span><span class="o">-></span><span class="n">rw_trx_ids</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">m_ids</span><span class="p">.</span><span class="n">clear</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>During <code class="language-plaintext highlighter-rouge">copy_trx_ids</code>, <code class="language-plaintext highlighter-rouge">m_up_limit_id</code> is assigned to the smallest:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">m_up_limit_id</span> <span class="o">=</span> <span class="n">m_ids</span><span class="p">.</span><span class="n">front</span><span class="p">();</span>
</code></pre></div></div>
<p>It is perhaps a bit counter-intuitive as they are sort of reversed:</p>
<ul>
<li>m_up_limit_id is the lower bound of visible trx_id_t (of transactions)</li>
<li>m_low_limit_id is the upper bound (exclusive) of visible trx_id_t (of transactions)</li>
</ul>
<p>And m_ids is the list of on-going trx_id_t (that are invisible).</p>
<p>With these knowledge, now we are ready to read the rows for real.</p>
<h2 id="reading-the-rows">Reading the rows</h2>
<p>Assuming this transaction is trying to read some rows:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span> <span class="k">from</span> <span class="n">t1</span> <span class="k">where</span> <span class="n">pk</span><span class="o">=</span><span class="mi">6</span><span class="p">;</span>
</code></pre></div></div>
<p>When reading rows, eventually we’ll get here:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">if</span> <span class="p">(</span><span class="n">srv_force_recovery</span> <span class="o"><</span> <span class="mi">5</span> <span class="o">&&</span>
<span class="o">!</span><span class="n">lock_clust_rec_cons_read_sees</span><span class="p">(</span><span class="n">rec</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="n">offsets</span><span class="p">,</span>
<span class="n">trx_get_read_view</span><span class="p">(</span><span class="n">trx</span><span class="p">)))</span> <span class="p">{</span>
<span class="n">rec_t</span> <span class="o">*</span><span class="n">old_vers</span><span class="p">;</span>
<span class="cm">/* The following call returns 'offsets' associated with 'old_vers' */</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">row_sel_build_prev_vers_for_mysql</span><span class="p">(</span>
<span class="n">trx</span><span class="o">-></span><span class="n">read_view</span><span class="p">,</span> <span class="n">clust_index</span><span class="p">,</span> <span class="n">prebuilt</span><span class="p">,</span> <span class="n">rec</span><span class="p">,</span> <span class="o">&</span><span class="n">offsets</span><span class="p">,</span> <span class="o">&</span><span class="n">heap</span><span class="p">,</span>
<span class="o">&</span><span class="n">old_vers</span><span class="p">,</span> <span class="n">need_vrow</span> <span class="o">?</span> <span class="o">&</span><span class="n">vrow</span> <span class="o">:</span> <span class="nb">NULL</span><span class="p">,</span> <span class="o">&</span><span class="n">mtr</span><span class="p">,</span>
<span class="n">prebuilt</span><span class="o">-></span><span class="n">get_lob_undo</span><span class="p">());</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">lock_clust_rec_cons_read_sees</code> is mostly just check if the record is visible:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">trx_id_t</span> <span class="n">trx_id</span> <span class="o">=</span> <span class="n">row_get_rec_trx_id</span><span class="p">(</span><span class="n">rec</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="n">offsets</span><span class="p">);</span>
<span class="k">return</span> <span class="p">(</span><span class="n">view</span><span class="o">-></span><span class="n">changes_visible</span><span class="p">(</span><span class="n">trx_id</span><span class="p">,</span> <span class="n">index</span><span class="o">-></span><span class="n">table</span><span class="o">-></span><span class="n">name</span><span class="p">));</span>
</code></pre></div></div>
<p>We check to see if the record in question can be observed by checking the trx_id_t field of the record and see if it is visible in the view.</p>
<p>As already discussed, <code class="language-plaintext highlighter-rouge">changes_visible</code> uses <code class="language-plaintext highlighter-rouge">(m_up_limit_id, m_low_limit_id)</code> as a fast path:</p>
<ul>
<li>If id < <code class="language-plaintext highlighter-rouge">m_up_limit_id</code>, it happens in the past and definitely visible</li>
<li>If id >= <code class="language-plaintext highlighter-rouge">m_low_limit_id</code>, it happens in the future and definitely not visible</li>
</ul>
<p>Then it does a binary search over list of transactions to see if it is in the list of active transactions at the time of the <code class="language-plaintext highlighter-rouge">ReadView</code> is established. If it is in the list, then it is definitely not visible.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="cm">/** Check whether the changes by id are visible.
@param[in] id transaction id to check against the view
@param[in] name table name
@return whether the view sees the modifications of id. */</span>
<span class="kt">bool</span> <span class="nf">changes_visible</span><span class="p">(</span><span class="n">trx_id_t</span> <span class="n">id</span><span class="p">,</span> <span class="k">const</span> <span class="n">table_name_t</span> <span class="o">&</span><span class="n">name</span><span class="p">)</span> <span class="k">const</span>
<span class="n">MY_ATTRIBUTE</span><span class="p">((</span><span class="n">warn_unused_result</span><span class="p">))</span> <span class="p">{</span>
<span class="n">ut_ad</span><span class="p">(</span><span class="n">id</span> <span class="o">></span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">id</span> <span class="o"><</span> <span class="n">m_up_limit_id</span> <span class="o">||</span> <span class="n">id</span> <span class="o">==</span> <span class="n">m_creator_trx_id</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">true</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">check_trx_id_sanity</span><span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">id</span> <span class="o">>=</span> <span class="n">m_low_limit_id</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">false</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">m_ids</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span>
<span class="k">return</span> <span class="p">(</span><span class="nb">true</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">const</span> <span class="n">ids_t</span><span class="o">::</span><span class="n">value_type</span> <span class="o">*</span><span class="n">p</span> <span class="o">=</span> <span class="n">m_ids</span><span class="p">.</span><span class="n">data</span><span class="p">();</span>
<span class="k">return</span> <span class="p">(</span><span class="o">!</span><span class="n">std</span><span class="o">::</span><span class="n">binary_search</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">p</span> <span class="o">+</span> <span class="n">m_ids</span><span class="p">.</span><span class="n">size</span><span class="p">(),</span> <span class="n">id</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Once we establish that the current record isn’t visible to current <code class="language-plaintext highlighter-rouge">ReadView</code>, we’d go down the rabbit hole of checking the undo log:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">if</span> <span class="p">(</span><span class="n">srv_force_recovery</span> <span class="o"><</span> <span class="mi">5</span> <span class="o">&&</span>
<span class="o">!</span><span class="n">lock_clust_rec_cons_read_sees</span><span class="p">(</span><span class="n">rec</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="n">offsets</span><span class="p">,</span>
<span class="n">trx_get_read_view</span><span class="p">(</span><span class="n">trx</span><span class="p">)))</span> <span class="p">{</span>
<span class="n">rec_t</span> <span class="o">*</span><span class="n">old_vers</span><span class="p">;</span>
<span class="cm">/* The following call returns 'offsets' associated with 'old_vers' */</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">row_sel_build_prev_vers_for_mysql</span><span class="p">(</span>
<span class="n">trx</span><span class="o">-></span><span class="n">read_view</span><span class="p">,</span> <span class="n">clust_index</span><span class="p">,</span> <span class="n">prebuilt</span><span class="p">,</span> <span class="n">rec</span><span class="p">,</span> <span class="o">&</span><span class="n">offsets</span><span class="p">,</span> <span class="o">&</span><span class="n">heap</span><span class="p">,</span>
<span class="o">&</span><span class="n">old_vers</span><span class="p">,</span> <span class="n">need_vrow</span> <span class="o">?</span> <span class="o">&</span><span class="n">vrow</span> <span class="o">:</span> <span class="nb">NULL</span><span class="p">,</span> <span class="o">&</span><span class="n">mtr</span><span class="p">,</span>
<span class="n">prebuilt</span><span class="o">-></span><span class="n">get_lob_undo</span><span class="p">());</span>
</code></pre></div></div>
<p>It simply calls to <code class="language-plaintext highlighter-rouge">row_vers_build_for_consistent_read</code> and it does a loop to scan the undo log backwards from the record:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dberr_t</span> <span class="nf">row_vers_build_for_consistent_read</span><span class="p">(</span>
<span class="k">const</span> <span class="n">rec_t</span> <span class="o">*</span><span class="n">rec</span><span class="p">,</span> <span class="n">mtr_t</span> <span class="o">*</span><span class="n">mtr</span><span class="p">,</span> <span class="n">dict_index_t</span> <span class="o">*</span><span class="n">index</span><span class="p">,</span> <span class="n">ulint</span> <span class="o">**</span><span class="n">offsets</span><span class="p">,</span>
<span class="n">ReadView</span> <span class="o">*</span><span class="n">view</span><span class="p">,</span> <span class="n">mem_heap_t</span> <span class="o">**</span><span class="n">offset_heap</span><span class="p">,</span> <span class="n">mem_heap_t</span> <span class="o">*</span><span class="n">in_heap</span><span class="p">,</span>
<span class="n">rec_t</span> <span class="o">**</span><span class="n">old_vers</span><span class="p">,</span> <span class="k">const</span> <span class="n">dtuple_t</span> <span class="o">**</span><span class="n">vrow</span><span class="p">,</span> <span class="n">lob</span><span class="o">::</span><span class="n">undo_vers_t</span> <span class="o">*</span><span class="n">lob_undo</span><span class="p">)</span> <span class="p">{</span>
<span class="n">trx_id</span> <span class="o">=</span> <span class="n">row_get_rec_trx_id</span><span class="p">(</span><span class="n">rec</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="o">*</span><span class="n">offsets</span><span class="p">);</span>
<span class="n">version</span> <span class="o">=</span> <span class="n">rec</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
<span class="cm">/* If purge can't see the record then we can't rely on
the UNDO log record. */</span>
<span class="n">trx_undo_prev_version_build</span><span class="p">(</span><span class="n">rec</span><span class="p">,</span> <span class="n">mtr</span><span class="p">,</span> <span class="n">version</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="o">*</span><span class="n">offsets</span><span class="p">,</span> <span class="n">heap</span><span class="p">,</span>
<span class="o">&</span><span class="n">prev_version</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="n">vrow</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">lob_undo</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">prev_version</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="cm">/* It was a freshly inserted version */</span>
<span class="o">*</span><span class="n">old_vers</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="o">*</span><span class="n">offsets</span> <span class="o">=</span> <span class="n">rec_get_offsets</span><span class="p">(</span><span class="n">prev_version</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="o">*</span><span class="n">offsets</span><span class="p">,</span> <span class="n">ULINT_UNDEFINED</span><span class="p">,</span>
<span class="n">offset_heap</span><span class="p">);</span>
<span class="n">trx_id</span> <span class="o">=</span> <span class="n">row_get_rec_trx_id</span><span class="p">(</span><span class="n">prev_version</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="o">*</span><span class="n">offsets</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">view</span><span class="o">-></span><span class="n">changes_visible</span><span class="p">(</span><span class="n">trx_id</span><span class="p">,</span> <span class="n">index</span><span class="o">-></span><span class="n">table</span><span class="o">-></span><span class="n">name</span><span class="p">))</span> <span class="p">{</span>
<span class="cm">/* The view already sees this version: we can copy
it to in_heap and return */</span>
<span class="n">buf</span> <span class="o">=</span>
<span class="k">static_cast</span><span class="o"><</span><span class="n">byte</span> <span class="o">*></span><span class="p">(</span><span class="n">mem_heap_alloc</span><span class="p">(</span><span class="n">in_heap</span><span class="p">,</span> <span class="n">rec_offs_size</span><span class="p">(</span><span class="o">*</span><span class="n">offsets</span><span class="p">)));</span>
<span class="o">*</span><span class="n">old_vers</span> <span class="o">=</span> <span class="n">rec_copy</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="n">prev_version</span><span class="p">,</span> <span class="o">*</span><span class="n">offsets</span><span class="p">);</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">version</span> <span class="o">=</span> <span class="n">prev_version</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">err</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The code is simplified to make it more readable:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">trx_undo_prev_version_build</code> reads the previous undo log record into prev_version
<ul>
<li>If it we reached the end, just exit the loop. By definition this would be a INSERTed row after this transaction, otherwise there would be at least one visible record in the undo log chain containing the original value.</li>
</ul>
</li>
<li>retrieve the trx_id of prev_version</li>
<li>See if the trx_id is visible in the view
<ul>
<li>If yes, copy it and assign to <code class="language-plaintext highlighter-rouge">old_vers</code></li>
<li>Otherwise keep looping</li>
</ul>
</li>
</ul>
<h2 id="whats-next">What’s next</h2>
<p>I’m planning to write more about MySQL / RocksDB / MyRocks / InnoDB and have a bunch of notes taken in my backlog. I was thinking about making it into a series but I end up realizing I’ll never have time to write a cohesive series about any of them given the scope of things. So I’ll just write about whatever I’m researching and get it out, and forget about the whole series thing. Hopefully this way I’ll actually get more done.</p>Yi Zhangmail@yizhang82.meOverview I’ve been doing some research in this area trying to understand how this works in databases (for my upcoming project), so I thought I’d share some of my learnings here. InnoDB internally uses ReadView to establish snapshots for consistent reads - basically giving you the point-in-time view of the database at the time when the snapshot it is created. In InnoDB, all changes are immediately made on the latest version of Database regardless whether it has been committed or not, so if you don’t have MVCC, everybody will see the latest version of rows and it’ll be a disaster for consistency. Not to mention you’ll need to be able to rollback the changes. In order to achieve this, InnoDB maintains a undo log to track a link list of changes made by other transactions, so reading in the past with a snapshot means going from the latest record in the BufferPool, and walk backwards to find the first visible change. Rollback is similar. This also means the undo log can’t be purged if the snapshot is still active, and undo log will get longer and longer, which slows down the reads more and more. This is the infamous long running transaction issue. The fundamental issue is that you need to be able to determine visibility of changes. This is done with two things: InnoDB tracks the trx_id_t of each rows and in the undo log InnoDB internally use a data structure called ReadView to determine if a transaction is visible in the snapshot. So the algorithm becomes as simple as walking the list backwards and find the first visible record.Trying and setting up WSL 22020-05-29T00:00:00+00:002020-05-29T00:00:00+00:00http://yizhang82.dev/wsl2-setup<p>The year of Linux desktop has finally come. It’s Windows + WSL 2. Seriously.</p>
<p>I use a MBP 16 for my daily work and SSH into linux machines for development/testing. While it’s a fantastic machine (and the track pad is second to none), I just hate the Apple trying to lock down the system so much that even setting up gdb to work is a nightmare, and running any simple script it tries to phone home for validation.</p>
<p>So I tried installing Linux on my machines. I do have a personal laptop X1 Carbon Gen7, but it doesn’t work well with Linux: mostly Linux just doesn’t like the 4 Channel Dolby Surround Speakers - they sound something from a tin-can and volume is much lower. While in Windows the sound I get is actually pretty nice (for a laptop, of course). I have spent countless time on it and I’ve seen many people struggling through the same issues. There are also occasionall hipcup with suspend/resume, but I can live with that. I also have a powerful gaming PC which I mostly play games. WSL sounds like a perfect solution for those machines where I can use Windows for their compatiblity / games, while also use it for development / tinkering on Linux. Yes, you can either dual boot or install a linux VM, but the integration between WSL 2 and Windows seems pretty nice to me, so I decided to try it out - and now all my Windows machines have WSL 2 installed.</p>
<p>Setting it up is not too bad - you do need to follow the <a href="https://docs.microsoft.com/en-us/windows/wsl/install-win10">official instructions</a> to install it, which I’m not going to repeat here. The installation experience was fairly smooth, though it requires multiple steps.</p>
<p>However, to get it work properly requires a bit of extra work. Once you set it up it’s pretty much all I ever needed. Here is what it looks like when I’m done:</p>
<p><img src="/imgs/wsl2-terminal.png" alt="WSL_terminal" /></p>
<!--more-->
<h2 id="install-wsl-remote-extension-on-vscode">Install WSL Remote extension on VSCode</h2>
<p>When you launch VS, it’ll automatically prompt you to install WSL Remote extension. Once you done installation, just open code from WSL:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>code <folder>
</code></pre></div></div>
<p>Once you do that, it’ll install VSCode Server automatically and launch VS code pointing to that folder. And you can browse through the code as usual.</p>
<p>And the best part is, once you install corresponding remote version of the extension (for example, C++ Extension), IntelliSense works! The installation of remote extension is a bit tricky - you need to find your extension again, and click the little green button “Install in WSL Ubuntu”.</p>
<p>You can refer to the <a href="https://code.visualstudio.com/docs/remote/wsl">official doc</a> for more details.</p>
<h2 id="moving-it-to-another-disk">Moving it to another disk</h2>
<p>By default, WSL forces you to install it on <code class="language-plaintext highlighter-rouge">C:</code> drive, which makes no sense what so ever in 2020. I suppose this is a Windows Store thing. Fortunately, there is a <a href="https://docs.microsoft.com/en-us/windows/wsl/install-win10">move-wsl</a> tool available in github. There is a powershell script and a simple batch file. I’m going to use the batch file:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Move ubuntu distro to D:\vm</span>
move-wsl.bat ubuntu D:<span class="se">\v</span>m
</code></pre></div></div>
<p>It’ll move Ubuntu distro to <code class="language-plaintext highlighter-rouge">D:\vm</code>, and that’s basically a huge <code class="language-plaintext highlighter-rouge">ext4.vhdx</code> file.</p>
<p>Once you launch WSL again, you may find the default user has become root. Don’t worry, just put the following into <code class="language-plaintext highlighter-rouge">/etc/wsl.conf</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[user]
default=YOUR_USERNAME
</code></pre></div></div>
<p>And go back to a windows prompt to terminate the running WSL ubuntu instance:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wsl -t ubuntu
</code></pre></div></div>
<p>The next time when you launch WSL you’ll be going back as your normal self.</p>
<h2 id="limiting-memory-growth">Limiting memory growth</h2>
<p>By default WSL 2 is setup to consume up to 80% of system memory which is way too high. In my 16GB laptop I’m setting this to 6GB (8GB is still too high with a few chrome tabs open and VSCode open side by side). As far as I can tell this is due to cache - linux is going to go memory hungry to use all the memory it can use for caches, but when windows needs that memory there is no way for linux to know that, unless you force linux to GC the unused memory more aggressively (see this <a href="https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/">article</a> for more details). But I’m hoping for a better long term solution where you can have the two OS talk to each other in some ways to negotiate memory usage. Before that happens, you’ll need to write following to <code class="language-plaintext highlighter-rouge">%USERPROFILE%\.wslconfig</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[wsl2]
memory=6GB
swap=0
localhostForwarding=true
</code></pre></div></div>
<p>If you are using this on a workstation with 32GB+ memory, you might not need this. Though it is still likely that it’ll happily consume everything when you do some heavy processing like compiling source code with 24 cores.</p>
<h2 id="terminal">Terminal</h2>
<p><a href="https://docs.microsoft.com/en-us/windows/terminal/">Windows Terminal</a> is a modern terminal that supports different shell like ubuntu shell, cmd, powershell, etc. I find it works well with zsh/tmux and supports color themes and good font rendering, so that’s the one I’m using right now.</p>
<p>I’ve set it up with <a href="https://design.ubuntu.com/font/">Ubuntu Mono font</a> and <a href="https://github.com/mbadolato/iTerm2-Color-Schemes/blob/master/windowsterminal/Afterglow.json">Afterglow</a> theme so it looks fairly close to a Terminal under linux.</p>
<h2 id="setting-up-git-credentials">Setting up git credentials</h2>
<p>Because there is no desktop support there, you can’t use libsecret which uses dbus. If you set it up, you’ll eventually run into this error:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>** (process:7902): CRITICAL **: could not connect to Secret Service: Cannot autolaunch D-Bus without X11 $DISPLAY
</code></pre></div></div>
<p>Fortunately, given this is windows and WSL supports Windows Interop, you can just use <code class="language-plaintext highlighter-rouge">git-credential-manager.exe</code> which works surprisingly well:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git config --global credential.helper "/mnt/c/Program\ Files/Git/mingw64/libexec/git-core/git-credential-manager.exe"
</code></pre></div></div>
<h2 id="docker">Docker</h2>
<p>You can install docker as usual, but whenever you try to launch any container you’ll get this error:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
</code></pre></div></div>
<p>This is because there is no systemd installed. As a result docker doesn’t know how to automatically launch docker daemon. You can still do it in the good old system-V style:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo service docker start
</code></pre></div></div>
<h2 id="copying-text-to-clipboard-in-tmux">Copying text to clipboard in tmux</h2>
<p>Under regular linux you could juse use <code class="language-plaintext highlighter-rouge">xsel</code> / <code class="language-plaintext highlighter-rouge">xclip</code> which isn’t an option here as there is no X window installed. Again, because there is Windows interop, you can juse use <code class="language-plaintext highlighter-rouge">clip.exe</code>!</p>
<p>You can set it up in tmux so that it integrates with your clipboard.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bind-key -Tcopy-mode-vi 'y' send -X copy-pipe "clip.exe"
</code></pre></div></div>
<p>I have a script that auto detects Linux/Mac/WSL and use the correct copy tool correctly <a href="https://github.com/yizhang82/dotfiles/blob/master/utils/copy">in github</a> based on https://github.com/Parth/dotfiles/blob/master/utils/copy.</p>
<h2 id="my-overall-impression">My overall impression</h2>
<p>WSL 2 is really a game changer. WSL was a good start but given that it is done through implementing linux sycall on top of windows (interop, basically), compatibility is a big issue. It’s hard to be productive when you hardly trust your environment. With WSL 2, you can run Windows and Linux literally side by side and have them talk to each other through WSL interop, so really you get the best of both worlds - the compatiblity of Windows (linux on laptop is still quite a hussle, especially for newer hardware) and the fantastic open source dev environment of linux. There is some trade off, but it’s worth it.</p>Yi Zhangmail@yizhang82.meThe year of Linux desktop has finally come. It’s Windows + WSL 2. Seriously. I use a MBP 16 for my daily work and SSH into linux machines for development/testing. While it’s a fantastic machine (and the track pad is second to none), I just hate the Apple trying to lock down the system so much that even setting up gdb to work is a nightmare, and running any simple script it tries to phone home for validation. So I tried installing Linux on my machines. I do have a personal laptop X1 Carbon Gen7, but it doesn’t work well with Linux: mostly Linux just doesn’t like the 4 Channel Dolby Surround Speakers - they sound something from a tin-can and volume is much lower. While in Windows the sound I get is actually pretty nice (for a laptop, of course). I have spent countless time on it and I’ve seen many people struggling through the same issues. There are also occasionall hipcup with suspend/resume, but I can live with that. I also have a powerful gaming PC which I mostly play games. WSL sounds like a perfect solution for those machines where I can use Windows for their compatiblity / games, while also use it for development / tinkering on Linux. Yes, you can either dual boot or install a linux VM, but the integration between WSL 2 and Windows seems pretty nice to me, so I decided to try it out - and now all my Windows machines have WSL 2 installed. Setting it up is not too bad - you do need to follow the official instructions to install it, which I’m not going to repeat here. The installation experience was fairly smooth, though it requires multiple steps. However, to get it work properly requires a bit of extra work. Once you set it up it’s pretty much all I ever needed. Here is what it looks like when I’m done:SWIG and Python3 unicode2019-08-15T00:00:00+00:002019-08-15T00:00:00+00:00http://yizhang82.dev/python3-utf8-swig<p>Anyone familiar with Python probably knew its history of Unicode support. If you add Python3, Unicode, and SWIG together, imagine what might go wrong?</p>
<h2 id="python3-unicode-swig-and-me">Python3, Unicode, SWIG, and me</h2>
<p>I was debugging a test failure written in Python just now and it is failing with this error:</p>
<blockquote>
<p>Many of the end-to-end tests here are written in Python because they are convenient - no one wants to write a C++ code to drive MySql and our infra service to do a series of stuff.</p>
</blockquote>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>UnicodeEncodeError: 'latin-1' codec can't encode character '\udcfa' in position 293: ordinal not in range(256)
</code></pre></div></div>
<p>The code looks like this:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sql</span> <span class="o">=</span> <span class="n">get_sql_from_some_magic_place</span><span class="p">()</span>
<span class="n">decoded_sql</span> <span class="o">=</span> <span class="n">cUnescape</span><span class="p">(</span><span class="n">sql</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">))</span>
<span class="n">decoded_sql_str</span> <span class="o">=</span> <span class="n">decoded_sql</span><span class="p">.</span><span class="n">encode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">)</span>
<span class="n">execute</span><span class="p">(</span><span class="n">decoded_sql_str</span><span class="p">)</span>
</code></pre></div></div>
<p>The code seems straight-forward enough. The offending string looks like this: <code class="language-plaintext highlighter-rouge">b"SELECT from blah WHERE col='\\372'</code>.</p>
<p>This string was originally escaped by <code class="language-plaintext highlighter-rouge">folly::cEscape</code> which does simple thing rather simple - converts the string to be a C representation where ‘' are double escaped and any non-printable characters are escaped with octal. This is convenient as those escaped strings are safe to pass around without worry for encoding as they are, well, ASCII.</p>
<blockquote>
<p>folly is Facebook’s open source standard C++ library collection. See https://github.com/facebook/folly for more information.</p>
</blockquote>
<p>It is convenient, until you need to call from Python, for which you’ll need to use SWIG:</p>
<blockquote>
<p>If you don’t know SWIG - just think it’s a tool that generates Python wrapper for C++ code so that they can be called from Python code. In this case, folly::cUnescape. Go to http://www.swig.org/ to learn more. Many language have equivalent tool/feature built-in, P/invoke in C#, cgo in go, JNI in Java, etc.</p>
</blockquote>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>std::string cUnescape(const std::string& a) {
std::string b;
folly::cUnescape(a, b);
return b;
}
</code></pre></div></div>
<p>I was scratching my ahead trying to understand what is happening as there is no way the strings are converted to ‘\udcfa’, until I realize <code class="language-plaintext highlighter-rouge">cUnescape</code> might be at fault.</p>
<p>It turns out, SWIG expects UTF-8 string and returns UTF-8 strings back. “\372” can be converted to UTF-8 without any trouble, but once it is unescaped it becomes “\372” which is 0xfa that is going to be interpreted as UTF-8:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="sa">b</span><span class="s">"</span><span class="se">\372</span><span class="s">"</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="s">"utf-8"</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s">"surrogateescape"</span><span class="p">).</span><span class="n">encode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">)</span>
</code></pre></div></div>
<p>And you get:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>UnicodeEncodeError: 'latin-1' codec can't encode character '\udcfa' in position 0: ordinal not in range(256)
</code></pre></div></div>
<h2 id="the-fix">The fix</h2>
<p>To fix the problem, you can encode the buffer again with surrogateescape:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="sa">b</span><span class="s">"</span><span class="se">\372</span><span class="s">"</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span><span class="s">"utf-8"</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s">"surrogateescape"</span><span class="p">).</span><span class="n">encode</span><span class="p">(</span><span class="s">"utf-8"</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s">"surrogateescape"</span><span class="p">).</span><span class="n">decode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">)</span>
<span class="s">'ú'</span>
</code></pre></div></div>
<p>Seems rather backwards, isn’t it? Why not just stop messing with the strings?</p>
<p>That’s exactly what was discussed in SWIG doc here: http://www.swig.org/Doc4.0/Python.html#Python_nn77. There is a magic macro you can use:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>%module char_to_bytes
%begin %{
#define SWIG_PYTHON_STRICT_BYTE_CHAR
%}
std::string cUnescape(const std::string& a) {
std::string b;
folly::cUnescape(a, b);
return b;
}
</code></pre></div></div>
<p>And the original code can be changed to:</p>
<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sql</span> <span class="o">=</span> <span class="n">get_sql_from_some_magic_place</span><span class="p">()</span>
<span class="n">decoded_sql</span> <span class="o">=</span> <span class="n">cUnescape</span><span class="p">(</span><span class="n">sql</span><span class="p">).</span><span class="n">decode</span><span class="p">(</span><span class="s">"latin-1"</span><span class="p">)</span>
<span class="n">execute</span><span class="p">(</span><span class="n">decoded_sql</span><span class="p">)</span>
</code></pre></div></div>
<p>Much simpler too.</p>
<p>I’m just happy that I mostly write C++ instead of Python…</p>Yi Zhangmail@yizhang82.meAnyone familiar with Python probably knew its history of Unicode support. If you add Python3, Unicode, and SWIG together, imagine what might go wrong? Python3, Unicode, SWIG, and me I was debugging a test failure written in Python just now and it is failing with this error: Many of the end-to-end tests here are written in Python because they are convenient - no one wants to write a C++ code to drive MySql and our infra service to do a series of stuff. UnicodeEncodeError: 'latin-1' codec can't encode character '\udcfa' in position 293: ordinal not in range(256) The code looks like this: sql = get_sql_from_some_magic_place() decoded_sql = cUnescape(sql.decode("latin-1")) decoded_sql_str = decoded_sql.encode("latin-1") execute(decoded_sql_str) The code seems straight-forward enough. The offending string looks like this: b"SELECT from blah WHERE col='\\372'. This string was originally escaped by folly::cEscape which does simple thing rather simple - converts the string to be a C representation where ‘' are double escaped and any non-printable characters are escaped with octal. This is convenient as those escaped strings are safe to pass around without worry for encoding as they are, well, ASCII. folly is Facebook’s open source standard C++ library collection. See https://github.com/facebook/folly for more information. It is convenient, until you need to call from Python, for which you’ll need to use SWIG: If you don’t know SWIG - just think it’s a tool that generates Python wrapper for C++ code so that they can be called from Python code. In this case, folly::cUnescape. Go to http://www.swig.org/ to learn more. Many language have equivalent tool/feature built-in, P/invoke in C#, cgo in go, JNI in Java, etc. std::string cUnescape(const std::string& a) { std::string b; folly::cUnescape(a, b); return b; } I was scratching my ahead trying to understand what is happening as there is no way the strings are converted to ‘\udcfa’, until I realize cUnescape might be at fault. It turns out, SWIG expects UTF-8 string and returns UTF-8 strings back. “\372” can be converted to UTF-8 without any trouble, but once it is unescaped it becomes “\372” which is 0xfa that is going to be interpreted as UTF-8: b"\372".decode("utf-8", errors="surrogateescape").encode("latin-1") And you get: UnicodeEncodeError: 'latin-1' codec can't encode character '\udcfa' in position 0: ordinal not in range(256) The fix To fix the problem, you can encode the buffer again with surrogateescape: >>> b"\372".decode("utf-8", errors="surrogateescape").encode("utf-8", errors="surrogateescape").decode("latin-1") 'ú' Seems rather backwards, isn’t it? Why not just stop messing with the strings? That’s exactly what was discussed in SWIG doc here: http://www.swig.org/Doc4.0/Python.html#Python_nn77. There is a magic macro you can use: %module char_to_bytes %begin %{ #define SWIG_PYTHON_STRICT_BYTE_CHAR %} std::string cUnescape(const std::string& a) { std::string b; folly::cUnescape(a, b); return b; } And the original code can be changed to: sql = get_sql_from_some_magic_place() decoded_sql = cUnescape(sql).decode("latin-1") execute(decoded_sql) Much simpler too. I’m just happy that I mostly write C++ instead of Python…