yizhang82’s blog

Bloaty: A super handy linux binary analysis

2021-02-10T00:00:00+00:00

bloaty is a great tool from Google for binary size analysis. We were just wondering why the binary size became so large for our code in production and bloaty is great at that.

For example, if you run it against a release build of bloaty itself, just for fun:

./bloaty -d sections ./bloaty
    FILE SIZE        VM SIZE
 --------------  --------------
8%  16.2Mi   0.0%       0    .debug_info
3%  11.4Mi   0.0%       0    .debug_loc
6%  5.26Mi   0.0%       0    .debug_str
5%  2.93Mi   0.0%       0    .debug_ranges
3%  2.83Mi  42.5%  2.83Mi    .rodata
7%  2.60Mi   0.0%       0    .debug_line
4%  2.00Mi  29.9%  2.00Mi    .text
0%       0  15.1%  1.01Mi    .bss
3%   585Ki   0.0%       0    .strtab
0%   441Ki   6.5%   441Ki    .data
7%   316Ki   0.0%       0    .debug_abbrev
6%   279Ki   4.1%   279Ki    .eh_frame
5%   235Ki   0.0%       0    .symtab
1%  50.3Ki   0.7%  50.3Ki    .eh_frame_hdr
1%  46.9Ki   0.7%  46.8Ki    .gcc_except_table
1%  38.3Ki   0.0%       0    .debug_aranges
0%  14.2Ki   0.1%  7.80Ki    [24 Others]
0%  7.78Ki   0.1%  7.72Ki    .dynstr
0%  6.20Ki   0.1%  6.14Ki    .dynsym
0%  4.89Ki   0.1%  4.83Ki    .rela.plt
0%  3.30Ki   0.0%  3.23Ki    .plt
0%  45.2Mi 100.0%  6.66Mi    TOTAL

You can easily tell most of the size is actually debug information - 79.2% (35.8+25.3+11.6+6.5)! This is actually a pretty common pattern for C++ binarie and most of the size is debug info. These debug symbols can be offloaded to a symbol package and installed on-demand for coredumps and debugging if needed, if size is becoming an issue.

Another interesting analysis you can do is to look at how much each file is contributing to your different sections (text, string, etc). Again, using bloaty itself as an example:

./bloaty -d sections,compileunits ./bloaty

...
4%  2.00Mi  29.9%  2.00Mi    .text
7%   688Ki  33.7%   688Ki    [117 Others]
4%   193Ki   9.4%   193Ki    /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/descriptor.cc
2%   125Ki   6.2%   125Ki    /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/descriptor.pb.cc
6%   115Ki   5.6%   115Ki    /home/yzha/local/github/bloaty/third_party/capstone/arch/AArch64/AArch64InstPrinter.c
6%  94.6Ki   4.6%  94.6Ki    /home/yzha/local/github/bloaty/third_party/capstone/arch/Sparc/SparcInstPrinter.c
6%  93.3Ki   4.6%  93.3Ki    /home/yzha/local/github/bloaty/third_party/capstone/arch/ARM/ARMDisassembler.c
1%  83.3Ki   4.1%  83.3Ki    /home/yzha/local/github/bloaty/src/bloaty.cc
9%  79.3Ki   3.9%  79.3Ki    /home/yzha/local/github/bloaty/third_party/demumble/third_party/libcxxabi/cxa_demangle.cpp
8%  78.7Ki   3.8%  78.7Ki    /home/yzha/local/github/bloaty/third_party/capstone/arch/PowerPC/PPCInstPrinter.c
0%  62.1Ki   3.0%  62.1Ki    /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/text_format.cc
8%  56.9Ki   2.8%  56.9Ki    /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/generated_message_reflection.cc
5%  50.1Ki   2.5%  50.1Ki    /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/extension_set.cc
3%  46.0Ki   2.3%  46.0Ki    /home/yzha/local/github/bloaty/third_party/capstone/arch/ARM/ARMInstPrinter.c
1%  42.2Ki   2.1%  42.2Ki    /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/map_field.cc
1%  42.1Ki   2.1%  42.1Ki    /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/wire_format.cc
0%  40.6Ki   2.0%  40.6Ki    /home/yzha/local/github/bloaty/third_party/capstone/arch/SystemZ/SystemZDisassembler.c
7%  34.0Ki   1.7%  34.0Ki    /home/yzha/local/github/bloaty/src/dwarf.cc
5%  30.9Ki   1.5%  30.9Ki    /home/yzha/local/github/bloaty/src/elf.cc
5%  30.2Ki   1.5%  30.2Ki    /home/yzha/local/github/bloaty/third_party/protobuf/src/google/protobuf/repeated_field.cc
5%  30.1Ki   1.5%  30.1Ki    /home/yzha/local/github/bloaty/third_party/capstone/arch/AArch64/AArch64Disassembler.c
3%  27.0Ki   1.3%  27.0Ki    /home/yzha/local/github/bloaty/third_party/re2/re2/re2.cc
...

It looks like protobuf is a big contributor. Now we can add source filter to see how much:

./bloaty -d sections,compileunits --source-filter=protobuf ./bloaty
...
 100.0%  24.1Mi 100.0%  1013Ki    TOTAL
Filtering enabled (source_filter); omitted file = 21.1Mi, vm = 5.67Mi of entries

There are a lot of output here, but you can see protobuf contributs to 24.1/45.2=53% of size of bloaty itself. If you want you can also dive into different sections to see how much each individual files contributes to as well.

std::atomic vs volatile, disassembled

2021-01-30T00:00:00+00:00

This came up during a code review and the code was using volatile to ensure the access to the pointer variable is atomic and serialized, and we were sort of debating whether it is sufficient, in particular:

Is it safer to switch to std::atomic<T>, and if so, why?
Is volatile sufficiently safe for a strong memory model CPU like x86?

Most of us can probably agree that std::atomic<T> would be safer, but we need to dig a bit deeper to see why is it safer, and even for x86.

What is the difference?

std::atomic provides atomic access to variables and provides different memory model access for store/load as well as bunch of multi-threading primitives. The default load and store provides sequential memory order guarantees.

volatile only prevents compiler optimization (it may do more depending on compilers) so a read / write cannot be optimized away in case another thread might modify it. But it provides no gaurantees in the hardware level, and no barrier is guaranteed. In some compilers (such as Visual C++) might insert barrier for you, but it isn’t guaranteed - for example gcc gives you no barriers.

Is std::atomic still required if you have volatile?

To answer this question we need to understand the concept of memory model. If all access to memory were seqential in nature and exactly done as written in code, we wouldn’t be having this discussion. However, in practice, reordering can happen in two levels:

compiler - compiler can reorder / delay / cache variables in registers
hardware - CPU can reorder read/write as long as the end result should be the same

volatile only prevents compiler optimizations but CPU might still reorder operations and/or cache the reads/writes, so the end result is still hardware dependent.

Memory model is how hardware models memory access and what kind of ordering and visibility guarantee it provides. CPUs are typically either strong memory model (x86, etc) or weak memory model (ARM, etc). This blog has one of the best description of weak memory model vs strong memory model. In particular, x86 CPU falls in the strong memory model category, which means usually load implies acquire semantics and load implies release semantics, but there is no ordering guarantee with #StoreLoad ordering, as observed in this example. To better understand acquire/release semantics, you can refer to this post.

So in short if you want your code to be correct and portable, and even in x86, the short answer is it’s best to not take any chances and use std::atomic. It’s better to be correct than fast and wrong.

std::atomic under the hood for x86

But you might wonder - what does std::atomic<T> do for x86 anyway? What is the magic?

It’d be easier to look into this by writing code using std::atomic<T> and looking at the disassembly code.

Suppose we have following code:

#include <atomic>
#include <stdio.h>

using namespace std;

std::atomic<int> x(0);
int main(void) {
    x.store(2);
    x.store(3, std::memory_order_release);

    int y = x.load();
    printf("%d", y);

    y = x.load(std::memory_order_acquire);
    printf("%d", y);

    return 0;
}

And let’s compile it with optimization and dump out the disassembly:

g++ atomic.cc --std=c++11 -O3
objdump --all -d ./a.out > a

And the output of main looks as follows:

0000000000401040 <main>:
  401040:       48 83 ec 08             sub    $0x8,%rsp
  401044:       b8 02 00 00 00          mov    $0x2,%eax
  401049:       87 05 d9 2f 00 00       xchg   %eax,0x2fd9(%rip)        # 404028 <x>
  40104f:       bf 10 20 40 00          mov    $0x402010,%edi
  401054:       31 c0                   xor    %eax,%eax
  401056:       c7 05 c8 2f 00 00 03    movl   $0x3,0x2fc8(%rip)        # 404028 <x>
  40105d:       00 00 00
  401060:       8b 35 c2 2f 00 00       mov    0x2fc2(%rip),%esi        # 404028 <x>
  401066:       e8 c5 ff ff ff          callq  401030 <printf@plt>
  40106b:       8b 35 b7 2f 00 00       mov    0x2fb7(%rip),%esi        # 404028 <x>
  401071:       bf 10 20 40 00          mov    $0x402010,%edi
  401076:       31 c0                   xor    %eax,%eax
  401078:       e8 b3 ff ff ff          callq  401030 <printf@plt>
  40107d:       31 c0                   xor    %eax,%eax
  40107f:       48 83 c4 08             add    $0x8,%rsp
  401083:       c3                      retq

For the first store(2, std::memory_order_seq_cst) (the default) in x86, gcc made it a full barrier using xchg instruction which has a implicit lock prefix:

  401049:       87 05 d9 2f 00 00       xchg   %eax,0x2fd9(%rip)        # 404028 <x>

Here the source is %eax = 2, the target of the move is address rip (=next instruction 0x40104f) + 0x2fd9 offset = 0x404028, which is the location of the global variable x.

If you are wondering what is the behavior of std::atomic<T>::operator = - it is the equivalent of store(std::memory_order_seq_cst)

In some compilers you may get mfence which is the full barrier instruction in x86 CPU, so the end result is the same.

Now to the second store(3, std::memory_order_release). Recall under x86 all store has release semantics, so the code is just normal mov:

  401056:       c7 05 c8 2f 00 00 03    movl   $0x3,0x2fc8(%rip)        # 404028 <x>

Now let’s look at reads.

For the first load(std::memory_order_seq_cst) (the default), given that in a sequential memory order a write already would publish the results to all cores with a full memory barrier, there is nothing we need to do. It is just a regular read - reading a memory location into esi, which is the 2nd argument to printf as per linux SystemV x64 ABI:

  401060:       8b 35 c2 2f 00 00       mov    0x2fc2(%rip),%esi        # 404028 <x>

For the 2nd load(std::memory_order_acquire), again recall x86 every load is implicitly has acquire semantics, so again it is just a regular read:

  40106b:       8b 35 b7 2f 00 00       mov    0x2fb7(%rip),%esi        # 404028 <x>

What if this is volatile?

If we replace the atomic to be a volatile:

#include <atomic>
#include <stdio.h>

using namespace std;

volatile int x(0);
int main(void) {
    x = 2;
    x = 3;

    int y = x;
    printf("%d", y);

    return 0;
}

The result code looks like this:

0000000000401040 <main>:
  401040:       48 83 ec 08             sub    $0x8,%rsp
  401044:       bf 10 20 40 00          mov    $0x402010,%edi
  401049:       31 c0                   xor    %eax,%eax
  40104b:       c7 05 d3 2f 00 00 02    movl   $0x2,0x2fd3(%rip)        # 404028 <x>
  401052:       00 00 00
  401055:       c7 05 c9 2f 00 00 03    movl   $0x3,0x2fc9(%rip)        # 404028 <x>
  40105c:       00 00 00
  40105f:       8b 35 c3 2f 00 00       mov    0x2fc3(%rip),%esi        # 404028 <x>
  401065:       e8 c6 ff ff ff          callq  401030 <printf@plt>
  40106a:       31 c0                   xor    %eax,%eax
  40106c:       48 83 c4 08             add    $0x8,%rsp
  401070:       c3                      retq

You can see the xchg becomes a simple movl as volatile doesn’t guarantee any ordering - it only prevents compiler optimization. What optimization, you might ask? Let’s see what happens when we remove the volatile.

Taking out the volatile

Now let’s just take out the volatile keyword, and see what we would get:

0000000000401040 <main>:
  401040:       48 83 ec 08             sub    $0x8,%rsp
  401044:       be 03 00 00 00          mov    $0x3,%esi
  401049:       bf 10 20 40 00          mov    $0x402010,%edi
  40104e:       31 c0                   xor    %eax,%eax
  401050:       c7 05 ce 2f 00 00 03    movl   $0x3,0x2fce(%rip)        # 404028 <x>
  401057:       00 00 00
  40105a:       e8 d1 ff ff ff          callq  401030 <printf@plt>
  40105f:       31 c0                   xor    %eax,%eax
  401061:       48 83 c4 08             add    $0x8,%rsp
  401065:       c3                      retq

You might have already noticed two significant differences:

The assignment x=2 is completely gone. compiler knows there are no side effects to the x=2 assignment so it is free to optimize it away
The read is completely gone, instead we assign %esi = 3 for printf from the get go:

  401044:       be 03 00 00 00          mov    $0x3,%esi

Again, compiler is free to optimize the load because no one else is going to change x in between, so it can simply replace x with 3 in the printf.

Conclusion

Multi-threading, memory-model, barriers are complicated topics but hopefully this gives you a good starting point. Even seemingly question like what is the difference of volatile and atomic can be quite confusing, and the fact that different compilers does different things for volatile made this more confusing (VC++ for example offers stronger gaurantee for volatile making it a full barrier). If you are still hungry for more, there is Linux Kernel Memory Barrier Doc that has great details and every programmer does lock-free multi-thread programming or want to understand the details probably should read. At the end of the day, having good understanding of compilers, assembly code and computer/CPU architecture would go a long way for system programmers.

Paper Reading: In Search of an Understandable Consensus Algorithm (Extended Version)

2021-01-16T00:00:00+00:00

This paper is the paper to read about Raft consensus algorithm and a good way to build intuition for consensus algorithms in general. The “consensus” about consensus algorithms is that they are hard to understand / build / test, and not surprisingly having an understandable consensus algorithm has a lot of value for system builders. I think Raft is designed for today’s mainstream single leader multi-follower log-replicated state machine model so it is a great starting point for building a practical distributed system around it. I’ve read about raft before but this is the first time I went through the paper in full. I must admit I find Paxos not intuitive and hard to follow as well and I might give Paxos/Multi-Paxos a go some other time. Meanwhile Raft is something I can get behind and feel comfortable with. And that is saying something.

Overview

Paxos is quite difficult to understand and requires complex changes to support practical systems. Raft is designed to be significantly easier to understand than Paxos, simlar with Viewstamped Replication, but with some novel features:

Strong leader with single direction of flow
Leader election with randomized timers
Membership changes with joint consensus

Consensus algorithms typicaly on a collection of state machines computing identical copies. Typically implemented separately with a replicated log, and executes in order. State machines are determinstic in nature and produces exact state.

Paxos has become almost synonymous with consensus (at the time of writing). Paxos first define a protocol capable of reaching agreement within a single instance, referred to as Single Decree Paxos, and then combine multiple instances to faciliate a series of decision. Paxos ensures both safety and liveness, but it has two main drawbacks:

Exceptionally difficult to understand. From the paper:

In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year
Not a good foundation to building practical implementations, mainly because multi-Paxos is not sufficiently specified, and as a result practical systems bear little resemblance to Paxos. One comment from Chubby implement is typical:

There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. . . . the final system will be based on an unproven protocol.

For these reasons, the authors designed an alternative consensus algorithms - and that is Raft. Raft is designed for understandability:

Decomposing the problems into easy-to-understand/explain pieces independently, such as leader election, log replication, safety, and membership changes.
Simplifying the problem space by placing constraints and reducing states, such as disallowing holes in logs.

Raft Consensus Algorithm

Raft implements consensus by first electing a leader who is responsible for managing the replicated log. Therefore, consensus algorithm can be broken down into 3 categories:

Leader election - leader must be elected
Log replication - leader replicates logs across cluster
Safety
- Election safety: at most one leader can be elected in a given term
- Leader Append-Only: a leader never overwrites or deletes entries in its log; it only appends new entries
- Log Matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
- Leader Completeness: if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms.
- State Machine Safety: if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

The basics

Raft server is in one of 3 states:

Leader - accept client requests
Follower - accept requests from leaderes
Candidate - used to elect new leader

Raft divides time into terms marked with monotonically increasing integers. Each term begins with election where one or more candidiates attempt to become leader.

Following diagram shows possible state transitions:

If one wins it becomes leader for the entire term. Otherwise in the case of split vote, the term ends with no leader. There is at most one leader in a given term. Term servces as Logical Clock in raft - each server maintains a current term that is monotically increased and exchanged when they communicate, and if stale term is detected server will update to the larger value. Server would reject requests with stale term.

Raft servers communicate with mainly two kinds of RPC:

RequestVote RPC - initiated by candidate for leader election
AppendEntries RPC - initiated by leaders to replicate to follower and provide heartbeat There is also a 3rd RPC for transferring snapshot. RPC are issued in parallel and will be retried if no ACK is received within time.

Leader election

Servers start up as followers, and will stay as leaders if they keep receiving AppendEntry RPC from leader or candidate. Leader send periodic empty AppendEntries RPC as heartbeat, so if followers aren’t receiving such heartbeat within a period of time (called Election Timeout), followers will start leader election.

As part of leader election, follower increments its current term and transition to candidate state, and then vote for itself and issue RequestForVote RPC in parallel to all the other servers.

A candidate wins election if it receives majority of votes. A server can only hand out vote for a single candidate in a given term and first come first serve basis. Once it wins election it’ll become leader and send empty AppendEntries RPC to all followers to establish authority and prevent new elections.

If candidate receives AppendEntries RPC from another leader, it only accepts it as leader if leader’s term >= its own term and return to follower state. Otherwise it rejects the RPC.

If many followers timeout and become candidate at the same time, it’s possible to have split vote situation and no one wins the election. In this case a new term of election is started. However, without extra measures the vote can continue infinitely, or only complete by luck. This is why Raft uses randomized timeout (for example, 150-300ms) to ensure split vote case are rare - so that followers timeout and become candidate at different time, and once split vote happens each candidate will start its own vote at different time.

The randomized approach might seem a bit naive at first glance, but the authors have debated a few different approaches and concluded that randomized timeout is the easiest to understand and prove correct:

From the paper:

lections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable

Log Replication

Once client send request to the leader, leader sends AppendEntries RPC to all followers in parallel to replicate the log entry. Once it is safely replicated, leader will then apply the log entry to its own state machine and returns the result of that exception to the client. The request is retried indefinitely until all followers have ACKed.

A raft log entry consists of (term, index, operation). Leader decides when it is safe to apply the log entry to the statement and applying the operation, and such log entry becomes committed. Raft gurantees that committed entries are durable and will eventually replicate to all available state machines. A log entry is committed once it is replicated to the majority of the servers. All the precending entries are considered committed as well. Once a follower learns the log entry is committed, it applies the entry to its own local state machine in log order.

This implies the latency will be higher in a raft consensus system as the follower would have to know the log entry being committed, usually on the next AppendEntries RPC request (either real user request or heartbeat).

Raft maintains the Log Match Property:

If two entries in different logs have the same index and term, then they store the same command.
If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

This property makes Raft logs much easier to understand and reason about its correctness.

After leader crashes, follower logs may become inconsistent with leader log. The paper have discussed a few scenarios that we won’t repeat hre. Raft handles consistencies by follower log to duplicate with the leader’s log - so conflicting entries in follower log would need to be rewritten. This is done by finding the latest log entry that is consistent in follower, and delete any logs after that, and send all following entries after it. This is achieved by having the leader maintain nextIndex for each follower, and keep sending AppendRPC and decrease it if rejected, until they agree, and at that point follower log after nextIndex is deleted. This can be further optimized by having AppendEntries RPC return the first conflicting term and first index in the term so that the leader would skip conflicting entries.

If a candidate / follower crashes, the leader would just retry infinitely. If the server already inserted the log but didn’t ACK, raft RPC are idempotent so it’ll just get ignored.

Election Safety

To prevent a stale follower overwriting committed entries, there must be further restrictions on leader election.

A candidate cannot win an election unless it contains all committed entries. When it sends RequestVote with the latest (term, index) in the log, such request will be rejected by other servers if their latest log entry (term, index) is larger and therefore more up to date.

It’s also possible for leader replicating previous term log entries to other stale followers and making them committed, but doing so before committing the new term running the risk of having the newly committed entry getting overwritten if it crashes before then. So committing log entries from previous term is deferred until an entry from current term commits. Other consensus algorithms tries to address this by “fixing” prior term to latest term but Raft is keeping things simple by having the log entry being immutable and retain the term number.

In the paper 5.4.3 Safety Argument section the correctness is proven there. Feel free to refer to the paper for more details.

Cluster Membership Changes

When cluster membership changes (adding/removing server, etc), it is important to prevent having two leaders at the same time with old/new configuration. This needs to be done with a two-phase approach - raft first switch to a joint consensus that is both old and new, and once the joint consensus has committed then raft will transition to the new configuration.

In joint consensus,

Log entries are replicated to both configurations
Any server from either configuration may serve as leader
Agreement requires separate majority from both old and new configuration - this means the log entry gets replicated in both configurations

Note the leader of joint consensus might not be part of the new cluster configuration. In this case it doesn’t count itself in the majority but still replicates to both majority, and step down once new configuration entry commits.

I’m wondering if’d be easier if we force the new leader to be the intersection of old configuration and new configuration.

When new server join the cluster, it might be a while for them to replicate all entries so new entries might not be able to proceed, so they need to join as non-voting members that is replicated but didn’t join the majority, until sufficiently caught up.

When removing servers from cluster, they will stop receiving heartbeats so they will start elections and disrupt the cluster availability. To prevent this problem, servers disregard RequestVote RPC within a minimum election timeout, and in such case the leader is considered alive. Note this isn’t the election timeout of the server (since it’ll revert to candidate if that happens) but rather a minimum “safe” election timeout. All the servers’s election timeout will be at least as large or larger than the minimum election timeout.

Other practical considerations

In any log system you can’t have the log grow unbounded, so raft needs to be able to compact the logs. In theory you could just create a snapshot of committed entries, but for a slow follower or a new server you would have to send the snapshot over with InstallSnapshot RPC. In practice terms this means discarding the state of the follower entirely, then copying the entire state over to the follower either physically or logically, and delete all logs. This is no different than incremental logging system such as LSM tree.

When client interacts with the cluster for the first time, it sends to a random server. If not the leader it’ll reject / forward to the correct leader. It is also possible that the leader crashes after commit but before ACK, so the client would retry the write again on a new leader which may have this entry. The client would need to track the request with a serial number so that the new leader can detect and return immediately.

For read-only queries, leader can only return the data once it commits its first blank AppendEntries RPC so that it is up-to-date and all entries are committed - it is still possible for uncommitted log entries from earlier term to become committed later (as part of catching up other followers), or those uncommitted changes can get discarded if another lead gets elected without those changes. It is also possible for the leader not knowing others might have elected a new leader, so the leader needs to confirm by exchanging heartbeat with majority of cluster before responding. Alternatively you could also use a lease-based approach but requires bounded clock skew.

My closing thoughts

Raft protocol has definitely delivered its promise on being a practical and understandable consensus procotol - the proliferation of many implementation in various languages has already proven that. And there are already many system using raft in production such as Kubernetes/etcd, CockroachDB, TiKV, etc. There are even raft support w/ MySQL from Alibaba. It’d be interesting to see how Raft performs in real production systems and how well does it scale in practice.

Writing your own NES emulator Part 3 - the 6502 CPU

2021-01-10T00:00:00+00:00

It’s been a while since the last update - I was mostly focusing on database technologies. Beginning of the year 2021 is a bit slow (that’s when many big companies start their annual / semi-annual review process), so I had a bit of time to write up this post about 6502 CPU emulation. All the code referenced in this post is in my simple NES emulator github repo NesChan. It’s fun to go back and look at my old code and the 6502 CPU wiki.

The 6502 CPU

NES uses 8-bit 6502 CPU with 16-bit address bus, meaning it can access memory range 0x0000~0xffff - not much, but more than enough for games back in the 80s with charming dots and sprites. It is used in a surprising large range of famous vintage computers/consoles like Apple I, Apple II, Atari, Commodore 64, and of course NES. The variant used by NES is a stock 6502 without decimal mode support. It is running at 1.79HMZ (PAL version runs at 1.66MHZ). It has 3 general purpose register A/X/Y, and 3 special register P (status) /SP (stack pointer) /PC (program counter, or instruction pointer), all of them being 8-bit except PC which is 16-bit. NES dev wiki has a great section on 6502 CPU that has a lot more details and we’ll be covering the most important aspects in the remainder of the article.

To emulate the CPU, the main loop would look something like this:

We start at a memory location by set current program counter (also known as instruction pointer in other architectures) PC to that location
Check if we reached the special end condition (end of program, BRK instruction, infinite loop, etc…), if met, terminate the execution process
Decode CPU instruction at current PC
Set instruction pointer to next instruction
Fetch data as per memory access mode
Execute instruction with data fetched
Move to the next instruction by going back to 2

The most interesting aspect are instruction decoding, memory access modes, and instruction execution. Let’s look at this one by one.

Decoding the instructions

Assembly instructions are usually encoded with 3 character memonics and they typically perform very low level hardware related operations supported by the CPU, to keep CPU simple and reduce cost. That’s why assembly instructions are considered low level. High-level language statements are usually compiled down to one or more CPU instructions, with the help of compiler. This is a perfect example of layering.

Let’s just take a look at a few examples of what 6502 CPU can do:

DEC, DEX, DEY for decrementing memory, X register, Y register, respectively
JMP for jumping to a particular address to keep executing code
LDA, LDX, LDY for loading A/X/Y register into target location, depending on the memory address mode
ADC, SBC for addition / subtraction using the A register (accumulator) and specified memory location, so basically A += M and A-= M, taking carry flag into account as well

If you are interested to know more, you can go to this page for a list of common 6502 CPU instructions and what they do.

Before executing the instruction, you need to first look up the bytes in memory and understand which instruction it represents, what is the arguments, etc. This is called decoding. Fortunately, 6502 CPU instructions are always single byte, only the arguments differ by memory access mode. This makes the decoding much easier - we just need a big table of all instructions and then call the right helper function for that instruction based on the byte!

We’ll look at memory access mode later. For now you’ll just need to know they indicates where the actual data is coming from, while the instruction itself is the operation. Instruction typically supports different memory access modes so that it can operate on different data from different locations using different methods, whether it is register, memory, etc.

In order to build the table, it’s useful to visualize it by looking at the following table from nesdev unofficial opcodes wiki:

But in order to see the patterns a bit better, let’s re-arrange it:

You can see the ALU (green ones, that does math operations) and the RMW (blue ones, = Read Modify Write) instructions follow a very clear pattern, while the red (mostly control instructions) and gray (unofficial / undocumented instructions) are sort of all over the place.

To keep things simple (and make modification easier, as I was still learning the instructions, and I don’t want to do it over when I misunderstood something), in the current implementation I went with a switch case approach with macros. This could be easily updated to use a real table with helper function pointers. You might think the jump table approach might be faster, but actually the reality can be a bit more complicated: compiler should easily create a jump table, and jumping into inlined version of the helper functions directly, end up being much faster than a jump table solution. Such optimization are actually more difficult with function pointers (but not impossible). Either way, since I’m not optimizing for a benchmark but to run NES games, I didn’t care too much about performance.

For example, for ALU instructions we use this macro in nes_cpu.cpp:

#define IS_ALU_OP_CODE_(op, offset, mode) \
    case nes_op_code::op##_base + offset : \
        NES_TRACE4(get_op_str(#op, nes_addr_mode::nes_addr_mode_##mode)); \
        op(nes_addr_mode::nes_addr_mode_##mode);  \
        break;

This defines a case statement for a variant of instruction op. For example, for ADC, offset 0x9 is ADC with immediate memory access mode. We’ll be calling to the op helper function for executing the code with the corresponding memory access mode. NES_TRACE4 is for logging and we can ignore that for now.

And for each particular ALU instruction, we define 8 variants of all memory access patterns based on the table earlier:

#define IS_ALU_OP_CODE(op) \
    IS_ALU_OP_CODE_(op, 0x9, imm) \
    IS_ALU_OP_CODE_(op, 0x5, zp) \
    IS_ALU_OP_CODE_(op, 0x15, zp_ind_x) \
    IS_ALU_OP_CODE_(op, 0xd, abs) \
    IS_ALU_OP_CODE_(op, 0x1d, abs_x) \
    IS_ALU_OP_CODE_(op, 0x19, abs_y) \
    IS_ALU_OP_CODE_(op, 0x1, ind_x) \
    IS_ALU_OP_CODE_(op, 0x11, ind_y)

For example, ADC + 0x9 is immediate mode, ADC + 0x5 is zero page mode, etc.

Then we can support a series of ALU instructions easily with these macros:

        IS_ALU_OP_CODE(ADC)
        IS_ALU_OP_CODE(AND)
        IS_ALU_OP_CODE(CMP)
        IS_ALU_OP_CODE(EOR)
        IS_ALU_OP_CODE(ORA)
        IS_ALU_OP_CODE(SBC)

Take a simple instruction as an example, the code looks like follows:

// Logical AND
void nes_cpu::AND(nes_addr_mode addr_mode)
{
    operand_t op = decode_operand(addr_mode);
    uint8_t val = read_operand(op);
    A() &= val;

    // flags    
    calc_alu_flag(A());
    
    // cycle count
    step_cpu(get_cpu_cycle(op, addr_mode));
}

decode_operand is responsible for decoding the following bytes based on the address mode, and return the access pattern in operand_t
Next we proceed to read the operand using op into val. The reason the decoding and reading are separate step is because some instruction do read, write, or both, so it is useful to separate them into different helpers.
Once we read the val, as per AND instruction, we AND the accmulator A register with val and then write it back. Note we have helpers that return registers (which really are just variables) as reference so the code reads quite naturally:

    uint8_t &A() { return _context.A; }
    uint8_t &X() { return _context.X; }
    uint8_t &Y() { return _context.Y; }
    uint16_t &PC() { return _context.PC; }
    uint8_t &P() { return _context.P; }
    uint8_t &S() { return _context.S; }

Based on the result of A, we need to update the ALU zero/negative flags accordingly. Those flags are typically checked at beginning of instruction and updated at the end of instruction, usually for math operations (carry flag) or controls (jump if zero). For a full list flags you can refer to this list.
Finally, we simulate the passing of CPU cycles (or rather, time). This is important for accuracy of emulation as many games rely this for timing, especially to synchronize with GPU cycles! Now that’s what we call real programmers .

Memory access mode

This is one of the more complicated aspect of 6502 CPU. Many instructions have different modes when it comes to where the operands are coming from. This is the full list of all suppported modes:

Abbr	Name	Notes
Imp	Implicit	Instructions like RTS or CLC have no address operand, the destination of results are implied.
A	Accumulator	Many instructions can operate on the accumulator, e.g. LSR A. Some assemblers will treat no operand as an implicit A where applicable.
#v	Immediate	Uses the 8-bit operand itself as the value for the operation, rather than fetching a value from a memory address.
d	Zero page	Fetches the value from an 8-bit address on the zero page.
a	Absolute	Fetches the value from a 16-bit address anywhere in memory.
label	Relative	Branch instructions (e.g. BEQ, BCS) have a relative addressing mode that specifies an 8-bit signed offset relative to the current PC.
(a)	Indirect	The JMP instruction has a special indirect addressing mode that can jump to the address stored in a 16-bit pointer anywhere in memory.

There are also more complicated memory access modes using the above:

Abbr	Name	Formula	Cycles
d,x	Zero page indexed	val = PEEK((arg + X) % 256)	4
d,y	Zero page indexed	val = PEEK((arg + Y) % 256)	4
a,x	Absolute indexed	val = PEEK(arg + X)	4+
a,y	Absolute indexed	val = PEEK(arg + Y)	4+
(d,x)	Indexed indirect	val = PEEK(PEEK((arg + X) % 256) + PEEK((arg + X + 1) % 256) * 256)	6
(d),y	Indirect indexed	val = PEEK(PEEK(arg) + PEEK((arg + 1) % 256) * 256 + Y)	5+

In the code I have a enum for all the supported modes:

// Addressing modes of 6502
// http://obelisk.me.uk/6502/addressing.html
// http://wiki.nesdev.com/w/index.php/CPU_addressing_modes
enum nes_addr_mode
{
    nes_addr_mode_imp,        // implicit
    nes_addr_mode_acc,        //          val = A
    nes_addr_mode_imm,        //          val = arg_8
    nes_addr_mode_ind_jmp,    //          val = peek16(arg_16), with JMP bug
    nes_addr_mode_rel,        //          val = arg_8, as offset
    nes_addr_mode_abs,        //          val = PEEK(arg_16), LSB then MSB                   
    nes_addr_mode_abs_jmp,    //          val = arg_16, LSB then MSB, direct jump address                  
    nes_addr_mode_zp,         //          val = PEEK(arg_8)
    nes_addr_mode_zp_ind_x,   // d, x     val = PEEK((arg_8 + X) % $FF ), 4 cycles
    nes_addr_mode_zp_ind_y,   // d, y     val = PEEK((arg_8 + Y) % $FF), 4 cycles
    nes_addr_mode_abs_x,      // a, x     val = PEEK(arg_16 + Y), 4+ cycles
    nes_addr_mode_abs_y,      // a, y     val = PEEK(arg_16 + Y), 4+ cycles
    nes_addr_mode_ind_x,      // (d, x)   val = PEEK(PEEK((arg + X) % $FF) + PEEK((arg + X + 1) % $FF) * $FF), 6 cycles
    nes_addr_mode_ind_y,      // (d), y   val = PEEK(PEEK(arg) + PEEK((arg + 1) % $FF)* $FF + Y), 5+ cycles
};

Recall that in instruction implementation we call decode_operand and read_operand (there is also write_operand) to decode and then read the target (whether it is register, an address, etc). So all the magic for decoding memory address modes are in there.

For example, following code in decode_operand_addr (used in decode_operand internally supports indirect y mode:

        else if (addr_mode == nes_addr_mode::nes_addr_mode_ind_y)
        {
            // Indirect Indexed
            // implies a table of table address in zero page
            uint8_t arg_addr = decode_byte();
            uint16_t addr = peek(arg_addr) + (uint16_t(peek((arg_addr + 1) & 0xff)) << 8);
            uint16_t new_addr = addr + _context.Y;
            return new_addr;
        }

Show me the RAM

Accesing “RAM” in a emulator in theory should be easy, right? Just reserve a “big chunk” of whopping 64K RAM and access that. Unfortunately it is a little bit more complicated than that:

The system only has built-in 2KB RAM - RAM is expensive those days
Some memory are mapped to I/O (such as PPU) registers so accessing those registers become simple memory operations, rather than, say, dedicated instructions
When NES cartridges are inserted, its onboard data (RAM, ROM) are mapped onto the 64K memory space as well

So the actual memory layout looks like this:

Address range	Size	Device
$0000-$07FF	$0800	2KB internal RAM
$0800-$0FFF	$0800	Mirrors of $0000-$07FF
$1000-$17FF	$0800	Mirrors of $0000-$07FF
$1800-$1FFF	$0800	Mirrors of $0000-$07FF
$2000-$2007	$0008	NES PPU registers
$2008-$3FFF	$1FF8	Mirrors of $2000-2007 (repeats every 8 bytes)
$4000-$4017	$0018	NES APU and I/O registers
$4018-$401F	$0008	APU and I/O functionality that is normally disabled. See CPU Test Mode.
$4020-$FFFF	$BFE0	Cartridge space: PRG ROM, PRG RAM, and mapper registers (See Note)

For more details you can refer to this page in NES wiki.

Dealing with cartridges and mappers are another big topic and a whole lot of complexity which we’ll cover a bit later. For now we’ll treat it as a black box.

All these means that whenever you write to a byte you need to do a bit of indirection (just like most of magic in computer science):

nes_memory.cpp

void nes_memory::set_byte(uint16_t addr, uint8_t val)
{
    redirect_addr(addr);
    if (is_io_reg(addr))
    {
        write_io_reg(addr, val);
        return;
    }

    if (_mapper && (_mapper_info.flags & nes_mapper_flags_has_registers))
    {
        if (addr >= _mapper_info.reg_start && addr <= _mapper_info.reg_end)
        {
            _mapper->write_reg(addr, val);
            return;
        }
    }

    _ram[addr] = val;
}

Testing

I use doctest which is a simple and convenient testing framework that is good enough for my needs. At the beginning I write manual tests - basically execute a bunch of instructions until BRK (stop the system) and verify the state of the CPU and RAM:

cpu_test.cpp

TEST_CASE("CPU tests") {
    nes_system system;

    SUBCASE("simple") {
        INIT_TRACE("neschan.instrtest.simple.log");

        cout << "Running [CPU][simple]..." << endl;

        system.power_on();

        system.run_program(
            {
                0xa9, 0x10,     // LDA #$10     -> A = #$10
                0x85, 0x20,     // STA $20      -> $20 = #$10
                0xa9, 0x01,     // LDA #$1      -> A = #$1
                0x65, 0x20,     // ADC $20      -> A = #$11
                0x85, 0x21,     // STA $21      -> $21=#$11
                0xe6, 0x21,     // INC $21      -> $21=#$12
                0xa4, 0x21,     // LDY $21      -> Y=#$12
                0xc8,           // INY          -> Y=#$13
                0x00,           // BRK 
            },
            0x1000);

        auto cpu = system.cpu();

        CHECK(cpu->peek(0x20) == 0x10);
        CHECK(cpu->peek(0x21) == 0x12);
        CHECK(cpu->A() == 0x11);
        CHECK(cpu->Y() == 0x13);
    }

But this quickly get tedious. Fortunately, there are a lot of existing test roms. I’m been using this one - it is fairly comprehensive. This does mean I need to implement a rudimentary ROM loading first (which we won’t cover here), but once that’s ready I can just load the ROM and follow the convention of the test ROM - in this case it means checking peek(0x6000) == 0.

#define INSTR_V5_TEST_CASE(test) \
    SUBCASE("instr_test-v5 " test) { \
        INIT_TRACE("neschan.instrtest.instr_test-v5." test ".log"); \
        cout << "Running [CPU][instr_test-v5-" << test << "]" << endl; \
        system.power_on(); \
        auto cpu = system.cpu(); \
        cpu->stop_at_infinite_loop(); \
        system.run_rom("./roms/instr_test-v5/rom_singles/" test ".nes", nes_rom_exec_mode_reset); \
        CHECK(cpu->peek(0x6000) == 0); \
    } 

With that I can run a bunch of ROMs as regression tests, much better:

    INSTR_V5_TEST_CASE("01-basics")
    INSTR_V5_TEST_CASE("02-implied")
    // INSTR_V5_TEST_CASE("03-immediate")
    INSTR_V5_TEST_CASE("04-zero_page")
    INSTR_V5_TEST_CASE("05-zp_xy")
    INSTR_V5_TEST_CASE("06-absolute")
    // INSTR_V5_TEST_CASE("07-abs_xy")
    INSTR_V5_TEST_CASE("08-ind_x")
    INSTR_V5_TEST_CASE("09-ind_y")
    INSTR_V5_TEST_CASE("10-branches")
    INSTR_V5_TEST_CASE("11-stack")
    INSTR_V5_TEST_CASE("12-jmp_jsr")
    INSTR_V5_TEST_CASE("13-rts")
    INSTR_V5_TEST_CASE("14-rti")
    // INSTR_V5_TEST_CASE("15-brk")
    // INSTR_V5_TEST_CASE("16-special")
}

Some of the commenting are most likely signs there are still bugs in CPU emulation.

Conclusion

It took me a few days to implement all CPU and get majority of the CPU tests to pass. Things are a bit more subtle than I expected, but that’s probably the case for emulating real world CPU which always has its own quirks or even bugs, and the documentation has bugs too (which are hard to find unless you test with a real CPU or a emulator). There are quite a bit of subtle behavior I didn’t cover (such as page crossing, etc) that I need to get exactly right. Getting CPU emulation correct is absolutely critical for getting games working, not surprisingly. One thing that did surprise me is that the last bug that prevented Super Mario Bros from working is bugs in my CPU emulation, including a documentation bug. If I remember correctly I had to debug it with another emulator side by side to find out the exact problem. On retrospective I probably should’ve get all the CPU tests working properly, and the fact that I had disabled a few (especially the earlier ones from 1-14) is definitely red flag. Unfortunately I was too excited to push ahead and “mostly working” is deemed “good enough”, which turned out to be a big mistake. That’s why we work on side projects - have fun, and learn something doing it.

The series so far…

If you are hungry for more NES…

Head to NESDev Wiki - I’ve learned pretty much everything about NES there. There is also a great book on NES called I am error, which is surprisingly deeply technical for a book about history of NES.

Doctest - my favorite lightweight, zero-friction unit test framework

2021-01-09T00:00:00+00:00

In my personal C++ projects I’ve always been using doctest. It’s simply awesome. It takes a few seconds to get bootstrapped and you are ready to run your tests. And it should really be the first thing you do when you start a new project.

For example, I’ve been using it in neschan which is a NES emulator that I wrote for fun back in 2018, and one such example is a few unit test that validates the emulated 6502 CPU works correctly:

cpu_test.cpp

TEST_CASE("CPU tests") {
    nes_system system;

    SUBCASE("simple") {
        INIT_TRACE("neschan.instrtest.simple.log");

        cout << "Running [CPU][simple]..." << endl;

        system.power_on();

        system.run_program(
            {
                0xa9, 0x10,     // LDA #$10     -> A = #$10
                0x85, 0x20,     // STA $20      -> $20 = #$10
                0xa9, 0x01,     // LDA #$1      -> A = #$1
                0x65, 0x20,     // ADC $20      -> A = #$11
                0x85, 0x21,     // STA $21      -> $21=#$11
                0xe6, 0x21,     // INC $21      -> $21=#$12
                0xa4, 0x21,     // LDY $21      -> Y=#$12
                0xc8,           // INY          -> Y=#$13
                0x00,           // BRK 
            },
            0x1000);

        auto cpu = system.cpu();

        CHECK(cpu->peek(0x20) == 0x10);
        CHECK(cpu->peek(0x21) == 0x12);
        CHECK(cpu->A() == 0x11);
        CHECK(cpu->Y() == 0x13);
    }
}

It’s pretty self-explanatory - use TEST_CASE to define a test case and SUBCASE for scenarios, and CHECK for actual validation/assertion. (Ignore INIT_TRACE - it’s not part of the doctest framework)

To use it in your own project - just download one file:

curl https://raw.githubusercontent.com/onqtam/doctest/master/doctest/doctest.h -o doctest.h

And include that and add a #define:

#define DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN
#include "doctest.h"

int add(int a, int b) {
  return a + b;
}

TEST_CASE("testing 1+1=2") {
    CHECK(add(1,1) == 2);
}

The magic DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN is to tell doctest.h this file needs a main. You should only have it before #include doctest.h (obviously), so that the following code in doctest.h can kick in:

#ifdef DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN
DOCTEST_MSVC_SUPPRESS_WARNING_WITH_PUSH(4007) // 'function' : must be 'attribute' - see issue #182
int main(int argc, char** argv) { return doctest::Context(argc, argv).run(); }
DOCTEST_MSVC_SUPPRESS_WARNING_POP
#endif // DOCTEST_CONFIG_IMPLEMENT_WITH_MAIN

Note that you should only have this in a single file (perhaps a bit obvious). Other .cc/.cpp files just need to #include "doctest.h" without the #define - the linker wouldn’t be happy more than one main function, after all.

Compile and run:

NOTE: –std=c++11 is required to use doctest, otherwise g++ would shout at you for feeding it nonsense

[~/tmp/test]: g++ test.cc --std=c++11 -o test
[~/tmp/test, 1s]: ./test
[doctest] doctest version is "2.3.1"
[doctest] run with "--help" for options
===============================================================================
[doctest] test cases:      1 |      1 passed |      0 failed |      0 skipped
[doctest] assertions:      1 |      1 passed |      0 failed |
[doctest] Status: SUCCESS!

It doesn’t get simpler than this. When I say zero friction I really mean it. OK, maybe not entirely zero, but close enough.

Note that the earlier main function calls out to doctest::Context(argc, argv). This means that the final executable automatically comes with command line arguments you can use to control how the test executes, such as:

Test case filters
Listing all test cases / test suites
Running tests N times
And much more

If you are curious…

If you are curious, doctest.h is giagantic 6000 line header file that got assembled from two files with a bit post-processing, if any of them changed:

CMakeLists.txt

    # add a custom target that assembles the single header when any of the parts are touched
    add_custom_command(
        OUTPUT ${CMAKE_CURRENT_SOURCE_DIR}/doctest/doctest.h
        DEPENDS
            ${doctest_parts_folder}/doctest_fwd.h
            ${doctest_parts_folder}/doctest.cpp
        COMMAND ${CMAKE_COMMAND} -P ${CMAKE_CURRENT_SOURCE_DIR}/scripts/cmake/assemble_single_header.cmake
        COMMENT "assembling the single header")

    add_custom_target(assemble_single_header ALL DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/doctest/doctest.h)

assemble_single_header.cmake

set(doctest_include_folder "${CMAKE_CURRENT_LIST_DIR}/../../doctest/")

file(READ ${doctest_include_folder}/parts/doctest_fwd.h fwd)
file(READ ${doctest_include_folder}/parts/doctest.cpp impl)

file(WRITE  ${doctest_include_folder}/doctest.h "// ====================================================================== lgtm [cpp/missing-header-guard]\n")
file(APPEND ${doctest_include_folder}/doctest.h "// == DO NOT MODIFY THIS FILE BY HAND - IT IS AUTO GENERATED BY CMAKE! ==\n")
file(APPEND ${doctest_include_folder}/doctest.h "// ======================================================================\n")
file(APPEND ${doctest_include_folder}/doctest.h "${fwd}\n")
file(APPEND ${doctest_include_folder}/doctest.h "#ifndef DOCTEST_SINGLE_HEADER\n")
file(APPEND ${doctest_include_folder}/doctest.h "#define DOCTEST_SINGLE_HEADER\n")
file(APPEND ${doctest_include_folder}/doctest.h "#endif // DOCTEST_SINGLE_HEADER\n")
file(APPEND ${doctest_include_folder}/doctest.h "\n${impl}")

This makes bootstraping the whole unit test essentially painless. You can just include a copy in your repo/folder and you are done. No need to fiddle with package manager / submodule. I wish more frameworks are done like this at least during distribution. Of course, assembling the entire boost library into a single header might be a bit extreme, but for simple frameworks where reducing friction of adoption is important, this can be a rather useful technique.

Putting Fedora 33 Workstation on X1 Carbon 7th gen

2021-01-02T00:00:00+00:00

I’ve had a Lenovo X1 Carbon 7th gen for a while and tried putting Ubuntu 20.04 on it, but had quite a bit of trouble. Mostly the problem was this model has 4 speakers (two front and two bottom) so linux had quite a bit of trouble with it. The sound was tinny, and volume up / down doesn’t work either. The microphone jack also pops. There are other minor issues like finger print sensor doesn’t work, though I don’t care about it much. There is a long thread discussing problems with ubuntu. I spend quite a while browsing forums and find some work arounds, but none are satisifactory. So I gave up and went WSL2.

WSL2 is basically a VM, so it works mostly quite well and is indistinguishable from a native linux, for the most part. However, it isn’t quite smooth sailing either. It is still quit a bit slower. For example, starting vim takes a second or so while in native linux it is pretty much instant. It is also very memory hungry - it seems that it aggressively will take over all memory for I/O cache, usually not a problem if it were the only game in town, but it would slow down Windows as a result. I have a desktop machine with 32G and it’ll happily push it over 80% in a memory intensive task such as compilation. Capping the memory consumption helps, though.

After a while I’ve heard Lenovo has been working with Fedora for ThinkPads, and Fedora 33 is out, so I’d like to give it a spin, but didn’t get a chance to try it out yet until this week. I’m happy to report that putting Fedora Workstation 33 x64 works pretty much perfectly:

Wifi works out of the box
Suspend/Resume works fine - Lenovo seems to suggest the keep the sleep state in BIOS to Windows as Linux supports it these days
Audio works fine - all 4 speakers seems to work and microphone works well as well. Volumn buttons work as well
Camera works - a must these days for meetings
Trackpad works - not quite as smooth as Windows but acceptable. Scrolling was a bit too fast for my liking and it looks like there isn’t a great way to tweak it in Gnome. But I can live with it
Fingerprint Sensor works - I didn’t even realize I need it but it even works for sudo, which is a pleasant surprise:

However, it did come with a catch. If I login with fingerprint, it’ll still ask me to unlock the keyring using password, which is broken for sure. Also the fingerprint daemon seems to be occasionally stop working and hang at the shutdown (until timeout), but either way using fingerprint for sudo is kinda nice so I’m ok with living with it.

One thing that annoyed me is the “task bar” won’t show up until I hover mouse to the top-left. Using Dash to Dock fixed that.

Putting the software I need on it is also relatively straight-forward. I have dotfiles that install vim/tmux/zsh for me and install.sh install all the utilities - I did have to adapt it to use dnf and some libraries need different names, but that’s pretty much it. Once installing VS Code and Chrome I’m good to go. I did run into a problem with Chrome 2nd window being super slow which seems to be a problem with wayland. Applying a fix from stackoverflow post fixed it for me.

Overall I’m quite happy with Fedora 33 on X1 Carbon 7th Gen. Linux has certainly came a long way and it’s great to see hardware manufacturers collaborating with Linux making the experience just work, so there are still hope. Unfortunately Linux desktop is still fragmented as ever, so maybe the year of linux desktop won’t be quite there yet. Who knows - maybe we’ll all end up with Chrome books and SSH to our dev boxes in the cloud.

Paper Reading - Hekaton: SQL Server’s Memory-Optimized OLTP Engine

2020-12-29T00:00:00+00:00

This is a great paper covering Microsoft Hekaton storage engine. There are a lot of meaty stuff here - lock-free Bw-tree indexing, MVCC, checkpointing, query compilation. I’m especially interested in its query compilation given my background in the .NET runtime and I’ve also spent some non-trivial amount of time focusing on optimizing query performance in my current job. Bw-tree is a very interesting direction for B+ tree as well and we’ll also be looking at a few papers that covers Bw tree in more depth in future posts.

Overview

Hekaton is an alternative SQL server storage engine optimized for main memory (not a separate DBMS). User can enable it by declaring tables to be “optimized”. Hekaton has following designing principles:

Durability is ensured by logging and checkpointing, but index operations are not logged - they are rebuilt entirely from latest checkpoint and logs. This avoids complex buffer pool flush management.
Internal data structures (allocators, indexes, transaction map, etc) are all entirely lock-free / latch-free or in any other performace-critical path. Hekaton uses a new optimistic multi-version concurrency control for transaction isolation semantics as well to avoid locking.
Requests are compiled down to native code. Decisions are made in advance as much as possible to reduce runtime overhead.

Sidebar: Request compilation is especially interesting here. This is a advanced technique commonly seen in language runtimes as JIT. However it is likely in most cases it isn’t quite worth the complexity for DBMS unless memory access (instead of I/O) become the bottleneck when most SQL access data that are hot (in buffer pool cache) and in memory, which definitely is the case here.

Note Hekaton doesn’t support table partitioning. Some in-memory database such as HyPer / H-Store / VoltDB / Dora supports partitioning database by core. However this has the downside when a query can’t be “partition aligned” (not using the partitioning index) needs to be sent to all partitions, and that can be potentially more expensive. To support wider variety of workloads Hekaton decided not to support table partioning. Keep in mind this is partitioning table inside the same database instances, and not related to distributed database where database are partitioned across database instances in different nodes.

In a high-level, Hekaton has 3 components:

Hekaton storage engine - manages user data and index
Hekaton compiler - takes AST of stored procedure and metadata input, and compiles to native code
Hekaton runtime system - integration with SQL server and providing helpers needed by compiled code

Hekaton also heavily leverages existing SQL server services - you can refer to the paper for more details.

Storage and Indexing

Hekaton supports hash index with lock-free hashtable and range-index with Bw-tree (a novel lock-free version of B-tree).

Following table is a good example:

Both hash-index and Bw-tree index stores pointers to the actual data
Hash-index are divided by hash-buckets - so bucket J points to start of all the names begin with J. All data within same bucket are linked together
Different versons of the same key are also linked to provide MVCC support. The begin/end time describes the transaction timestamp range when the value is valid and the range is strictly non-overlapping. All the reads have a read-time and only matching records would be returned.

During update, the record being updated has its end time marked with transaction id (Txn75 in the diagram) to indicating it is being updated, and any new record will have its start time to be the transaction id as well indicating it is a new record not being committed (the end time being infinity). Once the transaction commits, it updates the time to commit time. Old versions are garbage collected when they are no longer visible to any transaction, and done cooperatively by all worker threads.

Programming and Compilation

Typically DBMS use a “interpreter” style execution model to execute SQL statements. Hekaton compiler reuses SQL server T-SQL compiler stack (metadata, parser, name resolution, type derivation, and query optimizer). The output is C code and compiled with Microsoft VC++ into a DLL which gets loaded and executed at runtime.

As part of creation a new table, the schema functions such as hashing function, record comparison, and record serialization are compiled as well and available for index operations such as search / inserts. As a result those operation are quite efficient.

Ideally, these functions should be compiled together with SQL statements as well so that they can be properly inlined if needed, though the normal caveat of inlining applies.

A SQL statement is compiled into MAT (Mixed Abstract Tree) which is a rich abstract syntax tree representing metadata, imperative logic, and query plans. It is then converted into PIT (Pure Imperative Tree) that is more easily converted into C or other intermediate representations. The following picture shows the high-level flow:

A query plan consists of a tree of operators, like most query execution engines. Each operator has a common interface of operations so that they can be composed. In the example, the code calls Scan Operator which calls Filter operator to filter on the list of rows. The operators are connected by gotos instead of making calls - this greatly reduces the overhead of passing parameters and procedure calls, though it makes debugging the code harder.

The gotos is effectively “inlining” the code by hard coding the gotos, is less efficient but simpler to implement. It is also reasonable to expect compilers to inlining the code since the call graph are well defined.

Not all code are compiled - some are available as helpers such as sorting, math, etc where the implementation are complex and the overhead of function calling are relatively low.

The compiled store procedure looks like just like any T-SQL store procedures and supports parameter passing. There are limitations to what those T-SQL procedures and the SQL statements can do due to implementation restrictions. To get around those limitations, Hektaon supports Query Interop that enables conventional disk based query engine to query memory optimized tables.

Transactions

Hekaton supports optimistic MVCC to provide snapshot, repeatable read, and serializable transaction isolation. For serializable transactions it ensures:

Read stability - version still is the version visible at end of transaction
Phantom avoidance - scan wouldn’t return additional new versions

It is worth noting that repeatable read only need read stability.

In order to validate the reads, transaction checks the versions it read are visible as of the transactions end time. Each transaction maintains read-set (a list of pointers to each version it has read), and a scan-set.

If transaction T1 sees data changes in T2, T1 takes a commit dependency on T2. Before T2 commits, T1’s result set is held back by a read barrier and will be sent to client as soon as it is cleared.

Technically this is still blocking since the client won’t be receiving the results back. However in theory the thread can be freed to process other transactions. Until then the transaction isn’t actually committed.

Once transaction’s update has been logged, it is irreversibility committed and during commit post processing phase it’ll update all end timestamps in all versions it touched to the end / commit timestamp of this transaction. The list of insert / deleted versions is maintained with a write-set.

During a rollback, all versions created in transaction will be invalidated. Delete version will be restored by clearing end timestamp (to infinite). Any transaction dependent will be notified.

Checkpoint and recovery

Hekaton ensures transaction durability that allows it to recover after a failure, using logs and checkpoints. The design minimizes transactional processing overhead and push work to recovery time if possible. It supports parallel processing during recovery. Index are reconstructed during recovery.

Logs are essentially redo log for committed transactions. No undo log is recorded.

Checkpoints are continuous, incremental, and append-only - they are essentially delta of changes recorded in sequential files, containing multiple data files and delta files. The reason they are contiuous is that periodic checkpoints are disruptive for performance. Data files contains inserted records covering a specific timestamp range, and loaded at recovery time and index reconstructed. Delta files are list of deleted versions in the data file and 1:1 maps to data file. At recovery time it filters out records in data files and avoid loading them into memory. They are loaded in parallel during recovery. In this sense, checkpoints are basically compressed log. Checkpoint data files are also merged to drop deleted versions.

The continuous nature of checkpoints is different than traditional checkpoints. In traditional checkpoints, no filtering is required because data that are deleted in the timestamp range would get dropped in the checkpoint process. However with continuous checkpoint you need to record both inserts and deletes. It is essentially a segmented log that is self-contained (so just rotating redo logs won’t work) and optimized for batch loading.

Garbage Collection

Hekaton GC removes versions that are no longer visible to any active transactions. It is non-blocking, parallelizable and scalable. Most interestingly it is cooperative - worker threads doing transactional process will discard versions when they encounter it, making it naturally scalable. There are also background dedicated GC worker threads as well, in order to collect cold regions of index that might not be scanned at all.

Hekaton GC locate garbage versions by looking for end stamp smaller than oldest active transaction timestamp, which is determined periodically by a GC thread scanning global transaction map.

The background collection thread breaks the work and send the work to a set of work queues. Once Hekaton worker thread is done with transactional processing it’ll pick up a small chunk of garbage collection work as its CPU-local queue. It naturally parallizes work across CPU cores and also self-throttle since it is done in worker threads.

What’s next

There are a few more related paper that we can explore. Bw-tree is probably the most interesting and worth looking into.

InnoDB Internals - Consistent Reads

2020-06-18T00:00:00+00:00

Overview

I’ve been doing some research in this area trying to understand how this works in databases (for my upcoming project), so I thought I’d share some of my learnings here.

InnoDB internally uses ReadView to establish snapshots for consistent reads - basically giving you the point-in-time view of the database at the time when the snapshot it is created.

In InnoDB, all changes are immediately made on the latest version of Database regardless whether it has been committed or not, so if you don’t have MVCC, everybody will see the latest version of rows and it’ll be a disaster for consistency. Not to mention you’ll need to be able to rollback the changes. In order to achieve this, InnoDB maintains a undo log to track a link list of changes made by other transactions, so reading in the past with a snapshot means going from the latest record in the BufferPool, and walk backwards to find the first visible change. Rollback is similar.

This also means the undo log can’t be purged if the snapshot is still active, and undo log will get longer and longer, which slows down the reads more and more. This is the infamous long running transaction issue.

The fundamental issue is that you need to be able to determine visibility of changes. This is done with two things:

InnoDB tracks the trx_id_t of each rows and in the undo log
InnoDB internally use a data structure called ReadView to determine if a transaction is visible in the snapshot.

So the algorithm becomes as simple as walking the list backwards and find the first visible record.

For example - assuming current transaction is 6941, and the latest record is made by transaction 6999, and the undo log looks as follows:

6940 -> 6943 -> 6945 -> 6999

This link means the row has been modified by 6940, 6944, 6958, 6999 in order.

In order to determine visibility, ReadView tracks a upper bound, lower bound and list of active transactions.

Assuming the system has the following transactions on-going with following trx_id_t: (6943, 6945), and trx_sys->max_trx_id=6959:

ReadView is going to establish the following view for snapshot:

Lowest	On-going	Future
< 6943	6943, 6945	>= 6959 (max_trx_id)

This implies:

Any transactions < 6943 are definitely visible, because they are not active when the snapshot is established, and they have already been committed.
Any transactions >= 6959 (inclusive) are future changes that will not been seen by this snapshot.
Any transactions falling within this range have two possibilities:
- At the time the snapshot is created, the on-going transactions are 6943 and 6945. These transactions are old transactions and any updates by them are not visible, since they haven’t committed yet
- Otherwise, they have already been committed and should be visible

BTW, in case if you are wondering: the reason 6959 is inclusive is because max_trx_id is reserved for the next transaction, just as the comment in InnoDB code suggests:

  volatile trx_id_t max_trx_id; /*!< The smallest number not yet
                                assigned as a transaction id or
                                transaction number. This is declared
                                volatile because it can be accessed
                                without holding any mutex during
                                AC-NL-RO view creation. */

So, looking back at the link list:

6940 -> 6943 -> 6945 -> 6999

We can determine:

6999 is invisible because it is >= 6959 so belongs to the future (either committed or not committed, doesn’t matter)
6945 and 6943 are part of on-going transaction at time of snapshot, which means they are old transactions that are not yet committed at the time of snapshot creation (but they did commit later when we read now), so they are also invisible
6940 is visible because it is less than 6943, so it has already committed in the past and is by definition visible.

So we should return the record with trx_id_t = 6940.

Let’s look into this process with a bit more detail.

Creating the ReadView

Whenever you try to read any row in InnoDB with consistent read (as opposed to locking reads, which is another topic that is worth discussing in another article), a ReadView is going to be assigned to the active transaction:

  } else if (prebuilt->select_lock_type == LOCK_NONE) {
    /* This is a consistent read */
    /* Assign a read view for the query */

    if (!srv_read_only_mode) {
      trx_assign_read_view(trx);
    }

The assignment is rather straight-forward - it it either opens a view from free list or use the existing view if there is one already:

/** Assigns a read view for a consistent read query. All the consistent reads
 within the same transaction will get the same read view, which is created
 when this function is first called for a new started transaction.
 @return consistent read view */
ReadView *trx_assign_read_view(trx_t *trx) /*!< in/out: active transaction */
{
  ut_ad(trx->state == TRX_STATE_ACTIVE);

  if (srv_read_only_mode) {
    ut_ad(trx->read_view == NULL);
    return (NULL);

  } else if (!MVCC::is_view_active(trx->read_view)) {
    trx_sys->mvcc->view_open(trx->read_view, trx);
  }

  return (trx->read_view);
}

Assuming first time within this transaction, within mvcc::view_open, it calls into ReadView::prepare to setup the boundaries as discussed earlier:

void ReadView::prepare(trx_id_t id) {
  ut_ad(mutex_own(&trx_sys->mutex));

  m_creator_trx_id = id;

  m_low_limit_no = m_low_limit_id = m_up_limit_id = trx_sys->max_trx_id;

  if (!trx_sys->rw_trx_ids.empty()) {
    copy_trx_ids(trx_sys->rw_trx_ids);
  } else {
    m_ids.clear();
  }

During copy_trx_ids, m_up_limit_id is assigned to the smallest:

  m_up_limit_id = m_ids.front();

It is perhaps a bit counter-intuitive as they are sort of reversed:

m_up_limit_id is the lower bound of visible trx_id_t (of transactions)
m_low_limit_id is the upper bound (exclusive) of visible trx_id_t (of transactions)

And m_ids is the list of on-going trx_id_t (that are invisible).

With these knowledge, now we are ready to read the rows for real.

Reading the rows

Assuming this transaction is trying to read some rows:

SELECT * from t1 where pk=6;

When reading rows, eventually we’ll get here:

      if (srv_force_recovery < 5 &&
          !lock_clust_rec_cons_read_sees(rec, index, offsets,
                                         trx_get_read_view(trx))) {
        rec_t *old_vers;
        /* The following call returns 'offsets' associated with 'old_vers' */
        err = row_sel_build_prev_vers_for_mysql(
            trx->read_view, clust_index, prebuilt, rec, &offsets, &heap,
            &old_vers, need_vrow ? &vrow : NULL, &mtr,
            prebuilt->get_lob_undo());

lock_clust_rec_cons_read_sees is mostly just check if the record is visible:

  trx_id_t trx_id = row_get_rec_trx_id(rec, index, offsets);

  return (view->changes_visible(trx_id, index->table->name));

We check to see if the record in question can be observed by checking the trx_id_t field of the record and see if it is visible in the view.

As already discussed, changes_visible uses (m_up_limit_id, m_low_limit_id) as a fast path:

If id < m_up_limit_id, it happens in the past and definitely visible
If id >= m_low_limit_id, it happens in the future and definitely not visible

Then it does a binary search over list of transactions to see if it is in the list of active transactions at the time of the ReadView is established. If it is in the list, then it is definitely not visible.

  /** Check whether the changes by id are visible.
  @param[in]	id	transaction id to check against the view
  @param[in]	name	table name
  @return whether the view sees the modifications of id. */
  bool changes_visible(trx_id_t id, const table_name_t &name) const
      MY_ATTRIBUTE((warn_unused_result)) {
    ut_ad(id > 0);

    if (id < m_up_limit_id || id == m_creator_trx_id) {
      return (true);
    }

    check_trx_id_sanity(id, name);

    if (id >= m_low_limit_id) {
      return (false);

    } else if (m_ids.empty()) {
      return (true);
    }

    const ids_t::value_type *p = m_ids.data();

    return (!std::binary_search(p, p + m_ids.size(), id));
  }

Once we establish that the current record isn’t visible to current ReadView, we’d go down the rabbit hole of checking the undo log:

      if (srv_force_recovery < 5 &&
          !lock_clust_rec_cons_read_sees(rec, index, offsets,
                                         trx_get_read_view(trx))) {
        rec_t *old_vers;
        /* The following call returns 'offsets' associated with 'old_vers' */
        err = row_sel_build_prev_vers_for_mysql(
            trx->read_view, clust_index, prebuilt, rec, &offsets, &heap,
            &old_vers, need_vrow ? &vrow : NULL, &mtr,
            prebuilt->get_lob_undo());

It simply calls to row_vers_build_for_consistent_read and it does a loop to scan the undo log backwards from the record:

dberr_t row_vers_build_for_consistent_read(
    const rec_t *rec, mtr_t *mtr, dict_index_t *index, ulint **offsets,
    ReadView *view, mem_heap_t **offset_heap, mem_heap_t *in_heap,
    rec_t **old_vers, const dtuple_t **vrow, lob::undo_vers_t *lob_undo) {
  trx_id = row_get_rec_trx_id(rec, index, *offsets);

  version = rec;

  for (;;) {
    /* If purge can't see the record then we can't rely on
    the UNDO log record. */

    trx_undo_prev_version_build(rec, mtr, version, index, *offsets, heap,
                                &prev_version, NULL, vrow, 0, lob_undo);

    if (prev_version == NULL) {
      /* It was a freshly inserted version */
      *old_vers = NULL;
      break;
    }

    *offsets = rec_get_offsets(prev_version, index, *offsets, ULINT_UNDEFINED,
                               offset_heap);

    trx_id = row_get_rec_trx_id(prev_version, index, *offsets);

    if (view->changes_visible(trx_id, index->table->name)) {
      /* The view already sees this version: we can copy
      it to in_heap and return */

      buf =
          static_cast<byte *>(mem_heap_alloc(in_heap, rec_offs_size(*offsets)));

      *old_vers = rec_copy(buf, prev_version, *offsets);
      break;
    }

    version = prev_version;
  }

  return err;
}

The code is simplified to make it more readable:

trx_undo_prev_version_build reads the previous undo log record into prev_version
- If it we reached the end, just exit the loop. By definition this would be a INSERTed row after this transaction, otherwise there would be at least one visible record in the undo log chain containing the original value.
retrieve the trx_id of prev_version
See if the trx_id is visible in the view
- If yes, copy it and assign to old_vers
- Otherwise keep looping

What’s next

I’m planning to write more about MySQL / RocksDB / MyRocks / InnoDB and have a bunch of notes taken in my backlog. I was thinking about making it into a series but I end up realizing I’ll never have time to write a cohesive series about any of them given the scope of things. So I’ll just write about whatever I’m researching and get it out, and forget about the whole series thing. Hopefully this way I’ll actually get more done.

Trying and setting up WSL 2

2020-05-29T00:00:00+00:00

The year of Linux desktop has finally come. It’s Windows + WSL 2. Seriously.

I use a MBP 16 for my daily work and SSH into linux machines for development/testing. While it’s a fantastic machine (and the track pad is second to none), I just hate the Apple trying to lock down the system so much that even setting up gdb to work is a nightmare, and running any simple script it tries to phone home for validation.

So I tried installing Linux on my machines. I do have a personal laptop X1 Carbon Gen7, but it doesn’t work well with Linux: mostly Linux just doesn’t like the 4 Channel Dolby Surround Speakers - they sound something from a tin-can and volume is much lower. While in Windows the sound I get is actually pretty nice (for a laptop, of course). I have spent countless time on it and I’ve seen many people struggling through the same issues. There are also occasionall hipcup with suspend/resume, but I can live with that. I also have a powerful gaming PC which I mostly play games. WSL sounds like a perfect solution for those machines where I can use Windows for their compatiblity / games, while also use it for development / tinkering on Linux. Yes, you can either dual boot or install a linux VM, but the integration between WSL 2 and Windows seems pretty nice to me, so I decided to try it out - and now all my Windows machines have WSL 2 installed.

Setting it up is not too bad - you do need to follow the official instructions to install it, which I’m not going to repeat here. The installation experience was fairly smooth, though it requires multiple steps.

However, to get it work properly requires a bit of extra work. Once you set it up it’s pretty much all I ever needed. Here is what it looks like when I’m done:

Install WSL Remote extension on VSCode

When you launch VS, it’ll automatically prompt you to install WSL Remote extension. Once you done installation, just open code from WSL:

code <folder>

Once you do that, it’ll install VSCode Server automatically and launch VS code pointing to that folder. And you can browse through the code as usual.

And the best part is, once you install corresponding remote version of the extension (for example, C++ Extension), IntelliSense works! The installation of remote extension is a bit tricky - you need to find your extension again, and click the little green button “Install in WSL Ubuntu”.

You can refer to the official doc for more details.

Moving it to another disk

By default, WSL forces you to install it on C: drive, which makes no sense what so ever in 2020. I suppose this is a Windows Store thing. Fortunately, there is a move-wsl tool available in github. There is a powershell script and a simple batch file. I’m going to use the batch file:

# Move ubuntu distro to D:\vm
move-wsl.bat ubuntu D:\vm

It’ll move Ubuntu distro to D:\vm, and that’s basically a huge ext4.vhdx file.

Once you launch WSL again, you may find the default user has become root. Don’t worry, just put the following into /etc/wsl.conf:

[user]
default=YOUR_USERNAME

And go back to a windows prompt to terminate the running WSL ubuntu instance:

wsl -t ubuntu

The next time when you launch WSL you’ll be going back as your normal self.

Limiting memory growth

By default WSL 2 is setup to consume up to 80% of system memory which is way too high. In my 16GB laptop I’m setting this to 6GB (8GB is still too high with a few chrome tabs open and VSCode open side by side). As far as I can tell this is due to cache - linux is going to go memory hungry to use all the memory it can use for caches, but when windows needs that memory there is no way for linux to know that, unless you force linux to GC the unused memory more aggressively (see this article for more details). But I’m hoping for a better long term solution where you can have the two OS talk to each other in some ways to negotiate memory usage. Before that happens, you’ll need to write following to %USERPROFILE%\.wslconfig.

[wsl2]
memory=6GB
swap=0
localhostForwarding=true

If you are using this on a workstation with 32GB+ memory, you might not need this. Though it is still likely that it’ll happily consume everything when you do some heavy processing like compiling source code with 24 cores.

Terminal

Windows Terminal is a modern terminal that supports different shell like ubuntu shell, cmd, powershell, etc. I find it works well with zsh/tmux and supports color themes and good font rendering, so that’s the one I’m using right now.

I’ve set it up with Ubuntu Mono font and Afterglow theme so it looks fairly close to a Terminal under linux.

Setting up git credentials

Because there is no desktop support there, you can’t use libsecret which uses dbus. If you set it up, you’ll eventually run into this error:

** (process:7902): CRITICAL **: could not connect to Secret Service: Cannot autolaunch D-Bus without X11 $DISPLAY

Fortunately, given this is windows and WSL supports Windows Interop, you can just use git-credential-manager.exe which works surprisingly well:

git config --global credential.helper "/mnt/c/Program\ Files/Git/mingw64/libexec/git-core/git-credential-manager.exe"

Docker

You can install docker as usual, but whenever you try to launch any container you’ll get this error:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

This is because there is no systemd installed. As a result docker doesn’t know how to automatically launch docker daemon. You can still do it in the good old system-V style:

sudo service docker start

Copying text to clipboard in tmux

Under regular linux you could juse use xsel / xclip which isn’t an option here as there is no X window installed. Again, because there is Windows interop, you can juse use clip.exe!

You can set it up in tmux so that it integrates with your clipboard.

bind-key -Tcopy-mode-vi 'y' send -X copy-pipe "clip.exe"

I have a script that auto detects Linux/Mac/WSL and use the correct copy tool correctly in github based on https://github.com/Parth/dotfiles/blob/master/utils/copy.

My overall impression

WSL 2 is really a game changer. WSL was a good start but given that it is done through implementing linux sycall on top of windows (interop, basically), compatibility is a big issue. It’s hard to be productive when you hardly trust your environment. With WSL 2, you can run Windows and Linux literally side by side and have them talk to each other through WSL interop, so really you get the best of both worlds - the compatiblity of Windows (linux on laptop is still quite a hussle, especially for newer hardware) and the fantastic open source dev environment of linux. There is some trade off, but it’s worth it.

SWIG and Python3 unicode

2019-08-15T00:00:00+00:00

Anyone familiar with Python probably knew its history of Unicode support. If you add Python3, Unicode, and SWIG together, imagine what might go wrong?

Python3, Unicode, SWIG, and me

I was debugging a test failure written in Python just now and it is failing with this error:

Many of the end-to-end tests here are written in Python because they are convenient - no one wants to write a C++ code to drive MySql and our infra service to do a series of stuff.

UnicodeEncodeError: 'latin-1' codec can't encode character '\udcfa' in position 293: ordinal not in range(256)

The code looks like this:

sql = get_sql_from_some_magic_place()
decoded_sql = cUnescape(sql.decode("latin-1"))
decoded_sql_str = decoded_sql.encode("latin-1")
execute(decoded_sql_str)

The code seems straight-forward enough. The offending string looks like this: b"SELECT from blah WHERE col='\\372'.

This string was originally escaped by folly::cEscape which does simple thing rather simple - converts the string to be a C representation where ‘' are double escaped and any non-printable characters are escaped with octal. This is convenient as those escaped strings are safe to pass around without worry for encoding as they are, well, ASCII.

folly is Facebook’s open source standard C++ library collection. See https://github.com/facebook/folly for more information.

It is convenient, until you need to call from Python, for which you’ll need to use SWIG:

If you don’t know SWIG - just think it’s a tool that generates Python wrapper for C++ code so that they can be called from Python code. In this case, folly::cUnescape. Go to http://www.swig.org/ to learn more. Many language have equivalent tool/feature built-in, P/invoke in C#, cgo in go, JNI in Java, etc.

std::string cUnescape(const std::string& a) {
  std::string b;
  folly::cUnescape(a, b);
  return b;
}

I was scratching my ahead trying to understand what is happening as there is no way the strings are converted to ‘\udcfa’, until I realize cUnescape might be at fault.

It turns out, SWIG expects UTF-8 string and returns UTF-8 strings back. “\372” can be converted to UTF-8 without any trouble, but once it is unescaped it becomes “\372” which is 0xfa that is going to be interpreted as UTF-8:

b"\372".decode("utf-8", errors="surrogateescape").encode("latin-1")

And you get:

UnicodeEncodeError: 'latin-1' codec can't encode character '\udcfa' in position 0: ordinal not in range(256)

The fix

To fix the problem, you can encode the buffer again with surrogateescape:

>>> b"\372".decode("utf-8", errors="surrogateescape").encode("utf-8", errors="surrogateescape").decode("latin-1")
'ú'

Seems rather backwards, isn’t it? Why not just stop messing with the strings?

That’s exactly what was discussed in SWIG doc here: http://www.swig.org/Doc4.0/Python.html#Python_nn77. There is a magic macro you can use:

%module char_to_bytes
%begin %{
#define SWIG_PYTHON_STRICT_BYTE_CHAR
%}
std::string cUnescape(const std::string& a) {
  std::string b;
  folly::cUnescape(a, b);
  return b;
}

And the original code can be changed to:

sql = get_sql_from_some_magic_place()
decoded_sql = cUnescape(sql).decode("latin-1")
execute(decoded_sql)

Much simpler too.

I’m just happy that I mostly write C++ instead of Python…