Eliminating branches in Rust for fun... but not much profit
Last week, I nerd-sniped myself after reading a blog post about performance of parser combinators in Rust, and the associated Reddit discussion.
The benchmark in the post was parsing a bunch of HTTP requests. The focus was on improving performance of a function determining if a character formed part of a token. In the post, it was changed from
fn is_token(c: u8) -> bool {
c < 128 && c > 31 && b"()<>@,;:\\\"/[]?={} \t".iter()
.position(|&i| i == c).is_none()
}
which iterates over a list of characters, to
fn is_token(c: u8) -> bool {
// roughly follows the order of ascii chars: "\"(),/:;<=>?@[\\]{} \t"
c < 128 && c > 32 && c != b'\t' && c != b'"' && c != b'(' && c != b')' &&
c != b',' && c != b'/' && !(c > 57 && c < 65) && !(c > 90 && c < 94) &&
c != b'{' && c != b'}'
}
which unrolls the loop into a big boolean conjunction.
On my machine, the benchmark with the original version takes about 106 microseconds to parse a text file containing about 50 HTTP requests. The newer version takes 73 microseconds, which is about 30% faster.
After reading this, I started thinking something like “I know about CPUs, and
branches are bad because of mispredictions or something”, and “LOOK &&
AT
&&
ALL &&
THOSE &&
BRANCHES!”, and imagining fame and fortune for making
it even faster.
This led me down a path of trying to replace that function with a straight-through series of bitwise operations. First off, let’s unpack what’s actually being checked in this function:
c < 128
: thatc
represents asn ASCII characterc > 32
: thatc
is not a control character or space- all the
!=
comparisons: thatc
isn’t any of those characters !(c > 57 && c < 65)
: thatc
is none of:;<=>?@
!(c > 90 && c < 94)
: thatc
is none of[\]
Eliminate (almost) ALL THE BRANCHES
My basic idea was to replace all the equality checks with XOR, and all the
boolean ands with bitwise ands. Boolean ands (&&
) and ors (||
) have short
circuiting semantics: their right hand sides are only evaluated if necessary.
This means that &&
has to be implemented as a conditional jump to avoid
evaluating the right hand side if the left hand side is false.
To check I was making any sense, I used the fantastic Godbolt interactive compiler, which has Rust support. This confirmed that at least some of the boolean ands were compiling to conditional jumps, which was enough for me to go down the bitwise operation path.
After spending a bunch of time on translating it to be purely bitwise operations, I ended up with this performance result:
test bench_http ... bench: 230,030 ns/iter (+/- 6,303)
About three times slower. This was… disappointing. Especially because along the way, a broken benchmark setup had me convinced I sped things up by 60%!
Why so slow?
I talked about my results with Dan Luu, since he’s written about this kind of thing before. He mentioned that on Haswell, the missed branch prediction penalty is about 14 cycles. Optimally organised bitwise operations can be about 4 per cycle.
The original version had about 30 instructions, of which about 6 were branches. Assuming one instruction per cycle, a really bad branch predictor miss rate like 50% would be somewhere around 75 cycles. The bitwise version had about 130 instructions, of which one was a branch. Assume the branch predictor is always right, that still puts me at 40+ cycles if 3 are in parallel.
In summary, unless the branch predictor was abysmally bad at this code, or I was amazingly excellent at organizing the bitwise operations, there was no strong reason to expect my my bitwise version to be faster.
This got me to run perf stat
on the benchmark to get a look at the branch
prediction hit rate. This turned out to be above 99.5%, which is far from
‘abysmal’.
Another approach: a lookup table
The discussion also suggested implementing this as a lookup table. This would
reduce is_token
to a single load which can be inlined wherever it’s used.
Rust represents boolean values as bytes, so the table will be 256 bytes which
should mean we’ll rarely have to go too far down the cache hierarchy.
Indeed, implementing that gave a modest 6-7% performance improvement over the version that sent me off on this investigation:
test bench_http ... bench: 67,890 ns/iter (+/- 1,913)
So at least there was a somewhat satisfactory result at the end!
The moral of the story: validate assumptions, and know more numbers!
There were a couple of big takeaways for me. The first is I could have saved myself a whole bunch of time if I’d investigated the branch prediction performance before embarking on this adventure. Instead of assuming the branching was slowing things down, I would know that it probably wasn’t.
The other was knowing some performance numbers offhand could have helped a lot here. I have a rough idea of L1 / L2 latency, and main memory latency. But I had no idea how many bitwise operations to expect per cycle, or how bad a branch misprediction was. I’ve now added these to my collection, but knowing a few more rough figures like that would certainly be good!
Thanks to Dan Luu, Anton Dubrau, and David Turner for useful and interesting discussions, and to Julia Evans for reading drafts of this post.