Some early Linux IPC latency data
I’ve added benchmarks for UNIX domain
sockets and TCP sockets over the loopback interface. UNIX domain sockets were
super easy to implement thanks to the handy socketpair
function. It was not really any different from pipes. The difference is that
since sockets are full duplex, you only need to create one pair. If the
processes were unrelated, or if I wanted to be able to accept multiple
connections, it would be much more like TCP sockets—ie, a pain!
I say a pain because, in doing this, I ‘found out’ that, despite having written a non-zero number of server applications, I’ve never done socket programming before. This wasn’t exactly a surprise, but it was definitely interesting to realise how little I knew about how to go about it. Luckily, man pages! (And Advanced Programming in the UNIX Environment.)
Here’s the quick tl;dr for TCP over IPv4::
- to listen for incoming connections:
- create a socket with
socket(AF_INET, SOCK_STREAM, 0 /* default protocol */)
.1 - bind it to a port with
bind(sockfd, addr, addrlen)
whereaddr
is a struct that specifies the address to bind to. ForAF_INET
, this means the IP and port. In my case, I usedINETADDR_LOOPBACK
and0
to listen on some available port on127.0.0.1
.2 - start listening on the socket with
listen(sockfd, 1 /* backlog */)
. I used abacklog
of 1 because I only expect a single incoming connection. - finally, call
accept(sockfd, NULL /* addr */, NULL /* addrlen */)
to block until a connection comes in, which returns a new file descriptor to talk to the connecting process. I pass inNULL
for theaddr
because I don’t care who’s talking to me!
- create a socket with
- to connect to another process that’s listening:
- create a socket with
socket(AF_INET, SOCK_STREAM, 0 /* default protocol */)
. - connect to the remote process with
connect(sockfd, addr, addrlen)
. Theaddr
specifies the address to connect to; again forAF_INET
this means the IP and port.
- create a socket with
This brings me up to having programs to test latency for four IPC mechanisms: - pipes - eventfd - UNIX domain sockets - TCP sockets over the loopback interface
Here is some early latency data from my machine, with emphasis on the tail latencies:
50 | 75 | 90 | 99 | 99.9 | 99.99 | 99.999 | |
---|---|---|---|---|---|---|---|
pipes | 4255 | 4960 | 5208 | 5352 | 7814 | 16214 | 31290 |
eventfd | 4353 | 4443 | 4760 | 5053 | 9445 | 14573 | 68528 |
af_unix | 1439 | 1621 | 1655 | 1898 | 2681 | 11512 | 54714 |
af_inet_loopback | 7287 | 7412 | 7857 | 8573 | 17412 | 20515 | 37019 |
Units are nanoseconds. Time is measured using clock_gettime
with
CLOCK_MONOTONIC
. The quantiles are for a million measurements; in all cases,
the binary was run with flags --warmup-iters=10000 --iters=1 --repeat=1000000
(see below).
For me, the biggest surprise was how much faster UNIX domain sockets were than
anything else, and in particular, how much faster they are than eventfd. Or
that they are faster at all. The read
call in each case blocks until a
corresponding write
. I would have thought eventfd had the minimal amount of
extra work beyond that, since all it does is read and modify a uint64_t
. In
fairness, each of the other programs are writing a single byte at present, but
I doubt the difference will be so drastic.
Another fun thing is to see difference in `latency when pinning the two processes to specific CPUs. My machine has a dual core processor, where each processor has 2 hardware threads. Here’s a quick look at latencies for pipes with different CPU affinities:
Percentile | 50 | 75 | 90 | 99 | 99.9 | 99.99 | 99.999 |
---|---|---|---|---|---|---|---|
default | 4255 | 4960 | 5208 | 5352 | 7814 | 16214 | 31290 |
same CPU | 2386 | 2402 | 2564 | 3134 | 12255 | 15126 | 28225 |
same core | 4232 | 4270 | 4395 | 4788 | 14408 | 17101 | 39052 |
different core | 5043 | 5101 | 5170 | 5772 | 11894 | 38726 | 398796 |
I was expecting a difference between different cores and not, since it requires a trip through the L3 cache. I have no realy idea of what difference I was expecting, but a microsecond could make sense if multiple locations needed to be accessed. This stuff is beyond my ken, so I’m just guessing.
What I was not expecting, was a dramatic difference between ‘same CPU’ and ‘same core’. The CPUs are hardware threads on a single core. I can’t think of any reason there would be such a difference. I do want to check that it’s not due to scheduling weirdness, so I’ll probably boot up in single user mode at some point to give it another go.
If you want to run these on your own system, clone the repo and run make
.
There will be four binaries produced, one for each of the mechanisms. They all
take the same command line flags:
-c, --child-cpu=CPUID CPU to run the child on; default is to let the
scheduler do as it will
-i, -n, --iters=COUNT number of iterations to measure; default: 100000
-p, --parent-cpu=CPUID CPU to run the parent on; default is to let the
scheduler do as it will
-r, --repeat=COUNT number of times to repeat measurement; default: 1
-w, --warmup-iters=COUNT number of iterations before measurement; default:
1000
-?, --help Give this help list
--usage Give a short usage message
-
The default protocol for
SOCK_STREAM
for theAF_INET
socket family is TCP. ↩ -
A fun little thing to be aware of is that the
addr
must contain the IP address in network byte order. This necessitates converting the IP address and port usinghtonl
andhtons
, respectively, to convert the IP from h_ost _to _n_etwork byte order (thel
stands forlong
, which in this case means auint32_t
becauselong
s used to be shorter; thes
stands forshort
which have stayed short at 16 bits long). ↩