Today I wrote some code! None to show, just experiments. I’m investigating mmap and its interactions with mprotect and ftruncate. I got some good responses to yesterday’s email thinking about approaches to a memory mapped message builder for Cap’n Proto. In particular a message from Paul Pelzl expanded on my memory protection / SIGSEGV-based idea and suggested playing tricks with MAP_FIXED to repeatedly remap the file as it grows.

I started doing some investigation to see what happens when you mix mmap, mprotect, and ftruncate. Here’s some stuff I found out:

  • it is possible to create a non-zero length mapping with mmap on a zero-length file, eg, one just created by touch for the purpose
  • on Linux, attempts to access any of the memory result in SIGBUS
  • a SIGBUS handler can call ftruncate to extend the backing file; the access will then succeed!

This suggests an even simpler approach than what emerged on the mailing list:

  1. mmap a huge amount of address space backed by the target file, which starts off empty
  2. allocate an initial segment of 4GB—the maximum in Cap’n Proto’s encoding
  3. have a SIGBUS handler call ftruncate to extend the file whenever an attempt is made to reach beyond the end

An initial page can be set aside for a segment table to allow compatibility with the existing message readers. On closing of the file, the segment size can be written in at whatever length was actually set aside on disk. There would likely be an opportunity to shrink the segment down so it only occupies as much space as was actually used; in this case the file could be truncated to that size.

This approach has a few nice features:

  • it’s compatible with the existing flat array and stream based message readers
  • it does not require sparse file support
  • only one system call needs to be made in the signal handler

I still need to do a bit more research. In particular, I’m wondering:

  • will this work on non-Linux systems? The Linux mmap manpage mentions the SIGBUS behaviour, as does the NetBSD one, but the FreeBSD and OS X manpages do not; OpenBSD’s manpage says it will give a SIGSEGV instead, and helpfully points out that POSIX says this situation should be a SIGBUS.
  • how do I keep track of which regions belong to which files? The signal handler has the address that called the fault, but I’ll have to have some way to look it up.
  • how should the filis file lookup structure interact with threads? Is it a global table for which we incur some synchronisation cost? Is it a thread-local table, and we require that a message only be written to by the creating thread?

I’m hoping to answer at least the first question tomorrow, mostly for curiosity’s sake. I’m happy to be Linux-only for now; I’m sure that if this doesn’t work on other platforms, one of the other approaches will. After testing this on someone’s Mac, I can start on a proof of concept message builder, which will hopefully inform the design more.