Serious hash table kernel bug (CVE-2019-12817)

A while ago when I was running my Talos II in hashed page table (HPT) mode for KVM-PR purposes, I started noticing some weird kernel behaviour with recent versions of Firefox. In dmesg were weird messages like this:

[337262.237052] ida_free called for id=170 which is not allocated.
[337262.237089] WARNING: CPU: 6 PID: 12276 at lib/idr.c:519 ida_free+0x114/0x1e0

Initially these were annoying but seemed innocuous. But later on I started getting some weird lockups and I wasn't sure what was going on, so as part of an attempt to figure it out I switched the machine back into radix MMU mode (i.e., I removed disable_radix from the kernel arguments) and the problems disappeared. I reported the phenomenon and the backtraces to the good folks at OzLabs, and Michael Ellerman said he'd look into it.

Turns out the problem was a little deeper than just some weird kernel warnings. From Michael's detailed report, way back around 4.17 code was introduced to support mappings above a 512TB address in the event of a segment lookaside buffer miss. (The SLB has been a tricky devil before.) This situation might seem exotic at first glance, but such addresses are eminently possible in a 64-bit address space. The new code enabled a subtle bug: if a process allocates memory in that range with mmap(), and then forks a new process, the child erroneously maintains the parent's "context ID," a handle-like structure, to that memory mapping and both the child and parent can stomp on the same range of memory. This is obviously bad, but it gets worse when the child process exits, because all of the context IDs it had (including the one it incorrectly inherited) are now freed and thus sets up a use-after-free error where a subsequent unrelated process might get that context ID and access the original parent's space or vice versa. The kernel messages I was seeing was the kernel detecting the situation when both the child and parent exited before a subsequent process got the bogus context ID (and complaining about the double free). The hangups may well have been when the kernel didn't.

There are currently no known circulating exploits for this flaw but it's pretty clear this could be the basis of a nasty attack if a malicious sort were able to trigger such a mapping and then use it to victimize a subsequent process. All 64-bit POWER and PowerPC systems that use a 64K page size are vulnerable in HPT mode (but the Adelie Linux people can smirk a little here, because they use 4K pages, which are not vulnerable). Prior to POWER9, all systems are HPT, so this means everything is affected from the G5 on up including the PA6T and POWER4/5/6/7/8. Additionally, if you use KVM-PR on your POWER9, emulate a POWER8 guest in KVM-HV, or use KVM-PR within a KVM-HV guest, you are also vulnerable because your machine/guest must be using HPT mode. 32-bit PowerPC systems are unaffected.

Most POWER9 systems are probably using the radix MMU by default. If you aren't, then you should, at least temporarily (I haven't switched back to HPT on my own Talos yet, fortunately). dmesg will tell you in the first few lines:

[0.000000] dt-cpu-ftrs: setup for ISA 3000
[0.000000] dt-cpu-ftrs: not enabling: system-call-vectored (disabled or unsupported by kernel)
[0.000000] dt-cpu-ftrs: final cpu/mmu features = 0x0000f86f8f5fb1a7 0x3c006041
[0.000000] radix-mmu: Page sizes from device-tree:

If you see hash tables referenced instead, then you are in HPT mode, and you should remove disable_radix from your kernel arguments and restart.

If you are on a pre-POWER9 system, however, there is currently no effective mitigation, but the good news is that the patch has landed in the kernel tree. It has made it to the RC for 5.1.15, so it should enter Fedora quickly (however, my F30 system is still showing 5.1.12 as current as of 7:30pm Pacific). Update: Red Hat is tracking this as bug 1720616. Debian has an advisory page and is tracking the flaw. Ubuntu is tracking this issue as USN-4031-1 and already has updates.