Posts

Latest Posts

Juicing QEMU for fun, ??? and profit!


The number of packages and applications natively available for OpenPOWER continue to grow in just about every distro's package manager, and even if a prebuilt package doesn't exist even more will build from source. But emulation is still going to be a fact of life for Windows-only/x86/x86_64-only (maybe even aarch64-only) binaries we can't rebuild, and KVM only helps us with other Power ISA systems (in fact, it looks like KVM-PR broke and can't boot Mac OS X again, so I guess I'll be diving back into the source), so we need to wring as much speed out of QEMU's emulation engine as possible.

We are fortunate with QEMU in that there is ppc64le support in TCG, the Tiny Code Generator which implements a basic JIT, and the Power ISA TCG backend even emits those tasty newer POWER9 instructions to take better advantage of the processor. Without TCG, QEMU would be dreadfully slow when emulating a foreign architecture. However, unless IBM or some other OpenPOWER hardware developer implements instructions (a la Apple M1) in a future chip that specifically improve emulation of other CPUs (like, I dunno, x86_64), there's very little that can be done to improve the code the Power TCG backend generates and CPU emulation spends most of its time in TCG-generated code.

However, the software MMU that QEMU's CPU emulation uses has pre-compiled portions, and all the devices and components QEMU emulates (like the system bus, video, mass storage, USB, etc.) are also pre-compiled. This gives us an opportunity: with a little extra elbow grease, you can make a link-time-optimized and profile-guided-optimized (LTO-PGO) build of QEMU specific to the particular workload which can run the CPU anywhere from 3-8% faster and video and other devices up to 15% faster depending on the set of devices. While number crunching isn't substantially faster, and the modest CPU improvements don't improve user-mode emulation a great deal, full system emulation's general responsiveness improves and makes using more applications more feasible.

This process is not automated. For Firefox, we make LTO-PGO builds using the internal machinery and our patches for gcc compatibility, which is currently our preferred compiler on OpenPOWER systems. The Firefox build system generates a profiling build first, then automatically collects profiling data with it off a model workload and builds the optimized browser from that profile. QEMU doesn't have that infrastructure right now, but you can do it manually: you configure and compile a profiling build, run your workload with it to create a profile, and then configure and compile an optimized build with the profile thus generated.

I'll give instructions here for both QEMU 5.0 and 5.2, since 5.0 seems to be a bit more performant than 5.2 and has fewer build prerequisites, but 5.2 is more straightforward and we'll do it first. In these examples, I'm optimizing ppc-softmmu so that I can run Mac OS 9, which has never worked properly with KVM-PR; substitute with your desired target, such as x86_64-softmmu. Only do one target at a time, and you will want to do individual builds for each system image — even if you normally use the same executable binary for multiple OSes — because different code paths may be exercised with different workloads and/or configurations.

Let's start with making a profiling build. To do this, we'll add -fprofile-generate to the compiler flags (as well as -flto for LTO). For consistency we'll pass the same set of options to the C compiler, the C++ compiler and the linker (each will ignore options they don't need). In the QEMU source tree,

  • mkdir build
  • cd build
  • ../configure --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-ldflags="-flto -fprofile-generate" --target-list=ppc-softmmu
  • make -j24 (or as appropriate: this is a dual-8 Talos II)

Wait for QEMU to build. When it finishes, back up your drive image because you may not be able to shut it down normally and it would suck to damage it inadvertently. With a backup copy saved, run the new QEMU as you ordinarily would on your target workload. For example, my classic script is (assuming you're still in the build directory)

./qemu-system-ppc -M mac99,accel=tcg,via=pmu -m 1536 -boot c \
-drive id=root,file=classic.img,format=qcow2,l2-cache-size=4M \
-usb -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=mynet0 -rtc base=localtime

You should use as close to your normal configuration as possible so that the device drivers you run are factored into the profile.

The first thing you'll notice is that QEMU is now really, really, really slow. Crust-of-the-earth-cooling slow. This is because it's storing all that profile data every time any block of compiled code is executed. As a result you will probably not be able to type or interact with the guest in any meaningful fashion, so let the system boot, grab a cup of a fortifying beverage and and wait for it to get as far as it can. For Mac OS 9, it took several minutes to get to the desktop; for OS X 10.4, it took about a quarter of an hour (with a lot of timeouts in a verbose boot) to even start the login window. At some point you will not be able to usefully proceed any further with the guest, but fortunately you backed up your drive image already, so you can simply close the window.

Go back to the build directory. This time we will tell gcc to build with the generated profile (-fprofile-use), though we will allow it to account for certain changes (-fprofile-correction) and allow compilation to occur even if a profile doesn't exist for a particular target (-Wno-missing-profile) so that it can get through configure cleanly:

  • make clean (this doesn't remove the profile .gcda files)
  • ../configure \ --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-correction -fprofile-use -Wno-missing-profile" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --extra-ldflags="-flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --target-list=ppc-softmmu
  • make -j24

Enjoy the new hotness. You should be able to see measurable improvements in the CPU emulation, but more importantly, boot times and responsiveness of the full system emulation should also be improved.

For 5.0.0, the process is a bit more complicated, but it's a bit quicker, so I found it worth it (and it's what I currently use for Mac OS 9). In the QEMU source tree, configure the build:

  • ./configure --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-ldflags="-flto -fprofile-generate" --target-list=ppc-softmmu
  • make -j24

Run your profile as before. However, you need to preserve the profile before the rebuild because make clean will clobber it.

  • tar cvf instrumented.tar `find . -name '*.gcda' -print`
  • make clean
  • tar xf instrumented.tar
  • ../configure \ --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-correction -fprofile-use -Wno-missing-profile" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --extra-ldflags="-flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --target-list=ppc-softmmu
  • make -j24

Life's golden, and just a little bit zippier. It's not always possible to PGO all the things, but here's one where it makes a noticeable difference.

Firefox 86 on POWER


Firefox 86 is out, not only with multiple picture-in-picture (now have all the Weird Al videos open simultaneously!) and total cookie protection (not to be confused with other things called TCP) but also some noticeable performance improvements and finally gets rid of Backspace backing you up, a key I have never pressed to go back a page. Or, maybe those performance improvements are due to further improvements to our LTO-PGO recipe, which uses Fedora's work to get rid of the sidecar shell script. Now with this single patch, plus their change to nsTerminator.cpp to allow optimization to be unbounded by time, you can build a fully link- and profile-guided optimized version for OpenPOWER and gcc with much less work. Firefox 86 also incorporates our low-level Power-specific fix to xpconnect.

Our .mozconfigs are mostly the same except for purging a couple iffy options. Here's Optimized:

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-O3 -mcpu=power9"
ac_add_options --enable-release
ac_add_options --enable-linker=bfd
ac_add_options --enable-lto=full
ac_add_options MOZ_PGO=1

# uncomment if you have it
#export GN=/home/censored/bin/gn
And here's Debug:
export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-Og -mcpu=power9"
ac_add_options --enable-debug
ac_add_options --enable-linker=bfd

# uncomment if you have it
#export GN=/home/censored/bin/gn

Blackbird supply chain likely to improve


(Thanks to a reader tip.) Although yours truly is a Talos man (and was ever since it was going to have a POWER8 in it), the Blackbird is certainly far more attractive in terms of price. Backorders due to COVID-19's effect on the global supply chain have plagued it for months, but Raptor management on IRC indicates that logjam may be breaking; the first sign just a few days ago is that the 18-core monster POWER9 v2s (DD2.3) were back in stock. Obviously 18-cores don't (routinely) go in Blackbirds, but their presence suggests the supply chain issues are resolving and that a minimum order from IBM was met.

Raptor is well aware that the Blackbirds, more so than the T2 and T2 Lite, are its leading workstation product, and said there was "lots of demand" too ("about the only positive in the whole pandemic-induced mess"). However, Raptor's Timothy Pearson in the same IRC chat also commented that "we're playing it safe and focusing more on the next generation products than taking risks with POWER9 ... I can categorically state that if COVID19 had never happened, we'd have already offered other chips and we'd have at least one other product on the market designed around P9 by now." The latter sounds like a reference to Condor, Raptor's cancelled LaGrange system, but as long as POWER10 still has openness concerns, what "other chips"?

Gentoo on little-endian


A nice write up by Martin Kukač on getting Gentoo to be happy on little-endian: even though many Linux distributions support LE, and some now only do, if you install Gentoo from the Minimal Installation CD and try to use the ppc64le stage 3 tarball there's an endian mismatch and it doesn't work (dies during the install steps with /bin/bash in incompatible format). The issue appears to be that the Minimal Installation CD itself is big-endian; there is currently no analogous little-endian image. Martin's brainwave was to complete the installation from an already running little-endian system (he used RiscySlack but Void should also work as well). Following his steps, the OS will build in little-endian mode from within the second OS, and then can be booted into it. Good to have the choice and a nice how-to.

A better theory on why there won't be an open POWER10 workstation for awhile


In our previous analysis we suspected that Raptor's indigestion over POWER10 was IBM failing to release some component of the firmware, meaning it wasn't a truly open platform after all. Raptor, under whatever NDA prohibited them, couldn't say, but there was enough to do some educated reading between the lines regarding the problem.

So hats off to Hugo Landau, who did his own research on the subject. As you will recall, for POWER8 IBM introduced the Centaur memory buffers which serve essentially as off-chip memory controllers and a fourth level of cache, and scale-up Cumulus POWER9s (not the Nimbus POWER9s in Raptor workstations) can use them too. This enables a lot of logic to be move off-die and can turn what is a critical high-speed and potentially error-prone parallel interface into a serial one. IBM expanded this into the vendor-neutral Open Memory Interface, or OMI, which halves the latency of Centaur (to 5ns) and runs up to 25Gbps per lane. With OMI RAM technology can advance separately from the CPU, and the processor can be completely agnostic about what it's attached to (as opposed to Cumulus, which only "speaks" Centaur, and our Nimbus systems which use commodity directly-attached DDR4 RAM through an on-chip controller).

We reported previously that at the 2019 OpenPOWER summit Microchip Technology was announced as the first vendor of OMI DDIMMs, and although Micron, Samsung and SMART Modular were listed as planning to release their own, so far the only vendor of OMI controllers appears to be Microchip. We haven't heard anything about a Nimbus-alike POWER10 yet with direct-attached memory, so we have to assume that at least the first wave of POWER10 processors will only use OMI. Hugo's discovery was a obscure Github repo that appears to contain the firmware for the Microchip OMI controller — and no source code. Read Hugo's article for the additional dirty details.

The concept of RAM that requires firmware binary blobs is frankly very disconcerting: I shouldn't have to explain to any regular reader of this blog that if you own the RAM, you own the store, and you could potentially own the RAM this way (even/especially with a vendor lock: see SolarWinds). I won't say how I have knowledge of this, but various other cues indicate to me Hugo has found the exact reason POWER10 can't be considered open under any reasonable definition.

POWER9 systems can't last forever, of course. If there were going to be a truly open POWER10 system, we'd either have to reverse-engineer the Microchip controller firmware or develop a separate open memory controller of "our" own. Likewise, I'm pretty sure Raptor doesn't want to be in the DDIMM business, so if a separate Raptor-specific controller were required it may be simpler to just have RAM on the board as a build-to-spec option. Either way, while I understand IBM's decision with OMI to cater to their bandwidth-hungry institutional customers, the implementation they've chosen may put those very same high-value customers at risk. We should be glad Raptor didn't make the same choice and fortunately POWER9 systems will still be able to hold their own for awhile.

Followup on Firefox 85 for POWER: new low-level fix


Shortly after posting my usual update on Firefox on POWER, I started to notice odd occasional tab crashes in Fx85 that weren't happening in Firefox 84. Dan Horák independently E-mailed me to report the same thing. After some digging, it turned out that our fix way back when for Firefox 70 was incomplete: although it renovated the glue that allows scripts to call native functions and fixed a lot of problems, it had an undiagnosed edge case where if we had a whole lot of float arguments we would spill parameters to the wrong place in the stack frame. Guess what type of function was now getting newly called?

This fix is now in the tree as bug 1690152; read that bug for the dirty details. You will need to apply it to Firefox 85 and rebuild, though I plan to ask to land this on beta 86 once it sticks and it will definitely be in Firefox 87. It should also be applied to ESR 78, though that older version doesn't exhibit the crashes to the frequency Fx85 does. This bug also only trips in optimized builds.