Showing posts from March, 2021

More 64K page problems and some solutions

Meanwhile, as the question of a 4K-page Fedora 34 remains as yet undecided, if you are using a more recent video card with your Linux OpenPOWER system Trung LĂȘ reports that kernel 5.11.x still crashes with 64K pages on his AMD R9 Nano. (Older cards, like the AMD WX7100 workstation GPU in this Talos II that Raptor sells as a BTO option, are unaffected may also be affected — see comments for more.) This is relevant since Fedora 33 is moving to 5.11. If you're a Fedora 33 user and wish to continue with the 5.10.x series until a fix for amdgpu emerges, kernels are available from his Github project.

In the meantime, speaking personally as one of those people who still use FireWire/IEEE-1394, 4K pages are at least part of the problem as to why FireWire cards don't seem to work in my F33 Talos II (I tried a Rosewill one first without success, and more recently an Iocrest card using a more typical Texas Instruments controller). Although a patch for 64K page support was initially submitted, it was rejected, and the followup patch was never tested. I'll be getting around to trying this myself and hopefully getting it into the kernel, but in the meantime report back if using the patch works for you (I still use some FireWire devices, particularly for video capture and legacy interchange with the Power Macs that lurk around here).

Tonight's game on OpenPOWER: The Original Strife Veteran's Edition

I'm a big fan of Strife, famously the last game to use id Software's Doom 3-D engine, and a nice hybrid of light RPG and heavy action. The engine might have been old and the plot was more shooting than Shakespeare, but hey: the voice in your ear is named Blackbird. You can't beat that!

I actually do own a retail copy of Strife from back in the day for MS-DOS; I bought it new and played it on the 486 I still keep around for such things. It plays just fine in Chocolate Doom on this POWER9, but I later heard about Strife: Veteran Edition on that fixed minor bugs with better music and improved graphics, and even threw in some extra enhancements and achievements, but still kept the plot, voice acting and character art. It had a Linux version, but clearly one for x86. But that's not a problem when you have the source code.

The source code builds largely uneventfully as long as you have the prerequisites (Fedora 33 and I did not test on big-endian). In particular, it will want cmake, SDL2, libogg, libtheora, libvorbis, zlib, libpng and OpenGL. However, it tries to link against libSD2_main which is no longer necessary; after you've run cmake and make, it will fail with No rule to make target 'SDL2_MAIN_LIBRARY-NOTFOUND'. To get around this, edit (in the build directory where you ran cmake) ./CMakeFiles/strife-ve.dir/link.txt and remove SDL2_MAIN_LIBRARY-NOTFOUND from the single long line link command, then edit ./CMakeFiles/strife-ve.dir/build.make and just delete the line strife-ve: SDL2_MAIN_LIBRARY-NOTFOUND. Run make again and it will link.

Since it built, I decided to spend $10 and try to extract the game assets from the GOG pack. GOG gives this to you as a behemoth 400 megabyte "shell script" which is really a wrapper for a ZIP archive with a MojoSetup installer. Irritatingly the installer is all just binaries, but you can feed it to unzip and it will break it apart. If we list the contents of the file, it will conveniently ignore the header and go right for the ZIP archive, and the money is in data/noarch/game. Thus, do unzip ./ data/noarch/game/'*' and the assets will be extracted to data/noarch/game.

If you want, you can just move the files in game/ (maintain the tree under it, don't flatten it) in with the POWER9 binary of strife-ve, but if you don't want all the x86 binary crap you don't need, a quick find . -name '*.so.*' -print | xargs rm and rm strife-ve before you copy it over should remove the bulk of it. Conveniently, the GOG assets also include the original DOS version (in DOS/) and all the relevant WADs so you can also run it in Chocolate Doom or our OpenPOWER-JIT DOSBox, and another copy of the source code just in case you lose it. Anyway, with everything moved, if you run ./strife-ve it should then just work.

Don't keep the Front waiting.


The RISC-V community is buoyed by Wave Computing, fresh from Chapter 11 bankruptcy, reemerging with the name of its subsidiary MIPS Technologies to develop ... RISC-V chips.

This actually says less about RISC-V than it says about the new MIPS Technologies. You'll recall that MIPS Technologies, formerly Wave Computing, suspended the MIPS Open Initiative, apparently to position it for sale before they went under, and nobody bit. Understandably so: the once great architecture that powered the SGI MIPS workstations in my office (I own an Indy, an Indigo2 and a Fuel) has now been relegated to the "too cheap for ARM" embedded market, which coincidentally is exactly where RISC-V has gotten most of its design wins so far.

And Wave MIPS Technologies isn't giving anything up. There is no mention of the licensing program for MIPS-the-architecture ending, which means it remains in operation, and it remains closed. Indeed, Tallwood Venture Capital probably demanded it, as a hedge in case their efforts with RISC-V aren't sufficiently profitable. MIPS Technologies will be entering a crowded field with other established players, notably SiFive, and not a lot of extant IP to suggest they will substantially leapfrog those existing designs in performance or power usage (if at all). In that sense, this announcement is best seen as a cynical way to capture public interest rather than an important engineering leap, and the RISC-V community should not in any way conclude they have gained a valuable partner. If anything, they've failed to avoid a new, shadier member of the ecosystem who actually took steps to make their previous products less open. That's not a good look for an architecture that has made openness its defining characteristic.

Juicing QEMU for fun, ??? and profit!

The number of packages and applications natively available for OpenPOWER continue to grow in just about every distro's package manager, and even if a prebuilt package doesn't exist even more will build from source. But emulation is still going to be a fact of life for Windows-only/x86/x86_64-only (maybe even aarch64-only) binaries we can't rebuild, and KVM only helps us with other Power ISA systems (in fact, it looks like KVM-PR broke and can't boot Mac OS X again, so I guess I'll be diving back into the source), so we need to wring as much speed out of QEMU's emulation engine as possible.

We are fortunate with QEMU in that there is ppc64le support in TCG, the Tiny Code Generator which implements a basic JIT, and the Power ISA TCG backend even emits those tasty newer POWER9 instructions to take better advantage of the processor. Without TCG, QEMU would be dreadfully slow when emulating a foreign architecture. However, unless IBM or some other OpenPOWER hardware developer implements instructions (a la Apple M1) in a future chip that specifically improve emulation of other CPUs (like, I dunno, x86_64), there's very little that can be done to improve the code the Power TCG backend generates and CPU emulation spends most of its time in TCG-generated code.

However, the software MMU that QEMU's CPU emulation uses has pre-compiled portions, and all the devices and components QEMU emulates (like the system bus, video, mass storage, USB, etc.) are also pre-compiled. This gives us an opportunity: with a little extra elbow grease, you can make a link-time-optimized and profile-guided-optimized (LTO-PGO) build of QEMU specific to the particular workload which can run the CPU anywhere from 3-8% faster and video and other devices up to 15% faster depending on the set of devices. While number crunching isn't substantially faster, and the modest CPU improvements don't improve user-mode emulation a great deal, full system emulation's general responsiveness improves and makes using more applications more feasible.

This process is not automated. For Firefox, we make LTO-PGO builds using the internal machinery and our patches for gcc compatibility, which is currently our preferred compiler on OpenPOWER systems. The Firefox build system generates a profiling build first, then automatically collects profiling data with it off a model workload and builds the optimized browser from that profile. QEMU doesn't have that infrastructure right now, but you can do it manually: you configure and compile a profiling build, run your workload with it to create a profile, and then configure and compile an optimized build with the profile thus generated.

I'll give instructions here for both QEMU 5.0 and 5.2, since 5.0 seems to be a bit more performant than 5.2 and has fewer build prerequisites, but 5.2 is more straightforward and we'll do it first. In these examples, I'm optimizing ppc-softmmu so that I can run Mac OS 9, which has never worked properly with KVM-PR; substitute with your desired target, such as x86_64-softmmu. Only do one target at a time, and you will want to do individual builds for each system image — even if you normally use the same executable binary for multiple OSes — because different code paths may be exercised with different workloads and/or configurations.

Let's start with making a profiling build. To do this, we'll add -fprofile-generate to the compiler flags (as well as -flto for LTO). For consistency we'll pass the same set of options to the C compiler, the C++ compiler and the linker (each will ignore options they don't need). In the QEMU source tree,

  • mkdir build
  • cd build
  • ../configure --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-ldflags="-flto -fprofile-generate" --target-list=ppc-softmmu
  • make -j24 (or as appropriate: this is a dual-8 Talos II)

Wait for QEMU to build. When it finishes, back up your drive image because you may not be able to shut it down normally and it would suck to damage it inadvertently. With a backup copy saved, run the new QEMU as you ordinarily would on your target workload. For example, my classic script is (assuming you're still in the build directory)

./qemu-system-ppc -M mac99,accel=tcg,via=pmu -m 1536 -boot c \
-drive id=root,file=classic.img,format=qcow2,l2-cache-size=4M \
-usb -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=mynet0 -rtc base=localtime

You should use as close to your normal configuration as possible so that the device drivers you run are factored into the profile.

The first thing you'll notice is that QEMU is now really, really, really slow. Crust-of-the-earth-cooling slow. This is because it's storing all that profile data every time any block of compiled code is executed. As a result you will probably not be able to type or interact with the guest in any meaningful fashion, so let the system boot, grab a cup of a fortifying beverage and and wait for it to get as far as it can. For Mac OS 9, it took several minutes to get to the desktop; for OS X 10.4, it took about a quarter of an hour (with a lot of timeouts in a verbose boot) to even start the login window. At some point you will not be able to usefully proceed any further with the guest, but fortunately you backed up your drive image already, so you can simply close the window.

Go back to the build directory. This time we will tell gcc to build with the generated profile (-fprofile-use), though we will allow it to account for certain changes (-fprofile-correction) and allow compilation to occur even if a profile doesn't exist for a particular target (-Wno-missing-profile) so that it can get through configure cleanly:

  • make clean (this doesn't remove the profile .gcda files)
  • ../configure \ --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-correction -fprofile-use -Wno-missing-profile" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --extra-ldflags="-flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
  • make -j24

Enjoy the new hotness. You should be able to see measurable improvements in the CPU emulation, but more importantly, boot times and responsiveness of the full system emulation should also be improved.

For 5.0.0, the process is a bit more complicated, but it's a bit quicker, so I found it worth it (and it's what I currently use for Mac OS 9). In the QEMU source tree, configure the build:

  • ./configure --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-generate" \
    --extra-ldflags="-flto -fprofile-generate" --target-list=ppc-softmmu
  • make -j24

Run your profile as before. However, you need to preserve the profile before the rebuild because make clean will clobber it.

  • tar cvf instrumented.tar `find . -name '*.gcda' -print`
  • make clean
  • tar xf instrumented.tar
  • ../configure \ --extra-cflags="-O3 -mcpu=power9 -flto -fprofile-correction -fprofile-use -Wno-missing-profile" \
    --extra-cxxflags="-O3 -mcpu=power9 -flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
    --extra-ldflags="-flto -fprofile-use -fprofile-correction -Wno-missing-profile" \
  • make -j24

Life's golden, and just a little bit zippier. It's not always possible to PGO all the things, but here's one where it makes a noticeable difference.