Making your Talos II into a Power Mac: dcbz considered harmful (part 2)


In the first part of this article we talked about getting your Talos II prepped to emulate a Power Mac using KVMPPC, the kernel virtualization facility in Linux. Having followed the instructions in that article, you've got your kernel in hash table mode, you've got the KVM-PR kernel module loaded (and patched it if necessary), you installed (or built) QEMU, and you have a blank QEMU disk image ready to go.

For this part, we will assume you have chosen 10.3 Panther, 10.4 Tiger or 10.5 Leopard to install. I will discuss Leopard relatively little other than how to get you started in it; most of the rest applies to Leopard that applies to Tiger. I'll briefly discuss booting OS 9 with TCG at the end.

Before starting, since we will use tun/tap networking, make sure the interface is up before booting. On Fedora, I do something like this:

sudo ip tuntap add dev tap0 mode tap user [your username]
sudo ip link set tap0 up promisc on

and, if you use libvirt,

sudo brctl addif virbr0 tap0

For filesharing you could set up either Samba or Netatalk. I use Netatalk, since I'm more accustomed to AppleTalk and it enables my T2 to serve files over AFP to the other classic Macs here, and it also will work fine with Mac OS 9 if you want to use that at some point.

Let's begin by constructing the command line to boot your emulated Mac from disc and install the OS. Each OS does better currently with certain combinations of emulated CPU and hardware features. In addition, we also need to make sure that the emulator stays within a single core for better performance (you will get random system stalls if it moves over to another core and throughput will be generally impaired), so we need to set affinities appropriately.

We'll go with 10.4 for our example; substitute for your OS of choice where relevant. Start out with

taskset -a -c 0-3 qemu-system-ppc -M

This binds all of QEMU's threads to a single core (recall that the T2 Sforza cores are SMT-4, and each appear as logical CPUs, so everything must run on a single core this way). While QEMU spawns more than four threads, encompassing two cores (i.e., 0-7) has no noticeable performance benefit and can sometimes unsettle Mac OS X by making timing loops unpredictable.

For the -M option, we will specify mac99 and kvm. The OSes differ on what they prefer for the VIA. 10.3 and 10.4 need to run the emulated mac99 with an emulated CUDA chip onboard, or the OS is unable to detect the real-time clock. 10.5, however, requires the later PMU attached to the VIA. So that gets us to

taskset -a -c 0-3 qemu-system-ppc -M mac99,accel=kvm,via=cuda (10.3, 10.4)
taskset -a -c 0-3 qemu-system-ppc -M mac99,accel=kvm,via=pmu (10.5)

All three of these OSes work fine emulating a 7400-series G4. We will use the "Nitro" 7410 (-cpu nitro), which is a bit faster than the G3 (-cpu G3). 10.3 may have some problems with assigning more than 1.5GB of RAM (-m 1536), but 10.4 and 10.5 work fine with 2GB (-m 2048). Don't use more than 2GB of RAM; it will cause various problems. A verbose boot is helpful in case you accidentally did something wrong (-prom-env boot-args=-v). We'll specify our disk image and some tuning parameters (-drive file=[filename].img,format=qcow2,l2-cache-size=4M), and say boot from the CD or DVD (-boot d -cdrom "/dev/cdrom"). Lastly, we'll enable the emulated RTL8139 NIC and USB tablet (-netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no -device rtl8139,netdev=mynet0 -usb -device usb-tablet) and use a sane screen resolution (-g 1024x768x32). For my 10.4 booter, the full command line looks like this (using the filenames I use on this system):

taskset -a -c 0-3 qemu-system-ppc -M mac99,accel=kvm,via=cuda -cpu nitro -m 2048 -prom-env boot-args=-v -boot d -cdrom /dev/cdrom -drive file=tigerhd.img,format=qcow2,l2-cache-size=4M -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no -device rtl8139,netdev=mynet0 -usb -device usb-tablet -g 1024x768x32

I strongly suggest saving this as a shell script so that you can make any necessary variations. Insert your OS CD or DVD and run the script. It should go into the installer. If it didn't, make sure your filenames are correct, that you have OpenBIOS installed (it comes with QEMU) in a location the emulator can see, and that the KVM kernel modules (both kvm and kvm_pr) are loaded by checking lsmod.

Once the installer has booted you can of course directly proceed to installation in KVM, but I actually recommend shutting down the emulated Mac at this point and bringing everything back up in TCG to get the OS installed. To do that, just use the same command line, but change accel=kvm to accel=tcg. As I mentioned in the first part, heavy I/O loads tend to be less performant on KVMPPC, and installing and upgrading an OS is a pretty heavy I/O load, so running it in TCG will complete the task more quickly and more reliably.

If you want to run Software Update to bring your emulated Mac up to date, it's probably best to also do this in TCG. You could also separately download one of the combo installers (such as the one for 10.4.11) and push that to the emulated Mac on your Samba or Netatalk AFP share.

When the OS is installed, remove the CD-ROM from your command line unless you want to keep it, and change the -boot argument to -boot c to boot from the emulated drive image.

Ta-daa!

For best results with video updates, make sure that the display settings inside System Preferences match your physical display. I'm in 32-bit colour, so I made sure that System Preferences was using Millions instead of Thousands of colours. Because of variabilities in timing, you may notice the OS X clock is close but may seem to run somewhat unsynchronized from your host's clock because of how the delay loop might have been calibrated at bootup. This is mostly just a nuisance.

The next step is optional, but hacks KVMPPC to improve performance of the emulated Mac. Right now we're actually fooling the operating system; we're not really a G4. In fact, the closest Power Mac relative to the T2's POWER9 is the G5, i.e., the PowerPC 970, which is essentially a POWER4 with some modifications for workstation duty and a bolted-on AltiVec unit. Even though we told the OS we're a G4, this doesn't change the attributes of the CPU, in particular for this case the specific instructions it does and does not support and how certain others are handled.

With "big POWER" IBM removed some of the PowerPC instructions that were infrequently used or scaled badly, such as dcba and mcrxr. You don't need to know what these do; just know they were used in some software, but as of the G5 ceased to exist in hardware. Additionally, the G5 and later big POWER designs (including the POWER9) also have a 128-byte cache line instead of the 32-byte cache line of the G3 and G4, which is relevant to the dcbz instruction as it zeroes an entire cache line and potentially spills it to memory. OS X has adaptations for dealing with these cases (an illegal instruction handler in the first case that simulates the instructions in software, and modified system routines in the second), but that only happens if OS X knows the machine is a G5. In this case, it doesn't, so these adaptations are never installed.

KVM-PR gets around the dcbz problem on later POWER designs, including the POWER9, by scanning every new code page in a 32-bit guest for the dcbz instruction and replacing it with an illegal one it can detect. (Remember, it's still a legal instruction; it just behaves differently.) When executed it faults and falls back to KVM-PR, which simulates a 32-byte dcbz instruction in software, and returns control to the guest. It's not a surprise that this process is quite slow, especially if it gets called in a loop. Unfortunately Apple does just exactly that for clearing memory and the instruction is a major portion of the OS' built-in implementation of bzero, which is also called by memset. This is a hot routine and needs to run fast. The G5 version knows about the cache line difference and accounts for it; the G4 and G3 versions don't, and we're using the G4 version.

Apple, however, also helped us out here a little bit by allowing us to guess where the routine is. This and other major components live in a section of memory called the "commpage," which is always located in the top eight pages of the 32-bit addressing space in every process. It is provided by the kernel as an optimization for fast access to important data and common routines. The bzero routine is virtually unchanged from 10.3 to 10.4, and both start with a very unique instruction (cmplwi cr7,r4,32). If we see this instruction in the commpage, we can be confident we have found bzero. And now that we've found it, we can modify it.

Recall I mentioned that KVM-PR must scan each new executable code page for the instruction and change it. We can alter KVM-PR to detect that unique leader instruction if it's mapping in the commpage, and then monkeypatch in a new routine that doesn't use dcbz and thus won't require slow simulation. To make it more reliable, we know where the location should be, so we'll only patch it if it's actually there. As a bonus we'll also map dcba to nop anywhere in an executable section so that it doesn't need a trip to a special handler either. That is what this patch does.

To build KVMPPC with this patch uses the same steps as we discussed for building and installing the kernel modules in part 1. This patch also applies with -p1.

Does it make a difference? You bet it does. On my system with Geekbench 32-bit on Mac OS X 10.4.11, it improved the overall benchmark by nearly 200 points over the unpatched version, almost all of it in (no surprise) the memory score.

This consequence of masquerading as a different CPU also carries over into which software you can run. Even though this is a G4, you actually have to run the G5 version of TenFourFox, which doesn't have any of the other illegal instructions that aren't patched (just be patient -- it will take TenFourFox almost a full minute to come up). If your software offers a G5 version, you should run that if you can. The discontinuity leads to amusing discrepancies like this one.

Interestingly, TCG on POWER9 actually had errors during SunSpider that the JIT in TenFourFox under KVMPPC doesn't, and even with the warmup was up to twice as slow as KVM at SunSpider. Go TenFourFox!

You'll find that performance is still fairly pedestrian even with KVMPPC. While the OS typically benchmarks my T2 as a "2.04GHz G4" (TCG usually gets computed as somewhere between "900 MHz" and "1.0GHz"), the actual throughput you get varies greatly on workload. Raw CPU performance is a bit better than my Quad G5 scores running single core in Reduced mode, though the Quad running full tilt easily surpasses it (the emulation overhead is only reduced, not eliminated). The numbers get a lot different in applications depending on how their workload is structured. For example, TenFourFox's G5 JIT in KVMPPC gets about 6800ms in SunSpider compared to around 3800ms on a "real" 1GHz iMac G4. Improving these numbers to get parity, and especially getting QEMU to support SMP, will need to be an area of active future development.

Lastly, I mentioned about the best way to run OS 9 on a Talos. Although limited to TCG, it's still pretty snappy, a testament to Mac OS 9's comparatively low system requirements. Mac OS 9 works better with the PMU than the CUDA (or you get problems with the mouse not responding to double clicks reliably) and is limited to 1.5GB of RAM. It also doesn't support the QEMU USB tablet, but it does support the RTL8139 with this driver. To get the driver installed, I actually just made an ISO image out of it, dropped it in the Extensions folder and rebooted it. My command line looks like this:

qemu-system-ppc -M mac99,accel=tcg,via=pmu -m 1536 -boot c -drive file=classic.img,format=qcow2,l2-cache-size=4M -usb -netdev tap,id=mynet0,ifname=tap0,script=no,downscript=no -device rtl8139,netdev=mynet0 -rtc base=localtime

Mac OS 9 uses a different real-time clock base, so this has an additional -rtc option. You can use any CPU you want since it's emulated; I just use the default G4 7400 here instead of specifying one.

Post questions or things you've discovered in the comments.

Comments