SheepSforza: the SheepShaver Power Mac emulator for OpenPOWER


I've always maintained (possibly by personal experience) that one of the natural future markets for OpenPOWER is one of the architecture's past markets: Power Mac users, particularly those who were irked by the Intel transition. I certainly was, even granting it made commercial sense at the time, and I ended up using a Power Mac G5 Quad for 13 years as a daily driver before I got the Raptor Talos II I use now.

I first touched a Mac in 1987 (it was my buddy's father's Mac Plus, and we spent hours messing around in HyperCard and System 6), though the first Mac I personally owned was a second-hand Macintosh IIsi. I upgraded from there to (briefly) a used Power Mac 7200, then traded it in for a used Power Mac 7300, then pimped that out, then a Power Mac G4 MDD, the first computer I personally bought new, and then the G5. With the exception of the G5 all of these systems natively ran the classic Mac OS, so I had a large investment in classic Mac software. Even my G5 only ever ran Mac OS X Tiger so that I still could run Classic applications (on top of the fact I preferred Tiger's interface to Leopard).

This is relevant because of the current state of Mac emulation: in general, the classic Mac OS is better supported than Mac OS X. SheepShaver started on BeOS and the PowerPC-based BeBox as a commercial product and pun on the Amiga 68K Mac emulator ShapeShifter; it only runs the classic Mac OS, and only then up to 9.0.4 (later versions require an MMU, which SheepShaver doesn't implement). It achieved surprisingly good speed on modest hardware by heavily patching the operating system (more later) and running most programs as native code directly on the BeBox's twin 603 CPUs, not unlike KVM-PR, though without using any special processor features (instead, this was achieved by patching out supervisor portions of the emulated Mac ROM and running all components, including the nanokernel, in the problem state — today we would call this paravirtualization). SheepShaver works on Mac OS X, too, allowing Power Macs with Leopard to run classic apps at near native speeds, though not as well integrated as the Classic Environment, of course. The ability of SheepShaver to run Mac apps directly on the processor accounts for some of its unusual design decisions that persist even on non-PowerPC architectures either running applications through its JIT compiler (on x86 and x86_64) or with an interpreter (everything else, including aarch64 and Apple silicon). SheepShaver led to Basilisk II, which is a 68K Macintosh emulator, before itself becoming open source. To this day both emulators share substantial amounts of code.

Other than SheepShaver and Basilisk II, other emulators include vMac's descendant Mini vMac (68K), MESS/MAME (68K), PearPC (PowerPC, an exception in that it can only run Mac OS X and various other free OSes), and of course QEMU. QEMU originally could only boot Mac OS X, but later added support for Mac OS 9, and can accelerate running at least OS X with KVM-PR (when it's not broken), though this has some edge glitches, and KVM-PR doesn't work on early versions of OS X or with OS 9 at all. In addition, QEMU has decent emulation fidelity and a well-supported "mini JIT" called TCG when KVM or other virtualizers can't or don't work, but it requires drivers on the guest OS side for device support, is slower to start up, and because of its full system emulation has non-trivial overhead for certain operations.

PearPC, SheepShaver and QEMU are your only options right now for emulating a Power Mac. Back in the day while using QEMU to run my old Mac applications, I explored the two other choices. Neither are officially maintained anymore, and even any unofficial updates to SheepShaver appear to be intermittent and fragmented. More to the point, at that time I couldn't get SheepShaver to run at all, even when I managed to make it compile (more about this in a moment). PearPC does build and appear to run on OpenPOWER systems, but its emulation speed is hideous; the JIT is limited to x86, and without it the project estimates it runs about 500 times slower than actual performance, which certainly matches my experience with it. I did some tinkering with adjusting refresh rates and other attempts to reduce the overhead, but it still ran abysmally bad, and I ended up abandoning it.

SheepShaver's long history was attractive, though, and having networking and native filesystem support would be a huge plus, so after some alterations to the source code to make it build on POWER9, I decided to look at why it wouldn't start. Its design is quite unusual because it originated as a "native" emulator without a JIT, running application code bare-metal on a 32-bit PowerPC CPU, and some of the architectural decisions made as a result have persisted. The classic Mac OS is notable for storing the state of certain important globals in very low memory, starting even from an effective address of 0, requiring any implementation of SheepShaver to ensure that virtual addresses that low can be mapped — even in emulation. However, the known trick of sudo sysctl vm.mmap_min_addr=0 and (if you're on SELinux, which being Fedora I am) sudo setsebool -P mmap_low_allowed 1 to allow it to map the lowest page of memory didn't get it started, and trying to use one of SheepShaver's alternate memory mapping schemes failed during configuration, so it was going to be "real" addressing or bust.

The problem turned out to be that SheepShaver is still pervasively 32-bit, even in a 64-bit configuration, the other major legacy hangover of its internal design. When memory is mmap(2)ed on many 64-bit operating systems with malloc() (and pretty much all ppc64 systems), it gets a full 64-bit address, which immediately crashes the emulator because it only deals with the least significant word. The workaround on x86_64 was to give a specific address to mmap(2) instead of letting it pick anywhere, so I stumbled onto an address that seemed to work for ppc64le and was able to narrow that down based on where SheepShaver's executable code maps. (For Fedora this value was 0x18000000. It should work for other Linuces, but may need adjustment on *BSDs.)

Naturally big-endian code cannot run natively on a little-endian processor. (Parenthetically, it should be possible to get SheepShaver running native on a big-endian POWER9, though you would need a SIGILL handler for PowerPC instructions no longer supported by 64-bit PPC like mcrxr, and something would need to be done about differing cache line sizes or dcbz is going to really ruin your day. I leave this exercise to the reader: I'm fairly confident this would work because G5 systems under Mac OS X run SheepShaver just dandy, but note that's because the operating system handlers are already doing this work for you, and the PowerPC 970 — but not the POWER4 — has a bit in a special HID register where dcbz can be made to act like a G4. Without these provisions G5 Power Macs wouldn't be compatible with Classic or any pre-G5 32-bit application.) SheepShaver does implement an emulated PowerPC CPU using a bespoke library called Kheperix, and Kheperix does have a JIT, but the JIT backend only seems to function for i386 and x86_64. Everything else runs in the interpreter.

Running under the interpreter, incredibly, isn't terrible like it was for PearPC. Kheperix's interpreter is pretty efficient but the main reason is because huge amounts of the operating system are patched out and shortcut into the emulator itself, very unlike QEMU or PearPC, particularly I/O and video. But this is also very flexible: QuickDraw acceleration is even supported, plus, as mentioned, networking and mounting a local directory as a Mac volume (no AppleShare needed), or even a disc in the optical drive, and some rudimentary clipboard synchronization. Recall that the normal state of a pre-OS X Power Mac is to be running 68K code, so SheepShaver achieves most of this magic by executing 68K A-line traps directly and jumping in and out of the Power Mac ROM 68K emulator. Functionally this isn't a problem because SheepShaver only operates as a uniprocessor system anyway, so the so-called Blue Task in the nanokernel is all there is. In fact, Kheperix doesn't emulate any PowerPC supervisor-level instructions except mfmsr (which returns a constant Machine State Register value of 0x0000f072, i.e., big endian, address translation, problem state), and it doesn't even implement SPRGs or most other supervisor-level SPRs. This is less of a functional problem than you might think because the PowerPC Mac OS generally runs drivers in user mode, though SheepShaver doesn't run much of this code anyway. Overall, while the interpreter isn't bad for an interpreter, there are notable gaps and some CPU-bound tasks like StuffIt Expander take way longer than they ought to (like close to five minutes to unpack a 37MB .sit).

I did explore writing a ppc64le JIT with the initial port. The JIT backends were originally auto-generated with a tool that actually pulled binary data from an ELF object and built a header file with the basic operations and processor machine code, and then patches it on the fly. This is a clever idea and should work for any arbitrary architecture, but it's early 2000's code and has trouble with relocation modes in binaries generated by later versions of gcc (and presumably clang), and it doesn't understand ppc64 at all. I did some initial work on this but it ended up being a rather bigger undertaking than I wanted to undertake right now even though it needs to be undertaken only once.

Instead, to juice the interpreter a bit more I looked for complex operations that could be turned into inline assembly language, because since we're running on Power ISA these operations can be trivially done "in hardware." The best example is floating point: things like fctiw/fctiwz, which convert an FPR to integer, or unusual operations like fres and frsqrte, all become one or at most a handful of instructions rather than doing the arithmetic and implementation details manually in C++. Even for operations like fused multiply-add which should be lowerable, a single fmadd or fnmadd more often than not is still faster than what the compiler generates. (C++ is still required for updating the register images, FPSCR/XER and condition codes, of course. However, since things like __builtin_fpclassify suck on Power ISA, at least in gcc, even some of that I rewrote in assembly too.) Population count (popcntw) and some other complex integer operations were accelerated the same way. AltiVec should be accelerable with VMX/VSX, but not many classic Mac applications use it, so I didn't bother this time around.

Another big improvement to apparent performance was to enable hardware cursor support. As configured SheepShaver will rely on the Mac OS to poll the mouse and draw the mouse pointer, and since it's triggering double-emulated 68K interrupts to poll ADB, it really kinda chugs without a JIT even with 60Hz video updates. SheepShaver has support for a hardware cursor which is kept in sync with MacOS's cursor and drawn natively. This works beautifully at least with SDL 2, so I've made it the default and exposed it in the GTK settings dialogue. I also turned off swapping Command and Option by default (and exposed that in the settings dialogue) so I could keep my Mac muscle memory, and adjusted the SDL backend to capture all keystrokes when SheepShaver is foreground so that key combinations don't get eaten by your window manager. Combined with triple-pumping the 1Hz interrupt to better synchronize with the host real-time clock, responsiveness seems pretty darn good now — in fact, in my experience even better than QEMU running Mac OS 9.2.2.

I also fixed the sound system, which didn't work at all initially, even though SheepShaver supports it (using, you guessed it, 68K interrupts to get buffers from the mixer). The issue was AUDIO_S16MSB doesn't seem to work for SDL audio, at least not with my crummy little USB audio device. I changed this to AUDIO_S16SYS, which shouldn't regress big-endian, but then means I need to byte-flip the shorts I get from the Mac side. Latency is substantially, though not totally, reduced by writing inline assembly which generates a byte-swap vector in VMX (lvsl has other uses!) and repeatedly vperms the buffer 16 bytes at a time in a tight loop. Since the usual size of audio data chunks is a full 16K, I also made a heavily unrolled version which does 128 bytes every iteration. Alert sounds in Mac OS no longer have gaps, and most audio plays with only minimal pauses because of fetches from the Mac audio mixer, which will get better when the JIT is written.

Last but not least, the build system now supports link-time optimization by default (pass --disable-lto to configure if you don't want this), and will detect POWER8 and POWER9 CPUs from /proc/cpuinfo and add the appropriate -mcpu flag (pass --disable-cpudetect to configure if you don't want this).

This is now good enough for me to run my personal "big four" productivity applications: Adobe Photoshop 6.0 (it can't run 7.0 since SheepShaver is limited to 9.0.4), Adobe FrameMaker, QuickTime VR Authoring Studio and QuarkXPress. Microsoft Office 98 and 2001 won't run on SheepShaver, but Word 5.1 does (in general 68K applications seem to be more compatible).

Best of all, they interact directly with the T2's filesystem — no need for a local installation of netatalk to exchange files.

Here's QTVR inside the Apple company store circa 1997 or so.

It's also good enough to run Doom, and even play it, though the music trips up a bit over itself. If you turn the music off, it runs rather better.
Marathon and Marathon 2 (for a good time, go over to your cat while she's sleeping and whisper "MARATHON!" in her ear) both play fine.
Unfortunately there are random intermittent things I think are gaps in emulation, and other more consistent things that definitely are. For example, you can hang the Mac immediately by running the Startup Disk CDEV: it makes a weird disk status call with csCode 51 and then locks up. I wrote a little kludge to try calling SysError with dsForcedQuit whenever that call appears, and it does indeed pop up a system error box, but it's blank and nothing else works. Even ExitToShell, while it did exit the app, also hangs things. I'm not sure if I did something wrong with my deep magic jumping back into the 68K fire or whether Startup Disk has the system in an unrecoverable state at that point, and neither possibility is unlikely or mutually exclusive. I left the code in to see if I could do something more with it later.

In fact, because exception handling is basically non-existent in Kheperix, whole classes of applications won't work or crash worse. Much as a system error can take down the entire Mac, an illegal instruction will cause SheepShaver to abort because Kheperix has no facility for dealing with it. For that matter, illegal memory accesses are simply ignored if you enable that option because there is no way to invoke the exception handler in the Mac ROM. You can't use guest-level debuggers as a result because any instructions like trap, twi, etc. are unimplemented: you could decode the instruction, but if the trap condition is satisfied, you can't do anything with it. Some sort of deep plumbing would be required to essentially do what the nanokernel is unable to, and without it you can forget about running CodeWarrior in debug mode, or using MacsBug at all (though maybe you can get away with RealBASIC or Future BASIC). QEMU is a full system emulator, so at least in theory it can do all of these things, and it already can run some applications SheepShaver will never be able to.

Still, it's amped up and functional enough that it's now my go-to Power Mac emulator for when I need one, so now you can go to it too. There are enough OpenPOWER-specific changes here that I've christened this modified version of SheepShaver "SheepSforza" after our favourite Nimbus processor series. The supported configuration is Linux, little-endian POWER9, SDL2, running Mac OS 9.0.4 with a Power Macintosh 7300 Old World ROM. (Which, conveniently, I now own several of including my original machine.) It should work fine with a New World ROM but I find I have better luck with Old World systems, and there is no practical difference in feature support. Configurations with 60Hz video and up to 1GB of RAM work just fine and even a lowly 4-core Blackbird will handle both with ease. Do not attempt to enable the PowerPC or 68K JITs: at best nothing will happen and at worst nothing will happen except a core dump. If you are transferring a disk image, ROM and settings from another installation of SheepShaver, you may wish to re-examine your settings (in particular, I strongly advise enabling the hardware mouse cursor).

To build, grab the source from Github and ensure you have development headers for SDL 2 (SDL 1.2 may work, but I don't recommend it) and GTK, then

cd SheepShaver/src/Unix
./configure
make -j24 # or as you like

and ./SheepShaver to start the emulator. (Don't forget to set low memory with sudo sysctl vm.mmap_min_addr=0 and, if you use SELinux, sudo setsebool -P mmap_low_allowed 1 or it will error out on startup.) A nice wiki explaining further use and configuration on Linux is available, if you are new to SheepShaver, except you should not turn the JIT on yet and slirp is the only currently supported networking option.

What's next to do, as I have time and inclination? Obviously I want to get the JIT running, and obviously I'd like to have a more reliable fashion of injecting code into the main execution context so that we can at least work around some of these bugs even if we can't fix them. SheepShaver does this sort of already but in side-execution contexts, so we can't take over the machine that way. I also want to figure out a way to suspend the emulator like QEMU can so that I can put it in "hibernation" when it doesn't need to be taking up CPU cycles. Since it isn't using a network share and SheepShaver syncs the RTC directly, it should behave just fine in this mode. There are also likely some opportunities for more explicit vectorization in the video update loop.

The other thing to fix is Basilisk II, which should work, but doesn't. There's probably a couple similarly fundamental problems to be solved there too, but once they are, many of the improvements in SheepShaver should work there as well (and because it's emulating a 68K, it's likely to run at a ripping pace on modern OpenPOWER systems even without a JIT).

Back to the Firefox JIT!

Comments

  1. I am full of admiration for self-denial. In the 90s I belonged to the Alpha AXP club, unfortunately I do not have the same comfort as PPC users ;-)

    ReplyDelete
    Replies
    1. I certainly commiserate. I rather liked Alpha. There's still a 164LX here (running Tru64 as intended by all right-thinking human beings).

      Delete
    2. I started out with a Microway workstation with an Alpha 21164PC processor and was working with Fedora at the time. Now I have four more machines in operation: PWS 500au, XP1000, ES40 and ES45. Unfortunately, I didn't have access to Tru64 and they all work under Gentoo ;-)

      Delete

Post a Comment

Comments are subject to moderation. Be nice.