Showing posts from February, 2022

Intel gets worse, but Power11 might get better

Just in case we needed any more reassurance we made the right move with OpenPOWER: Phoronix is reporting that Intel is about to get even more restrictive with firmware. For as much flak as Intel (deservedly) takes over the Intel Management Engine and other closed highly-privileged blobs, the actual Firmware Support Package has so far been open source and royalty-free (it's what's layered on top that's the problem). There isn't a smoking gun or significant direct context in this Twitter thread, but the issue seems to be around the upcoming "Scalable FSP" architecture. Previously, open source firmware had control on initialization and could call into the closed blobs (or not) as necessary, but FSP 3.0 seems to invert this, giving a new closed blob control to call into the open source firmware (or not). This lets Intel cut projects like Coreboot on x86_64 out of the picture, and can only be seen as a way to directly subvert their operation. A lot of this stuff is under NDA currently but as systems incorporating FSP 3.0 start appearing we should begin to get a clearer understanding.

By the way, don't expect AMD to act any better. Remember that they're the company bringing you Pluton: quoted from the article, "Pluton will also prevent people from running software that has been modified without the permission of developers." It wouldn't be surprising to see AMD's Platform Security Processor pick up additional lock-in capabilities to reinforce this and other vendor controls.

Meanwhile here in the computing underground, we have our own problems with Power10, but there may be some light on the horizon for Power11. It was always a mystery after POWER8 and POWER9's completely open firmware why IBM would take a sudden wrong turn with Power10, but this unsubstantiated post from the same thread (if it's not wishful thinking) suggests COVID staffing issues rather than philosophical concerns were to blame for IBM using off-the-shelf vendored IP blocks requiring the existing blobs in its firmware.

I don't know who that is, or what internal events at IBM they're privy to, so it should be taken with a grain of salt. (If they read this blog, feel free to follow up in the comments or with me in E-mail.) Still, it makes more sense than IBM suddenly slamming the door on OpenPOWER after the tremendous goodwill built up with POWER8 and especially POWER9. It does also suggest, however, that the situation with Power10 is more or less baked in. The roadmap for POWER9, currently the OpenPOWER architecture with the widest install base, basically blew up and the long-promised POWER9 AIO "Axon" or "Axone" never arrived. I'm predicting that Power10 will have a smaller install base than POWER9 because it's still IBM-exclusive, no other vendors so far have announced machines, and Raptor (the only "low-end" vendor of OpenPOWER workstations) has said they won't ship a Power10 system with blobs. If there wasn't enough money on the table to release Axon for IBM's biggest OpenPOWER ecosystem, there won't be for a newly-freed "Power10+."

But there's plenty of time for Power11, possibly landing in the 2024-5 timeframe, just in time for POWER9's technological ebb. And if simple humanpower really was the reason IBM took shortcuts, hopefully their staffing and design teams will be in a much better place by then (wars, pestilence, locusts and inflation notwithstanding). It would come just in time because what makes OpenPOWER a compelling alternative to x86_64 and Apple ARM (and what so far has eluded RISC-V) is performance. I'd like to see Power11 continue to keep us in the game — but without compromises this time.

AlmaLinux 8.5 now stable on ppc64le

AlmaLinux, one of the "new" classic CentOSes after CentOS was reworked into the a-bit-more-fizzy CentOS Stream, has now updated their 8.5 beta release for ppc64le officially to stable status. This is probably your best bet if you want a no-cost RHEL-like experience on your OpenPOWER hardware with the stability reputation you used old-school CentOS for. Release notes and Live ISO images are available. Currently AlmaLinux 8 has a support commitment until at least 2029.

Vikings' OpenPOWER store is open

Rejoice, folks on the other side of the Atlantic: now you can buy the hardware you want from a source closer to home. Vikings' OpenPOWER store is now showing items in stock, including Raptor Talos II and T2 Lite full systems, T2 Lite boards, DD2.2 and DD2.3 POWER9 CPUs up to 22 and 18 cores respectively, and heatsinks and HSFs (and the hex driver needed to install them). The T2 and T2 Lite full systems in particular are different from what Raptor sells on this side of the pond: Raptor T2 systems currently come in Supermicro SC747 chasses with redundant 1620W PSUs but the Vikings flavour comes in Phanteks Enthoo Pro 2 towers with a 650W PSU. Vikings T2 Lites start at €4158/US$4713 with VAT and full T2s at €5707/US$6467 with VAT, and both include 16GB of RAM and a single 4-core DD2.2 POWER9 with 3U HSF. You can of course add a GPU, 2nd CPU, RAM, SSD, bigger PSU, etc. to order, and if you're in Aachen, Germany, you can even drop by and pick it up. No word on Blackbird sales yet but we're sure that's on the way as the supply chain improves and Raptor is able to manufacture more, and in the meantime the T2 Lite remains a solid alternative. Note that while their OpenPOWER store is separate from their RYF products, the T2 and T2 Lite are still absolutely FSF RYF. Systems are available for order now. Let's support the companies that support us.

Chimera Linux test ISOs available for ppc64le

Chimera Linux, an upcoming Linux distro with a FreeBSD userland so you don't have to choose, now has downloadable test images for ppc64le and x86_64. The little-endian Power images, which we care about here obviously, are available both as straight-up console (which can be redirected to the onboard serial port) and GNOME-or-console flavours, and require a POWER8 or higher. The GNOME spins (screenshot at right) use Wayland by default but also allow X11 with a bootloader configuration. There's still a lot in flux, but it's impressive the OS is this far along, and certainly offers something more substantial to the Power community than the usual distro dance. For more info, see the FAQ, or download ISO images from the download page.

Brief status update on the POWER9 JavaScript JIT

% obj/dist/bin/js --baseline-eager --ion-offthread-compile=off --regexp-warmup-threshold=0 -e 'var i,j=0;for(i=0;i<100;i++){j+=j+i;}print(j)'
% obj/dist/bin/js --ion-eager --ion-offthread-compile=off --regexp-warmup-threshold=0 -e 'var i,j=0;for(i=0;i<100;i++){j+=j+i;}print(j)'

Told you it was a productive holiday weekend. Onward to conquering the test suite.

Skiboot support lands in Coreboot

Coreboot, the lightweight open-source extensible firmware project, can now load the intermediate boot stage Skiboot as a payload. This should now make POWER8 and POWER9 (through QEMU) a functioning "board." The next step, per consultants 3mdeb on Twitter (and sponsor Insurgo), is to rebase and push actual hardware support for the Talos II (and T2 Lite). If all goes as planned, this means a potential (and potentially faster) replacement for Hostboot, which is the lower-level portion that launches Skiboot and eventually Petitboot. Note that this development is distinct from Arctic Tern, which aims to build "a better BMC" with its own firmware at a lower level than Hostboot or Coreboot. If POWER9 Coreboot can pull off a faster start time, combined with Arctic Tern it would be an even bigger jump for useability, so we look forward to seeing actual firmware available people can try. 3mdeb is looking for beta testers.

Firefox 97 on POWER

Firefox 97 is out. Because printing to PostScript was useful, Mozilla has removed it (though at least right now you can still print to PDF, and you can still print to PostScript printers, just not a file). More helpfully, there are various CSS, SVG and DOM improvements.

It was previously reported that WebRTC was broken on OpenPOWER under Fx96, which looks like bug 1738445 (a replay of bug 1465274, which I upstreamed four years ago). I don't use WebRTC myself due to the way my internal network is configured, so don't consider this in any way an attestation of functionality, but I was able to build Firefox 97 unmodified with the .mozconfigs and LTO-PGO patch from Firefox 95. Although I have an old build of gn, I tested whether that was the necessary piece by pulling that line out of the configuration, and it still built — at least under gcc. I did do a test clang build (removing the compiler export lines and removing forcing linkage with bfd), and that did fail to build (with or without gn), and --disable-webrtc along with this change to NSS did allow it to compile. However, when started up with ./mach run, the clang build could browse but complained the history and places databases were corrupt and about:support was messed up. Using the same profile with a gcc build was fine. IDK.

I do need to upstream some changes to that section of NSS. Besides fixing the bad #if (I agree with the analysis it only works on gcc by happy accident), it wasn't really written for anything past POWER4 or so. In fact, we could probably just assume that any ppc64le has a 128-byte cache line and shortcut the logic entirely (AFAIK this would just be POWER8, POWER9 and Power10), and just keep the code only on big-endian to still support the PowerPC 970. Tell me the weird edge case or unusual Power CPU this would break in the comments before I get a round tuit.

I've had a bad week of work, but I'm looking forward to making more progress on the test suite and continuing the OpenPOWER JIT development over the long holiday weekend here in the USA. Remember: I don't have a life so you don't have to.

SheepSforza: the SheepShaver Power Mac emulator for OpenPOWER

I've always maintained (possibly by personal experience) that one of the natural future markets for OpenPOWER is one of the architecture's past markets: Power Mac users, particularly those who were irked by the Intel transition. I certainly was, even granting it made commercial sense at the time, and I ended up using a Power Mac G5 Quad for 13 years as a daily driver before I got the Raptor Talos II I use now.

I first touched a Mac in 1987 (it was my buddy's father's Mac Plus, and we spent hours messing around in HyperCard and System 6), though the first Mac I personally owned was a second-hand Macintosh IIsi. I upgraded from there to (briefly) a used Power Mac 7200, then traded it in for a used Power Mac 7300, then pimped that out, then a Power Mac G4 MDD, the first computer I personally bought new, and then the G5. With the exception of the G5 all of these systems natively ran the classic Mac OS, so I had a large investment in classic Mac software. Even my G5 only ever ran Mac OS X Tiger so that I still could run Classic applications (on top of the fact I preferred Tiger's interface to Leopard).

This is relevant because of the current state of Mac emulation: in general, the classic Mac OS is better supported than Mac OS X. SheepShaver started on BeOS and the PowerPC-based BeBox as a commercial product and pun on the Amiga 68K Mac emulator ShapeShifter; it only runs the classic Mac OS, and only then up to 9.0.4 (later versions require an MMU, which SheepShaver doesn't implement). It achieved surprisingly good speed on modest hardware by heavily patching the operating system (more later) and running most programs as native code directly on the BeBox's twin 603 CPUs, not unlike KVM-PR, though without using any special processor features (instead, this was achieved by patching out supervisor portions of the emulated Mac ROM and running all components, including the nanokernel, in the problem state — today we would call this paravirtualization). SheepShaver works on Mac OS X, too, allowing Power Macs with Leopard to run classic apps at near native speeds, though not as well integrated as the Classic Environment, of course. The ability of SheepShaver to run Mac apps directly on the processor accounts for some of its unusual design decisions that persist even on non-PowerPC architectures either running applications through its JIT compiler (on x86 and x86_64) or with an interpreter (everything else, including aarch64 and Apple silicon). SheepShaver led to Basilisk II, which is a 68K Macintosh emulator, before itself becoming open source. To this day both emulators share substantial amounts of code.

Other than SheepShaver and Basilisk II, other emulators include vMac's descendant Mini vMac (68K), MESS/MAME (68K), PearPC (PowerPC, an exception in that it can only run Mac OS X and various other free OSes), and of course QEMU. QEMU originally could only boot Mac OS X, but later added support for Mac OS 9, and can accelerate running at least OS X with KVM-PR (when it's not broken), though this has some edge glitches, and KVM-PR doesn't work on early versions of OS X or with OS 9 at all. In addition, QEMU has decent emulation fidelity and a well-supported "mini JIT" called TCG when KVM or other virtualizers can't or don't work, but it requires drivers on the guest OS side for device support, is slower to start up, and because of its full system emulation has non-trivial overhead for certain operations.

PearPC, SheepShaver and QEMU are your only options right now for emulating a Power Mac. Back in the day while using QEMU to run my old Mac applications, I explored the two other choices. Neither are officially maintained anymore, and even any unofficial updates to SheepShaver appear to be intermittent and fragmented. More to the point, at that time I couldn't get SheepShaver to run at all, even when I managed to make it compile (more about this in a moment). PearPC does build and appear to run on OpenPOWER systems, but its emulation speed is hideous; the JIT is limited to x86, and without it the project estimates it runs about 500 times slower than actual performance, which certainly matches my experience with it. I did some tinkering with adjusting refresh rates and other attempts to reduce the overhead, but it still ran abysmally bad, and I ended up abandoning it.

SheepShaver's long history was attractive, though, and having networking and native filesystem support would be a huge plus, so after some alterations to the source code to make it build on POWER9, I decided to look at why it wouldn't start. Its design is quite unusual because it originated as a "native" emulator without a JIT, running application code bare-metal on a 32-bit PowerPC CPU, and some of the architectural decisions made as a result have persisted. The classic Mac OS is notable for storing the state of certain important globals in very low memory, starting even from an effective address of 0, requiring any implementation of SheepShaver to ensure that virtual addresses that low can be mapped — even in emulation. However, the known trick of sudo sysctl vm.mmap_min_addr=0 and (if you're on SELinux, which being Fedora I am) sudo setsebool -P mmap_low_allowed 1 to allow it to map the lowest page of memory didn't get it started, and trying to use one of SheepShaver's alternate memory mapping schemes failed during configuration, so it was going to be "real" addressing or bust.

The problem turned out to be that SheepShaver is still pervasively 32-bit, even in a 64-bit configuration, the other major legacy hangover of its internal design. When memory is mmap(2)ed on many 64-bit operating systems with malloc() (and pretty much all ppc64 systems), it gets a full 64-bit address, which immediately crashes the emulator because it only deals with the least significant word. The workaround on x86_64 was to give a specific address to mmap(2) instead of letting it pick anywhere, so I stumbled onto an address that seemed to work for ppc64le and was able to narrow that down based on where SheepShaver's executable code maps. (For Fedora this value was 0x18000000. It should work for other Linuces, but may need adjustment on *BSDs.)

Naturally big-endian code cannot run natively on a little-endian processor. (Parenthetically, it should be possible to get SheepShaver running native on a big-endian POWER9, though you would need a SIGILL handler for PowerPC instructions no longer supported by 64-bit PPC like mcrxr, and something would need to be done about differing cache line sizes or dcbz is going to really ruin your day. I leave this exercise to the reader: I'm fairly confident this would work because G5 systems under Mac OS X run SheepShaver just dandy, but note that's because the operating system handlers are already doing this work for you, and the PowerPC 970 — but not the POWER4 — has a bit in a special HID register where dcbz can be made to act like a G4. Without these provisions G5 Power Macs wouldn't be compatible with Classic or any pre-G5 32-bit application.) SheepShaver does implement an emulated PowerPC CPU using a bespoke library called Kheperix, and Kheperix does have a JIT, but the JIT backend only seems to function for i386 and x86_64. Everything else runs in the interpreter.

Running under the interpreter, incredibly, isn't terrible like it was for PearPC. Kheperix's interpreter is pretty efficient but the main reason is because huge amounts of the operating system are patched out and shortcut into the emulator itself, very unlike QEMU or PearPC, particularly I/O and video. But this is also very flexible: QuickDraw acceleration is even supported, plus, as mentioned, networking and mounting a local directory as a Mac volume (no AppleShare needed), or even a disc in the optical drive, and some rudimentary clipboard synchronization. Recall that the normal state of a pre-OS X Power Mac is to be running 68K code, so SheepShaver achieves most of this magic by executing 68K A-line traps directly and jumping in and out of the Power Mac ROM 68K emulator. Functionally this isn't a problem because SheepShaver only operates as a uniprocessor system anyway, so the so-called Blue Task in the nanokernel is all there is. In fact, Kheperix doesn't emulate any PowerPC supervisor-level instructions except mfmsr (which returns a constant Machine State Register value of 0x0000f072, i.e., big endian, address translation, problem state), and it doesn't even implement SPRGs or most other supervisor-level SPRs. This is less of a functional problem than you might think because the PowerPC Mac OS generally runs drivers in user mode, though SheepShaver doesn't run much of this code anyway. Overall, while the interpreter isn't bad for an interpreter, there are notable gaps and some CPU-bound tasks like StuffIt Expander take way longer than they ought to (like close to five minutes to unpack a 37MB .sit).

I did explore writing a ppc64le JIT with the initial port. The JIT backends were originally auto-generated with a tool that actually pulled binary data from an ELF object and built a header file with the basic operations and processor machine code, and then patches it on the fly. This is a clever idea and should work for any arbitrary architecture, but it's early 2000's code and has trouble with relocation modes in binaries generated by later versions of gcc (and presumably clang), and it doesn't understand ppc64 at all. I did some initial work on this but it ended up being a rather bigger undertaking than I wanted to undertake right now even though it needs to be undertaken only once.

Instead, to juice the interpreter a bit more I looked for complex operations that could be turned into inline assembly language, because since we're running on Power ISA these operations can be trivially done "in hardware." The best example is floating point: things like fctiw/fctiwz, which convert an FPR to integer, or unusual operations like fres and frsqrte, all become one or at most a handful of instructions rather than doing the arithmetic and implementation details manually in C++. Even for operations like fused multiply-add which should be lowerable, a single fmadd or fnmadd more often than not is still faster than what the compiler generates. (C++ is still required for updating the register images, FPSCR/XER and condition codes, of course. However, since things like __builtin_fpclassify suck on Power ISA, at least in gcc, even some of that I rewrote in assembly too.) Population count (popcntw) and some other complex integer operations were accelerated the same way. AltiVec should be accelerable with VMX/VSX, but not many classic Mac applications use it, so I didn't bother this time around.

Another big improvement to apparent performance was to enable hardware cursor support. As configured SheepShaver will rely on the Mac OS to poll the mouse and draw the mouse pointer, and since it's triggering double-emulated 68K interrupts to poll ADB, it really kinda chugs without a JIT even with 60Hz video updates. SheepShaver has support for a hardware cursor which is kept in sync with MacOS's cursor and drawn natively. This works beautifully at least with SDL 2, so I've made it the default and exposed it in the GTK settings dialogue. I also turned off swapping Command and Option by default (and exposed that in the settings dialogue) so I could keep my Mac muscle memory, and adjusted the SDL backend to capture all keystrokes when SheepShaver is foreground so that key combinations don't get eaten by your window manager. Combined with triple-pumping the 1Hz interrupt to better synchronize with the host real-time clock, responsiveness seems pretty darn good now — in fact, in my experience even better than QEMU running Mac OS 9.2.2.

I also fixed the sound system, which didn't work at all initially, even though SheepShaver supports it (using, you guessed it, 68K interrupts to get buffers from the mixer). The issue was AUDIO_S16MSB doesn't seem to work for SDL audio, at least not with my crummy little USB audio device. I changed this to AUDIO_S16SYS, which shouldn't regress big-endian, but then means I need to byte-flip the shorts I get from the Mac side. Latency is substantially, though not totally, reduced by writing inline assembly which generates a byte-swap vector in VMX (lvsl has other uses!) and repeatedly vperms the buffer 16 bytes at a time in a tight loop. Since the usual size of audio data chunks is a full 16K, I also made a heavily unrolled version which does 128 bytes every iteration. Alert sounds in Mac OS no longer have gaps, and most audio plays with only minimal pauses because of fetches from the Mac audio mixer, which will get better when the JIT is written.

Last but not least, the build system now supports link-time optimization by default (pass --disable-lto to configure if you don't want this), and will detect POWER8 and POWER9 CPUs from /proc/cpuinfo and add the appropriate -mcpu flag (pass --disable-cpudetect to configure if you don't want this).

This is now good enough for me to run my personal "big four" productivity applications: Adobe Photoshop 6.0 (it can't run 7.0 since SheepShaver is limited to 9.0.4), Adobe FrameMaker, QuickTime VR Authoring Studio and QuarkXPress. Microsoft Office 98 and 2001 won't run on SheepShaver, but Word 5.1 does (in general 68K applications seem to be more compatible).

Best of all, they interact directly with the T2's filesystem — no need for a local installation of netatalk to exchange files.

Here's QTVR inside the Apple company store circa 1997 or so.

It's also good enough to run Doom, and even play it, though the music trips up a bit over itself. If you turn the music off, it runs rather better.
Marathon and Marathon 2 (for a good time, go over to your cat while she's sleeping and whisper "MARATHON!" in her ear) both play fine.
Unfortunately there are random intermittent things I think are gaps in emulation, and other more consistent things that definitely are. For example, you can hang the Mac immediately by running the Startup Disk CDEV: it makes a weird disk status call with csCode 51 and then locks up. I wrote a little kludge to try calling SysError with dsForcedQuit whenever that call appears, and it does indeed pop up a system error box, but it's blank and nothing else works. Even ExitToShell, while it did exit the app, also hangs things. I'm not sure if I did something wrong with my deep magic jumping back into the 68K fire or whether Startup Disk has the system in an unrecoverable state at that point, and neither possibility is unlikely or mutually exclusive. I left the code in to see if I could do something more with it later.

In fact, because exception handling is basically non-existent in Kheperix, whole classes of applications won't work or crash worse. Much as a system error can take down the entire Mac, an illegal instruction will cause SheepShaver to abort because Kheperix has no facility for dealing with it. For that matter, illegal memory accesses are simply ignored if you enable that option because there is no way to invoke the exception handler in the Mac ROM. You can't use guest-level debuggers as a result because any instructions like trap, twi, etc. are unimplemented: you could decode the instruction, but if the trap condition is satisfied, you can't do anything with it. Some sort of deep plumbing would be required to essentially do what the nanokernel is unable to, and without it you can forget about running CodeWarrior in debug mode, or using MacsBug at all (though maybe you can get away with RealBASIC or Future BASIC). QEMU is a full system emulator, so at least in theory it can do all of these things, and it already can run some applications SheepShaver will never be able to.

Still, it's amped up and functional enough that it's now my go-to Power Mac emulator for when I need one, so now you can go to it too. There are enough OpenPOWER-specific changes here that I've christened this modified version of SheepShaver "SheepSforza" after our favourite Nimbus processor series. The supported configuration is Linux, little-endian POWER9, SDL2, running Mac OS 9.0.4 with a Power Macintosh 7300 Old World ROM. (Which, conveniently, I now own several of including my original machine.) It should work fine with a New World ROM but I find I have better luck with Old World systems, and there is no practical difference in feature support. Configurations with 60Hz video and up to 1GB of RAM work just fine and even a lowly 4-core Blackbird will handle both with ease. Do not attempt to enable the PowerPC or 68K JITs: at best nothing will happen and at worst nothing will happen except a core dump. If you are transferring a disk image, ROM and settings from another installation of SheepShaver, you may wish to re-examine your settings (in particular, I strongly advise enabling the hardware mouse cursor).

To build, grab the source from Github and ensure you have development headers for SDL 2 (SDL 1.2 may work, but I don't recommend it) and GTK, then

cd SheepShaver/src/Unix
make -j24 # or as you like

and ./SheepShaver to start the emulator. (Don't forget to set low memory with sudo sysctl vm.mmap_min_addr=0 and, if you use SELinux, sudo setsebool -P mmap_low_allowed 1 or it will error out on startup.) A nice wiki explaining further use and configuration on Linux is available, if you are new to SheepShaver, except you should not turn the JIT on yet and slirp is the only currently supported networking option.

What's next to do, as I have time and inclination? Obviously I want to get the JIT running, and obviously I'd like to have a more reliable fashion of injecting code into the main execution context so that we can at least work around some of these bugs even if we can't fix them. SheepShaver does this sort of already but in side-execution contexts, so we can't take over the machine that way. I also want to figure out a way to suspend the emulator like QEMU can so that I can put it in "hibernation" when it doesn't need to be taking up CPU cycles. Since it isn't using a network share and SheepShaver syncs the RTC directly, it should behave just fine in this mode. There are also likely some opportunities for more explicit vectorization in the video update loop.

The other thing to fix is Basilisk II, which should work, but doesn't. There's probably a couple similarly fundamental problems to be solved there too, but once they are, many of the improvements in SheepShaver should work there as well (and because it's emulating a 68K, it's likely to run at a ripping pace on modern OpenPOWER systems even without a JIT).

Back to the Firefox JIT!