Showing posts from 2021

W(h)ither POWER8

With the recent announcement that Ubuntu's ppc64le ("ppc64el") flavour is moving to require POWER9, it's worth asking not only how much life is in POWER8, but also POWER9, now that Power10 (such as it is) is now available.

POWER8 was the first OpenPOWER processor and the one planned for the original Raptor Talos (that never got released to the public), but also appeared in several third-party systems, largely by Tyan. It offered fully open firmware and while it exclusively required Centaur memory buffer chips, these could be on riser cards, interposers or even on the logic board to allow attaching regular ECC DIMMs. It introduced ISA 2.07, which among other features expanded on the vector-scalar extension instructions first introduced in POWER7 (called VSX-2 in 2.07).

POWER8 systems are certainly more widely distributed than previous generations which since about POWER5 were almost exclusively IBM, and they were also the first Power ISA CPU with a fully-functioning little-endian mode (the POWER7 implementation had gaps), which caused it to rapidly become the baseline for most distributions supporting Power. But POWER9 is even more widely distributed, not least of which because of "low end" systems like this Talos II and the Blackbird, uses 25% less power but is 50% faster than a chip that was already two to three times faster than POWER7, and has even more advantages in terms of instruction set; ISA 3.0 expands VSX further (VSX-3) and also adds a number of other useful instructions. The current incarnation of our Firefox JIT, for example, leverages new POWER9-specific instructions for remainders, accessing the program counter and 64-bit byte swapping. All this, and it's still a fully open architecture with fully open firmware.

On the other hand, Power10 is presently a step backwards. Putting its otiose binary blobs aside for the moment, there are only a few Power10 SKUs in its current infancy, none of them are workstations, and none of them don't say IBM. No Power10 hardware takes direct attach RAM, not even like the POWER8 did. No ODM has a channel for obtaining the actual CPUs. If there's a Rainier reference design to work from, no one seems to be talking about it. It's almost back to the bad old days when IBM wouldn't sell me a POWER7 and nobody else made one (my long-running POWER6 was a reseller purchase).

If Ubuntu's move is the first of many to decommission POWER8 support, that's still over six years as a first-tier citizen (almost five as second fiddle to POWER9), and no one else so far has talked about a similar move. (Even if RHEL 9 goes POWER9+ only, RHEL 8 would presumably support your POWER8 until 2029.) It's sad to see it happen but POWER9, besides being easier to get, is an improvement in virtually every way and in ways Power10 right now is not. Besides the fact IBM's still selling POWER9 machines, the chip's time on top and its wider distribution are good signs for the first Power CPU in years to be in purpose-built desktops and more third-party servers. Nearly five years atop the heap buys you a lot of market penetrance especially with a questionable successor. While all good things must come to an end, POWER8's death is hardly imminent, and POWER9's is nowhere yet in sight.

91ESR with Baseline Compiler/Baseline wasm for POWER9

It's heeeeeee-re. I've completed the pull-up of the POWER9 Firefox JavaScript JIT to the current ESR, Firefox 91. As a bonus I also completed the second-stage Baseline Compiler (Baseline Interpreter being the first-stage compiler) at the same time for a reason I'll explain in a minute.

The build process is the same as Firefox 91, using the 91ESR tree, but requires adding --enable-jit to your .mozconfig and applying this patch and set of files. Please note that POWER9 remains the only supported architecture (Power10 grudgingly, but it should work), and only on little-endian. If you compile big-endian, the JIT should statically disable itself, even with --enable-jit. If you compile with -mcpu=power9, which is recommended, the JIT is statically enabled with --enable-jit and becomes slightly faster because there are fewer runtime checks. If you don't explicitly specify POWER9, or do something like -mcpu=power8, but still specify --enable-jit, then runtime detection should be enabled (which right now disables the JIT). I have not tested this on POWER8 because I don't have a POWER8, so I can't fix it myself. If this doesn't work or builds a defective Firefox or JavaScript shell, please submit a correction and I'll incorporate it.

What's working? What now works is the Baseline Interpreter and the Baseline Compiler, and Baseline compilation for Web Assembly. asm.js using Cranelift isn't supported yet, because this requires the third-stage Ion optimizing compiler, and WebAssembly transpiled to asm.js will simply compile in Baseline. This is not the fastest the browser can run, but it is certainly noticeably faster, and most of the pure JavaScript benchmarks I tested showed it is already several times more efficient than the C++ interpreter. I did not encounter any obvious crashes in things like Gmail, Google Docs and my workplace Office 365 instance (and I was a lot more productive!) but the reason for releasing this is to see if you find any. If you can reliably crash the browser in a way that doesn't crash with the JIT off, file an issue with exact steps to reproduce. If I can't reproduce it, I can't fix it. Steps to trigger an assertion in a debug build would be even more helpful.

What's not working yet? The third-stage optimizing compiler doesn't work and isn't enabled (our patches turn it off by default in the browser, and you should always specify --no-ion to the JS shell unless you're doing development), and as stated, this also means no specific Cranelift support for things like the asm.js-based DOSBox and MAME emulators on Internet Archive. These will run in the slower Baseline Compiler directly. There are also some failures in Wasm compared to x86_64 and ARM that didn't turn up in the test suite (it passes everything) which I'm unable to narrow down right now. For example, WAD Commander has graphical glitches even though the game plays fine, and Google Earth stalls out with a runtime error. The reason I finished the Baseline Compiler support was on the hopes I'd smoke out some other bugs, and I did in fact find more to fix but it didn't fix these. On the other hand, these handcoded Wasm demos seem to work, as does this Wasm RISC-V emulator, this somewhat funky karts game and this Wasm Gameboy emulator:

It is entirely possible that some of this is simply due to other pre-existing bugs on our platform that this support just unmasks — after all, we were never able to run code like this before — and there are naturally changes in later Firefoxen that aren't in the ESR. I won't be able to assess that until it's pulled up further, of course, but for the time being you can use the JIT in 91ESR if you prefer/need the speed while further development stabilizes. Until then, please don't file issues on Wasm stuff that doesn't work unless you know why it doesn't work.

Next steps? The plan is to pull the 91ESR JIT up to Firefox 97 or 98 alpha and start on Ion development on that new base hopefully finishing in time to do one last pull-up to Firefox 102, i.e., the next ESR, and submit the finished JIT to Mozilla then. Longer term, we'd welcome support for additional configurations and the key is SupportsFloatingPoint() in js/src/jit/ppc64/Assembler-ppc64.h, which I have abused as a runtime gate. You should be able to tell from the comments in that file how to force the JIT to run on an unsupported configuration. I have implemented HasPPCISA3() which returns true on POWER9 (and Power10) so that appropriate codegen paths are run based on the CPU present. Most of the codegen will work on little-endian POWER8 except for a few places that will hit a forced crash. If you get this working and implement HasPPCISA27() or some such, then I will accept those changes assuming they are not massive. I will also accept big-endian patches, but you will have a much bigger job, and unless you're prepared to do little-endian emulation for Wasm or asm.js (like the limited little-endian support in TenFourFox's IonPower-NVLE for typed arrays) and maintain those changes certain things will never work on big.

Meanwhile, your contributions are still solicited especially on the new work to be done and we'll be getting that new tree up so you can participate. However, patches and PRs that will not be accepted are anything that regresses the core support for LE POWER9, spacing or style changes (we will be doing cleanup on the entire set before submitting to Mozilla, so please don't waste our time on this right now), or sets covering multiple issues (one catastrophe at a time, please). The faster we get this done, the faster we get it in the tree, and the better supported we'll be going forward.

Starting with Firefox 96, there will be the usual updates on building mozilla-release, but I'll also do a verification build on 91ESR and make any needed updates to patches, and upload updates to Github. Please post your constructive and reproducible issues in the comments or on Github for triage.

Firefox 95 on POWER

Firefox 95 is released, screenshot at right. The big new feature, besides speculative AOT JIT which doesn't apply to us yet, is RLBox, which compiles certain third-party libraries into safe WebAssembly, and then compiles them back into C, so they can be compiled a third time into pre-sanitized native code. This has obvious security benefits and the performance impact shouldn't be especially large, but it adds yet another build-time prerequisite: the WASI SDK. This kind of really sucks because now you have to have a third toolchain (it builds one whether you like it or not) besides clang and our preferred compiler, gcc. Pending internal package support, some distros have chosen simply to disable this for the immediate future, even including Fedora.

Besides the inconvenience the other main issue with this is while it's clearly safer native code, it's also slower native code by some non-zero factor, however small in a well-optimized PGO-LTO build. As such I've chosen to test to make sure it works but my "official" PGO-LTO build configs will have it turned off for the time being with --without-wasm-sandboxed-libraries. While it was very easy to build the smaller WASI libc, it doesn't have C++ headers, so the build goes bang if you try to use it as a WASI sysroot. Fortunately you can download a pre-built copy of the SDK and just pull out the system-independent wasi-sysroot and feed that to the Firefox build system with --with-wasi-sysroot=/where/it/at/wasi-sysroot. Then, to get the built-ins for linking pull out libclang_rt.builtins-wasm32.a and copy it to /usr/lib64/clang/13.0.0/lib/wasi (or wherever your clang libraries reside), and ensure you have wasm-lld. You may have to install lld to get wasm-ld; I had to, but Fedora has a package for it already. Now that you've looted the archive, you can just trash the rest of it and use your system version of clang assuming it's version 8+. This works and makes a functional version of Firefox but I can totally understand why this is unacceptable if you want to build from the raw source code.

After all that, though, I'm running Firefox 95 without RLBox simply because we need to wring all the performance out of the executable that we can. As such, here are the current .mozconfigs.


export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24" # or as you like
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-Og -mcpu=power9 -fpermissive"
ac_add_options --enable-debug
ac_add_options --enable-linker=bfd
ac_add_options --without-wasm-sandboxed-libraries

export GN=/home/censored/bin/gn # if you haz


export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-O3 -mcpu=power9 -fpermissive"
ac_add_options --enable-release
ac_add_options --enable-linker=bfd
ac_add_options --enable-lto=full
ac_add_options --without-wasm-sandboxed-libraries
ac_add_options MOZ_PGO=1

export GN=/home/censored/bin/gn

The PGO-LTO build patch is also updated.

Oh, by the way, in JIT news, I've mounted a debug browser and OpenPOWER Firefox can now run Doom. (Firefox 91 ESR shown.)

Still some glitches to work out, some of which I suspect aren't anything to do with Wasm support, but you couldn't do this on Firefox on OpenPOWER before. Does this count as a "Tonight's Game on OpenPOWER" entry? (*Firefox on Windows 95 screenshot from Beta Archive.)

Fedora 35 mini-review on the Blackbird and Talos II

Happy American Thanksgiving. While America watches football and eats deep-fried gobbler, we went to the Popeye's drivethru for chicken and I finished updating Fedora on my daily driver, now at version 35 (see our prior review of Fedora 34). As I always point out: while Fedora is a very common distro on OpenPOWER systems, even if you don't necessarily run Fedora yourself the fact that it does run is important, because it tends to be very ahead of most distros and many problems are identified and fixed in it before moving to other less advanced ones. I test it on my 4-core BMC graphics Blackbird and my dual-8 AMD WX7100 GPU Talos II.

F34 was a messy, unpleasant upgrade. I did the update first on my 4-core stock Blackbird, which I try to keep to stock Fedora as much as possible, though I note for the record both the Bird and the T2 are configured to come up in a text boot instead of gdm and I start GNOME manually from there. I strongly recommend this to act as a recovery mechanism in case your graphics card gets whacked by something or other. On Fedora this is easily done by ensuring the symlink /etc/systemd/system/ points to /lib/systemd/system/ Once you've logged into the console jump to GNOME with startx (set XDG_SESSION_TYPE to x11 if this isn't already done), or XDG_SESSION_TYPE=wayland dbus-run-session gnome-session if we want to explore the Wayland Wasteland. Since this is a minimal boot I can also do the upgrade at the same text prompt for speed and ensure as little interference as possible. As usual, the process is, from a root prompt:

dnf upgrade --refresh # upgrade prior system and DNF
dnf install dnf-plugin-system-upgrade # install upgrade plugin if not already done
dnf system-upgrade download --refresh --releasever=35 # download F35 packages
dnf system-upgrade reboot # reboot into upgrader

This went much more smoothly than F34, which had some weird conflicts; it was able to get the necessary packages right away and booted into the installer with no issue. Back at the text prompt, we started with Wayland, as I always do to see if it's still going to suck, and I'm still not disappointed. Performance was even worse than F34, it got glitchy just trying to take a grab with gnome-screenshot from the command line (see this Reddit thread) and BMC video (through the on-board HDMI connector) is still stuck at 1024x768. I took this on my Pixel 3 after I got tired of mucking around with it.

As before don't even bother with Wayland on a Blackbird if you don't have a GPU. Xorg worked fine but was still slow like F34 was. I'll get to that in a moment.
Otherwise, in Xorg, the system, Firefox and LibreOffice mostly worked as before modulo the performance problems, which was a relief.

The T2 tends to be a different story because I have this system heavily customized. Additionally, kernel 5.14 has a known problem with AMD Vega cards (add amdgpu.aspm=0 to your kernel command line as a workaround), and 5.15 may have an issue with amdgpu in power saving mode, so watch out for both of these problems depending on your GPU. (At least one user reported having to blacklist the AST BMC, though that wasn't necessary for me.)

The first problem was more elemental, however: after I downloaded the packages and ran the installation, it still came up offering an impossibly old kernel - the same thing I had to work around with updating to F34!

When I selected it, it started Fedora 35, but with this old 5.11-series kernel from Fedora 34. I did a manual grub2-mkconfig -o /boot/grub2/grub.cfg and restarted, and the Petitboot menu (built off the grub configuration) looked sane again. The text boot came up without incident.

Next, the desktop environment. Usually GNOME upgrades break a large number of my cherished extensions. Surprisingly, only Dash-to-Dock broke this time, which I rebuilt from a fork using these instructions. Note, however, that I do have disable-extension-version-validation set to true in dconf-editor which helps avoid a lot of churn.

However, the same GNOME regressions turned up in F35 that were in F34: CTM still makes a mess out of my custom colour profiles (again something like xrandr --output DisplayPort-0 --set CTM 0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1 will fix it, but this changes based on how your monitors are connected, and every time you [re]start GNOME you'll have to do it), colour calibration still crashes with my Pantone huey, and graphics were still awfully slow. This performance problem is once again libgraphene not being properly built to enable SIMD; the fix was made by the maintainer but the Fedora-distributed library doesn't seem to incorporate it properly. I rebuilt it on F35 and put a copy on Github. It will replace the file of the same name in /lib64 (remember to make a backup and don't do this while GNOME is running).

I'll not comment much further about Wayland except to say that it continues to meet my low expectations on the T2, but as it still doesn't support what my work habits require, I still don't use it. But you can, at least if you have a working discrete graphics card and you've updated libgraphene. For me, Xorg forever, I guess.

My conclusion is damning with faint praise: at least it wasn't any worse. And with these tweaks it works fine. If you're on F34 you have no reason not to upgrade, and if you're on F33 you won't have much longer until you have to (and you might as well just jump right to F35 at that point). But it's still carrying an odd number of regressions (even though, or perhaps despite the fact, the workarounds for F35 are the same as F34) and the installation on the T2 was bumpier than the Blackbird for reasons that remain unclear to me. If you run KDE or Xfce or anything other than GNOME, you shouldn't have any problems, but if you still use GNOME as your desktop environment you should be prepared to do more preparatory work to get it off the ground. I have higher hopes for F36 because we may finally get that float128 update that still wrecks a small but notable selection of packages like MAME, but I also hope that some of these regressions get dealt with as well because that would make these updates a bit more liveable. Any system upgrade of any OS will make you wonder what's going to break this time, but the most recent Fedora updates have come off as more fraught with peril than they ought to be.

If you like big-endian and Void and cannot lie ...

... then you other brothers can't stand by: Void PPC, probably one of the most finely tuned distributions for Power ISA systems (and one of the few still supporting Power Macs), needs big endian maintainers due to the work needed to maintain those four flavours, i.e., 32-bit PowerPC and 64-bit BE Power multiplied by musl and glibc. I totally get the idea of not maintaining what you don't personally use, which is one of the reasons I cut loose TenFourFox and Classilla earlier. It's a shame but it's awfully hard to justify dedicating resources to a free product that isn't personally beneficial. The new BE Void PPC maintainer would be responsible for doing the builds as well as fixing issues, but it should be possible to coordinate hosting the packages on an official mirror. I imagine it's negotiable to do only glibc or only 64-bit or some such depending on the hardware or interest you have.

If no one steps up, the big-endian musl repos go first by the end of this year, and the glibc repos will be discontinued in January 2023. Little-endian 64-bit is unaffected as is the experimental little-endian 32-bit flavour. Interested community members will want to take a look at the Void PPC Github.

51,552 JavaScript tests can't be wrong

Yeah, so about that OpenPOWER Minimum Viable Product JavaScript JIT for Firefox. This happened (all timings from an unoptimized debug build on my dual-8 Talos II with -j24):

% ./mach jstests --args "--no-ion --no-baseline --blinterp-eager --regexp-warmup-threshold=0" -F -j24

[43359|    0|    0|  614] 100% ======================================>| 529.7s
% ./mach jstests --args "--no-ion --no-baseline" -F -j24
[43359|    0|    0|  614] 100% ======================================>| 499.0s
% js/src/jit-test/ --args "--no-ion --no-baseline --blinterp-eager --regexp-warmup-threshold=0" -f -j24 obj/dist/bin/js
[8193|   0|   0|   0] 100% ==========================================>| 132.3s
% js/src/jit-test/ --args "--no-ion --no-baseline" -f -j24 obj/dist/bin/js
[8193|   0|   0|   0] 100% ==========================================>| 133.3s

That's a wrap, folks: the MVP, defined as Baseline Interpreter with irregexp and Wasm support for little-endian POWER9, is now officially V. This is the first and lowest of the JIT tiers, but is already a significant improvement; the JavaScript conformance suite executed using the same interpreter with --no-ion --no-baseline --no-blinterp --no-native-regexp took 762.4 seconds (1.53x as long) and one test timed out completely. An optimized build would be even faster.

Currently the code generator makes heavy use of POWER9-specific instructions, as well as VSX to make efficient use of the FPU. There are secondary goals of little-endian POWER8 and big-endian support (including pre-OpenPOWER so your G5 can play too), but these weren't necessary for the MVP, and we'd need someone actually willing to maintain those since I don't run Linux on my G5 or my POWER6 and I don't run any of my OpenPOWER systems big. While we welcome patches for them, they won't hold up primary support for POWER9 little-endian, which is currently the only "tier 1" platform. I note parenthetically this should also work on LE Power10 but as a matter of policy I'm not going to allow any special support for the architecture until IBM gets off their corporate rear end and actually releases the firmware source code. No free work for a chip that isn't!

You should be able to build a JIT-enabled Firefox 86 off of what's in the Github tree now, but my current goal is to pull it up to 91ESR so that it can be issued as patches against a stable branch of Firefox. These patches will be part of my ongoing future status updates for Firefox on OpenPOWER (yes, you'll need to build it yourself, though I'm pondering setting up a Fedora copr at some point). The next phase will be getting Baseline Compiler passing everything, which should be largely done already because of the existing Baseline Interpreter and Wasm support, and then the final Ion JIT stage, which still needs a lot of work. We'll most likely set up a separate tree for it so you can help (ahem). No promises right now but I'd like to see the completed JIT reach the Firefox source tree in time for the next ESR, which is Firefox 102. That's more than you can say for Chrome/Chromium, which so far has refused to accept OpenPOWER-specific work at all.


It's been a while since we did this, and even longer since we showed an actual Talos system, but here's Martin Kukač's Blackbird's new sexy case to contain its 8-core CPU, 32GB RAM and GeForce 210 GPU. The polished metal and open bottom, plus the vertical row of ports and power, make for a nice transitional look from the old Power Mac G5.

If you've got a well-coiffed OpenPOWER workstation to show off, post in the comments. Plus, somebody has to have an actual T2 or T2 Lite they're proud of, or I'm going to have to come up with a new hash tag.

Big and little POWER shouldn't just be endian

While the majority of OpenPOWER installations by this point are probably running little-endian, every single POWER chip runs big — big power usage, that is. While POWER9 is still performance-competitive with x86_64 and this situation continues to improve as more software gets better optimized, and there have been huge gains since POWER4/the PowerPC 970 in particular, POWER chips still run relatively hot and relatively hungry. Anandtech tried to normalize this for POWER8 systems by estimating transactions per watt; power measurements can be very imprecise and depend on more than just the system architecture, but even with that consideration the tested Tyan POWER8 in particular was outclassed by nearly a factor of three by a Xeon E5-2699. Possibly in response POWER9 is more aggressive with power savings than POWER8 and makes a lot of microarchitectural improvements, using 25% less juice for 50% more zip (so roughly a doubling of performance per watt), and Power10 supposedly improves on POWER9's performance per watt even more by at least 2.6 times according to IBM's figures.

But IBM's playbook for improving perf per watt hasn't really changed. Either you're boosting performance by juicing the microarch, jimmying IPC with more instructions and more cores, or both, or you're trying to diminish power usage with heavier clock speed throttling or turning off cores. While shooting the die budget at lower-wattage pack-in accelerators is a clever hybrid approach, their application-specific nature also means they're rather less useful in typical situations than their marketing would allege (look at how little currently uses the gzip accelerator in every POWER9, for example). You can do a lot with strategies like these — AMD certainly does — but sooner or later you'll hit a wall somewhere, either against the particular limitations of the design you're working with or against the intrinsic physical limitations of making a hippo do gymnastics while eating fewer calories.

Apple Silicon has a lot of concerning issues with it from a free computing perspective, but its performance is impressive, and its performance per watt is jaw-dropping. A lot of this is the secret sauce in their microarch which ironically came from P.A. Semi, originally a Power ISA licensee, and some may be due to details of the on-board GPU. But a good portion is also due to the big core-little core approach largely pioneered with the ARM big.LITTLE Cortex A7 and used to great effect in the M1 series. After all, if you want to get the best of both worlds, make some of the cores use less power and give those cores tasks that require less oomph (efficiency or E-cores), reserving the heavy tasks for the big ones (power or P-cores). Intel thinks so too: Lakefield and Alder Lake both attempt the same sort of heterogenous CPU topology for x86_64, and it would be inconceivable to believe AMD isn't looking to make the same jump for their next iteration.

The chief issue with going that route is making sure that the cores are getting work commensurate with their capabilities. This is easy for Apple since they control the whole banana: macOS Quality of Service is all about doing just that (you'd think they would do something based on nice levels as well, but I guess all the sweet talk about being desktop Un*x went out the window somewhere around Mavericks). Linux added initial support for big.LITTLE with kernel 3.10 but it took years for other improvements to the Linux scheduler to make it meaningful. Intel made things worse for themselves in Lakefield and Alder Lake by using lower power Atom-based E-cores that didn't support AVX-512 (and the Tremont E-cores in Lakefield didn't even support AVX2, meaning such tasks couldn't be run by them at all). Rather than hinting Windows 11 or the internal hardware not to send AVX-512 code to the Gracemont E-cores, Alder Lake just doesn't support AVX-512, full stop — on any core. Kernel 5.13 supports Alder Lake, but kernel 5.15 has dawned and there is no specific Intel Thread Manager Support so far, though there is scheduler support for AArch64 E-cores that can't run 32-bit code. And Alder Lake is turning out to be very power-hungry, which calls some of the design into question, in addition to various compatibility issues when software unwittingly puts tasks on the E-cores that don't work as expected.

Still, the time is coming where Power ISA should start thinking about a big-little CPU, maybe even for Power11. We already have big cores (if IBM will ever get their heads out of their rear ends and release the firmware source), but we also have an already extant little OpenPOWER core: Microwatt. While Microwatt doesn't support everything that POWER9 or Power10's large cores do, it's still intended to be a fully compliant OpenPOWER core, and since the Linux kernel is already starting to cater to heterogenous designs a set of POWER8-compliant Microwatt E-cores could still execute on the same die along with a set of Power11 full fat P-cores. Add logic on-chip to move threads to the P-cores if they hit an instruction the E-cores don't support and you're already most of the way there with relatively minor changes to the Linux kernel.

What IBM — or any future OpenPOWER chip builder, though so far no one else is in the performance category — needs to avoid is what seems to be dooming Alder Lake: they've managed to hit the bad luck jackpot with a chip that not only uses more power but has more compatibility problems. Software updates will fix this issue somewhat but a little more forethought might have staved it off, and the apparent greater wattage draw should have been noticed long before it left the lab. But IBM has already shown wattage improvements over the last two generations and if the P- and E-core functionalities are made appropriately comparable, a big-little Power11 — with open firmware please! — could be a very compelling next upgrade for the next generation of Power-based workstations and servers. Apple has clearly demonstrated that highly efficient and powerful computing experiences are possible when hardware and software align. There's no reason OpenPOWER and Linux or *BSD can't do the same on open platforms.

Firefox 94 on POWER

Firefox 94 is released. I have little interest in the colourizer, but I do like about:unloads and EGL support on Linux for great WebGL justice even on X11 (I don't use the Wayland Wasteland), at least if you have an AMD/ATI card like the WX7100 Raptor sells as a BTO option. There are also various performance improvements and a fun feature where you can use a different Mozilla VPN server for each separate multi-account container, the latter probably being Firefox's most useful capability right now. The LTO-PGO patch is unchanged from Firefox 93 and the .mozconfigs are unchanged from Firefox 90.

Fedora 35

Fedora 35 is out, which we pay particular attention to at Floodgap Orbiting HQ since both our daily driver Talos II and HTPC Blackbird run Fedora. Even if you don't run it, it's a cutting-edge distro, so OpenPOWER-specific issues show up and (hopefully) get fixed here early, making it a good preview for other distros. I wasn't too happy with F34 so I'm hoping the only direction it can go is up this time around.

Fedora 35 upgrades to GNOME 41 with Wayland-specific performance improvements, a new default GL renderer for GTK4 and new options for power and window management. WirePlumber is also added to complement PipeWire in F34 for additional video and audio session policy control, along with Python 3.10, Perl 5.34, and PHP 8.0. It ships with kernel 5.14, rpm 4.17, glibc 2.34 and gcc 11.

However, there is still no motion on the 128-bit long double transition for OpenPOWER, and the F34 tracking bug has not been reopened. This most notoriously affects MAME but also a small and growing number of other packages, and I have no idea what's holding this up for so long — like, literally, years.

Now that F35 is out, F33 will end-of-life on November 30. We'll do our usual deep dive review in a few days after everything has updated.

New Blackbird firmware

New firmware for the Blackbird is available from the Raptor wiki. This version fixes the Petitboot crashes that plagued users of the LSI SAS module, essentially replacing the "2.01 beta" that Raptor put out to fix the problem. If you were affected, you may wish to update in order to pick up the officially blessed fix.

Also, Raptor is hinting that more updates regarding the Blackbird's availability will come in November. I suspect this may have something to do with the shortage of SATA controllers which is also delaying some Talos II and T2 Lite orders; you can order a T2 without a SATA card, but the Blackbird has SATA on-board. Hopefully the logjam will break up soon.

First flight of Kestrel, the FPGA OpenPOWER-based BMC, and introducing the Arctic Tern dev board

Our alert readers yield the most interesting tips (thanks D!), including a video quietly uploaded to the Raptor wiki currently linked nowhere else showing a running Kestrel system connected to a Talos II. And it looks stupendous.

Kestrel, you will recall from our previous coverage, is a "soft BMC" that replaces the functionality of the onboard ASPEED BMC standard in all current shipping POWER9 hardware. Like the ASPEED BMC, Kestrel provides remote access and management, system IPL (via FSI), firmware (via LPC and SPI), and a 2D framebuffer (though Kestrel is planned to use HDMI, not VGA). However, while the ASPEED BMC runs its own full Linux distribution (OpenBMC), Kestrel runs Zephyr, a small open-source real time operating system, and can be built from the FPGA up with open tooling. Best of all, it's OpenPOWER just like the main system (instead of ARM), using a Microwatt core in little-endian mode as its CPU.

Here's the expanded block diagram of what it includes and provides:

What really impressed me about Kestrel was the potential for much faster BMC boot times — Raptor was promising within seconds (versus the minutes with the stock BMC firmware, though third-party projects like BangBMC aim to improve on it). If even that was all Kestrel could accomplish, it would be worth it.

Well, the video has made me a believer. A few short seconds after power was applied, and in less time than it took the announcer to describe what was happening, the Kestrel-enhanced T2 was ready to boot. I'll take two, Tim.

As before, Kestrel is incarnated on a Lattice ECP5 Versa development board, which the demo unit in the video has mounted to a little tray in the base of the T2's E-ATX case. The ECP5's PCIe edge is not connected. Instead, power is being drawn off an unknown source, and the Flexible Support Interface signals are coming from the on-board debug connector which is not mentioned in the T2 manual. Here's a picture of the one in my running T2 at J3200 (next to the boot and BMC flash):

The "FSI Adaptor v1.0" daughterboard plugged into J3200 is new and doesn't appear on the Raptor Engineering Kestrel page (for that matter, the page still says it's "not yet tested" on the Talos II). The TPM headers at J10105 are connected for LPC, and while it's hard for me to see at the photographed angle, the COM2 port at J7701 also seems connected as well as another set of lines that most likely service I2C. These signals all route to a hat sitting on the ECP5 which is also new, though its label is just out of focus. (The Ableconn card visible in the background looks like NVMe and doesn't appear to be part of Kestrel.)

For the demonstration the ASPEED BMC was completely disabled (but how wasn't said — perhaps the FSI connector is rigged to inhibit it, or maybe this method). The demo showed rapid power on into the Zephyr OS and IPL into Hostboot quickly afterwards. Once the On-Chip Controllers on the POWER9 become active, a separate thread in Zephyr continuously polls the CPU temperature sensors to set appropriate fan speeds, while maintaining the rest of the core functionality. Here's the Kestrel monitoring the system during IPL:

This demo didn't show remote access or management (though we have a screenshot) and it didn't show the framebuffer functionality. But the video does announce a dedicated soft-BMC development board called Arctic Tern which will be "plug and play for all Raptor Computing products" and available in Q1 2022. Likely this will be the hardware Kestrel will be based on, and while it's not clear if it will still be ECP5-based, presumably Arctic Tern will come from the factory preconfigured as Kestrels and you can reprogram them as you please for your own projects.

OpenBMC got us started, but its slow startup and heavier build requirements retarded further functional progress, and it's just not well-suited to workstations. I'm blown away by how far Kestrel has come, I hope to see future Raptor hardware with these as a competitive advantage, and I'll be first in line to get one. Watch for a review here in the near future.

A water-cooled update

Earlier we reported on Vikings' planned watercooling system for OpenPOWER. Vikings is now reporting their second revision, an improved lower-pressure mount, should be available for purchase from their store in two to four weeks. Unlike the IBM HSFs this is a low-pressure mounting mechanism which made it both less expensive and easier to engineer, and also means a custom cooler for the higher pressures won't be necessary. (Vikings notes a short screw is used "so that it shouldn't be possible to tighten it too much.") No MSRP yet and no preorders currently, but it will be sold as a full kit (fluid also available, or use your choice of appropriate fluids depending on the tubing) compatible with all existing Raptor systems or as just the cooler/mount for those with an existing external radiator. For you crazy people trying to cram an 18-core into a Blackbird this might be your ticket, but I'm interested myself to get rid of the fan bank in my POWER9 HTPC spooling up and down — after all, the best advantage of liquid cooling is the peace and quiet. More to come when kits are available.

Ubuntu 21.10 and 20.04.3

Ubuntu 21.10 "Impish Indri" is also out, upgrading to kernel 5.13, GNOME 40 (but presumably past the teething pains in Fedora 34 which required later patching) and gcc 11. This is the last interim release before the next Ubuntu LTS, scheduled for April 2022; the current LTS is updated to 20.04.3. As usual, new installs on OpenPOWER require installing Ubuntu Server first, and then converting to Desktop.

Tonight's game on OpenPOWER: Space Cadet Pinball

I've always loved pinball even though in league play I was always pretty much bang-up average. My first experience was with a Williams Pin-Bot at the local roller rink (I can't rollerskate either) and I was hooked. In Floodgap Orbiting HQ we have a Williams Star Trek: The Next Generation which I'm doing a long-playing LED upgrade on and a Stern Sopranos.

Computer pinball, however, has been a mixed bag, largely because of the simulation fidelity necessary for good play. Nowadays you have Pinball Arcade on mobile devices and Visual Pinball on Windows, but for years the physics never really exceeded what you got in Bill Budge's 1982 Pinball Construction Set and table features were even more limited. The mid 1990s introduced probably the first generation of computer pinball games that actually played vaguely like real pinball and some real pinball tables were even ported (I played a credible if low-res version of Bally's Eight Ball Deluxe on my Mac).

Of these, one of the best known was Maxis' Full Tilt Pinball in one of its tables' incarnation as 3D Pinball for Windows - Space Cadet, included first with Windows Plus! for Windows 95 and then with every version of Windows afterwards (including NT 4 and Windows 2000) through Windows XP inclusive. This version was a port of the original Space Cadet table written in cross-platform C and had a slightly different ruleset. I enjoyed this version on my father's AT&T Pentium 75; later I got Full Tilt Pinball for Mac, which was a dual-version disc with Windows.

Apparently I'm not the only one that liked it because the 3D Pinball version was eventually decompiled and rewritten. This redux not only plays authentically with the assets from the Windows Plus! version, but can use the higher-res versions with Full Tilt, though the ruleset is still from the Plus! game. It uses SDL and can scale to larger screen sizes and faster frame rates.

Compilation on Fedora 34 on this Talos II was straightforward. With development headers installed for SDL2 and SDL_mixer, grab the tree (do this from tip, not version 1.1), mkdir build, cd build, cmake .. and make. Copy the resources from the game — for Full Tilt this is pretty much CADET.DAT and the SOUND folder, but for the Plus! version copy everything in the same folder as PINBALL.EXE — into the build directory (if you're using the Full Tilt version as I did, you may need to loop-mount the disc to get the Windows XA session to show up) and start with ./SpaceCadetPinball.

For best results, under Options make sure Music is checked (you'll need something that plays MIDI files), under Options, Table Resolution make sure Use Maximum Resolution is checked (if you use the Full Tilt assets, you get 1024x768, and you can enlarge the window for sizes even larger), and under Options, Graphics make sure Uncapped UPS is checked so you get all the frames.

Good luck, Cadet.

OpenBSD 7.0

OpenBSD 7.0 is available, compatible with Raptor workstations in big-endian mode as well as "expected to be" with IBM PowerNV hardware generally. New powerpc64-specific improvements include MSI-X support, a fix for page faults under recursive locking, a bump in the maximum data size to 32GB, and support for the dynamic tracer. This is on top of better GPU support, additional driver and device support, updates to OpenSMTPD, LibreSSL and OpenSSH, and lots of new port packages. You can boot OpenBSD directly from Petitboot and install over the network; download mirrors are worldwide.

Firefox 93 on POWER

Firefox 93 is out, though because of inopportune scheduling at my workplace I haven't had much time to do much of anything other than $DAYJOB for the past week or so. (Cue Bill Lumbergh.) Chief amongst its features is AVIF image support (from the AV1 codec), additional PDF forms support, blocking HTTP downloads from HTTPS sites, new DOM/CSS/HTML support (including datetime-local), and most controversially Firefox Suggest, which I personally disabled since it gets in the way. I appreciate Mozilla trying to diversify its income streams, but I'd rather we could just donate directly to the browser's development rather than generally to Mozilla.

At any rate, a slight tweak was required to the LTO-PGO patch but otherwise the browser runs and functions normally using the same .mozconfigs from Firefox 90. Once I get through the next couple weeks hopefully I'll have more free time for JIT work, but you can still help.

DAWR YOLO even with DD2.3

Way back in Linux 5.2 was a "YOLO" mode for the DAWR register required for debugging with hardware watchpoints. This register functions properly on POWER8 but has an erratum on pre-DD2.3 POWER9 steppings (what Raptor sells as "v1") where the CPU will checkstop — invariably bringing the operating system to a screeching halt — if a watchpoint is set on cache-inhibited memory like device I/O. This is rare but catastrophic enough that the option to enable DAWR anyway is hidden behind a debugfs switch.

Now that I'm stressing out gdb a lot more working on the Firefox JIT, it turns out that even if you do upgrade your CPUs to DD2.3 (as I did for my dual-8 Talos II system, or what Raptor sells as "v2"), you don't automatically get access to the DAWR even on a fixed POWER9 (Fedora 34). Although you'll no longer be YOLOing it on such a system, still remember to echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous as root and restart your debugger to pick up hardware watchpoint support.

Incidentally, I'm about two-thirds of the way through the wasm test cases. The MVP is little-endian POWER9 Baseline Interpreter and Wasm support, so we're getting closer and closer. You can help.

Whonix on OpenPOWER

Developer Jeremy Rand wrote in to report his functioning port of Whonix 16 to OpenPOWER. (I should point out that all links in this article are "clearnet.") Whonix is a second operating system based on Kicksecure (a Debian derivative formerly known as "Hardened Debian") that runs within VMs on your existing OS (compare with Tails). All connections within it are forced through Tor, using different paths for different applications; additionally, it uses kloak for keystroke anonymization and secure network time synchronization instead of NTP, has higher quality RNGs, and enables AppArmor and hardened kernel profiles to prevent against other types of attacks.

The current release of Whonix is based on Debian bullseye and runs "native" on OpenPOWER KVM-HV using libvirt. Note that ppc64le isn't a top-tier architecture yet, so there are roadbumps: due to a bug in kernel versions prior to 5.14, currently you have to use Debian experimental for the VM, and there may be other glitches temporarily until support is mainstreamed. But if you bought an OpenPOWER workstation for its auditability and transparency, I doubt something like that's going to trip you up much. Detailed installation instructions, including Onion links if you prefer, are on the Raptor wiki.

Better x86 emulation with Live CDs

Yes, build a better emulator and the world will beat a path to your door to run their old brown x86 binaries. Right now that emulator is QEMU. Even if you run Hangover for Windows binaries, it's still QEMU underneath (and Hangover only works with 4K page kernels currently, leaving us stock Fedora ppc64le users out), and if you want to run Linux x86 or x86_64 binaries on your OpenPOWER box, it's going to be QEMU in user mode for sure.

However, one of the downers of this approach is that you also need system libraries. Hangover embeds Wine to solve this problem (and builds them natively for ppc64le to boot), but QEMU user mode needs the actual shared libraries themselves for the target architecture. This often involves labouriously copying them from foreign architecture packages and can be a slow process of trying and failing to acquire them all, and you get to do it all over again when you upgrade. Instead, just use a live CD/DVD as your library source: you can keep everything in one place (often using less space), and upgrading becomes merely a matter of downloading a new live image.

My real-world use for this is running the old brown Palm OS Emulator, which I've been playing with for retrocomputing purposes. Although the emulator source code is available, it's heavily 32-bit and I've had to make some really scary hacks to the files; I'm not sure I'll ever get it compiling on 64-bit Linux. But there is a pre-built 32-bit i386 binary. I've got a Palm m515 ROM, a death wish and too little to do after work. Let's boot this sucker up. Note that in these examples I'm "still" using QEMU 5.2.0. 6.1.0 had various problems and crashed at one point which I haven't investigated in detail. You might consider building QEMU 5.2.0 in a separate standalone directory (plus-minus juicing it) for this purpose.

We'll use the Debian live CD in this article, though any suitable live distro should do. Since POSE is i386, we'll need that particular architecture image. Download it and mount the ISO (which appears as d-live 11.0.0 gn i386 as of this writing).

The actual filesystem during normal operation is a squashfs image in the live directory. You can mount this with mount, but I use squashfuse for convenience. Similarly, while you could mount the ISO itself every time you need to do this, I just copy the squashfs image out and save a couple hundred megabytes. Then, from where you put it, make sure you have an ~/mnt folder (mkdir ~/mnt), and then: squashfuse debian-11-i386.squashfs ~/mnt

Let's test it on Captain Solo. After all, we've just mounted a squashfs image with a whole mess of alien binaries, so:

% ~/src/qemu-5.2.0/build/qemu-i386 -L ~/mnt ~/mnt/bin/uname -m

And now we can return Luke Skywalker to the Emperor: ~/src/qemu-5.2.0/build/qemu-i386 -L ~/mnt pose

Here it is, running a Palm image using an m515 ROM I copied over from my Mac.

However, uname and pose are both single binaries each in a single place. Let's pick a more complex example with resources, assets and other loadable components like a game. I happen to be a fan of the old Monolith anime-style shooter Shogo: Mobile Armor Division, which originated on Windows (GOG still sells it) but was also ported to the classic Mac OS and Linux by Hyperion. (The soundtrack CD is wonderful.) I own a boxed physical copy not only of the Windows release but also the Mac version, which is quite hard to find, and the retail Linux version is reportedly even rarer. While there have been promising recent developments with open-source versions of the LithTech engine, Shogo was the first LithTech game and apparently used a very old version which doesn't yet function. There is, however, a widely available Linux demo.

The demo which you download from there appears to just be a large i386 binary. But if you run it using the method above, you'll only get a weird error trying to run another binary from a temporary mount point. That's because it's actually an ISO image with an i386 ELF mounter in the header, so rename it to shogo.iso and mount it yourself. On my system GNOME puts it in /run/user/spectre/ISOIMAGE.

To set options before bringing up the main game, Shogo uses a custom launcher (on all platforms), but you can't just run it directly because Debian doesn't have all the libraries the launcher wants:

% ~/src/qemu-5.2.0/build/qemu-i386 -L ~/mnt /run/media/spectre/ISOIMAGE/shogolauncher
/run/media/spectre/ISOIMAGE/shogolauncher: error while loading shared libraries: cannot open shared object file: No such file or directory

You could try to scare up a copy of that impossibly old version of GTK, but in the Loki_Compat directory of the Shogo ISO is the desired shared object already. (Not Loki Entertainment: this Loki, a former Monolith employee.) You can't give qemu-i386 multiple -L options, but you can give environment variables to its ELF loader, so we'll just specify a custom LD_LIBRARY_PATH. For the next couple steps it will be necessary for us to actually be in the Shogo mounted image so it can find all of its data files, thusly:

% cd /run/media/spectre/ISOIMAGE
% ~/src/qemu-5.2.0/build/qemu-i386 -L ~/mnt -E LD_LIBRARY_PATH="/run/media/spectre/ISOIMAGE/Loki_Compat" ./shogolauncher

We've bypassed the shell script that actually handles the entire startup process, so when you select your options, instead of starting the game it will dump a command line to execute to the screen. This is convenient! To start out with, I picked a windowed 640x480 resolution using the software renderer and disabled sound (it doesn't work anyway, probably due to the age of the libraries it was developed with), got the command line and ran that through QEMU. Boom:
And, as long as you crank the detail level down to low from the main menu, it's playable!
A lot doesn't work: it doesn't save games because you're running it out of an ISO (copy it elsewhere if you want to); there is no sound, probably, as stated, due to the age of the libraries (the game itself dates to 1998 and the Linux port to 2001); and don't even think about trying to launch it using OpenGL (it bombs out with errors). There are also occasional graphics glitches and clipping problems, one of which makes it impossible to complete the level, though I don't know how much of this was their bug versus QEMU's bug.

Performance isn't revolutionary, either for POSE or for Shogo. However, keep in mind that all the system libraries are also running under emulation (only syscalls are native), and with Shogo in particular we've hobbled it even further by making the game render everything entirely in software. With that in mind, the fact the framerate is decent enough to actually play it is really rather remarkable. Moreover, I can certainly test things in POSE without much fuss and it's a lot more convenient than firing up a Mac OS 9 instance to run POSE there.

Best of all, when you're done running alien inferior binaries, just umount ~/mnt and it all goes away. When Debian 12 appears, just replace the squashfs image. Easy as pie! A much more straightforward way to run these sorts of programs when you need to.

A footnote: in an earlier article we discussed HQEMU. This was a heavily modified fork of QEMU that uses LLVM to recompile code on the fly for substantially faster speeds at the occasional cost of stability. Unfortunately it has not received further updates in several years and even after I hacked it to build again on Fedora 34, even with the pre-built LLVM 6 with which it is known to work, it simply hangs. Like I said, for now it's stock QEMU or bust.

Firefox 92 on POWER

Firefox 92 is out. Alongside some solid DOM and CSS improvements, the most interesting bug fix I noticed was a patch for open alerts slowing down other tabs in the same process. In the absence of a JIT we rely heavily on Firefox's multiprocessor capabilities to make the most of our multicore beasts, and this apparently benefits (among others, but in particular) the Google sites we unfortunately have to use in these less-free times. I should note for the record that on this dual-8 Talos II (64 hardware threads) I have dom.ipc.processCount modestly increased to 12 from the default of 8 to take a little more advantage of the system when idle, which also takes down fewer tabs in the rare cases when a content process bombs out. The delay in posting this was waiting for the firefox-appmenu patches, but I decided to just build it now and add those in later. The .mozconfigs and LTO-PGO patches are unchanged from Firefox 90/91.

Meanwhile, in OpenPOWER JIT progress, I'm about halfway through getting the Wasm tests to pass, though I'm currently hung up on a memory corruption bug while testing Wasm garbage collection. It's our bug; it doesn't happen with the C++ interpreter, but unfortunately like most GC bugs it requires hitting it "just right" to find the faulty code. When it all passes, we'll pull everything up to 91ESR for the MVP, and you can try building it. If you want this to happen faster, please pitch in and help.

It's not just OMI that's the trouble with POWER10

Now that POWER10 is out, the gloves (or at least the NDA) are off. Raptor Computing had been careful not to explicitly say what about POWER10 they didn't like and considered non-free, though we note that they pointed to our (and, credit where credit's due, Hugo Landau's) article on OMI's closed firmware multiple times. After all, when even your RAM has firmware, even your RAM can get pwned.

Well, it looks like they're no longer so constrained. In a nerdily juicy Twitter thread, Raptor points out that there's something else iffy with POWER10: unlike the issue with OMI firmware, which is not intrinsically part of the processor (the missing piece is the on-DIMM memory controller), this additional concern is the firmware for the on-chip "PPE I/O processor." It's 16 kilowords of binary blob. The source code isn't available.

It's not clear what this component does exactly, either. The commit messages, such as they are, make reference to a Synopsys part, so my guess is it manages the PCIe bus. Although PPE would imply a Power Processing Element (a la Cell or Xenon), the firmware code does not obviously look like Power ISA instructions at first glance.

In any case, Raptor's concern is justified: on POWER9, you can audit everything, but on POWER10, you have to trust the firmware blobs for RAM and I/O. That's an unacceptable step down in transparency for OpenPOWER, and one we hope IBM rectifies pronto. Please release the source.

First POWER10 machine announced

IBM turns up the volume to 10 (and their server numbers to four digits) with the Power E1080 server, the launch system for POWER10. POWER10 is a 7nm chip fabbed by Samsung with up to 15 SMT-8 cores (a 16th core is disabled for yield) for up to 120 threads per chip. IBM bills POWER10 as having 2.5 times more performance per core than Intel Xeon Platinum (based on an HPE Superdome system running Xeon Platinum 8380H parts), 2.5 times the AES crypto performance per core of POWER9 (no doubt due to quadruple the crypto engines present), five times "AI inferencing per socket" (whatever that means) over Power E980 via the POWER10's matrix math and AI accelerators, and 33% less power usage than the E980 for the same workload. AIX, Linux and IBM i are all supported.

IBM targets its launch hardware at its big institutional customers, and true to form the E1080 can scale up to four nodes, each with four processors, for a capacity of 240 cores (that's 1,920 hardware threads for those of you keeping score at home). The datasheet lists 10, 12 and 15 core parts as available, with asymmetric 48/32K L1 and 2MB of L2 cache per core. Chips are divided into two hemispheres (the 15-core version has 7 and 8 core hemispheres) sharing a pool of 8MB L3 cache per core per side, so the largest 15 core part has 120MB of L3 cache split into shared 64MB and 56MB pools respectively. This is somewhat different from POWER9 which divvys up L3 per two-core slice (but recall that the lowest binned 4- and 8-core parts, like the ones in most Raptor systems, fuse off the other cores in a slice such that each active core gets the L3 all to itself). Compared with Telum's virtual L3 approach, POWER10's cache strategy seems like an interim step to what we suspect POWER11 might have.

I/O doesn't disappoint, as you would expect. Each node has 8 PCIe Gen5 slots on board and can add up to four expansion drawers, each adding an additional twelve slots. You do the math for a full four-node behemoth.

However, memory and especially OMI is what we've been watching most closely with POWER10 because OMI DIMMs have closed-source firmware. Unlike the DDIMMs announced at the 2019 OpenPOWER Summit, the E1080 datasheet specifies buffered DDR4 CDIMMs. This appears to be simply a different form factor; the datasheet intro blurb indicates they are also OMI-based. Each 4-processor node can hold 16TB of RAM for 64TB in the largest 16-socket configuration. IBM lists no directly-attached RAM option currently.

IBM is taking orders now and shipments are expected to begin before the end of September. Now that POWER10 is actually a physical product, let's hope there's news on the horizon about a truly open Open Memory Interface in the meantime. Just keep in mind that if you have to ask how much this machine costs you clearly can't afford it, and IBM doesn't do retail sales anyway.

Cache splash in Telum means seventh heaven for POWER11?

AnandTech has a great analysis of IBM's new z/Architecture mainframe processor Telum, the successor to z15 (so you could consider it the "z16" if you like) scheduled for 2022. The most noteworthy part of that article is Telum's unusual approach to cache.

Most conventional CPUs (keeping in mind mainframes are hardly conventional, at least in terms of system design), including OpenPOWER chips, have multiple levels of cache; so did z15. L1 cache (divided into instruction and data) is private to the core and closest to it, usually measured in double-digit kilobytes on contemporary designs. It then fans out into L2, which is also usually private to an individual core and in triple-digit kilobyte range, and then some level of L3 (plus even L4) cache which is often shared by an entire processor and measured in megabytes. Cache size and how cache entries may be placed (i.e., associativity) is a tradeoff between the latency of searching a larger cache, die space considerations and power usage, versus the performance advantages of fewer cache misses and reduced use of slower peripheral memory.

While every design has some amount of L1, there certainly have been processors that dispensed with other tiers of cache. Most of Hewlett-Packard's late lamented PA-RISC architecture had no L2 cache at all, with the L1 cache being unusually large in some units (the 1997 PA-8200 had 4MB of total L1, 2MB each for data and instructions). Closer to home, the PowerPC 970 "G5" (derived from the POWER4) carried no L3; the 2005 dual-core 970MP, used in the Power Mac G5 Quad, IBM POWER 185 and YDL PowerStation, instead had 1MB of L2 per core which was on the large side for that era. Conversely, the Intel Itanium 2 could have up to 64MB of L4 cache; Haswell CPUs with GT3e Iris Pro Graphics can use the integrated GPU's eDRAM as a L3 victim cache for the same purpose as an L4, though this feature was removed in Skylake. However, the Sforza POWER9 in Raptor workstations is more typical of modern chips with three levels of cache: the dual-8 02CY649 in this machine I'm typing on has 32/32KB L1, 512KB L2 and 10MB L3 for each of the eight CPU cores. In contrast, AMD Zen 3 uses a shared 32MB L3 between up to eight cores, with fewer cores splitting the pot in more upmarket parts.

With money and power consumption being less or little object in mainframes, however, large multi-level caches rule the day directly. The IBM z15 processor "drawer" (there are five drawers in a typical system) divides itself into four Compute Processors, each CP containing 12 cores with 128/128K L1 (compare to Apple M1 with 192/192K) and split 4MB/4MB L2 per core paired with 256MB of shared L3, overseen by a single System Controller which provides a whopping 960MB of shared L4. This gives it the kind of throughput and redundancy expected by IBM's large institutional customers who depend on transaction processing reliability. The SC services the four CPs almost like an old-school northbridge, but to L4 cache instead of main RAM.

Telum could have doubled down on this the way z15 literally doubled down on z14 (twice the L3, nearly half again as much L4), but instead it dispenses with L3 and L4 altogether. L1 jumps to 256/256K, and in shades of PA-RISC L2 balloons to 32MB per core, with eight cores per chip. Let's zoom in on the die.
The 7nm 530mm2 die shows the L2 cache in the centre of the eight cores, which is already a tipoff as to how IBM's arranged it: cores can reach into other cores' cache. If a cache line gets evicted from a core's L2 and the core can find space for it within another core, then the cache line goes to that core's L2, and is marked as L3. This process is not free and does incur more latency than a traditional L3 when an L3 line stored elsewhere must be retrieved, but the ample L2 makes this condition less frequent, and in the less common case where a core requires data and some other core already evicted it to that core as L3, it can just adopt it. Overall, this strategy means better utilization of cache that adapts better to more diverse workloads because the large total L2 space can be flexibly redirected as "virtual L3" to cores with greater bandwidth demands.

It doesn't stop there, though, because Telum has another trick for "virtual L4." Recall that the z15 uses five drawers in a typical system; each drawer has an SC that maintains the L4 cache. Telum is two chips to a package, with four packages to a unit (the equivalent of a z15 "drawer") and four units to a system. If you can reach into other cores' L2 to use them as L3, then it's a simple conceptual leap to reach into other chips (even in different units) and use their L2 as L4. Again, latency jumps over a more traditional L4 approach, but this means theoretically a typical Telum system has a total of 8GB that could be redirected as L4 (7936MB, if you don't count an individual core's L2). With 256 cores in this system, there's bound to be room somewhere faster than main memory.

What makes this interesting for OpenPOWER is that z/Architecture and POWER naturally tend to cross-pollinate. (History favours POWER, too. POWER chips already took over IBM i first with the RS64-based A35 and finally with the eCLipz project; IBM AS/400 a/k/a i5/OS a/k/a i hardware used to be its own bespoke AS/400 architecture.) z/Architecture is decidedly not Power ISA but some microarchitectural features are sometimes shared, such as POWER6 and z10, which emerged from a common development process and as a result had similar fabrication technologies, execution units, floating-point units, busses and pipelines.

POWER10 is almost certainly already taped out if IBM is going to be anywhere close to a Q4 2021 release, so whatever influence Telum had on its creation has already happened. But Telum at the microarchitecture level sure looks more like POWER than z15 did: there is no more CP/SC division but rather general purpose cores in a NUMA topology more like POWER9, more typical PCIe controllers (in this case PCIe 5.0) for I/O and more reliance on specialized pack-in accelerators (Telum's headline feature is an AI accelerator for SIMD, matrix math and fast activation function computation; no doubt some of its design started with POWER10's own accelerator). Frankly, that reads like a recipe for POWER11. While a dual-CPU POWER11 workstation might not have much need for L4, the "virtual L3" strategy could really pay off for the variety of workloads workstations and non-mainframe servers have to do, and on a four or eight-socket server, the availability of virtual L4 starts outweighing any disadvantage in latency.

The commonalities should not be overstated, as Telum is also "only" SMT-2 (versus SMT-4 or SMT-8 for POWER9 and POWER10) and the deep 5GHz-plus pipeline the reduced SMT count facilitates doesn't match up with the shorter pipeline and lower clockspeeds on current POWER generations. But that's just part of the chips being customized for their respective markets, and if IBM can pull this trick off for z/Architecture it's a short jump to making the technology work on POWER. Assuming we don't have OMI to worry about by then, that could really be something to look forward to in future processor generations, and a genuinely unique advance for the architecture.

Kernel 5.14

Version 5.14 of the Linux kernel has landed. Not much in PowerPC land this time around except for a few bug fixes, although one of the fixes repairs an issue that can hit certain hashtable-based CPUs (though I don't believe the POWER9 in HPTE mode is known to be affected), but there are some privacy-related features including memfd_secret() that creates a tract of memory even a compromised kernel can't look into, a new ioctl for ext4 filesystems to prevent information leaks, and of course core-based scheduling allowing restrictions on what processes may share cores as extra insurance against Spectre-type attacks (at the cost of less effective utilization, so this is largely more of interest to hosting providers rather than what you run on your own box). Other new features of note include a burstable "Completely Fair Scheduling" to allow a task group to roll over unused CPU quota under certain conditions, a cgroup "kill button" feature and some initial infrastructure for supporting signed BPF programs. Expect this version to appear in Fedora and other "leading edge" distributions soon.

OpenPOWER Firefox JIT update

As of this afternoon, the Baseline Interpreter-only form of the OpenPOWER JIT (64-bit little-endian) now passes all of the JIT tests except for the Wasm ones, which are being actively worked on. Remember, this is just the first of the three phases and we need all three for the full benefit, but it already yields a noticeable boost in my internal tests over the C++ interpreter. The MVP is Baseline Interpreter and Wasm, so once it passes the Wasm tests as well, it's time to pull it current with 91ESR. You can help.

Debian 11

Debian 11 bullseye is officially released, the latest stable version and the "other white meat" of the two big distros I suspect are commonly used on OpenPOWER workstations (Fedora being the other, and Ubuntu third). Little-endian 64-bit Power ISA (ppc64el) has been a supported architecture for Debian since 8 jessie. The updates are conservative but important, which is what you're looking for if you run Debian stable, such as kernel 5.10, GNOME 3.38, KDE Plasma 5.20, LXDE 11, LXQt 0.16, MATE 1.24, and Xfce 4.16, plus gcc 10.2 and LLVM 9 (with Clang 11). ISOs are already available on the mirrors. If you've updated, post your impressions in the comments.

Firefox 91 on POWER fur the fowk

Firefox 91 is out. Yes, it further improves cookie isolation and cleanup, has faster paint scheduling (noticeably, in some cases), and new JavaScript and DOM support. But for my money, the biggest news is the Scots support: aye, laddie, noo ye kin stravaig the wab lik Robert Burns did. We've waited tae lang fur this.

Anyway, Firefox 91 builds oot o the kist oa, er, Firefox 91 builds out of the box on OpenPOWER using the same .mozconfigs for Firefox 90; I made a wee change to the PGO-LTO patch since I messed up the diff the last time and didn't notice. The crypto issues in Fx90 are fixed in this release.

Meanwhile, the OpenPOWER JIT is now passing all but a handful of the basic tests in Baseline Interpreter mode, and some amount of Wasm, though this isn't nearly as far along. Ye kin hulp.

Tonight's game on OpenPOWER: System Shock Enhanced Edition

Yeah, I know we're doing a lot of FPSes in this series. It's what I tend to play, so deal. Tonight we'll be playing System Shock, the classic hacker-shooter (seems appropriate), courtesy of Shockolate, which adds higher resolutions, better controls, mouselook and OpenGL support. Our drug dealers at GoG, who don't pay us a cent for this kind of shameless plug and really ought to, make the game files easily available as System Shock Enhanced Edition. However, you can also use the DOS or Windows 95 CD-ROM; I tested with both. (I'll talk about the Macintosh release in a moment.)

Shockolate requires CMake and SDL2, and FluidSynth is strongly advised. Don't let Shockolate build with its bundled versions: edit CMakeLists.txt and change all "BUNDLED" libraries to "ON" (don't forget the quote marks). Once set, building should work out of the box (tested on Fedora 34):

mkdir build
cd build
cmake ..
make -j24 # or as you like
cd ..
ln -s build/systemshock systemshock

(The last command is to make running the binary a little more convenient.)

Now we need to provide the resources. For FluidSynth, you'll need a soundfont (I used the default that comes with Fedora's package). If you have the DOS/Windows CD-ROM, insert it now. We will assume it is mounted at /run/media/censored/EA.

mkdir res
cd res
ln -s /usr/share/soundfonts/default.sf2 default.sf2
cp -R /run/media/censored/EA/hd/data .
cp -R /run/media/censored/EA/hd/sound .
chmod -R +w . # if copying from CD makes things read only
cd data
rm -f intro.res
rm -f objprop.dat
cp /run/media/censored/EA/cdrom/data/* .
cd ../..

Then start the game with ./systemshock. The resolutions and choice of renderer (software or OpenGL) are set from the in-gameplay menu (press ESC). Shockolate also implements WASD motion (as well as the classic arrow keys) and F to toggle mouselook. Note that OpenGL is somewhat darker than software mode. It's not clear if this is actually a bug.

Playing System Shock Enhanced Edition in Shockolate is just a more convenient way to get the DOS assets since Shockolate just uses those and not any of the patches (more about this in a second); gameplay and features are the same. Also, GoG only distributes it as a Windows installer and the file structure is a bit different. Use innoextract to break the installer EXE apart into a separate directory and delete everything but sshock.kpf, which is a cloaked ZIP archive containing the game assets. In your Shockolate source directory (note that this also creates res/, so if you did the steps above delete it first),

mkdir ssee
cd ssee
unzip /path/to/sshock.kpf
cd ..
mkdir res
mv ssee/res/pc/hd/data res
cp ssee/res/pc/cdrom/data/* res/data/
mv ssee/res/pc/hd/sound res
rm -rf ssee # if you want
ln -s /usr/share/soundfonts/default.sf2 res/default.sf2

Then start the game with ./systemshock.

Oddly, although Shockolate was based on the (IMHO) superior Power Mac release, it doesn't seem to properly support its higher-resolution assets (SSEE does and includes a converted set, but the source for thatunlike Strife — isn't currently available). I actually own this version also. One rather unique reason to own it is because the cutscenes and audio files are all playable in QuickTime, so if you don't feel like slogging through the entire game you can just listen to the audio logs or go straight to the ending using a Mac emulator. However, you need to do a little song and dance to mount the HFS volume on Linux (as root):

losetup /dev/loop0 /dev/sr0 # or where your drive is
partx -av /dev/loop0

This will respond with something like

partition: none, disk: /dev/loop0, lower: 0, upper: 0
/dev/loop0: partition table type 'mac' detected
range recount: max partno=2, lower=0, upper=0
/dev/loop0: partition #1 added
/dev/loop0: partition #2 added

and you should see it mount in your desktop environment (note that many applications won't understand the resource fork). Do losetup -D before ejecting the physical disc. As a parenthetical note, since SSEE is presumably derived from the GPL-released Mac source code, you would think it, too, would be GPL. But I'm uncertain of the exact history there.

Salt the fries.