Showing posts from September, 2021

DAWR YOLO even with DD2.3

Way back in Linux 5.2 was a "YOLO" mode for the DAWR register required for debugging with hardware watchpoints. This register functions properly on POWER8 but has an erratum on pre-DD2.3 POWER9 steppings (what Raptor sells as "v1") where the CPU will checkstop — invariably bringing the operating system to a screeching halt — if a watchpoint is set on cache-inhibited memory like device I/O. This is rare but catastrophic enough that the option to enable DAWR anyway is hidden behind a debugfs switch.

Now that I'm stressing out gdb a lot more working on the Firefox JIT, it turns out that even if you do upgrade your CPUs to DD2.3 (as I did for my dual-8 Talos II system, or what Raptor sells as "v2"), you don't automatically get access to the DAWR even on a fixed POWER9 (Fedora 34). Although you'll no longer be YOLOing it on such a system, still remember to echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous as root and restart your debugger to pick up hardware watchpoint support.

Incidentally, I'm about two-thirds of the way through the wasm test cases. The MVP is little-endian POWER9 Baseline Interpreter and Wasm support, so we're getting closer and closer. You can help.

Whonix on OpenPOWER

Developer Jeremy Rand wrote in to report his functioning port of Whonix 16 to OpenPOWER. (I should point out that all links in this article are "clearnet.") Whonix is a second operating system based on Kicksecure (a Debian derivative formerly known as "Hardened Debian") that runs within VMs on your existing OS (compare with Tails). All connections within it are forced through Tor, using different paths for different applications; additionally, it uses kloak for keystroke anonymization and secure network time synchronization instead of NTP, has higher quality RNGs, and enables AppArmor and hardened kernel profiles to prevent against other types of attacks.

The current release of Whonix is based on Debian bullseye and runs "native" on OpenPOWER KVM-HV using libvirt. Note that ppc64le isn't a top-tier architecture yet, so there are roadbumps: due to a bug in kernel versions prior to 5.14, currently you have to use Debian experimental for the VM, and there may be other glitches temporarily until support is mainstreamed. But if you bought an OpenPOWER workstation for its auditability and transparency, I doubt something like that's going to trip you up much. Detailed installation instructions, including Onion links if you prefer, are on the Raptor wiki.

Better x86 emulation with Live CDs

Yes, build a better emulator and the world will beat a path to your door to run their old brown x86 binaries. Right now that emulator is QEMU. Even if you run Hangover for Windows binaries, it's still QEMU underneath (and Hangover only works with 4K page kernels currently, leaving us stock Fedora ppc64le users out), and if you want to run Linux x86 or x86_64 binaries on your OpenPOWER box, it's going to be QEMU in user mode for sure.

However, one of the downers of this approach is that you also need system libraries. Hangover embeds Wine to solve this problem (and builds them natively for ppc64le to boot), but QEMU user mode needs the actual shared libraries themselves for the target architecture. This often involves labouriously copying them from foreign architecture packages and can be a slow process of trying and failing to acquire them all, and you get to do it all over again when you upgrade. Instead, just use a live CD/DVD as your library source: you can keep everything in one place (often using less space), and upgrading becomes merely a matter of downloading a new live image.

My real-world use for this is running the old brown Palm OS Emulator, which I've been playing with for retrocomputing purposes. Although the emulator source code is available, it's heavily 32-bit and I've had to make some really scary hacks to the files; I'm not sure I'll ever get it compiling on 64-bit Linux. But there is a pre-built 32-bit i386 binary. I've got a Palm m515 ROM, a death wish and too little to do after work. Let's boot this sucker up. Note that in these examples I'm "still" using QEMU 5.2.0. 6.1.0 had various problems and crashed at one point which I haven't investigated in detail. You might consider building QEMU 5.2.0 in a separate standalone directory (plus-minus juicing it) for this purpose.

We'll use the Debian live CD in this article, though any suitable live distro should do. Since POSE is i386, we'll need that particular architecture image. Download it and mount the ISO (which appears as d-live 11.0.0 gn i386 as of this writing).

The actual filesystem during normal operation is a squashfs image in the live directory. You can mount this with mount, but I use squashfuse for convenience. Similarly, while you could mount the ISO itself every time you need to do this, I just copy the squashfs image out and save a couple hundred megabytes. Then, from where you put it, make sure you have an ~/mnt folder (mkdir ~/mnt), and then: squashfuse debian-11-i386.squashfs ~/mnt

Let's test it on Captain Solo. After all, we've just mounted a squashfs image with a whole mess of alien binaries, so:

% ~/src/qemu-5.2.0/build/qemu-i386 -L ~/mnt ~/mnt/bin/uname -m

And now we can return Luke Skywalker to the Emperor: ~/src/qemu-5.2.0/build/qemu-i386 -L ~/mnt pose

Here it is, running a Palm image using an m515 ROM I copied over from my Mac.

However, uname and pose are both single binaries each in a single place. Let's pick a more complex example with resources, assets and other loadable components like a game. I happen to be a fan of the old Monolith anime-style shooter Shogo: Mobile Armor Division, which originated on Windows (GOG still sells it) but was also ported to the classic Mac OS and Linux by Hyperion. (The soundtrack CD is wonderful.) I own a boxed physical copy not only of the Windows release but also the Mac version, which is quite hard to find, and the retail Linux version is reportedly even rarer. While there have been promising recent developments with open-source versions of the LithTech engine, Shogo was the first LithTech game and apparently used a very old version which doesn't yet function. There is, however, a widely available Linux demo.

The demo which you download from there appears to just be a large i386 binary. But if you run it using the method above, you'll only get a weird error trying to run another binary from a temporary mount point. That's because it's actually an ISO image with an i386 ELF mounter in the header, so rename it to shogo.iso and mount it yourself. On my system GNOME puts it in /run/user/spectre/ISOIMAGE.

To set options before bringing up the main game, Shogo uses a custom launcher (on all platforms), but you can't just run it directly because Debian doesn't have all the libraries the launcher wants:

% ~/src/qemu-5.2.0/build/qemu-i386 -L ~/mnt /run/media/spectre/ISOIMAGE/shogolauncher
/run/media/spectre/ISOIMAGE/shogolauncher: error while loading shared libraries: cannot open shared object file: No such file or directory

You could try to scare up a copy of that impossibly old version of GTK, but in the Loki_Compat directory of the Shogo ISO is the desired shared object already. (Not Loki Entertainment: this Loki, a former Monolith employee.) You can't give qemu-i386 multiple -L options, but you can give environment variables to its ELF loader, so we'll just specify a custom LD_LIBRARY_PATH. For the next couple steps it will be necessary for us to actually be in the Shogo mounted image so it can find all of its data files, thusly:

% cd /run/media/spectre/ISOIMAGE
% ~/src/qemu-5.2.0/build/qemu-i386 -L ~/mnt -E LD_LIBRARY_PATH="/run/media/spectre/ISOIMAGE/Loki_Compat" ./shogolauncher

We've bypassed the shell script that actually handles the entire startup process, so when you select your options, instead of starting the game it will dump a command line to execute to the screen. This is convenient! To start out with, I picked a windowed 640x480 resolution using the software renderer and disabled sound (it doesn't work anyway, probably due to the age of the libraries it was developed with), got the command line and ran that through QEMU. Boom:
And, as long as you crank the detail level down to low from the main menu, it's playable!
A lot doesn't work: it doesn't save games because you're running it out of an ISO (copy it elsewhere if you want to); there is no sound, probably, as stated, due to the age of the libraries (the game itself dates to 1998 and the Linux port to 2001); and don't even think about trying to launch it using OpenGL (it bombs out with errors). There are also occasional graphics glitches and clipping problems, one of which makes it impossible to complete the level, though I don't know how much of this was their bug versus QEMU's bug.

Performance isn't revolutionary, either for POSE or for Shogo. However, keep in mind that all the system libraries are also running under emulation (only syscalls are native), and with Shogo in particular we've hobbled it even further by making the game render everything entirely in software. With that in mind, the fact the framerate is decent enough to actually play it is really rather remarkable. Moreover, I can certainly test things in POSE without much fuss and it's a lot more convenient than firing up a Mac OS 9 instance to run POSE there.

Best of all, when you're done running alien inferior binaries, just umount ~/mnt and it all goes away. When Debian 12 appears, just replace the squashfs image. Easy as pie! A much more straightforward way to run these sorts of programs when you need to.

A footnote: in an earlier article we discussed HQEMU. This was a heavily modified fork of QEMU that uses LLVM to recompile code on the fly for substantially faster speeds at the occasional cost of stability. Unfortunately it has not received further updates in several years and even after I hacked it to build again on Fedora 34, even with the pre-built LLVM 6 with which it is known to work, it simply hangs. Like I said, for now it's stock QEMU or bust.

Firefox 92 on POWER

Firefox 92 is out. Alongside some solid DOM and CSS improvements, the most interesting bug fix I noticed was a patch for open alerts slowing down other tabs in the same process. In the absence of a JIT we rely heavily on Firefox's multiprocessor capabilities to make the most of our multicore beasts, and this apparently benefits (among others, but in particular) the Google sites we unfortunately have to use in these less-free times. I should note for the record that on this dual-8 Talos II (64 hardware threads) I have dom.ipc.processCount modestly increased to 12 from the default of 8 to take a little more advantage of the system when idle, which also takes down fewer tabs in the rare cases when a content process bombs out. The delay in posting this was waiting for the firefox-appmenu patches, but I decided to just build it now and add those in later. The .mozconfigs and LTO-PGO patches are unchanged from Firefox 90/91.

Meanwhile, in OpenPOWER JIT progress, I'm about halfway through getting the Wasm tests to pass, though I'm currently hung up on a memory corruption bug while testing Wasm garbage collection. It's our bug; it doesn't happen with the C++ interpreter, but unfortunately like most GC bugs it requires hitting it "just right" to find the faulty code. When it all passes, we'll pull everything up to 91ESR for the MVP, and you can try building it. If you want this to happen faster, please pitch in and help.

It's not just OMI that's the trouble with POWER10

Now that POWER10 is out, the gloves (or at least the NDA) are off. Raptor Computing had been careful not to explicitly say what about POWER10 they didn't like and considered non-free, though we note that they pointed to our (and, credit where credit's due, Hugo Landau's) article on OMI's closed firmware multiple times. After all, when even your RAM has firmware, even your RAM can get pwned.

Well, it looks like they're no longer so constrained. In a nerdily juicy Twitter thread, Raptor points out that there's something else iffy with POWER10: unlike the issue with OMI firmware, which is not intrinsically part of the processor (the missing piece is the on-DIMM memory controller), this additional concern is the firmware for the on-chip "PPE I/O processor." It's 16 kilowords of binary blob. The source code isn't available.

It's not clear what this component does exactly, either. The commit messages, such as they are, make reference to a Synopsys part, so my guess is it manages the PCIe bus. Although PPE would imply a Power Processing Element (a la Cell or Xenon), the firmware code does not obviously look like Power ISA instructions at first glance.

In any case, Raptor's concern is justified: on POWER9, you can audit everything, but on POWER10, you have to trust the firmware blobs for RAM and I/O. That's an unacceptable step down in transparency for OpenPOWER, and one we hope IBM rectifies pronto. Please release the source.

First POWER10 machine announced

IBM turns up the volume to 10 (and their server numbers to four digits) with the Power E1080 server, the launch system for POWER10. POWER10 is a 7nm chip fabbed by Samsung with up to 15 SMT-8 cores (a 16th core is disabled for yield) for up to 120 threads per chip. IBM bills POWER10 as having 2.5 times more performance per core than Intel Xeon Platinum (based on an HPE Superdome system running Xeon Platinum 8380H parts), 2.5 times the AES crypto performance per core of POWER9 (no doubt due to quadruple the crypto engines present), five times "AI inferencing per socket" (whatever that means) over Power E980 via the POWER10's matrix math and AI accelerators, and 33% less power usage than the E980 for the same workload. AIX, Linux and IBM i are all supported.

IBM targets its launch hardware at its big institutional customers, and true to form the E1080 can scale up to four nodes, each with four processors, for a capacity of 240 cores (that's 1,920 hardware threads for those of you keeping score at home). The datasheet lists 10, 12 and 15 core parts as available, with asymmetric 48/32K L1 and 2MB of L2 cache per core. Chips are divided into two hemispheres (the 15-core version has 7 and 8 core hemispheres) sharing a pool of 8MB L3 cache per core per side, so the largest 15 core part has 120MB of L3 cache split into shared 64MB and 56MB pools respectively. This is somewhat different from POWER9 which divvys up L3 per two-core slice (but recall that the lowest binned 4- and 8-core parts, like the ones in most Raptor systems, fuse off the other cores in a slice such that each active core gets the L3 all to itself). Compared with Telum's virtual L3 approach, POWER10's cache strategy seems like an interim step to what we suspect POWER11 might have.

I/O doesn't disappoint, as you would expect. Each node has 8 PCIe Gen5 slots on board and can add up to four expansion drawers, each adding an additional twelve slots. You do the math for a full four-node behemoth.

However, memory and especially OMI is what we've been watching most closely with POWER10 because OMI DIMMs have closed-source firmware. Unlike the DDIMMs announced at the 2019 OpenPOWER Summit, the E1080 datasheet specifies buffered DDR4 CDIMMs. This appears to be simply a different form factor; the datasheet intro blurb indicates they are also OMI-based. Each 4-processor node can hold 16TB of RAM for 64TB in the largest 16-socket configuration. IBM lists no directly-attached RAM option currently.

IBM is taking orders now and shipments are expected to begin before the end of September. Now that POWER10 is actually a physical product, let's hope there's news on the horizon about a truly open Open Memory Interface in the meantime. Just keep in mind that if you have to ask how much this machine costs you clearly can't afford it, and IBM doesn't do retail sales anyway.

Cache splash in Telum means seventh heaven for POWER11?

AnandTech has a great analysis of IBM's new z/Architecture mainframe processor Telum, the successor to z15 (so you could consider it the "z16" if you like) scheduled for 2022. The most noteworthy part of that article is Telum's unusual approach to cache.

Most conventional CPUs (keeping in mind mainframes are hardly conventional, at least in terms of system design), including OpenPOWER chips, have multiple levels of cache; so did z15. L1 cache (divided into instruction and data) is private to the core and closest to it, usually measured in double-digit kilobytes on contemporary designs. It then fans out into L2, which is also usually private to an individual core and in triple-digit kilobyte range, and then some level of L3 (plus even L4) cache which is often shared by an entire processor and measured in megabytes. Cache size and how cache entries may be placed (i.e., associativity) is a tradeoff between the latency of searching a larger cache, die space considerations and power usage, versus the performance advantages of fewer cache misses and reduced use of slower peripheral memory.

While every design has some amount of L1, there certainly have been processors that dispensed with other tiers of cache. Most of Hewlett-Packard's late lamented PA-RISC architecture had no L2 cache at all, with the L1 cache being unusually large in some units (the 1997 PA-8200 had 4MB of total L1, 2MB each for data and instructions). Closer to home, the PowerPC 970 "G5" (derived from the POWER4) carried no L3; the 2005 dual-core 970MP, used in the Power Mac G5 Quad, IBM POWER 185 and YDL PowerStation, instead had 1MB of L2 per core which was on the large side for that era. Conversely, the Intel Itanium 2 could have up to 64MB of L4 cache; Haswell CPUs with GT3e Iris Pro Graphics can use the integrated GPU's eDRAM as a L3 victim cache for the same purpose as an L4, though this feature was removed in Skylake. However, the Sforza POWER9 in Raptor workstations is more typical of modern chips with three levels of cache: the dual-8 02CY649 in this machine I'm typing on has 32/32KB L1, 512KB L2 and 10MB L3 for each of the eight CPU cores. In contrast, AMD Zen 3 uses a shared 32MB L3 between up to eight cores, with fewer cores splitting the pot in more upmarket parts.

With money and power consumption being less or little object in mainframes, however, large multi-level caches rule the day directly. The IBM z15 processor "drawer" (there are five drawers in a typical system) divides itself into four Compute Processors, each CP containing 12 cores with 128/128K L1 (compare to Apple M1 with 192/192K) and split 4MB/4MB L2 per core paired with 256MB of shared L3, overseen by a single System Controller which provides a whopping 960MB of shared L4. This gives it the kind of throughput and redundancy expected by IBM's large institutional customers who depend on transaction processing reliability. The SC services the four CPs almost like an old-school northbridge, but to L4 cache instead of main RAM.

Telum could have doubled down on this the way z15 literally doubled down on z14 (twice the L3, nearly half again as much L4), but instead it dispenses with L3 and L4 altogether. L1 jumps to 256/256K, and in shades of PA-RISC L2 balloons to 32MB per core, with eight cores per chip. Let's zoom in on the die.
The 7nm 530mm2 die shows the L2 cache in the centre of the eight cores, which is already a tipoff as to how IBM's arranged it: cores can reach into other cores' cache. If a cache line gets evicted from a core's L2 and the core can find space for it within another core, then the cache line goes to that core's L2, and is marked as L3. This process is not free and does incur more latency than a traditional L3 when an L3 line stored elsewhere must be retrieved, but the ample L2 makes this condition less frequent, and in the less common case where a core requires data and some other core already evicted it to that core as L3, it can just adopt it. Overall, this strategy means better utilization of cache that adapts better to more diverse workloads because the large total L2 space can be flexibly redirected as "virtual L3" to cores with greater bandwidth demands.

It doesn't stop there, though, because Telum has another trick for "virtual L4." Recall that the z15 uses five drawers in a typical system; each drawer has an SC that maintains the L4 cache. Telum is two chips to a package, with four packages to a unit (the equivalent of a z15 "drawer") and four units to a system. If you can reach into other cores' L2 to use them as L3, then it's a simple conceptual leap to reach into other chips (even in different units) and use their L2 as L4. Again, latency jumps over a more traditional L4 approach, but this means theoretically a typical Telum system has a total of 8GB that could be redirected as L4 (7936MB, if you don't count an individual core's L2). With 256 cores in this system, there's bound to be room somewhere faster than main memory.

What makes this interesting for OpenPOWER is that z/Architecture and POWER naturally tend to cross-pollinate. (History favours POWER, too. POWER chips already took over IBM i first with the RS64-based A35 and finally with the eCLipz project; IBM AS/400 a/k/a i5/OS a/k/a i hardware used to be its own bespoke AS/400 architecture.) z/Architecture is decidedly not Power ISA but some microarchitectural features are sometimes shared, such as POWER6 and z10, which emerged from a common development process and as a result had similar fabrication technologies, execution units, floating-point units, busses and pipelines.

POWER10 is almost certainly already taped out if IBM is going to be anywhere close to a Q4 2021 release, so whatever influence Telum had on its creation has already happened. But Telum at the microarchitecture level sure looks more like POWER than z15 did: there is no more CP/SC division but rather general purpose cores in a NUMA topology more like POWER9, more typical PCIe controllers (in this case PCIe 5.0) for I/O and more reliance on specialized pack-in accelerators (Telum's headline feature is an AI accelerator for SIMD, matrix math and fast activation function computation; no doubt some of its design started with POWER10's own accelerator). Frankly, that reads like a recipe for POWER11. While a dual-CPU POWER11 workstation might not have much need for L4, the "virtual L3" strategy could really pay off for the variety of workloads workstations and non-mainframe servers have to do, and on a four or eight-socket server, the availability of virtual L4 starts outweighing any disadvantage in latency.

The commonalities should not be overstated, as Telum is also "only" SMT-2 (versus SMT-4 or SMT-8 for POWER9 and POWER10) and the deep 5GHz-plus pipeline the reduced SMT count facilitates doesn't match up with the shorter pipeline and lower clockspeeds on current POWER generations. But that's just part of the chips being customized for their respective markets, and if IBM can pull this trick off for z/Architecture it's a short jump to making the technology work on POWER. Assuming we don't have OMI to worry about by then, that could really be something to look forward to in future processor generations, and a genuinely unique advance for the architecture.