Posts

Latest Posts

Vikings' upcoming OpenPOWER retail channel


Many Talospace readers are familiar with Vikings, who offer libre hosting as well as hardware sales for libre-friendly devices and systems and peripherals certified by the FSF Respects Your Freedom program (for which Raptor systems qualify). However, the Vikings' storefront now shows a new tab for OpenPOWER hardware, hopefully a public demonstration of a new retail channel coming soon for those ready to pull the trigger on an OpenPOWER workstation or server of your own. This is particularly of value to our readers outside North America, since this gets around a lot of the inconveniences of shipping and payment with United States businesses; Vikings is based in Germany, and accepts payments in euros, US dollars, British pounds, Australian dollars and New Zealand dollars. Already we have also heard that Vikings is working on a water-cooler system for POWER systems with an aim to reach the market in two months or less, a great option for people trying to run the 18 and 22-core parts in desktop environments (current BTO cooling options are air-cooled only).

Currently it is not known yet whether Vikings will sell full systems, parts and/or processors, whether the systems include other OpenPOWER systems other than Raptor workstations and servers, or when general availability is expected. Still, the more retail options there are, the greater the volume of sales and the greater the economies of scale that will result. In the end, that can only be a good thing for growing our niche but very important market.

Will it build?


While I will always be big-endian at heart, ppc64le does get around a lot of the unfortunately pervasive endian assumptions in a cold blackboxed x86_64 world, and even things like MMX, SSE and SSSE3 can be automatically translated in many cases. It is therefore a happy result that even many software packages completely unaware of ppc64le will still build and function out of the box, assuming they don't do silly things like emit JITted x86_64 assembly code and try to run it, etc.

I ran across this project the other day which has over 1,000 build scripts for ppc64le (as shell scripts and/or Docker files) that you can either use directly, or as a hint whether your intended build will even work. Cursorily paging through a few I see IBM E-mail addresses, so no surprise much of it is tested on Red Hat (though largely RHEL 7.x), but there are also Ubuntu scripts there as well and I imagine they'd accept other distros. Keep in mind that this is generic ppc64le, so it would work on POWER8 and up but any special optimizations (for example, I always build optimized Firefox at -O3 -mcpu=power9), and the concentration more favours server-side packages than workstation and client software. I also see relatively few platform-specific corrections, which could be both good (they weren't needed) or bad (they weren't tested). Still, it's nice to see more resources to aid porting and platform compatibility and that can only in turn get more packages thinking about making ppc64le (and hopefully ppc64) a first-class citizen too.

Linux 5.8 on POWER


The 5.8 kernel milestone has arrived, with improvements to reduce thrashing (though with the amount of memory even a Blackbird can hold, there's no excuse not to load these suckers up), an API for receiving notifications of kernel events, support for hardware-assisted inline encryption at the block layer for storage devices and a nice convenience feature where you can put sysctl.something.or.other=999 right on the kernel command line.

On the Power ISA side, this kernel adds the first support for POWER10 and ISA 3.1, although our Raptor contacts have indicated some displeasure with IBM's management decisions and we suspect this is a way of saying firmware binary blobs might be required to enable maximal performance (though we don't know, and it's unclear how much is under NDA). Another nice feature is an ioctl to send gzip compression requests directly to POWER9's on-chip compression hardware via /dev/crypto/nx-gzip. This is part of the general family of Nest Accelerators (NXes) accessible through the Virtual Accelerator Switchboard. More about that in a later article, but in the meantime while we wait for compressors to add this support, here's an accelerated power-gzip userspace library that directly replaces zlib.

Finally, in addition to various improvements for the 40x and 8xx series, the most interesting commit was around prefixed instructions. These represent the first 64-bit instructions in the Power ISA (here's a code sample to show you the encoding) and allow much bigger 32-bit displacements for load-store operations than the 16-bit ones in current 32-bit instructions. I'm not too wild about the fact this makes Power ISA technically variable-length, but these D-form instructions are easy to identify and they are always 64 bits in size, and they should make certain types of code generation a lot simpler on chips that support it.

Condor cancelled


Raptor has confirmed that, unfortunately but not unexpectedly, the LaGrange-based Condor that was announced at the OpenPOWER summit last year has been cancelled due to economic concerns. Certainly any new high-end product would be tough to launch in the present COVID-19 economy, and because its size (ATX) and capabilities (single CPU, OpenCAPI, four slots) would have slotted it between the Talos II and the Talos II Lite in our view, there just isn't a lot of slack not served by those two existing products to soak up. It's probably just as well because I think getting ready for POWER10 would mean more to many users (it certainly would to me), but that itself requires a lot of R&D capacity and Raptor's a small company. Rather than a niche POWER9 design, here's hoping the resources that would have gone to Condor will go to a really kick-a$$ new Rainier-based system instead.

Firefox 79 on POWER


Firefox 79 is out. There are many new web and developer-facing features introduced in this version, of which only a couple are of note to us in 64-bit PowerPC land specifically. The first is a migration of WebExtensions storage to a new Rust-based implementation; there was a bit of a pause while extension storage migrated, so don't panic if the browser seems to stall out for a few long seconds on first run. The second is a further rollout of WebRender to more Windows configurations, so this seemed like a good time to me to check again how well it's working on this side of the fence. With the Raptor BTO WX7100 installed in this Talos II, I've forced it on with gfx.webrender.enabled and layers.acceleration.force-enabled both set to true (restart the browser after) and worked with it all afternoon with no issues noted, so this time I'm just going to leave it on and see how it goes. Any GCN-based AMD video card from Northern Islands on up (the WX7100 is Polaris) should work. about:support will show you if WebRender and hardware acceleration are enabled, though currently no Linux configuration has it enabled by default.

Unfortunately, it turns out relatively few of us are like me where we build the browser ourselves from source, and it seems some distros are enabling features — most likely higher-level optimizations — that trigger broken builds on ppc64le (Ubuntu was mentioned by at least one user). It would be nice to whittle down the offending feature(s) they enabled, both to get local fixes to the distro package configurations and then look at why they don't work (or make the default not to enable them on our platform, solving the problem in both places). I suspect LTO and PGO are to blame, which have a long history of being troublesome, as well as various defects in gold (use GNU bfd as the linker instead). Meanwhile, the build I'm typing this blog post into locally is still happily running on the same .mozconfigs from Firefox 67.

The littlest POWER9 booter


In our previous article we talked about emulating an OpenPOWER system from Skiboot up through the Petitboot boot menu using extracts from pre-built PNOR firmware images (and/or QEMU) instead of having to build your own. Well, what if you want to build your own?

You can certainly download and build Skiboot and Skiroot/Petitboot from scratch, or naturally any of the firmware stages in PNOR flash since we're a fully open platform, and there is an entire (huge) build system to automate this process. It's big and intimidating to the uninitiated, and it also works just dandy. But for this simpler example, let's start with something a little smaller which can serve as an educational tool as well.

Recall that Skiboot is the lowest level emulated by QEMU presently, although in reality it is an intermediate phase started by an earlier boot stage, i.e., Hostboot (the pretty graphical boot you see in current versions of the Raptor firmware). Among other tasks Skiboot's most important one is to offer the services provided by OPAL, the OpenPOWER Abstraction Layer, which the operating system will need to talk to the hardware. These services range from shutting down the machine to writing to the console, starting interrupts, handling PCI devices and probably not doing your dishes. After OPAL is initialized Skiboot then starts the bootloader for Petitboot, which unpacks Petitboot's Linux kernel and an initrd (i.e., being a zImage containing Skiroot), and that image is what ultimately brings up Petitboot.

However, when you get right down to it it's still just an ELF binary, so we can replace it as long as we understand how Skiboot calls and starts it.

Up to this point the CPU is in big-endian mode no matter what the terminal operating system is (as an old Power Mac user, this warms my grizzled cybernetic heart) and uses real physical memory addresses. When Skiboot finishes, it loads the single ELF binary stored in the PNOR flash partition BOOTKERNEL and runs it from its given entry point. This binary can be big-endian or little-endian. Skiboot also provides the binary the location of the flattened device tree (the FDT) in register r3, and two special addresses: the base address for OPAL in r8 (in physical memory, mind you), and the actual address to call for OPAL services in r9. This is more or less what kexec() does for a regular kernel, except those registers are guaranteed to be provided by Skiboot no matter the implementation.

OPAL calls assume the machine will stay that way (big-endian, real addresses, and also no external interrupts), so some leg work is required unless you just keep the system that way in the first place. In this simplest case, we'll do exactly that: the Skiboot source code even includes such a minimal boot image which simply says "Hello World!" to the console and shuts down the machine. Here, we see the code save the OPAL registers to non-volatile ones (so that calling OPAL won't clobber them) and use those to make the two OPAL calls themselves, setting the OPAL call number in r0, providing the OPAL base in r2 and any relevant arguments in the standard r3 through r10 registers, and then calling the OPAL entry point.

Let's see it in (brief) action. I will assume you already have QEMU set up to emulate an OpenPOWER machine as in the prior article (in particular, you should have either pnor.PAYLOAD or skiboot.lid available to provide Skiboot). To save you having to do so yourself, I added a little linker-assembler glue, some extra code to support both endian modes (more in a moment) and a trivial build system, and put it up on Github. If you're on an OpenPOWER system, as all right-thinking readers should be, then make should be sufficient to compile both the big and little endian versions, the latter of which I will come back to. If you are not, you will need a cross-building toolchain and should edit the Makefile to point to it.

Using what we learned last time, once you've run make, copy be_payload.elf into the same directory as skiboot.lid (QEMU's emulation doesn't work quite right with Raptor's PNOR Skiboot for this purpose), and let's kick it off:

qemu-system-ppc64 -M powernv8 -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./be_payload.elf

Ta-daa!

Now, what about the little-endian case? This is trickier, because the system starts big-endian and expects big-endian instructions, and simply twiddling the endian bit in the Machine State Register isn't enough (if you do so via typical means like mtmsrd, it is ignored). In fact, only three instructions are allowed to change endianness, namely rfid, its hypervisor analogue hrfid and rfscv, which are all returns from privileged code (interrupt handlers and vectored system calls respectively). Vectored system calls, in fact, weren't even supported in the Linux kernel until 5.9. For our purpose here rfid will suffice.

Let's look at the version of hello_kernel.S I marked up. You will notice that in little endian mode, we are assembling several handwritten opcodes immediately in the macro GO_LITTLE_ENDIAN. These are big-endian instructions (since we're little-endian we can't specify the instructions directly) that set the link register after this little stanza, copy over the MSR and toggle the endian bit, load the link register and the new MSR into the save-restore registers and then act as if we returned from an interrupt handler (rfid). rfid sets the new MSR and jumps to the link register which we have already rigged to be the following instruction. We now continue in little-endian mode.

Now, how do we do OPAL calls? I abstracted the code here a bit for both situations with a OPAL_CALL macro. Big-endian just sets the registers and jumps to the OPAL entry point, since we're in real mode and no external interrupts are presently enabled, exactly the same as the test code in Skiroot. For little-endian, however, I added a little subroutine at the end called le_opal_call which is nearly the same idea as GO_LITTLE_ENDIAN, but in reverse. We save the MSR and the LR in non-volatile registers, turn off the little endian bit in the MSR, compute the new return address for the trampoline after the oncoming rfid and load that into LR, set up srr0 and srr1 — but point to the OPAL entry point instead — and "return from the interrupt."

The OPAL call is thus executed big-endian in real mode. However, when we return following the rfid, we're still big-endian, so we immediately GO_LITTLE_ENDIAN again, restore the old MSR and LR (the LE bit is politely ignored) and return via the link register to the calling routine.

The last trick here is that the length of the string Hello World! will be stored according to the endianness we set for the assembler. If we don't account for this, we'll get a nonsense value in big-endian mode and the OPAL routine that prints a string to the console will spew garbage. When assembling in little-endian mode we thus manually specify the necessary bytes explicitly.

After all that,

qemu-system-ppc64 -M powernv8 -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./le_payload.elf

A couple parting comments.

First, while you might think this would be sufficient to make something bootable from both Skiboot and Petitboot, it isn't; if you try to boot this as a kernel from Petitboot it will simply hang. We'll explore this further in a later article. Second, I have intentionally not described how you would actually flash this to PNOR on a real machine lest someone screw something up and blame me for it. In broad strokes, however, you would take either of the ELF binaries and turn it into a PNOR flash partition with fpart (not to be confused with other partition and file management utilities of the same name). Having done so, you would transfer this to the BMC and use pflash to replace the contents of PAYLOAD (after, hopefully, backing up the previous contents with pflash -r). At this point you may now start your machine so it can, um, shut down.

Finally, this entire exercise brings up an interesting question (to me, anyway): is there a performance ramification to running in little-endian vs big-endian, given the additional necessary overhead of flipping endianness every time OPAL is called? The answer is probably, but it's likely negligible in practice unless you're on the bare metal as we are here. Let's compare how little-endian Linux does this in opal-calls.S with big-endian OpenBSD's locore.S; in both listings, scroll down to opal_call and note the differences. Even though we don't have to do quite as much song and dance setting up a trampoline and switching endianness, we still have to twiddle the MSR (in this case to turn off external interrupts and return to real mode), and a similar amount of instruction synchronization must still occur (using isync; rfid and hrfid do this as a natural consequence). From a practical perspective, unless you have some pathological case that makes lots of OPAL calls back to back, the few extra instructions required are probably below the noise threshold when considering everything else that affects performance in modern operating systems.