The littlest POWER9 booter

In our previous article we talked about emulating an OpenPOWER system from Skiboot up through the Petitboot boot menu using extracts from pre-built PNOR firmware images (and/or QEMU) instead of having to build your own. Well, what if you want to build your own?

You can certainly download and build Skiboot and Skiroot/Petitboot from scratch, or naturally any of the firmware stages in PNOR flash since we're a fully open platform, and there is an entire (huge) build system to automate this process. It's big and intimidating to the uninitiated, and it also works just dandy. But for this simpler example, let's start with something a little smaller which can serve as an educational tool as well.

Recall that Skiboot is the lowest level emulated by QEMU presently, although in reality it is an intermediate phase started by an earlier boot stage, i.e., Hostboot (the pretty graphical boot you see in current versions of the Raptor firmware). Among other tasks Skiboot's most important one is to offer the services provided by OPAL, the OpenPOWER Abstraction Layer, which the operating system will need to talk to the hardware. These services range from shutting down the machine to writing to the console, starting interrupts, handling PCI devices and probably not doing your dishes. After OPAL is initialized Skiboot then starts the bootloader for Petitboot, which unpacks Petitboot's Linux kernel and an initrd (i.e., being a zImage containing Skiroot), and that image is what ultimately brings up Petitboot.

However, when you get right down to it it's still just an ELF binary, so we can replace it as long as we understand how Skiboot calls and starts it.

Up to this point the CPU is in big-endian mode no matter what the terminal operating system is (as an old Power Mac user, this warms my grizzled cybernetic heart) and uses real physical memory addresses. When Skiboot finishes, it loads the single ELF binary stored in the PNOR flash partition BOOTKERNEL and runs it from its given entry point. This binary can be big-endian or little-endian. Skiboot also provides the binary the location of the flattened device tree (the FDT) in register r3, and two special addresses: the base address for OPAL in r8 (in physical memory, mind you), and the actual address to call for OPAL services in r9. This is more or less what kexec() does for a regular kernel, except those registers are guaranteed to be provided by Skiboot no matter the implementation.

OPAL calls assume the machine will stay that way (big-endian, real addresses, and also no external interrupts), so some leg work is required unless you just keep the system that way in the first place. In this simplest case, we'll do exactly that: the Skiboot source code even includes such a minimal boot image which simply says "Hello World!" to the console and shuts down the machine. Here, we see the code save the OPAL registers to non-volatile ones (so that calling OPAL won't clobber them) and use those to make the two OPAL calls themselves, setting the OPAL call number in r0, providing the OPAL base in r2 and any relevant arguments in the standard r3 through r10 registers, and then calling the OPAL entry point.

Let's see it in (brief) action. I will assume you already have QEMU set up to emulate an OpenPOWER machine as in the prior article (in particular, you should have either pnor.PAYLOAD or skiboot.lid available to provide Skiboot). To save you having to do so yourself, I added a little linker-assembler glue, some extra code to support both endian modes (more in a moment) and a trivial build system, and put it up on Github. If you're on an OpenPOWER system, as all right-thinking readers should be, then make should be sufficient to compile both the big and little endian versions, the latter of which I will come back to. If you are not, you will need a cross-building toolchain and should edit the Makefile to point to it.

Using what we learned last time, once you've run make, copy be_payload.elf into the same directory as skiboot.lid (QEMU's emulation doesn't work quite right with Raptor's PNOR Skiboot for this purpose), and let's kick it off:

qemu-system-ppc64 -M powernv8 -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./be_payload.elf


Now, what about the little-endian case? This is trickier, because the system starts big-endian and expects big-endian instructions, and simply twiddling the endian bit in the Machine State Register isn't enough (if you do so via typical means like mtmsrd, it is ignored). In fact, only three instructions are allowed to change endianness, namely rfid, its hypervisor analogue hrfid and rfscv, which are all returns from privileged code (interrupt handlers and vectored system calls respectively). Vectored system calls, in fact, weren't even supported in the Linux kernel until 5.9. For our purpose here rfid will suffice.

Let's look at the version of hello_kernel.S I marked up. You will notice that in little endian mode, we are assembling several handwritten opcodes immediately in the macro GO_LITTLE_ENDIAN. These are big-endian instructions (since we're little-endian we can't specify the instructions directly) that set the link register after this little stanza, copy over the MSR and toggle the endian bit, load the link register and the new MSR into the save-restore registers and then act as if we returned from an interrupt handler (rfid). rfid sets the new MSR and jumps to the link register which we have already rigged to be the following instruction. We now continue in little-endian mode.

Now, how do we do OPAL calls? I abstracted the code here a bit for both situations with a OPAL_CALL macro. Big-endian just sets the registers and jumps to the OPAL entry point, since we're in real mode and no external interrupts are presently enabled, exactly the same as the test code in Skiroot. For little-endian, however, I added a little subroutine at the end called le_opal_call which is nearly the same idea as GO_LITTLE_ENDIAN, but in reverse. We save the MSR and the LR in non-volatile registers, turn off the little endian bit in the MSR, compute the new return address for the trampoline after the oncoming rfid and load that into LR, set up srr0 and srr1 — but point to the OPAL entry point instead — and "return from the interrupt."

The OPAL call is thus executed big-endian in real mode. However, when we return following the rfid, we're still big-endian, so we immediately GO_LITTLE_ENDIAN again, restore the old MSR and LR (the LE bit is politely ignored) and return via the link register to the calling routine.

The last trick here is that the length of the string Hello World! will be stored according to the endianness we set for the assembler. If we don't account for this, we'll get a nonsense value in big-endian mode and the OPAL routine that prints a string to the console will spew garbage. When assembling in little-endian mode we thus manually specify the necessary bytes explicitly.

After all that,

qemu-system-ppc64 -M powernv8 -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./le_payload.elf

A couple parting comments.

First, while you might think this would be sufficient to make something bootable from both Skiboot and Petitboot, it isn't; if you try to boot this as a kernel from Petitboot it will simply hang. We'll explore this further in a later article. Second, I have intentionally not described how you would actually flash this to PNOR on a real machine lest someone screw something up and blame me for it. In broad strokes, however, you would take either of the ELF binaries and turn it into a PNOR flash partition with fpart (not to be confused with other partition and file management utilities of the same name). Having done so, you would transfer this to the BMC and use pflash to replace the contents of PAYLOAD (after, hopefully, backing up the previous contents with pflash -r). At this point you may now start your machine so it can, um, shut down.

Finally, this entire exercise brings up an interesting question (to me, anyway): is there a performance ramification to running in little-endian vs big-endian, given the additional necessary overhead of flipping endianness every time OPAL is called? The answer is probably, but it's likely negligible in practice unless you're on the bare metal as we are here. Let's compare how little-endian Linux does this in opal-calls.S with big-endian OpenBSD's locore.S; in both listings, scroll down to opal_call and note the differences. Even though we don't have to do quite as much song and dance setting up a trampoline and switching endianness, we still have to twiddle the MSR (in this case to turn off external interrupts and return to real mode), and a similar amount of instruction synchronization must still occur (using isync; rfid and hrfid do this as a natural consequence). From a practical perspective, unless you have some pathological case that makes lots of OPAL calls back to back, the few extra instructions required are probably below the noise threshold when considering everything else that affects performance in modern operating systems.