Showing posts from 2020

OpenBSD officially available for ppc64

OpenBSD 6.8 is now available and with it the first official release of the big-endian ppc64 port (which they call powerpc64). The port is specifically advertised for PowerNV machines (i.e., bare metal) with POWER9, which naturally includes the Raptor family but should also support IBM PowerNV systems as well. POWER8 support is described as "included but untested."

The installation directions are still not fully complete, though Petitboot should be able to start the installer from pretty much any standard medium, and the installer experience should be stock from there. What's more, it looks like a good selection of pre-built packages is available, though some large applications are still missing like Firefox (WebKit is apparently available). The missing packages seems to be similar to what is missing for their 32-bit powerpc flavour, so this is not unexpected.

With OpenBSD's release and FreeBSD's well-regarded history, this leaves only NetBSD — ironically the BSD with the most emphasis on portability, and my personal preference — as the last major cross-platform BSD yet to arrive on OpenPOWER. Given OpenBSD and NetBSD's genetic history, however, this release makes future NetBSD support for OpenPOWER much more likely.

IBM splits

IBM today announced that the company will split into two, moving the Managed Infrastructure Services portion of IBM Global Technology Services into a new cloud-focused corporation tentatively called "NewCo" by the end of 2021. NewCo would also have a greater focus on AI, presumably through a distributed computing model rather than traditional hardware sales. The Technology Support Services piece of GTS that addresses data centre, hardware and software support would remain part of "old" IBM, along with Red Hat and presumably the R&D folks responsible for working on Power ISA like the great people at OzLabs.

It is interesting that this move was predicted as early as February, and a split in itself only means that a combined business strategy no longer makes sense for these units. But chairwoman Ginni Rometty missed the boat on cloud early, and despite the hype in IBM's investor release over the new company, "NewCo" is really the "old" services-oriented IBM with a fresh coat of paint that was a frequent source of layoffs and cost-cutting manoeuvres over the years. There are probably reasons for this, not least of which their hidebound services-first mentality that wouldn't sell yours truly a brand new POWER7 in 2010 even when I had a $15,000 personal budget for the hardware because I didn't (and don't: the used POWER6 I bought instead is self-maintained) need their services piece. As a result I wasn't apparently worth the sale to them, which tells you something right there: today's growth is not in the large institutional customers that used to be IBM's bread and butter but rather in the little folks looking for smaller solutions in bigger numbers, and Rometty's IBM failed to capitalize on this opportunity. In my mind, today's split is a late recognition of her tactical error.

Presumably the new company would preferentially use "OldCo" hardware and recommend "OldCo" solutions for their service-driven hybrid buildouts. But "OldCo" makes most of its money from mainframes, and even with robust virtualization options mainframes as a sector aren't growing. Although IBM is taking pains to talk about "one IBM" in their press release, that halcyon ideal exists only as long as either company isn't being dragged down by the other, and going separate directions suggests such a state of affairs won't last long.

What does this mean to us in OpenPOWER land? Well, we were only ever a small part of the equation, and even with a split this won't increase our influence on "OldCo" much. Though IBM still makes good money from Power ISA and there's still a compelling roadmap, us small individual users will need to continue making our voices heard through the OpenPOWER Foundation and others, and even if IBM chooses not to emphasize individual user applications (and in fairness they won't, because we're not where the money is), they still should realize the public relations and engineering benefits of maintaining an open platform and not get in the way of downstream vendors like Raptor attempting to capitalize on the "low-end" (relatively speaking) market. If spinning off MIS gets IBM a better focus on hardware and being a good steward of their engineering resources, then I'm all for it.

Where did the 64K page size come from?

Lots of people were excited by the news over Hangover's port to ppc64le, and while there's a long way to go, the fact it exists is a definite step forward to improving the workstation experience on OpenPOWER. Except, of course, that many folks (including your humble author) can't run it: Hangover currently requires a kernel with a 4K memory page size, which is the page size of the majority of extant systems (certainly x86_64, which only offers a 4K page size). ppc64 and ppc64le can certainly run on a 4K page size and some distributions do, yet the two probably most common distributions OpenPOWER users run — Debian and Fedora — default to a 64K page size.

And there's lots of things that glitch and have glitched when userspace makes assumptions about this. Besides Hangover, Firefox used to barf on 64K pages (on aarch64 too), and had an issue where binaries built on one page size wouldn't work on systems with a different one. (This also bit numpy.) Golang and the runtime used to throw fatal errors. The famous nouveau driver for Nvidia GPUs assumes a 4K page size, and the compute-only binary driver that does exist (at least for POWER8) cheats by making 64K pages out of 16 copies of the "actual" 4K page. btrfs uses a filesystem page size that mirrors that of the host's page size on which it was created. That means if you make a btrfs filesystem on a 4K page size system, it won't be readable on a 64K page system and vice versa (this is being fixed, but hasn't been yet).

With all these problems, why have a 64K page size at all, let alone default to it? There must be some reason to use it because ppc64(le) isn't even unique in this regard; many of those bugs related to aarch64 which also has a 64K page option. As you might guess, it's all about performance. When a virtual memory page has to be attached to a process or mapped into its addressing space, a page fault is triggered and has to be handled by the operating system. Sometimes this is minor (it's already in memory and just has to be added to the process), sometimes this is major (the page is on disk, or swapped out), but either way a page fault has a cost. 64-bit systems naturally came about because of the need for larger memory addressing spaces, which benefits big applications like databases and high-performance computing generally, and these were the tasks that early 64-bit systems were largely used for. As memory increases, subdividing it into proportionally larger pieces thus becomes more performance-efficient: when the application faults less, the application spends more time in its own code and less in the operating system's.

A second performance improvement afforded by larger pages is higher efficiency from the translation lookaside buffer, or TLB. The TLB is essentially a mapping cache that allows a CPU to quickly get the physical memory page for a given virtual memory address. When the virtual memory address cannot be found in the TLB, then the processor has to go through the entire page table and find the address (and filling it in the TLB for later), assuming it exists. This can be a relatively expensive process if there are many entries to go through, and even worse if the page tables are nested in a virtualized setup. A larger page size not only allows more memory to be handled with a smaller page table, making table walks quicker, but also yields more hits for a TLB of the same size. It is fair to point out there are arguments over MMU performance between processor architectures which would magnify the need for this: performance, after all, was the reason why POWER9 moved to a radix-based MMU instead of the less-cache-friendly hashed page table scheme of earlier Power generations, and x86_64 has a radix tree per process while Power ISA's page table is global. (As an aside, some systems optionally or even exclusively have software-managed TLBs where the operating system manages the TLB for the CPU and walks the page tables itself. Power ISA isn't one of them, but these architectures in particular would obviously benefit from a smaller page table.)

64K page sizes, compatibility issues notwithstanding, naturally have a downside. The most important objection relates to memory fragmentation: many memory allocators have page alignment constraints for convenience, which could waste up to the remaining 60K if the memory actually in use fits entirely within a 4K page instead. On bigger systems with large amounts of memory running tasks that allocate large memory blocks, this excess might be relatively low, but they could add up on a workstation-class system with smaller RAM running a mix of client applications making smaller allocations. In a somewhat infamous rebuttal, Linus Torvalds commented, "These absolute -idiots- talk about how they win 5% on some (important, for them) benchmark by doing large pages, but then ignore the fact that on other real-world loads they lose by sevaral HUNDRED percent because of the memory fragmentation costs [sic]." Putting Linus' opinion into more anodyne terms, if the architecture bears a relatively modest page fault penalty, then the performance improvements of a larger page size may not be worth the memory it can waste. This is probably why AIX, presently specific to ppc64, offers both 4K and 64K pages (and even larger 16MB and 16GB pages) and determines what to offer to a process.

The 4K vs. 64K gulf is not unlike the endian debate. I like big endian and I cannot lie, but going little endian gave Power ISA a larger working software library by aligning with what those packages already assumed; going 4K is a similar situation. But while the performance difference between endiannesses has arguably never been significant, there really are performance reasons for a 64K page size and those reasons get more important as RAM and application size both increase. On my 16GB 4-core Blackbird, the same memory size as my 2005 Power Mac Quad G5, a 4K page size makes a lot more sense than a 64K one because I'm not running anything massive. In that sense the only reason I'm still running Fedora on it is to serve as an early warning indicator. But on my 64GB dual-8 Talos II, where I do run larger applications, build kernels and Firefoxen and run VMs, the performance implications of the larger page size under Fedora may well become relevant for those workloads.

For servers and HPCers big pages can have big benefits, but for those of us using these machines as workstations I think we need to consider whether the performance improvement outweighs the inconvenience. And while Fedora has generally served me well, lacking a 4K page option on ppc64le certainly hurts the value proposition for Fedora Workstation on OpenPOWER since there are likely to be other useful applications that make these assumptions. More to the point, I don't see Red Hat-IBM doubling their maintenance burden to issue a 4K page version and maintaining a downstream distro is typically an incredibly thankless task. While I've picked on Fedora a bit here, you can throw Debian and others into that mix as well for some of the same reasons. Until other operating systems adopt a hybrid approach like AIX's, the quibble over page size is probably the next major schism we'll have to deal with because in my humble opinion OpenPOWER should not be limited to the server room where big pages are king.

It's good to have a Hangover

One of the wishlists for us OpenPOWER workstation users is better emulation for when we have to run an application that only comes as a Windows binary. QEMU is not too quick, largely because of its overhead, and although Bochs is faster in some respects it's worse in others and doesn't have a JIT. While things like HQEMU are fast, they also have their own unique problems, and many things that work in QEMU don't work in HQEMU. Unfortunately, because Wine Is Not an Emulator, it cannot be used to run Windows binaries directly.

People then ask the question, what if we somehow put QEMU and Wine together like Slaughterhouse-Five and see if they breed? Somebody did that, at least for aarch64, and that is Hangover. And now it runs on ppc64le with material support for testing provided by Raptor.

Hangover is unabashedly imperfect and many things still don't work, and there are probably things that work on aarch64 that don't work on ppc64le as the support is specifically advertised as "incomplete." (Big-endian need not apply, by the way: there are no thunks here for converting endianness. Sorry.) There is also the maintainability problem that the changes to Wine to support ppc64le (done by Raptor themselves, as we understand) haven't been upstreamed and that will contribute to the rebasing burden.

With all that in mind, how's it work? Well ... I have no idea, because the other problem is right now it's limited to kernels using a 4K page size and not every ppc64le-compatible distribution uses them. Void Linux, for example, does support 4K pages on ppc64le, but Fedora only officially supports a 64K page size, and I'm typing this on Fedora 32. It may be possible to hack Hangover to add this support but the maintainer ominously warns that "loading PE binaries, which have 4K aligned sections, into a 64K page comes with [lots] of problems, so currently the best approach is to avoid that." I'm rebuilding my Blackbird system but I like using it as a Fedora tester before running upgrades on this daily driver Talos II, which has saved me some substantial inconvenience in the past. That said, a live CD that boots Void and then runs Hangover might be fun to work on.

If you've built Hangover on your machine and given it a spin, advise how well it works for you and how it compares to QEMU in the comments.

IBM makes available the POWER10 Functional Simulator

Simulators are an important way of testing compatibility with future architectures, and IBM has now released a functional simulator for POWER10. Now, we continue to watch POWER10 closely here at Talospace because of as-yet unaddressed concerns over just how "open" it is compared to POWER8 and POWER9, and we have not heard of any workstation-class hardware announced around it yet (from Raptor or anyone else). But we're always interested in the next generation of OpenPOWER, and the documentation states it provides "enough POWER10 processor complex functionality to allow the entire software stack to execute, including loading, booting and running a little endian Linux environment." Pretty cool, except you can't actually run it on OpenPOWER yet: there is no source code, and no binaries for ppc64le, although the page indicates it is supported; the only downloads as we go to press are for x86_64. IBM did eventually release ppc64le packages for Debian for the POWER9 functional simulator, so we expect the same here to happen eventually, even though it would have been a nice gesture to have it available immediately since we would be the very people most interested in trying it out. It includes a full instruction set model with SMP support, vector unit and the works, but as always you are warned "it may not model all aspects of the IBM Power Systems POWER10 hardware and thus may not exactly reflect the behavior of the POWER10 hardware."

FreeBSD swings both ways

They say there's an xkcd for everything, but me, I say it's Friends GIFs. Anyway, hat tip to developer Piotr Kubaj who reports that, if you don't like big endian and cannot lie, FreeBSD's covered you got with a new little endian ppc64le port to complement the existing (and by now practically mature) big endian ppc64 flavour.

Raptor themselves actually give material support to the project by providing a remote instance for development, powering a build server that continuously runs poudriere bulk -a to test ports. Plus, looking in the source tree, the commits to add little-endian support are all tagged as "Sponsored by: Tag1 Consulting, Inc." This company apparently has OpenPOWER alumni from the Oregon State University Open Source Lab (.pdf). It's nice to see the cross-pollination at work!

Although there are no .iso images yet, they should start appearing with the -CURRENT snapshots next week. Note that official ports support doesn't exist yet either, so you'll need to compile packages on your own for the moment, and there are other minor to moderate deficiencies relative to the big-endian port which are still being rectified. Still, choice is a good thing, especially since per Piotr there are no plans to decommission the big-endian port and both will coexist. How's that for playing on both teams?

Firefox 81 on POWER

Firefox 81 is released. In addition to new themes of dubious colour coordination, media controls now move to keyboards and supported headsets, the built-in JavaScript PDF viewer now supports forms (if we ever get a JIT going this will work a lot better), and there are relatively few developer-relevant changes.

This release heralds the first official change in our standard POWER9 .mozconfig since Fx67. Link-time optimization continues to work well (and in 81 the LTO-enhanced build I'm using now benches about 6% faster than standard -O3 -mcpu=power9), so I'm now making it a standard part of my regular builds with a minor tweak we have to make due to bug 1644409. Build time still about doubles on this dual-8 Talos II and it peaks out at almost 84% of its 64GB RAM during LTO, but the result is worth it.

Unfortunately PGO (profile-guided optimization) still doesn't work right, probably due to bug 1601903. The build system does appear to generate a profile properly, i.e., a controlled browser instance pops up, runs some JavaScript code, does some browser operations and so forth, and I see gcc created .gcda files with all the proper count information, but then the build system can't seem to find them to actually tune the executable. This needs a little more hacking which I might work on as I have free time™. I'd also like to eliminate ac_add_options --disable-release as I suspect it is no longer necessary but I need to do some more thorough testing first.

In any event, reliable LTO at least with the current Fedora 32 toolchain is still continuous progress. I've heard concerns that some distributions are not making functional builds of Firefox for ppc64le (let alone ppc64, which has its own problems), though Fedora is not one of them. Still, if you have issues with your distribution's build and you are not able to build it for yourself, if there is interest I may put up a repo or a download spot for the binaries I use since I consider them reliable. Without further ado, here are the current .mozconfigs that I attest as functional.

Optimized Configuration

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-O3 -mcpu=power9"
ac_add_options --disable-release
ac_add_options --enable-linker=bfd
ac_add_options --enable-lto=full

#export GN=/uncomment/and/set/path/if/you/haz
Debug Configuration
export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-Og -mcpu=power9"
ac_add_options --enable-debug
ac_add_options --disable-release
ac_add_options --enable-linker=bfd

#export GN=/uncomment/and/set/path/if/you/haz

The first production RISC-V workstation?

No, not the RiscPC, a RISC-V PC. And, not counting the various one-offs, it appears to be the very first production RISC-V workstation available. SiFive is announcing the RISC-V PC at the Linley Group Fall Virtual Processor Conference, based on the Freedom U740 ("FU740") to be introduced at the same time next month.

Precious little details are available, such as loadout, options, availability and most of all cost, but when has that stopped us from idly speculating before, eh? It is virtually certain the machine will be composed largely of off-the-shelf components other than the CPU, which is the real mystery of interest. The FU740 appears to be an evolution of the FU540, which is a 64-bit 1.5GHz+ part with four U54 "little" cores combined with one S1-series "big" core and 2MB of L2 cache on a 28nm process. Plainly, neither of these cores are even remotely in the ballpark with OpenPOWER: SiFive quotes CoreMark/MHz scores of 3.01 for both the U54 and S54, whereas the POWER9 easily achieves over 160. While the FU740 will almost certainly be faster due to its probable basis on the U74, it is difficult to imagine that the performance gulf will be narrowed significantly (the U74 edges up to around 5). You should not buy one and expect it to compare favourably with x86 or a Raptor system.

On the other hand, there's a good chance this will be another truly open system based on the fact that the Freedom E300 and U500 series are open source under the Apache license. While some parts of SiFive are proprietary, this line is not, and we presume that the U700 series will be likewise. RISC-V still lacks firm specs for vector and bit manipulation instructions, and this certainly hurts them for desktop and mobile applications, but this is a known deficiency and is being worked on. Assuming no shenanigans with the firmware, there's encouraging potential even in this early form.

I'm unambiguously on Team Power because of my long history with the architecture, but this blog is certainly interested in all kinds of free vendor-unencumbered computing, and this machine may well represent another such system. And it's newsworthy as the first RISC-V system that's at least workstation form factor even if its likely performance doesn't currently make it a credible daily driver. But maybe that's not the point: the point is to get developers on the architecture in a way that's bigger than an evaluation board (cf. Linus Torvalds and ARM), meaning it doesn't have to be their only daily driver; it just has to "be there" so people think about it. More on cost and specs and "how open is it" when we actually see it in October.

Moar OpenPOWER cores plz

More news from virtual OpenPOWER Summit 2020: I mentioned it would be interesting to see what other cores would pop up on the OpenPOWER Github and indeed following on from the PowerPC A2I comes another A2 variant, the PowerPC A2O.

Announced today by IBM and released under the standard OpenPOWER license, the A2O is an evolved 64-bit PowerPC A2 compliant to ISA 2.07, comparable to POWER8 (the A2I was 2.06) under the embedded-focused Book III-E, and can run both big or little endian. At 45nm it was intended for 3GHz+ speeds; at 7nm it is expected to achieve 4.2GHz speeds at 0.85W, or 3GHz at 0.25W. Unlike the strictly in-order and slightly more power-thrifty A2I the A2O is out-of-order and prioritizes single-threaded performance, but it's only SMT-2 versus the A2I which is SMT-4. Even this is theoretical, however, because the documentation notes that only single-thread generation has been attempted so far. Each core has an AXU similar to the A2I that appears to offer FPU operations in the Verilog code, plus a branch unit, FXUs for single and complex integer operations respectively, and a load/store unit. There also appears to be a basic MMU, though the core allows running without one relying entirely on ERATs, but unfortunately I couldn't find a vector unit (the A2I as released didn't come with one either).

IBM casts the A2O as being more appropriate for artificial intelligence, autonomous driving and security, whereas the A2I was meant for streaming, network processing and data analysis. I'm not sure I believe either of those claims, but despite apparently being just an evolutionary improvement over the A2I I think the A2O is more promising especially for smaller-scale systems. By being 2.07-compliant it's already almost a mainline POWER8 and the interest that has bubbled up around A2I should find even more to like in A2O. Adding a radix MMU implementation and vector operations wouldn't be trivial, and even this single-thread implementation has high FPGA utilization, but I think this would be a better basis than A2I for that hypothetical OpenPOWER developer board everybody seems to want or even a mythical modern PowerPC laptop. Like A2O, A2I still doesn't replace Microwatt, which is much better documented, better supported, can actually boot a Linux kernel, and if for no other purpose than pedagogy is a far more purposeful model for OpenPOWER systems. That said, A2I's very presence is yet another choice and yet another great reason to be on board with OpenPOWER.

IBM open-sources PowerAI as OpenCE

News from today's COVID-19 socially distanced virtual OpenPOWER Summit: IBM announced the open-sourcing of their PowerAI package today as OpenCE, the Open Cognitive Environment for deep learning and machine learning applications. The code should build on any Linux-based OpenPOWER system, including Raptor-family workstations and servers, and the Github repository contains everything needed to build Tensorflow, Pytorch, XGBoost and related projects and dependencies. If building binaries from scratch leaves you cold waiting for the goodies, Oregon State University simultaneously announced plans to offer pre-built ppc64le binaries for each upcoming tagged release both with and without CUDA support. Unfortunately, not everything is open: you'll still need to register and download a separate blob from Nvidia if you intend to use CUDA, even though it can be reportedly downloaded at no cost afterwards, and if you do you'll naturally be limited to Nvidia GPUs (which you can't use for 3D acceleration on OpenPOWER currently due to the lack of a working open-source driver). Still, here's a high-power option for your machines coming from someone who knows how to optimize for the platform, and Raptor's PowerAI-specific SKU is a turnkey package configured expressly for that purpose (and it's even in stock). Perhaps OpenCE is something they could preinstall for even greater value now that it's available.

Microwatt floats

When we last visited Microwatt, the little synthesizeable OpenPOWER core that could, we looked at how you could hack instructions in. Or, you can sit back and wait for the PRs from IBM, including now a simple FPU. While this pull request describes its performance in modest terms, impressively it operates exactly the same (and even authentically "fails" the same tests in the same fashion) as the FPU in the POWER9. There is still no (full) supervisor mode, and no vector unit, but Microwatt is now advanced enough to boot a Linux kernel. The possibility of a single-board Microwatt-based system (and fully reprogrammable, too) gets closer every day.

Firefox 80 on POWER

Firefox 80 is available, and we're glad it's here considering Mozilla's recent layoffs. I've observed in this blog before that Firefox is particularly critical to free computing, not just because of Google's general hostility to non-mainstream platforms but also the general problem of Google moving the Web more towards Google.

I had no issues building Firefox 79 because I was still on rustc 1.44, but rustc 1.45 asserted while compiling Firefox, as reported by Dan Horák. This was fixed with an llvm update, and with Fedora 32 up to date as of Sunday and using the most current toolchain available, Firefox 80 built out of the box with the usual .mozconfigs.

Since there was a toolchain update, I figured I would try out link-time optimization again since a few releases had elapsed since my last failed attempt (export MOZ_LTO=1 in your .mozconfig). This added about 15 minutes of build-time on the dual-8 Talos II to an optimized build, and part of it was spent with the fans screaming since it seemed to ignore my -j24 to make and just took over all 64 threads. However, it not only builds successfully, I'm typing this post in it, so it's clearly working. A cursory benchmark with Speedometer 2.0 indicated LTO yielded about a 4% improvement over the standard optimized build, which is not dramatic but is certainly noticeable. If this continues to stick, I might try profile-guided optimization for the next release. The toolchain on this F32 system is rustc 1.45.2, LLVM 10.0.1-2, gcc 10.2.1 and GNU ld.bfd 2.34-4; your mileage may vary with other versions.

There's not a lot new in this release, but WebRender is still working great with the Raptor BTO WX7100, and a new feature available in Fx80 (since Wayland is a disaster area without a GPU) is Video Acceleration API (VA-API) support for X11. The setup is a little involved. First, make sure WebRender and GPU acceleration is up and working with these prefs (set or create):

gfx.webrender.enabled true
layers.acceleration.force-enabled true

Restart Firefox and check in about:support that the video card shows up and that the compositor is WebRender, and that the browser works as you expect.

VA-API support requires EGL to be enabled in Firefox. Shut down Firefox again and bring it up with the environment variable MOZ_X11_EGL set to 1 (e.g., for us tcsh dweebs, setenv MOZ_X11_EGL 1 ; firefox &, or for the rest of you plebs using bash and descendants, MOZ_X11_EGL=1 firefox &). Now set (or create):

media.ffmpeg.vaapi-drm-display.enabled true
media.ffmpeg.vaapi.enabled true
media.ffvpx.enabled false

The idea is that VA-API will direct video decoding through ffmpeg and theoretically obtain better performance; this is the case for H.264, and the third setting makes it true for WebM as well. This sounds really great, but there's kind of a problem:

Reversing the last three settings fixed this (the rest of the acceleration seems to work fine). It's not clear whose bug this is (ffmpeg, or something about VA-API on OpenPOWER, or both, though VA-API seems to work just fine with VLC), but either way this isn't quite ready for primetime yet on our platform. No worries since the normal decoder seemed more than adequate even on my no-GPU 4-core "stripper" Blackbird. There are known "endian" issues with ffmpeg, presumably because it isn't fully patched yet for little-endian PowerPC, and I suspect once these are fixed then this should "just work."

In the meantime, the LTO improvement with the updated toolchain is welcome, and WebRender continues to be a win. So let's keep evolving Firefox on our platform and supporting Mozilla in the process, because it's supported us and other less common platforms when the big 1000kg gorilla didn't, and we really ought to return that kindness.

POWER10 sounds really great, but ...

IBM took the wraps off POWER10 officially today, a (Samsung-manufactured) 7nm monster in 18 layers with up to 15 SMT-8 cores (120 threads) with 2MB of L2 per core, up to 120MB of L3, 1 TB/s memory access, OpenCAPI and PCIe 5. New on-board is an embedded matrix math accelerator for specialized AI performance, multipetabyte memory clusters and transparent memory encryption with four times the number of AES engines than POWER9. Overall, IBM is touting that the processor is three times more energy efficient than POWER9 while being up to twice as fast at scalar and four times as fast at vector operations. General availability is announced for Q3 or Q4 of 2021.

First of all: damn. This sounds sweet. The dual-8 POWER9 Talos II under the desk with "just" 64 threads and PCIe 4 is already giving me sorrowful Eeyore eyes even though there's no guarantee what, if any, lower-end systems suitable as being workstations will be available when the processor is. But right now, what we do know is that right now Raptor has said there won't be POWER10 systems, and as it stands presently nobody else is making workstation-class OpenPOWER machines. Raptor, probably for reasons of NDAs, is playing this close to the vest, so what follows is merely my variably informed personal conjecture and may be completely inaccurate.

One of the truly incredible things about OpenPOWER — or at least POWER8 and POWER9 — is how far down you can see what the hardware is doing. In previous articles, we looked at emulating OpenPOWER at the bare metal level, and then even writing your own firmware bootkernel. But the bootloader and high-level firmware are really only the beginning: the build image created by op-build not only contains the Petitboot bootloader, but its Skiroot filesystem, Skiboot (containing OPAL, the OpenPOWER Abstraction Layer, which handles PCIe, interrupt and operating system services), Hostboot (which initializes and trains RAM, buffers and the bus), and the Self-Boot Engine which initializes the CPUs. Even the fused-in first instructions the POWER9 executes from its OTPROM to run the Self-Boot Engine are open source, and other than the OTPROM itself (it is a One-Time Programmable ROM, after all), everything is inspectable and changeable. And before the POWER9 executes those very first instructions, the Baseboard Management Controller that powers the system on has its own open firmware too. You know what your computer is doing, and you don't have to trust anyone's firmware build if you don't want to because you can always build and flash the system yourself.

Contrast this against the gyrations that x86 "open" systems have to struggle with. Do not interpret this as a slam against vendors like System76 or Purism because they're doing the best they can to deliver the most frequently used architecture in workstations and servers, in as unlocked a fashion as possible from processor manufacturers who are going in exactly the opposite direction. And there have been great improvements in untangling the tendrils of the Intel Management Engine from the processor, primarily through Coreboot's steady evolution. But even with these improvements where significant portions of the Intel ME are disabled, secret sauce is still needed to bring up the CPU and you have to trust that the sauce is only and specifically doing what it says it is, in addition to the other partitions of the ME which activated or not are still not fully understood. The situation is even worse for AMD Ryzen processors with the Platform Security Processor, which (at least the 3000 and 4000 variants) aren't presently supported by Coreboot at all, though System76 is apparently working on a port.

Don't just take my word for it: as of this writing no recent x86 system appears on the FSF Respects Your Freedom list, but the Talos II and T2 Lite both do (and I imagine the Blackbird is soon to follow). The Vikings D8 is indisputably libre, and has an FSF RYF certification, but is an AMD Opteron 4200, which is about eight or nine years old. As it stands I believe this is the most powerful x86 system still available on the FSF RYF list now that the D16 is out of production (Opteron 6200).

I think there's a reasonable argument to be had about how "open" something needs to be to be considered "libre" and at what point you could be considered to have meaningful control of your machine, but there's no denying there are aspects of modern x86 machines which you are prohibited by policy from getting into, and that means putting more faith in the processor vendor than they may truly deserve. (Don't get me started on GPUs, either. Or, for that matter, ARM.) Again, Raptor won't say, but their public disenchantment with POWER10 suggests that some aspects of the processor firmware stack are not open. This is a situation which is no better than x86, and I'm hoping this is merely an oversight on IBM's part and not a future policy.

To be effective, OpenPOWER needs to be more open than just the ISA being royalty-free, even though that's huge. To be sure, I think there has to be room for processor manufacturers to distinguish themselves in the market or you run the risk of a race to the bottom where people simply rip off designs (this is, I think, a real concern for RISC-V). I think sharing reference designs is necessary to get systems bootstrapped but I can't deny there's money in high performance applications, and high performance microarchitecture demands a return on investment to justify development costs. Similarly, to the extent that any pack-in hardware (like POWER9's Nest Accelerators) isn't part of the open ISA and are separately managed devices that simply share a die, to me it seems logical to also make it part of how a processor manufacturer can stand out to potential customers.

But the firmware absolutely needs to be as clean and available as the ISA. If the ISA is open and the instructions the CPU is running are part of that open standard, then any firmware components, which (ought to) entirely consist of those instructions, must be open too. If the CPU has pack-in hardware on the die that isn't part of the open ISA, then you should be able to bring up the chip without it. The standard that was set for current OpenPOWER should be the same standard for POWER10 or it doesn't really deserve the OpenPOWER name, and I'm worried that Raptor's insinuations imply IBM's standard isn't the same. Similarly, arguing that the currently incomplete situation with x86 is functionally equivalent to OpenPOWER (or, for that matter, RISC-V) may be well-intentioned but is disingenuous. The FSF may be ideologues on binary blobs, but that doesn't make their position wrong, and the entire OpenPOWER ecosystem from IBM on down should recognize how much goodwill and prominence the openness of POWER8 and POWER9 has generated for the community.

I hope I'm wrong, but I'm concerned I'm not. Let's make sure we get POWER10 right or we won't be practicing what we preach, and that's going to kill us in the crib.

Vikings' upcoming OpenPOWER retail channel

Many Talospace readers are familiar with Vikings, who offer libre hosting as well as hardware sales for libre-friendly devices and systems and peripherals certified by the FSF Respects Your Freedom program (for which Raptor systems qualify). However, the Vikings' storefront now shows a new tab for OpenPOWER hardware, hopefully a public demonstration of a new retail channel coming soon for those ready to pull the trigger on an OpenPOWER workstation or server of your own. This is particularly of value to our readers outside North America, since this gets around a lot of the inconveniences of shipping and payment with United States businesses; Vikings is based in Germany, and accepts payments in euros, US dollars, British pounds, Australian dollars and New Zealand dollars. Already we have also heard that Vikings is working on a water-cooler system for POWER systems with an aim to reach the market in two months or less, a great option for people trying to run the 18 and 22-core parts in desktop environments (current BTO cooling options are air-cooled only).

Currently it is not known yet whether Vikings will sell full systems, parts and/or processors, whether the systems include other OpenPOWER systems other than Raptor workstations and servers, or when general availability is expected. Still, the more retail options there are, the greater the volume of sales and the greater the economies of scale that will result. In the end, that can only be a good thing for growing our niche but very important market.

Will it build?

While I will always be big-endian at heart, ppc64le does get around a lot of the unfortunately pervasive endian assumptions in a cold blackboxed x86_64 world, and even things like MMX, SSE and SSSE3 can be automatically translated in many cases. It is therefore a happy result that even many software packages completely unaware of ppc64le will still build and function out of the box, assuming they don't do silly things like emit JITted x86_64 assembly code and try to run it, etc.

I ran across this project the other day which has over 1,000 build scripts for ppc64le (as shell scripts and/or Docker files) that you can either use directly, or as a hint whether your intended build will even work. Cursorily paging through a few I see IBM E-mail addresses, so no surprise much of it is tested on Red Hat (though largely RHEL 7.x), but there are also Ubuntu scripts there as well and I imagine they'd accept other distros. Keep in mind that this is generic ppc64le, so it would work on POWER8 and up but any special optimizations (for example, I always build optimized Firefox at -O3 -mcpu=power9), and the concentration more favours server-side packages than workstation and client software. I also see relatively few platform-specific corrections, which could be both good (they weren't needed) or bad (they weren't tested). Still, it's nice to see more resources to aid porting and platform compatibility and that can only in turn get more packages thinking about making ppc64le (and hopefully ppc64) a first-class citizen too.

Linux 5.8 on POWER

The 5.8 kernel milestone has arrived, with improvements to reduce thrashing (though with the amount of memory even a Blackbird can hold, there's no excuse not to load these suckers up), an API for receiving notifications of kernel events, support for hardware-assisted inline encryption at the block layer for storage devices and a nice convenience feature where you can put sysctl.something.or.other=999 right on the kernel command line.

On the Power ISA side, this kernel adds the first support for POWER10 and ISA 3.1, although our Raptor contacts have indicated some displeasure with IBM's management decisions and we suspect this is a way of saying firmware binary blobs might be required to enable maximal performance (though we don't know, and it's unclear how much is under NDA). Another nice feature is an ioctl to send gzip compression requests directly to POWER9's on-chip compression hardware via /dev/crypto/nx-gzip. This is part of the general family of Nest Accelerators (NXes) accessible through the Virtual Accelerator Switchboard. More about that in a later article, but in the meantime while we wait for compressors to add this support, here's an accelerated power-gzip userspace library that directly replaces zlib.

Finally, in addition to various improvements for the 40x and 8xx series, the most interesting commit was around prefixed instructions. These represent the first 64-bit instructions in the Power ISA (here's a code sample to show you the encoding) and allow much bigger 32-bit displacements for load-store operations than the 16-bit ones in current 32-bit instructions. I'm not too wild about the fact this makes Power ISA technically variable-length, but these D-form instructions are easy to identify and they are always 64 bits in size, and they should make certain types of code generation a lot simpler on chips that support it.

Condor cancelled

Raptor has confirmed that, unfortunately but not unexpectedly, the LaGrange-based Condor that was announced at the OpenPOWER summit last year has been cancelled due to economic concerns. Certainly any new high-end product would be tough to launch in the present COVID-19 economy, and because its size (ATX) and capabilities (single CPU, OpenCAPI, four slots) would have slotted it between the Talos II and the Talos II Lite in our view, there just isn't a lot of slack not served by those two existing products to soak up. It's probably just as well because I think getting ready for POWER10 would mean more to many users (it certainly would to me), but that itself requires a lot of R&D capacity and Raptor's a small company. Rather than a niche POWER9 design, here's hoping the resources that would have gone to Condor will go to a really kick-a$$ new Rainier-based system instead.

Firefox 79 on POWER

Firefox 79 is out. There are many new web and developer-facing features introduced in this version, of which only a couple are of note to us in 64-bit PowerPC land specifically. The first is a migration of WebExtensions storage to a new Rust-based implementation; there was a bit of a pause while extension storage migrated, so don't panic if the browser seems to stall out for a few long seconds on first run. The second is a further rollout of WebRender to more Windows configurations, so this seemed like a good time to me to check again how well it's working on this side of the fence. With the Raptor BTO WX7100 installed in this Talos II, I've forced it on with gfx.webrender.enabled and layers.acceleration.force-enabled both set to true (restart the browser after) and worked with it all afternoon with no issues noted, so this time I'm just going to leave it on and see how it goes. Any GCN-based AMD video card from Northern Islands on up (the WX7100 is Polaris) should work. about:support will show you if WebRender and hardware acceleration are enabled, though currently no Linux configuration has it enabled by default.

Unfortunately, it turns out relatively few of us are like me where we build the browser ourselves from source, and it seems some distros are enabling features — most likely higher-level optimizations — that trigger broken builds on ppc64le (Ubuntu was mentioned by at least one user). It would be nice to whittle down the offending feature(s) they enabled, both to get local fixes to the distro package configurations and then look at why they don't work (or make the default not to enable them on our platform, solving the problem in both places). I suspect LTO and PGO are to blame, which have a long history of being troublesome, as well as various defects in gold (use GNU bfd as the linker instead). Meanwhile, the build I'm typing this blog post into locally is still happily running on the same .mozconfigs from Firefox 67.

The littlest POWER9 booter

In our previous article we talked about emulating an OpenPOWER system from Skiboot up through the Petitboot boot menu using extracts from pre-built PNOR firmware images (and/or QEMU) instead of having to build your own. Well, what if you want to build your own?

You can certainly download and build Skiboot and Skiroot/Petitboot from scratch, or naturally any of the firmware stages in PNOR flash since we're a fully open platform, and there is an entire (huge) build system to automate this process. It's big and intimidating to the uninitiated, and it also works just dandy. But for this simpler example, let's start with something a little smaller which can serve as an educational tool as well.

Recall that Skiboot is the lowest level emulated by QEMU presently, although in reality it is an intermediate phase started by an earlier boot stage, i.e., Hostboot (the pretty graphical boot you see in current versions of the Raptor firmware). Among other tasks Skiboot's most important one is to offer the services provided by OPAL, the OpenPOWER Abstraction Layer, which the operating system will need to talk to the hardware. These services range from shutting down the machine to writing to the console, starting interrupts, handling PCI devices and probably not doing your dishes. After OPAL is initialized Skiboot then starts the bootloader for Petitboot, which unpacks Petitboot's Linux kernel and an initrd (i.e., being a zImage containing Skiroot), and that image is what ultimately brings up Petitboot.

However, when you get right down to it it's still just an ELF binary, so we can replace it as long as we understand how Skiboot calls and starts it.

Up to this point the CPU is in big-endian mode no matter what the terminal operating system is (as an old Power Mac user, this warms my grizzled cybernetic heart) and uses real physical memory addresses. When Skiboot finishes, it loads the single ELF binary stored in the PNOR flash partition BOOTKERNEL and runs it from its given entry point. This binary can be big-endian or little-endian. Skiboot also provides the binary the location of the flattened device tree (the FDT) in register r3, and two special addresses: the base address for OPAL in r8 (in physical memory, mind you), and the actual address to call for OPAL services in r9. This is more or less what kexec() does for a regular kernel, except those registers are guaranteed to be provided by Skiboot no matter the implementation.

OPAL calls assume the machine will stay that way (big-endian, real addresses, and also no external interrupts), so some leg work is required unless you just keep the system that way in the first place. In this simplest case, we'll do exactly that: the Skiboot source code even includes such a minimal boot image which simply says "Hello World!" to the console and shuts down the machine. Here, we see the code save the OPAL registers to non-volatile ones (so that calling OPAL won't clobber them) and use those to make the two OPAL calls themselves, setting the OPAL call number in r0, providing the OPAL base in r2 and any relevant arguments in the standard r3 through r10 registers, and then calling the OPAL entry point.

Let's see it in (brief) action. I will assume you already have QEMU set up to emulate an OpenPOWER machine as in the prior article (in particular, you should have either pnor.PAYLOAD or skiboot.lid available to provide Skiboot). To save you having to do so yourself, I added a little linker-assembler glue, some extra code to support both endian modes (more in a moment) and a trivial build system, and put it up on Github. If you're on an OpenPOWER system, as all right-thinking readers should be, then make should be sufficient to compile both the big and little endian versions, the latter of which I will come back to. If you are not, you will need a cross-building toolchain and should edit the Makefile to point to it.

Using what we learned last time, once you've run make, copy be_payload.elf into the same directory as skiboot.lid (QEMU's emulation doesn't work quite right with Raptor's PNOR Skiboot for this purpose), and let's kick it off:

qemu-system-ppc64 -M powernv8 -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./be_payload.elf


Now, what about the little-endian case? This is trickier, because the system starts big-endian and expects big-endian instructions, and simply twiddling the endian bit in the Machine State Register isn't enough (if you do so via typical means like mtmsrd, it is ignored). In fact, only three instructions are allowed to change endianness, namely rfid, its hypervisor analogue hrfid and rfscv, which are all returns from privileged code (interrupt handlers and vectored system calls respectively). Vectored system calls, in fact, weren't even supported in the Linux kernel until 5.9. For our purpose here rfid will suffice.

Let's look at the version of hello_kernel.S I marked up. You will notice that in little endian mode, we are assembling several handwritten opcodes immediately in the macro GO_LITTLE_ENDIAN. These are big-endian instructions (since we're little-endian we can't specify the instructions directly) that set the link register after this little stanza, copy over the MSR and toggle the endian bit, load the link register and the new MSR into the save-restore registers and then act as if we returned from an interrupt handler (rfid). rfid sets the new MSR and jumps to the link register which we have already rigged to be the following instruction. We now continue in little-endian mode.

Now, how do we do OPAL calls? I abstracted the code here a bit for both situations with a OPAL_CALL macro. Big-endian just sets the registers and jumps to the OPAL entry point, since we're in real mode and no external interrupts are presently enabled, exactly the same as the test code in Skiroot. For little-endian, however, I added a little subroutine at the end called le_opal_call which is nearly the same idea as GO_LITTLE_ENDIAN, but in reverse. We save the MSR and the LR in non-volatile registers, turn off the little endian bit in the MSR, compute the new return address for the trampoline after the oncoming rfid and load that into LR, set up srr0 and srr1 — but point to the OPAL entry point instead — and "return from the interrupt."

The OPAL call is thus executed big-endian in real mode. However, when we return following the rfid, we're still big-endian, so we immediately GO_LITTLE_ENDIAN again, restore the old MSR and LR (the LE bit is politely ignored) and return via the link register to the calling routine.

The last trick here is that the length of the string Hello World! will be stored according to the endianness we set for the assembler. If we don't account for this, we'll get a nonsense value in big-endian mode and the OPAL routine that prints a string to the console will spew garbage. When assembling in little-endian mode we thus manually specify the necessary bytes explicitly.

After all that,

qemu-system-ppc64 -M powernv8 -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./le_payload.elf

A couple parting comments.

First, while you might think this would be sufficient to make something bootable from both Skiboot and Petitboot, it isn't; if you try to boot this as a kernel from Petitboot it will simply hang. We'll explore this further in a later article. Second, I have intentionally not described how you would actually flash this to PNOR on a real machine lest someone screw something up and blame me for it. In broad strokes, however, you would take either of the ELF binaries and turn it into a PNOR flash partition with fpart (not to be confused with other partition and file management utilities of the same name). Having done so, you would transfer this to the BMC and use pflash to replace the contents of PAYLOAD (after, hopefully, backing up the previous contents with pflash -r). At this point you may now start your machine so it can, um, shut down.

Finally, this entire exercise brings up an interesting question (to me, anyway): is there a performance ramification to running in little-endian vs big-endian, given the additional necessary overhead of flipping endianness every time OPAL is called? The answer is probably, but it's likely negligible in practice unless you're on the bare metal as we are here. Let's compare how little-endian Linux does this in opal-calls.S with big-endian OpenBSD's locore.S; in both listings, scroll down to opal_call and note the differences. Even though we don't have to do quite as much song and dance setting up a trampoline and switching endianness, we still have to twiddle the MSR (in this case to turn off external interrupts and return to real mode), and a similar amount of instruction synchronization must still occur (using isync; rfid and hrfid do this as a natural consequence). From a practical perspective, unless you have some pathological case that makes lots of OPAL calls back to back, the few extra instructions required are probably below the noise threshold when considering everything else that affects performance in modern operating systems.

When will OpenPOWER OpenBSD be now? Now.

We were delighted by the tease that OpenBSD is moving to OpenPOWER (although it is officially big-endian powerpc64, it requires OPAL, so a POWER8 is minimally required). Well, now you can try it out: a powerpc64 snapshot is now available with most of the standard binary distribution sets. The installation documentation is pretty much copy-pasta — I doubt very much that 64-bit PowerPC is supported on AMD Opteron, and I would be impressed to learn that the Pinebook Pro is OpenPOWER — but you should be able to boot from the miniroot (flash it to a USB drive using dd bs=1m) and manually setup and copy the file sets over. Curiously, the X11 distribution sets do not appear to be built yet, so you may be restricted to a text boot and/or the serial console. When I get my spare Talos II back up and running I intend to give this a full shakedown, since this would be a great basis for finally having NetBSD on OpenPOWER (my personal BSD of choice). The fact it doesn't right now is a great shame to an OS that is supposed to run everywhere but doesn't on one of the most open platforms anywhere.

Firefox 78 on POWER

Firefox 78 is released and is running on this Talos II. This version in particular features an updated RegExp engine but is most notable (notorious) for disabling TLS 1.0/1.1 by default (only 1.2/1.3). Unfortunately, because of craziness at $DAYJOB and the lack of a build waterfall or some sort of continuous integration for ppc64le, a build failure slipped through into release but fortunately only in the (optional) tests. The fix is trivial, another compilation bug in the profiler that periodically plagues unsupported platforms, and I have pushed it upstream in bug 1649653. You can either apply that bug to your tree or add ac_add_options --disable-tests to your .mozconfig. Speaking of, as usual, the .mozconfigs we use for debug and optimized builds have been stable since Firefox 67.

UPDATE: The patch has landed on release, beta and ESR 78, so you should be able to build straight from source.

The newest OpenPOWER chip: A2

Besides Microwatt, another open core implementation is now available, the PowerPC A2. The chip name may not be familiar, but its most famous application should be: the Blue Gene/Q supercomputer, based on 45nm 18-core chips (16 active, one unused for yield purposes and one for interrupts, I/O and other on-chip services) at 1.6GHz with a TDP of 55W. In 2012, Blue Gene/Qs took top positions on all three major supercomputer benchmark ratings.

The A2I VHDL on offer does in fact appear to be for the Blue Gene/Q variant. This is important, because A2 doesn't have an FPU or vector unit out of the box; it leaves these to be connected through the auxiliary execution unit (AXU). The A2I BG/Q version, however, does have an IEEE 754-compliant FPU connected to the AXU, and this appears to be provided in the VHDL. There is also apparently an MMU, but while the FPU offers SIMD instructions for up to 4 double-precision floats simultaneously it is not AltiVec, so no VMX/VSX. In addition, despite being SMT-4, it is only dual issue (one instruction to the ALU, one to the AXU), and execution is strictly in-order.

A2I isn't going to replace Microwatt. Microwatt is smaller and simpler, intended for small FPGAs and embedded projects, and is actively evolving by leaps and bounds to the point it can now boot Linux. More to the point, it is intended to be fully OpenPOWER compliant. A2I, however, despite being a fully realized core, is only ISA 2.06 compliant, lacks the radix MMU, lacks AltiVec, and at least right now lacks active developers. But it's small enough that with some work and a process shrink this could be the start of a mobile OpenPOWER system: at 7nm IBM claims it got up to 3.9GHz (their blurb at right claims even higher, to 4.2GHz). And it is indeed under the OpenPOWER license.

The really interesting question is what else might show up under @openpower-cores.

Don't just slack. Power Slack.

A port of Slackware to OpenPOWER ("Riscy Slack") is taking shape, something that delights me personally since Slackware was my first taste of Linux on an old 486 we christened calvin circa 1998. Never a distro for the novice, which is admittedly part of its charm, there's no handholding even on supported platforms and even less so here, so use at your own risk. The current build is based on a snapshot from Slackware64 current, though about a month old as of this writing, and you will need to download and extract the tarballs manually (no slackpkg support yet) with some tweaks (this is the described installation process right now). There is no specific support for POWER8, but X and KDE are apparently working, with some Qt issues still yet to be ironed out. Installation fragments are on a dedicated server and you can watch the progress on the porter's blog.

It's Talos all the way down

Still can't bear the sticker shock of your very own Talos II, or even a itty bitty Blackbird? Why not do what we all do for the machines we can't own and emulate one instead? (And then decide you like it a lot, and save your pennies?)

QEMU 5.0.0 offers a machine model for the bare-metal PowerNV profile, to which the Raptor systems and other OpenPOWER POWER8 and POWER9 designs intended for Linux (i.e., not PowerVM machines) belong. Using the Talos II firmware image (mostly: one snag to be mentioned), you can boot the machine in QEMU and from there bring up an operating system in emulation. In this article we'll prove it works by bringing up Void Linux for Power (hi, Daniel!) in a variety of configurations. A set-up like this might be enough to test that your software or open-source package builds and runs on OpenPOWER, even if you don't own one yet. In a future article we'll talk about how you can boot your own code on the metal so you can port your favourite OS or build a unikernel.

(For the purposes of this article I'll assume an audience that isn't as familiar with OpenPOWER terminology as our usual readership. Kindly humour me.)

The emulation is imperfect, both if you're emulating it on a real Raptor family system or on an icky PC. While QEMU can emulate an AST2500 (i.e., the ARM-based Baseboard Management Controller, which acts as the service processor and provides the video framebuffer), and QEMU can also emulate a PowerNV system, it doesn't do both at the same time. That means the very lowest levels are actually being simulated here -- you can't watch Raptor's pretty Hostboot display, for example, and only the barest functions of the BMC are simulated enough to allow bring-up, not including the framebuffer. In fact, the hardware profiles we will use here do not in general match a real Raptor system either: we're just virtually plugging in PCI devices that give us necessary functionality, though of course none of the peripheral devices in a Raptor system is Raptor-proprietary. Finally, even though I have tagged this entry with KVM, KVM currently doesn't work right with the QEMU PowerNV machine model even though I'm pretty sure it should be technically possible. Sadly, I tried in vain to do so, could never get KVM-HV to be happy, and ended up kernel panicking the machine with KVM-PR. See if you can triumph where I have failed. In the meantime, naturally you can do everything here on a T2 or Blackbird as well because that's how I did it writing this article, but there is no special acceleration for those systems right now.

The first order of business is the first order of business with any emulator: get the ROMs. Fortunately, no one is going to bust you for pirating a set of these because we're an open platform, remember?

The two pieces required are Skiboot and Petitboot, both of which live in the system's PNOR flash. Skiboot contains OPAL, the OpenPOWER Abstraction Layer. It comes in after the BMC has turned on main power and started the Power CPUs' self-boot engines, which then IPL ("initial program load") Hostboot for the second-stage power-on sequence. When Hostboot completes, it chains into Skiboot, which initializes the PCIe host bus controllers (PHBs) and provides all the basic hardware calls needed by a guest kernel to support the platform. You can think of it as something like an overgrown BIOS. This is the lowest firmware level of an OpenPOWER system that QEMU currently supports emulating.

Skiboot lives only to service a kernel, so it immediately starts one. This initial payload is the bootloader for Petitboot, which is also stored in firmware. Petitboot has a small Linux root (Skiroot) and acts as a boot menu, finding bootable volumes on attached devices or over the network. Having found one (or you select one), it chains into it to start the main OS, and from then on Skiboot will provide platform services via OPAL for this final guest until the system is shut down or restarted. Because it's in firmware, Petitboot is always available, which can come in really handy when you're trying to do system recovery.

The first, best and most dedicated way is to build Skiboot and Petitboot yourself. They are open-source and the process is relatively well documented and automated, and you should know how to do this if you own an OpenPOWER machine anyhow. If you aren't doing this on a real OpenPOWER machine you'll need a cross-compiler, but most Linux distros offer such a package nowadays. Do keep in mind that if it looks like you're building a tiny Linux distro, well, that's because that's exactly what you're doing. The advantage here is you can fool around with the firmware at your leisure, but it requires a bit of an investment in disk space and time.

The second way assumes you have a more casual interest and would prefer to go with something prefab. It's possible if you (or, you know, your "friend") has a Raptor-family system to extract the necessary components right from the BMC prompt. Log into the BMC over SSH (or via direct serial connection) and type pflash -i. You'll see a list of all the partitions stored in the PNOR flash. The ones we want are PAYLOAD (which contains Skiboot) and BOOTKERNEL (which contains Skiroot and Petitboot). The exact addresses may vary from system to system and firmware to firmware.

root@bmc:~# pflash -P PAYLOAD -r /tmp/pnor.PAYLOAD --skip=4096
Reading to "/tmp/pnor.PAYLOAD" from 0x021a1000..0x022a1000 !
[==================================================] 100%
root@bmc:~# pflash -P BOOTKERNEL -r /tmp/pnor.BOOTKERNEL --skip=4096
Reading to "/tmp/pnor.BOOTKERNEL" from 0x022a1000..0x03821000 !
[==================================================] 100%

We skip the first 4K page to avoid the wrapping around each partition. pnor.PAYLOAD is actually compressed and needs to be uncompressed prior to use, so:

root@bmc:~# cd /tmp
root@bmc:/tmp# xz -d < pnor.PAYLOAD > skiboot.lid

Finally, scp both skiboot.lid and pnor.BOOTKERNEL to your desired system from the BMC.

Admittedly we just talked at length about the two ways most of you won't get the firmware, so let's talk about the third method and the way most of you will, i.e., you'll just download it. Currently there is an irregularity about Raptor's present Skiboot build for this purpose: it only boots if you are emulating a single POWER8. That's not a typo. If you use it to boot an emulated POWER9, the guest will simply panic, and the guest will go into a bootloop if you are emulating multiple POWER8 CPUs (necessary if you need a larger number of PCIe devices). This is undoubtedly a QEMU deficiency which will be corrected in future releases. In the meantime, if you just care about playing around using a single POWER8 on a terminal, then Raptor's builds (either from BMC flash or downloaded) will suffice. However, if you intend to emulate a POWER9 or SMP POWER8 system, download QEMU's own pre-built skiboot.lid and use that instead.

For Petitboot, we will extract that directly from Raptor's PNOR images. Assuming you didn't get it using the process above, download the current Talos II PNOR image and decompress it. In the shell_upgrade directory you will see the bzip2-compressed PNOR image. Uncompress that, leaving you with a filename like talos-ii-v2.00.pnor. Download my pnorex extractor tool (it's in Perl, because I'm one of those people) and run it on the PNOR image:

% pnorex talos-ii-v2.00.pnor
Version 1 PNOR archive with 33 entries.
Extracting PAYLOAD at offset 8601.
This is a xz format image.
Wrote 1020K successfully.
Extracting BOOTKERNEL at offset 8857.
This is an ELF executable image.
Wrote 22012K successfully.
Extracted 2 partitions successfully.

If you will be using Raptor's Skiroot, then uncompress pnor.PAYLOAD to skiroot.lid as above: xz -d < pnor.PAYLOAD > skiboot.lid

Now, with skiroot.lid (for this first example, either Raptor's or QEMU's) and pnor.BOOTKERNEL in the same folder, grab an ISO you want to boot. I used the prefab one Daniel offers on the Void Linux for Power site since I know it boots fine on OpenPOWER hardware. For our first example let's do a simple example of booting Void from a CD image on a POWER8 using the serial port. Our QEMU command line:

qemu-system-ppc64 -M powernv8 -m 4G -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./pnor.BOOTKERNEL \
-device ich9-ahci,id=ahci0 \
-drive id=cd0,media=cdrom,file=void-live-ppc64le-musl-20200411.iso,if=none \
-device ide-cd,bus=ahci0.0,drive=cd0

This configures a single-processor POWER8 system with 4GB of RAM, no graphics, and an Intel AHCI host controller with a single CD-ROM drive attached. The serial output should go to your terminal. It goes a little like this:

Here we are with Skiboot chaining into Petitboot. You can ignore the errors; there will be a lot of them since the platform is still incomplete. It will take a little bit of time to decompress the kernel (much slower than it would be on a regular system). You will notice a single device attached to the three available PCIe host bridges on the single POWER8 CPU, i.e., the host controller itself. Don't you just love that the vendor code for Intel is 8086?

This is Petitboot. When the bootable choices appear, cursor up to the starred option and press E before it autoboots, because we need to tell Void its console is the on-board serial port (otherwise it uses a VGA console: not sure whose bug that is).

Add console=hvc0 at the end, cursor down to OK and hit RETURN/ENTER a couple times to boot.

A successful login on your emulated baby POWER8. Ta-daa! To rudely pull the plug on the QEMU session, press Ctrl-A, and then X (QEMU: Terminated).

Let's now load out the POWER8. We would like to add a video card, an Ethernet card and a USB controller to our existing system, but POWER8 Turismo chips only offer enough PHBs for three PCI endpoints. How do we solve this problem? Easy: we'll add another processor!

At this point you will require the QEMU Skiboot and should use that where skiboot.lid appears in the remainder of this article. I use tun/tap networking in this example, which assumes you already have tap0 configured and up; change the -netdev setting if you want to use a different means of bridging the NIC. This example keeps the AHCI host controller and still displays debug output on the terminal, but uses the QEMU emulated VGA as a console instead and adds a good old Realtek 8139 NIC with a USB mouse and keyboard attached to a QEMU XHCI USB 3.0 controller.

qemu-system-ppc64 -M powernv8 -cpu power8 -m 4G -smp 2 \
-serial mon:stdio \
-device VGA \
-device ich9-ahci,id=ahci0,bus=pcie.0 \
-netdev tap,id=nic0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=nic0,bus=pcie.1 \
-device qemu-xhci,id=usb0,bus=pcie.2 \
-device usb-mouse \
-device usb-kbd \
-bios ./skiboot.lid \
-kernel ./pnor.BOOTKERNEL \
-drive id=cd0,media=cdrom,file=void-live-ppc64le-musl-20200411.iso,if=none \
-device ide-cd,bus=ahci0.0,drive=cd0

Let's spin this sucker like Superman's cape in a dryer:

The reason I keep the serial output is because the extra CPU adds around an extra minute on this T2 to get to Petitboot. Here, you will notice we now have six PHBs available, three per CPU, so now we have enough virtual PCI slots for the peripherals we require.

Petitboot shows up on both the 2D framebuffer and the serial terminal, and both work. You'll also see it probing the bridged Ethernet tap to see if it can boot that way, proving our Ethernet device is up and working. Whichever you use is where boot messages will go, so we'll use the framebuffer as console and start Void by cursoring up and selecting the starred option (thus also proving our USB devices work too).

Having booted Void, we can now demonstrate the PCI cards in the system, the attached peripherals and the number of CPUs. For the record, the DD2.3 POWER9 I'm typing this on shows its Spectre v2 status as "mitigated" with hardware acceleration.

Starting the Installer, which won't install anything because we haven't configured any storage to install to in our QEMU options. I'll leave that as an exercise to the reader.

If we switch to an emulated POWER9 system, Sforza CPUs support six PCI endpoints, so we get six PHBs. This means a single CPU is more than enough for our basic configuration without adding additional startup time. The QEMU command line to do so merely returns to single processor and changes the machine to powernv9 and the CPU to power9, i.e.,

qemu-system-ppc64 -M powernv9 -cpu power9 -m 4G \
-serial mon:stdio \
-device VGA \
-device ich9-ahci,id=ahci0,bus=pcie.0 \
-netdev tap,id=nic0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=nic0,bus=pcie.1 \
-device qemu-xhci,id=usb0,bus=pcie.2 \
-device usb-mouse \
-device usb-kbd \
-bios ./skiboot.lid \
-kernel ./pnor.BOOTKERNEL \
-drive id=cd0,media=cdrom,file=void-live-ppc64le-musl-20200411.iso,if=none \
-device ide-cd,bus=ahci0.0,drive=cd0

and it runs in the same way, but faster, because the emulation overhead is less. So let's totally do something stupid as our last parlour trick and run a POWER9 configuration with as many sockets as QEMU will let us hold (which right now is four). Note that these are all single-threaded cores, so this is still much less powerful than even a 4-core basic Blackbird.

./qemu-system-ppc64 -M powernv9 -cpu power9 -m 4G -smp 4 \
-serial mon:stdio \
-device VGA \
-device ich9-ahci,id=ahci0,bus=pcie.0 \
-netdev tap,id=nic0,ifname=tap0,script=no,downscript=no \
-device rtl8139,netdev=nic0,bus=pcie.1 \
-device qemu-xhci,id=usb0,bus=pcie.2 \
-device usb-mouse \
-device usb-kbd \
-bios ./skiboot.lid \
-kernel ./pnor.BOOTKERNEL \
-drive id=cd0,media=cdrom,file=void-live-ppc64le-musl-20200411.iso,if=none \
-device ide-cd,bus=ahci0.0,drive=cd0

With four emulated CPUs startup took over seven minutes from start to Petitboot on this dual-8 Talos II, so have patience if you're on a lesser workstation, but it does work:

You can see the watchdog complaining about the length of time OPAL calls are taking now (call 128 resets the XIVE VM interrupt controller on POWER9 chips). But we do have our four cores, and it's not impossibly slow on a beefy enough system (like another POWER9).

Incidentally, while the Power ISA emulation in QEMU allows SMT, it's very basic and not enough to get through the boot-up sequence, or at least not before the heat death of the universe. If you like listening to your cooling fans, see what happens when you try to emulate the biggest baddest dual-22 Talos II by adding -accel tcg,thread=multi -smp 176,threads=4,cores=22,sockets=2 to your QEMU command line. It's not pretty. That's why you should buy an OpenPOWER machine of your own instead of emulating one.


Today's featured entry in the increasingly inaccurately named #ShowUsYourTalos series is Karl S.'s Blackbird system, a 4-core unit with 32GB of RAM and a Sapphire RX5500 XT GPU in a rather arresting NZXT H400 case with red accents. A complete bill of materials and prices are proffered for your review. Mind the caution sticker, we wouldn't want to crack the glass.

If you have an OpenPOWER system you'd like to show off, post in the comments. Other than my personal T2, we haven't had any other Talos systems yet, but POWER8s, other POWER9s and of course Blackbirds are always welcome.

Firefox 77 on POWER

Firefox 77 is released. I really couldn't care less about Pocket recommendations, and I don't know who was clamouring for that exactly because everybody be tripping recommendations, but better accessibility options are always welcome and the debugging and developer tools improvements sound really nice. This post is being typed in it.

There are no OpenPOWER-specific changes in Fx77, though a few compilation issues were fixed expeditiously through Dan Horák's testing just in time for the Fx78 beta. Daniel Kolesa reported an issue with system NSS 3.52 and WebRTC, but I have not heard if this is still a problem (at least on the v2 ABI), and I always build using in-tree NSS myself which seems to be fine. This morning Daniel Pocock sent me a basic query of 64-bit Power ISA bugs yet to be fixed in Firefox; I suspect some are dupes (I closed one just this morning which I know I fixed myself already), and many are endian-specific, but we should try whittling down that list (and, as usual, LTO and PGO still need to be investigated further). I'm still using the same .mozconfigs from Firefox 67.

In a minor moment of self-promotion, I'm also shamelessly reminding readers that Fx77 comes out parallel with TenFourFox Feature Parity Release 23, relevant to Talospace readers because I made some fixes to its Content Security Policy support to properly support the web-based OpenBMC with System Package 2.00. Although the serial console-LAN redirector has some stuttery keystrokes, I think this is a timing problem rather than a feature deficiency, and everything else generally works. Connecting over ssh or serial port is naturally always an option, but I have to agree the web OpenBMC is a lot nicer and some tasks are certainly easier that way. If you're a long-term PowerPC dweeb like me and you want to use your beloved Power Mac to manage your brand-spanking-new Talos II or Blackbird, now you can.