Showing posts from 2020

Void PPC goes little-endian ... on 32-bit

It is frequently forgotten that just as 64-bit Power ISA comes in both big-endian (powerpc64, ppc64) and little-endian (ppc64le) variants, you can also have 32-bit little-endian (ppcle) as well as the classic 32-bit big-endian PowerPC most often encountered in Power Mac hardware. But such 32-bit little-endian systems have historically been quite rare (I struggle even to think of any), and distributions supporting them even more so.

So leave it to those wacky VoidPPC developers to not only spin a ppcle variant, but to even make it useful. It should be pointed out that such a distribution can run on little-endian OpenPOWER because there isn't the massive gulf between 32-bit and 64-bit PowerPC like there would be for x86 and x86_64. Accordingly, Daniel Kolesa demonstrated on Twitter a Blackbird running Void Linux but 32-bit little endian. Why do this on a 64-bit capable platform? Because it allows you to run certain 32-bit specific emulation systems like Box86 that due to their design don't run on 64-bit platforms (especially as Void is a 4K page system). And, true to form, the Blackbird was running Unreal Tournament 32-bit through Box86 at "fairly playable speed" — a game which was only ever made available for Linux as a 32-bit x86 binary (the PowerPC binaries were only for MacOS and Amiga). Such a feat with apparently acceptable performance is even more impressive given that Box86 doesn't have JIT support for ppcle either.

The ppcle spin (both glibc and musl variants) is already listed on the stats page and the port should be available shortly. Because of the unusualness of the architecture, cross-compilation depending on LLVM (and things that use it like Rust) may not yet work, but a significant slice of ports appear to already be built.

Linux 5.10

Linus Torvalds has tagged Linux 5.10, which will be the next long term support release. Despite a relatively small merge window, it includes nearly 14,000 commits.

Big new features in 5.10 include an ext4 performance improvement by reducing the amount of journal metadata written for crash recovery (unfortunately this feature right now needs to be enabled at the time the volume is mkfsed), secure ring sharing for io_uring, an API for manager processes to supply memory hints to other processes (I'm sure that won't be abused by anyone), support for static calls with in-place code patching as a better post-Spectre function pointer replacement than retpolines, widened timestamps on xfs to handle Y2038 (this wasn't already a thing? but requires explicit transitioning of old filesystems), lots of BPF improvements and many new drivers.

On the Power ISA specific side, 5.10 adds further support for POWER10 (shallow stop states and new watchpoint features), a fix for the three of you using a POWER9 in HPT mode with 4K pages and more than 16TB of RAM (or RAM attached to a second node), a filter for RTAS firmware calls to protect kernel memory, and better topology-aware scheduling on POWER9 and POWER10.

The saddest thing announced in this kernel, however, is the end of support for the original PowerPC 601. This is understandable as the 601 was always intended as a stepping-stone CPU and had important differences from classic PowerPC as later implemented in the 603, which had to be accounted for in system software. Nevertheless, if you've still got a Power Mac 6100 in your closet running Linux, your last call in LTS kernels is 5.8 (or 5.9, if you need the features).

There is no CentOS 8, there is only Stream

When IBM bought Red Hat in 2018 (who of course maintains Red Hat Enterprise Linux as a paid product and Fedora as its free community upstream, and since 2014 the de facto free version of RHEL, CentOS), as a Fedora user since it first booted on the Talos II, I had high hopes that finally OpenPOWER would be a first-class citizen on par with x86_64 under IBM's hopefully gentle goading. No more of this being in the alternative architectures penalty box for Fedora Workstation, for example. The idea was that IBM would see Red Hat's free open products as a logical extension of OpenPOWER and exploit the obvious synergy by having a free distribution available as a preferred choice on an open platform (and then the customers who want greater support and enterprise features could pony up). Win-win, right?

Well, first-class OpenPOWER Fedora still hasn't happened, and while RHEL remains perfectly happy to take your POWER8/POWER9 money, CentOS — or at least CentOS the way you've understood it, i.e., RHEL without the price or support contracts — is dead. There is no CentOS, there is only Stream (after 2021, that is, though CentOS 7 will finish its lifecycle as usual).

Let's be a little less handwringy, though, as Red Hat could have done a better job explaining what this means. As I so presciently determined back in 2019, Stream was clearly positioned as the "public beta" for RHEL and, at that time, for CentOS. It still is; this "merely" means there will be no stable channel. I also presciently determined in that 2019 article that there would be a relatively small slice of people who want "just enough" innovation compared to us on the bloody (Fedora) or bleeding (Rawhide) edge, but still prefer somewhat more current updates than those on regular RHEL or CentOS. Unexpectedly, Red Hat appears to have solved this problem by just eliminating classic CentOS. But people willingly pay good money for RHEL — or, you know, use CentOS — because stuff doesn't break much. Stream eliminates that "doesn't break" guarantee, as much as a free distribution could make such a guarantee in the first place, though it's definitely much less churn than Fedora and for many users will still fill the bill. Unlike Fedora, on CentOS Stream 8 you won't be forced to dogfood Stream 9 any earlier than an RHEL 8 user would be, though you may be forced to deal with 8.x.

There are certainly other RHEL downstreams because there must be (it must be open: that's how CentOS started in the first place). Amusingly, some of these are other proprietary vendor Linuces (Oracle Linux, Hewlett-Packard Enterprise ClearOS, etc.). But, and now getting to the OpenPOWER specific portion of this article, none of the free (as in beer) options run on POWER8 or POWER9. Springdale Linux, one of the few free rebuilds, is strictly x86, and Oracle Linux is free to download but only supported on aarch64 and x86_64 (sorry, SPARC). ClearOS is free but stuck on RHEL 7 and doesn't run on OpenPOWER either, and while CloudLinux claims to run on anything RHEL does, it costs money, so you might as well run RHEL unless you need its specific value-added features. [UPDATE: CloudLinux is now announcing a free community version in Q1 2021. No word on OpenPOWER support, but we're hopeful. Thanks Dimitris Z for reporting it.]

That leaves the recently announced Rocky Linux, led by Gregory Kurtzer, founder of the CentOS project. Rocky Linux aims to basically be what CentOS was originally: a downstream build of RHEL, without the branding or the fees. But all it is right now is an idea and no downloads are available, nor any indication that OpenPOWER will be supported, at least not as of this writing. When they come to a decision on that you'll hear it here first.

On the whole this announcement is probably of little concern for people using their OpenPOWER machines as workstations, because most of those that run a Red Hat derivative are probably running Fedora (yours truly included). A few will be running CentOS Stream, but that isn't going anywhere. Where this hurts is those individuals who wanted the superstability of CentOS without the supercost of RHEL, and that probably applies to a substantial number of people running OpenPOWER servers in high-availability environments. Many of these people will still be reasonably well served by Stream, but a few are so risk-averse that even Stream's small amount of turnover won't do for them either. That's no skin off IBM's nose because they'd rather have them as customers paying support fees, but it's not a good look for free computing, and it's not a good look for Red Hat specifically.

Which brings me to another prediction I made: "our worry is that the IBM monolith will affect Red Hat far more than the other way around." Is it my curse to always be right?

Where's Axon?

Last year, before a rampaging virus ate the globe (we're told the locusts are coming as this article goes to press), IBM announced more details on what would be the last iteration of POWER9 at the OpenPOWER Summit, the "Advanced I/O" flavour variously codenamed POWER9 AIO, POWER9 Prime, Axon and (in a few places) Axone. As POWER10 neared availability, the final POWER9 generation was due to come out in 2020, but now into the beginning of December there's no such chip. So what happened?

Axon, as it happens, was in the roadmap under various names for quite awhile. IBM has always prioritized bandwidth as a market discriminant against commodity x86 hardware, and memory bandwidth and I/O are two things IBM POWER chips have in spades. To maintain this competitive advantage, in 2018 IBM announced POWER9 AIO with OpenCAPI 4.0 (up from 3.0 in the Nimbus-class POWER9 CPUs in this Raptor Talos II workstation), NVLink 3.0 (up from 2.0), plus CAPI 2.0, 48 PCIe 4.0 lanes and up to 350GB/s of memory bandwidth (compared to "just" 150GB/s in Nimbus, and 210GB/s in Cumulus with Centaur memory buffers). Back then it was slated for 2019; POWER10 was due in "2020+".

That date obviously slipped, so IBM came back in 2019 and announced AIO remained on the roadmap but this time for 2020 with a memory bandwidth of 650 GB/s using the new Open Memory Interface; instead of putting the Centaur buffers on the board, OMI now allows RAM vendors to put them right on the DIMMs. Again with an eye to their biggest competitive advantage, IBM promoted it as a "Bandwidth Beast" and gave it the name "AXON": "the ‘AX’ representing [symmetric multiprocessing and up to 24 SMT-4 cores], ‘O’ representing OpenCAPI and the ‘N’ representing NVLink. Think neuron-to-neuron axon connections in the brain," according to Jeff Stuecheli, POWER hardware architect. The variety of available interconnects was clearly positioned to succeed both the current scale-out Nimbus and scale-up Cumulus POWER9s and squeeze one more processor generation out of the hardware.

IBM may sometimes be overly bureaucratic but they don't generally leave money on the table, and if demand existed for such a product odds are they'd deliver. However, while the memory bandwidth is considerably greater, this would only happen with an OMI system which would put stress on both IBM and vendors like Raptor and Tyan to produce them, so blame COVID-19: no one really wants to be in the business of producing a stopgap design in the middle of a pandemic when their currently shipping systems are already imperiled by supply chain issues. Likewise, while OpenCAPI 4.0 and NVLink 3.0 are a nice bump, they're not enough on its own to justify that sort of investment when the identical node size suggests no improvements in raw compute (and, as probable confirmation, both Cumulus and Axon are described with the identical phrase "enhanced microarchitecture").

So what happened to Axon? 2020 happened. Ordinarily one could phone in such an upgrade but by now there's obviously not enough money in another go-around and the systems that would take best advantage of it don't and won't exist. For those of us on OpenPOWER workstations, our upgrade path for at least the next year is more cores with differently binned chips that don't require anything more than a board that can support their power and cooling requirements. POWER10 would require more design investment than POWER9 Axon, but not a lot more by comparison, and that investment is justified by the more and better reasons to buy a POWER10 server (assuming the openness issues are resolved) than there would be to buy this; furthermore, IBM is already integrating Axon's improvements into POWER10's communication fabric and marketing it as PowerAXON, making POWER9's iteration of it superfluous. These are powerful systems that vendors can actually sell, even in a down economy. At 7nm and with ISA enhancements, PCIe 5 and OMI out of the box, POWER10 should be faster and beefier all around, and it will certainly be more than Axon would have been.

Is this the RISC-V PC?

We earlier reported that apparently the first production RISC-V workstation was being planned by SiFive for an October release based on their forthcoming U740. No such system by that name has emerged, and I think it's reasonable to assume that there won't be a RISC-V PC as such. Apparently instead, SiFive is offering the HiFive Unmatched, a mini-ITX motherboard successor to the U540-based HiFive Unleashed. This appeared about a month ago and they've announced nothing else, so I'm concluding this is it.

Although SiFive's page still doesn't say, other sites indicate the 64-bit main U740 CPU (four U74 cores and one "little" S7 core) is clocked at 1.4GHz. There is 2MB of L2 cache and no L3, plus 32MB of flash and 8GB of DDR4 RAM onboard; it doesn't look like there are any slots for more. A single 16x PCIe 3 slot is available for cards, though it also has Gigabit Ethernet, four USB 3.2 ports, an M.2 E-keyed slot (PCIe x1) for wireless connectivity and a MicroUSB console port (presumably some sort of serial console). For your storage options it offers on-board microSD and another M.2 M-keyed slot (PCIe x4) with NVMe support. That's good, since it lacks a BMC-like console, so you'd almost certainly need to install a video card in that single PCIe slot to use it as an effective workstation. Additional devices and creature comforts would have to be attached by USB.

The price on Crowd Supply is a surprisingly reasonable $665 for the bare board; you add your own case, PSU and peripherals. That's a little over half the cost of a Blackbird board and even includes the CPU and RAM (as long as you don't need to upgrade them), plus a 32GB MicroSD card and 3-metre Cat5e cable, though I'd rather forgo these towards the cost of a GPU. Before you go thinking this is a Blackbird-beater, however, remember even the wimpiest Blackbird with the lowliest 4-core POWER9 (like the one I have in the home theatre) would clean the floor with this system and the Blackbird has substantially more options built-in (two PCIe slots, DIMM slots, SATA, USB, audio, HDMI, three network ports and a "classic" DE-9 serial console). While Raptor is still backordered on the Blackbird, the T2 Lites they do have in stock would serve you even better in the CPU and PCIe department, even if the size is a little inconvenient and they don't have all the creature comforts onboard. And as for comparing it with the big T2, well, let's just all have a good laugh and get on with our day. There's a reason they cost more.

The other, more relevant question is how "open" this machine actually is. RISC-V is indisputably an open ISA and always has been, but it's everything else on the board that's the question mark. While SiFive offers the Freedom U SDK to build your own custom Linux distribution, that's not the same as being able to control it from the firmware up like the POWER8 and POWER9. SiFive notably doesn't make any claims about the lowlevel firmware and I don't think it's cynical in this case to assume that they don't provide source or a means to build it; your control of the system therefore starts at U-Boot. Otherwise, why not be up front about it, since everything else in the RISC-V ecosystem is all about openness?

If this is really the RISC-V PC they promised it sounds like a decent system for the money, certainly on par or above the high-end RPis people already try to make workstations out of, and I always encourage anything that weakens the x86 monoculture. For that matter, I'm actually toying with getting one myself for comparison purposes, since right now it looks like the most convenient way of experimenting with RISC-V other than an evaluation board. But also know that you're not getting a system on par with OpenPOWER performance or owner control, at least not in this iteration, and a lot of engineering work and a bit of policy change will have to both occur for that to happen. As of this writing, 113 boards have been backed on CrowdSupply with 16 days to go, and boards are expected to ship January 15, 2021. If you're picking one up or know more, post in the comments.

Firefox 83 on POWER

LTO-PGO is still working great in Firefox 83, which expands in-browser PDF support, adds additional features to Picture-in-Picture (which is still one of my favourite tools in Firefox) and some miscellany developer changes. The exact same process, configs and patches to build a fully link-time and profile-guided optimized build work that were used in Firefox 82.

Dan Horák has filed three bugs (1679271, 1679272 and 1679273) for build failures related to the internal profiler, which is still not supported on ppc64, ppc64le or s390x (or, for that matter, 32-bit PowerPC). These targetted fixes should be landing well before release, but perhaps we should be thinking about how to get it working on OpenPOWER rather than having to play emergency games of whack-a-mole whenever the build blows up.

Guix port to OpenPOWER

Since we already have FSF RYF certification for the Talos II and T2 Lite, why not run an FSF package manager on it too? Tobias Platen has announced an official branch for a port of Guix to ppc64le, based on the existing 32-bit PowerPC port of Guix that can already apparently run on big-endian ppc64, harking back to his initial work in 2019. The port is a bigger effort than it might naïvely appear, though: an updated gcc and glibc is required (and possibly a different bootstrap gcc as well), plus potential 64-bit fixes to Guile and reworking and updates of the heavily x86-centric GNU Mes, the combination Scheme interpreter and C compiler, to bootstrap the full GNU Guix System. If you're interested in finding out more, watch Tobias' video.

Vikings' first order with Raptor is go

Quick news bite: Vikings has their first order in with Raptor and will go live with a selection of OpenPOWER systems in their on-line store as soon as they arrive. We don't know yet how many units they'll stock, but the fact they'll have some on the other side of the Atlantic is good news for current and potential European (and, hey, probably other) customers for whom shipping and exchange rates to and from the United States could be prohibitive. Given that Raptor still lists the Blackbird as backordered, it's most likely the initial selection will be T2 and T2 Lites, but as someone very happy with the T2 he's typing on, a Talos II is still a heck of a computer. We'll be sure to advise more about what's available and options for RMAs and support as information is revealed.

Fedora 33 mini-review on the Blackbird and Talos II

To avoid watching American election returns, it's time to report back on our traditional mini-review for the newest release of Fedora, F33. If you run it yourself hopefully this will help your upgrade go more smoothly, and even if you don't you should still care about it because bugs in packages and platforms usually pop up in Fedora's cutting edge first (after all, F28 was one of the first out-of-the-box distributions to even run on POWER9). Now that F33 has hit the release channel, F31 will become EOL in less than a month.

We test it on both the Blackbird and Talos II, for which T2 Lite owners will have a similar experience. However, one important configuration change for this review compared to my previous go-around for F32 is that I'm no longer running gdm on the Blackbird either (I've never run it on the T2). This was largely an accident of a F32 reinstall I did, where I installed the server variant and converted it to Fedora Workstation, same as I had originally done for my T2 back with F28. In this setup the system comes up in a text boot, you log in that way, and then manually startx or dbus-run-session gnome-session (with XDG_SESSION_TYPE=wayland or as appropriate) to launch GNOME. Besides speeding startup a bit, you avoid pitfalls with a graphical start and can much more easily recover without having to do an emergency boot into the installer. This and future reviews of Fedora will be done in this configuration which just eliminates a whole class of issues I used to have on the Blackbird in particular.

As before, the general upgrade steps are

sudo dnf upgrade --refresh # upgrade DNF
sudo dnf install dnf-plugin-system-upgrade # install upgrade plugin if not already done
sudo dnf system-upgrade download --refresh --releasever=33 # download F33 packages
sudo dnf system-upgrade reboot # reboot into upgrader

When the system reboots, manually select the kernel directly from Petitboot to get a more verbose boot rather than just waiting for it to automatically start. This let me watch the install in text mode for a change. If you don't do this, your system may go to a black screen; pick another VTY with CTRL-ALT-F2 or something, log in as root and periodically issue dnf system-upgrade log --number=-1 to watch the hot hot action.

The Blackbird is my "early warning" system to catch bad updates before I tank this daily driver T2. However, perhaps because it has a vanilla install of F32 on it, it updated without any problems whatsoever, and all applications that I usually use on it (Firefox, LibreOffice, etc.) ran without issues under Xorg in 1920x1080 with the usual manually specified xorg.conf. I didn't notice much performance improvement or change, but nothing seemed to regress particularly either.

I also usually do a token test of Wayland on the Blackbird as well, which because I run it "stripped" with no GPU and with only the BMC as a framebuffer, is invariably an unusable catastrophe. But, to my surprise, not this time:

I'm a Never Wayland and I acknowledge my biases, but there has been clear improvement in its useability without a GPU, which right now is essential to run these systems "fully libre" (or at least as cheaply as possible). I suspect the LLVM update is responsible and sufficiently juices llvmpipe accordingly, but regardless of the reason the system was much more responsive and all default applications seemed to work.

What didn't, though, was the screen resolution, which remained stuck at 1024x768 because support for the Blackbird's HDMI transceiver is still not in the shipping kernel. I grabbed edid-generator and tried making an EDID out of the known-working Xorg modeline, which was ignored at bootup and dmesg said it didn't even load:

platform VGA-1: Direct firmware load for edid/bmc.bin failed with error -2

(Yes, the port name is VGA-1, despite being connected over HDMI.)

I also tried video=VGA-1:1920x1080@60e on the kernel command line and while the text boot obligingly came up in 1920x1080, when I started the GNOME session it just hung and never jumped to the graphic display, so back to Xorg. But credit where credit is due that it's getting better, whether Wayland or LLVM is responsible for the improvement.

The T2 is more complex because I have a lot more packages installed and a somewhat customized GNOME theme. Although I lost a few packages in F32, no packages were broken or needed to be backed out for F33, and the 5.0GB installation proceeded uneventfully. With fast reboots off it also properly restarts as well.

Restarting into the new installation for the first time is usually where the problems start, though, and that's what happened again this go-around. The first problem was a crapload of SELinux warnings over and over, which turned out to be another permissions clash and was eventually fixed with a restorecon -RFv /var/lib/rpm after a lot of totally family friendly cursing. The second problem is the usual GNOME extension breakage as F33 moves from 3.36 to 3.38; Dash-To-Dock again refused to update and had to be manually reinstalled from the command line as well as User Themes. However, my hacked version of Argos survived without any changes. The drag-to-reorder feature new in 3.38 mostly works as advertised, though I'm used to apps moving to close the gap and that didn't seem to happen, but I do like the changes to Screenshot.

On the T2's BTO WX7100 GPU, GNOME 3.38 under Xorg was nice and snappy as always. I didn't notice really any performance improvement, but it seemed no worse either. Wayland did improve in this iteration and the games it used not to launch now seem to start properly under Xwayland, but it seemed a little less sprightly this time. Likewise, I'm highly reliant on appmodmap for my muscle memory which won't work with any current Wayland compositor, and while GNOME's new ability under Wayland to run multiple displays at different refresh rates is a nice new feature, I don't need it for my two displays. So back to Xorg. (If maintaining Xorg is such a paper cut with hydrochloric acid on it, then why don't we use Wayland for the low-level display stuff and just run everything in Xwayland on top of it? Why must we throw the baby out with the bathwater? I like all the hacks X lets me do.)

Anyway, F33 was a largely uneventful release and I consider that a positive: while the normal little polish issues are still there it didn't seem to require pulling more teeth than usual and overall has been working well for the last couple days. What I really want, though, is for 128-bit long doubles to finally arrive and I'd really like to see a push for this in F34. Me personally I'm tired of having to hack MAME all the time just to play the same games my G5 can with MacMAME, but there are more practical and less first-world-problemy concerns for needing this feature as well. And it would really be a boon to the platform if we weren't still stuck in the Alternative Architectures penalty box every time too.

Updates: Fedora 33, FreeBSD 12.2, Ubuntu 20.10

Hot on the heels of Ubuntu 20.10 and 20.04.1 LTS (download the server flavour, and convert it to desktop if you like) comes Fedora 33. Ubuntu 20.10 upgrades to kernel 5.8, GNOME 3.38, QEMU 5 and OpenStack Victoria with an installer fix for OpenPOWER; Fedora 33 remains on 5.8 (5.9 likely to follow) but also includes GNOME 3.38, glibc 2.32 and LLVM 11, and also defaults to btrfs on Workstation (watch out if you change to a 4K page size; Fedora uses 64K pages and filesystems generated on one are not currently compatible with the other). As previously mentioned Fedora is important to me personally because it's what I run on my own T2 and Blackbird, so once the packages and late breaking changes settle down I will do a mini-review (as I did for F32), but the change I've been waiting for (128-bit long doubles) is still not in F33 as they wait on glibc changes (maybe glibc 2.33).

And if you like your OpenPOWER systems but don't like Linux, FreeBSD 12.2 is out as well, with multiple security, bugfix and functionality upgrades for a wide variety of PowerPC and OpenPOWER-based systems. Big-endian is well-tested and little-endian is coming along (and snapshots should finally be in -CURRENT by the time you read this).

Firefox 82 on POWER goes PGO

You'll have noticed this post is rather tardy, since Firefox 82 has been out for the better part of a week, but I wanted to really drill down on a couple variables in our Firefox build configuration for OpenPOWER and also see if it was time to blow away a few persistent assumptions.

But let's not bury the lede here: after several days of screaming, ranting and scaring the cat with various failures, this blog post is finally being typed in a fully profile-guided and link-time optimized Firefox 82 tuned for POWER9 little-endian. Although it multiplies compile time by nearly a factor of 3 and the build process intermittently can consume a terrifying amount of memory, the PGO-LTO build is roughly 25% faster than the LTO-only build, which was already 4% faster than the "baseline" -O3 -mcpu=power9 build. That's worth an 84-minute coffee break! (-j24 on a dual-8 Talos II [64 threads], 64GB RAM.)

The problem with PGO and gcc (at least gcc 10, anyway) is that all the .gcda files end up in the same directory as the built objects in an instrumented build. The build system, which is now heavily clang-centric (despite the docs, gcc is clearly Tier 2, since this and other things don't work), does not know how to handle or transfer the resulting profile data and bombs after running the test load. We don't build with clang because in previous attempts it never managed to fully build the browser on ppc64le and I'm sceptical of its code quality on this platform anyway, but since I wanted to verify against a presumably working configuration I did try a clang build first to see if anything had changed. It breaks fairly early now, interestingly while compiling a Rust component:

4:33.00 error: /home/censored/src/mozilla-release/obj-powerpc64le-unknown-linux-gnu/release/deps/ undefined symbol: __muloti4
4:33.00 --> /home/censored/src/mozilla-release/third_party/rust/phf_macros/src/
4:33.00 227 | #[::proc_macro_hack::proc_macro_hack]
4:33.00    |      ^^^^^^^^^^^^^^^
4:33.00 error: aborting due to previous error
4:33.00 error: could not compile `phf_macros`.

So there's that. I'm not very proficient in Rust so I didn't do much more diagnosis at this point. Back to the hippo gcc.

What's needed is to hack the build system to copy the .gcda files generated during profiling out of instrumented/ into the regular build tree for the actual (second) build phase, which is essentially the solution proposed in bug 1601903 except without any explanation as to how you actually do it. The PGO driver is fortunately in a standalone Python script, so I decided to simply hijack that. At the end is code to coalesce the .profraw files from a successful instrumented clang build, which shouldn't be running anyway if the compiler is gcc, so I threw in a couple lines to terminate instead after it runs this shell script:

#!/bin/csh -f

set where=/tmp/mozgcda.tar

# all on one line yo
cd /home/censored/src/mozilla-release/obj-powerpc64le-unknown-linux-gnu/instrumented || exit
tar cvf $where `find . -name '*.gcda' -print`
cd ..
tar xvf $where
rm -f $where

This repopulates the .gcda files in the right place before we rebuild with the profile data, but because of this subterfuge, gcc thinks the generated profile is not consistent with the source and spams an incredible amount of complaint messages ... which made it difficult to spot the internal compiler error that the profile-guided rebuild triggered. This required another rebuild with some tweaks to turn that off and some other irrelevant warnings (I'll probably upstream at least one of these changes) so I could determine where the ICE was in the scrollback. Fortunately, it was in a test binary, so I just commented it out in the and it finally stuck. And so far, it's working impressively well. This may well be the fastest the browser can get while still lacking a JIT.

After all that, it's almost an anticlimax to mention that --disable-release is no longer needed in the build configs. You can put it in the Debug configuration if you want, but I now use --enable-release in optimized builds and it seems to work fine.

If you want to try compiling a PGO-LTO build yourself, here is a gist with the changes I made (they are all trivial). Save the shell script above as gccpgostub.csh in ~/src/mozilla-release and/or adjust paths as necessary, and make sure it is chmodded +x. Yes, there is no doubt a more elegant way to do this in Python itself but I hate Python and I was just trying to get it to work. Note that PGO builds can be exceptionally toolchain-dependent (and ICEs more so); while TestUtf8 was what triggered the ICE on my system (Fedora 32, gcc 10.2.1), it is entirely possible it will halt somewhere else in yours, and the PGO command line options may not work the same in earlier versions of the compiler.

Without further ado, the current .mozconfigs, starting with Optimized. Add ac_add_options MOZ_PGO=1 to enable PGO once you have patched your tree and deposited the script.

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-O3 -mcpu=power9"
ac_add_options --enable-release
ac_add_options --enable-linker=bfd
ac_add_options --enable-lto=full

# this is implied by enable-release but left in to be explicit


export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-Og -mcpu=power9"
ac_add_options --enable-debug
ac_add_options --enable-linker=bfd


OpenBSD officially available for ppc64

OpenBSD 6.8 is now available and with it the first official release of the big-endian ppc64 port (which they call powerpc64). The port is specifically advertised for PowerNV machines (i.e., bare metal) with POWER9, which naturally includes the Raptor family but should also support IBM PowerNV systems as well. POWER8 support is described as "included but untested."

The installation directions are still not fully complete, though Petitboot should be able to start the installer from pretty much any standard medium, and the installer experience should be stock from there. What's more, it looks like a good selection of pre-built packages is available, though some large applications are still missing like Firefox (WebKit is apparently available). The missing packages seems to be similar to what is missing for their 32-bit powerpc flavour, so this is not unexpected.

With OpenBSD's release and FreeBSD's well-regarded history, this leaves only NetBSD — ironically the BSD with the most emphasis on portability, and my personal preference — as the last major cross-platform BSD yet to arrive on OpenPOWER. Given OpenBSD and NetBSD's genetic history, however, this release makes future NetBSD support for OpenPOWER much more likely.

IBM splits

IBM today announced that the company will split into two, moving the Managed Infrastructure Services portion of IBM Global Technology Services into a new cloud-focused corporation tentatively called "NewCo" by the end of 2021. NewCo would also have a greater focus on AI, presumably through a distributed computing model rather than traditional hardware sales. The Technology Support Services piece of GTS that addresses data centre, hardware and software support would remain part of "old" IBM, along with Red Hat and presumably the R&D folks responsible for working on Power ISA like the great people at OzLabs.

It is interesting that this move was predicted as early as February, and a split in itself only means that a combined business strategy no longer makes sense for these units. But chairwoman Ginni Rometty missed the boat on cloud early, and despite the hype in IBM's investor release over the new company, "NewCo" is really the "old" services-oriented IBM with a fresh coat of paint that was a frequent source of layoffs and cost-cutting manoeuvres over the years. There are probably reasons for this, not least of which their hidebound services-first mentality that wouldn't sell yours truly a brand new POWER7 in 2010 even when I had a $15,000 personal budget for the hardware because I didn't (and don't: the used POWER6 I bought instead is self-maintained) need their services piece. As a result I wasn't apparently worth the sale to them, which tells you something right there: today's growth is not in the large institutional customers that used to be IBM's bread and butter but rather in the little folks looking for smaller solutions in bigger numbers, and Rometty's IBM failed to capitalize on this opportunity. In my mind, today's split is a late recognition of her tactical error.

Presumably the new company would preferentially use "OldCo" hardware and recommend "OldCo" solutions for their service-driven hybrid buildouts. But "OldCo" makes most of its money from mainframes, and even with robust virtualization options mainframes as a sector aren't growing. Although IBM is taking pains to talk about "one IBM" in their press release, that halcyon ideal exists only as long as either company isn't being dragged down by the other, and going separate directions suggests such a state of affairs won't last long.

What does this mean to us in OpenPOWER land? Well, we were only ever a small part of the equation, and even with a split this won't increase our influence on "OldCo" much. Though IBM still makes good money from Power ISA and there's still a compelling roadmap, us small individual users will need to continue making our voices heard through the OpenPOWER Foundation and others, and even if IBM chooses not to emphasize individual user applications (and in fairness they won't, because we're not where the money is), they still should realize the public relations and engineering benefits of maintaining an open platform and not get in the way of downstream vendors like Raptor attempting to capitalize on the "low-end" (relatively speaking) market. If spinning off MIS gets IBM a better focus on hardware and being a good steward of their engineering resources, then I'm all for it.

Where did the 64K page size come from?

Lots of people were excited by the news over Hangover's port to ppc64le, and while there's a long way to go, the fact it exists is a definite step forward to improving the workstation experience on OpenPOWER. Except, of course, that many folks (including your humble author) can't run it: Hangover currently requires a kernel with a 4K memory page size, which is the page size of the majority of extant systems (certainly x86_64, which only offers a 4K page size). ppc64 and ppc64le can certainly run on a 4K page size and some distributions do, yet the two probably most common distributions OpenPOWER users run — Debian and Fedora — default to a 64K page size.

And there's lots of things that glitch and have glitched when userspace makes assumptions about this. Besides Hangover, Firefox used to barf on 64K pages (on aarch64 too), and had an issue where binaries built on one page size wouldn't work on systems with a different one. (This also bit numpy.) Golang and the runtime used to throw fatal errors. The famous nouveau driver for Nvidia GPUs assumes a 4K page size, and the compute-only binary driver that does exist (at least for POWER8) cheats by making 64K pages out of 16 copies of the "actual" 4K page. btrfs uses a filesystem page size that mirrors that of the host's page size on which it was created. That means if you make a btrfs filesystem on a 4K page size system, it won't be readable on a 64K page system and vice versa (this is being fixed, but hasn't been yet).

With all these problems, why have a 64K page size at all, let alone default to it? There must be some reason to use it because ppc64(le) isn't even unique in this regard; many of those bugs related to aarch64 which also has a 64K page option. As you might guess, it's all about performance. When a virtual memory page has to be attached to a process or mapped into its addressing space, a page fault is triggered and has to be handled by the operating system. Sometimes this is minor (it's already in memory and just has to be added to the process), sometimes this is major (the page is on disk, or swapped out), but either way a page fault has a cost. 64-bit systems naturally came about because of the need for larger memory addressing spaces, which benefits big applications like databases and high-performance computing generally, and these were the tasks that early 64-bit systems were largely used for. As memory increases, subdividing it into proportionally larger pieces thus becomes more performance-efficient: when the application faults less, the application spends more time in its own code and less in the operating system's.

A second performance improvement afforded by larger pages is higher efficiency from the translation lookaside buffer, or TLB. The TLB is essentially a mapping cache that allows a CPU to quickly get the physical memory page for a given virtual memory address. When the virtual memory address cannot be found in the TLB, then the processor has to go through the entire page table and find the address (and filling it in the TLB for later), assuming it exists. This can be a relatively expensive process if there are many entries to go through, and even worse if the page tables are nested in a virtualized setup. A larger page size not only allows more memory to be handled with a smaller page table, making table walks quicker, but also yields more hits for a TLB of the same size. It is fair to point out there are arguments over MMU performance between processor architectures which would magnify the need for this: performance, after all, was the reason why POWER9 moved to a radix-based MMU instead of the less-cache-friendly hashed page table scheme of earlier Power generations, and x86_64 has a radix tree per process while Power ISA's page table is global. (As an aside, some systems optionally or even exclusively have software-managed TLBs where the operating system manages the TLB for the CPU and walks the page tables itself. Power ISA isn't one of them, but these architectures in particular would obviously benefit from a smaller page table.)

64K page sizes, compatibility issues notwithstanding, naturally have a downside. The most important objection relates to memory fragmentation: many memory allocators have page alignment constraints for convenience, which could waste up to the remaining 60K if the memory actually in use fits entirely within a 4K page instead. On bigger systems with large amounts of memory running tasks that allocate large memory blocks, this excess might be relatively low, but they could add up on a workstation-class system with smaller RAM running a mix of client applications making smaller allocations. In a somewhat infamous rebuttal, Linus Torvalds commented, "These absolute -idiots- talk about how they win 5% on some (important, for them) benchmark by doing large pages, but then ignore the fact that on other real-world loads they lose by sevaral HUNDRED percent because of the memory fragmentation costs [sic]." Putting Linus' opinion into more anodyne terms, if the architecture bears a relatively modest page fault penalty, then the performance improvements of a larger page size may not be worth the memory it can waste. This is probably why AIX, presently specific to ppc64, offers both 4K and 64K pages (and even larger 16MB and 16GB pages) and determines what to offer to a process.

The 4K vs. 64K gulf is not unlike the endian debate. I like big endian and I cannot lie, but going little endian gave Power ISA a larger working software library by aligning with what those packages already assumed; going 4K is a similar situation. But while the performance difference between endiannesses has arguably never been significant, there really are performance reasons for a 64K page size and those reasons get more important as RAM and application size both increase. On my 16GB 4-core Blackbird, the same memory size as my 2005 Power Mac Quad G5, a 4K page size makes a lot more sense than a 64K one because I'm not running anything massive. In that sense the only reason I'm still running Fedora on it is to serve as an early warning indicator. But on my 64GB dual-8 Talos II, where I do run larger applications, build kernels and Firefoxen and run VMs, the performance implications of the larger page size under Fedora may well become relevant for those workloads.

For servers and HPCers big pages can have big benefits, but for those of us using these machines as workstations I think we need to consider whether the performance improvement outweighs the inconvenience. And while Fedora has generally served me well, lacking a 4K page option on ppc64le certainly hurts the value proposition for Fedora Workstation on OpenPOWER since there are likely to be other useful applications that make these assumptions. More to the point, I don't see Red Hat-IBM doubling their maintenance burden to issue a 4K page version and maintaining a downstream distro is typically an incredibly thankless task. While I've picked on Fedora a bit here, you can throw Debian and others into that mix as well for some of the same reasons. Until other operating systems adopt a hybrid approach like AIX's, the quibble over page size is probably the next major schism we'll have to deal with because in my humble opinion OpenPOWER should not be limited to the server room where big pages are king.

It's good to have a Hangover

One of the wishlists for us OpenPOWER workstation users is better emulation for when we have to run an application that only comes as a Windows binary. QEMU is not too quick, largely because of its overhead, and although Bochs is faster in some respects it's worse in others and doesn't have a JIT. While things like HQEMU are fast, they also have their own unique problems, and many things that work in QEMU don't work in HQEMU. Unfortunately, because Wine Is Not an Emulator, it cannot be used to run Windows binaries directly.

People then ask the question, what if we somehow put QEMU and Wine together like Slaughterhouse-Five and see if they breed? Somebody did that, at least for aarch64, and that is Hangover. And now it runs on ppc64le with material support for testing provided by Raptor.

Hangover is unabashedly imperfect and many things still don't work, and there are probably things that work on aarch64 that don't work on ppc64le as the support is specifically advertised as "incomplete." (Big-endian need not apply, by the way: there are no thunks here for converting endianness. Sorry.) There is also the maintainability problem that the changes to Wine to support ppc64le (done by Raptor themselves, as we understand) haven't been upstreamed and that will contribute to the rebasing burden.

With all that in mind, how's it work? Well ... I have no idea, because the other problem is right now it's limited to kernels using a 4K page size and not every ppc64le-compatible distribution uses them. Void Linux, for example, does support 4K pages on ppc64le, but Fedora only officially supports a 64K page size, and I'm typing this on Fedora 32. It may be possible to hack Hangover to add this support but the maintainer ominously warns that "loading PE binaries, which have 4K aligned sections, into a 64K page comes with [lots] of problems, so currently the best approach is to avoid that." I'm rebuilding my Blackbird system but I like using it as a Fedora tester before running upgrades on this daily driver Talos II, which has saved me some substantial inconvenience in the past. That said, a live CD that boots Void and then runs Hangover might be fun to work on.

If you've built Hangover on your machine and given it a spin, advise how well it works for you and how it compares to QEMU in the comments.

IBM makes available the POWER10 Functional Simulator

Simulators are an important way of testing compatibility with future architectures, and IBM has now released a functional simulator for POWER10. Now, we continue to watch POWER10 closely here at Talospace because of as-yet unaddressed concerns over just how "open" it is compared to POWER8 and POWER9, and we have not heard of any workstation-class hardware announced around it yet (from Raptor or anyone else). But we're always interested in the next generation of OpenPOWER, and the documentation states it provides "enough POWER10 processor complex functionality to allow the entire software stack to execute, including loading, booting and running a little endian Linux environment." Pretty cool, except you can't actually run it on OpenPOWER yet: there is no source code, and no binaries for ppc64le, although the page indicates it is supported; the only downloads as we go to press are for x86_64. IBM did eventually release ppc64le packages for Debian for the POWER9 functional simulator, so we expect the same here to happen eventually, even though it would have been a nice gesture to have it available immediately since we would be the very people most interested in trying it out. It includes a full instruction set model with SMP support, vector unit and the works, but as always you are warned "it may not model all aspects of the IBM Power Systems POWER10 hardware and thus may not exactly reflect the behavior of the POWER10 hardware."

FreeBSD swings both ways

They say there's an xkcd for everything, but me, I say it's Friends GIFs. Anyway, hat tip to developer Piotr Kubaj who reports that, if you don't like big endian and cannot lie, FreeBSD's covered you got with a new little endian ppc64le port to complement the existing (and by now practically mature) big endian ppc64 flavour.

Raptor themselves actually give material support to the project by providing a remote instance for development, powering a build server that continuously runs poudriere bulk -a to test ports. Plus, looking in the source tree, the commits to add little-endian support are all tagged as "Sponsored by: Tag1 Consulting, Inc." This company apparently has OpenPOWER alumni from the Oregon State University Open Source Lab (.pdf). It's nice to see the cross-pollination at work!

Although there are no .iso images yet, they should start appearing with the -CURRENT snapshots next week. Note that official ports support doesn't exist yet either, so you'll need to compile packages on your own for the moment, and there are other minor to moderate deficiencies relative to the big-endian port which are still being rectified. Still, choice is a good thing, especially since per Piotr there are no plans to decommission the big-endian port and both will coexist. How's that for playing on both teams?

Firefox 81 on POWER

Firefox 81 is released. In addition to new themes of dubious colour coordination, media controls now move to keyboards and supported headsets, the built-in JavaScript PDF viewer now supports forms (if we ever get a JIT going this will work a lot better), and there are relatively few developer-relevant changes.

This release heralds the first official change in our standard POWER9 .mozconfig since Fx67. Link-time optimization continues to work well (and in 81 the LTO-enhanced build I'm using now benches about 6% faster than standard -O3 -mcpu=power9), so I'm now making it a standard part of my regular builds with a minor tweak we have to make due to bug 1644409. Build time still about doubles on this dual-8 Talos II and it peaks out at almost 84% of its 64GB RAM during LTO, but the result is worth it.

Unfortunately PGO (profile-guided optimization) still doesn't work right, probably due to bug 1601903. The build system does appear to generate a profile properly, i.e., a controlled browser instance pops up, runs some JavaScript code, does some browser operations and so forth, and I see gcc created .gcda files with all the proper count information, but then the build system can't seem to find them to actually tune the executable. This needs a little more hacking which I might work on as I have free time™. I'd also like to eliminate ac_add_options --disable-release as I suspect it is no longer necessary but I need to do some more thorough testing first.

In any event, reliable LTO at least with the current Fedora 32 toolchain is still continuous progress. I've heard concerns that some distributions are not making functional builds of Firefox for ppc64le (let alone ppc64, which has its own problems), though Fedora is not one of them. Still, if you have issues with your distribution's build and you are not able to build it for yourself, if there is interest I may put up a repo or a download spot for the binaries I use since I consider them reliable. Without further ado, here are the current .mozconfigs that I attest as functional.

Optimized Configuration

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-O3 -mcpu=power9"
ac_add_options --disable-release
ac_add_options --enable-linker=bfd
ac_add_options --enable-lto=full

#export GN=/uncomment/and/set/path/if/you/haz
Debug Configuration
export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-Og -mcpu=power9"
ac_add_options --enable-debug
ac_add_options --disable-release
ac_add_options --enable-linker=bfd

#export GN=/uncomment/and/set/path/if/you/haz

The first production RISC-V workstation?

No, not the RiscPC, a RISC-V PC. And, not counting the various one-offs, it appears to be the very first production RISC-V workstation available. SiFive is announcing the RISC-V PC at the Linley Group Fall Virtual Processor Conference, based on the Freedom U740 ("FU740") to be introduced at the same time next month.

Precious little details are available, such as loadout, options, availability and most of all cost, but when has that stopped us from idly speculating before, eh? It is virtually certain the machine will be composed largely of off-the-shelf components other than the CPU, which is the real mystery of interest. The FU740 appears to be an evolution of the FU540, which is a 64-bit 1.5GHz+ part with four U54 "little" cores combined with one S1-series "big" core and 2MB of L2 cache on a 28nm process. Plainly, neither of these cores are even remotely in the ballpark with OpenPOWER: SiFive quotes CoreMark/MHz scores of 3.01 for both the U54 and S54, whereas the POWER9 easily achieves over 160. While the FU740 will almost certainly be faster due to its probable basis on the U74, it is difficult to imagine that the performance gulf will be narrowed significantly (the U74 edges up to around 5). You should not buy one and expect it to compare favourably with x86 or a Raptor system.

On the other hand, there's a good chance this will be another truly open system based on the fact that the Freedom E300 and U500 series are open source under the Apache license. While some parts of SiFive are proprietary, this line is not, and we presume that the U700 series will be likewise. RISC-V still lacks firm specs for vector and bit manipulation instructions, and this certainly hurts them for desktop and mobile applications, but this is a known deficiency and is being worked on. Assuming no shenanigans with the firmware, there's encouraging potential even in this early form.

I'm unambiguously on Team Power because of my long history with the architecture, but this blog is certainly interested in all kinds of free vendor-unencumbered computing, and this machine may well represent another such system. And it's newsworthy as the first RISC-V system that's at least workstation form factor even if its likely performance doesn't currently make it a credible daily driver. But maybe that's not the point: the point is to get developers on the architecture in a way that's bigger than an evaluation board (cf. Linus Torvalds and ARM), meaning it doesn't have to be their only daily driver; it just has to "be there" so people think about it. More on cost and specs and "how open is it" when we actually see it in October.

Moar OpenPOWER cores plz

More news from virtual OpenPOWER Summit 2020: I mentioned it would be interesting to see what other cores would pop up on the OpenPOWER Github and indeed following on from the PowerPC A2I comes another A2 variant, the PowerPC A2O.

Announced today by IBM and released under the standard OpenPOWER license, the A2O is an evolved 64-bit PowerPC A2 compliant to ISA 2.07, comparable to POWER8 (the A2I was 2.06) under the embedded-focused Book III-E, and can run both big or little endian. At 45nm it was intended for 3GHz+ speeds; at 7nm it is expected to achieve 4.2GHz speeds at 0.85W, or 3GHz at 0.25W. Unlike the strictly in-order and slightly more power-thrifty A2I the A2O is out-of-order and prioritizes single-threaded performance, but it's only SMT-2 versus the A2I which is SMT-4. Even this is theoretical, however, because the documentation notes that only single-thread generation has been attempted so far. Each core has an AXU similar to the A2I that appears to offer FPU operations in the Verilog code, plus a branch unit, FXUs for single and complex integer operations respectively, and a load/store unit. There also appears to be a basic MMU, though the core allows running without one relying entirely on ERATs, but unfortunately I couldn't find a vector unit (the A2I as released didn't come with one either).

IBM casts the A2O as being more appropriate for artificial intelligence, autonomous driving and security, whereas the A2I was meant for streaming, network processing and data analysis. I'm not sure I believe either of those claims, but despite apparently being just an evolutionary improvement over the A2I I think the A2O is more promising especially for smaller-scale systems. By being 2.07-compliant it's already almost a mainline POWER8 and the interest that has bubbled up around A2I should find even more to like in A2O. Adding a radix MMU implementation and vector operations wouldn't be trivial, and even this single-thread implementation has high FPGA utilization, but I think this would be a better basis than A2I for that hypothetical OpenPOWER developer board everybody seems to want or even a mythical modern PowerPC laptop. Like A2O, A2I still doesn't replace Microwatt, which is much better documented, better supported, can actually boot a Linux kernel, and if for no other purpose than pedagogy is a far more purposeful model for OpenPOWER systems. That said, A2I's very presence is yet another choice and yet another great reason to be on board with OpenPOWER.

IBM open-sources PowerAI as OpenCE

News from today's COVID-19 socially distanced virtual OpenPOWER Summit: IBM announced the open-sourcing of their PowerAI package today as OpenCE, the Open Cognitive Environment for deep learning and machine learning applications. The code should build on any Linux-based OpenPOWER system, including Raptor-family workstations and servers, and the Github repository contains everything needed to build Tensorflow, Pytorch, XGBoost and related projects and dependencies. If building binaries from scratch leaves you cold waiting for the goodies, Oregon State University simultaneously announced plans to offer pre-built ppc64le binaries for each upcoming tagged release both with and without CUDA support. Unfortunately, not everything is open: you'll still need to register and download a separate blob from Nvidia if you intend to use CUDA, even though it can be reportedly downloaded at no cost afterwards, and if you do you'll naturally be limited to Nvidia GPUs (which you can't use for 3D acceleration on OpenPOWER currently due to the lack of a working open-source driver). Still, here's a high-power option for your machines coming from someone who knows how to optimize for the platform, and Raptor's PowerAI-specific SKU is a turnkey package configured expressly for that purpose (and it's even in stock). Perhaps OpenCE is something they could preinstall for even greater value now that it's available.

Microwatt floats

When we last visited Microwatt, the little synthesizeable OpenPOWER core that could, we looked at how you could hack instructions in. Or, you can sit back and wait for the PRs from IBM, including now a simple FPU. While this pull request describes its performance in modest terms, impressively it operates exactly the same (and even authentically "fails" the same tests in the same fashion) as the FPU in the POWER9. There is still no (full) supervisor mode, and no vector unit, but Microwatt is now advanced enough to boot a Linux kernel. The possibility of a single-board Microwatt-based system (and fully reprogrammable, too) gets closer every day.

Firefox 80 on POWER

Firefox 80 is available, and we're glad it's here considering Mozilla's recent layoffs. I've observed in this blog before that Firefox is particularly critical to free computing, not just because of Google's general hostility to non-mainstream platforms but also the general problem of Google moving the Web more towards Google.

I had no issues building Firefox 79 because I was still on rustc 1.44, but rustc 1.45 asserted while compiling Firefox, as reported by Dan Horák. This was fixed with an llvm update, and with Fedora 32 up to date as of Sunday and using the most current toolchain available, Firefox 80 built out of the box with the usual .mozconfigs.

Since there was a toolchain update, I figured I would try out link-time optimization again since a few releases had elapsed since my last failed attempt (export MOZ_LTO=1 in your .mozconfig). This added about 15 minutes of build-time on the dual-8 Talos II to an optimized build, and part of it was spent with the fans screaming since it seemed to ignore my -j24 to make and just took over all 64 threads. However, it not only builds successfully, I'm typing this post in it, so it's clearly working. A cursory benchmark with Speedometer 2.0 indicated LTO yielded about a 4% improvement over the standard optimized build, which is not dramatic but is certainly noticeable. If this continues to stick, I might try profile-guided optimization for the next release. The toolchain on this F32 system is rustc 1.45.2, LLVM 10.0.1-2, gcc 10.2.1 and GNU ld.bfd 2.34-4; your mileage may vary with other versions.

There's not a lot new in this release, but WebRender is still working great with the Raptor BTO WX7100, and a new feature available in Fx80 (since Wayland is a disaster area without a GPU) is Video Acceleration API (VA-API) support for X11. The setup is a little involved. First, make sure WebRender and GPU acceleration is up and working with these prefs (set or create):

gfx.webrender.enabled true
layers.acceleration.force-enabled true

Restart Firefox and check in about:support that the video card shows up and that the compositor is WebRender, and that the browser works as you expect.

VA-API support requires EGL to be enabled in Firefox. Shut down Firefox again and bring it up with the environment variable MOZ_X11_EGL set to 1 (e.g., for us tcsh dweebs, setenv MOZ_X11_EGL 1 ; firefox &, or for the rest of you plebs using bash and descendants, MOZ_X11_EGL=1 firefox &). Now set (or create):

media.ffmpeg.vaapi-drm-display.enabled true
media.ffmpeg.vaapi.enabled true
media.ffvpx.enabled false

The idea is that VA-API will direct video decoding through ffmpeg and theoretically obtain better performance; this is the case for H.264, and the third setting makes it true for WebM as well. This sounds really great, but there's kind of a problem:

Reversing the last three settings fixed this (the rest of the acceleration seems to work fine). It's not clear whose bug this is (ffmpeg, or something about VA-API on OpenPOWER, or both, though VA-API seems to work just fine with VLC), but either way this isn't quite ready for primetime yet on our platform. No worries since the normal decoder seemed more than adequate even on my no-GPU 4-core "stripper" Blackbird. There are known "endian" issues with ffmpeg, presumably because it isn't fully patched yet for little-endian PowerPC, and I suspect once these are fixed then this should "just work."

In the meantime, the LTO improvement with the updated toolchain is welcome, and WebRender continues to be a win. So let's keep evolving Firefox on our platform and supporting Mozilla in the process, because it's supported us and other less common platforms when the big 1000kg gorilla didn't, and we really ought to return that kindness.

POWER10 sounds really great, but ...

IBM took the wraps off POWER10 officially today, a (Samsung-manufactured) 7nm monster in 18 layers with up to 15 SMT-8 cores (120 threads) with 2MB of L2 per core, up to 120MB of L3, 1 TB/s memory access, OpenCAPI and PCIe 5. New on-board is an embedded matrix math accelerator for specialized AI performance, multipetabyte memory clusters and transparent memory encryption with four times the number of AES engines than POWER9. Overall, IBM is touting that the processor is three times more energy efficient than POWER9 while being up to twice as fast at scalar and four times as fast at vector operations. General availability is announced for Q3 or Q4 of 2021.

First of all: damn. This sounds sweet. The dual-8 POWER9 Talos II under the desk with "just" 64 threads and PCIe 4 is already giving me sorrowful Eeyore eyes even though there's no guarantee what, if any, lower-end systems suitable as being workstations will be available when the processor is. But right now, what we do know is that right now Raptor has said there won't be POWER10 systems, and as it stands presently nobody else is making workstation-class OpenPOWER machines. Raptor, probably for reasons of NDAs, is playing this close to the vest, so what follows is merely my variably informed personal conjecture and may be completely inaccurate.

One of the truly incredible things about OpenPOWER — or at least POWER8 and POWER9 — is how far down you can see what the hardware is doing. In previous articles, we looked at emulating OpenPOWER at the bare metal level, and then even writing your own firmware bootkernel. But the bootloader and high-level firmware are really only the beginning: the build image created by op-build not only contains the Petitboot bootloader, but its Skiroot filesystem, Skiboot (containing OPAL, the OpenPOWER Abstraction Layer, which handles PCIe, interrupt and operating system services), Hostboot (which initializes and trains RAM, buffers and the bus), and the Self-Boot Engine which initializes the CPUs. Even the fused-in first instructions the POWER9 executes from its OTPROM to run the Self-Boot Engine are open source, and other than the OTPROM itself (it is a One-Time Programmable ROM, after all), everything is inspectable and changeable. And before the POWER9 executes those very first instructions, the Baseboard Management Controller that powers the system on has its own open firmware too. You know what your computer is doing, and you don't have to trust anyone's firmware build if you don't want to because you can always build and flash the system yourself.

Contrast this against the gyrations that x86 "open" systems have to struggle with. Do not interpret this as a slam against vendors like System76 or Purism because they're doing the best they can to deliver the most frequently used architecture in workstations and servers, in as unlocked a fashion as possible from processor manufacturers who are going in exactly the opposite direction. And there have been great improvements in untangling the tendrils of the Intel Management Engine from the processor, primarily through Coreboot's steady evolution. But even with these improvements where significant portions of the Intel ME are disabled, secret sauce is still needed to bring up the CPU and you have to trust that the sauce is only and specifically doing what it says it is, in addition to the other partitions of the ME which activated or not are still not fully understood. The situation is even worse for AMD Ryzen processors with the Platform Security Processor, which (at least the 3000 and 4000 variants) aren't presently supported by Coreboot at all, though System76 is apparently working on a port.

Don't just take my word for it: as of this writing no recent x86 system appears on the FSF Respects Your Freedom list, but the Talos II and T2 Lite both do (and I imagine the Blackbird is soon to follow). The Vikings D8 is indisputably libre, and has an FSF RYF certification, but is an AMD Opteron 4200, which is about eight or nine years old. As it stands I believe this is the most powerful x86 system still available on the FSF RYF list now that the D16 is out of production (Opteron 6200).

I think there's a reasonable argument to be had about how "open" something needs to be to be considered "libre" and at what point you could be considered to have meaningful control of your machine, but there's no denying there are aspects of modern x86 machines which you are prohibited by policy from getting into, and that means putting more faith in the processor vendor than they may truly deserve. (Don't get me started on GPUs, either. Or, for that matter, ARM.) Again, Raptor won't say, but their public disenchantment with POWER10 suggests that some aspects of the processor firmware stack are not open. This is a situation which is no better than x86, and I'm hoping this is merely an oversight on IBM's part and not a future policy.

To be effective, OpenPOWER needs to be more open than just the ISA being royalty-free, even though that's huge. To be sure, I think there has to be room for processor manufacturers to distinguish themselves in the market or you run the risk of a race to the bottom where people simply rip off designs (this is, I think, a real concern for RISC-V). I think sharing reference designs is necessary to get systems bootstrapped but I can't deny there's money in high performance applications, and high performance microarchitecture demands a return on investment to justify development costs. Similarly, to the extent that any pack-in hardware (like POWER9's Nest Accelerators) isn't part of the open ISA and are separately managed devices that simply share a die, to me it seems logical to also make it part of how a processor manufacturer can stand out to potential customers.

But the firmware absolutely needs to be as clean and available as the ISA. If the ISA is open and the instructions the CPU is running are part of that open standard, then any firmware components, which (ought to) entirely consist of those instructions, must be open too. If the CPU has pack-in hardware on the die that isn't part of the open ISA, then you should be able to bring up the chip without it. The standard that was set for current OpenPOWER should be the same standard for POWER10 or it doesn't really deserve the OpenPOWER name, and I'm worried that Raptor's insinuations imply IBM's standard isn't the same. Similarly, arguing that the currently incomplete situation with x86 is functionally equivalent to OpenPOWER (or, for that matter, RISC-V) may be well-intentioned but is disingenuous. The FSF may be ideologues on binary blobs, but that doesn't make their position wrong, and the entire OpenPOWER ecosystem from IBM on down should recognize how much goodwill and prominence the openness of POWER8 and POWER9 has generated for the community.

I hope I'm wrong, but I'm concerned I'm not. Let's make sure we get POWER10 right or we won't be practicing what we preach, and that's going to kill us in the crib.

Vikings' upcoming OpenPOWER retail channel

Many Talospace readers are familiar with Vikings, who offer libre hosting as well as hardware sales for libre-friendly devices and systems and peripherals certified by the FSF Respects Your Freedom program (for which Raptor systems qualify). However, the Vikings' storefront now shows a new tab for OpenPOWER hardware, hopefully a public demonstration of a new retail channel coming soon for those ready to pull the trigger on an OpenPOWER workstation or server of your own. This is particularly of value to our readers outside North America, since this gets around a lot of the inconveniences of shipping and payment with United States businesses; Vikings is based in Germany, and accepts payments in euros, US dollars, British pounds, Australian dollars and New Zealand dollars. Already we have also heard that Vikings is working on a water-cooler system for POWER systems with an aim to reach the market in two months or less, a great option for people trying to run the 18 and 22-core parts in desktop environments (current BTO cooling options are air-cooled only).

Currently it is not known yet whether Vikings will sell full systems, parts and/or processors, whether the systems include other OpenPOWER systems other than Raptor workstations and servers, or when general availability is expected. Still, the more retail options there are, the greater the volume of sales and the greater the economies of scale that will result. In the end, that can only be a good thing for growing our niche but very important market.

Will it build?

While I will always be big-endian at heart, ppc64le does get around a lot of the unfortunately pervasive endian assumptions in a cold blackboxed x86_64 world, and even things like MMX, SSE and SSSE3 can be automatically translated in many cases. It is therefore a happy result that even many software packages completely unaware of ppc64le will still build and function out of the box, assuming they don't do silly things like emit JITted x86_64 assembly code and try to run it, etc.

I ran across this project the other day which has over 1,000 build scripts for ppc64le (as shell scripts and/or Docker files) that you can either use directly, or as a hint whether your intended build will even work. Cursorily paging through a few I see IBM E-mail addresses, so no surprise much of it is tested on Red Hat (though largely RHEL 7.x), but there are also Ubuntu scripts there as well and I imagine they'd accept other distros. Keep in mind that this is generic ppc64le, so it would work on POWER8 and up but any special optimizations (for example, I always build optimized Firefox at -O3 -mcpu=power9), and the concentration more favours server-side packages than workstation and client software. I also see relatively few platform-specific corrections, which could be both good (they weren't needed) or bad (they weren't tested). Still, it's nice to see more resources to aid porting and platform compatibility and that can only in turn get more packages thinking about making ppc64le (and hopefully ppc64) a first-class citizen too.

Linux 5.8 on POWER

The 5.8 kernel milestone has arrived, with improvements to reduce thrashing (though with the amount of memory even a Blackbird can hold, there's no excuse not to load these suckers up), an API for receiving notifications of kernel events, support for hardware-assisted inline encryption at the block layer for storage devices and a nice convenience feature where you can put sysctl.something.or.other=999 right on the kernel command line.

On the Power ISA side, this kernel adds the first support for POWER10 and ISA 3.1, although our Raptor contacts have indicated some displeasure with IBM's management decisions and we suspect this is a way of saying firmware binary blobs might be required to enable maximal performance (though we don't know, and it's unclear how much is under NDA). Another nice feature is an ioctl to send gzip compression requests directly to POWER9's on-chip compression hardware via /dev/crypto/nx-gzip. This is part of the general family of Nest Accelerators (NXes) accessible through the Virtual Accelerator Switchboard. More about that in a later article, but in the meantime while we wait for compressors to add this support, here's an accelerated power-gzip userspace library that directly replaces zlib.

Finally, in addition to various improvements for the 40x and 8xx series, the most interesting commit was around prefixed instructions. These represent the first 64-bit instructions in the Power ISA (here's a code sample to show you the encoding) and allow much bigger 32-bit displacements for load-store operations than the 16-bit ones in current 32-bit instructions. I'm not too wild about the fact this makes Power ISA technically variable-length, but these D-form instructions are easy to identify and they are always 64 bits in size, and they should make certain types of code generation a lot simpler on chips that support it.

Condor cancelled

Raptor has confirmed that, unfortunately but not unexpectedly, the LaGrange-based Condor that was announced at the OpenPOWER summit last year has been cancelled due to economic concerns. Certainly any new high-end product would be tough to launch in the present COVID-19 economy, and because its size (ATX) and capabilities (single CPU, OpenCAPI, four slots) would have slotted it between the Talos II and the Talos II Lite in our view, there just isn't a lot of slack not served by those two existing products to soak up. It's probably just as well because I think getting ready for POWER10 would mean more to many users (it certainly would to me), but that itself requires a lot of R&D capacity and Raptor's a small company. Rather than a niche POWER9 design, here's hoping the resources that would have gone to Condor will go to a really kick-a$$ new Rainier-based system instead.

Firefox 79 on POWER

Firefox 79 is out. There are many new web and developer-facing features introduced in this version, of which only a couple are of note to us in 64-bit PowerPC land specifically. The first is a migration of WebExtensions storage to a new Rust-based implementation; there was a bit of a pause while extension storage migrated, so don't panic if the browser seems to stall out for a few long seconds on first run. The second is a further rollout of WebRender to more Windows configurations, so this seemed like a good time to me to check again how well it's working on this side of the fence. With the Raptor BTO WX7100 installed in this Talos II, I've forced it on with gfx.webrender.enabled and layers.acceleration.force-enabled both set to true (restart the browser after) and worked with it all afternoon with no issues noted, so this time I'm just going to leave it on and see how it goes. Any GCN-based AMD video card from Northern Islands on up (the WX7100 is Polaris) should work. about:support will show you if WebRender and hardware acceleration are enabled, though currently no Linux configuration has it enabled by default.

Unfortunately, it turns out relatively few of us are like me where we build the browser ourselves from source, and it seems some distros are enabling features — most likely higher-level optimizations — that trigger broken builds on ppc64le (Ubuntu was mentioned by at least one user). It would be nice to whittle down the offending feature(s) they enabled, both to get local fixes to the distro package configurations and then look at why they don't work (or make the default not to enable them on our platform, solving the problem in both places). I suspect LTO and PGO are to blame, which have a long history of being troublesome, as well as various defects in gold (use GNU bfd as the linker instead). Meanwhile, the build I'm typing this blog post into locally is still happily running on the same .mozconfigs from Firefox 67.

The littlest POWER9 booter

In our previous article we talked about emulating an OpenPOWER system from Skiboot up through the Petitboot boot menu using extracts from pre-built PNOR firmware images (and/or QEMU) instead of having to build your own. Well, what if you want to build your own?

You can certainly download and build Skiboot and Skiroot/Petitboot from scratch, or naturally any of the firmware stages in PNOR flash since we're a fully open platform, and there is an entire (huge) build system to automate this process. It's big and intimidating to the uninitiated, and it also works just dandy. But for this simpler example, let's start with something a little smaller which can serve as an educational tool as well.

Recall that Skiboot is the lowest level emulated by QEMU presently, although in reality it is an intermediate phase started by an earlier boot stage, i.e., Hostboot (the pretty graphical boot you see in current versions of the Raptor firmware). Among other tasks Skiboot's most important one is to offer the services provided by OPAL, the OpenPOWER Abstraction Layer, which the operating system will need to talk to the hardware. These services range from shutting down the machine to writing to the console, starting interrupts, handling PCI devices and probably not doing your dishes. After OPAL is initialized Skiboot then starts the bootloader for Petitboot, which unpacks Petitboot's Linux kernel and an initrd (i.e., being a zImage containing Skiroot), and that image is what ultimately brings up Petitboot.

However, when you get right down to it it's still just an ELF binary, so we can replace it as long as we understand how Skiboot calls and starts it.

Up to this point the CPU is in big-endian mode no matter what the terminal operating system is (as an old Power Mac user, this warms my grizzled cybernetic heart) and uses real physical memory addresses. When Skiboot finishes, it loads the single ELF binary stored in the PNOR flash partition BOOTKERNEL and runs it from its given entry point. This binary can be big-endian or little-endian. Skiboot also provides the binary the location of the flattened device tree (the FDT) in register r3, and two special addresses: the base address for OPAL in r8 (in physical memory, mind you), and the actual address to call for OPAL services in r9. This is more or less what kexec() does for a regular kernel, except those registers are guaranteed to be provided by Skiboot no matter the implementation.

OPAL calls assume the machine will stay that way (big-endian, real addresses, and also no external interrupts), so some leg work is required unless you just keep the system that way in the first place. In this simplest case, we'll do exactly that: the Skiboot source code even includes such a minimal boot image which simply says "Hello World!" to the console and shuts down the machine. Here, we see the code save the OPAL registers to non-volatile ones (so that calling OPAL won't clobber them) and use those to make the two OPAL calls themselves, setting the OPAL call number in r0, providing the OPAL base in r2 and any relevant arguments in the standard r3 through r10 registers, and then calling the OPAL entry point.

Let's see it in (brief) action. I will assume you already have QEMU set up to emulate an OpenPOWER machine as in the prior article (in particular, you should have either pnor.PAYLOAD or skiboot.lid available to provide Skiboot). To save you having to do so yourself, I added a little linker-assembler glue, some extra code to support both endian modes (more in a moment) and a trivial build system, and put it up on Github. If you're on an OpenPOWER system, as all right-thinking readers should be, then make should be sufficient to compile both the big and little endian versions, the latter of which I will come back to. If you are not, you will need a cross-building toolchain and should edit the Makefile to point to it.

Using what we learned last time, once you've run make, copy be_payload.elf into the same directory as skiboot.lid (QEMU's emulation doesn't work quite right with Raptor's PNOR Skiboot for this purpose), and let's kick it off:

qemu-system-ppc64 -M powernv8 -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./be_payload.elf


Now, what about the little-endian case? This is trickier, because the system starts big-endian and expects big-endian instructions, and simply twiddling the endian bit in the Machine State Register isn't enough (if you do so via typical means like mtmsrd, it is ignored). In fact, only three instructions are allowed to change endianness, namely rfid, its hypervisor analogue hrfid and rfscv, which are all returns from privileged code (interrupt handlers and vectored system calls respectively). Vectored system calls, in fact, weren't even supported in the Linux kernel until 5.9. For our purpose here rfid will suffice.

Let's look at the version of hello_kernel.S I marked up. You will notice that in little endian mode, we are assembling several handwritten opcodes immediately in the macro GO_LITTLE_ENDIAN. These are big-endian instructions (since we're little-endian we can't specify the instructions directly) that set the link register after this little stanza, copy over the MSR and toggle the endian bit, load the link register and the new MSR into the save-restore registers and then act as if we returned from an interrupt handler (rfid). rfid sets the new MSR and jumps to the link register which we have already rigged to be the following instruction. We now continue in little-endian mode.

Now, how do we do OPAL calls? I abstracted the code here a bit for both situations with a OPAL_CALL macro. Big-endian just sets the registers and jumps to the OPAL entry point, since we're in real mode and no external interrupts are presently enabled, exactly the same as the test code in Skiroot. For little-endian, however, I added a little subroutine at the end called le_opal_call which is nearly the same idea as GO_LITTLE_ENDIAN, but in reverse. We save the MSR and the LR in non-volatile registers, turn off the little endian bit in the MSR, compute the new return address for the trampoline after the oncoming rfid and load that into LR, set up srr0 and srr1 — but point to the OPAL entry point instead — and "return from the interrupt."

The OPAL call is thus executed big-endian in real mode. However, when we return following the rfid, we're still big-endian, so we immediately GO_LITTLE_ENDIAN again, restore the old MSR and LR (the LE bit is politely ignored) and return via the link register to the calling routine.

The last trick here is that the length of the string Hello World! will be stored according to the endianness we set for the assembler. If we don't account for this, we'll get a nonsense value in big-endian mode and the OPAL routine that prints a string to the console will spew garbage. When assembling in little-endian mode we thus manually specify the necessary bytes explicitly.

After all that,

qemu-system-ppc64 -M powernv8 -cpu power8 \
-nographic \
-bios ./skiboot.lid \
-kernel ./le_payload.elf

A couple parting comments.

First, while you might think this would be sufficient to make something bootable from both Skiboot and Petitboot, it isn't; if you try to boot this as a kernel from Petitboot it will simply hang. We'll explore this further in a later article. Second, I have intentionally not described how you would actually flash this to PNOR on a real machine lest someone screw something up and blame me for it. In broad strokes, however, you would take either of the ELF binaries and turn it into a PNOR flash partition with fpart (not to be confused with other partition and file management utilities of the same name). Having done so, you would transfer this to the BMC and use pflash to replace the contents of PAYLOAD (after, hopefully, backing up the previous contents with pflash -r). At this point you may now start your machine so it can, um, shut down.

Finally, this entire exercise brings up an interesting question (to me, anyway): is there a performance ramification to running in little-endian vs big-endian, given the additional necessary overhead of flipping endianness every time OPAL is called? The answer is probably, but it's likely negligible in practice unless you're on the bare metal as we are here. Let's compare how little-endian Linux does this in opal-calls.S with big-endian OpenBSD's locore.S; in both listings, scroll down to opal_call and note the differences. Even though we don't have to do quite as much song and dance setting up a trampoline and switching endianness, we still have to twiddle the MSR (in this case to turn off external interrupts and return to real mode), and a similar amount of instruction synchronization must still occur (using isync; rfid and hrfid do this as a natural consequence). From a practical perspective, unless you have some pathological case that makes lots of OPAL calls back to back, the few extra instructions required are probably below the noise threshold when considering everything else that affects performance in modern operating systems.