Showing posts from December, 2022

Fedora 37 mini-review on the Blackbird and Talos II

It's been kind of a wild ride getting the Talos II and the Blackbird upgraded to Fedora 37, but we're there, so it's finally time for a mini-review to summarize the current state. As I always like to remind folks, Fedora was one of the first mainstream distributions to support POWER9 out of the box, it's still one of the top distributions OpenPOWER denizens use and its position closest to the bleeding, ragged edge is where we see problems emerge first and get fixed (hopefully) before they move further downstream. That's why it's worth caring about it even if you yourself don't run it.

Another important reminder is both my 'Bird and T2 are configured to come up in a text boot instead of gdm and I start GNOME (Blackbird) or KDE (T2) manually from there. I still test GNOME on both systems, but I've pretty much entirely migrated over to KDE Plasma on the T2, so I'll talk about both the GNOME and KDE experience in this and future mini-reviews. I strongly recommend a non-graphical boot as a recovery mechanism in case your graphics card gets whacked by something or other. On Fedora this is easily done by ensuring the symlink /etc/systemd/system/ points to /lib/systemd/system/

As usual, the process is, from a root prompt:

dnf upgrade --refresh # upgrade prior system and DNF
dnf install dnf-plugin-system-upgrade # install upgrade plugin if not already done
dnf system-upgrade download --refresh --releasever=37 # download F37 packages
dnf system-upgrade reboot # reboot into upgrader

I did the Blackbird first, and immediately got a broken packages error because this was the system I had tested Pantheon on for Fedora 36. Just in case you didn't get the memo, new for F37 ain't no more Pantheon in Fedora (though there is a Copr). I just removed them instead (elementary-wallpapers-gnome, gala-libs and switchboard and switchboard-plug-tweaks). It booted into a graphical installer and rebooted without incident. The kernel release as of this writing is 6.0.13.

On the Talos II, which has the Raptor BTO option AMD WX7100 workstation card in it, even with GPU firmware loaded into the PNOR there was once again no graphical installation. If you manually select the kernel from Petitboot, it will at least show a text installation process. Alternatively, you can monitor on the serial port, or from a connected system viewing the serial console over the BMC's web server, or by logging into another VTY with CTRL-ALT-F2 or as appropriate as root and periodically issuing dnf system-upgrade log --number=-1 to watch log updates.

Although installation on the T2 also seemed to successfully terminate and reboot, Petitboot subsequently puked because it couldn't handle the state the updater left it in (the root is XFS and it had a stuck journal log entry). Fortunately the Blackbird was able to rectify the filesystem once the SSD was moved to an external device for recovery.

Both systems failed to update the grub2 configuration, requiring me to do so manually with grub2-mkconfig -o /boot/grub2/grub.cfg. This is a regression from bug 1921479 and can be detected in dnf with the same error message during a kernel package update (/etc/grub.d/10_linux: line XXX: test: XXXXXXX-pXXXXXXX: integer expression expected).

Additional problems were afoot on the Blackbird when starting a GUI, which is a basic 4-core using the ASPEED BMC for graphics. At least on the ASPEED, GNOME was rife with graphical abnormalities, worst of all in Wayland. (Starting Wayland from the command line also got more complex: I had to do something along the lines of XDG_SESSION_TYPE=wayland /usr/libexec/gnome-session-binary --builtin to get it to start; the regular gnome-session bombed out with an argument error.) Wayland is still restricted to 1024x768:

But that wasn't the worst of it. Trying to do even simple tasks yielded a lot of tearing and refresh problems. This is a picture of the screen because I couldn't even get the screenshot utility to work reliably.
So definitely a regression for Wayland. But even Xorg (still starts with startx, may need XDG_SESSION_TYPE=x11) had some unusual problems, like some apps' titlebars being transparent and causing graphical glitches:
Oddly, it wasn't all of them, though the Terminal was most affected. Newer or recently refreshed libadwaita-based apps seemed relatively immune, as did apps like Firefox that adhered to older GTK APIs. I keep the Blackbird as stock Fedora as possible, and I ran another update just prior to writing this up and didn't see any improvement.

KDE was not affected by that issue, though I observed — at least with Firefox and Thunderbird — that I had to grab the window and move it around a bit to get click point coordinates to be correctly reflected when the apps are full screen (then, after giving them a little shake by the titlebar, I could maximize them again and all would be well). I don't know whose bug this is exactly.

That particular irritation persisted on the Talos II, but none of GNOME's graphical problems that I saw on the Blackbird's BMC graphics. In fact, GNOME performed rather well for a change: I didn't need to force a rebuild of libgraphene this time to get improved performance and the UI was very smooth in Xorg. It also appears that the colour management issues I used to have where the screen would get blue-tinged have been rectified. Wayland GNOME had occasional animation stutters and a mild bit of lag but was otherwise useable, which is invariably the most I can say about Wayland.

Also in the positive category is that the bustage churn from the long double update in Fedora 36 is now almost all behind us, and the toolchain successfully built Firefox and other large projects without any new problems.

Overall F37 is a mixed update, more good than bad, but with some unwelcome regressions. In particular, if you are running BMC graphics only, even in Xorg GNOME has picked up some new glitches and in Wayland is once again a big mess. Systems with a GPU will largely be spared these issues — or just don't run GNOME. Likewise, be prepared with a second system to do any filesystem recovery if you're a long-time Fedora user and your root is still XFS; it may be time to convert it over to something else if you get pounded by it every time you do an upgrade.

Firefox 108 on POWER

Now that the Talos II is back in order and the Fedora 37 upgrade is largely behind me, it's now time to upgrade Firefox to version 108. There's some nice performance improvements here plus a hotkey for about:processes with Shift-Escape. Support for WebMIDI seems a little gratuitous, but what the hey (haven't tried it yet, the Macs mostly handle my music stuff), and there are also new CSS features. As before linking still requires Dan Horák's patch from bug 1775202 or the browser won't link on 64-bit Power ISA (alternatively put --disable-webrtc in your .mozconfig if you don't need WebRTC). Otherwise, we were able to eliminate one of our patches from the PGO-LTO diff, so use the new one for Firefox 108 and the .mozconfigs from Firefox 105.

When Petitboot barfs, everything's vomit

Colourful, no? But it's true. I've not been able to write up my Fedora 37 experience, nor upgrade Firefox (nor do further work on the JIT) because the Petitboot boot menu couldn't stop touching the main NVMe drive and making its older (Linux 5.5) XFS kernel module hang. If Petitboot can't start, your expensive POWER9 system is a brick.

In its most literal sense this article is largely a precautionary tale, because unless you're a long-term Fedora user like me with a continuously updated older installation, it's very unlikely you have an XFS volume in your OpenPOWER box. But if the antique kernel in Petitboot ever starts barfing on your own filesystems or a device you install, you'll be in this state too, so here's how I got the Talos II working again.

It's pretty much been a constant that you need a second system to deal with glitches. For me, this is usually my trusty Quad G5 Power Mac sitting next to the T2 which is connected to its serial port (or to the BMC's), and this works when it's a problem you can resolve from the BMC side, which is many of them. It would be nice to power up a Talos or Blackbird and have the console automatically start up talking to the BMC instead of needing another system to do so but this is what we have, at least until Kestrel develops that capability.

Unfortunately, this wasn't one of those problems:

  [Disk: nvme1n1p2 / 19a5d4e3-19f7-423f-a75b-5b15c8ee0bff]
    Fedora (0-rescue-ee275f6a7d994c9981e4e1436b83172d) 30 (Workstation Edition)
    Fedora Linux (5.18.13-200.fc36.ppc64le) 36 (Workstation Edition)
    Fedora Linux (5.18.18-200.fc36.ppc64le) 36 (Workstation Edition)
(*) Fedora Linux (6.0.12-200.fc36.ppc64le) 36 (Workstation Edition)
  System configuration
  System status log
  Rescan devices
  Retrieve config from URL
 *Exit to shell

 [fedora-root] Processing new Disk device[    8.041704] XFS: Assertion failed: !
(fields & XFS_ILOG_DFORK) ||
 (len == in_f->ilf_dsize), file: fs/xfs/xfs_log_recover.c, line: 3103
cpu 0x26: Vector: 700 (Program Check) at [c0002007e33171c0]
    pc: c008000008dc46bc: assfail+0x54/0x60 [xfs]
    lr: c008000008dc4694: assfail+0x2c/0x60 [xfs]
    sp: c0002007e3317450
   msr: 900000000282b033
  current = 0xc0002007e32c3180
  paca    = 0xc0002007ff7f5900   irqmask: 0x03   irq_happened: 0x01
    pid   = 649, comm = pb-discover
kernel BUG at fs/xfs/xfs_message.c:110!

After the assertion appeared, Petitboot locked up (at least on the regular console) and the system wouldn't start from any device because Petitboot could not be coerced into ignoring it. I tried holding down the x key from the serial console to force it into the shell, and that worked — but it still tried to mount the volume anyway and died. This did bring up a live kernel debugging session as you can see in the screenshot, but since I wasn't sure what the XFS module would do at this point and didn't want to risk the filesystem, I just powered it down.

Something about the state of the root XFS volume after the Fedora 37 update was making it go wrong, and I haven't been the first to observe this, either. Recovering cleanly would at minimum require a system that can mount and examine the XFS volume, and the G5, which runs Mac OS X Tiger, isn't that system. (Maybe the SGI Fuel next to it with IRIX 6.5.30 is — though that's something to explore some other time when it isn't my primary computer's boot volume at stake.)

Fortunately I've also got a Blackbird that did complete its F37 upgrade successfully. So it's time to do a little shopping.

I picked up two off-the-shelf NVMe-to-USB enclosures, one the Worst Best Buy Insignia store brand NS-PCNVMEHDE for about US$20, and a Sabrent EC-SNVE for about US$30 which also supports SATA. I was pretty sure the Sabrent would work due to their usual diligence about Linux, but I bought the Insignia anyway as a backstop in case the Sabrent was defective, and also because it came with a USB-C to USB-A converter since the Blackbird doesn't currently have any USB-C connectors.

Both devices are USB 3.2 Gen 2 and came up as "SuperSpeed USB" connected to the Blackbird's rear USB ports. The Sabrent is a much nicer unit with high-quality metal construction that folds open and has an integrated heat spreader in the top. The "tool free" part is there's a small clip that rotates to hold the M.2 stick in (with a stopper in the package for smaller-sized sticks). But even though the Insignia was kludgier (pulls out instead of folds open, requires you to stick on a heat spreader, really clumsy turn clip), it supports USB Attached SCSI Protocol; dmesg indicated the Sabrent didn't respond to a UAS probe. If I could have combined the chipset in the Insignia with the case of the Sabrent, we'd have the perfect enclosure.

Both devices also worked in Petitboot — by which I mean having the tainted NVMe SSD plugged in while Petitboot came up would also crash the Blackbird.

Bringing up Fedora first and then connecting the enclosure after, we next get the T2's root volume up so it can be checked. Because both the Blackbird's boot drive and the T2's boot drive have the volume group name fedora, we'll need to rename the T2's. We list the volume groups with vgdisplay; the T2's starts with lO, so the commands are:

vgrename `vgdisplay | grep lO | awk '{print $3}'` tfed
lvchange -ay /dev/tfed/root

But xfs_repair /dev/tfed/root wouldn't try to fix it: it said there was a log entry that had to be replayed first. This can be done simply by mounting it, so

mount /dev/tfed/root /mnt
umount /mnt
xfs_repair /dev/tfed/root

This showed no errors, so I inactivated the root LV again with lvchange -an /dev/tfed/root, disconnected the NVMe stick, put it back in its PCIe carrier and reinstalled it in the T2. Petitboot didn't crash, but Fedora requires the logical volume be named fedora, so we enter the Petitboot shell first and finish up with

vgrename `vgdisplay | grep lO | awk '{print $3}'` fedora

and then boot.

Whose bug was this? Well, arguably, Fedora might not have properly unmounted the drive after the update, but the error appears to be minor in that simply mounting the drive (with a later kernel, admittedly) fixed up the issue. It's more important that Petitboot have a stable, well-tested codebase, so the decision to use an older kernel (though 5.5 is a little excessive) is not an unreasonable one, and this older kernel appears not to be able to do that kind of recovery.

But if Petitboot can't do it, it shouldn't just brick the system. There should be a way for a user to hold down a key and bypass the menu without mounting anything, and try to recover in the shell at that point, which you can do from the console. Similarly, if it barfs on a filesystem or an installed device, it should simply say so and ignore it, not panic. These computers are just too expensive to have vomit everywhere when something goes wrong — and you shouldn't have to have a whole second system around to clean up the mess.

Linux 6.1

I'm a little behind on stuff since I'm waiting for parts to get my T2 booting again (doing everything on my Mac laptop and my long-suffering Quad G5), but kernel version 6.1 came out, and there's some really good stuff on Power to mention.

But first the marquee general improvements: first, general support for Rust in kernel, which is now fairly mature on Power ISA (every Firefox build I make has it) and has obvious security benefits — assuming you're on a platform it supports, that is. The other change I think is a big one, possibly even bigger than Rust support, is the enhanced multi-generational LRU (Least Recently Used) memory page evictor: it's not on by default, but it ships as a configurable option, and some of the reports show some impressive performance wins. Finally, the new implementation of in-kernel maple trees means better cache hit rates and less lock contention for those kernel structures reimplemented with them (if you're 64-bit and have an MMU, which naturally we do), and I know people will appreciate the updates to AMD GPU support.

However, the Power-specific improvements are particularly interesting. If you're using the POWER9-and-up radix MMU (not available on the POWER8, nor if you need to use KVM-PR thanks Russell Currey for the correction: HPT already has this support), there's now the option of execute-only mapping (as opposed to read-execute which is supported with hashed page tables). Another important Power improvement is full support for 64-bit Power ISA under both hashed and radix MMUs with KFENCE, a "low-overhead sampling-based memory safety error detector of heap use-after-free, invalid-free, and out-of-bounds access errors." Interestingly, 32-bit PowerPC was supported first!

To me, though, I'm most impressed with the exceptionally hard but worthy work done to rework system calls to use the new shared syscall wrapper implemented for s390, arm and x86 and obsolete the old legacy layer. This causes syscall handlers to take their parameters off the stack rather than relying on the state of the argument registers and r0, which is an obvious benefit if the registers are already on stack (such as for exception handling) because an additional stack frame wouldn't be needed, and further offers the opportunity to zero or sanitize them to prevent them from being used as a means to influence speculative execution (where expedient, and likely coming in 6.2). This has at most a minor performance boost, but it seems to be a definite security and maintainability gain, and best of all the new wrappers work on all PowerPC and Power ISA CPUs except the IBM Cell.

Expect to see it soon in Fedora and other leading-edge builds, and trickling down to other distros near you (full change list).