Posts

Fedora 38


Fedora 38 is out — a week early, for a change. Fedora matters to us here at Orbiting Floodgap HQ because it's what we run on our Talos II and Blackbird systems and it should matter to you because, being a bleeding edge distro, changes occur there first that tricke down to other distributions. That's why we make efforts to do mini-reviews of each release. With F38's release F36 will be End of Life in one month.

The changeset for 38 is typically extensive. Possibly the most controversial was the change to globally build with -fno-omit-frame-pointer to facilitate better profiling and debugging, particularly where debugging information is not available, but at a cost as this also takes a register out of circulation to hold the frame pointer. The performance impact seems to be limited on x86_64 but I doubt much testing was done on ppc64le, and it should be noted that PowerPC is one of the gcc targets where leaf functions wouldn't use a frame pointer anyway. Time will tell if this pays off. Builds are also now made with _FORTIFY_SOURCE=3 (up from 2) for better security, and another interesting though probably irrelevant change for most is reducing the shutdown timer in systemd to 45 seconds from 2 minutes.

On the back-end F38 ships with kernel 6.2.x and gcc 13, LLVM 16, gmake 4.4, binutils 2.39, glibc 2.37 and gdb 12.1. F38 also has a major upgrade to microdnf as dnf5, the "future of package management" that may ultimately replace dnf entirely. On the front-end F38 updates GNOME to version 44, finally with grid thumbnail view in the file picker, a big overhaul to the Settings app and many new applications, as well as more apps moving to the unthemable libadwaita (but I run KDE Plasma now, and haven't looked back). Xfce also updates to 4.18, there's a new spin for the Sway window manager, and the SDDM display manager now also defaults to Wayland (we use a text boot to log in and start X11 manually, avoiding any display manager completely).

This is the first release to include the change that blocks clients with different endianness from connecting to the X server, including XWayland, which means that the compositor has to support the configurable option too (GNOME 44 Mutter does, others may not). At least you still have the option!

We'll give the mirrors a week or two to catch up on builds and then start the transition on our own machines, with the usual mini-review to follow. Stay tuned.

FreeBSD 13.2


And hot on the heels of the latest OpenBSD release is the latest FreeBSD iteration, 13.2-RELEASE. FreeBSD has a longer track record on OpenPOWER and in my cursory estimates is the most commonly installed BSD on modern Power ISA. One big jump is that the bhyve hypervisor now supports more than 16 virtual CPUs and by default can create the same number of vCPUs as physical CPUs, which is quite useful to us once you get away from the smallest single-4 machines given all our cores are SMT-4. Additionally, for those of you running FreeBSD on a VM (such as an LPAR or under KVM), nested POWER9 radix MMU mappings are now supported on the pseries flavour, substantially reducing hypercall overhead. The Linux compatibility ABI has also been expanded and on the security side ASLR is now enabled for all 64-bit executables by default, configurable through proccontrol. Downloads are available for big-endian and little-endian. Note that the release notes indicate that all PowerPC and Power ISA releases right now must run kldxref /boot/kernel manually after an upgraded successful kernel and world installation.

OpenBSD 7.3


OpenBSD 7.3 is released. While most of the improvements are not specific to Power ISA, there's a lot we benefit from, including many kernel calls which are now "lock-free" (improving SMP performance) like mmap(2) and select(2), more device support, immutable permissions on address ranges to prevent permissions from being changed in the future — much of a running program's static address space like stack, code and most libraries is now automatically immutable — and support for execute-only memory on both Power ISA and the PowerPC 970 ("G5"). LibreSSL is updated to 3.7.2, OpenSSH is updated to 9.3, and the OS ships with LLVM/clang 13.0.0 and Perl 5.36.0. Download and install when ready, Puffy.

Firefox 111 on POWER


This got a bit delayed due to $DAYJOB interfering with my important hacking and writing time (darn having to make a living), but Firefox 111 is out. As usual you'll need to deal with bug 1775202 either with this patch — but without the line containing desktop_capture/desktop_capture_gn, since that's been gone since the latest WebRTC update — or put --disable-webrtc in your .mozconfig if you don't need WebRTC. The workaround adding #pragma GCC diagnostic ignored "-Wnonnull" to js/src/irregexp/imported/regexp-parser.cc for optimized builds fortunately was addressed by bug 1810584, so you no longer need it, and the browser otherwise builds and works with the PGO-LTO patch for Firefox 110 and the .mozconfigs from Firefox 105.

Now your LLaMa is playing with POWER


Now that the invasion of the large language models has occurred and we will all bow to our GPT overlords, I just generated a pull request to add additional POWER9-specific optimizations to llama.cpp, what all the cool kids are using for LLMs who aren't down with OpenAI. This repo moves quick but it's where the magic is happening if this is what you're into. It will work with both Alpaca and LLaMa models.

In a previous article we talked about autovectorization using conversion of Intel vector intrinsics to POWER9, but this is good old fashioned assembly code and hand-written C. The part that really helped was changing their pure-C "F16" (half-precision) float conversion code to use VSX instead. The rolls-off-your-tongue POWER9-and-up xscvhpdp and xscvdphp instructions convert half-precision floats to and from double-precision respectively (xscvdphp will also work on single-precision, which is handy, because the explicit conversion is from single-precision "F32"), and we also use POWER8 mffprd and mtfprd for toll-free copies between general and float registers without requiring a spill to memory. That change alone is about 12 percent faster than the old pure-C compute and lookup code. Additionally, we also have our own vectorized version of quantize_row_q4_0 like ARM NEON and AVX-256 written with VMX/VSX intrinsics. It's even a little better, because we were able to use our VMX floating-point multiply-add and remove a couple minor inefficiencies in the code. Additionally, people used to G4 and G5-era AltiVec will enjoy the fact that the newer intrinsics substantially map directly to ARM's — I especially liked vec_extract as an all-purpose replacement for all of the NEON vget_lane_* variations, as well as vec_signed for vcvtq_s32_f32 for converting floats in place, and the all-purpose simplified vec_splats for making a splat vector out of anything — making conversion much more straightforward when you need to write your own code.

I did play with alpaca.cpp, the other older white meat, and the changes here should more or less apply to that codebase as well. However, given how quickly llama.cpp evolves and the greater development interest, llama.cpp seems the best way forward for continued evolution.

I will say in the spirit of full disclosure that despite these improvements my 16GB 4P/4E/8G M1 MacBook Air still pops out tokens several times faster than this 64GB dual-8 Talos II, even full-tilt with all 64 threads in use (the cat still looks startled every time the fans rev). On the other hand, we're also comparing a 2017 CPU with one from 2020, and one with specific hardware acceleration for neural networks that llama.cpp takes particular advantage of. Even with Power10's improved bfloat16 support and matrix math operations, specific work would be needed to support those features which won't be coming from me (stay tuned for Power11, I guess). There are other opportunities for vectorization to be done, though at the rate this code base evolves it would be better waiting for one of the mainstream architectures to pick up a SIMD version we can convert first. In the meantime, while you should be advised that going beyond the 7B or 13B models will require patience regardless of how much RAM you have, I think this is definitely better than what we started with.

Firefox 110 on POWER


Firefox 110 is out, with graphics performance improvements like GPU-accelerated 2D canvas and faster WebGL, and the usual under the hood updates. The record's still broken and bug 1775202 still is too, so you'll either need this patch — but this time without the line containing desktop_capture/desktop_capture_gn, since that's gone in the latest WebRTC update — or put --disable-webrtc in your .mozconfig if you don't need WebRTC at all. I also had to put #pragma GCC diagnostic ignored "-Wnonnull" into js/src/irregexp/imported/regexp-parser.cc for optimized builds to complete on this Fedora 37 system and I suspect this is a gcc bug; you may not need it if you're not using gcc 12.2.1 or build with clang. Finally, I trimmed yet another patch from the PGO-LTO diff, so use the new one for Firefox 110 and the .mozconfigs from Firefox 105.