Showing posts from August, 2020

Firefox 80 on POWER

Firefox 80 is available, and we're glad it's here considering Mozilla's recent layoffs. I've observed in this blog before that Firefox is particularly critical to free computing, not just because of Google's general hostility to non-mainstream platforms but also the general problem of Google moving the Web more towards Google.

I had no issues building Firefox 79 because I was still on rustc 1.44, but rustc 1.45 asserted while compiling Firefox, as reported by Dan Horák. This was fixed with an llvm update, and with Fedora 32 up to date as of Sunday and using the most current toolchain available, Firefox 80 built out of the box with the usual .mozconfigs.

Since there was a toolchain update, I figured I would try out link-time optimization again since a few releases had elapsed since my last failed attempt (export MOZ_LTO=1 in your .mozconfig). This added about 15 minutes of build-time on the dual-8 Talos II to an optimized build, and part of it was spent with the fans screaming since it seemed to ignore my -j24 to make and just took over all 64 threads. However, it not only builds successfully, I'm typing this post in it, so it's clearly working. A cursory benchmark with Speedometer 2.0 indicated LTO yielded about a 4% improvement over the standard optimized build, which is not dramatic but is certainly noticeable. If this continues to stick, I might try profile-guided optimization for the next release. The toolchain on this F32 system is rustc 1.45.2, LLVM 10.0.1-2, gcc 10.2.1 and GNU ld.bfd 2.34-4; your mileage may vary with other versions.

There's not a lot new in this release, but WebRender is still working great with the Raptor BTO WX7100, and a new feature available in Fx80 (since Wayland is a disaster area without a GPU) is Video Acceleration API (VA-API) support for X11. The setup is a little involved. First, make sure WebRender and GPU acceleration is up and working with these prefs (set or create):

gfx.webrender.enabled true
layers.acceleration.force-enabled true

Restart Firefox and check in about:support that the video card shows up and that the compositor is WebRender, and that the browser works as you expect.

VA-API support requires EGL to be enabled in Firefox. Shut down Firefox again and bring it up with the environment variable MOZ_X11_EGL set to 1 (e.g., for us tcsh dweebs, setenv MOZ_X11_EGL 1 ; firefox &, or for the rest of you plebs using bash and descendants, MOZ_X11_EGL=1 firefox &). Now set (or create):

media.ffmpeg.vaapi-drm-display.enabled true
media.ffmpeg.vaapi.enabled true
media.ffvpx.enabled false

The idea is that VA-API will direct video decoding through ffmpeg and theoretically obtain better performance; this is the case for H.264, and the third setting makes it true for WebM as well. This sounds really great, but there's kind of a problem:

Reversing the last three settings fixed this (the rest of the acceleration seems to work fine). It's not clear whose bug this is (ffmpeg, or something about VA-API on OpenPOWER, or both, though VA-API seems to work just fine with VLC), but either way this isn't quite ready for primetime yet on our platform. No worries since the normal decoder seemed more than adequate even on my no-GPU 4-core "stripper" Blackbird. There are known "endian" issues with ffmpeg, presumably because it isn't fully patched yet for little-endian PowerPC, and I suspect once these are fixed then this should "just work."

In the meantime, the LTO improvement with the updated toolchain is welcome, and WebRender continues to be a win. So let's keep evolving Firefox on our platform and supporting Mozilla in the process, because it's supported us and other less common platforms when the big 1000kg gorilla didn't, and we really ought to return that kindness.

POWER10 sounds really great, but ...

IBM took the wraps off POWER10 officially today, a (Samsung-manufactured) 7nm monster in 18 layers with up to 15 SMT-8 cores (120 threads) with 2MB of L2 per core, up to 120MB of L3, 1 TB/s memory access, OpenCAPI and PCIe 5. New on-board is an embedded matrix math accelerator for specialized AI performance, multipetabyte memory clusters and transparent memory encryption with four times the number of AES engines than POWER9. Overall, IBM is touting that the processor is three times more energy efficient than POWER9 while being up to twice as fast at scalar and four times as fast at vector operations. General availability is announced for Q3 or Q4 of 2021.

First of all: damn. This sounds sweet. The dual-8 POWER9 Talos II under the desk with "just" 64 threads and PCIe 4 is already giving me sorrowful Eeyore eyes even though there's no guarantee what, if any, lower-end systems suitable as being workstations will be available when the processor is. But right now, what we do know is that right now Raptor has said there won't be POWER10 systems, and as it stands presently nobody else is making workstation-class OpenPOWER machines. Raptor, probably for reasons of NDAs, is playing this close to the vest, so what follows is merely my variably informed personal conjecture and may be completely inaccurate.

One of the truly incredible things about OpenPOWER — or at least POWER8 and POWER9 — is how far down you can see what the hardware is doing. In previous articles, we looked at emulating OpenPOWER at the bare metal level, and then even writing your own firmware bootkernel. But the bootloader and high-level firmware are really only the beginning: the build image created by op-build not only contains the Petitboot bootloader, but its Skiroot filesystem, Skiboot (containing OPAL, the OpenPOWER Abstraction Layer, which handles PCIe, interrupt and operating system services), Hostboot (which initializes and trains RAM, buffers and the bus), and the Self-Boot Engine which initializes the CPUs. Even the fused-in first instructions the POWER9 executes from its OTPROM to run the Self-Boot Engine are open source, and other than the OTPROM itself (it is a One-Time Programmable ROM, after all), everything is inspectable and changeable. And before the POWER9 executes those very first instructions, the Baseboard Management Controller that powers the system on has its own open firmware too. You know what your computer is doing, and you don't have to trust anyone's firmware build if you don't want to because you can always build and flash the system yourself.

Contrast this against the gyrations that x86 "open" systems have to struggle with. Do not interpret this as a slam against vendors like System76 or Purism because they're doing the best they can to deliver the most frequently used architecture in workstations and servers, in as unlocked a fashion as possible from processor manufacturers who are going in exactly the opposite direction. And there have been great improvements in untangling the tendrils of the Intel Management Engine from the processor, primarily through Coreboot's steady evolution. But even with these improvements where significant portions of the Intel ME are disabled, secret sauce is still needed to bring up the CPU and you have to trust that the sauce is only and specifically doing what it says it is, in addition to the other partitions of the ME which activated or not are still not fully understood. The situation is even worse for AMD Ryzen processors with the Platform Security Processor, which (at least the 3000 and 4000 variants) aren't presently supported by Coreboot at all, though System76 is apparently working on a port.

Don't just take my word for it: as of this writing no recent x86 system appears on the FSF Respects Your Freedom list, but the Talos II and T2 Lite both do (and I imagine the Blackbird is soon to follow). The Vikings D8 is indisputably libre, and has an FSF RYF certification, but is an AMD Opteron 4200, which is about eight or nine years old. As it stands I believe this is the most powerful x86 system still available on the FSF RYF list now that the D16 is out of production (Opteron 6200).

I think there's a reasonable argument to be had about how "open" something needs to be to be considered "libre" and at what point you could be considered to have meaningful control of your machine, but there's no denying there are aspects of modern x86 machines which you are prohibited by policy from getting into, and that means putting more faith in the processor vendor than they may truly deserve. (Don't get me started on GPUs, either. Or, for that matter, ARM.) Again, Raptor won't say, but their public disenchantment with POWER10 suggests that some aspects of the processor firmware stack are not open. This is a situation which is no better than x86, and I'm hoping this is merely an oversight on IBM's part and not a future policy.

To be effective, OpenPOWER needs to be more open than just the ISA being royalty-free, even though that's huge. To be sure, I think there has to be room for processor manufacturers to distinguish themselves in the market or you run the risk of a race to the bottom where people simply rip off designs (this is, I think, a real concern for RISC-V). I think sharing reference designs is necessary to get systems bootstrapped but I can't deny there's money in high performance applications, and high performance microarchitecture demands a return on investment to justify development costs. Similarly, to the extent that any pack-in hardware (like POWER9's Nest Accelerators) isn't part of the open ISA and are separately managed devices that simply share a die, to me it seems logical to also make it part of how a processor manufacturer can stand out to potential customers.

But the firmware absolutely needs to be as clean and available as the ISA. If the ISA is open and the instructions the CPU is running are part of that open standard, then any firmware components, which (ought to) entirely consist of those instructions, must be open too. If the CPU has pack-in hardware on the die that isn't part of the open ISA, then you should be able to bring up the chip without it. The standard that was set for current OpenPOWER should be the same standard for POWER10 or it doesn't really deserve the OpenPOWER name, and I'm worried that Raptor's insinuations imply IBM's standard isn't the same. Similarly, arguing that the currently incomplete situation with x86 is functionally equivalent to OpenPOWER (or, for that matter, RISC-V) may be well-intentioned but is disingenuous. The FSF may be ideologues on binary blobs, but that doesn't make their position wrong, and the entire OpenPOWER ecosystem from IBM on down should recognize how much goodwill and prominence the openness of POWER8 and POWER9 has generated for the community.

I hope I'm wrong, but I'm concerned I'm not. Let's make sure we get POWER10 right or we won't be practicing what we preach, and that's going to kill us in the crib.

Vikings' upcoming OpenPOWER retail channel

Many Talospace readers are familiar with Vikings, who offer libre hosting as well as hardware sales for libre-friendly devices and systems and peripherals certified by the FSF Respects Your Freedom program (for which Raptor systems qualify). However, the Vikings' storefront now shows a new tab for OpenPOWER hardware, hopefully a public demonstration of a new retail channel coming soon for those ready to pull the trigger on an OpenPOWER workstation or server of your own. This is particularly of value to our readers outside North America, since this gets around a lot of the inconveniences of shipping and payment with United States businesses; Vikings is based in Germany, and accepts payments in euros, US dollars, British pounds, Australian dollars and New Zealand dollars. Already we have also heard that Vikings is working on a water-cooler system for POWER systems with an aim to reach the market in two months or less, a great option for people trying to run the 18 and 22-core parts in desktop environments (current BTO cooling options are air-cooled only).

Currently it is not known yet whether Vikings will sell full systems, parts and/or processors, whether the systems include other OpenPOWER systems other than Raptor workstations and servers, or when general availability is expected. Still, the more retail options there are, the greater the volume of sales and the greater the economies of scale that will result. In the end, that can only be a good thing for growing our niche but very important market.

Will it build?

While I will always be big-endian at heart, ppc64le does get around a lot of the unfortunately pervasive endian assumptions in a cold blackboxed x86_64 world, and even things like MMX, SSE and SSSE3 can be automatically translated in many cases. It is therefore a happy result that even many software packages completely unaware of ppc64le will still build and function out of the box, assuming they don't do silly things like emit JITted x86_64 assembly code and try to run it, etc.

I ran across this project the other day which has over 1,000 build scripts for ppc64le (as shell scripts and/or Docker files) that you can either use directly, or as a hint whether your intended build will even work. Cursorily paging through a few I see IBM E-mail addresses, so no surprise much of it is tested on Red Hat (though largely RHEL 7.x), but there are also Ubuntu scripts there as well and I imagine they'd accept other distros. Keep in mind that this is generic ppc64le, so it would work on POWER8 and up but any special optimizations (for example, I always build optimized Firefox at -O3 -mcpu=power9), and the concentration more favours server-side packages than workstation and client software. I also see relatively few platform-specific corrections, which could be both good (they weren't needed) or bad (they weren't tested). Still, it's nice to see more resources to aid porting and platform compatibility and that can only in turn get more packages thinking about making ppc64le (and hopefully ppc64) a first-class citizen too.

Linux 5.8 on POWER

The 5.8 kernel milestone has arrived, with improvements to reduce thrashing (though with the amount of memory even a Blackbird can hold, there's no excuse not to load these suckers up), an API for receiving notifications of kernel events, support for hardware-assisted inline encryption at the block layer for storage devices and a nice convenience feature where you can put sysctl.something.or.other=999 right on the kernel command line.

On the Power ISA side, this kernel adds the first support for POWER10 and ISA 3.1, although our Raptor contacts have indicated some displeasure with IBM's management decisions and we suspect this is a way of saying firmware binary blobs might be required to enable maximal performance (though we don't know, and it's unclear how much is under NDA). Another nice feature is an ioctl to send gzip compression requests directly to POWER9's on-chip compression hardware via /dev/crypto/nx-gzip. This is part of the general family of Nest Accelerators (NXes) accessible through the Virtual Accelerator Switchboard. More about that in a later article, but in the meantime while we wait for compressors to add this support, here's an accelerated power-gzip userspace library that directly replaces zlib.

Finally, in addition to various improvements for the 40x and 8xx series, the most interesting commit was around prefixed instructions. These represent the first 64-bit instructions in the Power ISA (here's a code sample to show you the encoding) and allow much bigger 32-bit displacements for load-store operations than the 16-bit ones in current 32-bit instructions. I'm not too wild about the fact this makes Power ISA technically variable-length, but these D-form instructions are easy to identify and they are always 64 bits in size, and they should make certain types of code generation a lot simpler on chips that support it.