Posts

Showing posts from October, 2020

Updates: Fedora 33, FreeBSD 12.2, Ubuntu 20.10


Hot on the heels of Ubuntu 20.10 and 20.04.1 LTS (download the server flavour, and convert it to desktop if you like) comes Fedora 33. Ubuntu 20.10 upgrades to kernel 5.8, GNOME 3.38, QEMU 5 and OpenStack Victoria with an installer fix for OpenPOWER; Fedora 33 remains on 5.8 (5.9 likely to follow) but also includes GNOME 3.38, glibc 2.32 and LLVM 11, and also defaults to btrfs on Workstation (watch out if you change to a 4K page size; Fedora uses 64K pages and filesystems generated on one are not currently compatible with the other). As previously mentioned Fedora is important to me personally because it's what I run on my own T2 and Blackbird, so once the packages and late breaking changes settle down I will do a mini-review (as I did for F32), but the change I've been waiting for (128-bit long doubles) is still not in F33 as they wait on glibc changes (maybe glibc 2.33).

And if you like your OpenPOWER systems but don't like Linux, FreeBSD 12.2 is out as well, with multiple security, bugfix and functionality upgrades for a wide variety of PowerPC and OpenPOWER-based systems. Big-endian is well-tested and little-endian is coming along (and snapshots should finally be in -CURRENT by the time you read this).

Firefox 82 on POWER goes PGO


You'll have noticed this post is rather tardy, since Firefox 82 has been out for the better part of a week, but I wanted to really drill down on a couple variables in our Firefox build configuration for OpenPOWER and also see if it was time to blow away a few persistent assumptions.

But let's not bury the lede here: after several days of screaming, ranting and scaring the cat with various failures, this blog post is finally being typed in a fully profile-guided and link-time optimized Firefox 82 tuned for POWER9 little-endian. Although it multiplies compile time by nearly a factor of 3 and the build process intermittently can consume a terrifying amount of memory, the PGO-LTO build is roughly 25% faster than the LTO-only build, which was already 4% faster than the "baseline" -O3 -mcpu=power9 build. That's worth an 84-minute coffee break! (-j24 on a dual-8 Talos II [64 threads], 64GB RAM.)

The problem with PGO and gcc (at least gcc 10, anyway) is that all the .gcda files end up in the same directory as the built objects in an instrumented build. The build system, which is now heavily clang-centric (despite the docs, gcc is clearly Tier 2, since this and other things don't work), does not know how to handle or transfer the resulting profile data and bombs after running the test load. We don't build with clang because in previous attempts it never managed to fully build the browser on ppc64le and I'm sceptical of its code quality on this platform anyway, but since I wanted to verify against a presumably working configuration I did try a clang build first to see if anything had changed. It breaks fairly early now, interestingly while compiling a Rust component:

4:33.00 error: /home/censored/src/mozilla-release/obj-powerpc64le-unknown-linux-gnu/release/deps/libproc_macro_hack-b7d125d9ae0afae7.so: undefined symbol: __muloti4
4:33.00 --> /home/censored/src/mozilla-release/third_party/rust/phf_macros/src/lib.rs:227:5
4:33.00 227 | #[::proc_macro_hack::proc_macro_hack]
4:33.00    |      ^^^^^^^^^^^^^^^
4:33.00 error: aborting due to previous error
4:33.00 error: could not compile `phf_macros`.

So there's that. I'm not very proficient in Rust so I didn't do much more diagnosis at this point. Back to the hippo gcc.

What's needed is to hack the build system to copy the .gcda files generated during profiling out of instrumented/ into the regular build tree for the actual (second) build phase, which is essentially the solution proposed in bug 1601903 except without any explanation as to how you actually do it. The PGO driver is fortunately in a standalone Python script, so I decided to simply hijack that. At the end is code to coalesce the .profraw files from a successful instrumented clang build, which shouldn't be running anyway if the compiler is gcc, so I threw in a couple lines to terminate instead after it runs this shell script:

#!/bin/csh -f

set where=/tmp/mozgcda.tar

# all on one line yo
cd /home/censored/src/mozilla-release/obj-powerpc64le-unknown-linux-gnu/instrumented || exit
tar cvf $where `find . -name '*.gcda' -print`
cd ..
tar xvf $where
rm -f $where

This repopulates the .gcda files in the right place before we rebuild with the profile data, but because of this subterfuge, gcc thinks the generated profile is not consistent with the source and spams an incredible amount of complaint messages ... which made it difficult to spot the internal compiler error that the profile-guided rebuild triggered. This required another rebuild with some tweaks to turn that off and some other irrelevant warnings (I'll probably upstream at least one of these changes) so I could determine where the ICE was in the scrollback. Fortunately, it was in a test binary, so I just commented it out in the moz.build and it finally stuck. And so far, it's working impressively well. This may well be the fastest the browser can get while still lacking a JIT.

After all that, it's almost an anticlimax to mention that --disable-release is no longer needed in the build configs. You can put it in the Debug configuration if you want, but I now use --enable-release in optimized builds and it seems to work fine.

If you want to try compiling a PGO-LTO build yourself, here is a gist with the changes I made (they are all trivial). Save the shell script above as gccpgostub.csh in ~/src/mozilla-release and/or adjust paths as necessary, and make sure it is chmodded +x. Yes, there is no doubt a more elegant way to do this in Python itself but I hate Python and I was just trying to get it to work. Note that PGO builds can be exceptionally toolchain-dependent (and ICEs more so); while TestUtf8 was what triggered the ICE on my system (Fedora 32, gcc 10.2.1), it is entirely possible it will halt somewhere else in yours, and the PGO command line options may not work the same in earlier versions of the compiler.

Without further ado, the current .mozconfigs, starting with Optimized. Add ac_add_options MOZ_PGO=1 to enable PGO once you have patched your tree and deposited the script.

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-O3 -mcpu=power9"
ac_add_options --enable-release
ac_add_options --enable-linker=bfd
ac_add_options --enable-lto=full

# this is implied by enable-release but left in to be explicit
export RUSTC_OPT_LEVEL=2

Debug

export CC=/usr/bin/gcc
export CXX=/usr/bin/g++

mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-Og -mcpu=power9"
ac_add_options --enable-debug
ac_add_options --enable-linker=bfd

export RUSTC_OPT_LEVEL=0

OpenBSD officially available for ppc64


OpenBSD 6.8 is now available and with it the first official release of the big-endian ppc64 port (which they call powerpc64). The port is specifically advertised for PowerNV machines (i.e., bare metal) with POWER9, which naturally includes the Raptor family but should also support IBM PowerNV systems as well. POWER8 support is described as "included but untested."

The installation directions are still not fully complete, though Petitboot should be able to start the installer from pretty much any standard medium, and the installer experience should be stock from there. What's more, it looks like a good selection of pre-built packages is available, though some large applications are still missing like Firefox (WebKit is apparently available). The missing packages seems to be similar to what is missing for their 32-bit powerpc flavour, so this is not unexpected.

With OpenBSD's release and FreeBSD's well-regarded history, this leaves only NetBSD — ironically the BSD with the most emphasis on portability, and my personal preference — as the last major cross-platform BSD yet to arrive on OpenPOWER. Given OpenBSD and NetBSD's genetic history, however, this release makes future NetBSD support for OpenPOWER much more likely.

IBM splits


IBM today announced that the company will split into two, moving the Managed Infrastructure Services portion of IBM Global Technology Services into a new cloud-focused corporation tentatively called "NewCo" by the end of 2021. NewCo would also have a greater focus on AI, presumably through a distributed computing model rather than traditional hardware sales. The Technology Support Services piece of GTS that addresses data centre, hardware and software support would remain part of "old" IBM, along with Red Hat and presumably the R&D folks responsible for working on Power ISA like the great people at OzLabs.

It is interesting that this move was predicted as early as February, and a split in itself only means that a combined business strategy no longer makes sense for these units. But chairwoman Ginni Rometty missed the boat on cloud early, and despite the hype in IBM's investor release over the new company, "NewCo" is really the "old" services-oriented IBM with a fresh coat of paint that was a frequent source of layoffs and cost-cutting manoeuvres over the years. There are probably reasons for this, not least of which their hidebound services-first mentality that wouldn't sell yours truly a brand new POWER7 in 2010 even when I had a $15,000 personal budget for the hardware because I didn't (and don't: the used POWER6 I bought instead is self-maintained) need their services piece. As a result I wasn't apparently worth the sale to them, which tells you something right there: today's growth is not in the large institutional customers that used to be IBM's bread and butter but rather in the little folks looking for smaller solutions in bigger numbers, and Rometty's IBM failed to capitalize on this opportunity. In my mind, today's split is a late recognition of her tactical error.

Presumably the new company would preferentially use "OldCo" hardware and recommend "OldCo" solutions for their service-driven hybrid buildouts. But "OldCo" makes most of its money from mainframes, and even with robust virtualization options mainframes as a sector aren't growing. Although IBM is taking pains to talk about "one IBM" in their press release, that halcyon ideal exists only as long as either company isn't being dragged down by the other, and going separate directions suggests such a state of affairs won't last long.

What does this mean to us in OpenPOWER land? Well, we were only ever a small part of the equation, and even with a split this won't increase our influence on "OldCo" much. Though IBM still makes good money from Power ISA and there's still a compelling roadmap, us small individual users will need to continue making our voices heard through the OpenPOWER Foundation and others, and even if IBM chooses not to emphasize individual user applications (and in fairness they won't, because we're not where the money is), they still should realize the public relations and engineering benefits of maintaining an open platform and not get in the way of downstream vendors like Raptor attempting to capitalize on the "low-end" (relatively speaking) market. If spinning off MIS gets IBM a better focus on hardware and being a good steward of their engineering resources, then I'm all for it.

Where did the 64K page size come from?


Lots of people were excited by the news over Hangover's port to ppc64le, and while there's a long way to go, the fact it exists is a definite step forward to improving the workstation experience on OpenPOWER. Except, of course, that many folks (including your humble author) can't run it: Hangover currently requires a kernel with a 4K memory page size, which is the page size of the majority of extant systems (certainly x86_64, which only offers a 4K page size). ppc64 and ppc64le can certainly run on a 4K page size and some distributions do, yet the two probably most common distributions OpenPOWER users run — Debian and Fedora — default to a 64K page size.

And there's lots of things that glitch and have glitched when userspace makes assumptions about this. Besides Hangover, Firefox used to barf on 64K pages (on aarch64 too), and had an issue where binaries built on one page size wouldn't work on systems with a different one. (This also bit numpy.) Golang and the runtime used to throw fatal errors. The famous nouveau driver for Nvidia GPUs assumes a 4K page size, and the compute-only binary driver that does exist (at least for POWER8) cheats by making 64K pages out of 16 copies of the "actual" 4K page. btrfs uses a filesystem page size that mirrors that of the host's page size on which it was created. That means if you make a btrfs filesystem on a 4K page size system, it won't be readable on a 64K page system and vice versa (this is being fixed, but hasn't been yet).

With all these problems, why have a 64K page size at all, let alone default to it? There must be some reason to use it because ppc64(le) isn't even unique in this regard; many of those bugs related to aarch64 which also has a 64K page option. As you might guess, it's all about performance. When a virtual memory page has to be attached to a process or mapped into its addressing space, a page fault is triggered and has to be handled by the operating system. Sometimes this is minor (it's already in memory and just has to be added to the process), sometimes this is major (the page is on disk, or swapped out), but either way a page fault has a cost. 64-bit systems naturally came about because of the need for larger memory addressing spaces, which benefits big applications like databases and high-performance computing generally, and these were the tasks that early 64-bit systems were largely used for. As memory increases, subdividing it into proportionally larger pieces thus becomes more performance-efficient: when the application faults less, the application spends more time in its own code and less in the operating system's.

A second performance improvement afforded by larger pages is higher efficiency from the translation lookaside buffer, or TLB. The TLB is essentially a mapping cache that allows a CPU to quickly get the physical memory page for a given virtual memory address. When the virtual memory address cannot be found in the TLB, then the processor has to go through the entire page table and find the address (and filling it in the TLB for later), assuming it exists. This can be a relatively expensive process if there are many entries to go through, and even worse if the page tables are nested in a virtualized setup. A larger page size not only allows more memory to be handled with a smaller page table, making table walks quicker, but also yields more hits for a TLB of the same size. It is fair to point out there are arguments over MMU performance between processor architectures which would magnify the need for this: performance, after all, was the reason why POWER9 moved to a radix-based MMU instead of the less-cache-friendly hashed page table scheme of earlier Power generations, and x86_64 has a radix tree per process while Power ISA's page table is global. (As an aside, some systems optionally or even exclusively have software-managed TLBs where the operating system manages the TLB for the CPU and walks the page tables itself. Power ISA isn't one of them, but these architectures in particular would obviously benefit from a smaller page table.)

64K page sizes, compatibility issues notwithstanding, naturally have a downside. The most important objection relates to memory fragmentation: many memory allocators have page alignment constraints for convenience, which could waste up to the remaining 60K if the memory actually in use fits entirely within a 4K page instead. On bigger systems with large amounts of memory running tasks that allocate large memory blocks, this excess might be relatively low, but they could add up on a workstation-class system with smaller RAM running a mix of client applications making smaller allocations. In a somewhat infamous rebuttal, Linus Torvalds commented, "These absolute -idiots- talk about how they win 5% on some (important, for them) benchmark by doing large pages, but then ignore the fact that on other real-world loads they lose by sevaral HUNDRED percent because of the memory fragmentation costs [sic]." Putting Linus' opinion into more anodyne terms, if the architecture bears a relatively modest page fault penalty, then the performance improvements of a larger page size may not be worth the memory it can waste. This is probably why AIX, presently specific to ppc64, offers both 4K and 64K pages (and even larger 16MB and 16GB pages) and determines what to offer to a process.

The 4K vs. 64K gulf is not unlike the endian debate. I like big endian and I cannot lie, but going little endian gave Power ISA a larger working software library by aligning with what those packages already assumed; going 4K is a similar situation. But while the performance difference between endiannesses has arguably never been significant, there really are performance reasons for a 64K page size and those reasons get more important as RAM and application size both increase. On my 16GB 4-core Blackbird, the same memory size as my 2005 Power Mac Quad G5, a 4K page size makes a lot more sense than a 64K one because I'm not running anything massive. In that sense the only reason I'm still running Fedora on it is to serve as an early warning indicator. But on my 64GB dual-8 Talos II, where I do run larger applications, build kernels and Firefoxen and run VMs, the performance implications of the larger page size under Fedora may well become relevant for those workloads.

For servers and HPCers big pages can have big benefits, but for those of us using these machines as workstations I think we need to consider whether the performance improvement outweighs the inconvenience. And while Fedora has generally served me well, lacking a 4K page option on ppc64le certainly hurts the value proposition for Fedora Workstation on OpenPOWER since there are likely to be other useful applications that make these assumptions. More to the point, I don't see Red Hat-IBM doubling their maintenance burden to issue a 4K page version and maintaining a downstream distro is typically an incredibly thankless task. While I've picked on Fedora a bit here, you can throw Debian and others into that mix as well for some of the same reasons. Until other operating systems adopt a hybrid approach like AIX's, the quibble over page size is probably the next major schism we'll have to deal with because in my humble opinion OpenPOWER should not be limited to the server room where big pages are king.

It's good to have a Hangover


One of the wishlists for us OpenPOWER workstation users is better emulation for when we have to run an application that only comes as a Windows binary. QEMU is not too quick, largely because of its overhead, and although Bochs is faster in some respects it's worse in others and doesn't have a JIT. While things like HQEMU are fast, they also have their own unique problems, and many things that work in QEMU don't work in HQEMU. Unfortunately, because Wine Is Not an Emulator, it cannot be used to run Windows binaries directly.

People then ask the question, what if we somehow put QEMU and Wine together like Slaughterhouse-Five and see if they breed? Somebody did that, at least for aarch64, and that is Hangover. And now it runs on ppc64le with material support for testing provided by Raptor.

Hangover is unabashedly imperfect and many things still don't work, and there are probably things that work on aarch64 that don't work on ppc64le as the support is specifically advertised as "incomplete." (Big-endian need not apply, by the way: there are no thunks here for converting endianness. Sorry.) There is also the maintainability problem that the changes to Wine to support ppc64le (done by Raptor themselves, as we understand) haven't been upstreamed and that will contribute to the rebasing burden.

With all that in mind, how's it work? Well ... I have no idea, because the other problem is right now it's limited to kernels using a 4K page size and not every ppc64le-compatible distribution uses them. Void Linux, for example, does support 4K pages on ppc64le, but Fedora only officially supports a 64K page size, and I'm typing this on Fedora 32. It may be possible to hack Hangover to add this support but the maintainer ominously warns that "loading PE binaries, which have 4K aligned sections, into a 64K page comes with [lots] of problems, so currently the best approach is to avoid that." I'm rebuilding my Blackbird system but I like using it as a Fedora tester before running upgrades on this daily driver Talos II, which has saved me some substantial inconvenience in the past. That said, a live CD that boots Void and then runs Hangover might be fun to work on.

If you've built Hangover on your machine and given it a spin, advise how well it works for you and how it compares to QEMU in the comments.

IBM makes available the POWER10 Functional Simulator


Simulators are an important way of testing compatibility with future architectures, and IBM has now released a functional simulator for POWER10. Now, we continue to watch POWER10 closely here at Talospace because of as-yet unaddressed concerns over just how "open" it is compared to POWER8 and POWER9, and we have not heard of any workstation-class hardware announced around it yet (from Raptor or anyone else). But we're always interested in the next generation of OpenPOWER, and the documentation states it provides "enough POWER10 processor complex functionality to allow the entire software stack to execute, including loading, booting and running a little endian Linux environment." Pretty cool, except you can't actually run it on OpenPOWER yet: there is no source code, and no binaries for ppc64le, although the page indicates it is supported; the only downloads as we go to press are for x86_64. IBM did eventually release ppc64le packages for Debian for the POWER9 functional simulator, so we expect the same here to happen eventually, even though it would have been a nice gesture to have it available immediately since we would be the very people most interested in trying it out. It includes a full instruction set model with SMP support, vector unit and the works, but as always you are warned "it may not model all aspects of the IBM Power Systems POWER10 hardware and thus may not exactly reflect the behavior of the POWER10 hardware."