And if you like your OpenPOWER systems but don't like Linux, FreeBSD 12.2 is out as well, with multiple security, bugfix and functionality upgrades for a wide variety of PowerPC and OpenPOWER-based systems. Big-endian is well-tested and little-endian is coming along (and snapshots should finally be in -CURRENT by the time you read this).
Posts
Updates: Fedora 33, FreeBSD 12.2, Ubuntu 20.10
- Get link
- Other Apps
Firefox 82 on POWER goes PGO
But let's not bury the lede here: after several days of screaming, ranting and scaring the cat with various failures, this blog post is finally being typed in a fully profile-guided and link-time optimized Firefox 82 tuned for POWER9 little-endian. Although it multiplies compile time by nearly a factor of 3 and the build process intermittently can consume a terrifying amount of memory, the PGO-LTO build is roughly 25% faster than the LTO-only build, which was already 4% faster than the "baseline" -O3 -mcpu=power9 build. That's worth an 84-minute coffee break! (-j24 on a dual-8 Talos II [64 threads], 64GB RAM.)
The problem with PGO and gcc (at least gcc 10, anyway) is that all the .gcda files end up in the same directory as the built objects in an instrumented build. The build system, which is now heavily clang-centric (despite the docs, gcc is clearly Tier 2, since this and other things don't work), does not know how to handle or transfer the resulting profile data and bombs after running the test load. We don't build with clang because in previous attempts it never managed to fully build the browser on ppc64le and I'm sceptical of its code quality on this platform anyway, but since I wanted to verify against a presumably working configuration I did try a clang build first to see if anything had changed. It breaks fairly early now, interestingly while compiling a Rust component:
4:33.00 error: /home/censored/src/mozilla-release/obj-powerpc64le-unknown-linux-gnu/release/deps/libproc_macro_hack-b7d125d9ae0afae7.so: undefined symbol: __muloti4
4:33.00 --> /home/censored/src/mozilla-release/third_party/rust/phf_macros/src/lib.rs:227:5
4:33.00 227 | #[::proc_macro_hack::proc_macro_hack]
4:33.00 | ^^^^^^^^^^^^^^^
4:33.00 error: aborting due to previous error
4:33.00 error: could not compile `phf_macros`.
So there's that. I'm not very proficient in Rust so I didn't do much more diagnosis at this point. Back to the hippo gcc.
What's needed is to hack the build system to copy the .gcda files generated during profiling out of instrumented/ into the regular build tree for the actual (second) build phase, which is essentially the solution proposed in bug 1601903 except without any explanation as to how you actually do it. The PGO driver is fortunately in a standalone Python script, so I decided to simply hijack that. At the end is code to coalesce the .profraw files from a successful instrumented clang build, which shouldn't be running anyway if the compiler is gcc, so I threw in a couple lines to terminate instead after it runs this shell script:
#!/bin/csh -f
set where=/tmp/mozgcda.tar
# all on one line yo
cd /home/censored/src/mozilla-release/obj-powerpc64le-unknown-linux-gnu/instrumented || exit
tar cvf $where `find . -name '*.gcda' -print`
cd ..
tar xvf $where
rm -f $where
This repopulates the .gcda files in the right place before we rebuild with the profile data, but because of this subterfuge, gcc thinks the generated profile is not consistent with the source and spams an incredible amount of complaint messages ... which made it difficult to spot the internal compiler error that the profile-guided rebuild triggered. This required another rebuild with some tweaks to turn that off and some other irrelevant warnings (I'll probably upstream at least one of these changes) so I could determine where the ICE was in the scrollback. Fortunately, it was in a test binary, so I just commented it out in the moz.build and it finally stuck. And so far, it's working impressively well. This may well be the fastest the browser can get while still lacking a JIT.
After all that, it's almost an anticlimax to mention that --disable-release is no longer needed in the build configs. You can put it in the Debug configuration if you want, but I now use --enable-release in optimized builds and it seems to work fine.
If you want to try compiling a PGO-LTO build yourself, here is a gist with the changes I made (they are all trivial). Save the shell script above as gccpgostub.csh in ~/src/mozilla-release and/or adjust paths as necessary, and make sure it is chmodded +x. Yes, there is no doubt a more elegant way to do this in Python itself but I hate Python and I was just trying to get it to work. Note that PGO builds can be exceptionally toolchain-dependent (and ICEs more so); while TestUtf8 was what triggered the ICE on my system (Fedora 32, gcc 10.2.1), it is entirely possible it will halt somewhere else in yours, and the PGO command line options may not work the same in earlier versions of the compiler.
Without further ado, the current .mozconfigs, starting with Optimized. Add ac_add_options MOZ_PGO=1 to enable PGO once you have patched your tree and deposited the script.
export CC=/usr/bin/gcc
export CXX=/usr/bin/g++
mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-O3 -mcpu=power9"
ac_add_options --enable-release
ac_add_options --enable-linker=bfd
ac_add_options --enable-lto=full
# this is implied by enable-release but left in to be explicit
export RUSTC_OPT_LEVEL=2
Debug
export CC=/usr/bin/gcc
export CXX=/usr/bin/g++
mk_add_options MOZ_MAKE_FLAGS="-j24"
ac_add_options --enable-application=browser
ac_add_options --enable-optimize="-Og -mcpu=power9"
ac_add_options --enable-debug
ac_add_options --enable-linker=bfd
export RUSTC_OPT_LEVEL=0
- Get link
- Other Apps
OpenBSD officially available for ppc64
The installation directions are still not fully complete, though Petitboot should be able to start the installer from pretty much any standard medium, and the installer experience should be stock from there. What's more, it looks like a good selection of pre-built packages is available, though some large applications are still missing like Firefox (WebKit is apparently available). The missing packages seems to be similar to what is missing for their 32-bit powerpc flavour, so this is not unexpected.
With OpenBSD's release and FreeBSD's well-regarded history, this leaves only NetBSD — ironically the BSD with the most emphasis on portability, and my personal preference — as the last major cross-platform BSD yet to arrive on OpenPOWER. Given OpenBSD and NetBSD's genetic history, however, this release makes future NetBSD support for OpenPOWER much more likely.
- Get link
- Other Apps
IBM splits
It is interesting that this move was predicted as early as February, and a split in itself only means that a combined business strategy no longer makes sense for these units. But chairwoman Ginni Rometty missed the boat on cloud early, and despite the hype in IBM's investor release over the new company, "NewCo" is really the "old" services-oriented IBM with a fresh coat of paint that was a frequent source of layoffs and cost-cutting manoeuvres over the years. There are probably reasons for this, not least of which their hidebound services-first mentality that wouldn't sell yours truly a brand new POWER7 in 2010 even when I had a $15,000 personal budget for the hardware because I didn't (and don't: the used POWER6 I bought instead is self-maintained) need their services piece. As a result I wasn't apparently worth the sale to them, which tells you something right there: today's growth is not in the large institutional customers that used to be IBM's bread and butter but rather in the little folks looking for smaller solutions in bigger numbers, and Rometty's IBM failed to capitalize on this opportunity. In my mind, today's split is a late recognition of her tactical error.
Presumably the new company would preferentially use "OldCo" hardware and recommend "OldCo" solutions for their service-driven hybrid buildouts. But "OldCo" makes most of its money from mainframes, and even with robust virtualization options mainframes as a sector aren't growing. Although IBM is taking pains to talk about "one IBM" in their press release, that halcyon ideal exists only as long as either company isn't being dragged down by the other, and going separate directions suggests such a state of affairs won't last long.
What does this mean to us in OpenPOWER land? Well, we were only ever a small part of the equation, and even with a split this won't increase our influence on "OldCo" much. Though IBM still makes good money from Power ISA and there's still a compelling roadmap, us small individual users will need to continue making our voices heard through the OpenPOWER Foundation and others, and even if IBM chooses not to emphasize individual user applications (and in fairness they won't, because we're not where the money is), they still should realize the public relations and engineering benefits of maintaining an open platform and not get in the way of downstream vendors like Raptor attempting to capitalize on the "low-end" (relatively speaking) market. If spinning off MIS gets IBM a better focus on hardware and being a good steward of their engineering resources, then I'm all for it.
- Get link
- Other Apps
Where did the 64K page size come from?
And there's lots of things that glitch and have glitched when userspace makes assumptions about this. Besides Hangover, Firefox used to barf on 64K pages (on aarch64 too), and had an issue where binaries built on one page size wouldn't work on systems with a different one. (This also bit numpy.) Golang and the runtime used to throw fatal errors. The famous nouveau driver for Nvidia GPUs assumes a 4K page size, and the compute-only binary driver that does exist (at least for POWER8) cheats by making 64K pages out of 16 copies of the "actual" 4K page. btrfs uses a filesystem page size that mirrors that of the host's page size on which it was created. That means if you make a btrfs filesystem on a 4K page size system, it won't be readable on a 64K page system and vice versa (this is being fixed, but hasn't been yet).
With all these problems, why have a 64K page size at all, let alone default to it? There must be some reason to use it because ppc64(le) isn't even unique in this regard; many of those bugs related to aarch64 which also has a 64K page option. As you might guess, it's all about performance. When a virtual memory page has to be attached to a process or mapped into its addressing space, a page fault is triggered and has to be handled by the operating system. Sometimes this is minor (it's already in memory and just has to be added to the process), sometimes this is major (the page is on disk, or swapped out), but either way a page fault has a cost. 64-bit systems naturally came about because of the need for larger memory addressing spaces, which benefits big applications like databases and high-performance computing generally, and these were the tasks that early 64-bit systems were largely used for. As memory increases, subdividing it into proportionally larger pieces thus becomes more performance-efficient: when the application faults less, the application spends more time in its own code and less in the operating system's.
A second performance improvement afforded by larger pages is higher efficiency from the translation lookaside buffer, or TLB. The TLB is essentially a mapping cache that allows a CPU to quickly get the physical memory page for a given virtual memory address. When the virtual memory address cannot be found in the TLB, then the processor has to go through the entire page table and find the address (and filling it in the TLB for later), assuming it exists. This can be a relatively expensive process if there are many entries to go through, and even worse if the page tables are nested in a virtualized setup. A larger page size not only allows more memory to be handled with a smaller page table, making table walks quicker, but also yields more hits for a TLB of the same size. It is fair to point out there are arguments over MMU performance between processor architectures which would magnify the need for this: performance, after all, was the reason why POWER9 moved to a radix-based MMU instead of the less-cache-friendly hashed page table scheme of earlier Power generations, and x86_64 has a radix tree per process while Power ISA's page table is global. (As an aside, some systems optionally or even exclusively have software-managed TLBs where the operating system manages the TLB for the CPU and walks the page tables itself. Power ISA isn't one of them, but these architectures in particular would obviously benefit from a smaller page table.)
64K page sizes, compatibility issues notwithstanding, naturally have a downside. The most important objection relates to memory fragmentation: many memory allocators have page alignment constraints for convenience, which could waste up to the remaining 60K if the memory actually in use fits entirely within a 4K page instead. On bigger systems with large amounts of memory running tasks that allocate large memory blocks, this excess might be relatively low, but they could add up on a workstation-class system with smaller RAM running a mix of client applications making smaller allocations. In a somewhat infamous rebuttal, Linus Torvalds commented, "These absolute -idiots- talk about how they win 5% on some (important, for them) benchmark by doing large pages, but then ignore the fact that on other real-world loads they lose by sevaral HUNDRED percent because of the memory fragmentation costs [sic]." Putting Linus' opinion into more anodyne terms, if the architecture bears a relatively modest page fault penalty, then the performance improvements of a larger page size may not be worth the memory it can waste. This is probably why AIX, presently specific to ppc64, offers both 4K and 64K pages (and even larger 16MB and 16GB pages) and determines what to offer to a process.
The 4K vs. 64K gulf is not unlike the endian debate. I like big endian and I cannot lie, but going little endian gave Power ISA a larger working software library by aligning with what those packages already assumed; going 4K is a similar situation. But while the performance difference between endiannesses has arguably never been significant, there really are performance reasons for a 64K page size and those reasons get more important as RAM and application size both increase. On my 16GB 4-core Blackbird, the same memory size as my 2005 Power Mac Quad G5, a 4K page size makes a lot more sense than a 64K one because I'm not running anything massive. In that sense the only reason I'm still running Fedora on it is to serve as an early warning indicator. But on my 64GB dual-8 Talos II, where I do run larger applications, build kernels and Firefoxen and run VMs, the performance implications of the larger page size under Fedora may well become relevant for those workloads.
For servers and HPCers big pages can have big benefits, but for those of us using these machines as workstations I think we need to consider whether the performance improvement outweighs the inconvenience. And while Fedora has generally served me well, lacking a 4K page option on ppc64le certainly hurts the value proposition for Fedora Workstation on OpenPOWER since there are likely to be other useful applications that make these assumptions. More to the point, I don't see Red Hat-IBM doubling their maintenance burden to issue a 4K page version and maintaining a downstream distro is typically an incredibly thankless task. While I've picked on Fedora a bit here, you can throw Debian and others into that mix as well for some of the same reasons. Until other operating systems adopt a hybrid approach like AIX's, the quibble over page size is probably the next major schism we'll have to deal with because in my humble opinion OpenPOWER should not be limited to the server room where big pages are king.
- Get link
- Other Apps
It's good to have a Hangover
People then ask the question, what if we somehow put QEMU and Wine together like Slaughterhouse-Five and see if they breed? Somebody did that, at least for aarch64, and that is Hangover. And now it runs on ppc64le with material support for testing provided by Raptor.
Hangover is unabashedly imperfect and many things still don't work, and there are probably things that work on aarch64 that don't work on ppc64le as the support is specifically advertised as "incomplete." (Big-endian need not apply, by the way: there are no thunks here for converting endianness. Sorry.) There is also the maintainability problem that the changes to Wine to support ppc64le (done by Raptor themselves, as we understand) haven't been upstreamed and that will contribute to the rebasing burden.
With all that in mind, how's it work? Well ... I have no idea, because the other problem is right now it's limited to kernels using a 4K page size and not every ppc64le-compatible distribution uses them. Void Linux, for example, does support 4K pages on ppc64le, but Fedora only officially supports a 64K page size, and I'm typing this on Fedora 32. It may be possible to hack Hangover to add this support but the maintainer ominously warns that "loading PE binaries, which have 4K aligned sections, into a 64K page comes with [lots] of problems, so currently the best approach is to avoid that." I'm rebuilding my Blackbird system but I like using it as a Fedora tester before running upgrades on this daily driver Talos II, which has saved me some substantial inconvenience in the past. That said, a live CD that boots Void and then runs Hangover might be fun to work on.
If you've built Hangover on your machine and given it a spin, advise how well it works for you and how it compares to QEMU in the comments.
- Get link
- Other Apps
IBM makes available the POWER10 Functional Simulator
- Get link
- Other Apps