Showing posts from October, 2020

OpenBSD officially available for ppc64

OpenBSD 6.8 is now available and with it the first official release of the big-endian ppc64 port (which they call powerpc64). The port is specifically advertised for PowerNV machines (i.e., bare metal) with POWER9, which naturally includes the Raptor family but should also support IBM PowerNV systems as well. POWER8 support is described as "included but untested."

The installation directions are still not fully complete, though Petitboot should be able to start the installer from pretty much any standard medium, and the installer experience should be stock from there. What's more, it looks like a good selection of pre-built packages is available, though some large applications are still missing like Firefox (WebKit is apparently available). The missing packages seems to be similar to what is missing for their 32-bit powerpc flavour, so this is not unexpected.

With OpenBSD's release and FreeBSD's well-regarded history, this leaves only NetBSD — ironically the BSD with the most emphasis on portability, and my personal preference — as the last major cross-platform BSD yet to arrive on OpenPOWER. Given OpenBSD and NetBSD's genetic history, however, this release makes future NetBSD support for OpenPOWER much more likely.

IBM splits

IBM today announced that the company will split into two, moving the Managed Infrastructure Services portion of IBM Global Technology Services into a new cloud-focused corporation tentatively called "NewCo" by the end of 2021. NewCo would also have a greater focus on AI, presumably through a distributed computing model rather than traditional hardware sales. The Technology Support Services piece of GTS that addresses data centre, hardware and software support would remain part of "old" IBM, along with Red Hat and presumably the R&D folks responsible for working on Power ISA like the great people at OzLabs.

It is interesting that this move was predicted as early as February, and a split in itself only means that a combined business strategy no longer makes sense for these units. But chairwoman Ginni Rometty missed the boat on cloud early, and despite the hype in IBM's investor release over the new company, "NewCo" is really the "old" services-oriented IBM with a fresh coat of paint that was a frequent source of layoffs and cost-cutting manoeuvres over the years. There are probably reasons for this, not least of which their hidebound services-first mentality that wouldn't sell yours truly a brand new POWER7 in 2010 even when I had a $15,000 personal budget for the hardware because I didn't (and don't: the used POWER6 I bought instead is self-maintained) need their services piece. As a result I wasn't apparently worth the sale to them, which tells you something right there: today's growth is not in the large institutional customers that used to be IBM's bread and butter but rather in the little folks looking for smaller solutions in bigger numbers, and Rometty's IBM failed to capitalize on this opportunity. In my mind, today's split is a late recognition of her tactical error.

Presumably the new company would preferentially use "OldCo" hardware and recommend "OldCo" solutions for their service-driven hybrid buildouts. But "OldCo" makes most of its money from mainframes, and even with robust virtualization options mainframes as a sector aren't growing. Although IBM is taking pains to talk about "one IBM" in their press release, that halcyon ideal exists only as long as either company isn't being dragged down by the other, and going separate directions suggests such a state of affairs won't last long.

What does this mean to us in OpenPOWER land? Well, we were only ever a small part of the equation, and even with a split this won't increase our influence on "OldCo" much. Though IBM still makes good money from Power ISA and there's still a compelling roadmap, us small individual users will need to continue making our voices heard through the OpenPOWER Foundation and others, and even if IBM chooses not to emphasize individual user applications (and in fairness they won't, because we're not where the money is), they still should realize the public relations and engineering benefits of maintaining an open platform and not get in the way of downstream vendors like Raptor attempting to capitalize on the "low-end" (relatively speaking) market. If spinning off MIS gets IBM a better focus on hardware and being a good steward of their engineering resources, then I'm all for it.

Where did the 64K page size come from?

Lots of people were excited by the news over Hangover's port to ppc64le, and while there's a long way to go, the fact it exists is a definite step forward to improving the workstation experience on OpenPOWER. Except, of course, that many folks (including your humble author) can't run it: Hangover currently requires a kernel with a 4K memory page size, which is the page size of the majority of extant systems (certainly x86_64, which only offers a 4K page size). ppc64 and ppc64le can certainly run on a 4K page size and some distributions do, yet the two probably most common distributions OpenPOWER users run — Debian and Fedora — default to a 64K page size.

And there's lots of things that glitch and have glitched when userspace makes assumptions about this. Besides Hangover, Firefox used to barf on 64K pages (on aarch64 too), and had an issue where binaries built on one page size wouldn't work on systems with a different one. (This also bit numpy.) Golang and the runtime used to throw fatal errors. The famous nouveau driver for Nvidia GPUs assumes a 4K page size, and the compute-only binary driver that does exist (at least for POWER8) cheats by making 64K pages out of 16 copies of the "actual" 4K page. btrfs uses a filesystem page size that mirrors that of the host's page size on which it was created. That means if you make a btrfs filesystem on a 4K page size system, it won't be readable on a 64K page system and vice versa (this is being fixed, but hasn't been yet).

With all these problems, why have a 64K page size at all, let alone default to it? There must be some reason to use it because ppc64(le) isn't even unique in this regard; many of those bugs related to aarch64 which also has a 64K page option. As you might guess, it's all about performance. When a virtual memory page has to be attached to a process or mapped into its addressing space, a page fault is triggered and has to be handled by the operating system. Sometimes this is minor (it's already in memory and just has to be added to the process), sometimes this is major (the page is on disk, or swapped out), but either way a page fault has a cost. 64-bit systems naturally came about because of the need for larger memory addressing spaces, which benefits big applications like databases and high-performance computing generally, and these were the tasks that early 64-bit systems were largely used for. As memory increases, subdividing it into proportionally larger pieces thus becomes more performance-efficient: when the application faults less, the application spends more time in its own code and less in the operating system's.

A second performance improvement afforded by larger pages is higher efficiency from the translation lookaside buffer, or TLB. The TLB is essentially a mapping cache that allows a CPU to quickly get the physical memory page for a given virtual memory address. When the virtual memory address cannot be found in the TLB, then the processor has to go through the entire page table and find the address (and filling it in the TLB for later), assuming it exists. This can be a relatively expensive process if there are many entries to go through, and even worse if the page tables are nested in a virtualized setup. A larger page size not only allows more memory to be handled with a smaller page table, making table walks quicker, but also yields more hits for a TLB of the same size. It is fair to point out there are arguments over MMU performance between processor architectures which would magnify the need for this: performance, after all, was the reason why POWER9 moved to a radix-based MMU instead of the less-cache-friendly hashed page table scheme of earlier Power generations, and x86_64 has a radix tree per process while Power ISA's page table is global. (As an aside, some systems optionally or even exclusively have software-managed TLBs where the operating system manages the TLB for the CPU and walks the page tables itself. Power ISA isn't one of them, but these architectures in particular would obviously benefit from a smaller page table.)

64K page sizes, compatibility issues notwithstanding, naturally have a downside. The most important objection relates to memory fragmentation: many memory allocators have page alignment constraints for convenience, which could waste up to the remaining 60K if the memory actually in use fits entirely within a 4K page instead. On bigger systems with large amounts of memory running tasks that allocate large memory blocks, this excess might be relatively low, but they could add up on a workstation-class system with smaller RAM running a mix of client applications making smaller allocations. In a somewhat infamous rebuttal, Linus Torvalds commented, "These absolute -idiots- talk about how they win 5% on some (important, for them) benchmark by doing large pages, but then ignore the fact that on other real-world loads they lose by sevaral HUNDRED percent because of the memory fragmentation costs [sic]." Putting Linus' opinion into more anodyne terms, if the architecture bears a relatively modest page fault penalty, then the performance improvements of a larger page size may not be worth the memory it can waste. This is probably why AIX, presently specific to ppc64, offers both 4K and 64K pages (and even larger 16MB and 16GB pages) and determines what to offer to a process.

The 4K vs. 64K gulf is not unlike the endian debate. I like big endian and I cannot lie, but going little endian gave Power ISA a larger working software library by aligning with what those packages already assumed; going 4K is a similar situation. But while the performance difference between endiannesses has arguably never been significant, there really are performance reasons for a 64K page size and those reasons get more important as RAM and application size both increase. On my 16GB 4-core Blackbird, the same memory size as my 2005 Power Mac Quad G5, a 4K page size makes a lot more sense than a 64K one because I'm not running anything massive. In that sense the only reason I'm still running Fedora on it is to serve as an early warning indicator. But on my 64GB dual-8 Talos II, where I do run larger applications, build kernels and Firefoxen and run VMs, the performance implications of the larger page size under Fedora may well become relevant for those workloads.

For servers and HPCers big pages can have big benefits, but for those of us using these machines as workstations I think we need to consider whether the performance improvement outweighs the inconvenience. And while Fedora has generally served me well, lacking a 4K page option on ppc64le certainly hurts the value proposition for Fedora Workstation on OpenPOWER since there are likely to be other useful applications that make these assumptions. More to the point, I don't see Red Hat-IBM doubling their maintenance burden to issue a 4K page version and maintaining a downstream distro is typically an incredibly thankless task. While I've picked on Fedora a bit here, you can throw Debian and others into that mix as well for some of the same reasons. Until other operating systems adopt a hybrid approach like AIX's, the quibble over page size is probably the next major schism we'll have to deal with because in my humble opinion OpenPOWER should not be limited to the server room where big pages are king.

It's good to have a Hangover

One of the wishlists for us OpenPOWER workstation users is better emulation for when we have to run an application that only comes as a Windows binary. QEMU is not too quick, largely because of its overhead, and although Bochs is faster in some respects it's worse in others and doesn't have a JIT. While things like HQEMU are fast, they also have their own unique problems, and many things that work in QEMU don't work in HQEMU. Unfortunately, because Wine Is Not an Emulator, it cannot be used to run Windows binaries directly.

People then ask the question, what if we somehow put QEMU and Wine together like Slaughterhouse-Five and see if they breed? Somebody did that, at least for aarch64, and that is Hangover. And now it runs on ppc64le with material support for testing provided by Raptor.

Hangover is unabashedly imperfect and many things still don't work, and there are probably things that work on aarch64 that don't work on ppc64le as the support is specifically advertised as "incomplete." (Big-endian need not apply, by the way: there are no thunks here for converting endianness. Sorry.) There is also the maintainability problem that the changes to Wine to support ppc64le (done by Raptor themselves, as we understand) haven't been upstreamed and that will contribute to the rebasing burden.

With all that in mind, how's it work? Well ... I have no idea, because the other problem is right now it's limited to kernels using a 4K page size and not every ppc64le-compatible distribution uses them. Void Linux, for example, does support 4K pages on ppc64le, but Fedora only officially supports a 64K page size, and I'm typing this on Fedora 32. It may be possible to hack Hangover to add this support but the maintainer ominously warns that "loading PE binaries, which have 4K aligned sections, into a 64K page comes with [lots] of problems, so currently the best approach is to avoid that." I'm rebuilding my Blackbird system but I like using it as a Fedora tester before running upgrades on this daily driver Talos II, which has saved me some substantial inconvenience in the past. That said, a live CD that boots Void and then runs Hangover might be fun to work on.

If you've built Hangover on your machine and given it a spin, advise how well it works for you and how it compares to QEMU in the comments.

IBM makes available the POWER10 Functional Simulator

Simulators are an important way of testing compatibility with future architectures, and IBM has now released a functional simulator for POWER10. Now, we continue to watch POWER10 closely here at Talospace because of as-yet unaddressed concerns over just how "open" it is compared to POWER8 and POWER9, and we have not heard of any workstation-class hardware announced around it yet (from Raptor or anyone else). But we're always interested in the next generation of OpenPOWER, and the documentation states it provides "enough POWER10 processor complex functionality to allow the entire software stack to execute, including loading, booting and running a little endian Linux environment." Pretty cool, except you can't actually run it on OpenPOWER yet: there is no source code, and no binaries for ppc64le, although the page indicates it is supported; the only downloads as we go to press are for x86_64. IBM did eventually release ppc64le packages for Debian for the POWER9 functional simulator, so we expect the same here to happen eventually, even though it would have been a nice gesture to have it available immediately since we would be the very people most interested in trying it out. It includes a full instruction set model with SMP support, vector unit and the works, but as always you are warned "it may not model all aspects of the IBM Power Systems POWER10 hardware and thus may not exactly reflect the behavior of the POWER10 hardware."