Showing posts from April, 2020

Fedora 32 released

Fedora 32 is released (changelog). We care about Fedora a lot at Talospace Low Earth Orbit HQ because it was one of the earliest distributions to run on POWER9 "out of the box" and it's what we run locally. Plus, being more bleeding edge than many distributions, even if you don't run it you should still care about it because OpenPOWER-specific issues on other distros usually get identified on Fedora first.

Given that history and that Red Hat's been an IBM property for a year and a half by now, we'd have hoped to see OpenPOWER move out of the "alternative architectures" (or at least see a little more promotion of the OpenPOWER build). That said, it's got GNOME 3.36, primarily a UX update but with a new suspend option that probably won't work on Raptor workstations anyway, TRIM by default (finally! though I turned this on in 31 anyhow), support for Free Pascal on ppc64le, gcc 10 and glibc 2.31, LLVM 10, Python 3.8 (and retirement of Python 2) and Golang 1.14. Unfortunately, 128-bit floats on ppc64le, necessary to fix some compiler and builder edge cases, have slipped deadline yet again. However, there is a semi-official workstation ISO for ppc64le now, which is welcome for new owners trying to get their machines bootstrapped.

As usual, giving a little time for the mirrors and packages to catch up, I'll try it on my basic Blackbird and Talos II and report back as we have in previous mini-reviews. Meanwhile, download it yourself if you're adventurous.

Eight four two one, twice the cores is (almost) twice as fun

It took a little while but I'm now typing on my second Raptor Talos II workstation, effectively upgrading two years in from a 32GB RAM dual quad-core POWER9 DD2.2 to a 64GB RAM dual octo-core DD2.3. It's rather notable to think about how far we've come with the platform. A number of you have asked about how this changed things in practise, so let's do yet another semi-review.

Again, I say "semi-review" because if I were going to do this right, I'd have set up both the dual-4 and the dual-8 identically, had them do the same tasks and gone back if the results were weird. However, when you're buying a $7000+ workstation you economize where you can, which means I didn't buy any new NVMe cards, bought additional rather than spare RAM, and didn't buy another GPU; the plan was always to consolidate those into the new machine and keep the old chassis, board and CPUs/HSFs as spares. Plus, I moved over the case stickers and those totally change the entire performance characteristics of the system, you dig? We'll let the Phoronix guy(s) do that kind of exacting head-to-head because I pay for this out of pocket and we've all gotta tighten our belts in these days of plague. Here's the new beast sitting beside me under my work area:

(By the way, I'm still taking candidates for #ShowUsYourTalos. If you have pictures uploaded somewhere, I'll rebroadcast them here with your permission and your system's specs. Blackbirds, POWER8s and of course any other OpenPOWER systems welcome. Post in the comments.)

This new "consolidated" system has 64GB of RAM, Raptor's BTO option Radeon WX7100 workstation GPU and two NVMe main drives on the current Talos II 1.01 board, running Fedora 31 as before. In normal usage the dual-8 runs a bit hotter than the dual-4, but this is absolutely par for the course when you've just doubled the number of processors onboard. In my low-Earth-orbit Southern California office the dual-4's fastest fan rarely got above 2100rpm while the dual-8 occasionally spins up to 2300 or 2600rpm. Similarly, give or take system load, the infrared thermometer pegged the dual-4's "exhaust" at around 95 degrees Fahrenheit; the dual-8 puts out about 110 F. However, idle power usage is only about 20W more when sitting around in Firefox and gnome-terminal (130W vs 110W), and the idle fan speeds are about the same such that overall the dual-8 isn't appreciably louder than the very quiet dual-4 was with the most current firmware (with the standard Supermicro fan assemblies, though I replaced the dual-4's PSUs with "super-quiets" a while back and those are in the dual-8 now too).

Naturally the CPUs are the most notable change. Recall that the Sforza "scale out" POWER9 CPUs in Raptor family workstations are SMT-4, i.e., each core offers four hardware threads, which appear as discrete CPUs to the operating system. My dual-4 appeared to be a 32 CPU system to Fedora; this dual-8 appears to have 64. These threads come from "slices," and SMT-4 cores have four which are paired into two "super-slices." They look like this:

Each slice has a vector-scalar unit and an address generator feeding a load-store unit. The VSU has 64-bit integer, floating point and vector ALU components; two slices are needed to get the full 128-bit width of a VMX vector, hence their pairing as super-slices. The super-slices do not have L1 cache of their own, nor do they handle branch instructions or branch prediction; all of that is per-core, which also does instruction fetch and dispatch to the slices. (This has some similar strengths and pitfalls to AMD Ryzen, for example, which also has a single branch unit and caches per core, but the Ryzen execution units are not organized in the same fashion.) The upshot of all this is that certain parallel jobs, especially those that may be competing for scarcer per-core resources like L1 cache or the branch unit, may benefit more from full cores than from threads and this is true of pretty much any SMT implementation. In a like fashion, since each POWER9 slice is not a full vector unit (only the super-slices are), heavy use of VMX would soak up to twice the execution resources though amortized over the greater efficiency vector code would offer over scalar.

The biggest task I routinely do on my T2 are frequent smoke-test builds of Firefox to make sure OpenPOWER-specific bugs are found before they get to release. This was, in fact, where I hoped I would see the most improvement. Indeed, a fair bit of it can be run parallel, so if any of my typical workloads would show benefit, I felt it would likely be this one. Before I tore down the dual-4 I put all 64GB of RAM in it for a final time run to eliminate memory pressure as a variable (and the same sticks are in the dual-8, so it's exactly the same RAM in exactly the same slot layout). These snapshots were done building the Firefox 75 source code from mozilla-release (current as of this writing) with my standard optimized .mozconfig, varying only in the number of jobs specified. I'm only reporting wall time here because frankly that's the only thing I personally cared about. All build runs were done at the text console before booting X and GNOME to further eliminate variability, and I did three runs of each configuration back to back (./mach clobber && ./mach build) to account for any caching that might have occurred. Power was measured at the UPS. Default Spectre and Meltdown mitigations were in effect.

Dual-4 (-j24)

average draw 170W

Dual-8 (-j48)

average draw 230W

Dual-8 (-j24)

average draw 230W

The dual-8 is approximately 40% faster than the dual-4 on this task (or, said another way, the dual-4 was about 1.6x slower), but doubling the number of make processes from my prior configuration didn't seem to yield any improvement despite being well within the 64 threads available. This surprised me, so given that the dual-8 has 16 cores, I tried 16 processes directly:

Dual-8 (-j16)

average draw 215W

This proves, at least for this workload, that SMT does make some difference, just not as much as I would have thought. It also argues that the sweet spot for the dual-4 might have been around -j12, but I'm not willing to tear this box back down to try it. Still, cutting down my build times by over 10 minutes is nothing to sneeze at.

For other kinds of uses, though, I didn't see a lot different in terms of performance between DD2.2 and DD2.3 and to be honest you wouldn't expect to. DD2.3 does have improved Spectre mitigations and this would help the kind of branch-heavy code that would benefit least from additional slices, but the change is relatively minor and the difference in practise indeed seemed to be minimal. On my JIT-accelerated DOSBox build the benchmarks came in nearly exactly the same, as did QEMU running Mac OS 9. Booted into GNOME as I am right now, the extra CPU resources certainly do smooth out doing more things at once, but again, that's of course more a factor of the number of cores and slices than the processor stepping.

Overall I'm pretty pleased with the upgrade, and it's a nice, welcome boost that improves my efficiency further. Here are my present observations if you're thinking about a CPU upgrade too (or are a first time buyer considering how much you should get):

  • Upgrading is pretty easy with these machines: if you bought too little today, you can always drop in a beefier CPU or two tomorrow (assuming you have the dough and the sockets), and the Self-Boot Engine code is generic such that any Sforza POWER9 chip will work on any Raptor system that can accommodate it. I have been repeatedly assured no FPGA update is needed to use DD2.3 chips, even for "O.G." T2 systems. However, if you're running a Blackbird, you should think about case and cooling as well because the 8-core will run noticeably hotter than a 4-core. A lot of the more attractive slimline mATX cases are very thermally constrained, and the 8-core CPU really should be paired with the (3U) high speed fan assembly than the (2U) passive heatsink. This is a big reason why my personal HTPC Blackbird is still a little 4-core system.

  • The 4 and 8-core chips are familiar to most OpenPOWER denizens but the 18 and 22-core monsters have a complicated value proposition. People have certainly run them in Blackbirds; my personal thought is this is courting an early system demise, though I am impressed by how heroic some of those efforts are. I wouldn't myself, however: between the thermal and power demands you're gonna need a bigger boat for these sharks.

    The T2 is designed for them and will work fine, but after my experience here one wonders how loud and hot they would get in warmer environments. Plus, you really need to fill both of those sockets or you'll lose three slots (those serviced by the second CPU's PCIe lanes), which would make them even louder and hotter. The dual-8 setup here gets you 16 cores and all of the slots turned on, so I think it's the better workstation configuration even though it costs a little more than a single-18 and isn't nearly as performant. The dual-18 and dual-22 configurations are really meant for big servers and crazy people.

    With the T2 Lite, though, these CPUs make absolute sense and it would be almost a waste to run one with anything less. The T2 Lite is just a cut-down single-socket T2 board in the same form factor, so it will also easily accommodate any CPU but more cheaply. If you need the massive thread resources of a single-18 (72 thread) or single-22 (88 thread) workstation, and you can make do with an x16 and an x8 slot, it's really the overall best option for those configurations and it's not that much more than a Blackbird board. Plus, being a single CPU configuration it's probably a lot more liveable under one's desk.

  • Simply buying a DD2.3 processor to replace a DD2.2 processor of the same core count probably doesn't pay off for most typical tasks. Unless you need the features (and there are some: besides the Spectre mitigations, it also has Ultravisor support and proper hardware watchpoints), you'll just end up spending additional money for little or no observable benefit. However, if you're going to buy more cores at the same time, then you might as well get a DD2.3 chip and have those extra features just in case you'll need them later. The price difference is almost certainly worth a little futureproofing.

What to do when the BMC won't talk to you

Normally I'm typing TenFourFox posts from the Talos II, but today I'm typing a Talospace post from the old Quad G5. When my dual-8 second T2 arrived, it turned out the onboard NICs were duds (I'm not blaming Raptor for this, mind you; crib death happens with any system). When I started moving components off it to RMA the motherboard, however, one of the RAM sticks got jammed in my old T2 and when I tried to free it, the latch not only snapped off but with such force that it struck the NIC ROM chip and knocked one of the surface mount resistors connected to it off. I will simply say that the cat fled behind the washing machine and didn't come out for hours after I realized what had happened.

That untimely accident thus left me with two systems with dead network ports, and the BMC, the heartbeat of your OpenPOWER machine, the breath of life, the starter, the prime mover, only talks on one of them. Sure, I could plug in an (ugh) USB Ethernet dongle (protip: the old TrendNET USB NIC I dug out of storage doesn't work in Petitboot, so I had to unplug and replug it when Fedora booted), or waste a PCIe slot on another NIC or whathaveyou, but none of those things restores connectivity to the BMC that for security reasons is limited to that top onboard port. The BMC is now an island unto itself, never to be updated or talked to again.

Or ... is it? The BMC fortunately has a serial port. With an easily available 10-pin to DE-9 cable which you can get on eBay for a few bucks, you can talk to your BMC again.

There's no nice web interface (I'll have a later post about how to use TenFourFox and its built-in AppleScript-to-JavaScript bridge to support its later features if you're a Power Mac refugee like me) and everything is strictly by shell, but if you were already used to talking to your BMC over SSH then this won't seem at all unfamiliar. Here it is connected to a USB serial dongle back to the G5.

In ZTerm on this G5 I set the serial connection to 115200bps, 8-N-1. Assuming your system is running the 2.00 system package or later, hit RETURN/ENTER a couple times and you should get a "Phosphor" login prompt; log in as root and your BMC password. You'll get to the root prompt just as you would over SSH.

That suffices for logging in and changing your password if necessary, but what about updating the firmware? Fortunately the clever people who maintain OpenBMC, the standard firmware in Raptor and most other OpenPOWER machines, added rz and sz to the build. That means you're just a ZMODEM upload away from updating your BMC and PNOR! Just go into the correct directory (/tmp for PNOR, /run/initramfs for the BMC firmware), run rz from the shell, and then start a ZMODEM upload from the terminal program on your assisting computer (to update the BMC, you'll upload both image-kernel and image-rofs, and to update the PNOR, whatever file as named). Some terminal programs, or the sz utility if you use it on the client side, will allow you to specify an absolute path. Either way, once it's there, you can then finish the update the usual way.

The Blackbird has such a port as well and uses the same cable as the T2 and T2 Lite. You can even leave the cable connected and this will allow you to access the BMC at any time. Note that rebooting the BMC while the main processors are still running may cause your machine to act a little funny until they're back in sync.

As a postscript, those SMT resistors are damn small, and I'm pretty sure it's not right on the pads anyway assuming I didn't damage something else in the process. But the RMA board did arrive and does work, so I'm rebuilding the new system and we'll talk about whether the jump from a dual-4 to a dual-8 T2 is worth it when I do. The old system still works otherwise and at least with that serial cable and a quad-port PCIe NIC, it may be one slot less but I can still put it to good use.

Firefox 75 on POWER

Firefox 75 seems to build uneventfully on this Raptor Talos II and as always this post is being typed in the new version. I'm not particularly enamoured of the zooming address bar and I'm sure you won't be able to turn it off eventually, but for now you can. A number of the developer-facing features are quite compelling, though. In addition, if you're on Wayland (Xorg forever), Firefox on Wayland now has H.264 VA-API and full WebGL support; I don't know how well these work on Wayland on ppc64le and I'm not going to be the one to tell you, but I'm sure some of you folks will try.

Total build time for opt on this dual-4 DD2.2 system was 36:36.67 (with -j24). Why mention it? Well, my dual-8 DD2.3 is almost here and this sounds like a convenient real-world benchmark to try out on the new box. I'm thinking -j48 sounds nice and still gives me a whole 16 threads for serious business during the compile.

The opt and debug .mozconfigs I'm using are unchanged from Firefox 67.

Some updated notes on KVMPPC

A couple quick notes on KVMPPC that came out of the last article on Mac OS 9 in QEMU:

  • If you get a kernel Oops when unloading kvm_hv, you are bitten by this bug which should be fixed in recent kernel releases. (Thanks, Paul Mackerras.)

  • However, you don't have to do that anymore to use KVM-PR anyway. With QEMU 4.2 (at least on this Talos II running Fedora 31), if you use qemu-system-ppc64 with -M accel=kvm and explicitly specify a CPU which requires KVM-PR (such as, in our previous articles, a Motorola G4/7410 with -cpu nitro), then it will be used without having to unload KVM-HV. In the examples in the second half of the article, just change qemu-system-ppc to qemu-system-ppc64 and it will now "just work" (otherwise you'll get a weird error trying to start QEMU). That is, of course, assuming you have your kernel in HPT mode instead of radix, which is the default on POWER9.

  • Likewise, if you specify -cpu host (or power8 on a POWER9), KVM-HV should be used automatically, though you can explicitly specify you want HV with -M accel=kvm,kvm-type=HV as in our AIX on OpenPOWER example. Just don't tell IBM.

If you don't understand what all that meant, read our original article for a brief primer.