Posts

Showing posts from July, 2019

Qubes == Dollar$


Well, bitcoins, anyway. I'm of two minds on software bounties personally: there's nothing like money for bringing interest to a new platform and bounties do directly subsidize development, but they tend to attract mercenary types who may not have interest in the platform otherwise and they rarely cover the full actual development cost. Moreover, while they do usually yield software projects that work, at least for whatever the definition of "work" was, in many cases they subsequently bitrot and become unmaintained (or unmaintainable) due to the community lacking the technical expertise they put the bounty up for in the first place. As a relevant example, this happens a lot in the Amiga community, where people just try to throw money at the software gaps; many projects get finished but few have lasting significance (Timberwolf comes to mind but there are others), and these wrinkles clearly distinguish bounties from crowdfunding where a presumably already interested party just needs resources to finish the work they already want to do.

Nevertheless, it's still a step in the right direction, and there is lots of interest in our higher-security OpenPOWER world in running a higher-security operating system. Qubes OS certainly has the chops with its strict(er) security-by-isolation approach and its multiple operating domains. Qubes, however, is based on the Xen hypervisor and not KVM, and they make a cogent case for why, i.e., it doesn't rely on the Linux kernel to do proper isolation and Xen is more self-contained, smaller and thus more auditable (see the PDF specification). Unfortunately, while Xen used to support PowerPC through version 3.2 (so-called "XenPPC"), it doesn't look like work has been done on Power ISA compatibility in almost a decade and it certainly doesn't support the later features exploited by KVM-HV needed for high-throughput on modern Power CPUs.

Some work on getting a KVM-based strategy "good enough" for Power has already been done, and there are some encouraging statements from Qubes developers on what they would consider an acceptable security target. (However, this work was started by Shaun "Mr. Chromium on POWER9" Anastasio, which sort of proves my point that people who are already interested will do the work, bounty or not.) My impression is that there is still a fair amount of work to be done and that brings us to the moolah.

While the "task" has not actually been well-defined in the Github issue referenced (it's not actually "deliver Qubes OS that can boot on POWER9 (and the head of John the Baptist);" it reads to me more like "do the systems work to either get KVMPPC up to snuff or deliver a working alternative foundation"), the task is certainly well-funded: 2 BTC, currently US$19,368, and the potential for another matching donation of 1 BTC to equal 4 BTC. Thirty-eight grand is definitely enough money to get anyone's attention, though don't ask me, because I don't know a great deal about Qubes' internals and I'm still trying to do this Firefox JIT thing in my "copious" spare time. But if you do, and you've got the hardware and you've got the need, step right up.

Meanwhile, Shaun struck again and ported BSNES. What was that I said about bounties and people who were already interested?

Easier Power ISA vectorizing for fun and profit with GCC x86 intrinsics


Oh, you kids. When I was a boy I had to write my TenFourFox's AltiVec VP9 decoder with compiler intrinsics by hand and do all the endian conversions from the Intel SSE2 version uphill, both ways, in the snow, naked. If we were good we got to eat broken glass for dessert. It's how we kept our teeth clean.

Yeah, with your cushy POWER9s and your sexy Blackbirds you just don't know how it used to be. You may not even know how it currently is thanks to x86intrin.h, a master include file that "magically" adds support for MMX, SSE and SSE2 on chips supporting AltiVec/VMX, as well as its Power-specialized subcomponents of mmintrin.h (MMX), xmmintrin.h (SSE) and emmintrin.h (SSE2) that many x86-centric software packages with vectorization will #include directly. Particularly for SSE and SSE2, which are well-covered by AltiVec/VMX, many SSE intrinsics can simply be compiled directly to VMX using these translation headers with little or no source code changes. The shims also include additional support for VSX and some later POWER7/8 instructions for better performance. Support for these headers first appeared in gcc 8.

Additional support is on the way. Besides the MMX, SSE and SSE2 support added by x86intrin.h, there is SSE3 (pmmintrin.h), SSSE3 (tmmintrin.h), and a presently incomplete implementation of SSE4.1 (smmintrin.h). (There are other x86 intrinsic shim headers that translate scalar x86 intrinsics into Power, but I won't talk about these further here except to say that x86intrin.h includes those too.)

Unfortunately, the semantics are not exact. Besides endianness concerns (Power Macs are big-endian, so in TenFourFox I had to swap around high and low merges, and some shuffles required different permute vectors), there are differences in exceptions (more in a moment), scalar floats in vector registers require VSX (sorry, G5 users), and of course there is currently no support for AES, AVX or AVX-512. Still, this is a substantial improvement.

As a real world example, let's compare two ways of approaching vectorization with LAME, the venerable MP3 encoder. On my Power Macs I use LAMEVMX, which incorporates tmkk's AltiVec patches, adds a little additional G5 sweetener, and then wraps it up into a three-headed "universal" Mach-O binary for G3, G4 and G5 processors. With the Quad G5 running at dim-the-lights energy usage, it encodes at about 25 to 30 times playback speed, over three times faster than the non-SIMD version. These patches are hand-written and fairly efficient, including different code paths for the G5 and the vagaries of its own vector unit for which the 32-bit versions are unproductive or less performant.

However, on the POWER9 with VSX, a simpler approach is just to use regular LAME's SSE intrinsics and let the headers sort it out. You need to set the "I know what I'm doing" define (-DNO_WARN_X86_INTRINSICS) but with a little tweaking in this article and some additional minor changes it pretty much "just works." The optimized version presented there is over seven times faster than a stock build already, but while gcc's autovectorization is pretty good, with the SSE shim headers runtime is cut by another 25 percent. Since that article was written there is now support for _MM_SHUFFLE, so that part may not be required depending on your compiler version (I use gcc 9.1), but I had to make some changes to the configure script instead to make it happier on ppc64le plus a couple 64-bit tweaks. With this patch I also observe about 25% improvement. Apply it to the provided source for LAME 3.100 and run configure with CFLAGS="-O3 -mcpu=power9 -DNO_WARN_X86_INTRINSICS" and then make -j24 (or your preference).

My POWER9 now encodes MP3 files at about 40x playback speed compared to 32x in the optimized scalar version, and chews through entire discs in record speed when I run a "LAMEVSX" process per hardware thread (take that, Phoronix). Could the hand-written version be ported to ppc64le or even VSX and be even faster? Perhaps, but it's a non-trivial amount of work and probably has some endian issues, while this quick-and-dirty build gives us a demonstrable improvement on existing code with relatively little effort.

But let's turn to an even more dramatic demonstration. One of my interests is recurrent neural networks and on my POWER6 is some partial code I wrote for an AltiVec-accelerated feed-forward net in C to speed up the math, since I don't really do Nvidia GPU AI work currently. (Raptor will take your money, though.) As a comparison I was looking at KANN, a simple C library for small to medium artificial neural nets such as multi-layer perceptrons, convolutional neural networks and, yes, recurrent neural networks. For pure CPU computation KANN is pretty fast as its benchmarks show; it will get stomped by a GPU-based solution as you scale up, but it's pretty good for projects that aren't massive and it will run on absolutely libre hardware. I went into its Makefile and changed the CFLAGS to -O3 -mcpu=power9 and built it with make -j24. I then tried the addition example where we'll teach an RNN how to do basic math:

% seq 30000 | awk -v m=10000 '{a=int(m*rand());b=int(m*rand());print a,b,a+b}' > numbers
% time ./examples/rnn-bit -m7 -o add.kan numbers
epoch: 1; cost: 0.0614594 (class error: 2.73%)
epoch: 2; cost: 0.000170362 (class error: 0.00%)
epoch: 3; cost: 8.32791e-05 (class error: 0.00%)
epoch: 4; cost: 7.43936e-05 (class error: 0.00%)
epoch: 5; cost: 4.07932e-05 (class error: 0.00%)
epoch: 6; cost: 3.74252e-05 (class error: 0.00%)
epoch: 7; cost: 2.82747e-05 (class error: 0.00%)
127.447u 0.076s 2:07.56 99.9% 0+0k 0+896io 0pf+0w

Seven training epochs using scalar code took about 127 wall clock seconds on this dual-4 Talos II, and now it knows how to add:

% echo 987654 321000 | ./examples/rnn-bit -Ai add.kan -
1308654
% perl -e 'print 987654 + 321000'
1308654

It then occurred to me that there were SSE results in the benchmarks. Ooooh! Sure enough, there are checks for __SSE__ in the code. Let's do a make clean, set CFLAGS in the Makefile to -O3 -mcpu=power9 -D__SSE__ -DNO_WARN_X86_INTRINSICS and see what happens:

kautodiff.c: In function ‘kad_trap_fe’:
kautodiff.c:2322:2: warning: implicit declaration of function ‘_MM_SET_EXCEPTION_MASK’ [-Wimplicit-function-declaration]
kautodiff.c:2322:25: warning: implicit declaration of function ‘_MM_GET_EXCEPTION_MASK’ [-Wimplicit-function-declaration]
kautodiff.c:2322:54: error: ‘_MM_MASK_INVALID’ undeclared (first use in this function)
kautodiff.c:2322:73: error: ‘_MM_MASK_DIV_ZERO’ undeclared (first use in this function)

See, I told you (they told us) it wasn't a perfect conversion. Currently it doesn't look like there's any support for SSE exceptions and they would probably not map properly onto VMX/VSX anyway, so the easiest solution here is to edit kautodiff.c, find kad_trap_fe(), and change

#if __SSE__

to

#if defined(__SSE__) && !defined(NO_WARN_X86_INTRINSICS)

With that change, it compiles. But is it any better? Using our numbers file from before for consistency and doing seven epochs again,

% time ./examples/rnn-bit -m7 -o add.kan numbers
epoch: 1; cost: 0.0614577 (class error: 2.73%)
epoch: 2; cost: 0.000170445 (class error: 0.00%)
epoch: 3; cost: 8.33149e-05 (class error: 0.00%)
epoch: 4; cost: 7.44496e-05 (class error: 0.00%)
epoch: 5; cost: 4.08097e-05 (class error: 0.00%)
epoch: 6; cost: 3.74398e-05 (class error: 0.00%)
epoch: 7; cost: 2.82935e-05 (class error: 0.00%)
67.829u 0.075s 1:07.93 99.9% 0+0k 0+896io 0pf+0w
% echo 987654 321000 | ./examples/rnn-bit -Ai add.kan -
1308654

This runs in nearly half the time! In fact, this vectorized KANN is so good compared to my old hal-fassed AltiVec neural network experiment that I've completely scrapped it.

I should note that this was a particularly easy snag to fix because the exception checking here is probably not of major concern under normal usage, but it demonstrates that conversion is not always exact (or possible). I've also completely ignored the endian issue in this article because I'm conveniently running SSE code intended for a little-endian machine on a little-endian POWER9; even if it compiled properly on a big-endian system you may still need to do some additional work. However, the conversion shims are good enough that for many situations with basic vectorization code, Intel SIMD code can compile and "just work" on Power ISA, and can give you a starting point to determine whether it's worth doing additional conversion work to proper VMX/VSX sequences.

Yes, you kids and your fancy bi-endian machines and your new vector instructions and your smartypants compilers. You have it so much better than when we used to get only a bowl of hot molten lead slag for dinner. Sure, we got nerve damage and a low blood count, but it was something warm in our bellies and we could sleep for the 35 seconds or so before we had to have our kidneys removed. True story. Totes.

Power ISA improvements in 5.2 (and a Raptor tease)


I'm catching up on all the stuff while I was semi-off-grid, and among them is kernel 5.2, which was declared released on July 7 and should be reaching your distribution soooooon (though Fedora 30 on this Talos II is still at 5.1.x as of this writing). Big general improvements are Sound Open Firmware, which is not an audio player for the ok prompt but rather open source firmware for audio devices, a (hopefully better) new mount(2) interface with new syscalls, performance improvements to the Budget Fair Queuing (BFQ) I/O scheduler, and additional CPU information leak protections using an architecture-independent mitigations= command line argument (it works on Power machines too, as well as x86, x86_64, ARM64 and s390). On PowerPC and 64-bit Power, mitigations=off sets nopti,nospectre_v1,nospectre_v2,spec_store_bypass_disable=off which respectively disable mitigations for user/kernel page table isolation (i.e., Meltdown), Spectre versions 1 and 2, and speculative storage bypass. If set to auto, the default, then these mitigations are enabled in the kernel along with (on POWER8 and POWER9) mitigating SSB by inserting a store-forwarding barrier when entering and leaving kernel context. The particularly paranoid can set auto,nosmt to take the hit against L1TF and MDS attacks, but currently this disables SMT only on x86, because Power doesn't suck. ;)

Power-specific changes include the long-awaited (at least by me) YOLO DAWR support on POWER9, as well as support for Kernel Userspace Access/Execution Prevention (KUAP and KUEP). KUP features collectively are analogous to Intel Supervisor Mode Access Prevention (though I like this SMAP better) and prevent the kernel from accidentally accessing userspace outside copy_to/from_user() and/or executing code in userspace. Support is somewhat varied: most 32-bit CPUs except the 400, 440 and e500 series support both KUAP and KUEP (though the poor old PowerPC 601 lacks an NX segment bit, so no KUEP), but KUP on 64-bit Power currently requires the radix MMU, meaning only POWER9 CPUs in radix mode. You can see if your CPU is supported in this list, looking for CPU_HAVE_KUAP and CPU_HAVE_KUEP.

Meanwhile, who says 32-bit PowerPC is dead? 5.2 also adds 32-bit support for the Kernel Address Sanitizer (KASAN), further improving security, and some significant performance improvements to 32-bit syscall overhead (up to 12-17% improvement on the the null_syscall benchmark).

Although I won't be able to make OpenPOWER this year in San Diego, Raptor is going, and is teasing a new POWER9 product announcement. However, I will be at Vintage Computer Festival West exhibiting some of my PowerPC, PA-RISC and SPARC laptops and portable workstations. If you're going to be near the Computer History Museum in Mountain View (near the Google Death Star) on August 3 or 4, drop by, say hi, and play with the toys.

One big happy Void


The PowerPC Void Linux project has officially merged its 32-bit and 64-bit Power offerings, though to be fair this was expected for awhile and just makes good sense. Meanwhile, substantial progress is being made on the ports and it looks like most packages are buildable, but actual package availability for the big-endian (32-bit and 64-bit) and musl flavours still lags ppc64le at least right now, so that G5 under your desk may have to wait a bit. Live CDs are still available.

OCC is the sound you make when throttled


Back from distant climes to find an interesting tweet from Raptor relating to the POWER9 OCC. The OCC, or On-Chip Controller, monitors power usage and thermal stability, and can surface this information to the kernel via cpufreq. Raptor is asking users who get throttling warnings in dmesg to report them, though I haven't seen any such issues on my thermally constrained Blackbird or on this cool-running Talos II, and it's not clear how widespread the issue actually is.

Meanwhile, users who get weird OCC-related crashes when the POWER9 is in a stop state are encouraged to upgrade to the latest firmware release candidate to pick up this fix. This apparently is being triggered by recent kernel versions that enable deep power saving modes.

FreeBSD on POWER


We haven't covered BSD a great deal in this blog even though I personally run NetBSD on three systems myself (two of which are in regular service), mostly because my system and I suspect the majority of the OpenPOWER install base is on Linux. However, FreeBSD 11.3 is now officially released and has fairly good support for 32-bit and 64-bit PowerPC on Power Mac hardware, so it's worth pointing out that 12.0 (and 13.0) has also been tested on the Blackbird and thus should also work on the Talos II. However, on the PowerPC wiki page -CURRENT is recommended for Blackbird, 12.0 is mandatory for OpenPOWER (thus 11.x won't work and presumably won't ever work), and X11 is currently listed "on Power8/Power9 [as] still a work in progress." Nevertheless, POWER8 systems also work, hardware support is improving and the OS offers another big-endian option for people preferring to run their systems that way, so hopefully Justin or Mark who are more versed in the FreeBSD world than I am have some comments about how well it works for others to explore.

Firefox 68 on POWER


Firefox 68 is out. I haven't had a chance to exhaustively test it on my ppc64le Talos II due to business trips and some family obligations, but on cursory testing the browser seems to function normally. Unfortunately our last minute latest workaround for (what is now clearly) a compiler bug in bug 1512162 did not make release, so you'll need to add it if you build from source; without it, some optimization levels may crash or behave adversely. We have not yet narrowed down the issue in gcc and on my last check clang still can't build the browser fully. Fortunately the fix did land on the new Extended Support Release 68, so individuals who prefer the ESR should be able to build as-is from there, and the fix also does not appear to be necessary on big-endian. Thanks to Dan Horák's usual quick work, the patch is also in the standard Fedora packages. The configurations I'm using are unchanged from Firefox 67.

DIAF, Amazon Music (and DRM)


It used to be that Amazon Music was a decent choice for playing the music you purchased. Not only did the AutoRip feature mean you had an automatic digital copy of participating CDs you purchased, playable from any web browser (I used TenFourFox for this purpose up until recently), but you still had the physical disc and discs you bought before got automatically added to your AutoRip library if Amazon got rights to do so. It was cool to watch my music library just fill in over the years from past purchases and still have the original CD if I needed it.

Well, turns out I'll need those CDs after all, because guess what Amazon Music does now?

"Amazon Music Unlimited" my pasty sculpted white butt. The message is almost intentionally misleading. What I've "disabled" in my browser is the Google Widevine EME component, because it doesn't exist for ppc64le, and while Amazon's community staff are as useless as ever that "deficiency" appears to be the real reason it won't work. Amazon, in fact, is claiming Linux on any platform isn't supported for the browser version or the dedicated client at all.

I wasn't going to take no for an answer. I used uBlock Origin to remove as many of the elements as I could. I couldn't get the blurring away easily but I was able to get into my old albums library and try to play something. It looked like it was starting, but no music issued forth. In the Browser console was this damning message:

No, you lying sack of filth. I didn't pref anything off. I didn't do anything. You did.

How did this work before? Amazon Music would say it required Flash, but it actually didn't (TenFourFox hasn't supported NPAPI plugins for years). The music files were just MP3. You could stream them or download them, and while some of the tracks were watermarked, I considered that a reasonable tradeoff for the convenience. Now it won't even let you in to download them.

I'm no Stallmanite. I could live with a compromise where music I don't own requires some sort of DRM, because I'll just preview it (at least for as long as they'll still allow it, which currently they still seem to), and I'll buy it if I want it. The problem is that Amazon has now effectively defined everything I've ever bought from them (and I have, in fact, bought a few tracks that I don't have a disc for) as "music I don't own." You can't even download them again despite Amazon's instructions because the browser client doesn't let you get there, even if you block the restraining elements. I'm not going to stop buying CDs from Amazon if they have a decent price, but I won't consider AutoRip as part of the value calculation anymore, and I certainly won't buy any form of digital music from them until this changes.

If there's going to be choices in computing, then this kind of crap has to stop. DRM isn't compatible with open source by definition. Worse, locking down a service that previously didn't enforce DRM is not only a still greater sin, but it's even potentially actionable. When DRM like Widevine is the only choice for playing content, then that means the only computers that can are the ones they control, and I wouldn't run some potentially untrustworthy blob on my Talos II anyway even if a ppc64le version were one day offered. Amazon Music can die in a fire.