Easier Power ISA vectorizing for fun and profit with GCC x86 intrinsics


Oh, you kids. When I was a boy I had to write my TenFourFox's AltiVec VP9 decoder with compiler intrinsics by hand and do all the endian conversions from the Intel SSE2 version uphill, both ways, in the snow, naked. If we were good we got to eat broken glass for dessert. It's how we kept our teeth clean.

Yeah, with your cushy POWER9s and your sexy Blackbirds you just don't know how it used to be. You may not even know how it currently is thanks to x86intrin.h, a master include file that "magically" adds support for MMX, SSE and SSE2 on chips supporting AltiVec/VMX, as well as its Power-specialized subcomponents of mmintrin.h (MMX), xmmintrin.h (SSE) and emmintrin.h (SSE2) that many x86-centric software packages with vectorization will #include directly. Particularly for SSE and SSE2, which are well-covered by AltiVec/VMX, many SSE intrinsics can simply be compiled directly to VMX using these translation headers with little or no source code changes. The shims also include additional support for VSX and some later POWER7/8 instructions for better performance. Support for these headers first appeared in gcc 8.

Additional support is on the way. Besides the MMX, SSE and SSE2 support added by x86intrin.h, there is SSE3 (pmmintrin.h), SSSE3 (tmmintrin.h), and a presently incomplete implementation of SSE4.1 (smmintrin.h). (There are other x86 intrinsic shim headers that translate scalar x86 intrinsics into Power, but I won't talk about these further here except to say that x86intrin.h includes those too.)

Unfortunately, the semantics are not exact. Besides endianness concerns (Power Macs are big-endian, so in TenFourFox I had to swap around high and low merges, and some shuffles required different permute vectors), there are differences in exceptions (more in a moment), scalar floats in vector registers require VSX (sorry, G5 users), and of course there is currently no support for AES, AVX or AVX-512. Still, this is a substantial improvement.

As a real world example, let's compare two ways of approaching vectorization with LAME, the venerable MP3 encoder. On my Power Macs I use LAMEVMX, which incorporates tmkk's AltiVec patches, adds a little additional G5 sweetener, and then wraps it up into a three-headed "universal" Mach-O binary for G3, G4 and G5 processors. With the Quad G5 running at dim-the-lights energy usage, it encodes at about 25 to 30 times playback speed, over three times faster than the non-SIMD version. These patches are hand-written and fairly efficient, including different code paths for the G5 and the vagaries of its own vector unit for which the 32-bit versions are unproductive or less performant.

However, on the POWER9 with VSX, a simpler approach is just to use regular LAME's SSE intrinsics and let the headers sort it out. You need to set the "I know what I'm doing" define (-DNO_WARN_X86_INTRINSICS) but with a little tweaking in this article and some additional minor changes it pretty much "just works." The optimized version presented there is over seven times faster than a stock build already, but while gcc's autovectorization is pretty good, with the SSE shim headers runtime is cut by another 25 percent. Since that article was written there is now support for _MM_SHUFFLE, so that part may not be required depending on your compiler version (I use gcc 9.1), but I had to make some changes to the configure script instead to make it happier on ppc64le plus a couple 64-bit tweaks. With this patch I also observe about 25% improvement. Apply it to the provided source for LAME 3.100 and run configure with CFLAGS="-O3 -mcpu=power9 -DNO_WARN_X86_INTRINSICS" and then make -j24 (or your preference).

My POWER9 now encodes MP3 files at about 40x playback speed compared to 32x in the optimized scalar version, and chews through entire discs in record speed when I run a "LAMEVSX" process per hardware thread (take that, Phoronix). Could the hand-written version be ported to ppc64le or even VSX and be even faster? Perhaps, but it's a non-trivial amount of work and probably has some endian issues, while this quick-and-dirty build gives us a demonstrable improvement on existing code with relatively little effort.

But let's turn to an even more dramatic demonstration. One of my interests is recurrent neural networks and on my POWER6 is some partial code I wrote for an AltiVec-accelerated feed-forward net in C to speed up the math, since I don't really do Nvidia GPU AI work currently. (Raptor will take your money, though.) As a comparison I was looking at KANN, a simple C library for small to medium artificial neural nets such as multi-layer perceptrons, convolutional neural networks and, yes, recurrent neural networks. For pure CPU computation KANN is pretty fast as its benchmarks show; it will get stomped by a GPU-based solution as you scale up, but it's pretty good for projects that aren't massive and it will run on absolutely libre hardware. I went into its Makefile and changed the CFLAGS to -O3 -mcpu=power9 and built it with make -j24. I then tried the addition example where we'll teach an RNN how to do basic math:

% seq 30000 | awk -v m=10000 '{a=int(m*rand());b=int(m*rand());print a,b,a+b}' > numbers
% time ./examples/rnn-bit -m7 -o add.kan numbers
epoch: 1; cost: 0.0614594 (class error: 2.73%)
epoch: 2; cost: 0.000170362 (class error: 0.00%)
epoch: 3; cost: 8.32791e-05 (class error: 0.00%)
epoch: 4; cost: 7.43936e-05 (class error: 0.00%)
epoch: 5; cost: 4.07932e-05 (class error: 0.00%)
epoch: 6; cost: 3.74252e-05 (class error: 0.00%)
epoch: 7; cost: 2.82747e-05 (class error: 0.00%)
127.447u 0.076s 2:07.56 99.9% 0+0k 0+896io 0pf+0w

Seven training epochs using scalar code took about 127 wall clock seconds on this dual-4 Talos II, and now it knows how to add:

% echo 987654 321000 | ./examples/rnn-bit -Ai add.kan -
1308654
% perl -e 'print 987654 + 321000'
1308654

It then occurred to me that there were SSE results in the benchmarks. Ooooh! Sure enough, there are checks for __SSE__ in the code. Let's do a make clean, set CFLAGS in the Makefile to -O3 -mcpu=power9 -D__SSE__ -DNO_WARN_X86_INTRINSICS and see what happens:

kautodiff.c: In function ‘kad_trap_fe’:
kautodiff.c:2322:2: warning: implicit declaration of function ‘_MM_SET_EXCEPTION_MASK’ [-Wimplicit-function-declaration]
kautodiff.c:2322:25: warning: implicit declaration of function ‘_MM_GET_EXCEPTION_MASK’ [-Wimplicit-function-declaration]
kautodiff.c:2322:54: error: ‘_MM_MASK_INVALID’ undeclared (first use in this function)
kautodiff.c:2322:73: error: ‘_MM_MASK_DIV_ZERO’ undeclared (first use in this function)

See, I told you (they told us) it wasn't a perfect conversion. Currently it doesn't look like there's any support for SSE exceptions and they would probably not map properly onto VMX/VSX anyway, so the easiest solution here is to edit kautodiff.c, find kad_trap_fe(), and change

#if __SSE__

to

#if defined(__SSE__) && !defined(NO_WARN_X86_INTRINSICS)

With that change, it compiles. But is it any better? Using our numbers file from before for consistency and doing seven epochs again,

% time ./examples/rnn-bit -m7 -o add.kan numbers
epoch: 1; cost: 0.0614577 (class error: 2.73%)
epoch: 2; cost: 0.000170445 (class error: 0.00%)
epoch: 3; cost: 8.33149e-05 (class error: 0.00%)
epoch: 4; cost: 7.44496e-05 (class error: 0.00%)
epoch: 5; cost: 4.08097e-05 (class error: 0.00%)
epoch: 6; cost: 3.74398e-05 (class error: 0.00%)
epoch: 7; cost: 2.82935e-05 (class error: 0.00%)
67.829u 0.075s 1:07.93 99.9% 0+0k 0+896io 0pf+0w
% echo 987654 321000 | ./examples/rnn-bit -Ai add.kan -
1308654

This runs in nearly half the time! In fact, this vectorized KANN is so good compared to my old hal-fassed AltiVec neural network experiment that I've completely scrapped it.

I should note that this was a particularly easy snag to fix because the exception checking here is probably not of major concern under normal usage, but it demonstrates that conversion is not always exact (or possible). I've also completely ignored the endian issue in this article because I'm conveniently running SSE code intended for a little-endian machine on a little-endian POWER9; even if it compiled properly on a big-endian system you may still need to do some additional work. However, the conversion shims are good enough that for many situations with basic vectorization code, Intel SIMD code can compile and "just work" on Power ISA, and can give you a starting point to determine whether it's worth doing additional conversion work to proper VMX/VSX sequences.

Yes, you kids and your fancy bi-endian machines and your new vector instructions and your smartypants compilers. You have it so much better than when we used to get only a bowl of hot molten lead slag for dinner. Sure, we got nerve damage and a low blood count, but it was something warm in our bellies and we could sleep for the 35 seconds or so before we had to have our kidneys removed. True story. Totes.

Comments

  1. Here you can also find good linked sources:
    https://wiki.raptorcs.com/wiki/Power_ISA/Vector_Operations

    ReplyDelete

Post a Comment

Comments are subject to moderation. Be nice.