Posts

Latest Posts

Now your LLaMa is playing with POWER


Now that the invasion of the large language models has occurred and we will all bow to our GPT overlords, I just generated a pull request to add additional POWER9-specific optimizations to llama.cpp, what all the cool kids are using for LLMs who aren't down with OpenAI. This repo moves quick but it's where the magic is happening if this is what you're into. It will work with both Alpaca and LLaMa models.

In a previous article we talked about autovectorization using conversion of Intel vector intrinsics to POWER9, but this is good old fashioned assembly code and hand-written C. The part that really helped was changing their pure-C "F16" (half-precision) float conversion code to use VSX instead. The rolls-off-your-tongue POWER9-and-up xscvhpdp and xscvdphp instructions convert half-precision floats to and from double-precision respectively (xscvdphp will also work on single-precision, which is handy, because the explicit conversion is from single-precision "F32"), and we also use POWER8 mffprd and mtfprd for toll-free copies between general and float registers without requiring a spill to memory. That change alone is about 12 percent faster than the old pure-C compute and lookup code. Additionally, we also have our own vectorized version of quantize_row_q4_0 like ARM NEON and AVX-256 written with VMX/VSX intrinsics. It's even a little better, because we were able to use our VMX floating-point multiply-add and remove a couple minor inefficiencies in the code. Additionally, people used to G4 and G5-era AltiVec will enjoy the fact that the newer intrinsics substantially map directly to ARM's — I especially liked vec_extract as an all-purpose replacement for all of the NEON vget_lane_* variations, as well as vec_signed for vcvtq_s32_f32 for converting floats in place, and the all-purpose simplified vec_splats for making a splat vector out of anything — making conversion much more straightforward when you need to write your own code.

I did play with alpaca.cpp, the other older white meat, and the changes here should more or less apply to that codebase as well. However, given how quickly llama.cpp evolves and the greater development interest, llama.cpp seems the best way forward for continued evolution.

I will say in the spirit of full disclosure that despite these improvements my 16GB 4P/4E/8G M1 MacBook Air still pops out tokens several times faster than this 64GB dual-8 Talos II, even full-tilt with all 64 threads in use (the cat still looks startled every time the fans rev). On the other hand, we're also comparing a 2017 CPU with one from 2020, and one with specific hardware acceleration for neural networks that llama.cpp takes particular advantage of. Even with Power10's improved bfloat16 support and matrix math operations, specific work would be needed to support those features which won't be coming from me (stay tuned for Power11, I guess). There are other opportunities for vectorization to be done, though at the rate this code base evolves it would be better waiting for one of the mainstream architectures to pick up a SIMD version we can convert first. In the meantime, while you should be advised that going beyond the 7B or 13B models will require patience regardless of how much RAM you have, I think this is definitely better than what we started with.

Firefox 110 on POWER


Firefox 110 is out, with graphics performance improvements like GPU-accelerated 2D canvas and faster WebGL, and the usual under the hood updates. The record's still broken and bug 1775202 still is too, so you'll either need this patch — but this time without the line containing desktop_capture/desktop_capture_gn, since that's gone in the latest WebRTC update — or put --disable-webrtc in your .mozconfig if you don't need WebRTC at all. I also had to put #pragma GCC diagnostic ignored "-Wnonnull" into js/src/irregexp/imported/regexp-parser.cc for optimized builds to complete on this Fedora 37 system and I suspect this is a gcc bug; you may not need it if you're not using gcc 12.2.1 or build with clang. Finally, I trimmed yet another patch from the PGO-LTO diff, so use the new one for Firefox 110 and the .mozconfigs from Firefox 105.

Vikings now has Blackbirds


If you're on the other side of that great pond called the Atlantic, Vikings' OpenPOWER store now lists Blackbirds starting at €3695 + VAT. Not just the board, the package includes a "4-core DD2.3 (v2) CPU, 2U heatsink, 16GB ECC RAM, bequiet! TFX power supply, all packaged nicely in a Antec slim desktop case." That's already a nice quiet basic system and more than enough to get you started with OpenPOWER, but if you want something almost silent, consider pairing it with their so far exclusive water block assembly for POWER9 for €155 + VAT, though you'll need to BYO pump, tubing, reservoir and fluid.

Linux 6.2


Linux 6.2 is out. Among its marquee updates are improved Rust-in-kernel support (strings, formatting and printing, memory allocation, macros, etc.), adding TCP Protective Load Balancing (PLB) for IPv6, reducing the overhead of read-copy update (RCU) operations using lazy callbacks, performance and RAID improvements for Btrfs, and userspace support for runtime verification with safety-critical systems. And, of course, support for Apple silicon and Retbleed sucks less on Skylake, but who cares about that around here anyway?

On the Power ISA side, probably the most noteworthy change is official support for big endian ELFv2 kernels. A nice upgrade for our Sir Mix-A-Lot brigade! Another interesting commit is the one to allow compile time support for the lharx and lbarx instructions (present on ISA v2.06/POWER7 and up). The lwarx (32-bit word) and ldarx (64-bit doubleword) load instructions, along with the corresponding store instructions stwcx. and stdcx. (and a conditional branch), are used to implement atomic load-store-compare/exchange operations by placing and checking reservations on particular memory locations. The newer instructions can do this at halfword (short) and byte level respectively (with sthcx. and stbcx.) instead of reserving at least an entire 32-bit word, reducing contention in tightly packed structs. In the future, it might also benefit the newly introduced Power ISA-specific spinlock implementation as well, which is also new in this release.

Expect 6.2 to make it to bleeding edge users and Fedora in the very near future.

Tonight's game on OpenPOWER: Shadow Warrior


Well, it's been awhile since we expanded our games library, so let's go back to our regular fast food diet of FPSes and select one from the Build side of the house this time: Shadow Warrior. Build games have a reputation starting with Duke Nukem 3D (a game for another day) and that reputation is well-deserved, so let's get this out of the way: if you found these games iffy in the 1990s, rest assured they've aged badly, because you'll find the content level positively radioactive now between the adult humor, graphic violence and (this game in particular) incredibly inappropriate cultural stereotypes. Stop reading this article now and look at some of our other game builds.

On the other hand, Shadow Warrior was probably the most technically superior of the Build games (with the possible exception of Monolith's Blood): more sophisticated sector effects, coloured lighting, true transparency (including water, though used sparingly to avoid spoilers and performance issues), fog and clouds, larger levels, room-over-room effects and the part I liked the most (and was curiously missing from the classic Mac OS port by MacPlay-Westlake Interactive), voxel-based objects that were truly 3D. All of these features plus OpenGL have made it to JonoF's Shadow Warrior Port (JFSW), using Ken Silverman's Build and Polymost engines (more info).

JFSW builds pretty much out of the box with SDL 2; just type make (or make -j24 or such to exercise your other cores), then copy the .GRP group file from either the 3DRealms shareware install or a registered or retail version to ~/.jfsw (I used my MacPlay CD and named it swmac.grp). Shadow Warrior used redbook audio for the retail version, so for music, rip the tracks and save them as track02.ogg (intro) to track14.ogg ("Lo Wang Raps") in the same directory. Then go to where you've built JFSW and start the game with ./sw, and a configuration window will appear to select your resolution. Note that while widescreen resolutions are supported (and look good), the game still uses 4:3 assets, so things like Lo Wang's sword will be cut off.

A note on resolutions and colour depth: 8bpp modes are rendered 100% in software, which is very fast even on Blackbirds with just BMC graphics, and works beautifully on virtually any system. If you select a 24bpp mode, the game will try to use OpenGL. On my system this caused a freeze (actually an infinite loop, once I stepped through it in a debugger) whenever it attempts to render reflections in a mirror. This appears to be related to non-POT texture support which virtually every card anybody would be running probably supports properly. If you get the same freeze, kill the game and edit jfbuild/src/polymost.c. On line 4903 or thereabouts you'll see if ((method & METH_POW2XSPLIT) && (tsizx != xx)) which if you change to if (0) will get around the code that glitches. I can't tell if this is specific to my card, to OpenPOWER or to gcc, and it doesn't happen in software mode, which plays 100% fine all the way to the end including nuking Zilla himself.

Don't mess with Lo Wang.

Firefox 109 on POWER


Firefox 109 is out with new support for Manifest V3 extensions, but without the passive-aggressive deceitful crap Google was pushing (yet another reason not to use Chrome). There are also modest HTML, CSS and JS improvements.

As before linking still requires patching for bug 1775202 using this updated small change or the browser won't link on 64-bit Power ISA (alternatively put --disable-webrtc in your .mozconfig if you don't need WebRTC). Otherwise the browser builds and runs fine with the LTO-PGO patch for Firefox 108 and the .mozconfigs from Firefox 105.