Now your LLaMa is playing with POWER


Now that the invasion of the large language models has occurred and we will all bow to our GPT overlords, I just generated a pull request to add additional POWER9-specific optimizations to llama.cpp, what all the cool kids are using for LLMs who aren't down with OpenAI. This repo moves quick but it's where the magic is happening if this is what you're into. It will work with both Alpaca and LLaMa models.

In a previous article we talked about autovectorization using conversion of Intel vector intrinsics to POWER9, but this is good old fashioned assembly code and hand-written C. The part that really helped was changing their pure-C "F16" (half-precision) float conversion code to use VSX instead. The rolls-off-your-tongue POWER9-and-up xscvhpdp and xscvdphp instructions convert half-precision floats to and from double-precision respectively (xscvdphp will also work on single-precision, which is handy, because the explicit conversion is from single-precision "F32"), and we also use POWER8 mffprd and mtfprd for toll-free copies between general and float registers without requiring a spill to memory. That change alone is about 12 percent faster than the old pure-C compute and lookup code. Additionally, we also have our own vectorized version of quantize_row_q4_0 like ARM NEON and AVX-256 written with VMX/VSX intrinsics. It's even a little better, because we were able to use our VMX floating-point multiply-add and remove a couple minor inefficiencies in the code. Additionally, people used to G4 and G5-era AltiVec will enjoy the fact that the newer intrinsics substantially map directly to ARM's — I especially liked vec_extract as an all-purpose replacement for all of the NEON vget_lane_* variations, as well as vec_signed for vcvtq_s32_f32 for converting floats in place, and the all-purpose simplified vec_splats for making a splat vector out of anything — making conversion much more straightforward when you need to write your own code.

I did play with alpaca.cpp, the other older white meat, and the changes here should more or less apply to that codebase as well. However, given how quickly llama.cpp evolves and the greater development interest, llama.cpp seems the best way forward for continued evolution.

I will say in the spirit of full disclosure that despite these improvements my 16GB 4P/4E/8G M1 MacBook Air still pops out tokens several times faster than this 64GB dual-8 Talos II, even full-tilt with all 64 threads in use (the cat still looks startled every time the fans rev). On the other hand, we're also comparing a 2017 CPU with one from 2020, and one with specific hardware acceleration for neural networks that llama.cpp takes particular advantage of. Even with Power10's improved bfloat16 support and matrix math operations, specific work would be needed to support those features which won't be coming from me (stay tuned for Power11, I guess). There are other opportunities for vectorization to be done, though at the rate this code base evolves it would be better waiting for one of the mainstream architectures to pick up a SIMD version we can convert first. In the meantime, while you should be advised that going beyond the 7B or 13B models will require patience regardless of how much RAM you have, I think this is definitely better than what we started with.

Comments

  1. Man, every time I wonder what you’re up to, something like this is the answer. Awesome work, and super cool to see.

    ReplyDelete
  2. That's beautiful, great job for getting the intrinsics right and much appreciated as always! Would you mind to elaborate on timing differential that you have observed with your MBA v. 64-thread POWER or at least be a little bit more specific, i.e. are we talking about 2x slower, or 10x slower than the Apple Silicon implementation? Also: I wonder if AMD Instinct series cards like MI50 (32GB VRAM) could work with POWER for running a 30B version of llama at a substantial pace. If not exactly breaking the benchmarks, combined with a competent, free software accelerator I would expect it to do great. Are you aware if anybody is running these Instinct cards on Talos? It certainly seems like this current dash for CPU-only inference of LLMs is only ever going to get us far (as 4-bit precision drops the context substantially) and especially considering that the up-and-coming open source models are likely going to ever grow in size. Whoever is able to bring an affordable big-memory accelerator is going to make a lot of money here.

    ReplyDelete

Post a Comment

Comments are subject to moderation. Be nice.