Big and little POWER shouldn't just be endian


While the majority of OpenPOWER installations by this point are probably running little-endian, every single POWER chip runs big — big power usage, that is. While POWER9 is still performance-competitive with x86_64 and this situation continues to improve as more software gets better optimized, and there have been huge gains since POWER4/the PowerPC 970 in particular, POWER chips still run relatively hot and relatively hungry. Anandtech tried to normalize this for POWER8 systems by estimating transactions per watt; power measurements can be very imprecise and depend on more than just the system architecture, but even with that consideration the tested Tyan POWER8 in particular was outclassed by nearly a factor of three by a Xeon E5-2699. Possibly in response POWER9 is more aggressive with power savings than POWER8 and makes a lot of microarchitectural improvements, using 25% less juice for 50% more zip (so roughly a doubling of performance per watt), and Power10 supposedly improves on POWER9's performance per watt even more by at least 2.6 times according to IBM's figures.

But IBM's playbook for improving perf per watt hasn't really changed. Either you're boosting performance by juicing the microarch, jimmying IPC with more instructions and more cores, or both, or you're trying to diminish power usage with heavier clock speed throttling or turning off cores. While shooting the die budget at lower-wattage pack-in accelerators is a clever hybrid approach, their application-specific nature also means they're rather less useful in typical situations than their marketing would allege (look at how little currently uses the gzip accelerator in every POWER9, for example). You can do a lot with strategies like these — AMD certainly does — but sooner or later you'll hit a wall somewhere, either against the particular limitations of the design you're working with or against the intrinsic physical limitations of making a hippo do gymnastics while eating fewer calories.

Apple Silicon has a lot of concerning issues with it from a free computing perspective, but its performance is impressive, and its performance per watt is jaw-dropping. A lot of this is the secret sauce in their microarch which ironically came from P.A. Semi, originally a Power ISA licensee, and some may be due to details of the on-board GPU. But a good portion is also due to the big core-little core approach largely pioneered with the ARM big.LITTLE Cortex A7 and used to great effect in the M1 series. After all, if you want to get the best of both worlds, make some of the cores use less power and give those cores tasks that require less oomph (efficiency or E-cores), reserving the heavy tasks for the big ones (power or P-cores). Intel thinks so too: Lakefield and Alder Lake both attempt the same sort of heterogenous CPU topology for x86_64, and it would be inconceivable to believe AMD isn't looking to make the same jump for their next iteration.

The chief issue with going that route is making sure that the cores are getting work commensurate with their capabilities. This is easy for Apple since they control the whole banana: macOS Quality of Service is all about doing just that (you'd think they would do something based on nice levels as well, but I guess all the sweet talk about being desktop Un*x went out the window somewhere around Mavericks). Linux added initial support for big.LITTLE with kernel 3.10 but it took years for other improvements to the Linux scheduler to make it meaningful. Intel made things worse for themselves in Lakefield and Alder Lake by using lower power Atom-based E-cores that didn't support AVX-512 (and the Tremont E-cores in Lakefield didn't even support AVX2, meaning such tasks couldn't be run by them at all). Rather than hinting Windows 11 or the internal hardware not to send AVX-512 code to the Gracemont E-cores, Alder Lake just doesn't support AVX-512, full stop — on any core. Kernel 5.13 supports Alder Lake, but kernel 5.15 has dawned and there is no specific Intel Thread Manager Support so far, though there is scheduler support for AArch64 E-cores that can't run 32-bit code. And Alder Lake is turning out to be very power-hungry, which calls some of the design into question, in addition to various compatibility issues when software unwittingly puts tasks on the E-cores that don't work as expected.

Still, the time is coming where Power ISA should start thinking about a big-little CPU, maybe even for Power11. We already have big cores (if IBM will ever get their heads out of their rear ends and release the firmware source), but we also have an already extant little OpenPOWER core: Microwatt. While Microwatt doesn't support everything that POWER9 or Power10's large cores do, it's still intended to be a fully compliant OpenPOWER core, and since the Linux kernel is already starting to cater to heterogenous designs a set of POWER8-compliant Microwatt E-cores could still execute on the same die along with a set of Power11 full fat P-cores. Add logic on-chip to move threads to the P-cores if they hit an instruction the E-cores don't support and you're already most of the way there with relatively minor changes to the Linux kernel.

What IBM — or any future OpenPOWER chip builder, though so far no one else is in the performance category — needs to avoid is what seems to be dooming Alder Lake: they've managed to hit the bad luck jackpot with a chip that not only uses more power but has more compatibility problems. Software updates will fix this issue somewhat but a little more forethought might have staved it off, and the apparent greater wattage draw should have been noticed long before it left the lab. But IBM has already shown wattage improvements over the last two generations and if the P- and E-core functionalities are made appropriately comparable, a big-little Power11 — with open firmware please! — could be a very compelling next upgrade for the next generation of Power-based workstations and servers. Apple has clearly demonstrated that highly efficient and powerful computing experiences are possible when hardware and software align. There's no reason OpenPOWER and Linux or *BSD can't do the same on open platforms.

Comments

  1. Probably what needs to happen is that someone besides IBM has to enter the market. Big ask, but I don't see IBM catering to the desktop market for this. Sure, the money saved on electricity is a nice to have, but the institutions likely to have a POWER server or datacenter are likely to not really be hurting for the money.

    ReplyDelete
  2. Friend Classic, in my opinion the speech on Power is very different. Power is born and grows to always have maximum performance, I believe that the big little architecture, which has existed for years and years anyway, cannot fit on Power because Power is a pure power processor that needs to have all high performance cores. If you think about it, even if you manage to use P core and E core to the maximum, E cores can never be as powerful as P cores and that is logical, consequently you will not have all the performance that servers need. IBM is an avant-garde company that invests millions and millions of dollars in research and development, if it had wanted it would have implemented this approach a long time ago but I think it is evident that it is not compatible with the continuous intensive calculations of which Power has need to work as required and if we think that precisely, they have not made desktop CPUs for years now, see the old and mythical PPCs, we will understand that we will hardly see such an approach on our Power in my opinion ...

    ReplyDelete
  3. The Microwatt is a rudimentary core, if you actually wanted a meaningful big.LITTLE, you can't use just any bottom of the barrel design, the little core needs to be as sophisticated as the big one to make it all work decently.

    That said, for the market Power is in (server, and repurposed for workstation by Raptor), neither Intel, nor AMD actually intend to use big.LITTLE.

    You don't need hybrid to do decently with power efficiency in desktop, improving performance/watt of the big cores first is a better idea. After you picked th elow-hanging fruit there, there is in theory an opportunity to improve throughput-bound multi-thread tasks by using hybrid, but it is actually tricky and half-assing it with microwatt would not work.

    In any case, I don't think you need IBM to hop on this bandwaggon at all.
    Power ISA mostly only needs IBM to keep iterating competent big cores (and if they make them scale to low idle power in desktop/WS use, that is all that is needed) and not give up on the market. Will be hard enough already even without them trying to follow concepts that are not useful for their market niche.

    ReplyDelete
    Replies
    1. That's my point exactly: the Microwatt is the only core that's even close to ISA 3.0 compliance. The other open cores are various versions of ISA 2.x. We don't want E-cores that are less capable on the ISA level as well as performance.

      As far as big-little not being useful to their market niche, I dispute this, since performance/watt is becoming more and more relevant to enterprise users. I agree more can be done with current designs, but I don't agree that heterogeneous cores have no part of the strategy.

      Delete
  4. The really weird thing about Alder Lake is that while the little cores don't support AVX-512, the big ones explicitly do - to the point that some manufacturer's UEFI settings screens explicitly allow you to enable AVX-512, which inherently involves disabling the little cores outright. Apparently a number of the engineers at Intel were positively livid that AVX-512 was going to be disabled for the release, too. For lower-end Alder CPUs that have disabled the little cores outright, I'd think turning AVX-512 back on would be a no-brainer. Of all Intel's bungled steps in the last decade or so, the ongoing Alder Lake business is one of the most unnecessary.

    I don't really have an answer for whether Power should go BIG.little itself. Microwatt could be a cheerful basis for that strategy, and it would win a lot of goodwill, but time will tell whether IBM or anyone else will jump onto that as a solution.

    ReplyDelete
  5. I would much rather see AMD create hybrid BIG.little CPU arch where the BIG cores are x86 and the little cores are ARM, but perhaps with the complexity of software, is best suited for an application like an always-on gaming console.

    ReplyDelete

Post a Comment

Comments are subject to moderation. Be nice.