Posts

Showing posts matching the search for power11

Power11 hits the market this month


IBM made the date official: Power11 launches July 25, with the 32 AI-core Spyre Accelerator expected to follow in the fourth quarter. IBM's launch products will be the full-rack Power E1180 with up to up to 256 SMT-8 Power11 cores with 2MB L2 each and up to 128GBMB of shared L3 (8GBMB per core with the correct figures in the Red Book) with 64TB of DDR5 memory, the midrange 4U Power E1150 with up to 120 Power11 cores and 16TB of DDR5, the junior 4U Power S1124 with up to 60 Power11 cores with 8MB of L3 per core and 8TB of DDR5, and the "low-end" 2U Power S1122 with up to 60 Power11 cores and 4TB of DDR5. The processors come in 16, 24 or 30-core versions; the E systems have four sockets (with up to four nodes in the E1180) and the S systems have two. All four systems can run AIX and Linux, and all systems except for the E1150 can run IBM i. As is usual for IBM's initial offerings, internally they look like straight-up implementations of the Blueridge reference platform and should be expected to scale accordingly. And if you have to ask how much they are, well ...

It's notable that the "meet the family" document IBM links from the press release — so we can assume it's officially blessed — says nothing about OMI, only DDR5 RAM. However, IBM has made it clear that Power11 will continue to have OMI, since enterprise Power10 customers would certainly have had an investment in it, and the upper tier datasheets reference OMI channel capacity. We don't know if the OMI firmware for Power11 is open and libre (it was not in Power10), nor if the Synopsys IP blocks reportedly used in Power10's I/O are still present, because IBM didn't say, or if the "low-end" binned CPUs are different in this regard.

If there are going to be third-party Power11 systems, and IBM didn't say anything in the press release about them either, they generally follow six to twelve months after. We have heard little from Raptor since about the SolidSilicon S1 and X1, and because all indications suggest the S1 is a Power10 implementation without the crap, this already puts them behind the curve. That said, adapting Power11 should be possible to any next-generation Power ISA workstation: the Talos II and T2 Lite are fairly straightforward reworks of the reference POWER9 Romulus design, and Blackbird is still Romulus, just in a much smaller form factor. These first-generation P11 boxes, as presumably performant as they are, wouldn't be nice to have in an office and IBM just doesn't do end user sales. Creating a T3 based on Blueridge would seem to be the best way forward for Raptor to regain the top slot in OpenPOWER workstations — assuming the architecture is still open.

[UPDATE: I have been advised by an anonymous individual with knowledge of the situation that a new Raptor announcement on products under development is scheduled for Q1 2026 ... which would be "six to twelve months after" as predicted. "Open firmware" is specifically mentioned and absolutely planned. It's worth pointing out that both Raptor and SolidSilicon are now listed as top-tier Platinum members for OpenPOWER parallel with IBM itself. That implies SolidSilicon is still in the mix and IBM is still backing OpenPOWER. They stressed this is not an official announcement, so you take it for what it's worth.]

Early Power11 signals in the kernel


A number of people have alerted me to some new activity around Power11 in the Linux kernel, such as this commit and a PVR (processor version register) value of 0x0F000007. It should be pointed out that all this is very preliminary work and is likely related to simulation testing; we don't even know for certain what node size it's going to be. It almost certainly does not mean such a CPU is imminent, nor does this tell us when it is. Previous estimates had said 2024-5, but the smart money says no earlier than next calendar year and probably at the later end of that timeframe.

That said, the reputed pressures around Power10 that caused closed IP to be incorporated are hopefully no longer as acute for Power11, and off-the-books discussions I've had suggest IBM internally acknowledges its strategic mistake. That would be good news for Power11, but it's not exactly clear what this means for Solid Silicon and the S1 because S1's entire value proposition is being Power10 without the crap. While S1 will certainly come out before Power11, we still don't know when, and if there's a short window between S1 and a fully open Power11 then S1 could go like Osborne.

"Short" here will be defined in terms of how much work it takes to adapt the Power11 reference system. IBM understandably always likes to sell its launch systems first and exclusively before the chips and designs trickle down. The Talos II and to a lesser extent the Blackbird are a relatively straightforward rework of Romulus (POWER9's reference), so one would think adapting Power11 would similarly require little adjustment, though Romulus used the ASPEED BMC and any Raptor Power11 would undoubtedly use (Ant)arctic Tern/Solid Silicon's X1. In contrast, there'd be a bit more work to port Rainier (Power10) to S1 since the RAM would be direct-attach instead of OMI and there may be differences to account for with PCIe, plus the BMC change. The last estimate we had for the S1 machines was late 2024; putting this all together and assuming that date is at all accurate, such a system may have a year or two on the market before Power11 exits its IBM-exclusive phase.

That could still be worth it, but all of this could be better answered if we had a little more insight into S1 and its progress, and I've still got my feelers out to talk to the Solid Silicon folks. You'll see it here first when I get a bite.

Enter the IBM z17 mainframe with Telum II (more clues for Power11?)


IBM is announcing their new z17 mainframe, based on the Telum II (see our notes on the original Telum CPU). IBM first announced the Telum II last year and the z17, its intended first deployment, has now emerged just about bang on time.
Still, we're obviously more interested in Power ISA around here, and IBM has yet to say much substantive about Power11 other than the usual assertions of additional power efficiency, more cores and higher clock. It is also expected to offer DDR5 support for enhanced memory bandwidth, though this is all but certain to require OMI DDR5, not direct-attached RAM as in our Raptor boxes. But it's often instructive to look at what's going on with IBM mainframes for microarchitectural clues now that Z-machines and IBM "big" Power chips often have the same underlying design.

The first Telum strongly emphasized cache. Interestingly, it did so by dropping categorial L3 and L4 altogether: instead, IBM developed a strategy where cores could reach into the L2 of other cores and treat that as L3, and reach into other chips' cache and treat that as L4. Each chip had eight cores and 32MB of L2 per core, giving lots of opportunity for more efficient utilization. The picture of the Telum II die above shows that IBM has not substantially deviated from this plan, using the same 128K/128K L1 but increasing L2 to 36MB per core. IBM's documentation says that there are eight cores per chip, but at a cursory glance there appear to be ten on the die, likely for yield reasons (two cores would be fused off). Assuming these dud cores still have useable cache, however, that matches IBM's specs of up to 360MB of effective L3 and a whopping 2.88GB of L4 per system.

The cores top out at 5.5GHz with various microarchitectural improvements such as better branch prediction and faster store writeback and address translation, all the typical kinds of tweaks that would also likely show up in Power11. Power11 is also expected to remain on 7nm with a "refined" process instead of moving to 5nm. (It's possible that Power12, whenever that arrives, may skip 5nm entirely.)

Of course, the marketing material on z17 is all AI all the time. IBM's claimed AI improvements seem to descend from an enhanced "DPU" ("data processing unit") with its own 64K (32K instruction/32K data) L1 cache capable of 24 trillion INT8 operations per second, the kind of bolt-on hardware that could also be incorporated or scaled-down into other products. In fact, such a product exists already, shown above: IBM's Spyre Accelerator, which is basically 32 more DPUs. These attach over PCIe and would be a good alternative to our having to scrabble around with iffy GPU support, assuming that IBM supports this in Linux (but they already do for LinuxONE systems, so it shouldn't be much of a stretch).

If you have the money and a convenient IBM salesdroid who actually answers the phone, you too can horrify your electrical utility starting in June. As for those of us on the small systems side, Power11 in whatever form it ends up taking is not anticipated to emerge until Q3 2025, presumably as what will be the E1100 series starting with the E1180 and going down. This further shrinks the production and sales window for the long-anticipated Raptor S1 systems, however, and there hasn't been a lot of news about those — to say nothing of what the Trump tariffs could mean for rolling out a new system.

Intel gets worse, but Power11 might get better


Just in case we needed any more reassurance we made the right move with OpenPOWER: Phoronix is reporting that Intel is about to get even more restrictive with firmware. For as much flak as Intel (deservedly) takes over the Intel Management Engine and other closed highly-privileged blobs, the actual Firmware Support Package has so far been open source and royalty-free (it's what's layered on top that's the problem). There isn't a smoking gun or significant direct context in this Twitter thread, but the issue seems to be around the upcoming "Scalable FSP" architecture. Previously, open source firmware had control on initialization and could call into the closed blobs (or not) as necessary, but FSP 3.0 seems to invert this, giving a new closed blob control to call into the open source firmware (or not). This lets Intel cut projects like Coreboot on x86_64 out of the picture, and can only be seen as a way to directly subvert their operation. A lot of this stuff is under NDA currently but as systems incorporating FSP 3.0 start appearing we should begin to get a clearer understanding.

By the way, don't expect AMD to act any better. Remember that they're the company bringing you Pluton: quoted from the article, "Pluton will also prevent people from running software that has been modified without the permission of developers." It wouldn't be surprising to see AMD's Platform Security Processor pick up additional lock-in capabilities to reinforce this and other vendor controls.

Meanwhile here in the computing underground, we have our own problems with Power10, but there may be some light on the horizon for Power11. It was always a mystery after POWER8 and POWER9's completely open firmware why IBM would take a sudden wrong turn with Power10, but this unsubstantiated post from the same thread (if it's not wishful thinking) suggests COVID staffing issues rather than philosophical concerns were to blame for IBM using off-the-shelf vendored IP blocks requiring the existing blobs in its firmware.

I don't know who that is, or what internal events at IBM they're privy to, so it should be taken with a grain of salt. (If they read this blog, feel free to follow up in the comments or with me in E-mail.) Still, it makes more sense than IBM suddenly slamming the door on OpenPOWER after the tremendous goodwill built up with POWER8 and especially POWER9. It does also suggest, however, that the situation with Power10 is more or less baked in. The roadmap for POWER9, currently the OpenPOWER architecture with the widest install base, basically blew up and the long-promised POWER9 AIO "Axon" or "Axone" never arrived. I'm predicting that Power10 will have a smaller install base than POWER9 because it's still IBM-exclusive, no other vendors so far have announced machines, and Raptor (the only "low-end" vendor of OpenPOWER workstations) has said they won't ship a Power10 system with blobs. If there wasn't enough money on the table to release Axon for IBM's biggest OpenPOWER ecosystem, there won't be for a newly-freed "Power10+."

But there's plenty of time for Power11, possibly landing in the 2024-5 timeframe, just in time for POWER9's technological ebb. And if simple humanpower really was the reason IBM took shortcuts, hopefully their staffing and design teams will be in a much better place by then (wars, pestilence, locusts and inflation notwithstanding). It would come just in time because what makes OpenPOWER a compelling alternative to x86_64 and Apple ARM (and what so far has eluded RISC-V) is performance. I'd like to see Power11 continue to keep us in the game — but without compromises this time.

Big and little POWER shouldn't just be endian


While the majority of OpenPOWER installations by this point are probably running little-endian, every single POWER chip runs big — big power usage, that is. While POWER9 is still performance-competitive with x86_64 and this situation continues to improve as more software gets better optimized, and there have been huge gains since POWER4/the PowerPC 970 in particular, POWER chips still run relatively hot and relatively hungry. Anandtech tried to normalize this for POWER8 systems by estimating transactions per watt; power measurements can be very imprecise and depend on more than just the system architecture, but even with that consideration the tested Tyan POWER8 in particular was outclassed by nearly a factor of three by a Xeon E5-2699. Possibly in response POWER9 is more aggressive with power savings than POWER8 and makes a lot of microarchitectural improvements, using 25% less juice for 50% more zip (so roughly a doubling of performance per watt), and Power10 supposedly improves on POWER9's performance per watt even more by at least 2.6 times according to IBM's figures.

But IBM's playbook for improving perf per watt hasn't really changed. Either you're boosting performance by juicing the microarch, jimmying IPC with more instructions and more cores, or both, or you're trying to diminish power usage with heavier clock speed throttling or turning off cores. While shooting the die budget at lower-wattage pack-in accelerators is a clever hybrid approach, their application-specific nature also means they're rather less useful in typical situations than their marketing would allege (look at how little currently uses the gzip accelerator in every POWER9, for example). You can do a lot with strategies like these — AMD certainly does — but sooner or later you'll hit a wall somewhere, either against the particular limitations of the design you're working with or against the intrinsic physical limitations of making a hippo do gymnastics while eating fewer calories.

Apple Silicon has a lot of concerning issues with it from a free computing perspective, but its performance is impressive, and its performance per watt is jaw-dropping. A lot of this is the secret sauce in their microarch which ironically came from P.A. Semi, originally a Power ISA licensee, and some may be due to details of the on-board GPU. But a good portion is also due to the big core-little core approach largely pioneered with the ARM big.LITTLE Cortex A7 and used to great effect in the M1 series. After all, if you want to get the best of both worlds, make some of the cores use less power and give those cores tasks that require less oomph (efficiency or E-cores), reserving the heavy tasks for the big ones (power or P-cores). Intel thinks so too: Lakefield and Alder Lake both attempt the same sort of heterogenous CPU topology for x86_64, and it would be inconceivable to believe AMD isn't looking to make the same jump for their next iteration.

The chief issue with going that route is making sure that the cores are getting work commensurate with their capabilities. This is easy for Apple since they control the whole banana: macOS Quality of Service is all about doing just that (you'd think they would do something based on nice levels as well, but I guess all the sweet talk about being desktop Un*x went out the window somewhere around Mavericks). Linux added initial support for big.LITTLE with kernel 3.10 but it took years for other improvements to the Linux scheduler to make it meaningful. Intel made things worse for themselves in Lakefield and Alder Lake by using lower power Atom-based E-cores that didn't support AVX-512 (and the Tremont E-cores in Lakefield didn't even support AVX2, meaning such tasks couldn't be run by them at all). Rather than hinting Windows 11 or the internal hardware not to send AVX-512 code to the Gracemont E-cores, Alder Lake just doesn't support AVX-512, full stop — on any core. Kernel 5.13 supports Alder Lake, but kernel 5.15 has dawned and there is no specific Intel Thread Manager Support so far, though there is scheduler support for AArch64 E-cores that can't run 32-bit code. And Alder Lake is turning out to be very power-hungry, which calls some of the design into question, in addition to various compatibility issues when software unwittingly puts tasks on the E-cores that don't work as expected.

Still, the time is coming where Power ISA should start thinking about a big-little CPU, maybe even for Power11. We already have big cores (if IBM will ever get their heads out of their rear ends and release the firmware source), but we also have an already extant little OpenPOWER core: Microwatt. While Microwatt doesn't support everything that POWER9 or Power10's large cores do, it's still intended to be a fully compliant OpenPOWER core, and since the Linux kernel is already starting to cater to heterogenous designs a set of POWER8-compliant Microwatt E-cores could still execute on the same die along with a set of Power11 full fat P-cores. Add logic on-chip to move threads to the P-cores if they hit an instruction the E-cores don't support and you're already most of the way there with relatively minor changes to the Linux kernel.

What IBM — or any future OpenPOWER chip builder, though so far no one else is in the performance category — needs to avoid is what seems to be dooming Alder Lake: they've managed to hit the bad luck jackpot with a chip that not only uses more power but has more compatibility problems. Software updates will fix this issue somewhat but a little more forethought might have staved it off, and the apparent greater wattage draw should have been noticed long before it left the lab. But IBM has already shown wattage improvements over the last two generations and if the P- and E-core functionalities are made appropriately comparable, a big-little Power11 — with open firmware please! — could be a very compelling next upgrade for the next generation of Power-based workstations and servers. Apple has clearly demonstrated that highly efficient and powerful computing experiences are possible when hardware and software align. There's no reason OpenPOWER and Linux or *BSD can't do the same on open platforms.

Cache splash in Telum means seventh heaven for POWER11?


AnandTech has a great analysis of IBM's new z/Architecture mainframe processor Telum, the successor to z15 (so you could consider it the "z16" if you like) scheduled for 2022. The most noteworthy part of that article is Telum's unusual approach to cache.

Most conventional CPUs (keeping in mind mainframes are hardly conventional, at least in terms of system design), including OpenPOWER chips, have multiple levels of cache; so did z15. L1 cache (divided into instruction and data) is private to the core and closest to it, usually measured in double-digit kilobytes on contemporary designs. It then fans out into L2, which is also usually private to an individual core and in triple-digit kilobyte range, and then some level of L3 (plus even L4) cache which is often shared by an entire processor and measured in megabytes. Cache size and how cache entries may be placed (i.e., associativity) is a tradeoff between the latency of searching a larger cache, die space considerations and power usage, versus the performance advantages of fewer cache misses and reduced use of slower peripheral memory.

While every design has some amount of L1, there certainly have been processors that dispensed with other tiers of cache. Most of Hewlett-Packard's late lamented PA-RISC architecture had no L2 cache at all, with the L1 cache being unusually large in some units (the 1997 PA-8200 had 4MB of total L1, 2MB each for data and instructions). Closer to home, the PowerPC 970 "G5" (derived from the POWER4) carried no L3; the 2005 dual-core 970MP, used in the Power Mac G5 Quad, IBM POWER 185 and YDL PowerStation, instead had 1MB of L2 per core which was on the large side for that era. Conversely, the Intel Itanium 2 could have up to 64MB of L4 cache; Haswell CPUs with GT3e Iris Pro Graphics can use the integrated GPU's eDRAM as a L3 victim cache for the same purpose as an L4, though this feature was removed in Skylake. However, the Sforza POWER9 in Raptor workstations is more typical of modern chips with three levels of cache: the dual-8 02CY649 in this machine I'm typing on has 32/32KB L1, 512KB L2 and 10MB L3 for each of the eight CPU cores. In contrast, AMD Zen 3 uses a shared 32MB L3 between up to eight cores, with fewer cores splitting the pot in more upmarket parts.

With money and power consumption being less or little object in mainframes, however, large multi-level caches rule the day directly. The IBM z15 processor "drawer" (there are five drawers in a typical system) divides itself into four Compute Processors, each CP containing 12 cores with 128/128K L1 (compare to Apple M1 with 192/192K) and split 4MB/4MB L2 per core paired with 256MB of shared L3, overseen by a single System Controller which provides a whopping 960MB of shared L4. This gives it the kind of throughput and redundancy expected by IBM's large institutional customers who depend on transaction processing reliability. The SC services the four CPs almost like an old-school northbridge, but to L4 cache instead of main RAM.

Telum could have doubled down on this the way z15 literally doubled down on z14 (twice the L3, nearly half again as much L4), but instead it dispenses with L3 and L4 altogether. L1 jumps to 256/256K, and in shades of PA-RISC L2 balloons to 32MB per core, with eight cores per chip. Let's zoom in on the die.
The 7nm 530mm2 die shows the L2 cache in the centre of the eight cores, which is already a tipoff as to how IBM's arranged it: cores can reach into other cores' cache. If a cache line gets evicted from a core's L2 and the core can find space for it within another core, then the cache line goes to that core's L2, and is marked as L3. This process is not free and does incur more latency than a traditional L3 when an L3 line stored elsewhere must be retrieved, but the ample L2 makes this condition less frequent, and in the less common case where a core requires data and some other core already evicted it to that core as L3, it can just adopt it. Overall, this strategy means better utilization of cache that adapts better to more diverse workloads because the large total L2 space can be flexibly redirected as "virtual L3" to cores with greater bandwidth demands.

It doesn't stop there, though, because Telum has another trick for "virtual L4." Recall that the z15 uses five drawers in a typical system; each drawer has an SC that maintains the L4 cache. Telum is two chips to a package, with four packages to a unit (the equivalent of a z15 "drawer") and four units to a system. If you can reach into other cores' L2 to use them as L3, then it's a simple conceptual leap to reach into other chips (even in different units) and use their L2 as L4. Again, latency jumps over a more traditional L4 approach, but this means theoretically a typical Telum system has a total of 8GB that could be redirected as L4 (7936MB, if you don't count an individual core's L2). With 256 cores in this system, there's bound to be room somewhere faster than main memory.

What makes this interesting for OpenPOWER is that z/Architecture and POWER naturally tend to cross-pollinate. (History favours POWER, too. POWER chips already took over IBM i first with the RS64-based A35 and finally with the eCLipz project; IBM AS/400 a/k/a i5/OS a/k/a i hardware used to be its own bespoke AS/400 architecture.) z/Architecture is decidedly not Power ISA but some microarchitectural features are sometimes shared, such as POWER6 and z10, which emerged from a common development process and as a result had similar fabrication technologies, execution units, floating-point units, busses and pipelines.

POWER10 is almost certainly already taped out if IBM is going to be anywhere close to a Q4 2021 release, so whatever influence Telum had on its creation has already happened. But Telum at the microarchitecture level sure looks more like POWER than z15 did: there is no more CP/SC division but rather general purpose cores in a NUMA topology more like POWER9, more typical PCIe controllers (in this case PCIe 5.0) for I/O and more reliance on specialized pack-in accelerators (Telum's headline feature is an AI accelerator for SIMD, matrix math and fast activation function computation; no doubt some of its design started with POWER10's own accelerator). Frankly, that reads like a recipe for POWER11. While a dual-CPU POWER11 workstation might not have much need for L4, the "virtual L3" strategy could really pay off for the variety of workloads workstations and non-mainframe servers have to do, and on a four or eight-socket server, the availability of virtual L4 starts outweighing any disadvantage in latency.

The commonalities should not be overstated, as Telum is also "only" SMT-2 (versus SMT-4 or SMT-8 for POWER9 and POWER10) and the deep 5GHz-plus pipeline the reduced SMT count facilitates doesn't match up with the shorter pipeline and lower clockspeeds on current POWER generations. But that's just part of the chips being customized for their respective markets, and if IBM can pull this trick off for z/Architecture it's a short jump to making the technology work on POWER. Assuming we don't have OMI to worry about by then, that could really be something to look forward to in future processor generations, and a genuinely unique advance for the architecture.

CXL is going to eat OMI's lunch


The question is whether that's a bad thing. And as it stands right now, maybe it's not.

High I/O throughput has historically been the shiny IBM dangled to keep people in the Power fold, and was a featured part of the POWER9 roadmap even though those parts never emerged. IBM's solution to the memory throughput problem was the Centaur buffer used in POWER8 and scale-up Cumulus POWER9 systems (as opposed to our scale-out Nimbus POWER9s, which use conventional DDR4 RAM and an on-chip controller), and then for Power10 the Open Memory Interface, or OMI, a subset of OpenCAPI. In these systems, the memory controller-buffer accepts high-level commands from the CPU(s), abstracting away the details of where the underlying physical memory actually is and reordering, fusing or splitting those requests as required. Notoriously, OMI has an on-board controller, and its firmware isn't open-source.

But why should the interconnect be special-purpose? Compute Express Link (CXL) defines three classes of protocol: CXL.io, an enhanced CPU-to-device interconnect based on PCIe 5.0 with enhancements; CXL.cache, allowing peripheral devices to coherently access CPU memory; and CXL.mem, an interface for low-latency access to both volatile and non-volatile memory. Both CXL.cache and CXL.mem are closely related and themselves transmit over a standard PCIe 5.0 PHY. Memory would be an instance of a CXL Type 3 device, implementing both the CXL.io and CXL.mem specifications (Type 1 devices implement CXL.io and CXL.cache, and rely on access to CPU memory; Type 2 devices implement all three protocols, such as GPUs or other types of accelerators). The memory topology is highly flexible. If this sounds familiar, you might be thinking of Gen-Z, which aimed for an open royalty-free "memory semantic" protocol; Gen-Z started the merge into the CXL Consortium, led by Intel, in January.

IBM was part of Gen-Z, but eventually let it dangle for OpenCAPI and OMI, and while it is a contributing member to CXL this seems to have been as a consequence of its earlier involvement with Gen-Z. But really, what's OMI's practical future anyway? So far we've seen exactly one chipset implementation from one vendor and that implementation has directly harmed Power10's wider adoption apart from IBM's own hardware. OMI promises 25Gbps per lane at a 5ns latency, but Samsung's new CXL memory module puts 512GB of DDR5 RAM on the bus at nearly 32Gbps. It's a cinch that Power11, whenever it gets on the roadmap, would support at least PCIe 5.0 or whatever it is by then and CXL would appear to be a better overlay on that baseline. Devices of all sorts could share a huge memory pool, even GPUs. Plus, a lot more companies are on board and that would mean a lot more choices and greater staying power, plus more likelihood of open driver support the more devices emerge.

There are still some aspects of CXL that aren't clear. Although it's advertised as an open industry standard, there's nothing saying it's royalty or patent-free (Gen-Z explicitly was, or at least the former), and the download for the specification has an access agreement. The open aspect may not be much better either: Samsung has an ASIC controller in their memory device but it still may need a blob to drive it, either internally or as part of CPU firmware (earlier prototypes used an FPGA), and nothing says that another manufacturer might not require it either.

Still, OMI has the growing stench of death around it, and it never got the ecosystem support IBM was hoping for; CXL currently looks like everything technologically OMI was to be and more, and at least so far not substantially worse from a policy perspective. Other than as a sop to their legacy customers, one may easily conclude there's no technological nor practical reason to keep OMI in future IBM processors. With nothing likely changing on the horizon for Power10's firmware, that may be cautiously good news for us for a future Power11 option.

Microwatt goes multiprocessor


It's been awhile since we dropped in on Microwatt, the OpenPOWER VHDL softcore. Microwatt now runs on multiple FPGA boards or can be run (slowly) in simulation, and is capable of booting Linux. Raptor uses Microwatt for the Arctic Tern soft BMC. Although it still doesn't support vector instructions, recent commits have added an FPU and many of the standard special-purpose registers, and the newest ones now add support for SMP.

The newest pull request, currently to be committed, allows more than one processor core to be created by adding an NCPUS option to soc.vhdl. These cores can be debugged separately with JTAG and have the same view of memory and the same timebase value, and can be individually activated. For interrupts, they each have their own presentation controller in the XICS.

Although Microwatt cores are currently of only modest performance, more cores — if you have the space — can certainly improve its throughput and the range of applications it could be practical for. Unfortunately, we've still yet to hear anything new about the Solid Silicon S1 or how libre Power11 will end up being. Hopefully as the Microwatt design gets more efficient, at least the very smallest Power ISA systems will now have some additional flexibilities to work with.

Now your LLaMa is playing with POWER


Now that the invasion of the large language models has occurred and we will all bow to our GPT overlords, I just generated a pull request to add additional POWER9-specific optimizations to llama.cpp, what all the cool kids are using for LLMs who aren't down with OpenAI. This repo moves quick but it's where the magic is happening if this is what you're into. It will work with both Alpaca and LLaMa models.

In a previous article we talked about autovectorization using conversion of Intel vector intrinsics to POWER9, but this is good old fashioned assembly code and hand-written C. The part that really helped was changing their pure-C "F16" (half-precision) float conversion code to use VSX instead. The rolls-off-your-tongue POWER9-and-up xscvhpdp and xscvdphp instructions convert half-precision floats to and from double-precision respectively (xscvdphp will also work on single-precision, which is handy, because the explicit conversion is from single-precision "F32"), and we also use POWER8 mffprd and mtfprd for toll-free copies between general and float registers without requiring a spill to memory. That change alone is about 12 percent faster than the old pure-C compute and lookup code. Additionally, we also have our own vectorized version of quantize_row_q4_0 like ARM NEON and AVX-256 written with VMX/VSX intrinsics. It's even a little better, because we were able to use our VMX floating-point multiply-add and remove a couple minor inefficiencies in the code. Additionally, people used to G4 and G5-era AltiVec will enjoy the fact that the newer intrinsics substantially map directly to ARM's — I especially liked vec_extract as an all-purpose replacement for all of the NEON vget_lane_* variations, as well as vec_signed for vcvtq_s32_f32 for converting floats in place, and the all-purpose simplified vec_splats for making a splat vector out of anything — making conversion much more straightforward when you need to write your own code.

I did play with alpaca.cpp, the other older white meat, and the changes here should more or less apply to that codebase as well. However, given how quickly llama.cpp evolves and the greater development interest, llama.cpp seems the best way forward for continued evolution.

I will say in the spirit of full disclosure that despite these improvements my 16GB 4P/4E/8G M1 MacBook Air still pops out tokens several times faster than this 64GB dual-8 Talos II, even full-tilt with all 64 threads in use (the cat still looks startled every time the fans rev). On the other hand, we're also comparing a 2017 CPU with one from 2020, and one with specific hardware acceleration for neural networks that llama.cpp takes particular advantage of. Even with Power10's improved bfloat16 support and matrix math operations, specific work would be needed to support those features which won't be coming from me (stay tuned for Power11, I guess). There are other opportunities for vectorization to be done, though at the rate this code base evolves it would be better waiting for one of the mainstream architectures to pick up a SIMD version we can convert first. In the meantime, while you should be advised that going beyond the 7B or 13B models will require patience regardless of how much RAM you have, I think this is definitely better than what we started with.

GlobalFoundries stops all 7nm development


As reported in AnandTech, GlobalFoundries, which includes the former chip manufacturing foundries of AMD and, notably for our reporting, IBM, has scuttled their 7nm process roadmap. Instead, the company will be concentrating more on their 14nm and 12nm FinFET technology, including the 14nm FinFET process that GlobalFoundaries uses to manufacture the POWER9.

The POWER10, scheduled for 2020 in the IBM product roadmap, is supposedly being designed on a 10nm process. Assuming IBM doesn't redesign the POWER10 for 12nm, their other options are Intel (unlikely), TSMC or Samsung, all of whom have 10nm processes. POWER11 was planned for 7nm, but has no timeframe. Meanwhile, in a possible sign of what's to come, AMD has moved to TSMC.

First POWER10 machine announced


IBM turns up the volume to 10 (and their server numbers to four digits) with the Power E1080 server, the launch system for POWER10. POWER10 is a 7nm chip fabbed by Samsung with up to 15 SMT-8 cores (a 16th core is disabled for yield) for up to 120 threads per chip. IBM bills POWER10 as having 2.5 times more performance per core than Intel Xeon Platinum (based on an HPE Superdome system running Xeon Platinum 8380H parts), 2.5 times the AES crypto performance per core of POWER9 (no doubt due to quadruple the crypto engines present), five times "AI inferencing per socket" (whatever that means) over Power E980 via the POWER10's matrix math and AI accelerators, and 33% less power usage than the E980 for the same workload. AIX, Linux and IBM i are all supported.

IBM targets its launch hardware at its big institutional customers, and true to form the E1080 can scale up to four nodes, each with four processors, for a capacity of 240 cores (that's 1,920 hardware threads for those of you keeping score at home). The datasheet lists 10, 12 and 15 core parts as available, with asymmetric 48/32K L1 and 2MB of L2 cache per core. Chips are divided into two hemispheres (the 15-core version has 7 and 8 core hemispheres) sharing a pool of 8MB L3 cache per core per side, so the largest 15 core part has 120MB of L3 cache split into shared 64MB and 56MB pools respectively. This is somewhat different from POWER9 which divvys up L3 per two-core slice (but recall that the lowest binned 4- and 8-core parts, like the ones in most Raptor systems, fuse off the other cores in a slice such that each active core gets the L3 all to itself). Compared with Telum's virtual L3 approach, POWER10's cache strategy seems like an interim step to what we suspect POWER11 might have.

I/O doesn't disappoint, as you would expect. Each node has 8 PCIe Gen5 slots on board and can add up to four expansion drawers, each adding an additional twelve slots. You do the math for a full four-node behemoth.

However, memory and especially OMI is what we've been watching most closely with POWER10 because OMI DIMMs have closed-source firmware. Unlike the DDIMMs announced at the 2019 OpenPOWER Summit, the E1080 datasheet specifies buffered DDR4 CDIMMs. This appears to be simply a different form factor; the datasheet intro blurb indicates they are also OMI-based. Each 4-processor node can hold 16TB of RAM for 64TB in the largest 16-socket configuration. IBM lists no directly-attached RAM option currently.

IBM is taking orders now and shipments are expected to begin before the end of September. Now that POWER10 is actually a physical product, let's hope there's news on the horizon about a truly open Open Memory Interface in the meantime. Just keep in mind that if you have to ask how much this machine costs you clearly can't afford it, and IBM doesn't do retail sales anyway.