Cache splash in Telum means seventh heaven for POWER11?


AnandTech has a great analysis of IBM's new z/Architecture mainframe processor Telum, the successor to z15 (so you could consider it the "z16" if you like) scheduled for 2022. The most noteworthy part of that article is Telum's unusual approach to cache.

Most conventional CPUs (keeping in mind mainframes are hardly conventional, at least in terms of system design), including OpenPOWER chips, have multiple levels of cache; so did z15. L1 cache (divided into instruction and data) is private to the core and closest to it, usually measured in double-digit kilobytes on contemporary designs. It then fans out into L2, which is also usually private to an individual core and in triple-digit kilobyte range, and then some level of L3 (plus even L4) cache which is often shared by an entire processor and measured in megabytes. Cache size and how cache entries may be placed (i.e., associativity) is a tradeoff between the latency of searching a larger cache, die space considerations and power usage, versus the performance advantages of fewer cache misses and reduced use of slower peripheral memory.

While every design has some amount of L1, there certainly have been processors that dispensed with other tiers of cache. Most of Hewlett-Packard's late lamented PA-RISC architecture had no L2 cache at all, with the L1 cache being unusually large in some units (the 1997 PA-8200 had 4MB of total L1, 2MB each for data and instructions). Closer to home, the PowerPC 970 "G5" (derived from the POWER4) carried no L3; the 2005 dual-core 970MP, used in the Power Mac G5 Quad, IBM POWER 185 and YDL PowerStation, instead had 1MB of L2 per core which was on the large side for that era. Conversely, the Intel Itanium 2 could have up to 64MB of L4 cache; Haswell CPUs with GT3e Iris Pro Graphics can use the integrated GPU's eDRAM as a L3 victim cache for the same purpose as an L4, though this feature was removed in Skylake. However, the Sforza POWER9 in Raptor workstations is more typical of modern chips with three levels of cache: the dual-8 02CY649 in this machine I'm typing on has 32/32KB L1, 512KB L2 and 10MB L3 for each of the eight CPU cores. In contrast, AMD Zen 3 uses a shared 32MB L3 between up to eight cores, with fewer cores splitting the pot in more upmarket parts.

With money and power consumption being less or little object in mainframes, however, large multi-level caches rule the day directly. The IBM z15 processor "drawer" (there are five drawers in a typical system) divides itself into four Compute Processors, each CP containing 12 cores with 128/128K L1 (compare to Apple M1 with 192/192K) and split 4MB/4MB L2 per core paired with 256MB of shared L3, overseen by a single System Controller which provides a whopping 960MB of shared L4. This gives it the kind of throughput and redundancy expected by IBM's large institutional customers who depend on transaction processing reliability. The SC services the four CPs almost like an old-school northbridge, but to L4 cache instead of main RAM.

Telum could have doubled down on this the way z15 literally doubled down on z14 (twice the L3, nearly half again as much L4), but instead it dispenses with L3 and L4 altogether. L1 jumps to 256/256K, and in shades of PA-RISC L2 balloons to 32MB per core, with eight cores per chip. Let's zoom in on the die.
The 7nm 530mm2 die shows the L2 cache in the centre of the eight cores, which is already a tipoff as to how IBM's arranged it: cores can reach into other cores' cache. If a cache line gets evicted from a core's L2 and the core can find space for it within another core, then the cache line goes to that core's L2, and is marked as L3. This process is not free and does incur more latency than a traditional L3 when an L3 line stored elsewhere must be retrieved, but the ample L2 makes this condition less frequent, and in the less common case where a core requires data and some other core already evicted it to that core as L3, it can just adopt it. Overall, this strategy means better utilization of cache that adapts better to more diverse workloads because the large total L2 space can be flexibly redirected as "virtual L3" to cores with greater bandwidth demands.

It doesn't stop there, though, because Telum has another trick for "virtual L4." Recall that the z15 uses five drawers in a typical system; each drawer has an SC that maintains the L4 cache. Telum is two chips to a package, with four packages to a unit (the equivalent of a z15 "drawer") and four units to a system. If you can reach into other cores' L2 to use them as L3, then it's a simple conceptual leap to reach into other chips (even in different units) and use their L2 as L4. Again, latency jumps over a more traditional L4 approach, but this means theoretically a typical Telum system has a total of 8GB that could be redirected as L4 (7936MB, if you don't count an individual core's L2). With 256 cores in this system, there's bound to be room somewhere faster than main memory.

What makes this interesting for OpenPOWER is that z/Architecture and POWER naturally tend to cross-pollinate. (History favours POWER, too. POWER chips already took over IBM i first with the RS64-based A35 and finally with the eCLipz project; IBM AS/400 a/k/a i5/OS a/k/a i hardware used to be its own bespoke AS/400 architecture.) z/Architecture is decidedly not Power ISA but some microarchitectural features are sometimes shared, such as POWER6 and z10, which emerged from a common development process and as a result had similar fabrication technologies, execution units, floating-point units, busses and pipelines.

POWER10 is almost certainly already taped out if IBM is going to be anywhere close to a Q4 2021 release, so whatever influence Telum had on its creation has already happened. But Telum at the microarchitecture level sure looks more like POWER than z15 did: there is no more CP/SC division but rather general purpose cores in a NUMA topology more like POWER9, more typical PCIe controllers (in this case PCIe 5.0) for I/O and more reliance on specialized pack-in accelerators (Telum's headline feature is an AI accelerator for SIMD, matrix math and fast activation function computation; no doubt some of its design started with POWER10's own accelerator). Frankly, that reads like a recipe for POWER11. While a dual-CPU POWER11 workstation might not have much need for L4, the "virtual L3" strategy could really pay off for the variety of workloads workstations and non-mainframe servers have to do, and on a four or eight-socket server, the availability of virtual L4 starts outweighing any disadvantage in latency.

The commonalities should not be overstated, as Telum is also "only" SMT-2 (versus SMT-4 or SMT-8 for POWER9 and POWER10) and the deep 5GHz-plus pipeline the reduced SMT count facilitates doesn't match up with the shorter pipeline and lower clockspeeds on current POWER generations. But that's just part of the chips being customized for their respective markets, and if IBM can pull this trick off for z/Architecture it's a short jump to making the technology work on POWER. Assuming we don't have OMI to worry about by then, that could really be something to look forward to in future processor generations, and a genuinely unique advance for the architecture.

Comments

  1. FYI, the "per core" cache count for POWER9 listed on Wikipedia seems to be based on a misreading of an article covering the POWER9 announcement. As described on Raptor's wiki and in the POWER9 Processor User's Manual, there is 512KB of L2 and 10MB of L3 per *slice*, where a slice is either two SMT4 cores or one SMT8 core depending on the part.

    ReplyDelete
    Replies
    1. Even worse, the Wikipedia page is using a definition of "slice" from that same article that refers to sub-core execution units; those should be qualified as "execution slices" while the core-pair unit is a "cache slice" according to the manual. (There's a number of different kinds of "slices" in there.) That makes the mistakes on Wikipedia difficult to correct quickly -_-

      Delete
    2. This is a good point to make, which is why I was careful to state this particular Sforza part. Each of the slices in the 4/8 core parts has one core fused off, so the remaining core gets everything. But you're right to point out this doesn't hold for higher core counts.

      Delete
    3. Right, I forgot about how binning makes it more complicated than that. Is the "one core to a slice" rule consistent for the 4/8-core parts?

      Anyway, thanks for the nice blog as always :)

      Delete
    4. That's my understanding, though I was "told this" and I don't have a nice spec sheet or anything laying it out. It does match what I observe on the 4- and 8-core parts I have in the three machines here (2 T2s, 1 Blackbird).

      Delete
    5. If you check all 4-core & 8-core "OpenPOWER" variants of POWER9, I expect you will find that all are 10MB-L3 per core.

      Also, if you check memory latencies very carefully on P9, do you find it has similar behavior as Telum? E.g. an 8-core POWER9 with 7 cores idle offers the active core ~70MB of "virtual L4?"

      Delete

Post a Comment

Comments are subject to moderation. Be nice.