Showing posts from January, 2020

Bonuses for big-endian

As I recover from the flu and from my apoplexy over a local plumber who has stood me up for four days in a row, there's at least some good news for those of you who like big ends and cannot lie.

First, Void Linux for PowerPC reports that all available 64-bit big-endian (ppc64) and 32-bit PowerPC packages have been built, bringing them to near-parity with ppc64le. Of all total Void packages available, 32-bit ppc-musl has the lowest percentage of buildable packages, but even that is an impressively robust 88.28%. 32-bit ppc with glibc is at 88.79%, ppc64-musl at 90.15% and ppc64 at 90.49% (compare to 64-bit little-endian musl at 93.76% and glibc at 94.64%). As many 32-bit PowerPC users, particularly Power Mac owners, are looking for current supported browser options, Firefox isn't available on 32-bit (due to an xptcall issue), but WebKit and various derivatives (such as Midori and various other shells) are as well as NetSurf. (G5 owners can run Firefox on ppc64.) Chromium isn't available on any of the ports, but this seems fitting.

Secondly, someone pointed me at a distribution I hadn't encountered before called PowerEL, from VanTosh. PowerEL seems to be ultimately a Red Hat descendant and states it is ABI compatible with RHEL, but claims it derives other attributes from other distributions and has a list of repos you can browse. What I found most notable is that it offers support for big-endian POWER8 and up as well as little-endian POWER8/9 and Intel x86_64 with AVX. Note that I haven't tried booting this on a Talos II or Blackbird and it isn't clear to me how workstation-suitable it is; additionally, the internal version number suggests it's derived from RHEL 7 (not the current 8) which possibly explains why it still has big-endian support. If you use PowerEL or have tried it, post your impressions in the comments.

CentOS updated to 8.1.1911

CentOS has been updated to 8.1.1911, the distro for those of you who want Red Hat Enterprise Linux but don't want to pay for it. While not linked from the main AltArch Downloads page, ISOs that should be compatible with all OpenPOWER systems are available from the various mirrors.

Alternatively, CentOS Stream (the Freshmaker!) is also available for ppc64le.

Another Amiga you don't want

The Amigaphile community is possibly more rabid about its ecosystem than even us OpenPOWER dweebs, so right off I present my Amiga bona fides before getting stuck into it: in this house is a Amiga Technologies '060 A4000T running AmigaOS 3.9, an Amiga 3000 (with the tape drive, so I might even try Amix on it someday), and several A500s.

My disenchantment with the current crop of PowerPC-based Amigas, however, is well-known. At their various asking prices you got embedded-class CPUs at mid-high desktop level prices with performance less than a generation-old Power Mac. (Many used and easily available models of which, I might add, can run MorphOS for considerably less cash out of pocket.) A-EON's AmigaOne X5000 in particular attracts my ire for actually running a CPU that doesn't have AltiVec to replace the X1000, which did. Part of this problem was the loss of availability of the PA6T-1682M CPU, the one and only CPU P. A. Semi ever officially produced under that corporate name before Apple bought them, but there were plenty better choices than the NXP P5020 which wasn't ever really intended for this application.

Since 2014 or thereabouts a "small" Amiga (presumably along the lines of the earlier Sam440ep and Sam460ex family, which used PowerPC 400-series CPUs) was allegedly in development by A-EON and unveiled in 2015 as the original A1222 "Tabor" board. At that time it was specced as a mini-ITX board with 8GB of RAM and most controversially another grossly underpowered embedded core, a QorIQ P1022 at 1.2GHz. Based on the 32-bit e500v2 microarchitecture, it not only also lacks AltiVec but even a conventional FPU, something that for many years was not only standard on PowerPC CPUs but considered one of its particular strengths. As an embedded CPU, this was forgivable. As an embedded CPU in a modern general-purpose computing environment, however, this was unconscionable. Even by Amiga standards this was an odd choice and one with potential impacts for compatibility, and one a G5 or high-end Power Mac G4 would mop the floor with.

The original Tabor was apparently manufactured in a small batch of 50 for testers. Between then and 2018 it's not clear what happened on A-EON's end, but their hardware partner had a change of management and various other components needed updating. Strangely, the CPU was not among them. Today, six years later, people may now publicly pre-order the "A1222 Plus," with no change in specs other than changing some board components, and expected to ship Q2 2020. If you're lucky and they're not sold out, your $128 will get you a AAA Bundle Package with some software and a certificate in a pretty box as a 20 percent down payment on the "special low" $450 purchase price. If you get your $128 in after they're gone, then it's just a deposit. The AAA Bundle Package was supposed to be available December 24 but the website wasn't even up until January 11.

I'm willing to make some allowances for a cheap(er) modern Amiga because like our OpenPOWER systems, getting the price down some expands the market (ergo, Blackbird). However, that $450 price is almost certainly not the intended MSRP (nor has it been reported what that eventual MSRP is), and for low production volumes the CPU is not the major cost of a system. I don't know what actual considerations went into its design, but if A-EON chose this CPU deliberately as an attempt to make the price of the system lower, they chose wrong.

I certainly don't want to pick on other boutique systems unnecessarily. After all, in many people's minds the Talos II I'm typing on is itself a pretty darn boutique system, and one where the out-the-door price can easily eclipse even the priciest AmigaOne configuration. If people want to have their fun and pay a crapload of money for it even if other people think it's junk, then as long as it's not hurting anyone else praise the Lord and pass the credit card.

However, this is not that situation. Amiga has a long history in computing consciousness and most computer nerds would at least recognize the brand. Similarly, people still remember Power Macs, and while recollections vary in accuracy and fondness, there's a rather pernicious and commonly-held belief that Apple's migration to Intel somehow "proved" the inferiority of Power ISA. Today, along comes yet another underwhelming PowerPC-based Amiga to confirm their preconceived notions: because it's Amiga, it sticks in people's minds, and because it confirms their own beliefs about PowerPC, it reinforces their unjustified biases. Regardless of the fact that POWER9 systems like this one outpace ARM and RISC-V and compete with even high-end x86_64 offerings, the presence of Tabor and X5000 et al simply gives the resolutely underinformed yet another stick to beat Power desktops with. These Amigas suck compared to PCs, they reason, so OpenPOWER must suck too.

At the very least modern Amiga systems need to beat these decade-plus-old Power Macs to be even vaguely taken seriously. If they must go embedded then at least an e6500-series processor would do better than anything that they're running, and while I certainly do appreciate the considerable AmigaOS porting work that would be necessary, going with a minimally-redesigned reference design board for a POWER8 or POWER9 in big-endian mode could still free up development resources for such a task that would otherwise be sunk into yet another quixotic product. As it is, these Amiga systems don't do the Amiga community any favours, and from this outsider's view their performance even seems to be going backwards. If cost were the main consideration, there are other ways of dealing with it, and it's a certainty that Tabor's eventual price will be much greater than $450. Given all that, plus the protracted gestation time for "Tabor Plus," don't the Amiga owners who would be willing to buy such a machine deserve something with a little more zip?

The Amiga community needs new hardware, to be sure, and A-EON has at least filled the business opportunity however slow and abortive its progress in doing so; whatever small number they end up producing this time I imagine they'll eventually sell. That doesn't change the fact that weaksauce systems like this being sold as desktop machines continues to tar the architecture as underpowered and does nothing to expand the market beyond the shrinking circle of faithfuls. When there exist today at this very moment Power ISA workstations that can be every bit as utilitarian and functional as Intel workstations, the last thing any of our two communities need is yet another Amiga that people don't want.

Fedora Workstation Live ISOs up (and hopefully sticking)

After a little glitch getting them off the ground, Fedora Workstation Live ISOs are now a thing for ppc64le, and are exactly what they say on the tin: a try-before-you-metaphorically-buy image to check things out before you install. A bootable disk image of Rawhide (what will currently become Fedora 32) is ready now for your entertainment and possible utility, and they should turn up as a standard download presumably when F32 goes live late April.

Firefox 72 on POWER

Firefox 72 builds out of the box and uneventfully on OpenPOWER. The marquee feature this time around is picture-in-picture, which is now supported in Linux and works just fine for playing Trooper Clerks ("salsa shark! we're gonna need a bigger boat!"). The blocking of fingerprinting scripts should also be very helpful since it will reduce the amount of useless snitchy JavaScript that gets executed. The irony of that statement on a Blogger site is not lost on me, by the way.

The bug that mashed Firefox 71 (ultimately fallout from bug 1601707 and its many dupes) did not get fixed in time for Firefox 72 and turned out to be a compiler issue. The lifetime change that the code in question relies upon is in Clang 7 and up, but unless you are using a pre-release build this fix is not (yet) in any official release of gcc 9 or 10. As Clang is currently unable to completely build the browser on ppc64le, if your extensions are affected (mine aren't) you may want to add this patch which was also landed on the beta release channel for Firefox 73.

The debug and opt configurations are, again, otherwise unchanged from Firefox 67.

DOSBox JIT on ppc64le (and how you can write your own)

Apparently the quickest way to make software moar faster is to turn it into a tiny compiler and lots of things are doing it. As I get time between iterations of TenFourFox and smoke-testing Firefox builds on ppc64le, slow work on the Firefox JIT continues, but that doesn't mean we can't be JITting all the other things in the meantime.

One of my favourite games is LucasArts' Dark Forces, an FPS set in the Star Wars universe (but now apparently non-canon after the events of Rogue One). Although projects like XLEngine can run it, that, too, requires a code generator to be written (because of AngelScript). I decided that if I had to write a backend after all, a better approach would be to add a backend to DOSBox, the famous DOS emulator. That would happily run the copy of PC Dark Forces I already have and any of my other old DOS games like Extreme Pinball and Pinball Illusions and Death Rally and all those other great titles in my office besides. (The classic Mac version of Dark Forces is better, by the way, not least of which because of its beautiful high-resolution graphics that are double those of the PC release.)

Fortunately, a 32-bit big-endian PowerPC version of DOSBox already existed as unofficial patches (play it on your old Power Mac), which took only a few days for me to convert to 64-bit little-endian. While DOSBox in strictly interpreted mode on the Talos II is no slouch, this JIT, which is for DOSBox's dynamic recompiling "dynrec" core, increases performance roughly by a factor of four. This makes even the most demanding games playable and makes most other games run like butter (in fact, it's so fast it even destabilizes some timing loops, like the credits scroller in Descent). If I could shoot and take screenshots at the same time, you'd see me do better at blowing away Imperial officers, too.

You can build this yourself. Download the patch and backend, which we will call ppc64le_dosbox.diff and risc_ppc64le.h. This patch is intended to apply against SVN trunk.

svn checkout svn:// dosbox-code-0
cd dosbox-code-0
patch -p0 < ../ppc64le_dosbox.diff
cp ../risc_ppc64le.h src/cpu/core_dynrec
./configure CFLAGS="-O3 -mcpu=power9" CXXFLAGS="-O3 -mcpu=power9"
make -j24 (or as you like)

Copy over your favourite games or install media, and src/dosbox to start your fast-powered DOS machine. Or try this benchmark pack and see for yourself. (I like the PC Player one.) See the DOSBox documentation and DOSBox wiki for more.

But let's say you'd like to work on a JIT backend of your very own for some other ppc64le port. This is hardly the place for a tutorial on writing ppc64 assembly language -- you can read IBM's -- but I will talk about how to get generated code into memory and how to execute it, and how this might differ from x86_64 or ARM. The following examples should run on any 64-bit Power CPU from the 970/G5 to the POWER9 regardless of pagesize or endianness under recent Linux kernel versions, but they're tested on my ppc64le Talos II in Fedora 31, of course.

Being able to emit generated code to memory and run it is actually a rather big security hole unless it is done carefully and correctly. Indeed, SELinux will halt you right in your tracks (as it should) unless you do the correct mating dance. The basic steps are:

  1. Allocate writeable memory.
  2. Emit machine code to that memory.
  3. Flush the caches.
  4. Make that tract of memory executable. (This is the dangerous bit. We'll talk about how to mitigate it.)
  5. Run the now executable code.
  6. Profit!

For steps 1 (and, indirectly, step 4), you need to know what the OS believes the memory page size is. (On Fedora 31, this Talos II has a pagesize of 64K. Some Linuces like Adelie use 4K pages.) For step 3, you need to know how large a cache line on your CPU is (for all current 64-bit Power ISA processors, this number is 128 bytes). We will handwave away these a bit here by keeping the example code at or less than a cache line's length, and reading the OS's page size ourselves.

Let's look at the first example. You'll notice a couple blocks of code commented out. If you are not using SELinux for whatever reason, you may be able to get away with posix_memalign() and mprotect() to allocate your memory. However, if you use this on an SELinux system (Red Hat, recent Debians, etc.), you will have to modify your policy or temporarily disable some protection features to run that code.

The better way is to create the tract of memory as an in-RAM file and use mmap() to make it writeable. This only works in integral page numbers, hence the need to know your page size (we ask the kernel via syscall). You may be able to use the second commented block to call memfd_create() directly, but the syscall approach I've used here doesn't require setting _GNU_SOURCE. Once we have mmap()ed the memory, we can write to it. We do so as simply an array of 32-bit integers (even 64-bit Power still has 32-bit opcodes).

I've ripped off the assembler macros from DOSBox's backend because they match up nicely with the numbers in the Power ISA book and pretty much any Power assembly language reference. Our first example runs a very simple program:

li 3,2020

This code loads the integer 2020 into r3, the first argument register and the standard return register in the ELFv2 ABI (and indeed for any PowerPC using either PowerOpen ABI like AIX and Mac OS X, or SysV ABIs like Linux or the BSDs). It then branches back to the return address in the link register, terminating the program. We emit this code to the mmap()ed in-memory file.

The Linux kernel flushes the data and instruction caches of the processor separately, so we do the same. For every cache line that needs invalidation, we use the dcbst instruction to invalidate the data cache line in question, and once they are all flushed, we use the sync instruction so that each CPU's view of memory is consistent. Then we flush the instruction cache in the same way, using the icbi instruction and finally the isync instruction to ensure consistency of the I-cache this time. Because this all fits into a single cache line we just do each instruction once.

As our last step, we do another mmap() to make the tract of memory executable. Since we are using a named file to store our code rather than just memory we managed to grab, SELinux does not block it the same way it would ordinarily block an anonymous allocation. The second mmap() adds further security, because we are not making the memory executable until the program is fully assembled in RAM, and while the two mmap()s are linked and reference the same memory area as far as our program is concerned, assuming randomization is in effect an attacker would now have to derive two completely random memory addresses to do any funny business. We then execute the code as if were a C function (more on this in a moment).

Run the code like this:

% gcc -o jitlab1 jitlab1.c
% ./jitlab1
jitcode at 0x7fffac650000
result = 2020

Excellent! Note that the actual address of the jitcode may vary from run to run.

However, an example this trivial can only be this trivial because it didn't need to pull and manage a stack frame or maintain all but the most token adherence to the C ABI. We need a stack frame if we modify, or might modify, any non-volatile register or the link register (i.e., make any calls to other subroutines). Frankly, a JIT isn't much good if it can't call into its host somehow, either to run higher level operations or exchange data, so for all practical purposes you'll probably want to pull a stack frame in your code and be ABI compliant.

With that in mind, let's turn to our second contrived example. This slightly more complex demonstration will receive an integer value which it will perform basic math on (add 2020 to it), then call a function to display that result, double the result, and return that. As we are doing very little actual work in this JIT code (essentially messing with r3 as both the in and out argument, which is volatile, so it need not be saved in the stack frame we created), that makes control flow simpler. The function prologue, which is broken up a bit for performance reasons, can be as simple as stdu 1,-size(1):mflr 0:std 0,size+16(1), where size is the desired size of the stack frame (we use 256 here just for laziness).

Having pulled the frame and computed the first value, we now want to call the display routine to show our work. A minor complication is that there is no PowerPC/Power ISA instruction to directly branch to an address in a regular general purpose register. Instead, the ISA only supports indirect branching to either the address in the link register "LR" (blr) or the counter register "CTR" (bctr), both of which are special-purpose registers. As a practical measure, for branching other than returns from a routine, we prefer the counter register which does not need to be saved.

A more significant complication is setting up the actual function call itself. Without getting too deep into the weeds, 64-bit ELFv2 compliant functions can have two entry points, called the "global entry point" and the "local entry point." The global entry point is called when r2, the register customarily used for global symbol access through the Table of Contents, must be computed. The TOC, a holdover from PowerOpen systems, contains a conventional ELF global offset table "GOT" and optionally a small data section. The global entry point invariably consists of two instructions that compute r2 from the address of the routine itself, which is conventionally maintained in r12, and the local entry point follows at eight bytes after the global entry point. This scheme facilitates position-independent code generation as the compiler can emit code referencing globals as relative indexes on the TOC base stored in r2.

At link time the linker looks at branches and determines whether the caller function and callee function will share the same TOC. If they don't, then the linker points the branch at the global entry point either directly or through a procedure linkage table "PLT" stub. If they do, however, then this call is considered "local" (from the perspective of the global context), and the linker calls the local entry point instead. However, if it turns out the callee actually does no global access, the compiler generates only a single entry point because there is no need to compute r2, and the linker calls that.

The linker can juggle this because it has time to burn and the breadth of the codebase to scan. However, in our example here, we are the linker, and we have much less visibility into the codebase we're trying to call. Think of it as working in the basement with a flashlight trying not to walk into the walls and only a limited time to finish our job.

A hard branch (using b or bl) will always be faster, even if imperceptibly, and is less susceptible to Spectre-style attacks. However, if we actually turn out to be calling a function's global entry point and r12 is not set correctly, then r2 will also not be set correctly, and global access will either fault or just be plain wrong. If we end up doing all this computation, we might as well just branch to r12 via CTR, which is slower by a minor degree but will always work. Plus, if we're able, whatever speed hit is incurred can be mitigated by hoisting the mtctr up a few instructions ahead of the bctr or bctrl that uses it. Either way, whether we hit the global entry, the local entry or a non-global function, everything will be ABI compliant.

That does not mean you can never branch directly. Directly branching to a function with b or bl will work if you are calling a pure local function that accesses no globals, or you are calling the local entry point and nothing has modified r2 (though this is an awfully big gamble in complex codebases), or it's a function you generated (i.e., another JIT function) that you can guarantee to be purely local because it never touches r2. Or you can just set r12 and directly branch, I suppose.

The last consideration you need to keep in mind with JIT function calls is whether they need to be patchable. If they need to be patched or redirected as code addresses change, then you need to have a fixed-size branch stanza that can be changed on the fly. The minimum size of a branch stanza in 64-bit Power ISA is seven instructions (28 bytes) because we may call a 64-bit address and we need four immediate loads to compose it plus a rotation step. For example, to call a routine at 0x1234567876543210, the branch stanza looks like this:

lis 12,0x1234 (of course actually addis 12,0,0x1234)
ori 12,12,0x5678
rldicr 12,12,32,31
oris 12,12,0x7654
ori 12,12,0x3210
mtctr 12

This yields a full 64-bit quantity, 0x1234567876543210. We can then mtctr 12 and bctrl to call the routine and come back to the generated code. Repatching this stanza is "merely" a matter of changing the bottom 16 bits of the four immediate loads and flushing the caches. You can have a direct branch in a stanza if the location of the routine is within the available displacement for those instructions, but then everything else should be nops, e.g.,

bl 0x1234567876543210
(ori 0,0,0)

so that if the new address no longer fits into the branch instruction's displacement, you can rewrite it as a full stanza. If you choose to set r12 and end in a hard-coded branch at the same time, remember that you'll need to repatch both things if the address changes, which might make your code generator a bit more hairy. I consider the aggressive promotion of branch stanzas to hard branches to be a form of premature optimization and you shouldn't be doing this until you are sure the rest of your code generator is working correctly.

This gets even more complex, by the way, if your JIT code must itself access globals; in that case you may need to save r2 yourself and/or do additional linkage work. I should note that in the JITs I've personally written this was never necessary. Let the C host code handle it and debugging your generated code will then be much less of a hassle.

Returning to our example, when the function call returns to the JIT code it finishes its "work," dismantles the stack frame, recovers the previous link register value and branches back to it to exit. The result remains in r3. We use decimal 1111 as our passed value in this example, so the resulting values should be 3131 (1111+2020) and 6262 (3131+3131). Since our intermediate print function helpfully returns the value it was called with, in our example here we don't need to worry about stashing it anywhere. You may not be so lucky calling someone else's work.

% gcc -o jitlab2 jitlab2.c
% ./jitlab2
jitcode at 0x7fff99dc0000
called with 3131
result = 6262

These were obviously toy examples generating small blebs of code. For code blocks greater than one cache line in size, which you will almost certainly generate, you need to dcbst (or similar) and icbi each cache line in whatever setup function gets called to flush the cache. The Z constraint we use here can make this very easy by having gcc do the work of setting up the register for each address for you. See the cache_block_closing() function in the DOSBox backend for a real-world example. In a like fashion, if your code spans multiple memory pages, with the method we have used here you will need to make each memory page writeable, and then ultimately executable, in turn.

You should also get used to tossing trap (tw 31,0,0) instructions in code blocks you're not sure about. This lets you use the debugger to verify that you actually assembled the opcodes in memory correctly and allows you to trap at the point you believe might be problematic and single-step from there. Oddly, in Fedora 31, when a trap instruction is hit gdb does pause but doesn't register a SIGTRAP (whereas lldb does seem to trap correctly, but I'm much more used to gdb personally). In this case, when the trap is sprung in gdb, you would need to press CTRL-C and induce an interrupt when the trap is hit to actually get into the debugger. Debugging JITs can be a real nightmare especially when the bug is very subtle, so only write the minimum you need to get the JIT off the ground, use traps liberally during building and never optimize prematurely. In particular ensure that your actual assembler steps that write instructions to memory are doing so accurately, as this can yield some rather humiliating issues later if you hit edge cases.

Let's make all the ppc64le things faster!

Void Linux ppc64le packages at 100%

Happy New Year! The Power ISA port of Void Linux (currently supporting 32-bit PowerPC and 64-bit Power ISA in both big-endian and little-endian with musl or glibc) has hit an important milestone, at least for the musl and glibc flavours of ppc64le: 100% of the Void repo packages are either built or blacklisted. As of this writing, that means almost 95% of the Void repo runs on ppc64le-glibc and almost 94% on ppc64le-musl. This includes Firefox and Epiphany, but not Chromium. This is great news for folks who want another systemd-free alternative.

The numbers quickly fall off for big-endian systems, though. Both ppc64 and regular 32-bit ppc are building about 54% of the repo for either libc flavour. Notably, Firefox is available for big-endian ppc64, but not for 32-bit PowerPC. However, Epiphany and Midori are available all the way down to your beloved G3 or G4.