Sometimes it's necessary: running x86_64 binaries on the Talos II


Yes, it's gross, but sometimes it's necessary. There's a lot of software for Intel processors, and there's a lot of it that you can't recompile, so once in awhile you've got to get a little dirty to run what you need to.

In prior articles we've used QEMU with KVMPPC to emulate virtual Power Macs and IBM pSeries, but this time around we'll exclusively use its TCG JITted software CPU emulation to run x86_64 programs and evaluate their performance on the Talos II. For this entry we will be using QEMU 3.0, compiled from source with gcc -O3 -mcpu=power9. Make sure you have built it with (at least) x86_64-linux-user,x86_64-softmmu in your --target-list for these examples, or if using your distro's package, you'll need the qemu-x86_64 and qemu-system-x86_64 binaries.

However, there is also a new experimental fork of QEMU to try called HQEMU. HQEMU uses LLVM to further optimize the code generated by TCG in the background and can yield impressive performance benefits, and the latest version 2.5.2 now supports ppc64le as a host architecture. However, despite its obvious performance gains, HQEMU is not currently suitable as a total QEMU replacement: it's based on an older QEMU (2.5.x) that is missing some later features and improvements, it still has some (possibly ppc64le-specific) notable bugs, and because it needs a modified LLVM it currently has to be built from source. For this reason I recommend you have them both available and select the one that works best.

The step by step instructions for building HQEMU on ppc64le (PDF) work pretty much as is, except for the following:

  • LLVM 7.0 is not supported; I used LLVM 6.0.1. The included patch for 6.0 does not apply against 7.x. I already have clang on the system and the LLVM build system uses it by default (though ordinarily I'm one of those people who prefer gcc), so I don't know if it will build with gcc, though it should. Rather than install into /usr/local, I prefer to install into hqemu/llvm in my source directory to avoid tainting the system's version. This makes your cmake command look like this, assuming you followed the steps in the manual exactly and are in the proper directory:

    cmake -G "Unix Makefiles" \
    -DCMAKE_INSTALL_PREFIX= ../../llvm \
    -DCMAKE_BUILD_TYPE=Release ..

    It takes about 15 minutes to build LLVM and clang with make -j24.

  • Not all QEMU targets are supported. x86_64-linux-user and i386-linux-user compile more or less as is, but you cannot compile x86_64-softmmu in HQEMU 2.5.2 (or any of the other software MMU targets) on ppc64le without this patch. I haven't tried any of the ARM targets, but I have no reason to believe they don't work. None of the other architectures or targets are supported. My recommended configuration command line for the T2 family is:

    ../hqemu-2.5.2/configure --prefix=`pwd` \
    --extra-cflags="-O3 -mcpu=power9"\
    --target-list=x86_64-linux-user,x86_64-softmmu \
    --enable-llvm

    It takes a couple minutes for a build with make -j24.

  • If you rebuild hqemu and you get a weird compile error while building some of the LLVM-related files, make sure that the llvm-config from the modified LLVM suite is first in your PATH (not the system one).

I'll include a couple screenshots of QEMU 3.0 and HQEMU 2.5.2 running a benchmark in ReactOS 0.4.9 under full system emulation here; you should be familiar with using QEMU by now, so I won't discuss its use further in this article. I used the CrystalMark benchmark not because it's particularly good but because most of the typical Windows benchmarking programs don't like ReactOS. First is QEMU, second is HQEMU.

You'll notice there were some zeroes in the benchmark under HQEMU. That's because they made HQEMU segfault! Also, oddly, the ALU score was worse, but the D2D and OpenGL scores -- done purely in software -- were two to four times higher. Indeed, ReactOS is a lot more responsive under HQEMU assuming you don't do anything it doesn't like. If you need to run a Windows application on your T2 and HQEMU doesn't poop its pants while running it, it is a much faster option. Note that some or all of these numbers can improve if you have any VirtIO support in your OS and/or appropriate drivers, which I have intentionally not used here to demonstrate the worst case. You may also be able to use your local GPU with virgl. We might look into that in a future article to see how practical non-native gaming is.

Instead of dithery benchmarks in a full system emulator, however, let's try to better quantify the actual CPU emulation penalty by running a simple math benchmark under QEMU's user mode. One of the easiest is to use the command line calculator bc to compute the digits of π, which can be done by taking the arctangent of 1 and multiplying it by 4. You can then use the scale= variable to set the difficulty level, such as echo "scale=5000;4*a(1)" | bc -l, which will (slowly) compute 5000 digits of π. (It takes around 30 seconds on modern systems.)

However, when you run a foreign architecture binary, you also need each and every one of the libraries it links to from that architecture and current versions of bc have several additional dependencies. This somewhat unnecessarily complicates our little benchmark test. Fortunately, its ancestor, the venerable dc utility, has no dependencies other than libc, as proven from the output of objdump -p:

[...]
Dynamic Section:
  NEEDED               libc.so.6
[...]
Version References:
  required from libc.so.6:
To port this simple benchmark we will take advantage of a little-known fact that bc used to be simply a front end to dc; systems such as AIX and apparently some BSDs can still "compile" bc scripts to dc scripts with the -c option. I've provided the stripped down output from my own POWER6 AIX server to compute digits of π in dc as a gist. Download that and put it somewhere convenient (for the examples in this article I saved it to ~/pi.dc). Note that versions of GNU dc prior to 1.06 or so will not properly parse this script but most of the dc binaries of non-GNU provenance I've encountered will run it fine. Get a baseline by running it against your system (here, my own Talos II):

% time dc ~/pi.dc
3.14159265358979323846264338327950288419716939937508
0.004u 0.001s 0:00.00 0.0% 0+0k 0+0io 0pf+0w

(The exact format of the time command's output will depend on your shell; this is the one built into tcsh.)

Next, increase the number of digits computed by changing the line 50k to 500k (i.e., 500 digits), and time that.

% time dc ~/pi.dc
3.1415926535897932384626433832795028841971693993751058209749445923078\
164062862089986280348253421170679821480865132823066470938446095505822\
317253594081284811174502841027019385211055596446229489549303819644288\
109756659334461284756482337867831652712019091456485669234603486104543\
266482133936072602491412737245870066063155881748815209209628292540917\
153643678925903600113305305488204665213841469519415116094330572703657\
595919530921861173819326117931051185480744623799627495673518857527248\
9122793818301194912
2.393u 0.001s 0:02.39 100.0% 0+0k 0+0io 0pf+0w

Assuming those numbers look accurate, finally bump it to 1000 to get a less noisy test. I'll spare you the digits.

[...]
20.833u 0.007s 0:20.84 99.9% 0+0k 0+0io 0pf+0w

Call it about 20 seconds of wall time natively (though I should note that Fedora 28 ppc64le is compiled for POWER8, not POWER9). Now, let's set up our x86_64 library root for the emulator test. Your distro may offer you these files in some fashion as a package, but I'll assume it doesn't and show you how to do this manually.

  1. Create a folder called debian-lib-x86_64. Our libraries will live here.
  2. Download the desired x86_64 (a.k.a. amd64) .deb of libc. I used the one from Jessie, but any later version should work.
  3. Uncompress it and find data.tar.xz within the .deb. Uncompress that.
  4. Within the data subfolder thus created, drill down to lib/x86_64-linux-gnu. Move that folder to debian-lib-x86_64/lib.
  5. Within debian-root/, create a symlink from lib to lib64 (i.e., ln -s lib lib64).

If you did this correctly, you should have a debian-lib-x86_64/lib with a whole mess of files and symlinks in it, and a debian-lib-x86_64/lib64 that points to the same place. Any additional libraries you need can simply be thrown into debian-lib-x86_64/lib.

Next, grab the x86_64/amd64 build of dc. I used the version from Buster since it matched the one on my Fedora 28 install, 1.07.1. It will work fine with the Jessie libs, at least as of this writing. Uncompress the .deb, find data.tar.xz, uncompress that, and find the dc binary within the created data folder. Move it somewhere convenient. For the examples in this article I saved it to ~/dc.amd64 and my x86_64 Debian libraries are in ~/src/debian-lib-x86_64.

First, let's test with QEMU itself. This assumes your pi.dc script is still set to 1000k.

% time ~/src/qemu-3.0.0/x86_64-linux-user/qemu-x86_64 -L ~/src/debian-lib-x86_64 ~/dc.amd64 ~/pi.dc
[...]
62.736u 0.026s 1:02.77 99.9% 0+0k 0+0io 0pf+0w

This is about three times slower than native dc, which isn't as dismal as you might have expected because all the syscalls are native instead of being emulated as well. We already know HQEMU will do this faster, but it'll be interesting to see how much so.

% time ~/src/hqemu/build/bin/qemu-x86_64 -L ~/src/debian-lib-x86_64 ~/dc.amd64 ~/pi.dc
[...]
45.181u 1.976s 0:27.40 172.0% 0+0k 0+0io 0pf+0w

Yes, 172% CPU utilization because of HQEMU's background optimization threads, but wall clock time is only 27 seconds! That's "only" 35% higher!

Do note that HQEMU's optimization isn't free. If we reduce the number of digits back down to 50 (i.e., 50k), we see this:

% time ~/src/qemu-3.0.0/x86_64-linux-user/qemu-x86_64 -L ~/src/debian-lib-x86_64 ~/dc.amd64 ~/pi.dc
3.14159265358979323846264338327950288419716939937508
0.048u 0.002s 0:00.05 80.0% 0+0k 0+0io 0pf+0w
% time ~/src/hqemu/build/bin/qemu-x86_64 -L ~/src/debian-lib-x86_64 ~/dc.amd64 ~/pi.dc
3.14159265358979323846264338327950288419716939937508
0.164u 0.016s 0:00.17 100.0% 0+0k 0+0io 0pf+0w

In this case, HQEMU is about three times slower than regular QEMU because of the LLVM optimization overhead over a very brief runtime. This example is still a nearly imperceptible seventeen hundredths of a second in wall clock terms, but if your workload consists of repeatedly running an alien architecture binary with a short execution time over and over, HQEMU will cost you more. Admittedly, I can't think of too many workloads in this category, but I'm sure there are some.

The take-away from this is that if you have a Linux binary from an x86_64 system and you can collect all the needed libraries, it has an excellent chance of at least working, and if it's something HQEMU can run, working with a relatively low performance penalty. The trick, of course, is collecting all those libraries, which could be a quick trip to dependency hell, and messing around with binfmt for transparent execution is left as an exercise to the reader. Full system emulation still has a fair bit of overhead but it's easy to set up and generally liveable, even in pure TCG QEMU, so you can do what you need to if you have to. Now go take a shower and wash all that Intel off.

Comments

  1. Interesting article, i hope all this optimizations will be upstreamed on QEMU so everyone will benefit from it without particular knowledge

    ReplyDelete
  2. Noticed a little error:

    "taking the arctangent of 1 radian"

    As far as I know, the argument to atan is dimensionless, not in radians, so that should just be "taking the arctangent of 1".

    Nothing of consequence, just something that set off my math alarms.

    ReplyDelete
  3. Really curious if you have tried pci pass through with something like vfio. I’ve been using it for years on my x86 system for my workstation guest (video card and usb 3 card) but I’m not sure how well my workstation be would run on this kind of hardware if at all. Its very tempting though and the talos ii board looks really good; looks very similar to a super micro board like my x10dri.

    ReplyDelete
    Replies
    1. Not yet. I want to see how well it does as a gamer and that should facilitate testing something like that.

      Delete
    2. I suspect you're gonna have some problems when it comes to using the iommu with a VM that isn't running native instructions but I may be wrong about that. If it all works with respect to a difference in endianess between the guest and host and the addresses that are remapped to the guest then that would be really really cool. I've been meaning to try something similar on my dual e5-2630 v4 xeon build but IIRC the guest CPU depends on some instructions to be available like the timestamp counter and vt-d in order for it to work with video cards. FWIW I'm seriously considering moving to Power 9 or 10 within the next 2 years if the hardware keeps up like this. I gotta say though the motherboard price for what it is (it looks like a supermicro board) is still too high for me; $500 was too much for my x10dri but it was doable. Component wise, and layout wise the talos 2 is almost identical to my x10dri

      Delete
    3. also I like how it looks like they did away with the x16 slot clips on the talos, those broke off on mine ;p overall I see a lot of good characteristics on the talos2 like the socketed boot and bmc flash chips. The diff of $2,000+ still isn't worth it to me though.

      Delete
    4. later on I'll see if I can get my video card to pass through on a qemu-system-ppc64le guest and play around with that. I was playing around with it some yesterday trying to pass through a 10gbe nic without much luck but then again I couldn't get it to boot the debian installer discs I provided at all either. But, give me some time and I'll get back to you on that.

      Delete
  4. CyberMonday special from RaptorCS:
    Blackbird mATX mainboard with 4-core POWER9

    $999

    https://twitter.com/RaptorCompSys/status/1066112465304543232

    ReplyDelete

Post a Comment

Comments are subject to moderation. Be nice.