Low-level change to Firefox 70 and ESR coming


If you are using Firefox on 64-bit Power, you'll want to know about bug 1576303 which will be landing soon on the beta and ESR68 trees to be incorporated into 70 and the next ESR respectively. This fixes a long-standing issue with intermittent and difficult to trace crashes (thanks to Ted Campbell at Mozilla for figuring out the root cause and Dan Horák for providing the hardware access) due to what in retrospect was a blatant violation of the ELF ABI in xpconnect, which glues JavaScript to native XPCOM. This needed several dodgy workarounds until we found the actual culprit.

The patch is well tested on multiple little-endian systems including this Talos II, but because it's an issue with register allocation in function calls the issue also theoretically affects big-endian Power even though we haven't seen any reports. I'm pretty sure the code I wrote will work for big-endian but none of my big-endian Power systems run mainline Firefox (and TenFourFox even on the G5 is 32-bit, where the problem isn't present). If you're using a big-endian system, you may want to pull a current release and make sure there is no regression in the browser with the changes; if there is and you can bisect to it, post in the bug so we can do a follow-up fix. On the other hand, if you're building from an old ESR such as 52 (the last non-Rust-required one), you may want to backport this fix because the problem has been there pretty much since it was first written.

Stuff like this actually proves Linus Torvalds' point that "as long as everybody does cross-development, the platform won't be all that stable." Linus was talking about ARM-based servers being undercut by a dearth of ARM-based PCs, but the point is also true here: 64-bit Power may do well in the data center but it was rarely used for workstations other than the Power Mac G5 and the small number of non-Apple PowerPC 970 towers, meaning this bug went undiscovered until people like us finally started dogfooding Power-based desktops again. (For that matter, the official PowerPC Mac OS X builds of Firefox were also always 32-bit, even on the G5, so no one would have noticed it there.) There's just no substitute for improving the quality and quantity of software for Power ISA like having one under your desk, and as the number of machines increases I expect we'll get more of these ugly corner bugs ironed out in other packages too.

Comments

  1. Big thanks to you and Ted, it was a difficult one.

    ReplyDelete
  2. That reminds me, I recently tried ff68 on ppc32 and got a segfault on start like this https://gist.github.com/q66/0aefa2d42a846cfceb84607f19018c86

    Of course, I had to get it to compile it at all first; it involved patching xptcall: https://github.com/void-ppc/void-packages/blob/master/srcpkgs/firefox/patches/ppc32-fix.patch

    Think that's related? Ignoring the fact that attempting to build firefox *on* ppc32 reliably crashes rustc when building style (cross-compiling from a 64-bit machine works alright).

    ReplyDelete
    Replies
    1. The crash looks like you're dereferencing a null pointer somewhere. Rather than messing around with ptr, why not do as 32-bit x86 does with s->val, which is still supported (i.e., https://hg.mozilla.org/mozilla-central/rev/08339a56f3ae, which you reference)? Why are you treating jsvals differently?

      Delete
    2. The different treatment of jsvals was introduced here https://hg.mozilla.org/mozilla-central/rev/d16fcad6aa60 which was a consequence of bug https://bugzilla.mozilla.org/show_bug.cgi?id=961488 - before this, it was not treated specially, and this is also the reason the ptr field is needed, because you need to take an address of it.

      I referenced all the relevant changes in the patch...

      Delete
    3. Oh, my goodness, THAT bug. I didn't remember this because PowerOpen ABI doesn't have this problem, so it never affected TenFourFox, so I plead innocent :)

      However, since all the function does is take &val, why not do the double dereference in xptcinvoke instead? i.e., instead of your xptcall.h patch, in the xptcinvoke function change IsPtrData() to IsIndirect(), and then void *ptr = &s->val; . Then work with ptr as before depending if it's a jsval or not.

      Delete
    4. The problem with doing something like, void *ptr = &s->val; tempu32 = (uint32_t) &ptr; in invoke_copy_to_stack is that in this case the pointer would be on the stack in invoke_copy_to_stack and you'd be taking an address to it (i.e. once invoke_copy_to_stack is done calling, you'd be left with an address to a dangling pointer). By storing it in nsXPTCVariant, it becomes persistent, its lifetime becomes the same as the lifetime of the nsXPTCVariant itself. Either way, shouldn't matter for the crash; I thought I'd make that work first, and then deal with cleaning up...

      Delete
    5. Ah, I see what you're saying. In that case, I don't have a good solution there.

      Delete
  3. Debian builds Firefox natively for big-endian ppc64, 32-bit powerpc currently is not supported due to the lack of WASM support for 32-bit powerpc plus there is no NodeJS for powerpc and we haven't worked on a mechanism yet to cross-build the Javascript code for powerpc with NodeJS on x86_64.

    > https://buildd.debian.org/status/package.php?p=firefox-esr&suite=sid

    ReplyDelete
    Replies
    1. We also build Firefox for ppc64 (though unlike Debian we're ELFv2). Lack of WASM and WebRTC shouldn't prevent it (can just disable what doesn't work) and Node technically still works fine on ppc32 in its LTS version (10.x, 12.x sadly broke things). The rustc crashing is the worst blocker right now

      Delete

Post a Comment

Comments are subject to moderation. Be nice.