It's a plug and pray night


The Talos II got an upgrade today but not without a lot of messing around. Some of you have noted I've said little about further Firefox JIT updates and that's because of 1) $DAYJOB and 2) running dangerously low on space on the 1TB Samsung 960 EVO NVMe SSD mounted on /home, so having lots more source code sitting around wasn't happening.

Well, as of Monday I'm now officially between jobs (don't worry, I'll be getting a paycheck again in October) and I finally got the T2 to cooperate with a new 2TB Samsung 980 PRO. At the same time I also replaced the Raptor BTO Marvell 88SE9235 SATA card with a JMicron JMB582 card. It's two ports instead of four, but I'm only using it for the two optical drives, and it seems to be much more reliable than the Marvell which would sometimes come up with drives and other times stall out.

But getting them to all work together was unexpectedly tricky. Here's what's in there now.

% lspci
0000:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon Pro WX 7100]
0000:01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
0001:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0001:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
0002:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0003:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0003:01:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02)
0004:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0004:01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0004:01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
0005:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0005:01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 04)
0005:02:00.0 Multimedia video controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
0030:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0030:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963
0031:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0032:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0033:00:00.0 PCI bridge: IBM POWER9 Host Bridge (PHB4)
0033:01:00.0 SATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller

Originally, the NVMe drive I bought was a 2TB WD Black SN850X. This worked great in an external USB3 enclosure. I rsynced /home to it and put it into the PCIe carrier and restarted, and it failed to show up in Petitboot, lspci or Fedora. I tried a different passive adaptor on the off-chance that had something to do with it and moved it around the available slots, but nothing made it work. Later I found an old post on the Raptor forums reporting a similar problem, nor was I willing to get one of those pricey switched multi-M-2 cards to try.

Since the boot drive is (still) a Raptor BTO Samsung 960 and the drive I was replacing was also a Samsung, I decided the cheaper option would be to just buy another Samsung, though the TLC flash in the 980 PRO makes it more like a 980 EVO in my book. Anyway, I left everything copying overnight, came back this morning, pulled the PCIe carrier, swapped the SSD sticks and fired it back up, and Petitboot wouldn't see it either. (I had installed the JMicron card at the same time and it did show up, but that wasn't too helpful just then.) At this point two drives acting exactly the same way, including one that would be very likely to work, made me suspicious this was a configuration problem.

This is a fully-populated dual-8 T2, so both CPUs and all five PCIe slots are live. At this point, other than the BMC, Ethernet and USB, the only device on the first CPU's slots was the Raptor BTO AMD Radeon Pro WX7100 workstation card in the 16x (the 8x is open); the original NVMe SSDs and the Marvell SATA card occupied the three lower slots (16x, 16x, 8x) handled by the second CPU.

I started off by pulling the SATA card completely and just leaving the two SSDs and the WX7100. Sforza POWER9s support three PCIe controllers (PECs), 48 PCIe lanes and six PCIe host bridges (PHBs) per processor module. In theory this looks like three x16 slots per CPU, but in practice PEC1 on each CPU is always bifurcated x8x8 and PEC2 is optionally trifurcatable x8x4x4. Plus, there are on-board resources also competing for those lanes, so some of the T2 slots are necessarily bifurcated and others aren't. The x8 slot on CPU0 is actually a bifurcated PEC1, with x4 allocated to the Microsemi PM8068 BTO option (whether present or not), and its phantom PEC2 is split between the BMC, the Broadcom Ethernet controllers and the TI USB 3.0 host controller. On CPU1, slot three is PEC2 x16, slot four is PEC0 x16 (never bifurcated on either CPU), and slot five is PEC1 x8, with x4 allocated to OCuLink.

The old 960 EVO and the unresponsive new 980 were originally in slot 5 (CPU1, PEC1) and the boot 960 in slot 4 (CPU1, PEC0). Moving the new 980 into slot 1 (CPU0, PEC1) finally allowed it to coexist and be mountable with the boot 960, so next I put the JMicron SATA card into the newly available slot 5 and restarted ... and the x16 video card in slot two (CPU0, PEC0) failed to come up. Petitboot was fine on serial.

Figuring I had exceeded some maximum on CPU0 somehow, I moved the WX7100 to the open x16 on slot 3 (CPU1, PEC2). Not only did I still get a black screen, but I also got the dreaded PHB Freeze/Fence detected ! on that slot in the Hostboot output, which usually means the system planar is not happy with you now.

I returned to the immediately preceding configuration, returning the video card to CPU0 PEC0 in slot 2. Putting the 960 into slot 5 and the JMicron card into slot 4 also failed, so within those constraints the only thing I hadn't tried yet was putting the JMicron SATA card in slot 3 instead of slot 5. This seemed technically disgusting since I was wasting an entire x16 slot on a miserable little x1 card, and of course it worked.

The JMicron has a bright blue LED, so with the final configuration and the existing LEDs on the board I suppose I could slap a Honda hood ornament on the front and go street racing. This configuration leaves slot 5 unoccupied, which is an x8, so in the end it's no worse than what I started with (slot 1 was originally free).

I'm not sure what the moral of the story is here. It's possible the Marvell was misbehaving because of conflicts too, but it never failed to show up in lspci even though the drives connected to it sometimes did, whereas the new NVMe drive didn't even show up as a device, let alone a mountable filesystem. It also seems like the Radeon doesn't like being in a bifurcatable slot, though to test this I'd have to try it in slot 4. On the other hand, slot 5 should have been exactly the same as slot 1, yet neither the new 980 nor the SN850X would work in slot 5, nor would the JMicron card. Perhaps the SN850X would work fine in slot 1 as well. If I'm inside the case again to do something else, I should test some or all of these theories.

One thing that is worth remembering is that a PCIe device that initially fails to work or be recognized when installed may simply be picky about where you put it depending on what other devices are present. Unfortunately that means a whole lot more trial and error when you have multiple devices, and tonight I'm not interested in pushing my luck further. Once I've built the next Firefox (article to follow), then it's time to get back to work with a terabyte more space to expand into. And that'll hold a lot of source trees.

Comments

  1. It's also written here:
    https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility_List/PCIe_Devices#Known_issues

    ReplyDelete
  2. Commenting on "pricey switched multi-M-2 cards", I can suggest a relatively affordable option.

    The Supermicro AOC-SHG3-4M2P is available for $161 each. I've successfully bought several of these here:
    https://acmemicro.com/Product/16895/Supermicro-AOC-SHG3-4M2P-Full-Height-Quad-port-M-2-NVMe-SSD-PCI-E-3-0-add-on-card?c_id=116

    One really nice feature is that it is PCIe 3.0 x8 and fits in smaller slots. It does not fit in slot 1 of the Talos II because it physically interferes with the DIMM slots. Mine runs happily in slot 5 loaded with Samsung 970 EVO and 980 pro SSDs.

    Also note that there is a 4 pin ATX 12V power connector on the card. This is optional and can be left unconnected. The PCIe switch on the card consumes under 10W by my measurement, but you should have a least a little airflow over the card. The PCIe bracket is nicely perforated you don't need very high static pressure.

    ReplyDelete

Post a Comment

Comments are subject to moderation. Be nice.