The case of the disappearing core


Here's a fun pro-tip: what do you do when one of your system's cores went out to lunch and never came back? On my original dual-4 Talos II my compile times got abnormally long and more sluggish. In dmesg I noted with alarm that it was reporting numa: Node 0 CPUs: 4-15 instead of starting at CPU 0. That means an entire core (because they're SMT-4) somehow went off-line! What gives?

The answer turns out to be related to Hostboot. The GUARD portion of the PNOR controls what hardware components have been disabled (which includes RAM sticks and individual cores), presumably due to defect, but it can also happen spuriously if Hostboot mistakes a driver glitch for actual hardware failure and erroneously turns off that component in the hardware guard entries. With main power off, a simple pflash -P GUARD -c at the BMC root prompt will clear the guard entries and indeed the prodigal core returned forthwith when I powered it back on again. Thanks to Tim Pearson at Raptor for the #protip.

Comments