PCIe Problems

HP z840 Dead Slot

I just had an issue with the HP z840 system I’m using as a build server [1]. I had to take it to a site that was about 20 minutes drive away and after getting there it didn’t work and just gave 6 beeps and the red LED on the power button flashed. The beeps indicate a video issue, which refers to the Intel Arc B580 card (which is annoyingly large) [2]. I swapped the card with another video card I had lying around (which I knew to be reliable) and got the same result.

It turned out that the PCIe*16 slot that I was using for it had broken, maybe bumps during transport with the big heavy GPU had broken it. I plugged it into the next slot along which is a PCIe*8 slot that’s open ended so it takes larger cards. The upside of this is that the system is still working well, the downside is that the issues I already had with the GPU being unreasonably large are exacerbated by losing one of the *16 slots. Having it in a PCIe 3.0*8 slot is not a problem for me as I only plan to use it for 8K display and for ML stuff and I think that *8 speed (7.8GB/s) is sufficient for both those tasks. In that slot the card could display 8K video at 60Hz with 32bpp and no compression (something that I don’t anticipate ever doing). It could also transfer the maximum size LLM in under 2 seconds which isn’t an unreasonable delay for starting a LLM.

The question now is, should I remove PCIe cards before transport in future?

HP z640 Intermittant Errors

The next issue I have is with my HP z640 workstation which is now my main workstation [3]. I started getting the below errors and then I had the kwin_wayland session hang and another time I started getting video corruption with mpv.

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error 
message received from 0000:00:02.0
Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error details 
for 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable 
error message received from 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:   device [8086:2f04] error 
status/mask=00001040/00002000
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [ 6] BadTLP                
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [12] Timeout               
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER:   Error of this Agent 
is reported first
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:   device [1002:6987] error 
status/mask=00001000/00002000
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [12] Timeout               
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:   device [1002:aae0] 
error status/mask=00001000/00002000
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [12] Timeout               
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Correctable error 
message received from 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: found no error details 
for 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable 
error message received from 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: found no error details 
for 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable 
error message received from 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:   device [8086:2f04] error 
status/mask=00001040/00002000
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [ 6] BadTLP                
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [12] Timeout               
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER:   Error of this Agent 
is reported first
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:   device [1002:6987] error 
status/mask=00001100/00002000
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [ 8] Rollover              
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [12] Timeout               
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:   device [1002:aae0] 
error status/mask=00001100/00002000
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [ 8] Rollover              
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [12] Timeout

On that system I took the CPU out and reinstalled it with new heatsink paste on the theory that it might not have made good contact with some of the pins. The system also has one DIMM slot not working which can be a symptom of poor seating of the CPU. Doing that made no difference to the DIMM slot (I had bought the system for $50 in “unknown condition”) but the video has worked correctly since. It has been suggested to me that reseating the CPU didn’t directly affect the issue and that just taking the system apart could have addressed an issue of the GPU not making good contact in the PCIe slot.

It has been suggested that I could try “contact cleaner” which can be obtained from automotive supply stores among other places. I’m hesitant to put that in a PCIe slot but putting it on the connector of the card and then polishing it off seems like something to consider. Another suggestion was to use isopropyl alcohol to wash the contacts. I guess washing a PCIe slot out with isopropyl alcohol and leaving it for hours to dry is an option as a last resort.

For the moment it seems to be fine but I am not certain that the problem is gone forever. At the moment my main aim is to have these systems keep working until after the release of DDR6 workstations which is when I expect DDR5 workstations to become affordable on all the second hand sites.

etbe – Russell Coker

Linux, politics, and other interesting things

PCIe Problems

HP z840 Dead Slot

HP z640 Intermittant Errors