Archives

Categories

AMD Video Driver Issues

I have had some graphics hangs on my HP z640 workstation which seem to always be after about 4 days of uptime, in one instance running Debian kernel 6.16.12+deb14+1 I got the following kernel error:

kernel: amdgpu 0000:02:00.0: [drm] *ERROR* [CRTC:58:crtc-0] flip_done timed out

Then I got the following errors from kwin_wayland:

kwin_wayland_wrapper[19598]: kwin_wayland_drm: Pageflip timed out! This is a bug in the amdgpu kernel driver
kwin_wayland_wrapper[19598]: kwin_wayland_drm: Please report this at https://gitlab.freedesktop.org/drm/amd/-/issues
kwin_wayland_wrapper[19598]: kwin_wayland_drm: With the output of 'sudo dmesg' and 'journalctl --user-unit plasma-kwin_wayland --boot 0'

In another instance running Debian kernel 6.12.48+deb13 I got the kernel errors at the bottom of the post (not in the RSS feed).

A google result suggested putting the following on the kernel command line which has the downside of increasing the idle power, but given that it’s a low power GPU (that I selected when I was using a system without a PCIe power cable) a bit of extra power use shouldn’t matter much. But it didn’t seem to change anything.

amdgpu.runpm=0 amdgpu.dcdebugmask=0x10

I had tried out the Debian/Unstable kernel 6.16.12-2 which didn’t work with my USB speakers and had problems with the HDMI sound through my monitor but still had AMD GPU issues.

This all seemed to start with the PCIe errors being reported on this system [1]. So I’m now wondering if the PCIe errors were from the GPU not the socket/motherboard. The GPU in question is a Radeon RX560 4G which cost $246.75 back in about 2021 [2]. I could buy a new one of those on ebay for $149 or one of the faster AMD cards like Radeon RX570 that are around the same price. I probably have a Radeon R7 260X in my collection of spare parts that would do the job too (2G of VRAM is more than sufficient for my desktop computing needs).

Any suggestions on how I should proceed from here?

Continue reading AMD Video Driver Issues

PCIe Problems

HP z840 Dead Slot

I just had an issue with the HP z840 system I’m using as a build server [1]. I had to take it to a site that was about 20 minutes drive away and after getting there it didn’t work and just gave 6 beeps and the red LED on the power button flashed. The beeps indicate a video issue, which refers to the Intel Arc B580 card (which is annoyingly large) [2]. I swapped the card with another video card I had lying around (which I knew to be reliable) and got the same result.

It turned out that the PCIe*16 slot that I was using for it had broken, maybe bumps during transport with the big heavy GPU had broken it. I plugged it into the next slot along which is a PCIe*8 slot that’s open ended so it takes larger cards. The upside of this is that the system is still working well, the downside is that the issues I already had with the GPU being unreasonably large are exacerbated by losing one of the *16 slots. Having it in a PCIe 3.0*8 slot is not a problem for me as I only plan to use it for 8K display and for ML stuff and I think that *8 speed (7.8GB/s) is sufficient for both those tasks. In that slot the card could display 8K video at 60Hz with 32bpp and no compression (something that I don’t anticipate ever doing). It could also transfer the maximum size LLM in under 2 seconds which isn’t an unreasonable delay for starting a LLM.

The question now is, should I remove PCIe cards before transport in future?

HP z640 Intermittant Errors

The next issue I have is with my HP z640 workstation which is now my main workstation [3]. I started getting the below errors and then I had the kwin_wayland session hang and another time I started getting video corruption with mpv.

Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: Correctable error 
message received from 0000:00:02.0
Oct 10 20:46:36 xev kernel: pcieport 0000:00:02.0: AER: found no error details 
for 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable 
error message received from 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:   device [8086:2f04] error 
status/mask=00001040/00002000
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [ 6] BadTLP                
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [12] Timeout               
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER:   Error of this Agent 
is reported first
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:   device [1002:6987] error 
status/mask=00001000/00002000
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [12] Timeout               
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:   device [1002:aae0] 
error status/mask=00001000/00002000
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [12] Timeout               
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Correctable error 
message received from 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: found no error details 
for 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable 
error message received from 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: found no error details 
for 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER: Multiple Correctable 
error message received from 0000:00:02.0
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:   device [8086:2f04] error 
status/mask=00001040/00002000
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [ 6] BadTLP                
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0:    [12] Timeout               
Oct 10 20:46:37 xev kernel: pcieport 0000:00:02.0: AER:   Error of this Agent 
is reported first
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:   device [1002:6987] error 
status/mask=00001100/00002000
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [ 8] Rollover              
Oct 10 20:46:37 xev kernel: amdgpu 0000:02:00.0:    [12] Timeout               
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1: PCIe Bus Error: 
severity=Correctable, type=Data Link Layer, (Transmitter ID)
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:   device [1002:aae0] 
error status/mask=00001100/00002000
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [ 8] Rollover              
Oct 10 20:46:37 xev kernel: snd_hda_intel 0000:02:00.1:    [12] Timeout        

On that system I took the CPU out and reinstalled it with new heatsink paste on the theory that it might not have made good contact with some of the pins. The system also has one DIMM slot not working which can be a symptom of poor seating of the CPU. Doing that made no difference to the DIMM slot (I had bought the system for $50 in “unknown condition”) but the video has worked correctly since. It has been suggested to me that reseating the CPU didn’t directly affect the issue and that just taking the system apart could have addressed an issue of the GPU not making good contact in the PCIe slot.

It has been suggested that I could try “contact cleaner” which can be obtained from automotive supply stores among other places. I’m hesitant to put that in a PCIe slot but putting it on the connector of the card and then polishing it off seems like something to consider. Another suggestion was to use isopropyl alcohol to wash the contacts. I guess washing a PCIe slot out with isopropyl alcohol and leaving it for hours to dry is an option as a last resort.

For the moment it seems to be fine but I am not certain that the problem is gone forever. At the moment my main aim is to have these systems keep working until after the release of DDR6 workstations which is when I expect DDR5 workstations to become affordable on all the second hand sites.

Links October 2025

Informative video about the way corporations charge different rates based on location and even type of device used on the web site [1]. This should be illegal everywhere!

Bruce Schneier with Heather Adkins and Gadi Evron wrote an insightful post about AI Hacking and the Future of Cybersecurity, the future seems grim [2].

Slaughterbots is an interesting Dust SciFi movie exploring the future of autonoms weapons [3].

Arstechnica has an intersting article on a genetically engineered plant with a more efficient system for photosynthesis [4]. If this goes to plan it could revolutionise agriculture!

David Brin wrote an insightful blog post about the Seldon Paradox [5]. Also he wrote the final book in the Foundation series so he is the current living export on Hari Seldon.

Charles Stross wrote an insightful blog post about the pivot away from fossil fuels and the future of computers without Moore’s Law [6].

Audrey Woods wrote an insightful article about the end of Moore’s Law, we can get more transistors from multichip modules bit it’s only small linear improvements not exponential [7].

Bruce Schneier and Barath Raghavan wrote an interesting article about AI’s OODA loop problem, it’s a good way of thinking about some of these issues [8]. I think that LLM security is a losing game. Getting them to mostly not tell people how to commit crimes is the limit of controls.

Manga telling the story of Revelations in all it’s drug inspired madness [9]. When I read the entire Bible I skipped Revelations because it’s too obviously the product of mental illness.

CNN has an interesting article about bitcoin ARMs which are almost exclusively used for crime [10].

Internode NBN500

I have just converted to the Internode NBN500 plan which is now the same price as the NBN100 plan. I’m in a HFC area so they won’t let me get fiber to the home (due to Malcolm Turnbull breaking the NBN to help Murdoch) so I’m limited to what HFC can do.

I first tried it out on a 100mbit card and got speeds of 96/47 mb/s according to speedtest.net. I’ve always had the MTU set to 1492 for the PPPoE connection (something I forgot to mention in my blog post about connecting to the Arris CM8200 on Debian [1]) but when run on the 100mbit card I had to set it to 1488. Apparently 1488 is the number because 4 bytes are taken for the VLAN header and 8 bytes for the PPPoE header. But it seems that when using gigabit ethernet it doesn’t take 4 bytes for the VLAN (comments explaining that would be appreciated).

when connected via gigabit with a MTU of 1492 I got speeds of 534/46 which are quite good. When I tested with my laptop on a Wifi link while sitting next to the main node of my Kogan Wifi6 mesh [2] via 2.4GHz wifi I got 172/45. When using 5GHz I got 514/41. When using 5GHz at the far end of my home over the mesh I got 200/45.

Here’s a table summarising the speeds. I rounded all speeds off to 1Mbit/s because I don’t think that the results are even that accurate. I think that Wifi5 over mesh reporting a faster upload speed than Wifi5 near the AP is because of random factors not an actual benefit to being further away, but I will do more tests later on.

Connection Receive Mbit/s Send Mbit/s
100baseT 96 47
Gigabit 535 46
2.4GHz Wifi 172 45
Wifi5 514 41
Wifi5 Over Mesh 200 45

WordPress Spam Users

Just over a year ago I configured my blog to only allow signed in users to comment to reduce spam [1]. This has stopped all spam comments, it was even more successful than expected but spammers keep registering accounts. I’ve now got almost 5000 spam accounts, an average of more than 10 per day. I don’t know why they keep creating them without trying to enter comments. At first I thought that they were trying to assemble a lot of accounts for a deluge of comment spam but that hasn’t happened.

There are some WordPress plugins for bulk deletion of users but I couldn’t find one with support for “delete all users who haven’t submitted a comment”. So I do it a page at a time, but of course I don’t want to do it 100 at a time so I used the below SQL to change it to 400 at a time. I initially tried larger numbers like 2000 but got Chrome timeouts when trying to click the check-box to select all users. From experimenting it seems that the time taken to check that is worse than linear. Doing it for 2000 users is obviously much more than 5* the duration of doing it for 400. 800 users was one attempt which resulted in it being possible to select them all but then it gave an error about the URL being too long when it came to actually delete them. After a binary search I found that 450 was too many but 400 worked. So now it’s 12 operations to delete all the spam accounts. Each bulk delete operation is 5 GUI operations so it’s 60 operations to delete 15 months of spam users. This is annoying, but less than the other problems of spam.

UPDATE `wp_usermeta` SET `meta_value` = 400 WHERE `user_id` = 2 AND `meta_key` = 'users_per_page';

Deleting the spam users reduced the size of the backup (zstd -9 of a mysql dump) for my blog by 6.5%. Then changing from zstd -9 to -19 reduced it by another 13%. After realising this difference I configured all my mysql backups to be compressed with zstd -19, this will make a difference on the system with over 30G of zstd compressed mysql backups.

Links September 2025

Werdahias wrote an informative blog post about Dark Mode for QT programs on non-QT environments (mostly GNOME based), we need more blog posts about this sort of thing [1].

Astral Codex Ten has an interesting blog post about the rise of Christianity, trying to work out why it took over so quickly [2].

Frances Haugen’s whistleblower statement about Facebook is worth reading, Facebook seems to be one of the most evil companies in the world [3].

Interesting blog post by Philip Bennett about trying to repair a 28 player Galaxian game from 1990 [4].

Bruce Schneier and Nathan E. Sanders wrote an insightful article about AI in Government [5].

Krebs has an interesting analysis of Conservatives whinging about alleged discrimination due to their use of spam lists [6].

Nick Cheesman wrote an insightful article on the failures of Meritocracy with ANU as a case study [7]. I am mystified as to why ABC categorised it under Religion.

David Brin wrote an interesting short SciFi story about dealing with blackmail [8].

Charles Stross has an interesting take on AI economics etc [9].

Corey Doctorow wrote an interesting blog post about the impending economic crash becuase of all the money tied up in AI investments [10].

More About the Colmi P80

The FOSS Android program for communicating with smart watches is Gadget Bridge which now has support for the Colmi P80 [1].

I first blogged about the Colmi P80 just over a month ago [2]. Now I have a couple of relatives using it happily on Android with the proprietary app. I couldn’t use it myself because I require more control over which apps have their notifications go to the watch than the Colmi app offers. Also I’m trying to move away from non-free software.

Yesterday the f-droid repository informed me that there was a new version of Gadget Bridge and the changelog indicated support for the Colmi P80 so I connected the P80 and disconnected the PineTime.

The first problem I noticed is that the vibrator on the P80 when on it’s maximum setting is much weaker than that on the PineTime, so weak that I often didn’t notice it. Maybe if I wore it for a few weeks I would teach myself to notice it but it should just be able to work with me on this. If it could be set to have multiple bursts of vibrating then that would work.

The next problem is that the P80 by default does not turn the screen on when there’s a notification and there seems to be no way to configure it to do so. I configured it to turn on when I raise my arm which can mostly work but that still relies on me noticing the vibration. Vibration and the screen light turning on would be harder to miss than vibration on it’s own.

I don’t recall seeing any review of smart watches ever that stated whether the screen would turn on when there’s a notification or whether the vibration was easy to notice.

One problem with both the PineTime (running InfiniTime) and the P80 is that when the screen is turned on (through gesture, pushing the button, or a notification in the case of the Pinetime) it is active for swiping to change the settings. I would like to have some other action required before settings can be changed so that if the screen turns on when I’m asleep my watch won’t brush against something and change it’s settings (which has happened).

It’s neat how Gadget Bridge supports talking to multiple smart watches at the same time. One useful feature for that would be to have different notification settings for each watch. I can imagine someone changing between a watch for jogging and a watch for work and wanting different settings.

Colmi P80 Problems

No authentication for Bluetooth connections.

Runs non-free software so no chance to fix things.

Battery life worse than PineTime (but not really bad).

Vibration weak.

Screen doesn’t turn on when notification is sent.

Conclusion

I’m using the PineTime as my daily driver again. While it works well enough for some people (even with the Colmi proprietary app) it doesn’t do what I want. It is however a good test device for FOSS work on the phone side, it has a decent feature set and is cheap.

Apart from lack of authentication and running non-free software the problems are mostly a matter of taste. Some people might think it’s great the way it works.

Links August 2025

Dimitri John Ledkov wrote an informative blog post about self encrypting disks and UEFI with Linux [1].

This Coffeezilla video highlights an interesting scam, run a broker for day traders and don’t execute trades, just be the counterparty for every trade and rely on day traders losing [2].

First Sight is a Dust SciFi short film that’s worthy of Black Mirror [3].

Apple published an interesting article about the operation of the Secure Enclave in iPhone, Mac, and all other significant Apple hardware [4].

Reese Waters made an amusing video about a conservative catfight that’s going on, nice to see horrible people attacking each other [5].

Chengyuan Ma wrote an informative summary of the history of the Great Firewall of China [6]. The V2Ray proxy has a nice feature set!

Interesting article about the JP Morgan Workplace Activity Data Utility (WADU) AI spyware system [7]. Corporate work is going to become even more horrible.

Veritasium has a great video about the history of vulcanised rubber and the potential for significant problems if rubber leaf blight spreads to other countries [8].

Jalopnik has an interesting article about how Reagan killed the safest car ever built [9].

David Brin wrote an interesting article “Tolkien Enemy of Progress” about fiction that justifies autocratic rule [10].

ZRAM and VMs

I’ve just started using zram for swap on VMs. The use of compression for swap in Linux apparently isn’t new, it’s been in the Linux kernel since version 3.2 (since 2012). But until recent years I hadn’t used it. When I started using Mobian (the Debian distribution for phones) zram was in the default setup, it basically works and I never needed to bother with it which is exactly what you want from such a technology. After seeing it’s benefits in Mobian I started using it on my laptops where it worked well.

Benefits of ZRAM

ZRAM means that instead of paging data to storage it is compressed to another part of RAM. That means no access to storage which is a significant benefit if storage is slow (typical for phones) or if storage wearing out is a problem.

For servers you typically have SSDs that are fast and last for significant write volumes, for example the 120G SSDs referenced in my blog post about swap (not) breaking SSD [1] are running well in my parents’ PC because they outlasted all the other hardware connected to them and 120G isn’t usable for anything more demanding than my parents use nowadays. Those are Intel 120G 2.5″ DC grade SATA SSDs. For most servers ZRAM isn’t a good choice as you can just keep doing IO on the SSDs for years.

A server that runs multiple VMs is a special case because you want to isolate the VMs from each other. Support for quotas for storage IO in Linux isn’t easy to configure while limiting the number of CPU cores is very easy. If a system or VM using ZRAM for swap starts paging excessively the bottleneck will be CPU, this probably isn’t going to be great on a phone with a slow CPU but on a server class CPU it will be less of a limit. Whether compression is slower or faster than SSD is a complex issue but it will definitely be just a limit for that VM. When I setup a VM server I want to have some confidence that a DoS attack or configuration error on one VM isn’t going to destroy the performance of other VMs. If the VM server has 4 cores (the smallest VM server I run) and no VM has more than 2 cores then I know that the system can still run adequately if half the CPU performance is being wasted.

Some servers I run have storage limits that make saving the disk space for swap useful. For servers I run in Hetzner (currently only one server but I have run up to 6 at various times in the past) the storage is often limited, Hetzner seems to typically have storage that is 8* the size of RAM so if you have many VMs configured with the swap that they might need in the expectation that usually at most one of them will be actually swapping then it can make a real difference to usable storage. 5% of storage used for swap files isn’t uncommon or unreasonable.

Big Servers

I am still considering the implications of zram on larger systems. If I have a ML server with 512G of RAM would it make sense to use it? It seems plausible that a system might need 550G of RAM and zram could make the difference between jobs being killed with OOM and the jobs just completing. The CPU overhead of compression shouldn’t be an issue as when you have dozens of cores in the system having one or two used for compression is no big deal. If a system is doing strictly ML work there will be a lot of data that can’t be compressed, so the question is how much of the memory is raw input data and the weights used for calculations and how much is arrays with zeros and other things that are easy to compress.

With a big server nothing less than 32G of swap will make much difference to the way things work and if you have 32G of data being actively paged then the fastest NVMe devices probably won’t be enough to give usable performance. As zram uses one “stream” per CPU code if you have 44 cores that means 44 compression streams which should handle greater throughput. I’ll write another blog post if I get a chance to test this.

Dell T320 H310 RAID and IT Mode

The Problem

Just over 2 years ago my Dell T320 server had a motherboard failure [1]. I recently bought another T320 that had been gutted (no drives, PSUs, or RAM) and put the bits from my one in it.

I installed Debian and the resulting installation wouldn’t boot, I tried installing with both UEFI and BIOS modes with the same result. Then I realised that the disks I had installed were available even though I hadn’t gone through the RAID configuration (I usually make a separate RAID-0 for each disk to work best with BTRFS or ZFS). I tried changing the BIOS setting for SATA disks between “RAID” and “AHCI” modes which didn’t change things and realised that the BIOS setting in question probably applies to the SATA connector on the motherboard and that the RAID card was in “IT” mode which means that each disk is seen separately.

If you are using ZFS or BTRFS you don’t want to use a RAID-1, RAID-5, or RAID-6 on the hardware RAID controller, if there are different versions of the data on disks in the stripe then you want the filesystem to be able to work out which one is correct. To use “IT” mode you have to flash a different unsupported firmware on the RAID controller and then you either have to go to some extra effort to make it bootable or have a different device to boot from.

The Root Causes

Dell has no reason to support unusual firmware on their RAID controllers. Installing different firmware on a device that is designed for high availability is going to have some probability of data loss and perhaps more importantly for Dell some probability of customers returning hardware during the support period and acting innocent about why it doesn’t work. Dell has a great financial incentive to make it difficult to install Dell firmware on LSI cards from other vendors which have equivalent hardware as they don’t want customers to get all the benefits of iDRAC integration etc without paying the Dell price premium.

All the other vendors have similar financial incentives so there is no official documentation or support on converting between different firmware images. Dell’s support for upgrading the Dell version is pretty good, but it aborts if it sees something different.

The Attempts

I tried following the instructions in this document to flash back to Dell firmware [2]. This document is about the H310 RAID card in my Dell T320 AKA a “LSI SAS 9211-8i”. The sas2flash.efi program didn’t seem to do anything, it returned immediately and didn’t give an error message.

This page gives a start of how to get inside the Dell firmware package but doesn’t work [3]. It didn’t cover the case where sasdupie aborts with an error because it detects the current version as “00.00.00.00” not something that the upgrade program is prepared to upgrade from. But it’s a place to start looking for someone who wants to try harder at this.

This forum post has some interesting information, I gave up before trying it, but it may be useful for someone else [4].

The Solution

Dell tower servers have as a standard feature an internal USB port for a boot device. So I created a boot image on a spare USB stick and installed it there and it then loads the kernel and mounts the filesystem from a SATA hard drive. Once I got that working everything was fine. The Debian/Trixie installer would probably have allowed me to install an EFI device on the internal USB stick as part of the install if I had known what was going to happen.

The system is now fully working and ready to sell. Now I just need to find someone who wants “IT” mode on the RAID controller and hopefully is willing to pay extra for it.

Whatever I sell the system for it seems unlikely to cover the hours I spent working on this. But I learned some interesting things about RAID firmware and hopefully this blog post will be useful to other people, even if only to discourage them from trying to change firmware.