AMD Video Driver Issues

I have had some graphics hangs on my HP z640 workstation which seem to always be after about 4 days of uptime, in one instance running Debian kernel 6.16.12+deb14+1 I got the following kernel error:

kernel: amdgpu 0000:02:00.0: [drm] *ERROR* [CRTC:58:crtc-0] flip_done timed out

Then I got the following errors from kwin_wayland:

kwin_wayland_wrapper[19598]: kwin_wayland_drm: Pageflip timed out! This is a bug in the amdgpu kernel driver
kwin_wayland_wrapper[19598]: kwin_wayland_drm: Please report this at https://gitlab.freedesktop.org/drm/amd/-/issues
kwin_wayland_wrapper[19598]: kwin_wayland_drm: With the output of 'sudo dmesg' and 'journalctl --user-unit plasma-kwin_wayland --boot 0'

In another instance running Debian kernel 6.12.48+deb13 I got the kernel errors at the bottom of the post (not in the RSS feed).

A google result suggested putting the following on the kernel command line which has the downside of increasing the idle power, but given that it’s a low power GPU (that I selected when I was using a system without a PCIe power cable) a bit of extra power use shouldn’t matter much. But it didn’t seem to change anything.

amdgpu.runpm=0 amdgpu.dcdebugmask=0x10

I had tried out the Debian/Unstable kernel 6.16.12-2 which didn’t work with my USB speakers and had problems with the HDMI sound through my monitor but still had AMD GPU issues.

This all seemed to start with the PCIe errors being reported on this system [1]. So I’m now wondering if the PCIe errors were from the GPU not the socket/motherboard. The GPU in question is a Radeon RX560 4G which cost $246.75 back in about 2021 [2]. I could buy a new one of those on ebay for $149 or one of the faster AMD cards like Radeon RX570 that are around the same price. I probably have a Radeon R7 260X in my collection of spare parts that would do the job too (2G of VRAM is more than sufficient for my desktop computing needs).

Any suggestions on how I should proceed from here?

Continue reading

ZRAM and VMs

I’ve just started using zram for swap on VMs. The use of compression for swap in Linux apparently isn’t new, it’s been in the Linux kernel since version 3.2 (since 2012). But until recent years I hadn’t used it. When I started using Mobian (the Debian distribution for phones) zram was in the default setup, it basically works and I never needed to bother with it which is exactly what you want from such a technology. After seeing it’s benefits in Mobian I started using it on my laptops where it worked well.

Benefits of ZRAM

ZRAM means that instead of paging data to storage it is compressed to another part of RAM. That means no access to storage which is a significant benefit if storage is slow (typical for phones) or if storage wearing out is a problem.

For servers you typically have SSDs that are fast and last for significant write volumes, for example the 120G SSDs referenced in my blog post about swap (not) breaking SSD [1] are running well in my parents’ PC because they outlasted all the other hardware connected to them and 120G isn’t usable for anything more demanding than my parents use nowadays. Those are Intel 120G 2.5″ DC grade SATA SSDs. For most servers ZRAM isn’t a good choice as you can just keep doing IO on the SSDs for years.

A server that runs multiple VMs is a special case because you want to isolate the VMs from each other. Support for quotas for storage IO in Linux isn’t easy to configure while limiting the number of CPU cores is very easy. If a system or VM using ZRAM for swap starts paging excessively the bottleneck will be CPU, this probably isn’t going to be great on a phone with a slow CPU but on a server class CPU it will be less of a limit. Whether compression is slower or faster than SSD is a complex issue but it will definitely be just a limit for that VM. When I setup a VM server I want to have some confidence that a DoS attack or configuration error on one VM isn’t going to destroy the performance of other VMs. If the VM server has 4 cores (the smallest VM server I run) and no VM has more than 2 cores then I know that the system can still run adequately if half the CPU performance is being wasted.

Some servers I run have storage limits that make saving the disk space for swap useful. For servers I run in Hetzner (currently only one server but I have run up to 6 at various times in the past) the storage is often limited, Hetzner seems to typically have storage that is 8* the size of RAM so if you have many VMs configured with the swap that they might need in the expectation that usually at most one of them will be actually swapping then it can make a real difference to usable storage. 5% of storage used for swap files isn’t uncommon or unreasonable.

Big Servers

I am still considering the implications of zram on larger systems. If I have a ML server with 512G of RAM would it make sense to use it? It seems plausible that a system might need 550G of RAM and zram could make the difference between jobs being killed with OOM and the jobs just completing. The CPU overhead of compression shouldn’t be an issue as when you have dozens of cores in the system having one or two used for compression is no big deal. If a system is doing strictly ML work there will be a lot of data that can’t be compressed, so the question is how much of the memory is raw input data and the weights used for calculations and how much is arrays with zeros and other things that are easy to compress.

With a big server nothing less than 32G of swap will make much difference to the way things work and if you have 32G of data being actively paged then the fastest NVMe devices probably won’t be enough to give usable performance. As zram uses one “stream” per CPU code if you have 44 cores that means 44 compression streams which should handle greater throughput. I’ll write another blog post if I get a chance to test this.

8k Video Cards

I previously blogged about getting an 8K TV [1]. Now I’m working on getting 8K video out for a computer that talks to it. I borrowed an NVidia RTX A2000 card which according to it’s specs can do 8K [2] with a mini-DisplayPort to HDMI cable rated at 8K but on both Windows and Linux the two highest resolutions on offer are 3840*2160 (regular 4K) and 4096*2160 which is strange and not useful.

The various documents on the A2000 differ on whether it has DisplayPort version 1.4 or 1.4a. According to the DisplayPort Wikipedia page [3] both versions 1.4 and 1.4a have a maximum of HBR3 speed and the difference is what version of DSC (Display Stream Compression [4]) is in use. DSC apparently causes no noticeable loss of quality for movies or games but apparently can be bad for text. According to the DisplayPort Wikipedia page version 1.4 can do 8K uncompressed at 30Hz or 24Hz with high dynamic range. So this should be able to work.

My theories as to why it doesn’t work are:

  • NVidia specs lie
  • My 8K cable isn’t really an 8K cable
  • Something weird happens converting DisplayPort to HDMI
  • The video card can only handle refresh rates for 8K that don’t match supported input for the TV

To get some more input on this issue I posted on Lemmy, here is the Lemmy post [5]. I signed up to lemmy.ml because it was the first one I found that seemed reasonable and was giving away free accounts, I haven’t tried any others and can’t review it but it seems to work well enough and it’s free. It’s described as “A community of privacy and FOSS enthusiasts, run by Lemmy’s developers” which is positive, I recommend that everyone who’s into FOSS create an account there or some other Lemmy server.

My Lemmy post was about what video cards to buy. I was looking at the Gigabyte RX 6400 Eagle 4G as a cheap card from a local store that does 8K, it also does DisplayPort 1.4 so might have the same issues, also apparently FOSS drivers don’t support 8K on HDMI because the people who manage HDMI specs are jerks. It’s a $200 card at MSY and a bit less on ebay so it’s an amount I can afford to risk on a product that might not do what I want, but it seems to have a high probability of getting the same result. The NVidia cards have the option of proprietary drivers which allow using HDMI and there are cards with DisplayPort 1.4 (which can do 8K@30Hz) and HDMI 2.1 (which can do 8K@50Hz). So HDMI is a better option for some cards just based on card output and has the additional benefit of not needing DisplayPort to HDMI conversion.

The best option apparently is the Intel cards which do DisplayPort internally and convert to HDMI in hardware which avoids the issue of FOSS drivers for HDMI at 8K. The Intel Arc B580 has nice specs [6], HDMI 2.1a and DisplayPort 2.1 output, 12G of RAM, and being faster than the low end cards like the RX 6400. But the local computer store price is $470 and the ebay price is a bit over $400. If it turns out to not do what I need it still will be a long way from the worst way I’ve wasted money on computer gear. But I’m still hesitating about this.

Any suggestions?

Creating a Micro Users’ Group

Fosdem had a great lecture Building an Open Source Community One Friend at a Time [1]. I recommend that everyone who is involved in the FOSS community watches this lecture to get some ideas.

For some time I’ve been periodically inviting a few friends to visit for lunch, chat about Linux, maybe do some coding, and watch some anime between coding. It seems that I have accidentally created a micro users’ group.

LUGs were really big in the mid to late 90s and still quite vibrant in the early 2000’s. But they seem to have decreased in popularity even before Covid19 and since Covid19 a lot of people have stopped attending large meetings to avoid health risks. I think that a large part of the decline of users’ groups has been due to the success of YouTube. Being able to choose from thousands of hours of lectures about computers on YouTube is a disincentive to spending the time and effort needed to attend a meeting with content that’s probably not your first choice of topic. Attending a formal meeting where someone you don’t know has arranged a lecture might not have a topic that’s really interesting to you. Having lunch with a couple of friends and watching a YouTube video that one of your friends assures you is really good is something more people will find interesting.

In recent times homeschooling [2] has become more widely known. The same factors that allow learning about computers at home also make homeschooling easier. The difference between the traditional LUG model of having everyone meet at a fixed time for a lecture and a micro LUG of a small group of people having an informal meeting is similar to the difference between traditional schools and homeschooling.

I encourage everyone to create their own micro LUG. All you have to do is choose a suitable time and place and invite some people who are interested. Have a BBQ in a park if the weather is good, meet at a cafe or restaurant, or invite people to visit you for lunch on a weekend.

Kitty and Mpv

6 months ago I switched to Kitty for terminal emulation [1]. So far there’s only been one thing that I couldn’t effectively do with Kitty that I did with Konsole in the past, that is watching a music video in 1/4 of the screen while using the rest for terminals. I could setup multiple Kitty windows taking up the rest of the screen but I wanted to keep using a single Kitty with multiple terminals and just have mpv go over one of them. Kitty supports it’s own graphical interface so “mpv –vo=kitty” works but took 6* the CPU power in my tests which isn’t good for a laptop.

For X11 there’s a –ontop option for mpv that does what you expect, but that doesn’t work on Wayland. Not working is mostly Wayland’s fault as there is a long tail of less commonly used graphical operations that work in X11 but aren’t yet implemented in Wayland. I have filed a Debian bug report about this, the mpv man page should note that it’s only going to work on X11 on Linux.

I have discovered a solution to that, in the KDE settings there’s a “Window Rules” section, I created an entry for “Window class” exactly matching “mpv” and then added a rule “Keep above other windows” and set it for “force” and “yes”.

After that I can just resize mpv to occlude just one terminal and keep using the rest. Also one noteworthy thing with this is that it makes mpv go on top of the KDE taskbar, which can be a feature.

BOINC and Idle Users

The BOINC distributed computing client in Debian (Bookworm and previous releases) can check the idle time via the X11 protocol and run GPU jobs when the interactive user is idle, so the user gets GPU power for graphics when they need it and when it’s idle BOINC uses it. This doesn’t work for Wayland and unfortunately no-one has written a Wayland equivalent of xprintidle (which shows the number of milliseconds that the X11 session has been idle in milliseconds.

In the Debian bug system there is bug #813728 about a message every second due to failed attempts to find X11 idle time [1]. On my main workstation with Wayland it logs “Authorization required, but no authorization protocol specified“.

There is also bug #775125 about BOINC not detecting mouse movements [2], I added to it about the issues with Wayland. There’s the package swayidle in Debian that is designed to manage the screen-save process on Wayland, below is an example of how to use it to display output on 5 seconds and 10 seconds of idle.

swayidle -w timeout 5 'echo 5' timeout 10 'echo 10' resume 'echo resume' before-sleep 'echo before-sleep'

The code for swayidle has only 7 comments and isn’t easy to read. I looked in to writing a Wayland equivalent of xprintidle but it would take more work than I’m prepared to invest in it. So it seems to me that the best option might be to have BOINC receive SIGUSR1 and SIGUSR2 for the start and stop of idle time and then have scripts call xprintidle, swayidle, a wrapper for “w” (for systems without graphics) or other methods. To run swayidle as root you can set WAYLAND_DISPLAY=../$USER_ID/wayland-0.

4

Strange X11 Grabbing

A couple of days ago I upgraded my home server from Debian/Bullseye to Debian/Testing (soon to be Bookworm). Since then KDE sessions on that system have had problems of locking the input queue, the mouse can move and mouse-over events work but clicking the mouse or pressing the keyboard does nothing. Various web pages suggested that the xdotool program (in the xdotool package in Debian) can address this. The problem is apparently programs “grabbing” the input and not letting it go.

The command “xdotool key XF86LogGrabInfo” causes the xorg server to dump information on it’s “grabs”. After running that command I looked in /var/log/Xorg.0.log and found that active grabs were only held by /usr/bin/kwin_x11 and /usr/bin/kglobalaccel5. So it seems like a KDE issue. Other systems running X11 with Debian/Testing (such as the laptop I’m using to write this blog post) don’t have the problem, so it could be something related to the KDE configuration of the account used on that system.

The command “xdotool key XF86Ungrab” is supposed to break out of such a grab, but for me didn’t do so.

On the same system running KDE with Wayland works fine in this regard. Does Wayland do things differently and not allow this “grabbing” to block everything? Does KDE have an X11 specific bug? Is there a race condition that just gets triggered by the speed of Xorg on that system but not by the slightly different timings of Wayland? I might never find out.

I previously wrote about problems with Wayland/KDE on laptops [1]. Fortunately this bug happened to occur on a server so inability to reconfigure monitors isn’t necessarily a deal breaker, although being unable to use some of the high-DPI settings for the 4K monitor it has may be an issue. It will be really annoying if some of the laptop configurations I support get this grabbing problem. But since that time I have learned of the kscreen-doctor command which is included in Debian/Testing and can do some of the necessary things, it doesn’t have a man page so you have to run “kscreen-doctor -h” for documentation.

Hyper Threading on the E5-2696v3

I just did some quick tests of hyper-threading on my new E5-2696v3 CPU. I compiled the Linux 6.0.10 kernel with and without hyper-threading enabled. Here’s the times for “make -j36 bzImage” and “make -j36 modules” with HT enabled:

real    2m26.540s
user    55m25.121s
sys     9m56.443s

real    10m57.374s
user    309m21.531s
sys     58m1.070s

Here’s the times for “make -j18 bzImage” and “make -j18 modules” with HT disabled:

real    2m40.501s
user    31m35.295s
sys     5m43.523s

real    11m39.313s
user    170m46.840s
sys     31m37.756s

That’s 9.6% faster for bzImage and 6.4% faster for modules.

So for a performance boost that’s between 5% and 10% I get greater exposure to kernel security issues and more difficulty tracking CPU time. That doesn’t seem like a good trade-off so I’ve put the “nosmt” kernel command-line option back.

Wayland in Bookworm

We are getting towards the freeze for Debian/Bookworm so the current state of packages isn’t going to change much before the release. Bugs will get fixed but missing features will mostly be missing until the next release.

Anarcat wrote an excellent blog post about using Wayland with the Sway window manager [1]. It seems pretty good if you like Sway, but I like KDE and plan to continue using it. Several of the important utility programs referenced by Anarcat won’t run with KDE/Wayland and give errors such as “Compositor doesn’t support wlr-output-management-unstable-v1”. One noteworthy thing about Wayland is the the Window manager and the equivalent to the X server are the same program so KDE has different Wayland code than Sway and doesn’t support some features. The lack of these features limits my ability to manage multiple displays and therefore makes KDE/Wayland unsuitable for many laptop uses. My work laptop runs Ubuntu 22.04 with KDE and wouldn’t correctly display on the pair of monitors on a USB-C dock that’s the standard desktop configuration where I work.

In my previous post about Wayland [2] I wrote about converting 2 of my systems to Wayland. Since then I had changed them back to X because of problems with supporting strange monitor configurations on laptops and also due to the KDE window manager crashing occasionally which terminates the session in Wayland but merely requires restarting the window manager in X. More recently I had a problem with the GPU in my main workstation sometimes not being recognised by the system (reporting no PCIe device), when I got a new one I couldn’t get X to work with the error “Cannot run in framebuffer mode. Please specify busIDs for all framebuffer devices” so I tried Wayland again. Now in the later stage of the Bookworm development process it seems that the problem with the KDE window manager crashing has been solved or mitigates and there is a new problem of the plasmashell process crashing. As I can restart plasmashell without logging out that’s much less annoying. So now my main workstation is running on Wayland with a slower GPU than I previously had while also giving a faster user experience so Wayland is providing a definite performance benefit.

Maybe for Trixie (the next release of Debian after Bookworm) we should have a release goal of having full Wayland support in all the major GUI systems.

3

DDC as a KVM Switch

With the recent resurgence in Covid19 I’ve been working from home a lot and using both my work laptop and personal PC on the same monitor. HDMI KVM switches start at $150 and I didn’t feel like buying one. So I wrote a script to change inputs on my monitor. The following script locks the session on the local machine and switches the monitor’s input to the other machine. I ran the command “ddcutil vcpinfo| grep Input” which shows that (on my monitor at least) 60 is the VCP for input. Then I ran the command “ddcutil getvcp 60” to get the current value and tried setting values sequentially to find the value for the other port.

Below is the script I’m using on one system, the other is the same but setting the different port via setvcp. The loginctl command is to lock the screen to prevent accidental keyboard or mouse input from messing anything up.

# lock the session, assumes that seat0 is the only session
loginctl lock-session $(loginctl list-sessions|grep "seat0 *$"|cut -c1-7)
# 0xf is DisplayPort, 0x11 is HDMI-1
ddcutil setvcp 60 0x11

For keyboard, mouse, and speakers I’m using a USB 2.0 hub that I can switch between computers. I idly considered getting a three-pole double-throw switch (four pole switches aren’t available at my local electronic store) to switch USB 2.0 as I only need to switch 3 of the 4 wires. But for the moment just plugging the hub into different systems is enough, I only do that a couple of times a day.