etbe - Russell Coker

Links January 2021

Krebs on Security has an informative article about web notifications and how they are being used for spamming and promoting malware [1]. He also includes links for how to permanently disable them. If nothing else clicking “no” on each new site that wants to send notifications is annoying.

Michael Stapelberg wrote an insightful posts about inefficiencies in the Debian development processes [2]. While I agree with most of his assessment of Debian issues I am not going to decrease my involvement in Debian. Of the issues he mentions the 2 that seem to have the best effort to reward ratio are improvements to mailing list archives (to ideally make it practical to post to lists without subscribing and read responses in the archives) and the issues of forgetting all the complexities of the development process which can be alleviated by better Wiki pages. In my Debian work I’ve contributed more to the Wiki in recent times but not nearly as much as I should.

Jacobin has an insightful article “Ending Poverty in the United States Would Actually Be Pretty Easy” [3].

Mark Brown wrote an interesting blog post about the Rust programming language [4]. He links to a couple of longer blog posts about it. Rust has some great features and I’ve been meaning to learn it.

Scientific America has an informative article about research on the spread of fake news and memes [5]. Something to consider when using social media.

Bruce Schneier wrote an insightful blog post on whether there should be limits on persuasive technology [6].

Jonathan Dowland wrote an interesting blog post about git rebasing and lab books [7]. I think it’s an interesting thought experiment to compare the process of developing code worthy of being committed to a master branch of a VCS to the process of developing a Ph.D thesis.

CBS has a disturbing article about the effect of Covid19 on people’s lungs [8]. Apparently it usually does more lung damage than long-term smoking and even 70%+ of people who don’t have symptoms of the disease get significant lung damage. People who live in heavily affected countries like the US now have to worry that they might have had the disease and got lung damage without knowing it.

Russ Allbery wrote an interesting review of the book “Because Internet” about modern linguistics [9]. The topic is interesting and I might read that book at some future time (I have many good books I want to read).

Jonathan Carter wrote an interesting blog post about CentOS Streams and why using a totally free OS like Debian is going to be a better option for most users [10].

Linus has slammed Intel for using ECC support as a way of segmenting the market between server and desktop to maximise profits [11]. It would be nice if a company made a line of Ryzen systems with ECC RAM support, but most manufacturers seem to be in on the market segmentation scam.

Russ Allbery wrote an interesting review of the book “Can’t Even” about millenials as the burnout generation and the blame that the corporate culture deserves for this [12].

PSI and Cgroup2

In the comments on my post about Load Average Monitoring [1] an anonymous person recommended that I investigate PSI. As an aside, why do I get so many great comments anonymously? Don’t people want to get credit for having good ideas and learning about new technology before others?

PSI is the Pressure Stall Information subsystem for Linux that is included in kernels 4.20 and above, if you want to use it in Debian then you need a kernel from Testing or Unstable (Bullseye has kernel 4.19). The place to start reading about PSI is the main Facebook page about it, it was originally developed at Facebook [2].

I am a little confused by the actual numbers I get out of PSI, while for the load average I can often see where they come from (EG have 2 processes each taking 100% of a core and the load average will be about 2) it’s difficult to work out where the PSI numbers come from. For my own use I decided to treat them as unscaled numbers that just indicate problems, higher number is worse and not worry too much about what the number really means.

With the cgroup2 interface which is supported by the version of systemd in Testing (and which has been included in Debian backports for Buster) you get PSI files for each cgroup. I’ve just uploaded version 1.3.5-2 of etbemon (package mon) to Debian/Unstable which displays the cgroups with PSI numbers greater than 0.5% when the load average test fails.

System CPU Pressure: avg10=0.87 avg60=0.99 avg300=1.00 total=20556310510
/system.slice avg10=0.86 avg60=0.92 avg300=0.97 total=18238772699
/system.slice/system-tor.slice avg10=0.85 avg60=0.69 avg300=0.60 total=11996599996
/system.slice/system-tor.slice/tor@default.service avg10=0.83 avg60=0.69 avg300=0.59 total=5358485146

System IO Pressure: avg10=18.30 avg60=35.85 avg300=42.85 total=310383148314
 full avg10=13.95 avg60=27.72 avg300=33.60 total=216001337513
/system.slice avg10=2.78 avg60=3.86 avg300=5.74 total=51574347007
/system.slice full avg10=1.87 avg60=2.87 avg300=4.36 total=35513103577
/system.slice/mariadb.service avg10=1.33 avg60=3.07 avg300=3.68 total=2559016514
/system.slice/mariadb.service full avg10=1.29 avg60=3.01 avg300=3.61 total=2508485595
/system.slice/matrix-synapse.service avg10=2.74 avg60=3.92 avg300=4.95 total=20466738903
/system.slice/matrix-synapse.service full avg10=2.74 avg60=3.92 avg300=4.95 total=20435187166

Above is an extract from the output of the loadaverage check. It shows that tor is a major user of CPU time (the VM runs a ToR relay node and has close to 100% of one core devoted to that task). It also shows that Mariadb and Matrix are the main users of disk IO. When I installed Matrix the Debian package told me that using SQLite would give lower performance than MySQL, but that didn’t seem like a big deal as the server only has a few users. Maybe I should move Matrix to the Mariadb instance. to improve overall system performance.

So far I have not written any code to display the memory PSI files. I don’t have a lack of RAM on systems I run at the moment and don’t have a good test case for this. I welcome patches from people who have the ability to test this and get some benefit from it.

We are probably about 6 months away from a new release of Debian and this is probably the last thing I need to do to make etbemon ready for that.

RISC-V and Qemu

RISC-V is the latest RISC architecture that’s become popular. It is the 5th RISC architecture from the University of California Berkeley. It seems to be a competitor to ARM due to not having license fees or restrictions on alterations to the architecture (something you have to pay extra for when using ARM). RISC-V seems the most popular architecture to implement in FPGA.

When I first tried to run RISC-V under QEMU it didn’t work, which was probably due to running Debian/Unstable on my QEMU/KVM system and there being QEMU bugs in Unstable at the time. I have just tried it again and got it working.

The Debian Wiki page about RISC-V is pretty good [1]. The instructions there got it going for me. One thing I wasted some time on before reading that page was trying to get a netinst CD image, which is what I usually do for setting up a VM. Apparently there isn’t RISC-V hardware that boots from a CD/DVD so there isn’t a Debian netinst CD image. But debootstrap can install directly from the Debian web server (something I’ve never wanted to do in the past) and that gave me a successful installation.

Here are the commands I used to setup the base image:

apt-get install debootstrap qemu-user-static binfmt-support debian-ports-archive-keyring

debootstrap --arch=riscv64 --keyring /usr/share/keyrings/debian-ports-archive-keyring.gpg --include=debian-ports-archive-keyring unstable /mnt/tmp http://deb.debian.org/debian-ports

I first tried running RISC-V Qemu on Buster, but even ls didn’t work properly and the installation failed.

chroot /mnt/tmp bin/bash
# ls -ld .
/usr/bin/ls: cannot access '.': Function not implemented

When I ran it on Unstable ls works but strace doesn’t work in a chroot, this gave enough functionality to complete the installation.

chroot /mnt/tmp bin/bash
# strace ls -l
/usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not implemented
/usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented
/usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented
/usr/bin/strace: detach: waitpid(1602629): No child processes
/usr/bin/strace: Process 1602629 detached

When running the VM the operation was noticably slower than the emulation of PPC64 and S/390x which both ran at an apparently normal speed. When running on a server with equivalent speed CPU a ssh login was obviously slower due to the CPU time taken for encryption, a ssh connection from a system on the same LAN took 6 seconds to connect. I presume that because RISC-V is a newer architecture there hasn’t been as much effort made on optimising the Qemu emulation and that a future version of Qemu will be faster. But I don’t think that Debian/Bullseye will give good Qemu performance for RISC-V, probably more changes are needed than can happen before the freeze. Maybe a version of Qemu with better RISC-V performance can be uploaded to backports some time after Bullseye is released.

Here’s the Qemu command I use to run RISC-V emulation:

qemu-system-riscv64 -machine virt -device virtio-blk-device,drive=hd0 -drive file=/vmstore/riscv,format=raw,id=hd0 -device virtio-blk-device,drive=hd1 -drive file=/vmswap/riscv,format=raw,id=hd1 -m 1024 -kernel /boot/riscv/vmlinux-5.10.0-1-riscv64 -initrd /boot/riscv/initrd.img-5.10.0-1-riscv64 -nographic -append net.ifnames=0 noresume security=selinux root=/dev/vda ro -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-device,rng=rng0 -device virtio-net-device,netdev=net0,mac=02:02:00:00:01:03 -netdev tap,id=net0,helper=/usr/lib/qemu/qemu-bridge-helper

Currently the program /usr/sbin/sefcontext_compile from the selinux-utils package needs execmem access on RISC-V while it doesn’t on any other architecture I have tested. I don’t know why and support for debugging such things seems to be in early stages of development, for example the execstack program doesn’t work on RISC-V now.

RISC-V emulation in Unstable seems adequate for people who are serious about RISC-V development. But if you want to just try a different architecture then PPC64 and S/390 will work better.

[1] https://wiki.debian.org/RISC-V

Monopoly the Game

The Smithsonian Mag has an informative article about the history of the game Monopoly [1]. The main point about Monopoly teaching about the problems of inequality is one I was already aware of, but there are some aspects of the history that I learned from the article.

Here’s an article about using modified version of Monopoly to teach Sociology [2].

Maria Paino and Jeffrey Chin wrote an interesting paper about using Monopoly with revised rules to teach Sociology [3]. They publish the rules which are interesting and seem good for a class.

I think it would be good to have some new games which can teach about class differences. Maybe have an “Escape From Poverty” game where you have choices that include drug dealing to try and improve your situation or a cooperative game where people try to create a small business. While Monopoly can be instructive it’s based on the economic circumstances of the past. The vast majority of rich people aren’t rich from land ownership.

Planet Linux Australia

Linux Australia have decided to cease running the Planet installation on planet.linux.org.au. I believe that blogging is still useful and a web page with a feed of Australian Linux blogs is a useful service. So I have started running a new Planet Linux Australia on https://planet.luv.asn.au/. There has been discussion about getting some sort of redirection from the old Linux Australia page, but they don’t seem able to do that.

If you have a blog that has a reasonable portion of Linux and FOSS content and is based in or connected to Australia then email me on russell at coker.com.au to get it added.

When I started running this I took the old list of feeds from planet.linux.org.au, deleted all blogs that didn’t have posts for 5 years and all blogs that were broken and had no recent posts. I emailed people who had recently broken blogs so they could fix them. It seems that many people who run personal blogs aren’t bothered by a bit of downtime.

As an aside I would be happy to setup the monitoring system I use to monitor any personal web site of a Linux person and notify them by Jabber or email of an outage. I could set it to not alert for a specified period (10 mins, 1 hour, whatever you like) so it doesn’t alert needlessly on routine sysadmin work and I could have it check SSL certificate validity as well as the basic page header.

Weather and Boinc

I just wrote a Perl script to look at the Australian Bureau of Meteorology pages to find the current temperature in an area and then adjust BOINC settings accordingly. The Perl script (in this post after the break, which shouldn’t be in the RSS feed) takes the URL of a Bureau of Meteorology observation point as ARGV[0] and parses that to find the current (within the last hour) temperature. Then successive command line arguments are of the form “24:100” and “30:50” which indicate that at below 24C 100% of CPU cores should be used and below 30C 50% of CPU cores should be used. In warm weather having a couple of workstations in a room running BOINC (or any other CPU intensive task) will increase the temperature and also make excessive noise from cooling fans.

To change the number of CPU cores used the script changes /etc/boinc-client/global_prefs_override.xml and then tells BOINC to reload that config file. This code is a little ugly (it doesn’t properly parse XML, it just replaces a line of text) and could fail on a valid configuration file that wasn’t produced by the current BOINC code.

The parsing of the BoM page is a little ugly too, it relies on the HTML code in the BoM page – they could make a page that looks identical which breaks the parsing or even a page that contains the same data that looks different. It would be nice if the BoM published some APIs for getting the weather. One thing that would be good is TXT records in the DNS. DNS supports caching with specified lifetime and is designed for high throughput in aggregate. If you had a million IOT devices polling the current temperature and forecasts every minute via DNS the people running the servers wouldn’t even notice the load, while a million devices polling a web based API would be a significant load. As an aside I recommend playing nice and only running such a script every 30 minutes, the BoM page seems to be updated on the half hour so I have my cron jobs running at 5 and 35 minutes past the hour.

If this code works for you then that’s great. If it merely acts as an inspiration for developing your own code then that’s great too! BOINC users outside Australia could replace the code for getting meteorological data (or even interface to a digital thermometer). Australians who use other CPU intensive batch jobs could take the BoM parsing code and replace the BOINC related code. If you write scripts inspired by this please blog about it and comment here with a link to your blog post.

Continue reading Weather and Boinc

MPV vs Mplayer

After writing my post about VDPAU in Debian [1] I received two great comments from anonymous people. One pointed out that I should be using VA-API (also known as VAAPI) on my Intel based Thinkpad and gave a reference to an Arch Linux Wiki page, as usual Arch Linux Wiki is awesome and I learnt a lot of great stuff there. I also found the Debian Wiki page on Hardware Video Acceleration [2] which has some good information (unfortunately I had already found all that out through more difficult methods first, I should read the Debian Wiki more often.

It seems that mplayer doesn’t suppoer VAAPI. The other comment suggested that I try the mpv fork of Mplayer which does support VAAPI but that feature is disabled by default in Debian.

I did a number of tests on playing different videos on my laptop running Debian/Buster with Intel video and my workstation running Debian/Unstable with ATI video. The first thing I noticed is that mpv was unable to use VAAPI on my laptop and that VDPAU won’t decode VP9 videos on my workstation and most 4K videos from YouTube seem to be VP9. So in most cases hardware decoding isn’t going to help me.

The Wikipedia page about Unified Video Decoder [3] shows that only VCN (Video Core Next) supports VP9 decoding while my R7-260x video card [4] has version 4.2 of the Unified Video Decoder which doesn’t support VP9, H.265, or JPEG. Basically I need a new high-end video card to get VP9 decoding and that’s not something I’m interested in buying now (I only recently bought this video card to do 4K at 60Hz).

The next thing I noticed is that for my combination of hardware and software at least mpv tends to take about 2/3 the CPU time to play videos that mplayer does on every video I tested. So it seems that using mpv will save me 1/3 of the power and heat from playing videos on my laptop and save me 1/3 of the CPU power on my workstation in the worst case while sometimes saving me significantly more than that.

Conclusion

To summarise quite a bit of time experimenting with video playing and testing things: I shouldn’t think too much about hardware decoding until VP9 hardware is available (years for me). But mpv provides some real benefits right now on the same hardware, I’m not sure why.

SMART and SSDs

The Hetzner server that hosts my blog among other things has 2*256G SSDs for the root filesystem. The smartctl and smartd programs report both SSDs as in FAILING_NOW state for the Wear_Leveling_Count attribute. I don’t have a lot of faith in SMART. I run it because it would be stupid not to consider data about possible drive problems, but don’t feel obliged to immediately replace disks with SMART errors when not convenient (if I ordered a new server and got those results I would demand replacement before going live).

Doing any sort of SMART scan will cause service outage. Replacing devices means 2 outages, 1 for each device.

I noticed the SMART errors 2 weeks ago, so I guess that the SMART claims that both of the drives are likely to fail within 24 hours have been disproved. The system is running BTRFS so I know there aren’t any unseen data corruption issues and it uses BTRFS RAID-1 so if one disk has an unreadable sector that won’t cause data loss.

Currently Hetzner Server Bidding has ridiculous offerings for systems with SSD storage. Search for a server with 16G of RAM and SSD storage and the minimum prices are only 2E cheaper than a new server with 64G of RAM and 2*512G NVMe. In the past Server Bidding has had servers with specs not much smaller than the newest systems going for rates well below the costs of the newer systems. The current Hetzner server is under a contract from Server Bidding which is significantly cheaper than the current Server Bidding offerings, so financially it wouldn’t be a good plan to replace the server now.

Monitoring

I have just released a new version of etbe-mon [1] which has a new monitor for SMART data (smartctl). It also has a change to the sslcert check to search all IPv6 and IPv4 addresses for each hostname, makes freespace check look for filesystem mountpoint, and makes the smtpswaks check use latest swaks command-line.

For the new smartctl check there is an option to treat “Marginal” alert status from smartctl as errors and there is an option to name attributes that will be treated as marginal even if smartctl thinks they are significant. So now I have my monitoring system checking the SMART data on the servers of mine which have real hard drives (not VMs) and aren’t using RAID hardware that obscures such things. Also it’s not alerting me about the Wear_Leveling_Count on that particular Hetzner server.

[1] https://doc.coker.com.au/projects/etbe-mon/

Testing VDPAU in Debian

VDPAU is the Video Decode and Presentation API for Unix [1]. I noticed an error with mplayer “Failed to open VDPAU backend libvdpau_i965.so: cannot open shared object file: No such file or directory“, Googling that turned up Debian Bug #869815 [2] which suggested installing the packages vdpau-va-driver and libvdpau-va-gl1 and setting the environment variable “VDPAU_DRIVER=va_gl” to enable VPDAU.

The command vdpauinfo from the vdpauinfo shows the VPDAU capabilities, which showed that VPDAU was working with va_gl.

When mplayer was getting the error about a missing i915 driver it took 35.822s of user time and 1.929s of system time to play Self Control by Laura Branigan [3] (a good music video to watch several times while testing IMHO) on my Thinkpad Carbon X1 Gen1 with Intel video and a i7-3667U CPU. When I set “VDPAU_DRIVER=va_gl” mplayer took 50.875s of user time and 4.207s of system time but didn’t have the error.

It’s possible that other applications on my Thinkpad might benefit from VPDAU with the va_gl driver, but it seems unlikely that any will benefit to such a degree that it makes up for mplayer taking more time. It’s also possible that the Self Control video I tested with was a worst case scenario, but even so taking almost 50% more CPU time made it unlikely that other videos would get a benefit.

For this sort of video (640×480 resolution) it’s not a problem, 38 seconds of CPU time to play a 5 minute video isn’t a real problem (although it would be nice to use less battery). For a 1600*900 resolution video (the resolution of the laptop screen) it took 131 seconds of user time to play a 433 second video. That’s still not going to be a problem when playing on mains power but will suck a bit when on battery. Most Thinkpads have Intel video and some have NVidia as well (which has issues from having 2 video cards and from having poor Linux driver support). So it seems that the only option for battery efficient video playing on the go right now is to use a tablet.

On the upside, screen resolution is not increasing at a comparable rate to Moore’s law so eventually CPUs will get powerful enough to do all this without using much electricity.

Thinkpad Storage Problem

For a while I’ve had a problem with my Thinkpad X1 Carbon Gen 1 [1] where storage pauses for 30 seconds, it’s become more common recently and unfortunately everything seems to depend on storage (ideally a web browser playing a video from the Internet could keep doing so with no disk access, but that’s not what happens).

echo 3 > /sys/block/sda/device/timeout

I’ve put the above in /etc/rc.local which should make it only take 3 seconds to recover from storage problems instead of 30 seconds, the problem hasn’t recurred since that change so I don’t know if it works as desired. The Thinkpad is running BTRFS and no data is reported as being lost, so it’s not a problem to use it at this time.

The bottom of this post has an example of the errors I’m getting. A friend said that his searches for such errors shows people claiming that it’s a “SATA controller or cable” problem, which for a SATA device bolted directly to the motherboard means it’s likely to be a motherboard problem (IE more expensive to fix than the $289 I paid for the laptop almost 3 years ago).

A Lenovo forum discussion says that the X1 Carbon Gen1 uses an unusual connector that’s not M.2 SATA and not mSATA [2]. This means that a replacement is going to be expensive, $100US for a replacement from eBay when a M.2 SATA or NVMe device would cost about $50AU from my local store. It also means that I can’t use a replacement for anything else. If my laptop took a regular NVMe I’d buy one and test it out, if it didn’t solve the problem a spare NVMe device is always handy to have. But I’m not interested in spending $100US for a part that may turn out to be useless.

I bought the laptop expecting that there was nothing I could fix inside it. While it is theoretically possible to upgrade the CPU etc in a laptop most people find that the effort and expense makes it better to just replace the entire laptop. With the ultra light category of laptops the RAM is soldered on the motherboard and there are even less options for replacing things. So while it’s annoying that Lenovo didn’t use a standard interface for this device (they did so for other models in the range) it’s not a situation that I hadn’t expected.

Now the question is what to do next. The Thinkpad has a SD socket and micro SD cards are getting large capacities nowadays. If it won’t boot from a SD card I could boot from USB and then run from SD. So even if the internal SATA device fails entirely it should still be useful. I wonder if I could find a Thinkpad with a similar problem going cheap as I’m pretty sure that a Windows user couldn’t make any storage device usable the way I can with Linux.

I’ve looked at prices on auction sites but haven’t seen a Thinkpad X1 Carbon going for anywhere near the price I got this one. The nearest I’ve seen is around $350 for a laptop with significant damage.

While I am a little unhappy at Lenovo’s choice of interface, I’ve definitely got a lot more than $289 of value out of this laptop. So I’m not really complaining.

[315041.837612] ata1.00: status: { DRDY }
[315041.837613] ata1.00: failed command: WRITE FPDMA QUEUED
[315041.837616] ata1.00: cmd 61/20:48:28:1e:3e/00:00:00:00:00/40 tag 9 ncq dma 
16384 out
                         res 40/00:01:00:00:00/00:00:00:00:00/e0 Emask 0x4 
(timeout)                
[315041.837617] ata1.00: status: { DRDY }
[315041.837618] ata1.00: failed command: READ FPDMA QUEUED
[315041.837621] ata1.00: cmd 60/08:50:e0:26:84/00:00:00:00:00/40 tag 10 ncq 
dma 4096 in
                         res 40/00:01:00:00:00/00:00:00:00:00/e0 Emask 0x4 
(timeout)                
[315041.837622] ata1.00: status: { DRDY }
[315041.837625] ata1: hard resetting link
[315042.151781] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[315042.163368] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) 
succeeded
[315042.163370] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) 
filtered out
[315042.163372] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered 
out
[315042.183332] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) 
succeeded
[315042.183334] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) 
filtered out
[315042.183336] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered 
out
[315042.193332] ata1.00: configured for UDMA/133
[315042.193789] sd 0:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_SENSE
[315042.193791] sd 0:0:0:0: [sda] tag#10 Sense Key : Illegal Request [current] 
[315042.193793] sd 0:0:0:0: [sda] tag#10 Add. Sense: Unaligned write command
[315042.193795] sd 0:0:0:0: [sda] tag#10 CDB: Read(10) 28 00 00 84 26 e0 00 00 
08 00
[315042.193797] print_req_error: I/O error, dev sda, sector 8660704
[315042.193810] ata1: EH complete

etbe – Russell Coker

Archives

Categories

Links January 2021

PSI and Cgroup2

RISC-V and Qemu

Monopoly the Game

Planet Linux Australia

Weather and Boinc

MPV vs Mplayer

Conclusion

SMART and SSDs

Monitoring

Testing VDPAU in Debian

Thinkpad Storage Problem

Archives

Email and RSS

Archives

Categories

Tags

Conclusion

Monitoring

Archives

Email and RSS