1

TSIG Error From SSSD

A common error when using the sssd daemon to authenticate via Active Directory on Linux seems to be:

sssd[$PID]: ; TSIG error with server: tsig verify failure

This is from sssd launching the command “nsupdate -g” to do dynamic DNS updates. It is possible to specify the DNS server in /etc/sssd/sssd.conf but that will only be used AFTER the default servers have been attempted, so it seems impossible to stop this error from happening. It doesn’t appear to do any harm as the correct server is discovered and used eventually. The commands piped to the nsupdate command will be something like:

server $SERVERIP
realm $DOMAIN
update delete $HOSTNAME.$DOMAIN. in A
update add $HOSTNAME.$DOMAIN. 3600 in A $HOSTIP
send
update delete $HOSTNAME.$DOMAIN. in AAAA
send

AMT/MEBX on Debian

I’ve just been playing with Intel’s Active Management Technology (AMT) [1] which is also known as Management Engine Bios Extension (MEBX).

Firstly a disclaimer, using this sort of technology gives remote access to your system at a level that allows in some ways overriding the OS. If this gets broken then you have big problems. Also all the code that matters is non-free. Please don’t comment on this post saying that AMT is bad, take it as known that it has issues and that people are forced to use it anyway.

I tested this out on a HP Z420 workstation. The first thing it to enable AMT via Intel “MEBX”, the default password is “admin”. On first use you are compelled to set a new password which must be 8+ characters containing upper and lower case, number, and punctuation characters.

The Debian package “amtterm” (which needs the package “libsoap-lite-perl“) has basic utilities for AMT. The amttool program connects to TCP port 16992 and the amtterm program connects to TCP port 16994. Note that these programs seem a little rough, you can get Perl errors (as opposed to deliberate help messages) if you enter bad command-line parameters. They basically work but could do with some improvement.

If you use DHCP for the IP address the DHCP hostname will be “DESKTOP-$AssetID” and you can find the IP address by requesting an alert be sent to the sysadmin.

Here are some examples of amttool usage:

# get AMT info
AMT_PASSWORD="$PASS" amttool $IP
# reset the system and redirect BIOS messages to serial over lan
AMT_PASSWORD="$PASS" amttool reset bios
# access serial over lan console
amtterm -p "$PASS" $IP

The following APT configuration enables the Ubuntu package wsmancli which had some features not in any Debian packages last time I checked.

deb http://us.archive.ubuntu.com/ubuntu/ bionic-updates universe
deb http://us.archive.ubuntu.com/ubuntu/ bionic universe

This Cyberciti article has information on accessing KVM over AMT [2], I haven’t tried to do that yet.

Feedburner Seems to be Dying

Many years ago Feedburner was a useful service. It proxied the RSS feed of your blog and gave you analytics of what happened with it. Now feeds using Feedburner randomly give HTTP error 404s. The Feedburner Twitter account is inactive and recommends that people Tweet at Google instead. It seems that Google wants to get rid of the service and random 404s probably aren’t a high priority for them.

I’ve just gone through the config for Planet Linux Australia [1] and changed as many Feedburner URLs as possible to direct feed URLs. I did this by loading the Feedburner feed, getting the URL for the site, and then guessing the feed URL (usually just appending “/feed” to the domain name).

I recommend that everyone abandon Feedburner, it’s not reliable enough to use and doesn’t seem to have any active support.

3

Hangouts Replacement

Google is currently in the process of killing Hangouts. Last year Hangouts was quite a nice IM system with integrated video chat and voice calling. Now they have decided to kill it and replace it with “Google Chat” and “Google Meet” both of which are integrated with the Gmail app on Android. To start getting people off the old platform they have disabled video and audio chats with more than 2 people in Hangouts. To do a video call you have to use Meet which has a worse user interface and isn’t integrated with text chat, so if in a text discussion someone says “let’s have a video call” you have to open a new app. Meet also doesn’t appear to have a facility to notify group members that someone has joined a group call so it’s required that Chat (or something else) is used to tell people they can join Meet.

Many of my relatives use Hangouts because they are forced to have it installed on their Android phones and because it worked quite well. Now it doesn’t work well and will soon be going away. So another option is needed.

I’m considering Matrix as a replacement. Matrix has a good feature set and is being worked on a lot. The video conferencing is through a connection to a Jitsi server and is well integrated giving functionality more like Hangouts than Chat/Meet.

For the LUV Matrix server the URL https://luv.asn.au/.well-known/matrix/client has the following contents:

{
  "m.homeserver": {
    "base_url": "https://luv.asn.au"
  }
  "jitsi": {
    "preferredDomain": "jitsi.perthchat.org"
  }
  "im.vector.riot.jitsi": {
    "preferredDomain": "jitsi.perthchat.org"
  }
}

This specifies the Jitsi server to be used for chats started from that Matrix server. The PerthChat.org people seem to be leading the way for self hosted Matrix in Australia. Note that other people shouldn’t link to their Jitsi server without discussing it with them first. I only included real data because it’s published on the web so there’s no point in keeping it secret.

The Flounder free software users’ group [1] uses Matrix a lot. We will probably discuss Matrix at the next meeting on Saturday.

There is also Element Call [2] which is apparently more integrated with Matrix (and also newer and possibly buggier). Jitsi works and we can change to a different service easily enough at a later time.

3

Installing NextCloud

NextCloud and OwnCloud History

Some time ago I tried OwnCloud, it wasn’t a positive experience for me. Since that time I’ve got a server with a much faster CPU, a faster Internet connection, and the NextCloud code is newer and running on a newer version of PHP, I didn’t make good notes so I’m not sure which factors were most responsible for having a better experience this time. According to the NextCloud Wikipedia page [1] the fork of NextCloud from the OpenCloud base happened in 2016 so it’s obviously been a while since I tried it, it was probably long before 2016.

Recently the BBC published an interesting article on “Turnover contagion” which is when one resignation can trigger many more [2] which is interesting to read in the context of OwnCloud losing critical staff failing after one key developer resigned.

I mentioned OwnCloud in a 2012 blog post about Liberty and Mobile Phones [3], since then I haven’t done well at achieving those goals. A few days ago I decided to try NextCloud and found it a much better experience than I recall OwnCloud being in the past.

Installation

I installed OwnCloud on an Oracle Cloud ARM VM (see my previous blog post about the Oracle Cloud Free Tier [4]).

This CloudCone article on installing NextCloud on Debian 10 (Buster) covers the basics well [5].

Here is the NextCloud URL for downloading the PHP files (a large ZIP archive) [6]. You have to extract to where Apache is configured to have it’s webroot and then run “chown -R www-data nextcloud/lib/private/Log nextcloud/config nextcloud/apps” (or if you use php-fpm then chown it to the user for that). NextCloud recommend having all of the NextCloud files owned by www-data, but that’s just a bad idea, allowing it to rewrite some of it’s program files is bad, allowing it to rewrite all of them is worse.

For my installation I used the Apache modiles macro, rewrite, ssl, php7.4, and headers (this is more about how I configure Apache than about NextCloud). Also I edited /etc/php/7.4/apache2/php.ini and changed memory_limit to 512M (the default of 128M is not enough). I’m currently only testing it, for a production use I would use php-fpm and run it under it’s own UID so that it can’t interact with other PHP apps.

After that it was just a matter of visiting the configuration URL and giving it the details of the database etc.

After setting it up the command “php -d memory_limit=512M occ app:install richdocumentscode_arm64” when run from the root of the OwnCloud installation installs the Cloudera components for editing LibreOffice documents in OwnCloud, this is the command for ARM64 architecture, I presume the command for other architectures is similar.

Conclusion

OwnCloud is very usable, it has a decent feature set built in and the option to download modules such as the components for editing LibreOffice files on the web is useful. But I am hesitant to install things that require the sort of access it requires. I think it would be better if there was a documented and supported way of installing things and then locking them down so that at runtime it can only write to data files not any program files or configuration files. It would also be better if it was packaged for Debian and had the Debian update process for security fixes. I can imagine many people installing it, forgetting to update it, and ending up with insecure systems.

Netflix and IPv6

It seems that Netflix has an ongoing issue of not working well with IPv6, apparently they have some sort of region checking code that doesn’t correctly identify IPv6 prefixes. To fix this I wrote the following script to make a small zone file with only A records for Netflix and no AAAA records. The $OUT.header file just has the SOA record for my fake netflix.com domain.

#!/bin/bash

OUT=/etc/bind/data/netflix.com
HEAD=$OUT.header

cp $HEAD $OUT
dig -t a www.netflix.com @8.8.8.8|sed -n -e "s/^.*IN/www IN/p"|grep [0-9]$ >> $OUT
dig -t a android.prod.cloud.netflix.com @8.8.8.8|sed -n -e "s/^.*IN/android.prod.cloud IN/p"|grep [0-9]$ >> $OUT
/usr/sbin/rndc reload > /dev/null

Update

I updated this post to add a line for android.prod.cloud.netflix.com which is the address used by Android devices.

Internode NBN with Arris CM8200 on Debian

I’ve recently signed up for Internode NBN while using the Arris CM8200 device supplied by Optus (previously used for a regular phone service). I took the configuration mostly from Dean’s great blog post on the topic [1]. One thing I changed was the /etc/networ/interfaces configuration, I used the following:

# VLAN ID 2 for Internode's NBN HFC.
auto eth1.2
iface eth1.2 inet manual
  vlan-raw-device eth1

auto nbn
iface nbn inet ppp
    pre-up /bin/ip link set eth1.2 up
    provider nbn

There is no need to have a section for eth1 when you have a section for eth1.2.

IPv6

IPv6 for only one system

With a line in /etc/ppp/options containing only “ipv6 ,” you get an IPv6 address automatically for the ppp0 interface after starting pppd.

IPv6 for your lan

Internode has documented how to configure the WIDE DHCPv6 client to get an IPv6 “prefix” (subnet) [2]. Just install the wide-dhcpv6-client package and put your interface names in a copy of the Internode example config and that works. That gets you a /64 assigned to your local Ethernet. Here’s an example of /etc/wide-dhcpv6/dhcp6c.conf:

interface ppp0 {
    send ia-pd 0;
    script "/etc/wide-dhcpv6/dhcp6c-script";
};

id-assoc pd {
    prefix-interface br0 {
        sla-id 0;
        sla-len 8;
    };
};

For providing addresses to other systems on your LAN they recommend radvd version 1.1 or greater, Debian/Bullseye will ship with version 2.18. Here is an example /etc/radvd.conf that will work with it. It seems that you have to manually (or with a script) set the value to use in place of “xxxx:xxxx:xxxx:xxxx” from the value that is assigned to eth0 (or whichever interface you are using) by the wide-dhcpv6-client.

interface eth0 { 
        AdvSendAdvert on;
        MinRtrAdvInterval 3; 
        MaxRtrAdvInterval 10;
        prefix xxxx:xxxx:xxxx:xxxx::/64 { 
                AdvOnLink on; 
                AdvAutonomous on; 
                AdvRouterAddr on; 
        };
};

Either the configuration of the wide dhcp client or radvd removes the default route from ppp0, so you need to run a command like
ip -6 route add default dev ppp0” to put it back. Probably having “ipv6 ,” is the wrong thing to do when using wide-dhcp-client and radvd.

On a client machine with bridging I needed to have “net.ipv6.conf.br0.accept_ra=2” in /etc/sysctl.conf to allow it to accept route advisory messages on the interface (in this case eth0), for machines without bridging I didn’t need that.

Firewalling

The default model for firewalling nowadays seems to be using NAT and only configuring specific ports to be forwarded to machines on the LAN. With IPv6 on the LAN every system can directly communicate with the rest of the world which may be a bad thing. The following lines in a firewall script will drop all inbound packets that aren’t in response to packets that are sent out. This will give an equivalent result to the NAT firewall people are used to and you can always add more rules to allow specific ports in.

ip6tables -A FORWARD -i ppp+ -m state --state ESTABLISHED,RELATED -j ACCEPT
ip6tables -A FORWARD -i ppp+ -i DROP

Wifi Performance on Linux

Wifi usually just works. In the past I haven’t had to worry much about performance as for home use things have always been bearable and at work it’s never been my job so I just file a bug report with the relevant people when things go wrong. But a few years ago I had some problems.

For my home network I got a free Wifi AP which wasn’t performing well.

My AP supported 802.11 modes b/g or g/n (b, g, and n are slow, medium, and fast speeds). I initially had the AP running in b/g mode because I had an 802.11b USB wifi device that I used. When I replaced that with one that did 802.11g I tried changing the AP to g/n mode but performance was even worse on my laptop (although quite good on phones) so I switched back.

For phones it appeared to work well giving 54Mb/s while on my laptop (a second hand Thinkpad X1 Carbon) it was giving 11Mb/s at best and often much less than that. The best demonstration of problems was to start transferring a large file while pinging a system on the LAN the AP was connected to. Usually it would give ping times of 1s or more, sometimes 5s+ ping times. While this was happening the “Invalid misc” count increased rapidly, often by more than 100 per second.

The results of Google searches suggest that “Invalid misc” is due to interference and recommend changing the channel. My AP had been on channel 1 which had performed poorly, channels 2-8 were ok, and channel 9 seemed reasonably good. As an aside trying all channels manually is not a good idea, it takes a lot of time and gives little useful data. After changing to channel 9 it still only gave about 500KB/s when transferring large files with ping times of about 100ms, but that’s a big improvement. I tried running “iwlist scanning” to scan the Wifi network for other APs, that showed that channel 1 was used a lot but didn’t make it clear what I should do other than that.

The next thing I tried was the Wifi Analyser app on Android [1] (which doesn’t work on my latest phone, I don’t know if it’s still being actively maintained, it will definitely work on older phones). That has a nice graph mode that shows which channels are used and how the frequencies spread and interfere with other channels. One thing I hadn’t realised before I looked at the graphs is that 802.11n uses 4 channels and interferes past that. If you have two 802.11n devices you don’t have much space left out of the 14 channels available. To make more space I configured the Wifi AP in my ADSL modem to 802.11b/g mode and assigned it a channel away from the others making 4 channels available with no interference.

After that iwconfig reported between 60 and 120Mb/s and I got consistent transfer rates over 1.5MB/s while ping times remained below 100ms.

The 5GHz frequency range is less congested. But at the time I didn’t feel like buying 5GHz equipment.

Since that time I had signed up with an ISP that had a good deal on a Wifi AP that had 5GHz. Now I have all my devices configured to use 5GHz or 2.4GHz depending on which they think is best. So there’s less devices on 2.4GHz and the AP is configured for “20MHz channel width” in the 2.4GHz range (which means 802.11b/g).

Conclusion

802.11n seems to be a bad idea unless you run the only AP in an area. In a suburban area you will have 3 other houses broadcasting in your area and 802.11n is bad for everyone. The worst case scenario would be one person using 802.11n and interfering with everyone else’s 802.11g and then having everyone else turn on 802.11n to try and make things faster.

5GHz is less congested as most people run old hardware. It also has a shorter range which has the upside of getting less interference from other people. I’m considering installing 5GHz APs at both ends of my house and configuring all my new devices to not use 2.4GHz.

Wifi spectrum analysis software is much better than manual testing of channels or trying to deduce things from the output if “iwlist scanning“.

DNS, Lots of IPs, and Postal

I decided to start work on repeating the tests for my 2006 OSDC paper on Benchmarking Mail Relays [1] and discover how the last 15 years of hardware developments have changed things. There have been software changes in that time too, but nothing that compares with going from single core 32bit systems with less than 1G of RAM and 60G IDE disks to multi-core 64bit systems with 128G of RAM and SSDs. As an aside the hardware I used in 2006 wasn’t cutting edge and the hardware I’m using now isn’t either. In both cases it’s systems I bought second hand for under $1000. Pedants can think of this as comparing 2004 and 2018 hardware.

BIND

I decided to make some changes to reflect the increased hardware capacity and use 2560 domains and IP addresses, which gave the following errors as well as a startup time of a minute on a system with two E5-2620 CPUs.

May  2 16:38:37 server named[7372]: listening on IPv4 interface lo, 127.0.0.1#53
May  2 16:38:37 server named[7372]: listening on IPv4 interface eno4, 10.0.2.45#53
May  2 16:38:37 server named[7372]: listening on IPv4 interface eno4, 10.0.40.1#53
May  2 16:38:37 server named[7372]: listening on IPv4 interface eno4, 10.0.40.2#53
May  2 16:38:37 server named[7372]: listening on IPv4 interface eno4, 10.0.40.3#53
[...]
May  2 16:39:33 server named[7372]: listening on IPv4 interface eno4, 10.0.47.0#53
May  2 16:39:33 server named[7372]: listening on IPv4 interface eno4, 10.0.48.0#53
May  2 16:39:33 server named[7372]: listening on IPv4 interface eno4, 10.0.49.0#53
May  2 16:39:33 server named[7372]: listening on IPv6 interface lo, ::1#53
[...]
May  2 16:39:36 server named[7372]: zone localhost/IN: loaded serial 2
May  2 16:39:36 server named[7372]: all zones loaded
May  2 16:39:36 server named[7372]: running
May  2 16:39:36 server named[7372]: socket: file descriptor exceeds limit (123273/21000)
May  2 16:39:36 server named[7372]: managed-keys-zone: Unable to fetch DNSKEY set '.': not enough free resources
May  2 16:39:36 server named[7372]: socket: file descriptor exceeds limit (123273/21000)

The first thing I noticed is that a default configuration of BIND with 2560 local IPs (when just running in the default recursive mode) takes a minute to start and needed to open over 100,000 file handles. BIND also had some errors in that configuration which led to it not accepting shutdown requests. I filed Debian bug report #987927 [2] about this. One way of dealing with the errors in this situation on Debian is to edit /etc/default/named and put in the following line to allow BIND to access to many file handles:

OPTIONS="-u bind -S 150000"

But the best thing to do for BIND when there are many IP addresses that aren’t going to be used for DNS service is to put a directive like the following in the BIND configuration to specify the IP address or addresses that are used for the DNS service:

listen-on { 10.0.2.45; };

I have just added the listen-on and listen-on-v6 directives to one of my servers with about a dozen IP addresses. While 2560 IP addresses is an unusual corner case it’s not uncommon to have dozens of addresses on one system.

dig

When doing tests of Postfix for relaying mail I noticed that mail was being deferred with DNS problems (error was “Host or domain name not found. Name service error for name=a838.example.com type=MX: Host not found, try again“. I tested the DNS lookups with dig which failed with errors like the following:

dig -t mx a704.example.com
socket.c:1740: internal_send: 10.0.2.45#53: Invalid argument
socket.c:1740: internal_send: 10.0.2.45#53: Invalid argument
socket.c:1740: internal_send: 10.0.2.45#53: Invalid argument

; <<>> DiG 9.16.13-Debian <<>> -t mx a704.example.com
;; global options: +cmd
;; connection timed out; no servers could be reached

Here is a sample of the strace output from tracing dig:

bind(20, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0
recvmsg(20, {msg_namelen=128}, 0)       = -1 EAGAIN (Resource temporarily 
unavailable)
write(4, "\24\0\0\0\375\377\377\377", 8) = 8
sendmsg(20, {msg_name={sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.0.2.45")}, msg_
namelen=16, msg_iov=[{iov_base="86\1 
\0\1\0\0\0\0\0\1\4a704\7example\3com\0\0\17\0\1\0\0)\20\0\0\0\0
\0\0\f\0\n\0\10's\367\265\16bx\354", iov_len=57}], msg_iovlen=1, 
msg_controllen=0, msg_flags=0}, 0) 
= -1 EINVAL (Invalid argument)
write(2, "socket.c:1740: ", 15)         = 15
write(2, "internal_send: 10.0.2.45#53: Invalid argument", 45) = 45
write(2, "\n", 1)                       = 1
futex(0x7f5a80696084, FUTEX_WAIT_PRIVATE, 0, NULL) = 0
futex(0x7f5a80696010, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f5a8069809c, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f5a80698020, FUTEX_WAKE_PRIVATE, 1) = 1
sendmsg(20, {msg_name={sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("10.0.2.45")}, msg_namelen=16, msg_iov=[{iov_base="86\1 
\0\1\0\0\0\0\0\1\4a704\7example\3com\0\0\17\0\1\0\0)\20\0\0\0\0\0\0\f\0\n\0\10's\367\265\16bx\354", 
iov_len=57}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = -1 EINVAL 
(Invalid argument)
write(2, "socket.c:1740: ", 15)         = 15
write(2, "internal_send: 10.0.2.45#53: Invalid argument", 45) = 45
write(2, "\n", 1)

Ubuntu bug #1702726 claims that an insufficient ARP cache was the cause of dig problems [3]. At the time I encountered the dig problems I was seeing lots of kernel error messages “neighbour: arp_cache: neighbor table overflow” which I solved by putting the following in /etc/sysctl.d/mine.conf:

net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 1024

Making that change (and having rebooted because I didn’t need to run the server overnight) didn’t entirely solve the problems. I have seen some DNS errors from Postfix since then but they are less common than before. When they happened I didn’t have that error from dig. At this stage I’m not certain that the ARP change fixed the dig problem although it seems likely (it’s always difficult to be certain that you have solved a race condition instead of made it less common or just accidentally changed something else to conceal it). But it is clearly a good thing to have a large enough ARP cache so the above change is probably the right thing for most people (with the possibility of changing the numbers according to the required scale). Also people having that dig error should probably check their kernel message log, if the ARP cache isn’t the cause then some other kernel networking issue might be related.

Preliminary Results

With Postfix I’m seeing around 24,000 messages relayed per minute with more than 60% CPU time idle. I’m not sure exactly how to count idle time when there are 12 CPU cores and 24 hyper-threads as having only 1 process scheduled for each pair of hyperthreads on a core is very different to having half the CPU cores unused. I ran my script to disable hyper-threads by telling the Linux kernel to disable each processor core that has the same core ID as another, it was buggy and disabled the second CPU altogether (better than finding this out on a production server). Going from 24 hyper-threads of 2 CPUs to 6 non-HT cores of a single CPU didn’t change the thoughput and the idle time went to about 30%, so I have possibly halved the CPU capacity for these tasks by disabling all hyper-threads and one entire CPU which is surprising given that I theoretically reduced the CPU power by 75%. I think my focus now has to be on hyper-threading optimisation.

Since 2006 the performance has gone from ~20 messages per minute on relatively commodity hardware to 24,000 messages per minute on server equipment that is uncommon for home use but which is also within range of home desktop PCs. I think that a typical desktop PC with a similar speed CPU, 32G of RAM and SSD storage would give the same performance. Moore’s Law (that transistor count doubles approximately every 2 years) is often misquoted as having performance double every 2 years. In this case more than 1024* the performance over 15 years means the performance doubling every 18 months. Probably most of that is due to SATA SSDs massively outperforming IDE hard drives but it’s still impressive.

Notes

I’ve been using example.com for test purposes for a long time, but RFC2606 specifies .test, .example, and .invalid as reserved top level domains for such things. On the next iteration I’ll change my scripts to use .test.

My current test setup has a KVM virtual machine running my bhm program to receive mail which is taking between 20% and 50% of a CPU core in my tests so far. While that is happening the kvm process is reported as taking between 60% and 200% of a CPU core, so kvm takes as much as 4* the CPU of the guest due to the virtual networking overhead – even though I’m using the virtio-net-pci driver (the most efficient form of KVM networking for emulating a regular ethernet card). I’ve also seen this in production with a virtual machine running a ToR relay node.

I’ve fixed a bug where Postal would try to send the SMTP quit command after encountering a TCP error which would cause an infinite loop and SEGV.

2

First Attempt at Gnocchi-Statsd

I’ve been investigating the options for tracking system statistics to diagnose performance problems. The idea is to track all sorts of data about the system (network use, disk IO, CPU, etc) and look for correlations at times of performance problems. DataDog is pretty good for this but expensive, it’s apparently based on or inspired by the Etsy Statsd. It’s claimed that the gnocchi-statsd is the best implementation of the protoco used by the Etsy Statsd, so I decided to install that.

I use Debian/Buster for this as that’s what I’m using for the hardware that runs KVM VMs. Here is what I did:

# it depends on a local MySQL database
apt -y install mariadb-server mariadb-client
# install the basic packages for gnocchi
apt -y install gnocchi-common python3-gnocchiclient gnocchi-statsd uuid

In the Debconf prompts I told it to “setup a database” and not to manage keystone_authtoken with debconf (because I’m not doing a full OpenStack installation).

This gave a non-working configuration as it didn’t configure the MySQL database for the [indexer] section and the sqlite database that was configured didn’t work for unknown reasons. I filed Debian bug #971996 about this [1]. To get this working you need to edit /etc/gnocchi/gnocchi.conf and change the url line in the [indexer] section to something like the following (where the password is taken from the [database] section).

url = mysql+pymysql://gnocchi-common:PASS@localhost:3306/gnocchidb

To get the statsd interface going you have to install the gnocchi-statsd package and edit /etc/gnocchi/gnocchi.conf to put a UUID in the resource_id field (the Debian package uuid is good for this). I filed Debian bug #972092 requesting that the UUID be set by default on install [2].

Here’s an official page about how to operate Gnocchi [3]. The main thing I got from this was that the following commands need to be run from the command-line (I ran them as root in a VM for test purposes but would do so with minimum privs for a real deployment).

gnocchi-api
gnocchi-metricd

To communicate with Gnocchi you need the gnocchi-api program running, which uses the uwsgi program to provide the web interface by default. It seems that this was written for a version of uwsgi different than the one in Buster. I filed Debian bug #972087 with a patch to make it work with uwsgi [4]. Note that I didn’t get to the stage of an end to end test, I just got it to basically run without error.

After getting “gnocchi-api” running (in a terminal not as a daemon as Debian doesn’t seem to have a service file for it), I ran the client program “gnocchi” and then gave it the “status” command which failed (presumably due to the metrics daemon not running), but at least indicated that the client and the API could communicate.

Then I ran the “gnocchi-metricd” and got the following error:

2020-10-12 14:59:30,491 [9037] ERROR    gnocchi.cli.metricd: Unexpected error during processing job
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/gnocchi/cli/metricd.py", line 87, in run
    self._run_job()
  File "/usr/lib/python3/dist-packages/gnocchi/cli/metricd.py", line 248, in _run_job
    self.coord.update_capabilities(self.GROUP_ID, self.store.statistics)
  File "/usr/lib/python3/dist-packages/tooz/coordination.py", line 592, in update_capabilities
    raise tooz.NotImplemented
tooz.NotImplemented

At this stage I’ve had enough of gnocchi. I’ll give the Etsy Statsd a go next.

Update

Thomas has responded to this post [5]. At this stage I’m not really interested in giving Gnocchi another go. There’s still the issue of the indexer database which should be different from the main database somehow and sqlite (the config file default) doesn’t work.

I expect that if I was to persist with Gnocchi I would encounter more poorly described error messages from the code which either don’t have Google hits when I search for them or have Google hits to unanswered questions from 5+ years ago.

The Gnocchi systemd config files are in different packages to the programs, this confused me and I thought that there weren’t any systemd service files. I had expected that installing a package with a daemon binary would also get the systemd unit file to match.

The cluster features of Gnocchi are probably really good if you need that sort of thing. But if you have a small instance (EG a single VM server) then it’s not needed. Also one of the original design ideas of the Etsy Statsd was that UDP was used because data could just be dropped if there was a problem. I think for many situations the same concept could apply to the entire stats service.

If the other statsd programs don’t do what I need then I may give Gnocchi another go.