Shortly before 9AM this morning I discovered that the IP address for my mail server was not being routed, according to my logs the problem started shortly after midnight. It’s on a TPG ADSL connection, there is one IP address for the PPPOE link and 6 addresses in a /29 routed to it – one of the addresses in the /29 is for my mail server.
It wasn’t until 3PM that I was able to visit the server to sort the problem out. It turned out that the main IP address was working but the /29 wasn’t being routed to it. So TPG had somehow dropped the route from my routing table. I pinged all the addresses from a 3G broadband connection on my EeePC while running tcpdump on the server, and no packets for the /29 came through – but the IP address for the PPP link worked fine. I was even able to ssh in to the server once I knew the IP address of the ppp0 device – for future use I need to keep ALL IP addresses of all network gear on my EeePC not just the ones used for providing services.
So I phoned the helpdesk, and naturally they asked me inane questions. My patience extended to telling them the broadcast address etc that was being used on the Ethernet device (actually a bridge for Xen but I wasn’t going to confuse them). The system been power-cycled before I got there in the hope that it might fix the problem – so I could honestly answer the question “have you rebooted it” (usually I lie – rebooting systems to fix network problems is a Windows thing). But my patience started to run out when they asked me to check my DNS settings, I explained very clearly that my problem was that IP packets couldn’t get through and that I wasn’t using DNS and demanded that they fix it.
I didn’t get anyone technical to look at the problem until I firmly demanded that the help-desk operator test the routing by pinging my systems. The help-desk people don’t have Internet access so that actually testing the connection required escalating the issue. It seems that the algorithm used for help-desk people is to just repeatedly tell people to check various things on their own system, and that continues until the customer’s patience runs out. Either the customer goes away or makes requests firmly enough to get something done about it.
So their technician did some tests and proclaimed that there was no problem. While said tests were being done things started working, so obviously their procedure is to fix problems and then blame it on the customer. It is not plausible to believe that a problem in their network which had persisted for more than 15 hours would accidentally disappear during the 5 minute window that the technician was investigating the problem.
In the discussion that followed the help-desk operator tried to trick me into admitting that it was my fault. They claimed that because I had used multiple IP addresses I must have reconfigured my system and had therefore fixed a problem on my end, my response was “I HAVE A HEAP OF MACHINES HERE RUNNING ALL THE TIME, I USE WHICHEVER ONE I FEEL LIKE, I CHANGED NOTHING“. I didn’t mention that the machines in question are DomUs on the same Xen server, someone who doesn’t understand how ping works or what routing is wouldn’t have been able to cope with that.
I stated clearly several times that I don’t like being lied to. Either the help-desk operator was lying to me or their technician was lying to them. In either case they were not going to trick me – I know more about how the Internet works than they do.
TPG was unable to give me any assurance that such problems won’t happen again. The only thing I can be sure of is that when they lie they will stick to their story regardless of whether it works.