Shortly before 9AM this morning I discovered that the IP address for my mail server was not being routed, according to my logs the problem started shortly after midnight. It’s on a TPG ADSL connection, there is one IP address for the PPPOE link and 6 addresses in a /29 routed to it – one of the addresses in the /29 is for my mail server.
It wasn’t until 3PM that I was able to visit the server to sort the problem out. It turned out that the main IP address was working but the /29 wasn’t being routed to it. So TPG had somehow dropped the route from my routing table. I pinged all the addresses from a 3G broadband connection on my EeePC while running tcpdump on the server, and no packets for the /29 came through – but the IP address for the PPP link worked fine. I was even able to ssh in to the server once I knew the IP address of the ppp0 device – for future use I need to keep ALL IP addresses of all network gear on my EeePC not just the ones used for providing services.
So I phoned the helpdesk, and naturally they asked me inane questions. My patience extended to telling them the broadcast address etc that was being used on the Ethernet device (actually a bridge for Xen but I wasn’t going to confuse them). The system been power-cycled before I got there in the hope that it might fix the problem – so I could honestly answer the question “have you rebooted it” (usually I lie – rebooting systems to fix network problems is a Windows thing). But my patience started to run out when they asked me to check my DNS settings, I explained very clearly that my problem was that IP packets couldn’t get through and that I wasn’t using DNS and demanded that they fix it.
I didn’t get anyone technical to look at the problem until I firmly demanded that the help-desk operator test the routing by pinging my systems. The help-desk people don’t have Internet access so that actually testing the connection required escalating the issue. It seems that the algorithm used for help-desk people is to just repeatedly tell people to check various things on their own system, and that continues until the customer’s patience runs out. Either the customer goes away or makes requests firmly enough to get something done about it.
So their technician did some tests and proclaimed that there was no problem. While said tests were being done things started working, so obviously their procedure is to fix problems and then blame it on the customer. It is not plausible to believe that a problem in their network which had persisted for more than 15 hours would accidentally disappear during the 5 minute window that the technician was investigating the problem.
In the discussion that followed the help-desk operator tried to trick me into admitting that it was my fault. They claimed that because I had used multiple IP addresses I must have reconfigured my system and had therefore fixed a problem on my end, my response was “I HAVE A HEAP OF MACHINES HERE RUNNING ALL THE TIME, I USE WHICHEVER ONE I FEEL LIKE, I CHANGED NOTHING“. I didn’t mention that the machines in question are DomUs on the same Xen server, someone who doesn’t understand how ping works or what routing is wouldn’t have been able to cope with that.
I stated clearly several times that I don’t like being lied to. Either the help-desk operator was lying to me or their technician was lying to them. In either case they were not going to trick me – I know more about how the Internet works than they do.
TPG was unable to give me any assurance that such problems won’t happen again. The only thing I can be sure of is that when they lie they will stick to their story regardless of whether it works.
“TPG Sucks” might be a better title — I often search for “$company sucks” to determine the quality of a company before I sign up with them.
I agree with the TPG verdict though – I have only had negative experiences with them.
seriously internode…..
“It is not plausible …”
One of tests the ‘engineer’ ran may also have changed something in the routing that fixed the problem, without the ‘engineer’ realizing/understanding this.
Ivo: The first test when someone says “IP address X is not pingable” should be to ping IP address X. Once a lack of ping responses is noticed the engineer can’t be unaware of the problem.
Besides if routine tests change the routing tables then they have some serious problems in their network configuration!
lol, they are a realllly cheap ISP and tec support isn’t really qualified to to do anything other then walk nana through getting the blue ‘e’ on the wallpaper to open the email.
would be nice if there was a secret geek password (maybe “ping my fsking server ya knob”) to bypass 1st level support and skip to the lying liars who lie / fix connection :p
I’ve seen issues like this with proxy ARP before. Where a connection out from a server with multiple IP addresses to a router caused the router to learn the IP and start working again, but where it couldn’t seem to learn the IP by sending incoming packets despite everything seeming to be correct. No doubt I messed up on some detail of that not so transparent bridge.
I wouldn’t rule out it being fixed by use of diagnostic tools, even good old fashioned ping. Sure routers shouldn’t be that flaky…..
me: It does seem reasonable that help-desk people should be trained to recognise skilled people. If someone starts talking about pings etc then they know enough to talk to someone skilled. Also if someone is paying for services that aren’t common (such as having extra IP addresses routed to an ADSL connection) it is an indication that someone other than a help-desk person should be assigned to the case.
Simon: I pinged the outside world from the server and pinged the server from the outside world. During the time period while the IP addresses were unavailable various mail servers and web clients around the world were trying to make connections, and the mail server had some messages in it’s queue that it was trying to deliver.
I can’t imagine how anything could have fixed this without something being changed at their end.
Let’s face it, it’s not exactly uncommon in the IT industry to fix things and then deny there was a problem (maybe something went wrong on your end). But when called on it most people are smart enough to say “oh I did have to restart a router to perform the tests, maybe that changed it” – that’s a way of admitting that the customer wasn’t at fault while also not admitting that they were lying when they claimed to have not done anything to fix it.