Really slow ssh uncovered by a simple tcpdump
Here's another story about knowing exactly where to hit something with a hammer. This one is also from my days working tech support. One thing about web hosting with a sufficiently large group of customers is that you will see everything sooner or later. The real question is whether you learn from it and improve.
Anyway, this one came to me as a random escalation type question from someone working on a ticket. The tech mentioned that server number such and such "takes forever to connect to". I noticed it would send a quick SSH banner but had a relatively slow password prompt. As I couldn't log in very quickly, I asked if it was being beaten to death by some process. The other tech said that it had a 0 load average.
As I waited, he added that typing something like "foo" would give "f", "o" and "o" with about 3 seconds worth of delay between characters. He threw in that the port had already been checked, and was running at 100 Mbps full duplex, and that datacenter techs had swapped its cable. Moreover, the switch port was free of collisions.
I guess I started up tcpdump at some point on my workstation to figure out just what was going on, and that's when I saw something odd.
13:52:32.436749 IP a.b.c.d.22 > z.y.x.w.52585: P 24353:24401(48) ack 2736 win 1 <nop,nop,timestamp 543811750 2526691683>
That just looked ... wrong. I couldn't be 100% sure, but that "win 1" looked dubious and gave me something to check on once I got on the machine and finally got a shell running. A couple of minutes later, I had my answer.
Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface xx.y.zzz.128 0.0.0.0 255.255.255.192 U 0 0 0 eth0 aa.bbb.0.0 0.0.0.0 255.255.224.0 U 0 0 0 eth1 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 aa.ccc.0.0 aa.bbb.0.1 255.255.0.0 UG 0 0 0 eth1 0.0.0.0 xx.y.zzz.129 0.0.0.0 UG 0 1 0 eth0
Check that out. Someone or something crammed an actual value in there for the window field! I've never seen that used before. I added a host route back to my machine which would "win" over default and opened a new connection. It was nice and quick, just like it should have been all along. That confirmed it for me, so I dropped and re-added the default route (since 'route change' didn't twiddle window settings), dropped my temporary host route, and that was it.
We weren't able to figure out how it got there, like a rogue kickstart script, or some really clueless tech, or something else entirely. I added it to my home-grown ticket index and went on with life. That way, there'd be some chance of someone possibly finding it when it would come up in the future.