The Department of Water and Power is doing work near the office, and over the weekend, there was a sustained power outage. I came in Monday to shrieking UPSes and had to power up the firewall and a few other machines. It was the normal stupid kind of stuff.
We have a few virtual servers out in “the cloud,” and we use point-to-point VPNs to make them seem local to our network. Those VPNs also needed restarting.
Through the course of the day, however, one VPN connection kept unceremoniously disconnecting. Looking at logs on the various servers was unenlightening. Everything was running normally, other than the surprise disconnects.
In the evenings, I’ve been watching the old Grenada TV/Jeremy Brett Sherlock Holmes series, so I had to apply Holmes’ deductive process. The virtual servers had experienced no changes except being disconnected, so I needed to focus on the firewall. The firewall had experienced no change, except being restarted. What could have happened?
I finally found a configuration that was incorrect (it was a netmask that was insufficiently restrictive, allowing devices not on the VPN to collide with VPN IP addresses). I fixed the netmask, and the VPN has been up and stable ever since.
But how could this be? It had been running properly literally for years. It had to be something to do with the power outage. But if that had corrupted the configuration, it wouldn’t have been a single IP netmask changing. “[W]hen you have eliminated the impossible, whatever remains, however improbable, must be the truth.” The bad configuration file could not have been in use.
The best theory is that the configuration file had been (accidentally?) modified at some point in the past, but never loaded. When the firewall was restarted, it loaded this modified configuration for the first time.