Thursday, 21 March 2019

The names have changed, the mistakes are the same

I decided not to stay on the monthly IT department catch up call this afternoon.  Luckily I had plenty to keep me occupied a few minutes later.

> "Hey, does anyone know why the site's down?"
< "It's not down", "Ooh, it's slow, what's that about"
Long story short, somebody updated a security group config so one of the apps couldn't reach its cache.  A few years ago it would have been a firewall, or iptables, or ipchains now we're in the cloud.

Before we could get around to identifying and fixing that we had a bigger problem - users started to see server errors instead of just slow responses and timeouts.

Second long story short - there was an expired SSL certificate on another back end service.

This is 2019, and we're still making the same types of mistake as I used to see in 1990s.