Tuesday 29 April 2014

Automation considered harmful

Intro
I've had a particularly unproductive couple of days in the office so far this week.  Two quite separate projects which my current project depends on have had enough of an impact to prevent me from testing code changes, or checking in merged content.

We are still investigating one of the problems, so I will just focus on the problem that has been fully identified and resolved, saving the continuous integration issue for another post.

Background
I have a development machine which sits under my desk and acts as a local content management system server.  I use this to try out functionality and present demonstrations without fear of scheduled or unscheduled downtime.

To ensure that this system doesn't fall out of sync with new functionality that is being developed by another team, I periodically call on the automation magic of chef to obtain the latest binaries and configuration.

A week or so ago the chef update failed partway through.  This wasn't a major problem as the application would still run, however I was not in a position to identify the cause of the problem or how to fix it.

The problem
This week I decided to try the chef update again, as a colleague had agreed to address the earlier problem.  This seemed like a good idea at the time, but resulted in the chef update failing - and leaving the content management applications unresponsive.

Here we go again I thought, except now it's a higher priority issue for me as this week's development needed this all to be running.

Thankfully the remote team responsible for managing the chef configuration had some availability to look into this issue.  Unfortunately the individual involved didn't seem to have enough context to appreciate my non-virtualized setup etc.  So it was time for me to take another dig around the chef-managed setup.

Ultimately it came down to the usual diagnostic approach of checking the log files for the various applications involved.  One of the content management server processes failed to initialise with a duplicate column error.

Like many modern extensible software products, this particular content management system automatically manages the structure of an underlying relational database.  When new properties need to be represented, a developer can specify that in a configuration file - which will ultimately trigger an alteration to a database table.

Tracing back through git commits showed up what was special about this duplicate column - it was actually a special case of renaming the column with a different - non-camel - casing.

Summary
I wouldn't criticise any of the technological approaches involved:

  • It made sense for the content management system to flag up the unusual database structure change
  • It made sense to use chef for managing updates to the binaries and configuration

It just feels quite strange that I have started off simply being a consumer of a web application or web service with my own local installation, but ended up having to delve into multiple relational databases to rename some columns.

To tie back to the cheesy considered harmful title, using tools such as chef without understanding what they are doing or without having access to examine their effect can result in problems and delays.  This ties back to a concept that I keep coming across sometimes you don't know what you don't know.

No comments:

Post a Comment