More than few years ago I’ve read the book Agile Modeling, by Scott Ambler, and it was quite a revelation for me. I was beginning to look into extreme programming and TDD at the time, and the fact that you could (and should) write software in an evolving manner instead of the usual big architecture up front that I had studied in university was quite refreshing and empowering.
(as a side note, I actually was going to use the title Agile infrastructure for this, but I’ve promised I’m not to use the word anymore)
Not sure how many people have read the book, but it basically goes through principles and techniques that enable you to write software one piece at a time, solving the problem at hand today, and worrying about tomorrow’s one tomorrow.
If I remember correctly, there was a sentence that went something like this (please don’t quote me on that):
Your first responsibility is to make the code work for the problem you are solving now. The second one is the problem you are solving next.
Many years have passed since the book has been written. Nowadays growing software while developing is what (almost) everyone is doing. The idea of drawing some kind of detailed architecture that will be implemented in a few months or years is completely foreign in most sensible organisations.
Basically, evolving software is almost not interesting anymore. People do it, and know how to do it (as I wrote this I’ve realised it isn’t actually true, a lot of companies don’t do it or know how to, but let’s keep in mind the ones that do…).
In the meantime, a lot has evolved and new areas that were completely static in the past are becoming increasingly dynamic, the current trendy one being IT infrastructure.
The uprise of virtual infrastructure and the so called devops movement have developed tools and practices that make it possible to create thousands of instances on demand and automatically deploy packages whenever and wherever you want. However the thinking behind infrastructure within most IT departments is the equivalent of waterfall for software.
I’m not just talking about auto-scaling here, since that seems to be a concept that’s easy to grasp. What I don’t quite get is why the same thinking that we have when writing software can’t be applied when creating the servers that will run it.
In other words:
- Start writing your application in one process, one server*, and put it live to a few users.
- Try to increase the number of users until you hit a performance bottleneck
- Solve the problem by making it better. Maybe multiple processes? Maybe more servers? Maybe you need some kind of service that will scale separately from the main app?
- Repeat until you get to the next bottleneck
* ok, two for redundancy…
The tools and practices are definitely there. We can automate every part of the deployment process, we can test it to make sure it’s working and we can refactor without breaking everything. However, there are a few common themes that come back when talking about this idea:
“If we do something like this we will do things too quickly and create low quality infrastructure”
This is the equivalent of “if we don’t write an UML diagram, how do we know what we are building?” argument that used to happen when evolving software was still mystery to most people. It’s easy to misunderstand simplicity as low quality, but that doesn’t need to (and shouldn’t) be the case. As with application code, once you put complexity in, is a major pain to take it out, and unnecessary complexity just increases the chance for problems. Simple solutions are and will always be more reliable and robust.
“We have lots of users so we know what we need in terms of performance”
If a new software project is being developed, it is pretty much understood nowadays that nobody knows what is going to happen and how it is going to evolve over time. So pretending that we know it in infrastructure land is just a pipe dream in my opinion.
“We have SLA’s to comply to”
SLA’s are the IT infrastructure equivalent of software regulations and certifications, sometimes useful, sometimes just a something we can use to justify spending money. If there are SLA’s, deal with it, but still in the simplest possible way. If you need 99.9% uptime, then provide 99.9% uptime, but don’t do that and also use a CDN to make things faster (or cooler) just in case.
As it’s said about test code, infrastructure code is code. Treat it the same way.