I recently had cause to reflect on the fact that our engineering teams regularly release code changes into production multiple times within a single day and felt it worthwhile to elaborate on what that means and how it has changed the way we work as a business.
This practice is known as Continuous Delivery. Continuous Delivery is a logical extension of the Agile philosophy especially the practices of Extreme Programming. It requires a high degree of automation and a trivial deployment process. It also requires courage and the willingness to not take the easy way out in your overall processes.
Most deployment mechanisms can actually be run multiple times within a single day – it would be rare that a website deployment required more than a full day to complete. However, this quite often doesn’t translate into that mechanism being called into use that often. The challenge lies in the fact that to be able to release changes confidently as often as your mechanisms allow, many parts of your process need to be aligned.
I will give a brief walkthrough of our overall process, starting at the tail end and tracking back to the beginning.
To release a new version of the application, we run the following command:
cap production deploy
This kicks off a series of operations on the deployment environment:
- Update the Git repositories on each machine
- Copy the code tree to a newly created release directory
- Link in the environmental configuration
- Bundle gems
- Compile the assets
- Make the new release the ‘current’ one
- Notify NewRelic
- Restart unicorns
- Restart background worker processes
- Notify Airbrake
- Tag the revision
If we look in New Relic we see the release marked on the chart which allows us to see that everything is going okay. New Relic tracks an error rate so we can see if there is an increase in errors as a result of the release and take appropriate action (this happens rarely in practice). Here is an example of a typical day:
Verification of the release is done by looking at Airbrake and NewRelic. Airbrake captures all exceptions that the system raises, collates them and sends an alert email for the first instance of an error. By notifying Aribrake as part of the build, the known error list is cleared so that you can see which errors have been occurring since the latest release. By default there is code which captures Controller error and an API so that you can explicitly send errors, which we use for background processes. As a result, we very rarely need to look in logs for errors. Here are some screenshots of the overall list and what an error looks like:
On the occasion where the release contains a database schema change, we run the following command:
cap production deploy:migrations
Before we can push out a release, we need to go through a validation process, which looks like this:
- Commit changes on the feature branch
- Continuous Integration server build succeeds for the feature branch
- Create Github pull request to stage the change
- Test on staging environment
- Merge pull request to master
- Continuous Integration server build succeeds on master branch
- Ready to deploy to production
We use feature branches for every change, we keep them very short lived (days, not weeks) and even one line changes are done in a branch. The isolation is important when you might be coordinating 5 or more changes in a day and the overhead of branching is negligible with Git. The actual steps in the preparation are not innovative, the trick is to keep the flow to less than an hour so that even the tiniest change can follow the same approach as bigger ones. It also requires a fast build, and a fast Continuous Integration server.
Primarily though, it requires you to develop in the smallest increments possible, and this is where you need to avoid taking the easy way out.
In practice the chunking of development work tends to gravitate over time to the shortest time between deployments. If your release cycle is four weeks, you’ll end up thinking about work in granularities of 2-4 weeks. If you can reduce the minimum time between deployments to zero, then you end up inverting your thinking. Instead of working out what you can fit into the release, you start to think about how small can you make the releases.
This means for example that you can:
- perform a refactor and release it before you start the story that required it
- release a two-line change five minutes after you discovered you need it
- incrementally performance tune controller actions with real traffic
- dark launch features so that you can use them with production data or show them to beta users
Dark launching with feature toggles is what helps alleviate the problems of large features and long-lived branches. We isolate the feature behind a switch (or just don’t link to it anywhere) and keep pushing out updates without affecting users or diverging from master.
Here is what the code for the switch looks like in practice:
# variant code goes here
# normal code goes here
We have a screen which shows the features that exist and allow an administrator to turn it on for themselves. It also contains information about the experiment which I won’t cover in greater detail at this point.
The other powerful element that small releases opens up the realm of multiple stage releases. What are these? Primarily these occur around database refactorings when you want zero downtime. Lets take the example of breaking a table into two (or more) tables, such as might occur when you have too many columns and your model is starting to get bloated. Here are the releases you need to do:
- Release migration to create the new table
- Release code which double writes to both the new and old tables
- Copy the rows from before the double-writing from the old table to the new ones
- Release code which only reads from the new tables
- Release migration to drop old columns (being mindful of locks on large tables)
Voila! Table refactoring with zero downtime.
When you have a pipeline to production then your planning complexity drops completely. Depending on your user base, you can roll out multi-page changes progressively rather than in one hit. You may decide to be continually releasing behind a feature toggle (dark launching) and have your staff using the features as they develop, in which case the final release becomes a single line change to permanently set the toggle to on. We’ve built up a reasonable set of patterns around dark launching: admininstration screens for toggling on and off for staff users and landing pages with switches so that we can share the beta (or alpha!) feature with trusted users.
Planning begins to allow focus much more on validating ideas which inevitably drives how we do product development. Experimentation becomes part and parcel of our regular cycle of development. I will cover this in more detail in a subsequent post.
There is a really important aspect to making this whole thing work that should be clearly stated. We have no operations staff who get handed a release to deploy. There are no testers taking a developers code and making sure it works. I haven’t seen a business analyst for nearly four years. Nor project managers wielding timelines and telling people what to do.
What we have is a strong belief that those who release the code should feel ownership of the change that has just gone out. To achieve that everyone has to feel responsibility for the end to end process. An engineer studies a problem, devises a solution and then validates that the problem is now solved. To that end, every member of the team performs all parts of the software development process. Engineers analyse and research user behaviour to look for opportunities.
The outcome of this team dynamic is that very little coordination is required for a release. Teams can be the smallest possible size, allowing for pairing, as every member sees analysis, coding, testing and releasing as their responsibility. The standard we uphold ourselves to is that a good idea had in the morning should be able to be in front of users before the day ends.