This past week I met with a mature company who has been delivering software solutions for more than 20 years. They know software. They know how to build it, package it, represent its value, and sell it. After a round of introductions one of their senior technologists turned to me and asked, “What’s a Docker?”
This simple, incorrectly formed question highlights where we are in our industry today. It mirrors the innocent question of a child who knows a thing is important without knowing what that thing even is.
We’re going through a sea change, with the waterfall system of software development transforming into parallel systems backed by containers and continuous integration and delivery. Organizational processes that have long been set in stone are now anchors tied to the feet of productivity, and that metamorphosis seemingly happened overnight.
System administrators are now SREs or DevOps engineers, and everyone wants a full-stack employee who knows everything, does everything, and has already mastered the technology that someone else will invent tomorrow. Managers and senior executives have woken up to a world with words that don’t obviously define their concept, and many are scrambling to catch up.
It’s a brave new world, and it’s daunting.
“Google made Kubernetes, so it must be good. We should use it.” – Unnamed CIO
The business space has two types of people looking at these changes. One group holds the people who decide if the use of the technology will further the business goals. Will it make the business more competitive? Will the business be able to build and sell widgets faster or with lower cost and higher margins? The other group will implement and manage the technology. They are the ones who are expected to move the business reliably into the new workflows, design the processes, implement the solutions, and train everyone on how to use the tools.
Most companies have one side or the other, or they might have both sides with different views of the benefits and how to leverage them. Your CIO might have heard that Google created Kubernetes, and biased by the fact that Google makes tons of money, he might decide that he wants to use Kubernetes. In reality, Kubernetes is a beast and not suited to most environments. When installed, it runs well, but it requires significant maintenance, specialized knowledge, and is only a framework. Unless you use a tool like Rancher, you still have to install and build the services that run within it, monitor them, deliver fault tolerance, restart them when they fail, and configure them for downstream services to use. Kubernetes is not for the faint of heart.
Maybe you’re the champion of the power of containers because you know that it will make your job easier, deliver a more reliable architecture, and streamline product delivery. You’re up against deep inertia, the will of a company to not change the way they do things because change is hard. Change breaks things. Change can cost people their jobs. The larger an organization, the slower it moves. Proponents of this fact believe that it protects the organization, but ironically it is this very point that can bring an established player to its knees.
Competition requires agility, and the entire process being developed around DevOps, CI/CD, and containers allows an organization to be more agile, move faster, release faster, and deliver products and features faster. This translates to more revenue, lower costs, and loyal customers.
It doesn’t matter which side of the fence you’re on; you have the same fight ahead of you. You have to convince others that this new paradigm delivers true value. If you want to sell someone on the benefit of something, you have to show them the direct benefits to their bottom line.
The fight over containers often comes down to the tools before it ever considers the process. Let’s take a step back and explore the process first. Out of this we’ll find the tools that are right for the job.
A Game of Ownership
Developers and Operations staff have always struggled to get along. Operations people crave stability in the environment because stability reduces downtime. We achieve stability by having standards across all systems. Developers, by the very nature of what they do, create problems for stability. They develop software for release. This software is different from the last release. That’s a change. A change threatens stability.
Businesses long sought to mitigate the risk of change by developing processes and procedures to control change. This became known as Change Management and wrapped it up in rules that sought to manage change like a process on an assembly line. It’s not a bad idea, and considering the tools available at the time, it worked great for what it was intended to do: slow the process of change down enough that people can review it, write procedures for it, plan for it, and be ready to execute actions when it falls apart.
The waterfall system of software development defines a process where an object leaves one department and moves to another department in isolation. Developers might have a new release to the application. They package it up and hand it to the quality assurance folks. They put it through a battery of tests, sign off that it’s good to go, and hand it off to the operations people. They do whatever they need to do with it and then schedule it for deployment, which involves procedures for downtime, maintenance, and possibly the upgrade of other components on the system to support the new release.
This whole process might take weeks or months.
Even if everyone does everything perfectly, the application might have a bug. It might expose the system to attack. People have to respond, operations people have to decide how to isolate the app, if it should be rolled back to a previous version, how to get it fixed, and developers have to figure out what caused the bug, write tests for it, fix it, and fast track it through the release process again.
Does this sound stressful? Can you imagine how someone in Ops might resent someone in Dev for writing code with a bug in it? An operator might feel that the developers are forcing him to release buggy code that jeapordizes the stability of his environment. A developer might feel that the operations team is forcing them into constraints that limit their ability to write code the way they want to because of limitations that they see as unreasonable.
Each party feels that they can’t do their job right because the other party won’t let them.
No one likes to be responsible for someone else’s mistakes, but the truth of the matter is that people are fallable, and software has bugs. The belief that we can prevent bugs by slowing down the process is flawed. What we need is a process where we can embrace mistakes and mitigate their impact by making it easy and fast to fix them.
DevOps Is a Process
Ask ten people who work in DevOps what their responsibility is, and you’ll get ten different answers. Most often you’ll find developers now given the job of being system administrators, when they aren’t qualified to make decisions about system health, integrity, and security. Being able to launch a cloud server does not make you a system administrator, but is that worse than finding system administrators tasked with doing development when their approach to solving issues is completely different? Scripting up some stuff in Python or Ruby does not make you a developer.
Some companies embrace DevOps because they think they’ll get one super employee for less than the cost of two specialized employees, but they’re also wrong. Go look on job boards for how many companies want “full stack developers” with a decade of experience and are offering lower salaries than entry level positions. This highlights a failure in expectations and drives everyone to the lowest common denominator of skill.
If you hire mediocre people, you’ll be a mediocre company.
The best description I’ve found of the DevOps process is by HashiCorp, makers of tools that fit into the DevOps model and make it easier to manage. They define seven stages of software development and point to the roles that each member of the team satisfies according to his skill set. Their model then leverages tools to streamline each party’s workload and to make it easier to move an object between the stages.
I’m a system administrator at heart. I’m also a developer, but I’m a much better system administrator than I am a developer because I’ve spent more time doing it. I want to own my portion of the process completely, because by doing so I can confidently promise that it will work. I want to give that same level of confidence to developers with whom I work, so that they can confidently promise that code that worked in their development environment will work in production.
The best way to do this is with containers.
A Tale of Two Deployments
Imagine the following two scenarios. Which one would you prefer?
ABC Legacy Widgets
Frank is a developer for ABC Legacy Widgets, Inc. He and his team of developers and QA specialists are developing the next release of Application X, which they’ve been working on for several months. It has features that users have been asking about for more than a year, and he’s excited to finally be deploying those. The application depends on third-party libraries, and he’s gone to great lengths to leverage their strengths to give users what they want. He’s grateful that his manager was able to buy some dedicated hardware for their team so that they can build the application and make sure that everything works before they hand it off. After putting in some late nights to avoid missing the deadline, they’re ready to release. He commits the code to the repository and notifies the operations team.
Bob is a system administrator and has been tasked with releasing version 2.0 of Application X to production. The developers have said that everything looks good, but he has to run his own tests to make sure that they’re telling the truth.
Bob begins the process of figuring out what’s involved to push it live. He has a staging server that is a copy of the production environment, so he installs the application there. The first thing that he notices is that it has a dependency on a system library that he needs to update. He updates the library, and in doing so discovers that he now has to update a bunch of other software on the system. He does so and is able to install the application, but now Application Z that runs in the same environment crashes. It makes a call to a function in one of the libraries he updated, and that call was deprecated two versions ago and has now been removed. The developers of Application Z haven’t updated their code to handle the deprecation, so if he installs version 2.0 of Application X, it will break Application Z.
He could deploy Application X on its own server farm, but that would cost the company more money and leave him supporting two different environments. He rejects the deployment and asks the team from Application X to work with the team from Application Z to match library dependencies and make whatever changes are necessary to run both apps on the same system. He knows it will be weeks before they figure it out and he has to look at this again.
Chuck is a developer for Lightning Apps, LLC. They have a cloud solution that does some stuff that their users love. His team is active in the company’s Slack and support channels, listening to the users for feature requests and incorporating them as quickly as possible into the application. He has three developers working with him, but they’re all remote. They work off of a private Gitlab installation and run Docker containers inside of Vagrant on their development machines. The containers hold all of the dependencies for their application, and he has his development folder mounted within the Vagrant VM so that he can test his changes in real-time on his local machine, even when he’s not connected to the Internet. He writes unit tests before he writes any code, so he knows that future changes don’t break previous work. When he completes a feature, he commits the code and pushes it to Gitlab. Gitlab CI runs the unit tests and then runs the code within a Docker container that allows it to run functional tests of the final application. If these tests fail, Chuck gets a notification with information on what failed. Once all tests pass, Gitlab builds a container for the application and pushes it out to their private Docker registry.
Amy is a quality assurance specialist for Lightning Apps. After Gitlab CI finishes building the container, it opens a ticket that a new release is ready for testing, and this ticket has been assigned to Amy automatically. She logs into Rancher and deploys the container to into the company’s QA environment. She goes through all of the usability tests to assure that it meets the company’s requirements, which only takes her a few minutes. She writes a note that the testing is complete and closes her ticket.
Dave is a system administrator for Lightning Apps. He oversees their entire cloud environment and manages their Docker installation using Rancher. When Amy closes her ticket, the system opens a new ticket in his queue that says the next version of the app is ready to release. He knows that he wouldn’t have received this email if it hadn’t passed all of its tests, so he proceeds with confidence. He checks the release notes to make sure that he doesn’t have to do any additional work with the database. This release is a cosmetic release, so he can proceed without downtime.
He logs into Rancher and connects to the production environment. Once there, he selects the “Upgrade” button for the service that uses this container. Rancher pulls the new container, starts it, and then swaps it in for the old container while keeping the old one running. If there’s any issue with the new container, Dave can press “Rollback” to return to the previous version in a few seconds. As usual, though, the upgrade goes through without any issue, and he clicks “Finish Upgrade” to terminate the old containers. All of this happens in under a minute, with the container swap happening almost instantaneously.
He could have also stopped backends in the load balancer and shown the “site down for maintenance” page, but all state in their application is handled outside of the containers, so the impact to active users is minimal.
The Container Is the Unit of Deployment
What makes it possible to accelerate development and release schedules with confidence is the use of the container as the unit of deployment. The container encapsulates everything necessary to run the application, and in the case of service-oriented applications, it encapsulates everything necessary to run each specific piece of the application. The container starts in development, where every developer is using the same containers, built from the same instruction set. These containers move along the chain from development to production, and at every stop along the way, everyone knows that they’re using the same object that everyone before them used. Moving faster requires confidence, and while containers deliver that confidence, they don’t do it on their own.
The tool doesn’t matter if you understand the process.
To properly leverage containers requires that you implement tools that support them. This includes a system for building containers, storing them, launching them, testing code that runs within them, and moving them along to deployment. After that you still need tools to monitor them and provide a method for automated or manual recovery when they fail.
HashiCorp has their own tool set for each of these phases, but others exist. You can use Jenkins, Travis CI, Gitlab CI, Circle CI or any others, as long as they support the features you require. For deployment you can use straight Docker, Docker Compose, Docker Swarm, Kubernetes, Mesos or any supported orchestration engine within Rancher. For management you can use custom solutions built within Ansible, Puppet, Chef, or a fully-featured solution like Rancher. You can address monitoring at the application level using common tools, and you can monitor things at the container level using Prometheus, Grafana, ELK, or others.
There are dozens of tools to accomplish your goals, but that’s the point. They all do the same thing: they empower you to move quickly and deploy with confidence. The tool doesn’t matter if you understand the process.
The process is simple: encapsulate everything you can within one object and move it along the chain from development to release, giving every party along the way the tools to not only do their job and do it well but to own their part of the process.