Shipping to Production
Since the book is out, I contacted Gergely to inquire whether he would be willing to share a chapter with the newsletter audience. To my delight, he kindly agreed. The chapter I've chosen is 'Shipping to Production.' I hope you enjoy it.
You can check out the book here: The Software Engineer's Guidebook
As a tech lead, you’re expected to get your team’s work into production quickly and reliably. But how does this happen, and which principles should you follow? This depends on several factors: the environment, the maturity of the product being worked on, how expensive outages are, and whether moving fast or having no reliability issues is more important.
This chapter covers shipping to production reliably in different environments. It highlights common approaches across the industry, and helps you refine how your team thinks about this process. We cover:
Extremes in shipping to production
Typical shipping processes at different types of companies
Principles and tools for shipping to production responsibly
Additional verification layers and protections
Taking pragmatic risks to move faster
Additional considerations for defining a deployment process
Selecting an approach
1. EXTREMES IN SHIPPING TO PRODUCTION
Let’s start with two “extremes” in shipping to production:
The You Only Live Once (YOLO) approach is used for many prototypes, side projects, and unstable products like alpha/beta versions. It’s also how some urgent changes make it into production.
The idea is simple, make a change in production and check if it works in production. Examples of YOLO shipping include:
SSH into a production server → open an editor (e.g. vim) → make a change in a file → save the file and/or restart the server → see if the change works.
Make a change to a source code file → force land this change without a code review → push a new deployment of a service.
Log on to the production database → execute a production query to fix a data issue (e.g. modifying records with issues) → hope this fixes the problem.
YOLO shipping is as fast as it gets when shipping a change to production. However, it also has the highest risk of introducing new issues into production because there is no safety net. For products with few to zero production users, the damage done by introducing bugs into production can be low, so this approach is justifiable.
YOLO releases are common for:
Early-stage startups with no customers
Mid-sized companies with poor engineering practices
Resolving urgent incidents at places without well-defined incident handling practices
As a software product grows and more customers rely on it, code changes need to go through extra validation before production. Let’s go to the other extreme: a team obsessed with doing everything possible to ship zero bugs into production.
Thorough verification through multiple stages
This is an approach used for mature products with many valuable customers, where a single bug can cause major problems. This rigorous approach is used if bugs could result in customers losing money, or make them switch to a competitor’s offering.
Several verification layers are in place, with the goal of simulating the real world with greater accuracy, such as:
Local validation. Tooling for software engineers to catch obvious issues.
CI validation. Automated tests like unit tests and linting on every pull request.
Automation before deploying to a test environment. More expensive tests such as integration tests or end-to-end tests, before deployment to the next environment.
Test environment #1. More automated testing, like smoke tests. Quality assurance engineers might manually exercise the product, running manual tests and doing exploratory testing.
Test environment #2. An environment where a subset of real users – such as internal company users or paid beta testers – exercise the product. The environment is coupled with monitoring and the rollout is halted upon sign of a regression.
Pre-production environment. An environment in which the final set of validations are run. This often means running another set of automated and manual tests.
Staged rollout. A small subset of users get the changes, and the team monitors for key metrics to remain healthy, and checks customer feedback. A staged rollout strategy depends on the riskiness of the change being made.
Full rollout. As the staged rollout increases, at some point changes are pushed to all customers.
Post-rollout. Issues arise in production, for which monitoring and alerting is set up, and also a feedback loop with customers. If there’s an issue, it’s dealt with by the standard oncall process. We discuss this process more in Part 5: “Reliable software engineering.”
A heavyweight release process is used by:
Highly regulated industries, such as healthcare, aviation or automotive.
Telecommunications providers, where it’s common to have ~6 months of thorough testing of changes before major changes are shipped to customers.
Banks, where bugs could cause financial losses.
Traditional companies with legacy codebases with little automated testing. These places want to keep quality high and are happy to slow down releases by adding verification stages.
2. TYPICAL SHIPPING PROCESSES
Different companies tend to take different steps in shipping to production. Below is a summary of typical approaches, highlighting the variety of processes:
Startups typically do fewer quality checks. These companies tend to prioritize moving fast and iterating quickly, and often do so without much of a safety net. This makes perfect sense if they don't have customers yet. As customers arrive, teams need to find ways to avoid regressions and the shipping of bugs.
Startups are usually too small to invest in automation, and so most do manual QA – including the founders being the ‘ultimate’ testers, while some places hire dedicated QA folks. As a company finds its product-market fit, it’s more common to invest in automation. And at tech startups that hire strong engineering talent, these teams can put automated tests in place from day one.
These places tend to rely more heavily on QAs teams. Automation is sometimes present at more traditional companies, but typically they rely on large QA teams to verify what is built. Working on branches is also common; it's rare to have trunk-based development.
Code mostly gets pushed to production on a weekly schedule or even less frequently, after the QA team verifies functionality.
Staging and UAT (User Acceptance Testing) environments are more common, as are larger, batched changes shipped between environments. Sign-off is required from the QA team, the product manager, or the project manager, in order to progress the release to the next stage.
Large tech companies
These places typically invest heavily in infrastructure and automation related to shipping with confidence. Such investments often include automated tests which run quickly and deliver rapid feedback, canarying, feature flags and staged rollouts.
These companies aim for a high quality bar, but also to ship immediately when quality checks are complete, working on trunk. Tooling to deal with merge conflicts becomes important, given that some places can make over 100 changes on trunk per day. For details on QA at Big Tech, see the article How Big Tech does QA.
Meta’s core product
Facebook, as a product and engineering team, merits a separate mention, because this organization has a sophisticated and effective approach few other companies use.
This Meta product has fewer automated tests than many would assume, but on the other hand, Facebook has an exceptional automated canarying functionality, where the code is rolled out through 4 environments, from a testing environment with automation, through one that all employees use, then through a test market that is a smaller geographical region, and finally to all users. At each stage, the rollout automatically halts if the metrics are off.
3. PRINCIPLES AND TOOLS
What are principles and approaches worth following for shipping changes to production responsibly? Consider these:
Use a local or isolated development environment. Engineers should be able to make changes on their local machine, or in an isolated environment unique to them. It’s more common for developers to work in local environments. However, places like Meta are shifting to remote servers for each engineer. From the article, Inside Facebook’s Engineering culture:
“Most developers work with a remote server, not locally. Starting from around 2019, all web and backend development is done remotely, with no code copied locally, and Nuclide facilitating this workflow. In the background, Nuclide was using virtual machines (VMs) at first, later moving to OnDemand instances – similar to how GitHub Codespaces works today – years before GitHub launched Codespaces.
Mobile development is still mostly done on local machines, as doing this in a remote setup, as with web and backend, has tooling challenges.”
Verify locally. After writing the code, do a local test to ensure it works as expected.
Testing and verification
Consider edge cases and test for them. Which obscure cases does your code change need to account for? Which real-world use cases haven’t you accounted for yet?
Before finalizing work on the change, compile a list of edge cases. Consider writing automated tests for them, if possible. At least do manual testing. Coming up with a list of unconventional edge cases is a task for which QA engineers or testers can be very helpful.
Write automated tests to validate your changes. After manually verifying your changes, exercise them with automated tests. If following a methodology like test driven development (TDD,) you might do this the other way around by writing automated tests first, then checking that your code change passes them.
Another pair of eyes: a code review. With your code changes complete, put up a pull requests and get somebody with context to look at your code changes. Write a clear, concise description of the changes, which edge cases are tested for, and get a code review.
All automated tests pass, minimizing the risk of regressions. Before pushing the code, run all the existing tests for the codebase. This is typically done automatically, via the CI/CD system (continuous integration/continuous deployment.)
Keep reading with a 7-day free trial
Subscribe to ByteByteGo Newsletter to keep reading this post and get 7 days of free access to the full post archives.