This is another addition to our ongoing Agile in Practice series. You can get up to speed on where we were by looking at where we began.
Issue
Background: The State of the Product
Nuve had developed an increasingly complicated product that linked code across 3 different codebases: Front End Interfaces, Back End Orchestration, and SAP Systems running on cloud infrastructure.
We had done a good job writing automated tests to check functionality within each codebase, which led to low bugs within front end code, back end code, and SAP systems.
Background: Our Git Workflow
We were using GitHub Flow, which essentially was:
- Our
main
branch was regularly deployed to production - Our
feature
branches were always branched off of main
then merged back into main - Automated tests were run prior to releasing
main
to production after feature
branches were merged in
The Issue that Arose
While our automated tests would pass, signifying that our Back End, Front End, and SAP Systems were essentially bug free, we would sometimes see user facing bugs after release. That’s because, while the code within codebases were well tested and therefore bug free, the interactions between codebases were not always well tested.
For example: A user might do something that wasn’t allowed - something that should show a helpful error message. Our back-end codebase was sending an error message to the (separate) front end interface codebase. But the front end codebase didn’t know how to handle the message. So it wouldn’t show anything to the user, and the user would get stuck. They didn’t get any feedback that they had done something incorrectly, which led to frustration.
There isn’t really a way to write a unit test for this type of bug because the issue sits at the intersection between two codebases. It’s an “integration test” issue because it arises at the integration point between two systems.
Solution
Agile Brainstorming: Options for Solving
We considered a number of possible solutions
Possibility 1: Write Automated Integration Tests
Good idea, but in practice, we had seen automated integration tests become very time-consuming to build and maintain, so we filed that away as something we might do in the future.
Possibility 2: Test all Integrations in Your Isolated Development Environment
We decided to do this, because it’s a good practice to test your own code end to end. But it wasn’t quite enough to give us peace of mind because a) it put all responsibility on to one developer to think of every test case, and 2) it didn’t allow us to test complicated releases that could cause issues.
Possibility 3: Create a Production-Like Quality Assurance (QA) Environment
Basically, we would clone our production development environment (with the exception of sensitive production customer data), and we would release changes here prior to releasing them to production.
This was the idea that really met our needs. It enabled us to do the following:
- Test our release to a production-like environment to ensure our production release would go smoothly
- After release, all developers could test each others’ work prior to the production release - more eyes meant more opportunities to catch bugs
- And it had additional benefits:
- If you had never released to production before, this was a good place to practice
- If you had to make changes to infrastructure (eg. web or database servers), you could test those changes before making them in production, reducing the chances of an infrastructure issue
Implementation: A New Git Workflow
This led to an interesting question: Should we just release our main
git branch to QA and then later release it to production after testing on QA? Or should we do something else?
There were good arguments in both directions, but given how easy it is to merge one branch into another using Github, we settled on creating a new branch which lead to the following equivalency:
main
branch = Productiondevelop
branch = QA
In other words, everything that is on main
should be well tested enough to release to production (and production releases should happen quickly after changes are merged into main
).
Also, everything that is on develop
should be well tested enough to release to QA (and QA releases should happen quickly after changes are merged into develop
).
In summary, our git workflow became:
develop
is branched off of main
and a production release is merging develop
back into main
and deploying to productionfeature
branches should be branched off of develop
and merged back into develop
for release to QA
Simply put, code started to flow like this:
feature-branch
(Dev environment) → develop
(QA) → main
(Production)
Implementation: When Do We Merge Features to QA?
We suggested the following readiness principle:
If you merge a feature into develop
(i.e. release to QA) you should be very confident that it’s tested enough that you will be able to merge into main
and release to production within 24 hours.
Essentially, it’s bad form to break QA for a long time because you didn’t test well enough in your own isolated system. That behavior blocks other people, and while they have their own isolated systems, they can only develop in isolation for so long. They are pushing to integrate their code into QA in prep for a production release.
Results
This change, arguably more than anything else, reduced the number of user-facing bugs in production releases. The bugs became obvious when we tested them in a near-identical-to-production environment.
The change also enforced a degree of discipline leading up to releases. We knew that QA had to be “stable” prior to starting a production release, so people could rally to make that happen if there were bugs on QA prior to a planned release. We also could delay a release (generally for less than 24 hours based on our readiness principle) if we knew QA wasn’t stable.
As a final comment, it might sound like this adds some overhead/maintenance costs to the development process. That is true, but we’d argue that it’s a “good” kind of maintenance cost. Because QA is so similar to production, arguably, every issue that arises on QA could have arisen in production instead if we didn’t see it in QA first! In our opinion, it’s better to have those issues arise in QA than in production and to get valuable practice fixing them.