Whether your software is helping doctors make medical decisions, alerting sleepy drivers, or just selling dog food, there are consequences for failure. It could mean missed revenue for your company, loss of customers’ confidence, or worse.
Having spent over 15 years of working on mission and safety critical software such as medical and financial applications, I have learned a lot about what can make or break your software’s reliability. Some of the following strategies may seem obvious or are seen to have value on their own, but I strongly believe that the impact they can have on software reliability is vastly underestimated — and in particular, the interplay between them.
1. Reduce Technical Debt
Technical debt is the kiss of death for the reliability of your application, not to mention your budget, timeline, and agility. Here’s my own definition:
Technical Debt is when a software system or data representation model no longer makes sense, or is difficult to reason about.
If you have technical debt, your code and/or data model is lying to you — it no longer represents reality. Tech debt creates unnecessary complexity and confusion, and will often lead to mistakes (bugs). Even worse, this can become a vicious cycle, in which engineers are forced to take further shortcuts to meet deadlines.
Some amount of tech debt is unavoidable. It’s an organic part of the software development process. Sometimes, you need to make a tight deadline to land a big contract or fix a security issue. That’s fine — as long as you pay back that debt at a later date — or are willing to throw away that code when changes are required.
2. Keep it Simple
When NASA designs a spacecraft, there are generally two basic approaches to ensure reliability. One is redundancy, ie. backup computers, backup power, etc. When redundancy isn’t possible, they employ simplicity.
The ascent stage engine on the Apollo Lunar Module, the rocket engine that brought astronauts from off the lunar surface back to their mothership, was surprisingly simple. It utilized “hypergolic” fuels that combust instantly when they come into contact with each other. That’s basically it. Unfortunately, these were very toxic materials to work with, but it ensured that it worked every time.
If simplicity works for spacecraft, then it can work for software. In fact, I think it works even better for software.
- The simpler something is, the less that can go wrong.
- The simpler something is, the less things that need to be tested, and the more thoroughly it can be tested.
- The simpler something is, the easier it is to understand, and the less likely mistakes will be made.
3. Make Your Code Readable
At some point in my career, I noticed that there is a widely-held belief that senior programmers write more complicated code than junior programmers, and you need to also be senior to understand it. It’s seen as a badge of honor to write such code.
Sure, I’ve seen some pretty interesting Typescript code that twisted my brain into a mobius strip, but for the most part, this isn’t true. In my opinion, the best programmers are the ones who take complex behavior and are able to break it down and write readable, reasonable, communicative code that most programmers can understand. Writing readable code is an art and skill in itself that takes years to master.
One of my favorite talks on the subject is “Seven Ineffective Coding Habits of Many Programmers” by Kevlin Henney.
4. Use Proper Error Handling and Follow an Incident Management Strategy
Mistakes are inevitable — but even if they weren’t, not all errors are bugs. Someone may log in with a bad password, or a 3rd party service could be down. You’ll want to communicate this across your application layers and downstream infrastructure in a standardized way. Errors that degrade your service in some way (ie, a dependent service being down, timeouts, a bug, etc.) should become an incident, and handled according to your team’s Incident Management Strategy.
No matter your strategy, good incident handling starts at the application level (the programmer). Remember the axiom, “Garbage In, Garbage Out.”
If your programming language has an
Error construct, use it. Strings do not represent errors well and lack critical features, such as typing, structured data, context, and stack traces. Errors should describe clearly what is happening, along with any context and a stack trace. Ideally, they should be typed or standardized in some way.
Be very careful of swallowing unexpected errors —errors that are not handled by the immediate code should always propagate down the stack, and every layer should only handle errors that it knows what to do with. Eventually, if an error makes it all the way to a global “catch-all” error handler, it will still be dealt with in a manner as gracefully as possible, and more importantly, become an incident, and let someone on your team know before your users report it.
I highly recommend this talk given by Lewis Ellis. Although it is tailored to Node applications, his advice translates well to other platforms and is very easy to follow.
5. Close The Feedback Loop
If an Error Log falls in the forest... Seriously though, if your errors aren’t making noise, their value is greatly diminished. Do you have a noisy Slack channel where many errors and warnings are dumped into every day? Do they just sit on a server somewhere?
If you don’t already have one, come up with an Incident Management Strategy for your application. Think of it as a pipeline, where errors and warnings are captured (standardized, structured data is helpful), and then something happens, depending on the service and error level. Maybe it integrates with PagerDuty to wake up an engineer at 4am.
Once engineers are tired of waking up at 4am, they will realize that any error, regardless of customer impact, is a big problem. This is the signal vs. noise problem, or “The Boy Who Cried Wolf.” If you have lots of errors and warnings, then they all lose meaning. Every error and warning should be taken seriously and prioritized accordingly. If they are not really warnings or errors after all, then perhaps it’s time to revisit your team’s error handling or coding practices.
6. Use a Type-Safe Programming Language
Modern web application systems are mostly passing data around in different shapes, and these type-safe languages make it really easy to define and declare these data structures as first-class citizens in the language.
7. Code with GUTs (Good Unit Tests)
Unit Tests provide a “green light” which tells programmers that they haven’t broken anything. Clever programmers run unit tests frequently during the development process, especially during a refactor. Without this green light, it can be daunting to do these refactors, which are absolutely necessary to prevent the accumulation of technical debt (see #1).
Unfortunately, many unit tests I’ve seen are brittle, confusing, and overwhelming. These tests end up breaking every time there is a change and are a headache to deal with. Programmers end up neglecting these tests, or just force them to pass so they can go home. These tests actually become worse than useless — they give a false sense of security. GUTs test behavior, not implementation. They are easy to read and maintain. GUTs read like a straightforward specification, helping the reader to understand what the code being tested should be doing.
Here are a few suggestions:
- Leverage functional programming and layered architecture to minimize dependencies to mock out. Pure functions are much easier to write unit tests for. Consider separation of application layers, such as with hexagonal architecture. This way, you can push difficult to unit-test code out to the edges of your application.
- Test behavior, not implementation. This is known Behavior Driven Development (BDD). You don’t need to test every internal function directly to get to 100% coverage.
- Simplify your assertions. You shouldn’t need to learn a new language and write a novella to assert that two things are equal. In fact, some suggest one assertion is all you really need.
- Break your tests apart into sections. I like the “given, when, then” pattern. Your code should read in a straightforward manner.
- Keep it simple. Don’t try to do too much in one test. A little repetition is OK. I’d rather see some repetition than a test that goes on for hundreds of lines.
8. Use End-to-End Integration Testing Tools
Just like with unit tests, end-to-end integration tests help programmers to know that they haven’t caused any regression during the development process. Unlike unit tests, however, they test the entire system. There are many parts of an application that are difficult to write unit tests for, particularly things like UI’s and writing to databases. For these, end-to-end integration tests might be the best bang for your buck. Of course, they also greatly reduce the amount of repetitive manual testing required — and increase the chances of catching a bug before deployment.
When talking about web-based applications, this usually means using a tool like Selenium to drive a website, similarly to a human, but much faster.
It’s important that these tests be as time-efficient as possible. Downtime is a productivity killer, and would de-incentivize programmers from using or maintaining it. One such optimization is to run tests in parallel.
It’s also important that tests be easy to write and maintain. A suggestion is to avoid relying on ID tags or classes, which could change along with implementation changes and styling.
9. Do Manual Testing
Even if you were to have 100% unit test coverage and extensive end-to-end automated integration test coverage, I would still recommend doing at least some basic manual smoke testing — in both test and production environments — focusing around the areas of impact.
Unless you are employing some kind of visual perceptual diff tool, things can visually still be wrong in your UI and pass all of your tests. These kinds of problems will be much more obvious to humans.
While we’re on the topic, you should always test in production after launch. “But why test in production when you’ve already tested in staging?” you ask. “Isn’t staging the same as production?”. Not exactly. They are different environments, running different configurations, pointing to different physical databases (even if they have identical data), and probably with a bunch of other subtle differences, too (i.e., front-end compression, logging, debug mode, security, etc). Before you fire up vim and start writing production tests, I must warn you that I generally do not recommend running automated tests in production. Any kind of bug in your tests could have disastrous effects.
By the way, manual testing doesn’t have to be entirely the burden of engineers or your QA department. You can outsource this! There are many companies and independent professionals around the world that can do manual testing for you at a very affordable rate, 24/7. In fact, at I worked for a scrappy startup (Advizr) where we outsourced this manual testing in lieu of automated integration tests (it happened to work out well for us).
10. Learn From Your Mistakes
Mistakes are inevitable, but as long as you learn from them, you have a chance to be successful. At my last company, I created a policy that for every production bug, the engineer who diagnosed and solved the issue must research and write a “Post Mortem,” along with their recommendations for how to prevent a similar issue in the future.
If you would like to read more about this topic and nerdy space things, please consider following me! This is my first Medium article, any feedback is appreciated.
I have over 15 years of experience building mission-critical applications across medical, financial, and e-commerce industries. This past year, I worked on improving reliability in clinical trial software. I am passionate about pushing our profession towards a higher level of accountability. Visit my website.