Two weeks ago, my colleague and I did battle with a rather insidious bug related to a ODBC database connections failing. It delayed our software release by a week, but we managed to find a workaround/solution and there was much rejoicing. The root cause was a defect in a DB2 driver where the act of installing a windows service caused it to drop connections that were established in other processes on the same host. Yikes.
Solving problems like these can be exhausting but it's incredibly important to master on your software engineering career journey. I certainly have not yet mastered this yet, but I thought I'd share my hard-earned experiences here. These tips apply to any bugs on any wavelength of the difficulty spectrum:
- Easy: repeatable bug in your application code
- Medium: non-repeatable bug in your application code that is environment/data-specific
- Hard: non-repeatable bug in closed-source 3rd party libraries, operating systems, or other black box dependencies
- Nightmare: non-repeatable bug in lower layers of abstraction: networking or firmware/hardware. My personal favorite is bad RAM.
You won't get far if your application code isn't set up with adequate logging. This can be done many ways and there are lots of foot-guns so my advice is to focus on simply providing yourself with enough meaningful information at the right time with the right context.
You also must understand how your host OS and its services/daemons log things in addition to your own application. Ignore logs from your OS or other dependencies at your own peril.
Eliminate the usual suspects
As you gain experience with problem solving / debugging you will build up your own list of outlier things to check first. Eliminate these usual suspects first to save yourself time and suffering. Here my go-to things to rule out:
- Make sure your libraies, OS patches, and other configuration are consistent: Seems obvious but major differences in platform versions usually need to bee ruled out. These are judgement calls but be prepared to eliminate them as variables quickly. Ideally your non-production environment(s) match production but if not, you need to be able to replicate your production environment without much effort. Testing your hypotheses on production systems is your last resort.
- Check out time zones: Many nasty bugs I've encountered have been due to time zone issues so just be aware of it, store stuff in UTC, and pray that the powers that be never create time zones on Mars for the sake of future generations of programmers.
- Disable any kind of antivirus/antimalware security software: These programs are highly invasive and tend to cause issues in surprising ways. Typically they are locking access to files, executables, or network resources.
- Make sure your system clock is in sync: These days everybody syncs with NTP but clocks can still potentially go wrong if the NTP configuration is wrong or if an NTP server is flaky and the hardware clock has a problem. I have two stories on this subject from recent memory:
- Recently I had a system suddenly start throwing errors when users would SSO to another system via SAML. Turns out the application's system clock was behind by a couple minutes which caused the validity timestamps in the SAML payload to be incorrect and rejected by the receiving system. The fix was to point the server at a different NTP server.
- Many years ago, I was seeing random failures submitting credit card transactions to a payment processor. After exhausting all options on the host I ran the
timecommand a few dozen times and noticed that the system clock would go back in time by one second every five seconds then fix itself. This broke some assumptions in the payment processor's logic and caused the errors. The fix was to move to a different host.
Prioritize reproducibility, eliminate variables, and inventory assumptions
Once you have eliminated the usual suspects the next step is to do everything in your power to ensure you have repeatable steps to reproduce the issue. This almost always involves the systematic elimination of variables one-at-a-time, which is the core skill in any kind of debugging. Don't ever get tempted to alter several variables at once for the sake of time; for me it's always been a net negative.
The variables you hone in on need to be scrutinized by using occam's razor. If you find yourself googling for an obscure error message and find no results, it means that nobody in the entire universe has ever had this problem and you are probably barking up the wrong tree.
You should also document any assumptions you are making along the way. For example, it's easy to assume that your issue doesn't depend on which Linux distro you are using but this is not always true. As you progress though eliminating variables, revisit your key assumptions and consider them as variables to eliminate.
Prepare to hit brick walls
Just like learning computer programming, debugging this stuff usually involves dead ends, rabbit holes, and failed experiments. It's incredibly important to have the right frame-of-mind when you encounter such things and take appropriate action. Here is what works for me:
- Don't take it personally: Don't make the mistake your temporary inability to solve an esoteric issue as a judgement on your self-worth. If you have not yet read The Growth Mindset and struggle with this then I strongly recommend reading it. The short version: people are more successful overall when they treat failures as learning opportunities instead of judgments.
- Don't get discouraged: One of the best words of encouragement I ever got was at my first job when a colleague noticed I hit a brick wall and said, "and you're gonna let that stop you?" This little ego challenge gave me the energy to try new things. It doesn't work for everybody but sometimes a little push is what you need.
- Take a walk: A change of venue, mindless manual labor, shower, trip to a coffee shop, or other such activity can give your brain the escape it needs to provide you with new ideas. It's truly amazing how well this works. Just don't stare at your screen from the same seated position for hours on end.
You are not an electrical engineer designing circuit boards dealing with physical constraints. Your constraints are mostly in your mind so it's incredibly important to learn emotional intelligence, avoid traps like the sunk cost fallacy, and figure out what works best for you.
Ask for help
Your ego should not be so fragile that you take things personally but also should not be so big to prevent you from asking for help. The best engineers I have every worked with knew precisely when to ask for help. This may mean asking a friend, posting a question on Stack Overflow, or contacting a vendor's support team. Try to master this timing by looking back on your own experiences and asking yourself if reaching out for help sooner would have been the right thing to do (or not).
We software engineers operate on top of many layers of abstraction. Your problem may be caused by a layer you don't understand, and you can never be expected to deeply understand all of them. While it's often good to have some level of understanding of the layers beneath yours (to give you hints and instincts), you will need to seek help from experts in their own respective layers.
Communicate with the outside world appropriately
The context in which you are trying to solve these bugs matters a lot. Your blood pressure readings will be different when fixing a bug in a prototype app versus a mission-critical business application with no safety net on Friday at 6:00pm. Who is affected by the issue and how to manage their expectations is very important, almost as important as how efficiently you resolve it.
People are surprisingly understanding when it comes to technical glitches. Things happen, especially in the complex world we live in. They primarily want to know that it is being worked on and, once fixed, that steps ae being made to prevent it from happening again. So post something on your status page (or whatever appropriate channel) and follow up with a postmortem once everything is fixed. You may be tempted to move on with life once the issue is fixed (because it can be exhausting) but taking the time to communicate effectively will maintain the trust you are working hard to build with your users and colleagues.
Too often I see vendors magically fix critical issues with no acknowledgement or explanation. This erodes my confidence in their ability to manage their own systems and makes me think they aren't interested in learning from their mistakes.
It's also important to properly deal with the knowledge you acquired while solving the issue when you are a member of an engineering team. That could be writing up a root-cause fix to be prioritized later or a document on how to resolve an issue if it comes up again. The worst thing you can do is hoard this knowledge so that you're the only person who can fix it in the future. Doing so is unprofessional and whatever job security you think you will get will be negated by your inability to take sick/vacation time and your team's inability to make progress because you will become a process bottleneck.
There are many things to master when it comes to solving software bugs in the real world. Having a systematic/scientific approach, using common sense, maintaining a healthy mindset, knowing when to ask for help, and communicating effectively are important skills to master.
To quote Bill Paxton from the 1986 movie Aliens: "Is this going to be a stand-up fight, sir, or another bug hunt?"
We will all face many bugs in our travels, some of which will push us to our limits... And there will always be more. So be prepared, take ownership, apply what you learn, and enjoy the challenge.