Debugging On Ancient Game Platforms
When you run into obscure programming bugs on ancient game platforms, it's often a challenge to get useful diagnostic information out of them, especially compared to modern development environments. This article provides some helpful general advice for debugging, as well as suggesting some ways you can narrow that gap between old and new development tools.
Contents
- 1 Don't Panic
- 2 What Changed?
- 3 Use the Background Color
- 4 ASSERT that certain conditions aren't happening (or crash to confirm they are)
- 5 Disable Code Blocks
- 6 Unit Testing
- 7 Use Modern Debugging
- 8 Common Causes of Catastrophic Problems
- 9 When Problems Manifest Only On One Particular Console, Or Only With One Tester
- 10 When Nothing Else Works To Squash The Bug, Trust Nothing
- 11 Avoid Bugs In the First Place
Don't Panic
When your game is doing things that it shouldn't be doing, and you can't figure out why, it's natural to feel some gloom and despair. This only gets worse the more you throw yourself at the problem, without any results. Try not to give into despair. You're better than the problem, and with a structured approach and a bit of persistence, you'll eventually understand it and squash it.
What Changed?
If the bug is new, the first thing you should be asking yourself is what code was changed or added recently. It won't always guide you to the bug with 100% accuracy, since the new code can highlight some deficiency in your older code, but the new code is a very good place to start looking.
This also highlights the reason to keep older copies of your game around, whether that's manually backing up your project, or working on it in a system with revision control. Being able to test new and older versions in tandem can give you insight into when the bug first occurred, and what changes were added at that time. If you've just been editing the same source file over and over again, you're at the mercy of your best recollection.
Use the Background Color
A quick way to confirm execution of a particular code block, or confirming a certain program condition is true, is to set the background color index uniquely when that happens. This technique requires minimal code change to your game, and sends a clear and obvious signal.
If your bug is due to running out of CPU time, changing the background color is also helpful for measuring CPU utilisation on systems that execute code during the visible screen. If you change the background color before a given routine, and change it back after, the height of the colored bar on the screen will be proportional to the CPU time taken.
If you know how many CPU cycles it takes to draw a scanline, you could even figure out about how many cycles your routine takes. I don't bother with that, as it's more helpful to think of your routine's CPU time in relation to how much CPU time you have for the whole frame.
ASSERT that certain conditions aren't happening (or crash to confirm they are)
The assert command is found in many modern languages, and is used to abort the program if a certain condition isn't true.
assert(myvariable > 0); // myvariable should always be a non-zero value.
You can do the old-hardware equivalent by executing an infinite loop when certain conditions occur. This makes the unanticipated failure spectacular, rather than seeing some subtle bug, from a knock-on effect.
Disable Code Blocks
If the bug isn't obviously belonging to any particular game functionality, by skipping over certain blocks (or replacing them with simpler less-functional versions) you many get some insight into which blocks are triggering the bug.
If you have a lot of code blocks to work through, you can use a divide and conquer approach. (aka binary search). Out of all of the possible blocks you can disable, test disabling the first half of them. If the bug doesn't occur, re-enable the first half, and disable the second half. Assuming the bug manifests, you know which set of code blocks your problem is in. Then split that set into two halves, testing each of those in turn, and repeating the splitting and testing, until you've narrowed the bug down to one specific code block.
Unit Testing
Unit testing means to test a particular code block with specific inputs, and see if it produces the expected output. Code that may look 100% right on first, second, and third glance, can have subtle bugs you may not have considered. In the face of a tough bug, don't assume routines are correct, if you haven't proven them correct. If you've recently modified a routine, however trivially, and you haven't performed a new unit check, you can't assume it's still good.
Use Modern Debugging
Many emulators for old systems provide modern debuggers, which among other things allow you to:
- set watch points: When a particular variable is read to or written to, code execution will pause, the assembly code involved will be highlighted, and you'll have the opportunity to manually step through it, observing how register values and memory locations change.
- set break points: When a particular part of the program is reached, code execution will pause, the assembly code involved will be highlighted, and again you can manually step through it.
- dumping memory: At any point in your program execution, you can write the contents of some memory to a file on your PC, for analysis.
- CPU trace logging: Every command the CPU executes will be written for a log. This can be used to see where a program went off the rails.
- view memory in real-time: this allows you to see memory values as the program runs, or as you step through it instruction-by-instruction.
For the Atari 7800 platform, knowing your way around the A7800/Mame debugger is an essential skill. (see Introduction to the MAME debugger) On the Atari 2600, the same can be said for the Stella debugger.
Common Causes of Catastrophic Problems
- Stack abuse: If your program has unmatched gosub/return or jsr/rts statements, you'll wind up with the stack overwriting memory it shouldn't, and unexpected crashing. Same thing if you have unmatched stack pushes and pops.
- Stack exhaustion: If you're using a portion of the stack memory as regular variables, as is often the case with 6502 based platforms, using too many nested subroutines may lead to those variables being unintentionally overwritten.
- Interrupts not correctly starting or exiting: When a 6502 interrupt starts up, you should save all of the register values immediately, as well as clearing decimal mode, if you do any addition or subtraction in the interrupt. Before you exit the interrupt, the same register values should be restored.
- Interrupts reusing general purpose variables: Interrupts shouldn't be using temp variables that your main code uses, because your main code might just find those values have magically changed mid-routine, due to a triggered interrupt. Similarly, you need to be careful with other variable changes, and always consider that the interrupt code may become active anywhere in the body of your main code.
- Memory map overlap: It's common to reuse memory locations for two different purposes, on old platforms, because often there's not enough memory to go around. You need to be careful to ensure both memory uses don't accidentally overlap. Sometimes your code will start out without overlap, and then you later expand one of those uses with the faulty assumption the memory is dedicated.
- Off-by-one errors: It's a common programming error to have a loop run for one more iteration than intended. This can be disastrous if the code in question sets up jump tables, causes a write to unintended RAM, or causes a bank-switch hotspot to be hit.
When Problems Manifest Only On One Particular Console, Or Only With One Tester
On 6502 based platforms, a common assembly language typo is to use something like "lda 0" when you meant to use "lda #0". The former loads the accumulator with the value at memory location 0, while the latter means to load the literal value 0 into the accumulator.
On many of those same platforms, location 0 is either a register that often happens to be 0, and/or there are floating bits at that location. In most cases, the value returned from 0 will be 0, as it was the last value on the bus, due to being the second byte of the opcode. On some rarer consoles, the last value on the floating bus prior to being read, which makes the "lda 0" mistake strikingly obvious.
If possible on your emulated platform, you should test your program with undriven bits being randomized. (e.g. Stella, the 2600 emulator supports this)
You should also keep in mind that some bugs tend to be triggered by certain play-styles more than others. As a programmer, you know the correct and ideal way to play the game, so you unintentionally avoid triggering the bug. After all, if you had triggered it before, you would have fixed it. When you get an unusual report from a tester, always make a point of getting all of the circumstances under which it occurred, so you can learn to reproduce it yourself.
When Nothing Else Works To Squash The Bug, Trust Nothing
When you've done all of the stuff written about here, and you still can't find the bug, you've made some bad underlying assumption somewhere. If the bad assumption isn't in your code per se, than it may be introduced by some assumptions you've made about the underlying platform, or development tools. Verify all assumptions inherent in your code. Trust nothing that you haven't personally experimentally verified.
Avoid Bugs In the First Place
The easiest way to squash bugs is to not introduce them in the first place. Ensure your code is self-documenting with expressive variable names, that you use indentation to relay semantic meaning, and that you use program structure like subroutines to make it easier read the code flow. The latter also makes unit testing easier!