Difference between revisions of "Debugging On Ancient Game Platforms"

From 8BitDev.org - Atari 7800 Development Wiki
Jump to: navigation, search
(8 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
== Don't Panic ==
 
== Don't Panic ==
  
When your game is doing things that it shouldn't be doing, and you can't figure out why, it's natural to feel some gloom and despair. This only gets worse the more you throw yourself at the problem, without any results. Try not to give into despair. You've managed to build up your code to it's current state - you're smart enough to squash it. If you take a structured approach and dig deep for a bit of persistence, you'll eventually understand the reason it's happening, and squash it.
+
When your game is doing things that it shouldn't be doing, and you can't figure out why, it's natural to feel some gloom and despair. This only gets worse the more you throw yourself at the problem, without any results. Try not to give into despair. You've managed to build up your code to it's current state - you're smart enough to squash the bug. If you take a structured approach and dig deep for a bit of persistence, you'll eventually understand the reason why the bug is happening, and how to squash it.
  
  
Line 68: Line 68:
 
Here are some common things that can cause spectacular and difficult-to-trace failure in game code. Take these as a check-list, and consider if any of them might be the culprit with your bug.
 
Here are some common things that can cause spectacular and difficult-to-trace failure in game code. Take these as a check-list, and consider if any of them might be the culprit with your bug.
  
* Stack abuse: If your program has unmatched gosub/return or jsr/rts statements, you'll wind up with the stack overwriting memory it shouldn't, and unexpected crashing. Same thing if you have unmatched stack pushes and pops. One method of checking for stack abuse is to use a [https://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries Stack Canary] in tandem with a debugger watch point, to ensure the canary isn't written to. <ref name="stackcanary">stack canary tip contributed by Karl Garrison</ref>. Another method to detect unmatched gosub/return (or equivalent) would be to push the return address for an exception handler to the base of the stack, which will eventually be executed if such a condition repeatedly happens. <ref name="exceptionstackbase">exception routine at the base of the stack tip contributed by TailChao.</ref> Lastly, if your stack issue is caused by an unmatched goto/return or jsr/rts, the (return) address of the problem routine will almost certainly be duplicated again and again in stack memory, which you can read out in a debugger.
+
* Stack abuse: If your program has unmatched gosub/return or jsr/rts statements, you'll wind up with the stack overwriting memory it shouldn't, and unexpected crashing. Same thing if you have unmatched stack pushes and pops. One method of checking for stack abuse is to use a [https://en.wikipedia.org/wiki/Stack_buffer_overflow#Stack_canaries Stack Canary] in tandem with a debugger watch point, to ensure the canary isn't written to. <ref name="stackcanary">stack canary tip contributed by Karl Garrison</ref>. Another method to detect too many rts/return would be to push the return address for an exception handler to the base of the stack, which will eventually be executed if such a condition repeatedly happens. <ref name="exceptionstackbase">exception routine at the base of the stack tip contributed by TailChao.</ref> Lastly, if your stack issue is caused by an too many gosub/jsr, the address of the problem gosub/jsr will almost certainly be duplicated again and again in stack memory; check for it in a debugger, and look up what program code is at that location, in your assembly list file.
 
* Stack exhaustion: If you're using a portion of the stack memory as regular variables, as is often the case with 6502 based platforms, using too many nested subroutines may lead to those variables being unintentionally overwritten.
 
* Stack exhaustion: If you're using a portion of the stack memory as regular variables, as is often the case with 6502 based platforms, using too many nested subroutines may lead to those variables being unintentionally overwritten.
 
* Interrupts not correctly starting or exiting: When a 6502 interrupt starts up, you should save all of the register values immediately, as well as clearing decimal mode if you do any addition or subtraction in the interrupt. Before you exit the interrupt, the same register values should be restored.
 
* Interrupts not correctly starting or exiting: When a 6502 interrupt starts up, you should save all of the register values immediately, as well as clearing decimal mode if you do any addition or subtraction in the interrupt. Before you exit the interrupt, the same register values should be restored.
 
* Interrupts reusing general purpose variables: Interrupts shouldn't be using temp variables that your main code uses, because your main code might just find those values have magically changed mid-routine, due to a triggered interrupt. Similarly, you need to be careful with other variable changes, and always consider that the interrupt code may become active anywhere in the body of your main code.
 
* Interrupts reusing general purpose variables: Interrupts shouldn't be using temp variables that your main code uses, because your main code might just find those values have magically changed mid-routine, due to a triggered interrupt. Similarly, you need to be careful with other variable changes, and always consider that the interrupt code may become active anywhere in the body of your main code.
* Runaway code execution: if you're not using otherwise using interrupts, setting the IRQ vector to an exception handler can provide helpful diagnostic information. <ref name="irqhandler">irq handler tip contributed by TailChao.</ref> In addition to providing feedback on unintentional interrupts, it could advise you to runaway code execution on 6502-based platforms, since the interrupt generating BRK opcode is represented by a byte value of $00, and $00 is often found in sections of data or empty ROM.
+
* Runaway code execution: if you're not using otherwise using interrupts, setting the IRQ vector to an exception handler can provide helpful diagnostic information. <ref name="irqrunaway">irq runaway code handler tip contributed by TailChao.</ref> In addition to providing feedback on unintentional interrupts, it could advise you to runaway code execution on 6502-based platforms, since the interrupt generating BRK opcode is represented by a byte value of $00, and $00 is often found in sections of data or empty ROM. If you are already using interrupts in your game, you can get a similar effect to the previous suggestion by filling your empty program space with NOP instructions, and ending banks/program sections with a jump to the exception handler.<ref name="nonirqrunaway">non-irq runaway code handler tip contributed by TailChao.</ref>
 
* Memory map overlap: It's common to reuse memory locations for two different purposes, on old platforms, because often there's not enough memory to go around. You need to be careful to ensure both memory uses don't accidentally overlap. Sometimes your code will start out without overlap, and then you later expand one of those uses with the faulty assumption the memory is dedicated.
 
* Memory map overlap: It's common to reuse memory locations for two different purposes, on old platforms, because often there's not enough memory to go around. You need to be careful to ensure both memory uses don't accidentally overlap. Sometimes your code will start out without overlap, and then you later expand one of those uses with the faulty assumption the memory is dedicated.
 
* Off-by-one errors: It's a common programming error to have a loop run for one more iteration than intended. This can be disastrous if the code in question sets up jump tables, causes a write to unintended RAM, or causes a bank-switch hotspot to be hit.
 
* Off-by-one errors: It's a common programming error to have a loop run for one more iteration than intended. This can be disastrous if the code in question sets up jump tables, causes a write to unintended RAM, or causes a bank-switch hotspot to be hit.

Revision as of 07:24, 20 March 2021

When you run into obscure programming bugs on ancient game platforms, it's often a challenge to get useful diagnostic information out of them, especially compared to modern development environments. This article provides some helpful general advice for debugging, as well as suggesting some ways you can narrow that gap between old and new development tools.


Don't Panic

When your game is doing things that it shouldn't be doing, and you can't figure out why, it's natural to feel some gloom and despair. This only gets worse the more you throw yourself at the problem, without any results. Try not to give into despair. You've managed to build up your code to it's current state - you're smart enough to squash the bug. If you take a structured approach and dig deep for a bit of persistence, you'll eventually understand the reason why the bug is happening, and how to squash it.


What Changed?

If the bug is new, the first thing you should be asking yourself is what code was changed or added recently. It won't always guide you to the bug with 100% accuracy, since the new code can highlight some deficiency in your older code, but the new code is a very good place to start looking.

This also highlights the reason to keep older copies of your game around, whether that's manually backing up your project, or working on it in a system with revision control. Being able to test new and older versions in tandem can give you insight into when the bug first occurred, and what changes were added at that time. If you've just been editing the same source file over and over again, you're at the mercy of your best recollection.


Use the Score Display

If your platform supports some kind of score display, this is a natural place to output numeric information and double-check variables, without perturbing your game code a great amount. [1] Assuming your platforms score display works with Binary Coded Decimal representation (as many do) you'll need to either convert your decimal value variables to BCD prior to updating the score with them, live with diagnostic info with values ranging from 0-9 where BCD and decimal are identical, or you'll need to ensure your score display supports hexadecimal digits.


Use the Background Color

A quick way to confirm execution of a particular code block, or confirming a certain program condition is true, is to set the background color index uniquely when that happens. This technique requires minimal code change to your game, and sends a clear and obvious signal.

If your bug is due to running out of CPU time, changing the background color is also helpful for measuring CPU utilisation on systems that execute code during the visible screen. If you change the background color before a given routine, and change it back after, the height of the colored bar on the screen will be proportional to the CPU time taken.

If you know how many CPU cycles it takes to draw a scanline, you could even figure out about how many cycles your routine takes. I don't bother with that, as it's more helpful to think of your routine's CPU time in relation to how much CPU time you have for the whole frame.


ASSERT that certain conditions aren't happening (or crash to confirm they are)

The assert command is found in many modern languages, and is used to abort the program if a certain condition isn't true.

 assert(myvariable > 0); // myvariable should always be a non-zero value.

You can do the old-hardware equivalent by executing an infinite loop when certain conditions occur. This makes the unanticipated failure spectacular, rather than seeing some subtle bug, from a knock-on effect.


Disable Code Blocks

If the bug isn't obviously belonging to any particular game functionality, by skipping over certain blocks (or replacing them with simpler less-functional versions) you many get some insight into which blocks are triggering the bug.

If you have a lot of code blocks to work through, you can use a divide and conquer approach. (aka binary search). Out of all of the possible blocks you can disable, test disabling the first half of them. If the bug doesn't occur, re-enable the first half, and disable the second half. Assuming the bug manifests, you know which set of code blocks your problem is in. Then split that set into two halves, testing each of those in turn, and repeating the splitting and testing, until you've narrowed the bug down to one specific code block.


Unit Testing

Unit testing means to test a particular code block with specific inputs, and see if it produces the expected output. Code that may look 100% right on first, second, and third glance, can have subtle bugs you may not have considered. In the face of a tough bug, don't assume routines are correct, if you haven't proven them correct. If you've recently modified a routine (however trivially) and you haven't performed a new unit check, you can't assume it's still good.


Use Modern Debugging

Many emulators for old systems provide modern debuggers, which among other things allow you to:

  • set watch points: When a particular variable is read to or written to, code execution will pause, the assembly code involved will be highlighted, and you'll have the opportunity to manually step through it, observing how register values and memory locations change.
  • set break points: When a particular part of the program is reached, code execution will pause, the assembly code involved will be highlighted, and again you can manually step through it.
  • dump memory: At any point in your program execution, you can write the contents of some memory to a file on your PC, for analysis. Some even let you change and reload this memory block.
  • CPU trace logging: Every command the CPU executes will be written for a log. This can be used to see where a program went off the rails.
  • view memory in real-time: this allows you to see memory values as the program runs, or as you step through your program instruction-by-instruction.

For the Atari 7800 platform, knowing your way around the A7800/Mame debugger is an essential skill. (see Introduction to the MAME debugger) On the Atari 2600, the same can be said for the Stella debugger.


Common Causes of Catastrophic Problems

Here are some common things that can cause spectacular and difficult-to-trace failure in game code. Take these as a check-list, and consider if any of them might be the culprit with your bug.

  • Stack abuse: If your program has unmatched gosub/return or jsr/rts statements, you'll wind up with the stack overwriting memory it shouldn't, and unexpected crashing. Same thing if you have unmatched stack pushes and pops. One method of checking for stack abuse is to use a Stack Canary in tandem with a debugger watch point, to ensure the canary isn't written to. [2]. Another method to detect too many rts/return would be to push the return address for an exception handler to the base of the stack, which will eventually be executed if such a condition repeatedly happens. [3] Lastly, if your stack issue is caused by an too many gosub/jsr, the address of the problem gosub/jsr will almost certainly be duplicated again and again in stack memory; check for it in a debugger, and look up what program code is at that location, in your assembly list file.
  • Stack exhaustion: If you're using a portion of the stack memory as regular variables, as is often the case with 6502 based platforms, using too many nested subroutines may lead to those variables being unintentionally overwritten.
  • Interrupts not correctly starting or exiting: When a 6502 interrupt starts up, you should save all of the register values immediately, as well as clearing decimal mode if you do any addition or subtraction in the interrupt. Before you exit the interrupt, the same register values should be restored.
  • Interrupts reusing general purpose variables: Interrupts shouldn't be using temp variables that your main code uses, because your main code might just find those values have magically changed mid-routine, due to a triggered interrupt. Similarly, you need to be careful with other variable changes, and always consider that the interrupt code may become active anywhere in the body of your main code.
  • Runaway code execution: if you're not using otherwise using interrupts, setting the IRQ vector to an exception handler can provide helpful diagnostic information. [4] In addition to providing feedback on unintentional interrupts, it could advise you to runaway code execution on 6502-based platforms, since the interrupt generating BRK opcode is represented by a byte value of $00, and $00 is often found in sections of data or empty ROM. If you are already using interrupts in your game, you can get a similar effect to the previous suggestion by filling your empty program space with NOP instructions, and ending banks/program sections with a jump to the exception handler.[5]
  • Memory map overlap: It's common to reuse memory locations for two different purposes, on old platforms, because often there's not enough memory to go around. You need to be careful to ensure both memory uses don't accidentally overlap. Sometimes your code will start out without overlap, and then you later expand one of those uses with the faulty assumption the memory is dedicated.
  • Off-by-one errors: It's a common programming error to have a loop run for one more iteration than intended. This can be disastrous if the code in question sets up jump tables, causes a write to unintended RAM, or causes a bank-switch hotspot to be hit.


When Problems Manifest Only On One Particular Console, Or Only With One Tester

On 6502 based platforms, a common assembly language typo is to use something like "lda 0" when you meant to use "lda #0". The former loads the accumulator with the value at memory location 0, while the latter means to load the literal value 0 into the accumulator.

On many of those same platforms, location 0 is either a register that often happens to be 0, and/or there are floating bits at that location. In most cases, the value returned from 0 will be 0, as it was the last value on the bus, due to being the second byte of the opcode. On some rarer consoles, the last value on the floating bus prior to being read, which makes the "lda 0" mistake strikingly obvious.

If possible on your emulated platform, you should test your program with undriven bits being randomized. (e.g. Stella, the 2600 emulator supports this)

You should also keep in mind that some bugs tend to be triggered by certain play-styles more than others. As a programmer, you know the correct and ideal way to play the game, so you unintentionally avoid triggering the bug. After all, if you had triggered it before, you would have fixed it. When you get an unusual report from a tester, always make a point of getting all of the circumstances under which it occurred, so you can learn to reproduce it yourself.


When Nothing Else Works To Squash The Bug, Trust Nothing

When you've done all of the stuff written about here, and you still can't find the bug, you've made some bad underlying assumption somewhere. You need to figure out what basic assumptions you've made, and test all of them to ensure they're valid.

The bad assumption doesn't have to be in your code. You may have made some assumptions about the underlying platform, other people's code that you've tied into, or the development tools themselves.

Trust no assumption that you haven't personally verified through unit testing and experimentation.


Avoid Bugs In the First Place

The easiest way to squash bugs is to not introduce them in the first place. Ensure your code is self-documenting with expressive variable names, that you use indentation to relay semantic meaning, and that you use program structure like subroutines to make it easier read the code flow. The latter also makes unit testing easier!

On 6502 based platforms, the use of hi+lo bytes for indirection/pointers can be a common source of potential bugs. The common convention of accessing the hi and low bytes as VARIABLENAME+1 and VARIABLENAME means its easy to miss an accidental reference to the hi byte when you meant to access the lo byte, or visa versa. It's better to either reference pointers with offset constants i.e. VARIABLENAME+HI and VARIABLENAME+LO [6] or with explicit names that spell out which byte of the pointer they are. i.e. VARIABLENAME_HI and VARIABLENAME_LO.


Authorship

Debugging On Ancient Game Platforms was written by Mike Saarna (aka RevEng) as original content for 7800.8bitdev.org.

Some important points were contributed by the following helpful AtariAge 7800 forum regulars:
  1. score display tip contributed by Perry Thuente, aka Tep392
  2. stack canary tip contributed by Karl Garrison
  3. exception routine at the base of the stack tip contributed by TailChao.
  4. irq runaway code handler tip contributed by TailChao.
  5. non-irq runaway code handler tip contributed by TailChao.
  6. hi/lo naming tip by TailChao.