Yesterday, I had some fun playing around with remote debugging and examining crash dumps. In the lab, we’re currently running a lot of noise map simulations for a project we’re involved in. These simulations are distributed over all available computers in the lab, a mixture of old and new, often actively used by other people as desktop. The fun part starts when things do not run as smoothly as planned, which is unfortunately often the case. What else to expect expected from software that is simultaneously developed and used in production. We’re a research lab, after all.
Our current noise mapper is pretty young and only supports full multi-threading since recently. Since we use this to squeeze out that extra bit of performance on machines that have more than one core, we started noticing – not surprisingly – inconsistent and hard-to-reproduce crashes. Reproducing the crashes on the development box was not feasible as you could just spend hours waiting for the crash to happen. Debugging the crashed programs locally wasn’t an option either, as most of them didn’t have any debugging tools installed whatsoever.
Last week, I’ve found some time to investigate the problem further, and I first started poking around with remote debugging in Visual Studio 2005, but it wasn’t really helpful. Normal use requires that you have identical accounts on both PCs. In the heterogeneous environment of our lab, without central user management and where everyone admins his own desktop PC, this is not a easy thing. However, it also has a no-authorization mode (who needs security anyway =), and after disabling the firewall (the rule automatically created by the remote debugger somehow wasn’t enough) and modifying some obscure security setting, I finally was able to connect to it from my dev box. Unfortunately, I only got corrupted call stacks, so it wasn’t of any help at all.
But that’s when I noticed something really wonderful: minidumps. When your program crashes the hard way on WinXP and later, it pops up a dialog box asking if you want to send debug information to Microsoft. You of course don’t want to do that, but at that moment, a minidump already has been created. You just need to secure it before you click away the dialog box. Then you copy it to your dev box, open it in visual studio and press F5. Et voila, you get an instant reproduction of the crash, ready to be debugged. Then it was almost a breeze to solve the problem.
Why have I never noticed this before? It’s so helpful to investigate problems that happen on machines other than yours. It’s plain magic. I’m thinking of building it in in our software so that the crash dumps are automatically send to me. If I also can avoid the dialog box, I can run the simulations unattended, and still be reported of any issue. When a crash happens, the program simply sends me the crash dump and tries to run the next job. I already have something similar for problems that arise in the Python code of the simulations, but this would make it complete.
Perhaps to be continued =)