Superbug!

(This entry is for software developers only. If you’re not a software developer, stop reading now, or you risk terminal boredom. πŸ˜‰ )

As I’ve mentioned before, I wrote a program (which I refer to as Project Badger here) that was sold to a larger company, which keeps my company on call to advise on its continued development.

Customers have been complaining about a minor but annoying problem, off and on, for well over a year. We could never reproduce it, and after they complained, the customers couldn’t either. It seemed to happen only under extremely rare conditions, only some of which were under the user’s control. Fortunately, one of the testers in the company found a way to reliably reproduce it (on his machine, at least) a few weeks ago, so the senior developer and I have been on its trail ever since.

The more experienced C++ programmers among you, reading that description, are probably already nodding sagely and muttering phrases like “rogue pointer” or “double delete.” I thought the same from the first report, but tracking it down was another story — one that finally concluded Friday afternoon.

At the time that I started writing the software, more than a decade ago now, I hadn’t had a lot of experience with designing large programs. Despite many years of refactoring, there are parts of it that still show that. This bug was lurking in one of those parts. A C++ object was, in one of its member functions, calling a function that (in a convoluted and round-about way) ended up deleting the calling object before returning to the object’s member function — which was still referring to that object, not knowing that its memory had been freed. This went undetected for so long because it only showed itself when the OS re-used the memory that it had previously occupied and stored a non-zero value in one specific portion of it. If the program had crashed when that happened, we’d have been able to track it down easily, but it would only pop up a nonsensical dialog box before continuing on as if nothing had happened.

Okay, we had the cause, but how do we fix it? We came up with several possible solutions.

The best way would have been to redesign the program so that it never took that path, but that wasn’t a viable option. There were just too many interdependencies; it would have taken at least a couple weeks to redesign and rewrite it. If this had been a major problem, and that code was likely to cause continued difficulties, we could have justified the time and manpower required for that. But such a minor bug, in a place that wasn’t likely to produce any other problems, just wasn’t worth it.

The second-best way was to move the deletion, so that the object would still exist when that function returned. Not very difficult, and I could immediately see a design that, while not very elegant, would have handled it. But it occurred to us that this sounded like a perfect job for a smart pointer, so we hashed out a different system.

We changed the program to use a boost::shared_ptr for those objects instead of a raw pointer, storing a boost::weak_ptr to the object within the object itself. An object pointed to by a shared_ptr is reference-counted, and is automatically deleted when the last shared_ptr reference to it goes out of scope; a weak_ptr is exactly the same except that it doesn’t add to the reference count, so putting a weak_ptr in the object itself wouldn’t hold the object in perpetual existence the way that putting a shared_ptr to it there would have.

Then, at the beginning of the function that triggered the problem, we created a stack-allocated shared_ptr to the object from the stored weak_ptr. The result: when the program followed the previously-problematic path, the object would now remain in existence, held there by the function’s locally-created shared_ptr. When that function returned, if the object wasn’t being used elsewhere, it would now be automatically deleted. On every other path it acted exactly the same as before, except that the object no longer needed to be explicitly deleted. ViolΓ !

Not exactly the problem that the creators of the Boost library’s smart pointers set out to solve, but it works, and very elegantly too.

The programming world has one less superbug now. πŸ˜€

12 Comments

  1. That’s pretty funny, you’re creatively mis-using C++ memory management… πŸ˜‰ I hope you documented that for the poor sod who might have to mess with it. πŸ™‚

  2. We did it on Friday evening, so it hasn’t been reviewed or entered into the version-control system yet, but I’d already planned to make sure there’s a comment explaining the purpose behind the function’s local shared_ptr. Otherwise some well-meaning future coder will inevitably see that it’s not used in the function and “helpfully” delete it. πŸ™‚

  3. I’ve never understood that criticism. Maybe it’s just my inner control freak, but I like having full control over where things get deleted. Relying on someone else’s code to read my mind seems a bit… over-optimistic, to say the least.

    It probably helps that I haven’t had to use a language with a built-in garbage collector for quite a while, though. I haven’t been spoiled by it. πŸ™‚

  4. Lisp has a built-in garbage collector, you use that sometimes. Java’s garbage collector for a long time had been… garbage, however.

  5. I don’t know how Lisp does it, but the way I’ve heard that languages like Java do it is by scanning the stack and other places every so often, to see what’s no longer being used. I don’t know if it’s true, but it seems to me that keeping track of what has a handle to a piece of allocated memory, and freeing/destroying the thing automatically as soon as the last handle is released, would be more efficient.

  6. Lisp has several different algorithms, historically garbage collection was so bad that people would temporarily turn it off, but it’s gotten to the point of being a science now and the typical Lisp’s garbage collection tends to be more advanced than Java’s. There’s nothing that performs better than manually handling memory of course, assuming that you’re doing it right. (The problem of course is that the larger the program, the more chance there is that somewhere you’re not doing it right.)

  7. That’s a symptom of a development process in need of work. All you have to do is make sure that when you use new to allocate something, that something is deleted. The two main ways to do that are by assigning it to a smart pointer (which handles the cleanup automatically), or ensuring that all new calls are in constructors (and putting the corresponding delete call in the corresponding destructor).

    (The smart-pointer method is preferable, because if you throw an exception in the constructor, the destructor will never get called. But for most programs, that’s rarely a problem.)

    If you follow those rules, you won’t have any problem with manual memory handling in C++. You just have to be careful on the few occasions where you find a need to do something different.

  8. There’s that “make sure” bit, if “making sure” were so easy in programming, there wouldn’t be such a thing as a memory leak; but there is.

  9. If “making sure” were easy for most people, I’d have a lot more competition. πŸ˜‰ But it is that easy — just ensure that as soon as you add a new, you also add something to delete it, like a smart pointer.

  10. To each his own. I personally prefer the added control. It requires almost no extra effort, if you follow the guidelines I mentioned above, and you know exactly what’s going on at every point in the program.

Comments are closed.