Rewrites Are Dangerous

I don’t think I’m the first one to ever say this, but rewriting software is perhaps one of the most risky things you can do in a project, not to mention a business.

They’re risky because it takes a human to transfer knowledge from one system to another that behave subtly differently. I’m about to show you exactly what I just caught myself doing while rewriting a part of my “super top-secret” project that I’ve been working on for a few years.

if (headers.Count == 0 && data.Count == 2)
  if (!table.ContainsKey(data[0]))
    table.Add(data[0], new List());


This is from a mammoth table parser I’ve caused myself to support. It walks any given html table and tries to figure out how a human would parse it and then store it in a dictionary.  For those non-programmers out there, it basically says:

If there are no headers on the table, and there are exactly two elements in the table; assume that the first column in the table is the label and the second is the data for that label. We do this by first checking if there is already a label for that column. If there is not, store the new column. After we make sure the column exists, store the data for that column.

This effectively means that if a table looks like:

hello world
hello universe

“hello” will be the label with “world” and “universe” as the data.


In the perfect world, everyone would use table tags … but for some reason a certain group of people felt that divs make for better styling of tabular data. That’s a discussion for another day…

Point is, I had to rewrite this guy to handle div based layouts of tables… and this is what that same part transcribed to:

if (titles.Count == 0 && contents.Count == 2)
  if (!table.ContainsKey(contents[0]))
    table.Add(contents[0], new List()

At first glance it looks like it does about the same thing with new words. Instead of headers, they’re called titles. Instead of data it’s called contents… From a purely simple glance, they look and sound like they kinda do the same thing:

If there are no titles in the table and there are exactly two pieces of content; assume that the first piece of content is the title and the second is the real content we want to store. We do this by first checking that there isn’t a title with the first piece of content. If and only if there is not, do we add the title and give it the second piece of content as its data and store it.

This means, given the table above, we’d have “hello” as the title with a singular “world” as the data.

This was a subtle bug that took a little while to track down. As I was transcribing the logic, I likely saw it as a small optimization that I could make. The reality is that I was subtly changing the behavior of an edge case.

An “edge case” is when the input of a function is at the boundaries of a solution. In this particular instance, you hardly expect a table to have multiple “titles” that are the same, but have different sets of data. However, when it happens, you have to be ready for it. This is an edge case.

Rewrites are dangerous for this reason. It is quite difficult to remember why you did something the way you did it originally. Even if you have detailed notes, you are extremely unlikely to note something that seemed obvious to you at the time when actually, it was paramount to how the thing worked in the first place.

Most software doesn’t even have the notes beside them as to why they are the way they are. People forget, they get in the zone and they don’t note why they did what they did… ourselves included.

Sometimes, things need rewriting … I get it. You’ll learn why the little things are the way they are, and hopefully, this time it will get documented. In the meantime, you have to deal with the subtle bugs. You may actually even not ever find the cause of the problem and end up rewriting something else altogether.

Until next time…