The Data Science Paradox

Let’s talk about organisms for a moment. All multicellular organisms on earth have DNA. It’s a strand of proteins that tell a cell in the organism how to grow, when to divide, and what proteins to build. You can read the DNA to discover what color eyes the organism might have, or how tall it might be.

DNA is the fundamental building blocks of complex life on earth. However, during life, the organism might decide not to actually implement some aspects of the DNA, or to implement it in unexpected ways.

Even more amazingly, is that DNA actually contains a bunch of “junk”, completely unused.

If you were to unfairly compare this to software, you might consider the data, in the database, as DNA, while the code, the living implementation of the data.

Just as we see junk DNA, we have junk data. Sometimes the data doesn’t even make sense. Yet, the application still functions, quite possibly with a low error rate and only a few open bugs at any given time.

Why? How? Is there a bug in the application or is there something wrong with the data?

This gets into the core of data science, and the paradox:

Given a database, the data is always correct.

This is both patently false and patently true.

Anyone who’s ever tried to answer even simple questions over a sufficiently large data set, knows that the data may not make any sense or actually be erroneous data entries.

Anyone who’s ever written code knows the model just loaded from the database is obviously correct because we have unit tests that test the insertion of the data and the loading of the data. Plus, there’d be bug reports if it were wrong.

There, right there, is the disconnect. The code knows how to “treat” the “bad data”. How to make it right and sensible. Perhaps a chunk of expiration dates are unix time 0. The code that would load those kinds of things might not even use expiration dates.

Data science seems to be about reading databases and making models based on the data. Software engineering seems to be all about taking models and making databases based on the computed data.

Perhaps we’re doing data science wrong? Perhaps we should use the same language the application developers use, so we can load the same entity the application does, and study that? Would that mean we lose the power of scikit-learn if the application isn’t written in Python?