Welcome to our blog. This is an archived post, most of our knowledge and advice remain valid but some material or links may be outdated. Click here to see our most recent posts.
Many of you may be familiar with the problem. Especially if you use database applications in your everyday jobs. Sometimes the system stores things that are not entirely correct, or even blatantly untrue. Data pollution; that is what we call it. Sometimes you have to look really hard to find the real truth. If at all you can… Occasionally, you will find your entire department in total distress, if something big is at stake. Wouldn’t it be wonderful if we were to allow only true facts in the system… But that is theory. Isn’t it? But that very theory has some pretty solid solutions in store.
It is not hard to imagine that what is true today may be false tomorrow. Today I am 54 years of age, but next year I’m not. Each fact that depends on time can cease to be true at any moment. Just like that, without any human intervention. Data pollution that is caused by aging facts is thus unstoppable. However, you can prevent it by storing time independent facts only. For example: my birthday is April 12th, 1959. That will be true even after 200 years. With that fact, a system can compute that I am 54 years today. That is the kind of system I can rely on to provide me with the correct age at any time. Maybe this is the right time to go and look into your database application, to see if there are any time dependent statements in it. Do the “address-check”: send word that you have moved to another city. Does your system all of a sudden show an address that you don’t live on?
Multiple storage is also a great source of data pollution. It is not hard to see: A computer that copies my address an arbitrary number of times, may end up having different copies of my address. It may not know which one is the correct address anymore. To store everything just once is the proper remedy. But why don’t you ask an administrator at your office how many addresses your favorite customer has in the customer database, in the contract database, in the invoices database, or in the order-and-delivery administration? Could it be possible that a customer occurs in multiple databases? What are the odds that these registrations contain contradictory information?
I might carry on and on, but my guess is that you are pretty fed up by now. For I have told nothing you cannot find in the college textbooks on this topic. How hard can it be? I admit, this is theory, but it is downright practical! So isn’t it odd that things go wrong from time to time? What strikes me as even stranger is that office workers among you seem to have learned to live with it. Some of you seem to think nothing of polluted data in data sets you work with on a daily basis. Some people think it is in the natural order of things to have polluted data in their computers. But wouldn’t hell break loose if that would happen to their bank account or airline reservation?
So from this blog I call upon you to not settle for anything less than the truth, from this day on. The very existence of polluted data is utterly unacceptable, for crying out loud. My plea is that data pollution can be prevented. A good designer is capable of doing so. And he must be able to explain how and why. So please accept only and exclusively the truth, the whole truth, and nothing but the truth, in each and every information system…
(Stef Joosten is professor of computer science, and works on sustainably designing information systems and business processes.)
SUBSCRIBE TO BIZZDESIGN'S BLOG
Join 10.000+ others! Get BiZZdesign's latest articles straight to your inbox. Enter your email address below: