Over the past decade or so, I’ve been heavily involved with clients who process large quantities of sensitive data — things like credit history, billing information, employee benefits data, and the like. In two cases, the data was aggregated from dozens or hundreds of other companies. So I’ve had more of an opportunity than most people to get a broad sampling of data quality across entire industries, in a variety of companies, large and small.
When I started out, I naively assumed that most companies, especially large ones, are very careful with important data and work hard to insure that it’s accurate. Now, I know better. Most companies are astoundingly careless, often to the point of total fiduciary misconduct. Don’t believe it? Neither did I. But I’ve seen the soft underbelly of corporate American data processing, and it ain’t pretty.
If you want your company to play like the big boys and generate data that is total crap, here are my insider’s guidelines:
1. Don’t validate data entry. The companies that are most successful at capturing data inaccurately have absolutely no data entry validation in place. This allows data entry employees to not only make all kinds of stupid mistakes, but also to put insulting or incriminating comments about customers into name, address or phone number fields (lawsuits, anyone?). But more subtly, it allows an infinite number of ways to (mis)spell and punctuate names and addresses, so that it’s a cinch to create multiple entries for the same customer — not just internal to the database, but when the data is merged with other information from other company departments or with sources outside the company.
2. Don’t have any kind of QA or review process. At every step in your processes, create a culture that is accustomed to throwing stuff in and never looking at it again. For God’s sake, don’t ever have anyone review sorted records in context and notice untoward patterns. If they do, you might have to correct them. Although, you could take a page from a client I worked for briefly in the mid 90’s. When I pointed out serious inconsistencies in their data, he looked annoyed and said, “The data is the data. It has its own integrity.”
3. Assume it’s right. This is closely related to (2) above. This technique is particularly effective if you are buying data. After all, you’ve just spent fifteen grand for fifteen thousand records … “they” wouldn’t let “them” sell you rotten data, right? Skepticism is your worst enemy; it exposes “issues” that you’ll have to deal with. What a waste of your time! And it’s painful to think!
4. Make sure that accuracy is no one’s responsibility. This is easiest at large companies, where everyone works in isolated silos and you deal with lots of outside services and contractors. This creates an environment where people throw data “over the wall” at each other and any problems are always someone else’s fault, up or downstream from you. A recent poster child for this is Hewitt Associates, who botched payments to ex-Enron employees and then made the novel defense in court that they were somehow not actually responsible for making sure the funds were correctly distributed — even though they were in fact the contractor hired to do exactly that.
5. Be the ugly American. If you are a U.S. company dealing with foreign customers, make sure you use US-centric software that assumes the whole world has US-style zip codes and city names that never exceed 18 characters or require extended character sets. Tell employees to put region and country names and “non-standard” postal codes into the city field or some other location. Tell them to spell and abbreviate however the mood strikes them. Heck, who cares if they put complete info in there at all? It’s not your fault those damn foreigners don’t know how to format addresses.
6. Don’t Automate. If you share data with service vendors, credit bureaus, and the like, make sure that the task of exporting the data falls to some receptionist using a data view into Excel. Make sure she figures out how to do it herself and never documents the process. This will insure that the data will be slightly different each time it’s sent — this week an xls, next week a csv, then a zipped xls, then tab-delimited with a column missing — that sort of thing. It’s important to be creative. It’s also important to be late. If that receptionist is on vacation, let the data pile up. Then when she catches things up she’ll think she’s sending last month’s summary two weeks late when in fact she is sending the past 4 weeks from the date generated. Finally, when the employee quits or is promoted, the expected data will mysteriously stop flowing and no one will know why.
7. Don’t Encrypt. Send sensitive raw data via unprotected FTP (clear text passwords for a false sense of security) or better yet, email. Encryption just gets in the way.
8. Be baroque. Try to create inherently fragile processes such as sending variable numbers of files, say one from each department, and randomly leaving off one or two each month. That will create errors in the receiving system or process that no one will ever figure out.
9. Be vague. If you want the highest A/R balance on a customer record, for example, don’t explain whether that’s highest balance this year, or this quarter, trailing 12 months, or since the last system upgrade. Let the developer interpret it how they like, and never check it for accuracy. When asked for this info, don’t clarify what the requestor wants. They probably have no real idea, anyway. They’re just trying to get an odious task off their plate. If forced to be specific, introduce latency issues where the figure is updated once a month, creating the possibility for the current A/R balance to be higher than the highest historical A/R balance.
10. When all else fails, stonewall. Once in awhile some asshole will call you on some bit of sloppiness, as if it actually matters. For example, they might point out that you changed Aging.xls to Ageing.xls and could you please keep the name consistent or at least notify them when you change it? In this case the correct reply is, “It’s the same as it’s always been. Do you want it changed to Aging.xls?”
I hope I’ve helped you increase the randomness of your data in creative new ways. Remember, technology no longer allows us to “fold, spindle or mutilate” data on physical punch cards, so we have to move with the times and find ways to do those things digitally.