Bad Data is the Real Problem

Big data is the buzzword de jour, and why not?  Companies like Google with huge server farms are doing amazing things leveraging huge amounts of data and processing power.  It’s all very sexy but these researchers get to pick and choose the data they work with.  They can maximize their research gains by pushing the cutting edge with data that is amicable to their task.

Meanwhile, the reality for most companies is that they are being crushed under their own small mountains of legacy data.  Data that has been merged together over decades from different machines with different fields and formatting constraints.  Decades where new laws were passed about which data can and must be collected.  These companies are desperate for data science.  Not is-it-a-cat data science, but discover-things-and-the-links-between-them data science.

Big data is the exciting research frontier where gains come relatively easily, but the more daunting task is coming up with a solution for bad data.  Unfortunately, dealing with bad data is difficult tradeoffs all the way down and this makes it intractable to build a one size fits all solution.  Even a one size fits many solution is difficult.  For example, at the US Census they focus on aggregate statistics and so somewhat sloppy methods will suffice, but what about the Casino trying to keep track of card counters?  Or even more poignant, the TSA computer trying to determine if you’re on a terrorist watch list?

Simplifying things a bit, here are the four basic categories of tasks within entity resolution:

  1. Low risk where errors have a low cost as with similar products on a shopping site
  2. High false-positive risk where false-positives have a high cost but false-negatives have moderate to low cost as with merging customer databases
  3. High false-negative risk where false-negatives have a high cost but false-positives have moderate cost as with anti-money laundering
  4. High risk where both false-positives and false-negatives have a high cost such as with the FBI hunting someone down who is sending anthrax in the mail

With low risk the more data the merrier.  The good will drown out the bad with traditional machine learning techniques.

For high false-positive risk you need to be extremely careful and manually review those records which have even a moderate probability of representing different entities.  Thankfully, as the cost of keeping duplicates around is low, you can start with combining databases in a straightforward manner and then slowly work on merging records over time as an iterative process from most probable matches to least.

The high false-negative risk style problem is much more challenging.  Consider that you have two datasets of size N and M, the number of all possible pairs of records is then N * M which would be far too many records for manual review and low cost Mechanical Turk style reviewers do a poor job at finding needles in haystacks.  So one approach, similar to high false-positive risk, is to order your matches by probability but also include a measure of quantified false-negative risk on a per-record match basis.  You must also go to extreme lengths to find matches here.  For example: building algorithms which understand how different cultures use, write and pass down names.  It’s a constant struggle to further refine the quality of your results without while pushing down the false-negative rate.

High risk is the hardest problem of all and is better left mostly unautomated at this point.  Perhaps you could use a high false-negative risk style approach to glean candidates, but you’ll still need a lot of intelligently applied elbow grease and a large team to get there.

These gross categories don’t take into account other factors like data quantity, class size imbalances, lossy data formatting, input errors, and poor data management.  I’m afraid that until the singularly no magic bullet will solve this problem.  For getting started my best recommendation is John Talburt’s Entity Resolution and Information Quality.  Unlike many books on the subject it’s quite accessible to non-academics. 

(experimental affiliate link)

(experimental affiliate link to fund my book habit)

Enjoy this post? Continue the conversation with me on twitter.

Tags: , , , ,


  1. Apparently I started one of my blog posts with the same exact words as you. Great minds think alike? Or, simple minds seldom differ?

    I think the long term problem is not bad data but blind data. I feel like a lot of machine learning methods are applied just to make predictions without really trying to glean and underlying theory of why the world behaves that way. In the short term, this is a quick fix to get nice predictions, but in the long term you realize that you have no theoretical understanding from which to build radically different approaches. Of course, this isn’t always the case.

    To add to your categorization of tasks, I think it is also nice to look at mathbabe’s two cultures of dataminers. Are certain risks-types more common among one of the two cultures? Or are the two measures orthogonal?

  2. [...] Richard Minerich wrote “Bad Data is the Real Problem“. [...]

Leave a comment