How Machine Learning Products are Different (Part 2, Entity Resolution Checklists)

Last time I talked a bit about the context of my experience with Machine Learning products and the high level issues we had getting customers to switch to what was clearly a better product. This time I get into some technical examples from these checklists and try to demonstrate the conflict.

The first and most important distinction between our product and our competitors is we did “Entity Resolution” while they did mostly “Name Matching with Rules”. What does this look like in practice?

When we built our system our competitors were using name matching systems with rules layered on top. These systems would first create matches based on the textual similarity of the name (based on some fast text matching data structure like a suffix tree). Then, this huge volume of matches would be slimmed down with rules like “only hit if the last name starts with the same letter and the dates of birth are the same”. A typical customer would have hundreds of these rules to manage and would have a large team dedicated to changing them and running simulations of those changes and reviewing huge numbers of false positives for potential matches.

Our Entity Resolution system would take into account many different facets of a person at once: name, date of birth, address, phone number etc, and decide based on the the total evidence available, including frequency information, if they were likely to be the same person. For example, a match on a very common name like “John Smith” does not merit as much evidence as a rare one like “Richard Minerich”. Similarly Dick vs Richard can indicate a match but it’s not as strong as “Richard vs Richard”. There are variations on this theme for every data element. Once two records were considered to have enough evidence of similarity they could be linked and considered to represent the same entity. The highest probability records would be linked with no human intervention, while in the middle some amount of human review would be required. The resulting impact to the customer is there are many times fewer false positives to look at and, as there are fewer knobs to turn, fewer simulations required.

In both cases “is this match relevant” is the next most important question which maybe I’ll get into down the road. It doesn’t matter if a match is true if no one cares, but no one knows if tomorrow it might suddenly become relevant. This dimension of relevancy is worth talking about as it’s missed by far too many projects.

Now that you have the background let’s get on to some examples from those checklists. I hope you’ll understand I’ve only included examples from long enough ago to not conflict with my time at my previous employer.

Can your product match all of the following names (see attached list of name pairs)?

Ronald McDonaldMonald DcDonald
John SmithJ1o2h3n4 S5m6i7t8h9
Osama bin Mohammed bin Awad bin LadenMuhammed
George BushEorge Ush

Invariably this would be a list of several hundred names someone typed into Excel by hand that they made up on the fly. Almost nothing in here would resemble the issues we’d see in real data but that didn’t matter. We always had to try our best to match these, and then later we would show the customer what the cost was in terms of false positive increases. Often we won these battles but not always. In the end a configurable matching system combined with an ad-hoc data cleaning system was invaluable for dealing with cases like this.

Does your product allow scans to be rerun with different settings?

This was an interesting one that we received many variations of, often it was an implicit expectation. In fact our platform depended on model configuration being consistent in production for both efficiency and correctness.

Efficiency was important and the consistency of the configuration enabled us to only screen changes to the data. In practice this is basically the difference between millions and thousands of records daily per client in terms of computational cost. We were then able to spend those cycles on fancier Entity Resolution techniques dramatically reducing false positives.

The process depended on its history. If the settings were to change, so would what matched previously and all the saved scores, different records would be above or below different thresholds triggering different downstream workflows and remediation queues. Changes would make the history you are comparing to suspect and so all of the records would need to be rescreened and for a big customer. Rescreening an entire database took a lot of compute and potentially several days. That’s not to say we never changed settings and rescreened, but it was a planned event we’d coordinate with a customer and scale up hardware for.

Correctness was very important as well, our database was the “system of record”, we strove to have the data always model what actually happened on a given day because we needed to be able to answer ad-hoc regulatory queries into what happened when and why. Multiple screening events with different settings would call into question the validity of those screening events and make it very hard to give clear answers.

So to help we would often start with what we considered very low thresholds and many many examples of what might match or might not to build comfort. In some cases we would even end up setting special test environments for bigger customers and give them ability to roll back the data there on request. This turned out to be a deeper issue and later we addressed it with additional product specific features.

Can we control who can see which matches?

This one caught us off guard early on and it goes a lot deeper than you might expect. While we did have data controls from day one we expected that data would be relatively open for review inside compliance departments, silly to think that in retrospect given the risk involved in some of these matches. In actuality our customers were often carved up into different teams with different levels of access within the bigger institutions. Often they didn’t talk to each other at all.

Workflows that determined which team’s queue the matches were put into would take into account many upstream factors that could change including it’s likelihood score. Often the more probable or riskier matches were routed directly to “level 2” analysts. Sometimes certain facets of records would cause matches to be routed to different teams as well, for example sanctions.

Consider the case where a record is updated with some new evidence and now the match belongs to an entirely new department. Even worse, what happens when the evidence goes down, do you take the record away from the level 2 reviewers? We could leave the old match in the old queue and put the new one in the new queue but this would cause cross-team confusion about who would take an action.

Additionally, one of the most important views in our industry is the “alert” view which is simply a roll up of all of the matches for a given customer record. The customer record is the unit they act on (i.e. file reports on), and typically at most one of the matches on a customer record can be true. It’s easy to see why they often do review in terms of their own customer records. If one is deemed clearly true the rest can usually be discarded as false.

But what happens now when this is spread across teams that can’t see matches in each others queues? You have remediated matches in the system but can’t see the decision or why, and so can’t consider the full set of potential matches to judge if a record is true. On top of all that, as records bounced between teams, certain elements of that record’s history would have to be omitted depending on who remediated it and who was looking at it.

Different kinds of customers wanted to solve this differently and so our solutions to this were myriad, we ended up building a very configurable system. Some customers allowed ownership of the entire history to move between teams. Others wanted to show that the data was there but omitted. Some just opened things up as long as it wasn’t a particularly risky data segment.

I hope this peek into some of the pitfalls I ran into as a technical leader for a machine learning based product line was interesting. Please let me know if you’re finding this series interesting by leaving a comment or giving me a shoutout on twitter and I’ll keep at it. Thanks!

Leave a Reply

Blog at WordPress.com.

Discover more from Inviting Epiphany

Subscribe now to keep reading and get access to the full archive.

Continue reading