Record Linkage in F# – Token Matching, Stable Marriages and the Gale-Shapley algorithm

This post was originally published September 13th, 2011.

Initially, one of the biggest problems I found when trying to marry records was the god awful quality of much of data I often have to work with. It’s mostly old mainframe and database data with truncated fields, limited character sets and fields with nonsensical contents. Even worse, much of the data is the result of years of merged account lists from different sources and so global statistical techniques often fail quite spectacularly.

While naive compared to doing something along the lines of clustering, I have found that bag-of-words token matching has proven to be quite robust when comparing names. The idea is that if you consider a name as an unordered set of tokens, you can then find the best global alignment with some target name and that this will indicate the proper ordering in the majority of cases. Finally you can score these ordered tokens with an algorithm (like Jaro-Winker for example), average them, and end up with a good representation of the similarity of the two given names.

(Of course, name is just one of many things to look at. After all, I’m pretty sure that all these John Smiths aren’t the same guy.)

At first you might think finding this alignment is rather simple. Just score every permutation of the two sets of tokens and return the maximum, right? Well, this will work great for most traditional American names as they have only 2-3 tokens. Three factorial is only 6 different token arrangements to try, not a big deal at all. However, this method has serious limitations in practice. What happens when you run up against a 9 token Arabic name? Nine factorial is 362,880 different permutations to try, that’s a lot of grinding for just one record. That’s also just considering real names, if someone were to accidentally dump a paragraph into your algorithm you may just want to grab a snickers.

Don’t despair though, the stable marriages problem and the Gale-Shapely algorithm are here to help.

In the stable marriages problem you have two sets of things you want to match up with each other (described in an unambiguously sexist way as men and women), and each has a way of rating the others. The question is, how might we match up all the men and women so that there’s no two people who would rather be with each other than who they are already with. Note that this does not mean everyone gets their first choice, what we are looking for is global optimality of the pairs.

This sounds familiar right? It’s a slightly more complex version of exactly what we’re trying to do with tokens! So, the solution to this problem is called the Gale-Shapley algorithm, and get this, it’s only worst case O(M*W) where M is the number of men tokens and W is the number of women!

To turn this all into plain English, we can use this algorithm to align that 9-token name in at most 9 * 9 = 81 steps, and we’re guaranteed to get a result that is just as good!

It works like this: Men take turns proposing to each woman in order of preference. If she’s single they are immediately “engaged”, if she’s already engaged the woman picks the man she likes best and the loser goes back into the singles pool. This happens over and over until all single men are engaged or out of options.

In our case we’re actually creating sorted lists for each man of each woman ahead of time, so it’s O(M*WLog(W)), but it’s only a very minor increase in complexity, and how would you do that lazily anyhow?

First, let’s take a look at the Wikipedia article’s very imperative pseudocode:

function stableMatching {
    Initialize all m ? M and w ? W to free
    while ? free man m who still has a 
            woman w to propose to {
       w = m's highest ranked such woman 
               who he has not proposed to yet
       if w is free
         (m, w) become engaged
       else some pair (m', w) already exists
         if w prefers m to m'
           (m, w) become engaged
           m' becomes free
         else
           (m', w) remain engaged
    }
}

Now, I’m not a big fan of imperative code so I wrote a pure functional version instead. Anyhow, having only two immutable data structures makes for a much easier read. It’s pretty much a long list of cases.

open System
// a Bachelor is an identity index and an 
// ordered list of women indicies to approach.
type Bachelor = int * int list
// Some notation:
// wi = woman index (int)
// mi = man index (int)
// mi' = woman's current partner index (int)
// m = man with index and unapproached women indices (Bachelor)
// mSingle = men that are single (Bachelor list)
// wEngaged = engagements from women to men (int, Bachelor)
let funGS (comp: _ -> _ -> float) (M: _ array) (W: _ array) =
  let Windices = [ 0 .. W.Length - 1 ]
  // List of men with women in order of desire  
  let Munproposed = 
    List.init M.Length 
      (fun mi -> 
           let sortFun wi = 1.0 - (comp M.[mi] W.[wi])
           mi, Windices |> List.sortBy sortFun)
  // Recursively solve stable marriages
  let rec findMarriages mSingle wEngaged =
    match mSingle with
    // No single guys left with desired women, we're done
    | [] -> wEngaged
    // Guy is out of luck, remove from singles
    | (mi, []) :: bachelors -> findMarriages bachelors wEngaged
    // He's got options!
    | (mi, wi :: rest) :: bachelors -> 
      let m = mi, rest
      match wEngaged |> Map.tryFind wi with
      // She's single! m is now engaged!
      | None -> findMarriages bachelors (wEngaged |> Map.add wi m)
      // She's already engaged, let the best man win!
      | Some (m') -> 
        let mi', _ = m'
        if comp W.[wi] M.[mi] > comp W.[wi] M.[mi'] then 
          // Congrats mi, he is now engaged to wi
          // The previous suitor (mi') is bested 
          findMarriages 
            (m' :: bachelors) 
            (wEngaged |> Map.add wi m)
        else
          // The current bachelor (mi) lost, better luck next time
          findMarriages 
            (m :: bachelors) 
            wEngaged
  findMarriages Munproposed Map.empty
  // Before returning, remove unproposed lists from man instances  
  |> Map.map (fun wi m -> let mi, _ = m in mi)  
// By the supreme power of partial application I give you 
// Jaro-Winkler Token Alignment with Gale-Shapely in one line!
let alignJaroWinkler = funGS jaroWinkler

So break out your F# Interactive and try something like:





alignJaroWinkler 
  [|"DUDE"; "DUDERSON"|] 
  [|"D"; "RONNY"; "DUDERSON"|]

and you’ll hopefully get something back like this:

val it : Map = map [(0, 0); (2, 1)]

This is just a map from women tokens to men tokens. (0,0) indicates that “D” matched best with “DUDE” and (2,1) indicates that “DUDERSON” matched best with “DUDERSON”. Poor “RONNY” ended up being the old maid.

That’s it for today, but of course, there’s a whole lot more to cover here. We need to look at how to manipulate this output into something a bit nicer. I also have a bunch of tweaks and improvements to this pair of algorithms that I want to talk about. Finally, I want to sidetrack for a bit and tunnel into how one goes from from the horrible imperative pseudocode above, to the magnificent pure functional code underneath it.

As per usual, all of the code use in this post (and previous posts in the series) can be found on github.

Leave a Reply

Blog at WordPress.com.

Discover more from Inviting Epiphany

Subscribe now to keep reading and get access to the full archive.

Continue reading