13
Sep 11

Record Linkage in F# – Token Matching, Stable Marriages and the Gale-Shapley algorithm

Initially, one of the biggest problems I found when trying to marry records was the god awful quality of much of data I often have to work with. It’s mostly old mainframe and database data with truncated fields, limited character sets and fields with nonsensical contents. Even worse, much of the data is the result of years of merged account lists from different sources and so global statistical techniques often fail quite spectacularly.

While naive compared to doing something along the lines of clustering, I have found that bag-of-words token matching has proven to be quite robust when comparing names. The idea is that if you consider a name as an unordered set of tokens, you can then find the best global alignment with some target name and that this will indicate the proper ordering in the majority of cases. Finally you can score these ordered tokens with an algorithm (like Jaro-Winker for example), average them, and end up with a good representation of the similarity of the two given names.

(Of course, name is just one of many things to look at. After all, I’m pretty sure that all these John Smiths aren’t the same guy.)

At first you might think finding this alignment is rather simple. Just score every permutation of the two sets of tokens and return the maximum, right? Well, this will work great for most traditional American names as they have only 2-3 tokens. Three factorial is only 6 different token arrangements to try, not a big deal at all. However, this method has serious limitations in practice. What happens when you run up against a 9 token Arabic name? Nine factorial is 362,880 different permutations to try, that’s a lot of grinding for just one record. That’s also just considering real names, if someone were to accidentally dump a paragraph into your algorithm you may just want to grab a snickers.

Don’t despair though, the stable marriages problem and the Gale-Shapely algorithm are here to help.

In the stable marriages problem you have two sets of things you want to match up with each other (described in an unambiguously sexist way as men and women), and each has a way of rating the others. The question is, how might we match up all the men and women so that there’s no two people who would rather be with each other than who they are already with. Note that this does not mean everyone gets their first choice, what we are looking for is global optimality of the pairs.

This sounds familiar right? It’s a slightly more complex version of exactly what we’re trying to do with tokens! So, the solution to this problem is called the Gale-Shapley algorithm, and get this, it’s only worst case O(M*W) where M is the number of men tokens and W is the number of women!

To turn this all into plain English, we can use this algorithm to align that 9-token name in at most 9 * 9 = 81 steps, and we’re guaranteed to get a result that is just as good!

It works like this: Men take turns proposing to each woman in order of preference. If she’s single they are immediately “engaged”, if she’s already engaged the woman picks the man she likes best and the loser goes back into the singles pool. This happens over and over until all single men are engaged or out of options.

In our case we’re actually creating sorted lists for each man of each woman ahead of time, so it’s O(M*WLog(W)), but it’s only a very minor increase in complexity, and how would you do that lazily anyhow?

First, let’s take a look at the Wikipedia article’s very imperative pseudocode:

```
function stableMatching {
Initialize all m ? M and w ? W to free
while ? free man m who still has a
woman w to propose to {
w = m's highest ranked such woman
who he has not proposed to yet
if w is free
(m, w) become engaged
else some pair (m', w) already exists
if w prefers m to m'
(m, w) become engaged
m' becomes free
else
(m', w) remain engaged
}
}
```

Now, I’m not a big fan of imperative code so I wrote a pure functional version instead. Anyhow, having only two immutable data structures makes for a much easier read. It’s pretty much a long list of cases.

``` 1: open System
2:
3: // a Bachelor is an identity index and an
4: // ordered list of women indicies to approach.
5: type Bachelor = int * int list
6:
7: // Some notation:
8: // wi = woman index (int)
9: // mi = man index (int)
10: // mi' = woman's current partner index (int)
11: // m = man with index and unapproached women indices (Bachelor)
12: // mSingle = men that are single (Bachelor list)
13: // wEngaged = engagements from women to men (int, Bachelor)
14:
15: let funGS (comp: _ -> _ -> float) (M: _ array) (W: _ array) =
16:   let Windices = [ 0 .. W.Length - 1 ]
17:   // List of men with women in order of desire
18:   let Munproposed =
19:     List.init M.Length
20:       (fun mi ->
21:            let sortFun wi = 1.0 - (comp M.[mi] W.[wi])
22:            mi, Windices |> List.sortBy sortFun)
23:   // Recursively solve stable marriages
24:   let rec findMarriages mSingle wEngaged =
25:     match mSingle with
26:     // No single guys left with desired women, we're done
27:     | [] -> wEngaged
28:     // Guy is out of luck, remove from singles
29:     | (mi, []) :: bachelors -> findMarriages bachelors wEngaged
30:     // He's got options!
31:     | (mi, wi :: rest) :: bachelors ->
32:       let m = mi, rest
33:       match wEngaged |> Map.tryFind wi with
34:       // She's single! m is now engaged!
35:       | None -> findMarriages bachelors (wEngaged |> Map.add wi m)
36:       // She's already engaged, let the best man win!
37:       | Some (m') ->
38:         let mi', _ = m'
39:         if comp W.[wi] M.[mi] > comp W.[wi] M.[mi'] then
40:           // Congrats mi, he is now engaged to wi
41:           // The previous suitor (mi') is bested
42:           findMarriages
43:             (m' :: bachelors)
44:             (wEngaged |> Map.add wi m)
45:         else
46:           // The current bachelor (mi) lost, better luck next time
47:           findMarriages
48:             (m :: bachelors)
49:             wEngaged
50:   findMarriages Munproposed Map.empty
51:   // Before returning, remove unproposed lists from man instances
52:   |> Map.map (fun wi m -> let mi, _ = m in mi)
53:
54: // By the supreme power of partial application I give you
55: // Jaro-Winkler Token Alignment with Gale-Shapely in one line!
56:
57: let alignJaroWinkler = funGS jaroWinkler```

namespace System
type Bachelor = int * int list

Full name: Snippet.Bachelor

Multiple items

val int : 'T -> int (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.int

——————–

type int<'Measure> = int

Full name: Microsoft.FSharp.Core.int<_>

type: int<'Measure>
implements: IComparable
implements: IConvertible
implements: IFormattable
implements: IComparable<int<'Measure>>
implements: IEquatable<int<'Measure>>
inherits: ValueType

——————–

type int = int32

Full name: Microsoft.FSharp.Core.int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

type 'T list = List<'T>

Full name: Microsoft.FSharp.Collections.list<_>

type: 'T list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'T>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'T>
implements: Collections.IEnumerable

val funGS : ('a -> 'a -> float) -> 'a array -> 'a array -> Map<int,int>

Full name: Snippet.funGS

val comp : ('a -> 'a -> float)
Multiple items

val float : 'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–

type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

type: float<'Measure>
implements: IComparable
implements: IConvertible
implements: IFormattable
implements: IComparable<float<'Measure>>
implements: IEquatable<float<'Measure>>
inherits: ValueType

——————–

type float = Double

Full name: Microsoft.FSharp.Core.float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

Multiple items

val M : 'a array

type: 'a array
implements: ICloneable
implements: Collections.IList
implements: Collections.ICollection
implements: Collections.IStructuralComparable
implements: Collections.IStructuralEquatable
implements: Collections.Generic.IList<'a>
implements: Collections.Generic.ICollection<'a>
implements: seq<'a>
implements: Collections.IEnumerable
inherits: Array

——————–

val M : 'a array

type: 'a array
implements: ICloneable
implements: Collections.IList
implements: Collections.ICollection
implements: Collections.IStructuralComparable
implements: Collections.IStructuralEquatable
implements: Collections.Generic.IList<'a>
implements: Collections.Generic.ICollection<'a>
implements: seq<'a>
implements: Collections.IEnumerable
inherits: Array

type 'T array = 'T []

Full name: Microsoft.FSharp.Core.array<_>

type: 'T array
implements: ICloneable
implements: Collections.IList
implements: Collections.ICollection
implements: Collections.IStructuralComparable
implements: Collections.IStructuralEquatable
implements: Collections.Generic.IList<'T>
implements: Collections.Generic.ICollection<'T>
implements: seq<'T>
implements: Collections.IEnumerable
inherits: Array

Multiple items

val W : 'a array

type: 'a array
implements: ICloneable
implements: Collections.IList
implements: Collections.ICollection
implements: Collections.IStructuralComparable
implements: Collections.IStructuralEquatable
implements: Collections.Generic.IList<'a>
implements: Collections.Generic.ICollection<'a>
implements: seq<'a>
implements: Collections.IEnumerable
inherits: Array

——————–

val W : 'a array

type: 'a array
implements: ICloneable
implements: Collections.IList
implements: Collections.ICollection
implements: Collections.IStructuralComparable
implements: Collections.IStructuralEquatable
implements: Collections.Generic.IList<'a>
implements: Collections.Generic.ICollection<'a>
implements: seq<'a>
implements: Collections.IEnumerable
inherits: Array

val Windices : int list

type: int list
implements: Collections.IStructuralEquatable
implements: IComparable<List<int>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<int>
implements: Collections.IEnumerable

val W : 'a array

type: 'a array
implements: ICloneable
implements: Collections.IList
implements: Collections.ICollection
implements: Collections.IStructuralComparable
implements: Collections.IStructuralEquatable
implements: Collections.Generic.IList<'a>
implements: Collections.Generic.ICollection<'a>
implements: seq<'a>
implements: Collections.IEnumerable
inherits: Array

property Array.Length: int
val Munproposed : (int * int list) list

type: (int * int list) list
implements: Collections.IStructuralEquatable
implements: IComparable<List<int * int list>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<int * int list>
implements: Collections.IEnumerable

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =
| ( [] )
| ( :: ) of 'T * 'T list
with
interface Collections.IEnumerable
interface Collections.Generic.IEnumerable<'T>
member IsEmpty : bool
member Item : index:int -> 'T with get
member Length : int
member Tail : 'T list
static member Cons : head:'T * tail:'T list -> 'T list
static member Empty : 'T list
end

Full name: Microsoft.FSharp.Collections.List<_>

type: List<'T>
implements: Collections.IStructuralEquatable
implements: IComparable<List<'T>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'T>
implements: Collections.IEnumerable

val init : int -> (int -> 'T) -> 'T list

Full name: Microsoft.FSharp.Collections.List.init

val M : 'a array

type: 'a array
implements: ICloneable
implements: Collections.IList
implements: Collections.ICollection
implements: Collections.IStructuralComparable
implements: Collections.IStructuralEquatable
implements: Collections.Generic.IList<'a>
implements: Collections.Generic.ICollection<'a>
implements: seq<'a>
implements: Collections.IEnumerable
inherits: Array

val mi : int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val sortFun : (int -> float)
val wi : int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val sortBy : ('T -> 'Key) -> 'T list -> 'T list (requires comparison)

Full name: Microsoft.FSharp.Collections.List.sortBy

val findMarriages : ((int * int list) list -> Map<int,(int * int list)> -> Map<int,(int * int list)>)
val mSingle : (int * int list) list

type: (int * int list) list
implements: Collections.IStructuralEquatable
implements: IComparable<List<int * int list>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<int * int list>
implements: Collections.IEnumerable

val wEngaged : Map<int,(int * int list)>

type: Map<int,(int * int list)>
implements: IComparable
implements: Collections.Generic.IDictionary<int,(int * int list)>
implements: Collections.Generic.ICollection<Collections.Generic.KeyValuePair<int,(int * int list)>>
implements: seq<Collections.Generic.KeyValuePair<int,(int * int list)>>
implements: Collections.IEnumerable

val bachelors : (int * int list) list

type: (int * int list) list
implements: Collections.IStructuralEquatable
implements: IComparable<List<int * int list>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<int * int list>
implements: Collections.IEnumerable

val rest : int list

type: int list
implements: Collections.IStructuralEquatable
implements: IComparable<List<int>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<int>
implements: Collections.IEnumerable

val m : int * int list
Multiple items

module Map

from Microsoft.FSharp.Collections

——————–

type Map<'Key,'Value (requires comparison)> =
class
interface Collections.IEnumerable
interface IComparable
interface Collections.Generic.IEnumerable<Collections.Generic.KeyValuePair<'Key,'Value>>
interface Collections.Generic.ICollection<Collections.Generic.KeyValuePair<'Key,'Value>>
interface Collections.Generic.IDictionary<'Key,'Value>
new : elements:seq<'Key * 'Value> -> Map<'Key,'Value>
member Add : key:'Key * value:'Value -> Map<'Key,'Value>
member ContainsKey : key:'Key -> bool
override Equals : obj -> bool
member Remove : key:'Key -> Map<'Key,'Value>
member TryFind : key:'Key -> 'Value option
member Count : int
member IsEmpty : bool
member Item : key:'Key -> 'Value with get
end

Full name: Microsoft.FSharp.Collections.Map<_,_>

type: Map<'Key,'Value>
implements: IComparable
implements: Collections.Generic.IDictionary<'Key,'Value>
implements: Collections.Generic.ICollection<Collections.Generic.KeyValuePair<'Key,'Value>>
implements: seq<Collections.Generic.KeyValuePair<'Key,'Value>>
implements: Collections.IEnumerable

val tryFind : 'Key -> Map<'Key,'T> -> 'T option (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.tryFind

union case Option.None: Option<'T>
val add : 'Key -> 'T -> Map<'Key,'T> -> Map<'Key,'T> (requires comparison)

union case Option.Some: 'T -> Option<'T>
val m' : int * int list
val mi' : int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val empty<'Key,'T (requires comparison)> : Map<'Key,'T> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.empty

val map : ('Key -> 'T -> 'U) -> Map<'Key,'T> -> Map<'Key,'U> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.map

val alignJaroWinkler : ('a array -> 'a array -> Map<int,int>)

Full name: Snippet.alignJaroWinkler

So break out your F# Interactive and try something like:

```alignJaroWinkler
[|"DUDE"; "DUDERSON"|]
[|"D"; "RONNY"; "DUDERSON"|]```

and you’ll hopefully get something back like this:

`val it : Map = map [(0, 0); (2, 1)]`

This is just a map from women tokens to men tokens. (0,0) indicates that “D” matched best with “DUDE” and (2,1) indicates that “DUDERSON” matched best with “DUDERSON”. Poor “RONNY” ended up being the old maid.

That’s it for today, but of course, there’s a whole lot more to cover here. We need to look at how to manipulate this output into something a bit nicer. I also have a bunch of tweaks and improvements to this pair of algorithms that I want to talk about. Finally, I want to sidetrack for a bit and tunnel into how one goes from from the horrible imperative pseudocode above, to the magnificent pure functional code underneath it.

As per usual, all of the code use in this post (and previous posts in the series) can be found on github.