Record Linkage in F# – Token Matching, Stable Marriages and the Gale-Shapley algorithm

Initially, one of the biggest problems I found when trying to marry records was the god awful quality of much of data I often have to work with. It’s mostly old mainframe and database data with truncated fields, limited character sets and fields with nonsensical contents. Even worse, much of the data is the result of years of merged account lists from different sources and so global statistical techniques often fail quite spectacularly.

While naive compared to doing something along the lines of clustering, I have found that bag-of-words token matching has proven to be quite robust when comparing names. The idea is that if you consider a name as an unordered set of tokens, you can then find the best global alignment with some target name and that this will indicate the proper ordering in the majority of cases. Finally you can score these ordered tokens with an algorithm (like Jaro-Winker for example), average them, and end up with a good representation of the similarity of the two given names.

(Of course, name is just one of many things to look at. After all, I’m pretty sure that all these John Smiths aren’t the same guy.)

At first you might think finding this alignment is rather simple. Just score every permutation of the two sets of tokens and return the maximum, right? Well, this will work great for most traditional American names as they have only 2-3 tokens. Three factorial is only 6 different token arrangements to try, not a big deal at all. However, this method has serious limitations in practice. What happens when you run up against a 9 token Arabic name? Nine factorial is 362,880 different permutations to try, that’s a lot of grinding for just one record. That’s also just considering real names, if someone were to accidentally dump a paragraph into your algorithm you may just want to grab a snickers.

Don’t despair though, the stable marriages problem and the Gale-Shapely algorithm are here to help.

In the stable marriages problem you have two sets of things you want to match up with each other (described in an unambiguously sexist way as men and women), and each has a way of rating the others. The question is, how might we match up all the men and women so that there’s no two people who would rather be with each other than who they are already with. Note that this does not mean everyone gets their first choice, what we are looking for is global optimality of the pairs.

This sounds familiar right? It’s a slightly more complex version of exactly what we’re trying to do with tokens! So, the solution to this problem is called the Gale-Shapley algorithm, and get this, it’s only worst case O(M*W) where M is the number of men tokens and W is the number of women!

To turn this all into plain English, we can use this algorithm to align that 9-token name in at most 9 * 9 = 81 steps, and we’re guaranteed to get a result that is just as good!

It works like this: Men take turns proposing to each woman in order of preference. If she’s single they are immediately “engaged”, if she’s already engaged the woman picks the man she likes best and the loser goes back into the singles pool. This happens over and over until all single men are engaged or out of options.

In our case we’re actually creating sorted lists for each man of each woman ahead of time, so it’s O(M*WLog(W)), but it’s only a very minor increase in complexity, and how would you do that lazily anyhow?

First, let’s take a look at the Wikipedia article’s very imperative pseudocode:

function stableMatching { Initialize all m ? M and w ? W to free while ? free man m who still has a woman w to propose to { w = m's highest ranked such woman who he has not proposed to yet if w is free (m, w) become engaged else some pair (m', w) already exists if w prefers m to m' (m, w) become engaged m' becomes free else (m', w) remain engaged } }

Now, I’m not a big fan of imperative code so I wrote a pure functional version instead. Anyhow, having only two immutable data structures makes for a much easier read. It’s pretty much a long list of cases.

 1: open System
 2: 
 3: // a Bachelor is an identity index and an 
 4: // ordered list of women indicies to approach.
 5: type Bachelor = int * int list
 6: 
 7: // Some notation:
 8: // wi = woman index (int)
 9: // mi = man index (int)
10: // mi' = woman's current partner index (int)
11: // m = man with index and unapproached women indices (Bachelor)
12: // mSingle = men that are single (Bachelor list)
13: // wEngaged = engagements from women to men (int, Bachelor)
14: 
15: let funGS (comp: _ -> _ -> float) (M: _ array) (W: _ array) =
16:   let Windices = [ 0 .. W.Length - 1 ]
17:   // List of men with women in order of desire  
18:   let Munproposed = 
19:     List.init M.Length 
20:       (fun mi -> 
21:            let sortFun wi = 1.0 - (comp M.[mi] W.[wi])
22:            mi, Windices |> List.sortBy sortFun)
23:   // Recursively solve stable marriages
24:   let rec findMarriages mSingle wEngaged =
25:     match mSingle with
26:     // No single guys left with desired women, we're done
27:     | [] -> wEngaged
28:     // Guy is out of luck, remove from singles
29:     | (mi, []) :: bachelors -> findMarriages bachelors wEngaged
30:     // He's got options!
31:     | (mi, wi :: rest) :: bachelors -> 
32:       let m = mi, rest
33:       match wEngaged |> Map.tryFind wi with
34:       // She's single! m is now engaged!
35:       | None -> findMarriages bachelors (wEngaged |> Map.add wi m)
36:       // She's already engaged, let the best man win!
37:       | Some (m') -> 
38:         let mi', _ = m'
39:         if comp W.[wi] M.[mi] > comp W.[wi] M.[mi'] then 
40:           // Congrats mi, he is now engaged to wi
41:           // The previous suitor (mi') is bested 
42:           findMarriages 
43:             (m' :: bachelors) 
44:             (wEngaged |> Map.add wi m)
45:         else
46:           // The current bachelor (mi) lost, better luck next time
47:           findMarriages 
48:             (m :: bachelors) 
49:             wEngaged
50:   findMarriages Munproposed Map.empty
51:   // Before returning, remove unproposed lists from man instances  
52:   |> Map.map (fun wi m -> let mi, _ = m in mi)  
53: 
54: // By the supreme power of partial application I give you 
55: // Jaro-Winkler Token Alignment with Gale-Shapely in one line!
56: 
57: let alignJaroWinkler = funGS jaroWinkler

namespace System
type Bachelor = int * int list

Full name: Snippet.Bachelor

Multiple items

val int : 'T -> int (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.int

——————–

type int<'Measure> = int

Full name: Microsoft.FSharp.Core.int<_>

  type: int<'Measure>
  implements: IComparable
  implements: IConvertible
  implements: IFormattable
  implements: IComparable<int<'Measure>>
  implements: IEquatable<int<'Measure>>
  inherits: ValueType

——————–

type int = int32

Full name: Microsoft.FSharp.Core.int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

type 'T list = List<'T>

Full name: Microsoft.FSharp.Collections.list<_>

  type: 'T list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'T>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'T>
  implements: Collections.IEnumerable

val funGS : ('a -> 'a -> float) -> 'a array -> 'a array -> Map<int,int>

Full name: Snippet.funGS

val comp : ('a -> 'a -> float)
Multiple items

val float : 'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–

type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

  type: float<'Measure>
  implements: IComparable
  implements: IConvertible
  implements: IFormattable
  implements: IComparable<float<'Measure>>
  implements: IEquatable<float<'Measure>>
  inherits: ValueType

——————–

type float = Double

Full name: Microsoft.FSharp.Core.float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

Multiple items

val M : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

——————–

val M : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

type 'T array = 'T []

Full name: Microsoft.FSharp.Core.array<_>

  type: 'T array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'T>
  implements: Collections.Generic.ICollection<'T>
  implements: seq<'T>
  implements: Collections.IEnumerable
  inherits: Array

Multiple items

val W : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

——————–

val W : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

val Windices : int list

  type: int list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int>
  implements: Collections.IEnumerable

val W : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

property Array.Length: int
val Munproposed : (int * int list) list

  type: (int * int list) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int list>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int list>
  implements: Collections.IEnumerable

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =
  | ( [] )
  | ( :: ) of 'T * 'T list
  with
    interface Collections.IEnumerable
    interface Collections.Generic.IEnumerable<'T>
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    static member Cons : head:'T * tail:'T list -> 'T list
    static member Empty : 'T list
  end

Full name: Microsoft.FSharp.Collections.List<_>

  type: List<'T>
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'T>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'T>
  implements: Collections.IEnumerable

val init : int -> (int -> 'T) -> 'T list

Full name: Microsoft.FSharp.Collections.List.init

val M : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

val mi : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val sortFun : (int -> float)
val wi : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val sortBy : ('T -> 'Key) -> 'T list -> 'T list (requires comparison)

Full name: Microsoft.FSharp.Collections.List.sortBy

val findMarriages : ((int * int list) list -> Map<int,(int * int list)> -> Map<int,(int * int list)>)
val mSingle : (int * int list) list

  type: (int * int list) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int list>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int list>
  implements: Collections.IEnumerable

val wEngaged : Map<int,(int * int list)>

  type: Map<int,(int * int list)>
  implements: IComparable
  implements: Collections.Generic.IDictionary<int,(int * int list)>
  implements: Collections.Generic.ICollection<Collections.Generic.KeyValuePair<int,(int * int list)>>
  implements: seq<Collections.Generic.KeyValuePair<int,(int * int list)>>
  implements: Collections.IEnumerable

val bachelors : (int * int list) list

  type: (int * int list) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int list>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int list>
  implements: Collections.IEnumerable

val rest : int list

  type: int list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int>
  implements: Collections.IEnumerable

val m : int * int list
Multiple items

module Map

from Microsoft.FSharp.Collections

——————–

type Map<'Key,'Value (requires comparison)> =
  class
    interface Collections.IEnumerable
    interface IComparable
    interface Collections.Generic.IEnumerable<Collections.Generic.KeyValuePair<'Key,'Value>>
    interface Collections.Generic.ICollection<Collections.Generic.KeyValuePair<'Key,'Value>>
    interface Collections.Generic.IDictionary<'Key,'Value>
    new : elements:seq<'Key * 'Value> -> Map<'Key,'Value>
    member Add : key:'Key * value:'Value -> Map<'Key,'Value>
    member ContainsKey : key:'Key -> bool
    override Equals : obj -> bool
    member Remove : key:'Key -> Map<'Key,'Value>
    member TryFind : key:'Key -> 'Value option
    member Count : int
    member IsEmpty : bool
    member Item : key:'Key -> 'Value with get
  end

Full name: Microsoft.FSharp.Collections.Map<_,_>

  type: Map<'Key,'Value>
  implements: IComparable
  implements: Collections.Generic.IDictionary<'Key,'Value>
  implements: Collections.Generic.ICollection<Collections.Generic.KeyValuePair<'Key,'Value>>
  implements: seq<Collections.Generic.KeyValuePair<'Key,'Value>>
  implements: Collections.IEnumerable

val tryFind : 'Key -> Map<'Key,'T> -> 'T option (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.tryFind

union case Option.None: Option<'T>
val add : 'Key -> 'T -> Map<'Key,'T> -> Map<'Key,'T> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.add

union case Option.Some: 'T -> Option<'T>
val m' : int * int list
val mi' : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val empty<'Key,'T (requires comparison)> : Map<'Key,'T> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.empty

val map : ('Key -> 'T -> 'U) -> Map<'Key,'T> -> Map<'Key,'U> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.map

val alignJaroWinkler : ('a array -> 'a array -> Map<int,int>)

Full name: Snippet.alignJaroWinkler

So break out your F# Interactive and try something like:

alignJaroWinkler [|"DUDE"; "DUDERSON"|] [|"D"; "RONNY"; "DUDERSON"|]

and you’ll hopefully get something back like this:

val it : Map = map [(0, 0); (2, 1)]

This is just a map from women tokens to men tokens. (0,0) indicates that “D” matched best with “DUDE” and (2,1) indicates that “DUDERSON” matched best with “DUDERSON”. Poor “RONNY” ended up being the old maid.

That’s it for today, but of course, there’s a whole lot more to cover here. We need to look at how to manipulate this output into something a bit nicer. I also have a bunch of tweaks and improvements to this pair of algorithms that I want to talk about. Finally, I want to sidetrack for a bit and tunnel into how one goes from from the horrible imperative pseudocode above, to the magnificent pure functional code underneath it.

As per usual, all of the code use in this post (and previous posts in the series) can be found on github.

Enjoy this post? Continue the conversation with me on twitter.

Tags: , , ,

One comment

  1. […] writing the previous article on tokenized matching I realized I left out some important background information on Jaro-Winkler distance. The Vector […]

Leave a comment