Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 1)

When first approaching the task of record linkage I was initially overwhelmed by the huge number of different algorithms available for comparing strings. Now I know that the secret to finding your way in this sea of algorithms is two fold. First, know that many are outdated and have newer and better implementations, so they can be ignored. Second, of the newest extension of each algorithm, each has a particular use depending on the kind of data you’re comparing.

Jaro-Winkler distance is a particularly good choice when comparing short strings of human entered data, such as names. This is due to it’s relative robustness against letter transpositions and it’s weighting of similarity toward the beginning of the string. When it comes to comparing names, a slightly customized version of tokenized Jaro-Winkler is one of the first things in my toolbox that I reach for.

Jaro-Winkler distance also has the benefit of being a rather easy to understand algorithm. It’s composed of two parts, Jaro’s original algorithm and Winkler’s extension.  In this installment we’ll roll up our sleeves and dig into the first part of this algorithm, Jaro distance.

The Jaro Algorithm

This is the equation for Jaro distance. It’s made up of the average of three sub-calculations.

  1. The ratio of matching characters to the length of the first string.
  2. The ratio of matching characters to the length of the second string.
  3. The ratio of non-transpositions to the number matching of characters.

My implementation has been optimized a bit, but I think it’s still rather easy to understand.

open System

/// Given an offset and a radius from that offset, 
/// does mChar exist in that part of str?
let inline existsInWin (mChar: char) (str: string) offset rad =
    let startAt = max 0 (offset - rad)
    let endAt = min (offset + rad) (String.length str - 1)
    if endAt - startAt < 0 then false
    else
        let rec exists index =
            if str.[index] = mChar then true
            elif index = endAt then false
            else exists (index + 1)
        exists startAt

/// The jaro distance between s1 and s2
let jaro s1 s2 =

    // The radius is half of the lesser 
    // of the two string lengths rounded up.
    let matchRadius =
        let minLen =
            min (String.length s1) (String.length s2) in
              minLen / 2 + minLen % 2

    // An inner function which recursively finds the number  
    // of matched characters within the radius.
    let commonChars (chars1: string) (chars2: string) =
        let rec inner i result =
            match i with
            | -1 -> result
            | _ -> if existsInWin chars1.[i] chars2 i matchRadius
                   then inner (i - 1) (chars1.[i] :: result)
                   else inner (i - 1) result
        inner (chars1.Length - 1) []

    // The sets of common characters and their lengths as floats 
    let c1 = commonChars s1 s2
    let c2 = commonChars s2 s1
    let c1length = float (List.length c1)
    let c2length = float (List.length c2)

    // The number of transpositions within 
    // the sets of common characters.
    let transpositions =
        let rec inner cl1 cl2 result =
            match cl1, cl2 with
            | [], _ | _, [] -> result
            | c1h :: c1t, c2h :: c2t ->
                if c1h <> c2h
                then inner c1t c2t (result + 1.0)
                else inner c1t c2t result
        let mismatches = inner c1 c2 0.0
        // If one common string is longer than the other
        // each additional char counts as half a transposition
        (mismatches + abs (c1length - c2length)) / 2.0

    let s1length = float (String.length s1)
    let s2length = float (String.length s2)
    let tLength = max c1length c2length

    // The jaro distance as given by 
    // 1/3 ( m2/|s1| + m1/|s2| + (mc-t)/mc )
    let result = (c1length / s1length +
                  c2length / s2length +
                  (tLength - transpositions) / tLength)
                 / 3.0

    // This is for cases where |s1|, |s2| or m are zero 
    if Double.IsNaN result then 0.0 else result

We’re quite lucky to have examples of the algorithm in action available right in the Wikipedia article to use as unit tests. When optimizing, you couldn’t ask for more.

Take ‘DWAYNE’ and ‘DUANE’ for example. According to the article we should end up with the following results:

DWAYNE vs DUANE

So, let’s break it down. The lesser of the two strings, DUANE, is 5 characters long. When we set minLen to 5 in (minLen / 2 + minLen % 2 is 3), we get a radius of 3.

Next we find the common characters in each direction. Comparing at DWAYNE to DUANE with a radius of 3 we can see that D, A, N and E will match, giving us a c1 = “DANE”. Similarly, comparing at DUANE to DWAYNE we get exactly the same thing. That is, c2 = “DANE” as well.

As you might have guessed by the fact both sets of common strings are the same, we have zero transpositions in this instance.

So now, we just plug in the numbers and get 1/3 * (4/6 + 4/5 + (4-0)/4) = 0.822. Not too difficult, eh?

For a better understanding of how transpositions end up working out, try walking through the debugger with the MARTHA vs MARHTA test below.

open Xunit

[<Fact>]
let ``Jaro identity test`` () =
    let result = jaro "RICK" "RICK"
    Assert.Equal("1.000", String.Format("{0:0.000}", result))

[<Fact>]
let ``Jaro martha test`` () =
    let result = jaro "MARTHA" "MARHTA"
    Assert.Equal("0.944", String.Format("{0:0.000}", result))

[<Fact>]
let ``Jaro dwayne test`` () =
    let result = jaro "DWAYNE" "DUANE"
    Assert.Equal("0.822", String.Format("{0:0.000}", result))

[<Fact>]
let ``Jaro dixon test`` () =
    let result = jaro "DIXON" "DICKSONX"
    Assert.Equal("0.767", String.Format("{0:0.000}", result))

I hope you enjoyed this installment of Record Linkage Algorithms in F#. Next time we’ll explore the Winkler extension to this algorithm and take a look at why weighting errors earlier in the string more heavily ends up giving significantly better results.

Edit: All current code for this series is now available on github.

Leave a Reply

Blog at WordPress.com.

Discover more from Inviting Epiphany

Subscribe now to keep reading and get access to the full archive.

Continue reading