Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 2)

Last time we dove into the Jaro distance algorithm and picked apart how each of its components are calculated. However, from a modern perspective Jaro alone is a rather weak method of string matching. It was Winkler’s extension that brought this algorithm into widespread modern use.

Matthew Jaro’s insight when inventing the Jaro distance algorithm was that humans tend to make certain kinds of errors. However, while working on the 1990 census William Winkler had an additional insight: these kinds of errors are much more likely to occur later in the string. By simply wrapping Jaro’s algorithm to forgive some penalty according to the similarity of the first few characters, Winkler was able to get significantly better results with little additional overhead.

Winker's Extension

Here we see that the Jaro-Winkler distance (dw) is equal to the result of the Jaro distance (dj) plus one minus that same value times some weighted metric (lp). This is a great mathematical trick for two reasons. First, as long as the weighted metric (lp) doesn’t exceed 1, the final result will stay within the 0-1 range of the Jaro metric. Second, it guarantees that the result of Jaro-Winkler will never be lower than the result of Jaro alone. It effectively lets Winkler enrich the results of Jaro by filling in the remaining gap.

The meaning of the weighted metric (lp) is even more simple. Here l is just the number of initial characters which match, up to a maximum of four. Meanwhile p is a weight which can be up to 1/4, or one over the maximum possible value of l.

As p gets larger and larger, more and more of the gap will be filled in by Winkler’s extension. When p is 1/4, strings where first four characters match will always be given a perfect score. Just as we’d never want strings like “JOHN” and “JOHNSON” to be considered a perfect match, we’d never want to use a p value of 1/4. After much experimentation, Winkler recommends using a p value of 0.1, which is also what I use.

Now that we’ve covered how the math works, let’s take a look at an actual implementation. If you’d like to see an implementation of the Jaro distance algorithm take a look at the previous installment of Record Linkage Algorithms in F#.

/// Calculate the Jaro-Winkler distance of s1 and s2
let jaroWinkler s1 s2 = 
    let jaroScore = jaro s1 s2
    // Accumulate the number of matching initial characters
    let maxLength = (min s1.Length s2.Length) - 1
    let rec calcL i acc =
        if i > maxLength || s1.[i] <> s2.[i] then acc
        else calcL (i + 1) (acc + 1.0)
    let l = min (calcL 0 0.0) 4.0
    // Calculate the JW distance
    let p = 0.1
    jaroScore + (l * p * (1.0 - jaroScore))

In my implementation I’ve allowed for some slight inefficiency in order to make it easier to play with the p value and number of characters examined. So far I’ve found Winkler’s choice of four characters to be the best, which incidentally has been shown as a great number of initial characters to look at when using a number of record linkage algorithms on things in the English language. However, I suspect that other values may work better when working in other languages.

The math here is so simple that I don’t feel it’s worth breaking down further, but I’ve included my tests built from the examples in the Jaro-Winkler distance Wikipedia article. For a fuller understanding just break out the debugger and play a bit on your own.

open Xunit

[<Fact>]
let ``Jaro-Winkler identity test`` () = 
    let result = jaroWinkler "RICK" "RICK"
    Assert.Equal("1.000", String.Format("{0:0.000}", result))

[<Fact>]
let ``Jaro-Winkler martha test`` () = 
    let result = jaroWinkler "MARTHA" "MARHTA"
    Assert.Equal("0.961", String.Format("{0:0.000}", result))

[<Fact>]
let ``Jaro-Winkler dwayne test`` () = 
    let result = jaroWinkler "DWAYNE" "DUANE"
    Assert.Equal("0.840", String.Format("{0:0.000}", result))

[<Fact>]
let ``Jaro-Winkler dixon test`` () =
    let result = jaroWinkler "DIXON" "DICKSONX"
    Assert.Equal("0.813", String.Format("{0:0.000}", result))

In the next installment of Record Linkage Algorithms in F# we’ll take a look a method for efficient token based matching with Jaro-Winkler. Stay tuned!

Edit: All current code for this series is now available on github.

Leave a Reply

Blog at WordPress.com.

%d bloggers like this: