Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 1)

When first approaching the task of record linkage I was initially overwhelmed by the huge number of different algorithms available for comparing strings. Now I know that the secret to finding your way in this sea of algorithms is two fold. First, know that many are outdated and have newer and better implementations, so they can be ignored. Second, of the newest extension of each algorithm, each has a particular use depending on the kind of data you’re comparing.

Jaro-Winkler distance is a particularly good choice when comparing short strings of human entered data, such as names. This is due to it’s relative robustness against letter transpositions and it’s weighting of similarity toward the beginning of the string. When it comes to comparing names, a slightly customized version of tokenized Jaro-Winkler is one of the first things in my toolbox that I reach for.

Jaro-Winkler distance also has the benefit of being a rather easy to understand algorithm. It’s composed of two parts, Jaro’s original algorithm and Winkler’s extension.  In this installment we’ll roll up our sleeves and dig into the first part of this algorithm, Jaro distance.

The Jaro Algorithm

This is the equation for Jaro distance. It’s made up of the average of three sub-calculations.

  1. The ratio of matching characters to the length of the first string.
  2. The ratio of matching characters to the length of the second string.
  3. The ratio of non-transpositions to the number matching of characters.

My implementation has been optimized a bit, but I think it’s still rather easy to understand.

 1: open System
 2: 
 3: /// Given an offset and a radius from that offset, 
 4: /// does mChar exist in that part of str?
 5: let inline existsInWin (mChar: char) (str: string) offset rad =
 6:     let startAt = max 0 (offset - rad)
 7:     let endAt = min (offset + rad) (String.length str - 1)  
 8:     if endAt - startAt < 0 then false
 9:     else
10:         let rec exists index =
11:             if str.[index] = mChar then true
12:             elif index = endAt then false
13:             else exists (index + 1)
14:         exists startAt
15: 
16: /// The jaro distance between s1 and s2
17: let jaro s1 s2 =
18:     
19:     // The radius is half of the lesser 
20:     // of the two string lengths rounded up.
21:     let matchRadius = 
22:         let minLen = 
23:             min (String.length s1) (String.length s2) in
24:               minLen / 2 + minLen % 2
25: 
26:     // An inner function which recursively finds the number  
27:     // of matched characters within the radius.
28:     let commonChars (chars1: string) (chars2: string) =
29:         let rec inner i result = 
30:             match i with
31:             | -1 -> result
32:             | _ -> if existsInWin chars1.[i] chars2 i matchRadius
33:                    then inner (i - 1) (chars1.[i] :: result)
34:                    else inner (i - 1) result
35:         inner (chars1.Length - 1) []
36: 
37:     // The sets of common characters and their lengths as floats 
38:     let c1 = commonChars s1 s2
39:     let c2 = commonChars s2 s1
40:     let c1length = float (List.length c1)
41:     let c2length = float (List.length c2)
42:     
43:     // The number of transpositions within 
44:     // the sets of common characters.
45:     let transpositions = 
46:         let rec inner cl1 cl2 result = 
47:             match cl1, cl2 with
48:             | [], _ | _, [] -> result
49:             | c1h :: c1t, c2h :: c2t -> 
50:                 if c1h <> c2h
51:                 then inner c1t c2t (result + 1.0)
52:                 else inner c1t c2t result
53:         let mismatches = inner c1 c2 0.0
54:         // If one common string is longer than the other
55:         // each additional char counts as half a transposition
56:         (mismatches + abs (c1length - c2length)) / 2.0
57: 
58:     let s1length = float (String.length s1)
59:     let s2length = float (String.length s2)
60:     let tLength = max c1length c2length
61: 
62:     // The jaro distance as given by 
63:     // 1/3 ( m2/|s1| + m1/|s2| + (mc-t)/mc )
64:     let result = (c1length / s1length +
65:                   c2length / s2length + 
66:                   (tLength - transpositions) / tLength)
67:                  / 3.0
68: 
69:     // This is for cases where |s1|, |s2| or m are zero 
70:     if Double.IsNaN result then 0.0 else result

We’re quite lucky to have examples of the algorithm in action available right in the Wikipedia article to use as unit tests. When optimizing, you couldn’t ask for more.

Take ‘DWAYNE’ and ‘DUANE’ for example. According to the article we should end up with the following results:

DWAYNE vs DUANE

So, let’s break it down. The lesser of the two strings, DUANE, is 5 characters long. When we set minLen to 5 in (minLen / 2 + minLen % 2 is 3), we get a radius of 3.

Next we find the common characters in each direction. Comparing at DWAYNE to DUANE with a radius of 3 we can see that D, A, N and E will match, giving us a c1 = “DANE”. Similarly, comparing at DUANE to DWAYNE we get exactly the same thing. That is, c2 = “DANE” as well.

As you might have guessed by the fact both sets of common strings are the same, we have zero transpositions in this instance.

So now, we just plug in the numbers and get 1/3 * (4/6 + 4/5 + (4-0)/4) = 0.822. Not too difficult, eh?

For a better understanding of how transpositions end up working out, try walking through the debugger with the MARTHA vs MARHTA test below.

72: open Xunit
73: 
74: [<Fact>]
75: let ``Jaro identity test`` () = 
76:     let result = jaro "RICK" "RICK"
77:     Assert.Equal("1.000", String.Format("{0:0.000}", result))
78: 
79: [<Fact>]
80: let ``Jaro martha test`` () =
81:     let result = jaro "MARTHA" "MARHTA"
82:     Assert.Equal("0.944", String.Format("{0:0.000}", result))
83: 
84: [<Fact>]
85: let ``Jaro dwayne test`` () = 
86:     let result = jaro "DWAYNE" "DUANE"
87:     Assert.Equal("0.822", String.Format("{0:0.000}", result))
88: 
89: [<Fact>]
90: let ``Jaro dixon test`` () =
91:     let result = jaro "DIXON" "DICKSONX"
92:     Assert.Equal("0.767", String.Format("{0:0.000}", result))
93: 
94: 

namespace System
val existsInWin : char -> string -> 'a -> 'b -> bool (requires member ( – ) and member ( + ))

Full name: Snippet.existsInWin

Given an offset and a radius from that office,
 does mChar exist in that part of str?

val mChar : char

  type: char
  implements: IComparable
  implements: IConvertible
  implements: IComparable<char>
  implements: IEquatable<char>
  inherits: ValueType

Multiple items

val char : 'T -> char (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.char

——————–

type char = Char

Full name: Microsoft.FSharp.Core.char

  type: char
  implements: IComparable
  implements: IConvertible
  implements: IComparable<char>
  implements: IEquatable<char>
  inherits: ValueType

val str : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

Multiple items

val string : 'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

——————–

type string = String

Full name: Microsoft.FSharp.Core.string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val offset : 'a (requires member ( – ) and member ( + ))
val rad : 'b (requires member ( – ) and member ( + ))
val startAt : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val max : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.max

val endAt : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val min : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.min

type String =
  class
    new : char -> string
    new : char * int * int -> string
    new : System.SByte -> string
    new : System.SByte * int * int -> string
    new : System.SByte * int * int * System.Text.Encoding -> string
    new : char [] * int * int -> string
    new : char [] -> string
    new : char * int -> string
    member Chars : int -> char
    member Clone : unit -> obj
    member CompareTo : obj -> int
    member CompareTo : string -> int
    member Contains : string -> bool
    member CopyTo : int * char [] * int * int -> unit
    member EndsWith : string -> bool
    member EndsWith : string * System.StringComparison -> bool
    member EndsWith : string * bool * System.Globalization.CultureInfo -> bool
    member Equals : obj -> bool
    member Equals : string -> bool
    member Equals : string * System.StringComparison -> bool
    member GetEnumerator : unit -> System.CharEnumerator
    member GetHashCode : unit -> int
    member GetTypeCode : unit -> System.TypeCode
    member IndexOf : char -> int
    member IndexOf : string -> int
    member IndexOf : char * int -> int
    member IndexOf : string * int -> int
    member IndexOf : string * System.StringComparison -> int
    member IndexOf : char * int * int -> int
    member IndexOf : string * int * int -> int
    member IndexOf : string * int * System.StringComparison -> int
    member IndexOf : string * int * int * System.StringComparison -> int
    member IndexOfAny : char [] -> int
    member IndexOfAny : char [] * int -> int
    member IndexOfAny : char [] * int * int -> int
    member Insert : int * string -> string
    member IsNormalized : unit -> bool
    member IsNormalized : System.Text.NormalizationForm -> bool
    member LastIndexOf : char -> int
    member LastIndexOf : string -> int
    member LastIndexOf : char * int -> int
    member LastIndexOf : string * int -> int
    member LastIndexOf : string * System.StringComparison -> int
    member LastIndexOf : char * int * int -> int
    member LastIndexOf : string * int * int -> int
    member LastIndexOf : string * int * System.StringComparison -> int
    member LastIndexOf : string * int * int * System.StringComparison -> int
    member LastIndexOfAny : char [] -> int
    member LastIndexOfAny : char [] * int -> int
    member LastIndexOfAny : char [] * int * int -> int
    member Length : int
    member Normalize : unit -> string
    member Normalize : System.Text.NormalizationForm -> string
    member PadLeft : int -> string
    member PadLeft : int * char -> string
    member PadRight : int -> string
    member PadRight : int * char -> string
    member Remove : int -> string
    member Remove : int * int -> string
    member Replace : char * char -> string
    member Replace : string * string -> string
    member Split : char [] -> string []
    member Split : char [] * int -> string []
    member Split : char [] * System.StringSplitOptions -> string []
    member Split : string [] * System.StringSplitOptions -> string []
    member Split : char [] * int * System.StringSplitOptions -> string []
    member Split : string [] * int * System.StringSplitOptions -> string []
    member StartsWith : string -> bool
    member StartsWith : string * System.StringComparison -> bool
    member StartsWith : string * bool * System.Globalization.CultureInfo -> bool
    member Substring : int -> string
    member Substring : int * int -> string
    member ToCharArray : unit -> char []
    member ToCharArray : int * int -> char []
    member ToLower : unit -> string
    member ToLower : System.Globalization.CultureInfo -> string
    member ToLowerInvariant : unit -> string
    member ToString : unit -> string
    member ToString : System.IFormatProvider -> string
    member ToUpper : unit -> string
    member ToUpper : System.Globalization.CultureInfo -> string
    member ToUpperInvariant : unit -> string
    member Trim : unit -> string
    member Trim : char [] -> string
    member TrimEnd : char [] -> string
    member TrimStart : char [] -> string
    static val Empty : string
    static member Compare : string * string -> int
    static member Compare : string * string * bool -> int
    static member Compare : string * string * System.StringComparison -> int
    static member Compare : string * string * System.Globalization.CultureInfo * System.Globalization.CompareOptions -> int
    static member Compare : string * string * bool * System.Globalization.CultureInfo -> int
    static member Compare : string * int * string * int * int -> int
    static member Compare : string * int * string * int * int * bool -> int
    static member Compare : string * int * string * int * int * System.StringComparison -> int
    static member Compare : string * int * string * int * int * bool * System.Globalization.CultureInfo -> int
    static member Compare : string * int * string * int * int * System.Globalization.CultureInfo * System.Globalization.CompareOptions -> int
    static member CompareOrdinal : string * string -> int
    static member CompareOrdinal : string * int * string * int * int -> int
    static member Concat : obj -> string
    static member Concat : obj [] -> string
    static member Concat<'T> : System.Collections.Generic.IEnumerable<'T> -> string
    static member Concat : System.Collections.Generic.IEnumerable<string> -> string
    static member Concat : string [] -> string
    static member Concat : obj * obj -> string
    static member Concat : string * string -> string
    static member Concat : obj * obj * obj -> string
    static member Concat : string * string * string -> string
    static member Concat : obj * obj * obj * obj -> string
    static member Concat : string * string * string * string -> string
    static member Copy : string -> string
    static member Equals : string * string -> bool
    static member Equals : string * string * System.StringComparison -> bool
    static member Format : string * obj -> string
    static member Format : string * obj [] -> string
    static member Format : string * obj * obj -> string
    static member Format : System.IFormatProvider * string * obj [] -> string
    static member Format : string * obj * obj * obj -> string
    static member Intern : string -> string
    static member IsInterned : string -> string
    static member IsNullOrEmpty : string -> bool
    static member IsNullOrWhiteSpace : string -> bool
    static member Join : string * string [] -> string
    static member Join : string * obj [] -> string
    static member Join<'T> : string * System.Collections.Generic.IEnumerable<'T> -> string
    static member Join : string * System.Collections.Generic.IEnumerable<string> -> string
    static member Join : string * string [] * int * int -> string
  end

Full name: System.String

  type: String
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val length : string -> int

Full name: Microsoft.FSharp.Core.String.length

val exists : (int -> bool)
val index : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val jaro : string -> string -> float

Full name: Snippet.jaro

The jaro distance between s1 and s2

val s1 : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val s2 : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val matchRadius : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val minLen : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val commonChars : (string -> string -> char list)
val chars1 : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val chars2 : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val inner : (int -> char list -> char list)
val i : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val result : char list

  type: char list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<char>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<char>
  implements: Collections.IEnumerable

property String.Length: int
val c1 : char list

  type: char list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<char>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<char>
  implements: Collections.IEnumerable

val c2 : char list

  type: char list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<char>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<char>
  implements: Collections.IEnumerable

val c1length : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

Multiple items

val float : 'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–

type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

  type: float<'Measure>
  implements: IComparable
  implements: IConvertible
  implements: IFormattable
  implements: IComparable<float<'Measure>>
  implements: IEquatable<float<'Measure>>
  inherits: ValueType

——————–

type float = Double

Full name: Microsoft.FSharp.Core.float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =
  | ( [] )
  | ( :: ) of 'T * 'T list
  with
    interface Collections.IEnumerable
    interface Collections.Generic.IEnumerable<'T>
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    static member Cons : head:'T * tail:'T list -> 'T list
    static member Empty : 'T list
  end

Full name: Microsoft.FSharp.Collections.List<_>

  type: List<'T>
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'T>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'T>
  implements: Collections.IEnumerable

val length : 'T list -> int

Full name: Microsoft.FSharp.Collections.List.length

val c2length : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val transpositions : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val inner : ('a list -> 'a list -> float -> float) (requires equality)
val cl1 : 'a list (requires equality)

  type: 'a list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'a>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'a>
  implements: Collections.IEnumerable

val cl2 : 'a list (requires equality)

  type: 'a list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'a>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'a>
  implements: Collections.IEnumerable

val result : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val c1h : 'a (requires equality)
val c1t : 'a list (requires equality)

  type: 'a list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'a>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'a>
  implements: Collections.IEnumerable

val c2h : 'a (requires equality)
val c2t : 'a list (requires equality)

  type: 'a list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'a>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'a>
  implements: Collections.IEnumerable

val mismatches : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val abs : 'T -> 'T (requires member Abs)

Full name: Microsoft.FSharp.Core.Operators.abs

val s1length : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val s2length : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val tLength : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

type Double =
  struct
    member CompareTo : obj -> int
    member CompareTo : float -> int
    member Equals : obj -> bool
    member Equals : float -> bool
    member GetHashCode : unit -> int
    member GetTypeCode : unit -> System.TypeCode
    member ToString : unit -> string
    member ToString : string -> string
    member ToString : System.IFormatProvider -> string
    member ToString : string * System.IFormatProvider -> string
    static val MinValue : float
    static val MaxValue : float
    static val Epsilon : float
    static val NegativeInfinity : float
    static val PositiveInfinity : float
    static val NaN : float
    static member IsInfinity : float -> bool
    static member IsNaN : float -> bool
    static member IsNegativeInfinity : float -> bool
    static member IsPositiveInfinity : float -> bool
    static member Parse : string -> float
    static member Parse : string * System.Globalization.NumberStyles -> float
    static member Parse : string * System.IFormatProvider -> float
    static member Parse : string * System.Globalization.NumberStyles * System.IFormatProvider -> float
    static member TryParse : string * float -> bool
    static member TryParse : string * System.Globalization.NumberStyles * System.IFormatProvider * float -> bool
  end

Full name: System.Double

  type: Double
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

Double.IsNaN(d: float) : bool
Multiple overloads

String.Format(format: string, args: obj []) : string

String.Format(format: string, arg0: obj) : string

String.Format(provider: IFormatProvider, format: string, args: obj []) : string

String.Format(format: string, arg0: obj, arg1: obj) : string

String.Format(format: string, arg0: obj, arg1: obj, arg2: obj) : string

I hope you enjoyed this installment of Record Linkage Algorithms in F#. Next time we’ll explore the Winkler extension to this algorithm and take a look at why weighting errors earlier in the string more heavily ends up giving significantly better results.

Edit: All current code for this series is now available on github.

Enjoy this post? Continue the conversation with me on twitter.

Tags: , , , , , , ,

3 comments

  1. Good job!
    Both interesting, well explained and useful!
    Waiting for Winkler :)

  2. […] Last time we dove into the Jaro distance algorithm and picked apart how each of its components are calculated. However, from a modern perspective Jaro alone is a rather weak method of string matching. It was Winkler’s extension that brought this algorithm into widespread modern use. […]

  3. […] Damerau–Levenshtein distance (or restricted edit distance). Damerau-Levenshtien adds simple Jaro-Winkler like transpositions to the original algorithm. However, in the process of adding just a minor tweak […]

Leave a comment