When first approaching the task of record linkage I was initially overwhelmed by the huge number of different algorithms available for comparing strings. Now I know that the secret to finding your way in this sea of algorithms is two fold. First, know that many are outdated and have newer and better implementations, so they can be ignored. Second, of the newest extension of each algorithm, each has a particular use depending on the kind of data you’re comparing.
Jaro-Winkler distance is a particularly good choice when comparing short strings of human entered data, such as names. This is due to it’s relative robustness against letter transpositions and it’s weighting of similarity toward the beginning of the string. When it comes to comparing names, a slightly customized version of tokenized Jaro-Winkler is one of the first things in my toolbox that I reach for.
Jaro-Winkler distance also has the benefit of being a rather easy to understand algorithm. It’s composed of two parts, Jaro’s original algorithm and Winkler’s extension. In this installment we’ll roll up our sleeves and dig into the first part of this algorithm, Jaro distance.
This is the equation for Jaro distance. It’s made up of the average of three sub-calculations.
- The ratio of matching characters to the length of the first string.
- The ratio of matching characters to the length of the second string.
- The ratio of non-transpositions to the number matching of characters.
My implementation has been optimized a bit, but I think it’s still rather easy to understand.
1: open System
2:
3: /// Given an offset and a radius from that offset,
4: /// does mChar exist in that part of str?
5: let inline existsInWin (mChar: char) (str: string) offset rad =
6: let startAt = max 0 (offset - rad)
7: let endAt = min (offset + rad) (String.length str - 1)
8: if endAt - startAt < 0 then false
9: else
10: let rec exists index =
11: if str.[index] = mChar then true
12: elif index = endAt then false
13: else exists (index + 1)
14: exists startAt
15:
16: /// The jaro distance between s1 and s2
17: let jaro s1 s2 =
18:
19: // The radius is half of the lesser
20: // of the two string lengths rounded up.
21: let matchRadius =
22: let minLen =
23: min (String.length s1) (String.length s2) in
24: minLen / 2 + minLen % 2
25:
26: // An inner function which recursively finds the number
27: // of matched characters within the radius.
28: let commonChars (chars1: string) (chars2: string) =
29: let rec inner i result =
30: match i with
31: | -1 -> result
32: | _ -> if existsInWin chars1.[i] chars2 i matchRadius
33: then inner (i - 1) (chars1.[i] :: result)
34: else inner (i - 1) result
35: inner (chars1.Length - 1) []
36:
37: // The sets of common characters and their lengths as floats
38: let c1 = commonChars s1 s2
39: let c2 = commonChars s2 s1
40: let c1length = float (List.length c1)
41: let c2length = float (List.length c2)
42:
43: // The number of transpositions within
44: // the sets of common characters.
45: let transpositions =
46: let rec inner cl1 cl2 result =
47: match cl1, cl2 with
48: | [], _ | _, [] -> result
49: | c1h :: c1t, c2h :: c2t ->
50: if c1h <> c2h
51: then inner c1t c2t (result + 1.0)
52: else inner c1t c2t result
53: let mismatches = inner c1 c2 0.0
54: // If one common string is longer than the other
55: // each additional char counts as half a transposition
56: (mismatches + abs (c1length - c2length)) / 2.0
57:
58: let s1length = float (String.length s1)
59: let s2length = float (String.length s2)
60: let tLength = max c1length c2length
61:
62: // The jaro distance as given by
63: // 1/3 ( m2/|s1| + m1/|s2| + (mc-t)/mc )
64: let result = (c1length / s1length +
65: c2length / s2length +
66: (tLength - transpositions) / tLength)
67: / 3.0
68:
69: // This is for cases where |s1|, |s2| or m are zero
70: if Double.IsNaN result then 0.0 else result
We’re quite lucky to have examples of the algorithm in action available right in the Wikipedia article to use as unit tests. When optimizing, you couldn’t ask for more.
Take ‘DWAYNE’ and ‘DUANE’ for example. According to the article we should end up with the following results:
So, let’s break it down. The lesser of the two strings, DUANE, is 5 characters long. When we set minLen to 5 in (minLen / 2 + minLen % 2 is 3), we get a radius of 3.
Next we find the common characters in each direction. Comparing at DWAYNE to DUANE with a radius of 3 we can see that D, A, N and E will match, giving us a c1 = “DANE”. Similarly, comparing at DUANE to DWAYNE we get exactly the same thing. That is, c2 = “DANE” as well.
As you might have guessed by the fact both sets of common strings are the same, we have zero transpositions in this instance.
So now, we just plug in the numbers and get 1/3 * (4/6 + 4/5 + (4-0)/4) = 0.822. Not too difficult, eh?
For a better understanding of how transpositions end up working out, try walking through the debugger with the MARTHA vs MARHTA test below.
72: open Xunit
73:
74: [<Fact>]
75: let ``Jaro identity test`` () =
76: let result = jaro "RICK" "RICK"
77: Assert.Equal("1.000", String.Format("{0:0.000}", result))
78:
79: [<Fact>]
80: let ``Jaro martha test`` () =
81: let result = jaro "MARTHA" "MARHTA"
82: Assert.Equal("0.944", String.Format("{0:0.000}", result))
83:
84: [<Fact>]
85: let ``Jaro dwayne test`` () =
86: let result = jaro "DWAYNE" "DUANE"
87: Assert.Equal("0.822", String.Format("{0:0.000}", result))
88:
89: [<Fact>]
90: let ``Jaro dixon test`` () =
91: let result = jaro "DIXON" "DICKSONX"
92: Assert.Equal("0.767", String.Format("{0:0.000}", result))
93:
94:
namespace System
val existsInWin : char -> string -> 'a -> 'b -> bool (requires member ( – ) and member ( + ))
Full name: Snippet.existsInWin
Given an offset and a radius from that office,
does mChar exist in that part of str?
val mChar : char
type: char
implements: IComparable
implements: IConvertible
implements: IComparable<char>
implements: IEquatable<char>
inherits: ValueType
Multiple items
val char : 'T -> char (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.char
——————–
type char = Char
Full name: Microsoft.FSharp.Core.char
type: char
implements: IComparable
implements: IConvertible
implements: IComparable<char>
implements: IEquatable<char>
inherits: ValueType
val str : string
type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>
Multiple items
val string : 'T -> string
Full name: Microsoft.FSharp.Core.Operators.string
——————–
type string = String
Full name: Microsoft.FSharp.Core.string
type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>
val offset : 'a (requires member ( – ) and member ( + ))
val rad : 'b (requires member ( – ) and member ( + ))
val startAt : int
type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType
val max : 'T -> 'T -> 'T (requires comparison)
Full name: Microsoft.FSharp.Core.Operators.max
val endAt : int
type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType
val min : 'T -> 'T -> 'T (requires comparison)
Full name: Microsoft.FSharp.Core.Operators.min
type String =
class
new : char -> string
new : char * int * int -> string
new : System.SByte -> string
new : System.SByte * int * int -> string
new : System.SByte * int * int * System.Text.Encoding -> string
new : char [] * int * int -> string
new : char [] -> string
new : char * int -> string
member Chars : int -> char
member Clone : unit -> obj
member CompareTo : obj -> int
member CompareTo : string -> int
member Contains : string -> bool
member CopyTo : int * char [] * int * int -> unit
member EndsWith : string -> bool
member EndsWith : string * System.StringComparison -> bool
member EndsWith : string * bool * System.Globalization.CultureInfo -> bool
member Equals : obj -> bool
member Equals : string -> bool
member Equals : string * System.StringComparison -> bool
member GetEnumerator : unit -> System.CharEnumerator
member GetHashCode : unit -> int
member GetTypeCode : unit -> System.TypeCode
member IndexOf : char -> int
member IndexOf : string -> int
member IndexOf : char * int -> int
member IndexOf : string * int -> int
member IndexOf : string * System.StringComparison -> int
member IndexOf : char * int * int -> int
member IndexOf : string * int * int -> int
member IndexOf : string * int * System.StringComparison -> int
member IndexOf : string * int * int * System.StringComparison -> int
member IndexOfAny : char [] -> int
member IndexOfAny : char [] * int -> int
member IndexOfAny : char [] * int * int -> int
member Insert : int * string -> string
member IsNormalized : unit -> bool
member IsNormalized : System.Text.NormalizationForm -> bool
member LastIndexOf : char -> int
member LastIndexOf : string -> int
member LastIndexOf : char * int -> int
member LastIndexOf : string * int -> int
member LastIndexOf : string * System.StringComparison -> int
member LastIndexOf : char * int * int -> int
member LastIndexOf : string * int * int -> int
member LastIndexOf : string * int * System.StringComparison -> int
member LastIndexOf : string * int * int * System.StringComparison -> int
member LastIndexOfAny : char [] -> int
member LastIndexOfAny : char [] * int -> int
member LastIndexOfAny : char [] * int * int -> int
member Length : int
member Normalize : unit -> string
member Normalize : System.Text.NormalizationForm -> string
member PadLeft : int -> string
member PadLeft : int * char -> string
member PadRight : int -> string
member PadRight : int * char -> string
member Remove : int -> string
member Remove : int * int -> string
member Replace : char * char -> string
member Replace : string * string -> string
member Split : char [] -> string []
member Split : char [] * int -> string []
member Split : char [] * System.StringSplitOptions -> string []
member Split : string [] * System.StringSplitOptions -> string []
member Split : char [] * int * System.StringSplitOptions -> string []
member Split : string [] * int * System.StringSplitOptions -> string []
member StartsWith : string -> bool
member StartsWith : string * System.StringComparison -> bool
member StartsWith : string * bool * System.Globalization.CultureInfo -> bool
member Substring : int -> string
member Substring : int * int -> string
member ToCharArray : unit -> char []
member ToCharArray : int * int -> char []
member ToLower : unit -> string
member ToLower : System.Globalization.CultureInfo -> string
member ToLowerInvariant : unit -> string
member ToString : unit -> string
member ToString : System.IFormatProvider -> string
member ToUpper : unit -> string
member ToUpper : System.Globalization.CultureInfo -> string
member ToUpperInvariant : unit -> string
member Trim : unit -> string
member Trim : char [] -> string
member TrimEnd : char [] -> string
member TrimStart : char [] -> string
static val Empty : string
static member Compare : string * string -> int
static member Compare : string * string * bool -> int
static member Compare : string * string * System.StringComparison -> int
static member Compare : string * string * System.Globalization.CultureInfo * System.Globalization.CompareOptions -> int
static member Compare : string * string * bool * System.Globalization.CultureInfo -> int
static member Compare : string * int * string * int * int -> int
static member Compare : string * int * string * int * int * bool -> int
static member Compare : string * int * string * int * int * System.StringComparison -> int
static member Compare : string * int * string * int * int * bool * System.Globalization.CultureInfo -> int
static member Compare : string * int * string * int * int * System.Globalization.CultureInfo * System.Globalization.CompareOptions -> int
static member CompareOrdinal : string * string -> int
static member CompareOrdinal : string * int * string * int * int -> int
static member Concat : obj -> string
static member Concat : obj [] -> string
static member Concat<'T> : System.Collections.Generic.IEnumerable<'T> -> string
static member Concat : System.Collections.Generic.IEnumerable<string> -> string
static member Concat : string [] -> string
static member Concat : obj * obj -> string
static member Concat : string * string -> string
static member Concat : obj * obj * obj -> string
static member Concat : string * string * string -> string
static member Concat : obj * obj * obj * obj -> string
static member Concat : string * string * string * string -> string
static member Copy : string -> string
static member Equals : string * string -> bool
static member Equals : string * string * System.StringComparison -> bool
static member Format : string * obj -> string
static member Format : string * obj [] -> string
static member Format : string * obj * obj -> string
static member Format : System.IFormatProvider * string * obj [] -> string
static member Format : string * obj * obj * obj -> string
static member Intern : string -> string
static member IsInterned : string -> string
static member IsNullOrEmpty : string -> bool
static member IsNullOrWhiteSpace : string -> bool
static member Join : string * string [] -> string
static member Join : string * obj [] -> string
static member Join<'T> : string * System.Collections.Generic.IEnumerable<'T> -> string
static member Join : string * System.Collections.Generic.IEnumerable<string> -> string
static member Join : string * string [] * int * int -> string
end
Full name: System.String
type: String
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>
val length : string -> int
Full name: Microsoft.FSharp.Core.String.length
val exists : (int -> bool)
val index : int
type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType
val jaro : string -> string -> float
Full name: Snippet.jaro
The jaro distance between s1 and s2
val s1 : string
type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>
val s2 : string
type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>
val matchRadius : int
type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType
val minLen : int
type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType
val commonChars : (string -> string -> char list)
val chars1 : string
type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>
val chars2 : string
type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>
val inner : (int -> char list -> char list)
val i : int
type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType
val result : char list
type: char list
implements: Collections.IStructuralEquatable
implements: IComparable<List<char>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<char>
implements: Collections.IEnumerable
property String.Length: int
val c1 : char list
type: char list
implements: Collections.IStructuralEquatable
implements: IComparable<List<char>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<char>
implements: Collections.IEnumerable
val c2 : char list
type: char list
implements: Collections.IStructuralEquatable
implements: IComparable<List<char>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<char>
implements: Collections.IEnumerable
val c1length : float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
Multiple items
val float : 'T -> float (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.float
——————–
type float<'Measure> = float
Full name: Microsoft.FSharp.Core.float<_>
type: float<'Measure>
implements: IComparable
implements: IConvertible
implements: IFormattable
implements: IComparable<float<'Measure>>
implements: IEquatable<float<'Measure>>
inherits: ValueType
——————–
type float = Double
Full name: Microsoft.FSharp.Core.float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
Multiple items
module List
from Microsoft.FSharp.Collections
——————–
type List<'T> =
| ( [] )
| ( :: ) of 'T * 'T list
with
interface Collections.IEnumerable
interface Collections.Generic.IEnumerable<'T>
member Head : 'T
member IsEmpty : bool
member Item : index:int -> 'T with get
member Length : int
member Tail : 'T list
static member Cons : head:'T * tail:'T list -> 'T list
static member Empty : 'T list
end
Full name: Microsoft.FSharp.Collections.List<_>
type: List<'T>
implements: Collections.IStructuralEquatable
implements: IComparable<List<'T>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'T>
implements: Collections.IEnumerable
val length : 'T list -> int
Full name: Microsoft.FSharp.Collections.List.length
val c2length : float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
val transpositions : float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
val inner : ('a list -> 'a list -> float -> float) (requires equality)
val cl1 : 'a list (requires equality)
type: 'a list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'a>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'a>
implements: Collections.IEnumerable
val cl2 : 'a list (requires equality)
type: 'a list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'a>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'a>
implements: Collections.IEnumerable
val result : float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
val c1h : 'a (requires equality)
val c1t : 'a list (requires equality)
type: 'a list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'a>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'a>
implements: Collections.IEnumerable
val c2h : 'a (requires equality)
val c2t : 'a list (requires equality)
type: 'a list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'a>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'a>
implements: Collections.IEnumerable
val mismatches : float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
val abs : 'T -> 'T (requires member Abs)
Full name: Microsoft.FSharp.Core.Operators.abs
val s1length : float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
val s2length : float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
val tLength : float
type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
type Double =
struct
member CompareTo : obj -> int
member CompareTo : float -> int
member Equals : obj -> bool
member Equals : float -> bool
member GetHashCode : unit -> int
member GetTypeCode : unit -> System.TypeCode
member ToString : unit -> string
member ToString : string -> string
member ToString : System.IFormatProvider -> string
member ToString : string * System.IFormatProvider -> string
static val MinValue : float
static val MaxValue : float
static val Epsilon : float
static val NegativeInfinity : float
static val PositiveInfinity : float
static val NaN : float
static member IsInfinity : float -> bool
static member IsNaN : float -> bool
static member IsNegativeInfinity : float -> bool
static member IsPositiveInfinity : float -> bool
static member Parse : string -> float
static member Parse : string * System.Globalization.NumberStyles -> float
static member Parse : string * System.IFormatProvider -> float
static member Parse : string * System.Globalization.NumberStyles * System.IFormatProvider -> float
static member TryParse : string * float -> bool
static member TryParse : string * System.Globalization.NumberStyles * System.IFormatProvider * float -> bool
end
Full name: System.Double
type: Double
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType
Double.IsNaN(d: float) : bool
Multiple overloads
String.Format(format: string, args: obj []) : string
String.Format(format: string, arg0: obj) : string
String.Format(provider: IFormatProvider, format: string, args: obj []) : string
String.Format(format: string, arg0: obj, arg1: obj) : string
String.Format(format: string, arg0: obj, arg1: obj, arg2: obj) : string
I hope you enjoyed this installment of Record Linkage Algorithms in F#. Next time we’ll explore the Winkler extension to this algorithm and take a look at why weighting errors earlier in the string more heavily ends up giving significantly better results.
Edit: All current code for this series is now available on github.