6
Sep 11

## Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 2)

Last time we dove into the Jaro distance algorithm and picked apart how each of its components are calculated. However, from a modern perspective Jaro alone is a rather weak method of string matching. It was Winkler’s extension that brought this algorithm into widespread modern use.

Matthew Jaro’s insight when inventing the Jaro distance algorithm was that humans tend to make certain kinds of errors. However, while working on the 1990 census William Winkler had an additional insight: these kinds of errors are much more likely to occur later in the string. By simply wrapping Jaro’s algorithm to forgive some penalty according to the similarity of the first few characters, Winkler was able to get significantly better results with little additional overhead.

Here we see that the Jaro-Winkler distance (dw) is equal to the result of the Jaro distance (dj) plus one minus that same value times some weighted metric (lp). This is a great mathematical trick for two reasons. First, as long as the weighted metric (lp) doesn’t exceed 1, the final result will stay within the 0-1 range of the Jaro metric. Second, it guarantees that the result of Jaro-Winkler will never be lower than the result of Jaro alone. It effectively lets Winkler enrich the results of Jaro by filling in the remaining gap.

The meaning of the weighted metric (lp) is even more simple. Here l is just the number of initial characters which match, up to a maximum of four. Meanwhile p is a weight which can be up to 1/4, or one over the maximum possible value of l.

As p gets larger and larger, more and more of the gap will be filled in by Winkler’s extension. When p is 1/4, strings where first four characters match will always be given a perfect score. Just as we’d never want strings like “JOHN” and “JOHNSON” to be considered a perfect match, we’d never want to use a p value of 1/4. After much experimentation, Winkler recommends using a p value of 0.1, which is also what I use.

Now that we’ve covered how the math works, let’s take a look at an actual implementation. If you’d like to see an implementation of the Jaro distance algorithm take a look at the previous installment of Record Linkage Algorithms in F#.

``` 1: /// Calculate the Jaro-Winkler distance of s1 and s2
2: let jaroWinkler s1 s2 =
3:     let jaroScore = jaro s1 s2
4:     // Accumulate the number of matching initial characters
5:     let maxLength = (min s1.Length s2.Length) - 1
6:     let rec calcL i acc =
7:         if i > maxLength || s1.[i] <> s2.[i] then acc
8:         else calcL (i + 1) (acc + 1.0)
9:     let l = min (calcL 0 0.0) 4.0
10:     // Calculate the JW distance
11:     let p = 0.1
12:     jaroScore + (l * p * (1.0 - jaroScore))```

In my implementation I’ve allowed for some slight inefficiency in order to make it easier to play with the p value and number of characters examined. So far I’ve found Winkler’s choice of four characters to be the best, which incidentally has been shown as a great number of initial characters to look at when using a number of record linkage algorithms on things in the English language. However, I suspect that other values may work better when working in other languages.

The math here is so simple that I don’t feel it’s worth breaking down further, but I’ve included my tests built from the examples in the Jaro-Winkler distance Wikipedia article. For a fuller understanding just break out the debugger and play a bit on your own.

```14: open Xunit
15:
16: [<Fact>]
17: let ``Jaro-Winkler identity test`` () =
18:     let result = jaroWinkler "RICK" "RICK"
19:     Assert.Equal("1.000", String.Format("{0:0.000}", result))
20:
21: [<Fact>]
22: let ``Jaro-Winkler martha test`` () =
23:     let result = jaroWinkler "MARTHA" "MARHTA"
24:     Assert.Equal("0.961", String.Format("{0:0.000}", result))
25:
26: [<Fact>]
27: let ``Jaro-Winkler dwayne test`` () =
28:     let result = jaroWinkler "DWAYNE" "DUANE"
29:     Assert.Equal("0.840", String.Format("{0:0.000}", result))
30:
31: [<Fact>]
32: let ``Jaro-Winkler dixon test`` () =
33:     let result = jaroWinkler "DIXON" "DICKSONX"
34:     Assert.Equal("0.813", String.Format("{0:0.000}", result))
35: ```

val jaroWinkler : 'a -> 'b -> float

Full name: Snippet.jaroWinkler

Calculate the Jaro-Winkler distance of s1 and s2

val s1 : 'a
val s2 : 'b
val jaroScore : float

type: float
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<float>
implements: System.IEquatable<float>
inherits: System.ValueType

val maxLength : int

type: int
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<int>
implements: System.IEquatable<int>
inherits: System.ValueType

val min : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.min

val calcL : (int -> float -> float)
val i : int

type: int
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<int>
implements: System.IEquatable<int>
inherits: System.ValueType

val acc : float

type: float
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<float>
implements: System.IEquatable<float>
inherits: System.ValueType

val l : float

type: float
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<float>
implements: System.IEquatable<float>
inherits: System.ValueType

val p : float

type: float
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<float>
implements: System.IEquatable<float>
inherits: System.ValueType

val result : float

type: float
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<float>
implements: System.IEquatable<float>
inherits: System.ValueType

module String

from Microsoft.FSharp.Core

Multiple items

type Format<'Printer,'State,'Residue,'Result,'Tuple> = PrintfFormat<'Printer,'State,'Residue,'Result,'Tuple>

Full name: Microsoft.FSharp.Core.Format<_,_,_,_,_>

type: Format<'Printer,'State,'Residue,'Result,'Tuple>
inherits: PrintfFormat<'Printer,'State,'Residue,'Result>

——————–

type Format<'Printer,'State,'Residue,'Result> = PrintfFormat<'Printer,'State,'Residue,'Result>

Full name: Microsoft.FSharp.Core.Format<_,_,_,_>

In the next installment of Record Linkage Algorithms in F# we’ll take a look a method for efficient token based matching with Jaro-Winkler. Stay tuned!

Edit: All current code for this series is now available on github.

2
Sep 11

## Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 1)

When first approaching the task of record linkage I was initially overwhelmed by the huge number of different algorithms available for comparing strings. Now I know that the secret to finding your way in this sea of algorithms is two fold. First, know that many are outdated and have newer and better implementations, so they can be ignored. Second, of the newest extension of each algorithm, each has a particular use depending on the kind of data you’re comparing.

Jaro-Winkler distance is a particularly good choice when comparing short strings of human entered data, such as names. This is due to it’s relative robustness against letter transpositions and it’s weighting of similarity toward the beginning of the string. When it comes to comparing names, a slightly customized version of tokenized Jaro-Winkler is one of the first things in my toolbox that I reach for.

Jaro-Winkler distance also has the benefit of being a rather easy to understand algorithm. It’s composed of two parts, Jaro’s original algorithm and Winkler’s extension.  In this installment we’ll roll up our sleeves and dig into the first part of this algorithm, Jaro distance.

This is the equation for Jaro distance. It’s made up of the average of three sub-calculations.

1. The ratio of matching characters to the length of the first string.
2. The ratio of matching characters to the length of the second string.
3. The ratio of non-transpositions to the number matching of characters.

My implementation has been optimized a bit, but I think it’s still rather easy to understand.

``` 1: open System
2:
3: /// Given an offset and a radius from that offset,
4: /// does mChar exist in that part of str?
5: let inline existsInWin (mChar: char) (str: string) offset rad =
6:     let startAt = max 0 (offset - rad)
7:     let endAt = min (offset + rad) (String.length str - 1)
8:     if endAt - startAt < 0 then false
9:     else
10:         let rec exists index =
11:             if str.[index] = mChar then true
12:             elif index = endAt then false
13:             else exists (index + 1)
14:         exists startAt
15:
16: /// The jaro distance between s1 and s2
17: let jaro s1 s2 =
18:
19:     // The radius is half of the lesser
20:     // of the two string lengths rounded up.
22:         let minLen =
23:             min (String.length s1) (String.length s2) in
24:               minLen / 2 + minLen % 2
25:
26:     // An inner function which recursively finds the number
27:     // of matched characters within the radius.
28:     let commonChars (chars1: string) (chars2: string) =
29:         let rec inner i result =
30:             match i with
31:             | -1 -> result
32:             | _ -> if existsInWin chars1.[i] chars2 i matchRadius
33:                    then inner (i - 1) (chars1.[i] :: result)
34:                    else inner (i - 1) result
35:         inner (chars1.Length - 1) []
36:
37:     // The sets of common characters and their lengths as floats
38:     let c1 = commonChars s1 s2
39:     let c2 = commonChars s2 s1
40:     let c1length = float (List.length c1)
41:     let c2length = float (List.length c2)
42:
43:     // The number of transpositions within
44:     // the sets of common characters.
45:     let transpositions =
46:         let rec inner cl1 cl2 result =
47:             match cl1, cl2 with
48:             | [], _ | _, [] -> result
49:             | c1h :: c1t, c2h :: c2t ->
50:                 if c1h <> c2h
51:                 then inner c1t c2t (result + 1.0)
52:                 else inner c1t c2t result
53:         let mismatches = inner c1 c2 0.0
54:         // If one common string is longer than the other
55:         // each additional char counts as half a transposition
56:         (mismatches + abs (c1length - c2length)) / 2.0
57:
58:     let s1length = float (String.length s1)
59:     let s2length = float (String.length s2)
60:     let tLength = max c1length c2length
61:
62:     // The jaro distance as given by
63:     // 1/3 ( m2/|s1| + m1/|s2| + (mc-t)/mc )
64:     let result = (c1length / s1length +
65:                   c2length / s2length +
66:                   (tLength - transpositions) / tLength)
67:                  / 3.0
68:
69:     // This is for cases where |s1|, |s2| or m are zero
70:     if Double.IsNaN result then 0.0 else result```

We’re quite lucky to have examples of the algorithm in action available right in the Wikipedia article to use as unit tests. When optimizing, you couldn’t ask for more.

Take ‘DWAYNE’ and ‘DUANE’ for example. According to the article we should end up with the following results:

So, let’s break it down. The lesser of the two strings, DUANE, is 5 characters long. When we set minLen to 5 in (minLen / 2 + minLen % 2 is 3), we get a radius of 3.

Next we find the common characters in each direction. Comparing at DWAYNE to DUANE with a radius of 3 we can see that D, A, N and E will match, giving us a c1 = “DANE”. Similarly, comparing at DUANE to DWAYNE we get exactly the same thing. That is, c2 = “DANE” as well.

As you might have guessed by the fact both sets of common strings are the same, we have zero transpositions in this instance.

So now, we just plug in the numbers and get 1/3 * (4/6 + 4/5 + (4-0)/4) = 0.822. Not too difficult, eh?

For a better understanding of how transpositions end up working out, try walking through the debugger with the MARTHA vs MARHTA test below.

```72: open Xunit
73:
74: [<Fact>]
75: let ``Jaro identity test`` () =
76:     let result = jaro "RICK" "RICK"
77:     Assert.Equal("1.000", String.Format("{0:0.000}", result))
78:
79: [<Fact>]
80: let ``Jaro martha test`` () =
81:     let result = jaro "MARTHA" "MARHTA"
82:     Assert.Equal("0.944", String.Format("{0:0.000}", result))
83:
84: [<Fact>]
85: let ``Jaro dwayne test`` () =
86:     let result = jaro "DWAYNE" "DUANE"
87:     Assert.Equal("0.822", String.Format("{0:0.000}", result))
88:
89: [<Fact>]
90: let ``Jaro dixon test`` () =
91:     let result = jaro "DIXON" "DICKSONX"
92:     Assert.Equal("0.767", String.Format("{0:0.000}", result))
93:
94: ```

namespace System
val existsInWin : char -> string -> 'a -> 'b -> bool (requires member ( – ) and member ( + ))

Full name: Snippet.existsInWin

Given an offset and a radius from that office,
does mChar exist in that part of str?

val mChar : char

type: char
implements: IComparable
implements: IConvertible
implements: IComparable<char>
implements: IEquatable<char>
inherits: ValueType

Multiple items

val char : 'T -> char (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.char

——————–

type char = Char

Full name: Microsoft.FSharp.Core.char

type: char
implements: IComparable
implements: IConvertible
implements: IComparable<char>
implements: IEquatable<char>
inherits: ValueType

val str : string

type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>

Multiple items

val string : 'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

——————–

type string = String

Full name: Microsoft.FSharp.Core.string

type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>

val offset : 'a (requires member ( – ) and member ( + ))
val rad : 'b (requires member ( – ) and member ( + ))
val startAt : int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val max : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.max

val endAt : int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val min : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.min

type String =
class
new : char -> string
new : char * int * int -> string
new : System.SByte -> string
new : System.SByte * int * int -> string
new : System.SByte * int * int * System.Text.Encoding -> string
new : char [] * int * int -> string
new : char [] -> string
new : char * int -> string
member Chars : int -> char
member Clone : unit -> obj
member CompareTo : obj -> int
member CompareTo : string -> int
member Contains : string -> bool
member CopyTo : int * char [] * int * int -> unit
member EndsWith : string -> bool
member EndsWith : string * System.StringComparison -> bool
member EndsWith : string * bool * System.Globalization.CultureInfo -> bool
member Equals : obj -> bool
member Equals : string -> bool
member Equals : string * System.StringComparison -> bool
member GetEnumerator : unit -> System.CharEnumerator
member GetHashCode : unit -> int
member GetTypeCode : unit -> System.TypeCode
member IndexOf : char -> int
member IndexOf : string -> int
member IndexOf : char * int -> int
member IndexOf : string * int -> int
member IndexOf : string * System.StringComparison -> int
member IndexOf : char * int * int -> int
member IndexOf : string * int * int -> int
member IndexOf : string * int * System.StringComparison -> int
member IndexOf : string * int * int * System.StringComparison -> int
member IndexOfAny : char [] -> int
member IndexOfAny : char [] * int -> int
member IndexOfAny : char [] * int * int -> int
member Insert : int * string -> string
member IsNormalized : unit -> bool
member IsNormalized : System.Text.NormalizationForm -> bool
member LastIndexOf : char -> int
member LastIndexOf : string -> int
member LastIndexOf : char * int -> int
member LastIndexOf : string * int -> int
member LastIndexOf : string * System.StringComparison -> int
member LastIndexOf : char * int * int -> int
member LastIndexOf : string * int * int -> int
member LastIndexOf : string * int * System.StringComparison -> int
member LastIndexOf : string * int * int * System.StringComparison -> int
member LastIndexOfAny : char [] -> int
member LastIndexOfAny : char [] * int -> int
member LastIndexOfAny : char [] * int * int -> int
member Length : int
member Normalize : unit -> string
member Normalize : System.Text.NormalizationForm -> string
member PadLeft : int -> string
member PadLeft : int * char -> string
member PadRight : int -> string
member PadRight : int * char -> string
member Remove : int -> string
member Remove : int * int -> string
member Replace : char * char -> string
member Replace : string * string -> string
member Split : char [] -> string []
member Split : char [] * int -> string []
member Split : char [] * System.StringSplitOptions -> string []
member Split : string [] * System.StringSplitOptions -> string []
member Split : char [] * int * System.StringSplitOptions -> string []
member Split : string [] * int * System.StringSplitOptions -> string []
member StartsWith : string -> bool
member StartsWith : string * System.StringComparison -> bool
member StartsWith : string * bool * System.Globalization.CultureInfo -> bool
member Substring : int -> string
member Substring : int * int -> string
member ToCharArray : unit -> char []
member ToCharArray : int * int -> char []
member ToLower : unit -> string
member ToLower : System.Globalization.CultureInfo -> string
member ToLowerInvariant : unit -> string
member ToString : unit -> string
member ToString : System.IFormatProvider -> string
member ToUpper : unit -> string
member ToUpper : System.Globalization.CultureInfo -> string
member ToUpperInvariant : unit -> string
member Trim : unit -> string
member Trim : char [] -> string
member TrimEnd : char [] -> string
member TrimStart : char [] -> string
static val Empty : string
static member Compare : string * string -> int
static member Compare : string * string * bool -> int
static member Compare : string * string * System.StringComparison -> int
static member Compare : string * string * System.Globalization.CultureInfo * System.Globalization.CompareOptions -> int
static member Compare : string * string * bool * System.Globalization.CultureInfo -> int
static member Compare : string * int * string * int * int -> int
static member Compare : string * int * string * int * int * bool -> int
static member Compare : string * int * string * int * int * System.StringComparison -> int
static member Compare : string * int * string * int * int * bool * System.Globalization.CultureInfo -> int
static member Compare : string * int * string * int * int * System.Globalization.CultureInfo * System.Globalization.CompareOptions -> int
static member CompareOrdinal : string * string -> int
static member CompareOrdinal : string * int * string * int * int -> int
static member Concat : obj -> string
static member Concat : obj [] -> string
static member Concat<'T> : System.Collections.Generic.IEnumerable<'T> -> string
static member Concat : System.Collections.Generic.IEnumerable<string> -> string
static member Concat : string [] -> string
static member Concat : obj * obj -> string
static member Concat : string * string -> string
static member Concat : obj * obj * obj -> string
static member Concat : string * string * string -> string
static member Concat : obj * obj * obj * obj -> string
static member Concat : string * string * string * string -> string
static member Copy : string -> string
static member Equals : string * string -> bool
static member Equals : string * string * System.StringComparison -> bool
static member Format : string * obj -> string
static member Format : string * obj [] -> string
static member Format : string * obj * obj -> string
static member Format : System.IFormatProvider * string * obj [] -> string
static member Format : string * obj * obj * obj -> string
static member Intern : string -> string
static member IsInterned : string -> string
static member IsNullOrEmpty : string -> bool
static member IsNullOrWhiteSpace : string -> bool
static member Join : string * string [] -> string
static member Join : string * obj [] -> string
static member Join<'T> : string * System.Collections.Generic.IEnumerable<'T> -> string
static member Join : string * System.Collections.Generic.IEnumerable<string> -> string
static member Join : string * string [] * int * int -> string
end

Full name: System.String

type: String
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>

val length : string -> int

Full name: Microsoft.FSharp.Core.String.length

val exists : (int -> bool)
val index : int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val jaro : string -> string -> float

Full name: Snippet.jaro

The jaro distance between s1 and s2

val s1 : string

type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>

val s2 : string

type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val minLen : int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val commonChars : (string -> string -> char list)
val chars1 : string

type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>

val chars2 : string

type: string
implements: IComparable
implements: ICloneable
implements: IConvertible
implements: IComparable<string>
implements: seq<char>
implements: Collections.IEnumerable
implements: IEquatable<string>

val inner : (int -> char list -> char list)
val i : int

type: int
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<int>
implements: IEquatable<int>
inherits: ValueType

val result : char list

type: char list
implements: Collections.IStructuralEquatable
implements: IComparable<List<char>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<char>
implements: Collections.IEnumerable

property String.Length: int
val c1 : char list

type: char list
implements: Collections.IStructuralEquatable
implements: IComparable<List<char>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<char>
implements: Collections.IEnumerable

val c2 : char list

type: char list
implements: Collections.IStructuralEquatable
implements: IComparable<List<char>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<char>
implements: Collections.IEnumerable

val c1length : float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

Multiple items

val float : 'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–

type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

type: float<'Measure>
implements: IComparable
implements: IConvertible
implements: IFormattable
implements: IComparable<float<'Measure>>
implements: IEquatable<float<'Measure>>
inherits: ValueType

——————–

type float = Double

Full name: Microsoft.FSharp.Core.float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =
| ( [] )
| ( :: ) of 'T * 'T list
with
interface Collections.IEnumerable
interface Collections.Generic.IEnumerable<'T>
member IsEmpty : bool
member Item : index:int -> 'T with get
member Length : int
member Tail : 'T list
static member Cons : head:'T * tail:'T list -> 'T list
static member Empty : 'T list
end

Full name: Microsoft.FSharp.Collections.List<_>

type: List<'T>
implements: Collections.IStructuralEquatable
implements: IComparable<List<'T>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'T>
implements: Collections.IEnumerable

val length : 'T list -> int

Full name: Microsoft.FSharp.Collections.List.length

val c2length : float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

val transpositions : float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

val inner : ('a list -> 'a list -> float -> float) (requires equality)
val cl1 : 'a list (requires equality)

type: 'a list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'a>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'a>
implements: Collections.IEnumerable

val cl2 : 'a list (requires equality)

type: 'a list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'a>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'a>
implements: Collections.IEnumerable

val result : float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

val c1h : 'a (requires equality)
val c1t : 'a list (requires equality)

type: 'a list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'a>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'a>
implements: Collections.IEnumerable

val c2h : 'a (requires equality)
val c2t : 'a list (requires equality)

type: 'a list
implements: Collections.IStructuralEquatable
implements: IComparable<List<'a>>
implements: IComparable
implements: Collections.IStructuralComparable
implements: Collections.Generic.IEnumerable<'a>
implements: Collections.IEnumerable

val mismatches : float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

val abs : 'T -> 'T (requires member Abs)

Full name: Microsoft.FSharp.Core.Operators.abs

val s1length : float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

val s2length : float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

val tLength : float

type: float
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

type Double =
struct
member CompareTo : obj -> int
member CompareTo : float -> int
member Equals : obj -> bool
member Equals : float -> bool
member GetHashCode : unit -> int
member GetTypeCode : unit -> System.TypeCode
member ToString : unit -> string
member ToString : string -> string
member ToString : System.IFormatProvider -> string
member ToString : string * System.IFormatProvider -> string
static val MinValue : float
static val MaxValue : float
static val Epsilon : float
static val NegativeInfinity : float
static val PositiveInfinity : float
static val NaN : float
static member IsInfinity : float -> bool
static member IsNaN : float -> bool
static member IsNegativeInfinity : float -> bool
static member IsPositiveInfinity : float -> bool
static member Parse : string -> float
static member Parse : string * System.Globalization.NumberStyles -> float
static member Parse : string * System.IFormatProvider -> float
static member Parse : string * System.Globalization.NumberStyles * System.IFormatProvider -> float
static member TryParse : string * float -> bool
static member TryParse : string * System.Globalization.NumberStyles * System.IFormatProvider * float -> bool
end

Full name: System.Double

type: Double
implements: IComparable
implements: IFormattable
implements: IConvertible
implements: IComparable<float>
implements: IEquatable<float>
inherits: ValueType

Double.IsNaN(d: float) : bool

String.Format(format: string, args: obj []) : string

String.Format(format: string, arg0: obj) : string

String.Format(provider: IFormatProvider, format: string, args: obj []) : string

String.Format(format: string, arg0: obj, arg1: obj) : string

String.Format(format: string, arg0: obj, arg1: obj, arg2: obj) : string

I hope you enjoyed this installment of Record Linkage Algorithms in F#. Next time we’ll explore the Winkler extension to this algorithm and take a look at why weighting errors earlier in the string more heavily ends up giving significantly better results.

Edit: All current code for this series is now available on github.

29
Apr 10

## The Ted Neward F# Folding Challenge

My friend, and fellow Professional F# 2.0 author, Ted Neward recently challenged me to a bit of a Code Kata.   Take a list of numbers and compress it in a particular simple way but without any mutable state.  What makes this problem interesting is that a tech interviewer mentioned that that he hadn’t seen a functional solution to this problem.  I also wanted to share this because I think it’s a great example of how to convert an imperative loop into a functional fold.

So, on to the problem.

Given a set of numbers like [4; 5; 5; 5; 3; 3], compress it to be an alternating list of  counts and numbers.

For example, the solution for [4; 5; 5; 5; 3; 3] would be [1; 4; 3; 5; 2; 3], as the list consists of one four, then three fives and finally two threes.  Ordering counts as the initial list must be reconstructable from the compressed list.  The answer to [4; 5; 4] would be [1; 4; 1; 5; 1; 4]

The imperative solution is rather simple, we use three variables outside of a loop:  a last value, a count and an accumulator list.

```let clImp list =
let mutable value = 0
let mutable count = 0
let output = new System.Collections.Generic.List<_>()
for element in list do
if element <> value && count > 0 then
count <- 0
value <- element
count <- count + 1
if count > 0 then
output```

Using the normal .NET mutable list type, this version works efficiently and produces the output we expect.

```> let compressed = clImp numbers
Seq.iter (fun e -> printf "%i " e) compressed;;

1 4 3 5 2 3```

How might we convert this to a functional style?  In this example, the general type of operation could be thought of as the gradual building of a data structure while walking over a list.  F# just happens to have a list processing function designed just for this task.  This function is named fold and it is one of the most useful constructs in any functional programmer’s tool chest.

` `

```let clFold input =
let (count, value, output) =
List.fold
(fun (count, value, output) element ->
if element <> value && count > 0 then
1, element, output @ [count; value]
else
count + 1, element, output)
(0 , 0, [])
input
output @ [count; value]```

Here, we are doing almost exactly the same thing as in the imperative version but with a fold instead of a loop.   The secret is that instead of putting variables outside of our loop and changing them with mutation, we have added them as elements in our accumulator tuple.  In this way, the values are updated when each element is visited with no mutation.

However, there is one serious problem with this example.  Appending to the end of a linked list requires recreating every node in that list.  This will make our algorithm grow exponentially slower approximately in proportion to the length of the input list.  To correct this we have two choices: do a head append with a normal fold and reverse the list when we are done, or use foldBack.  The foldBack version is a rather small step from here and looks much nicer, so let’s go in that direction.

` `

```let clFoldBack input =
let (count, value, output) =
List.foldBack
(fun element (count, value, output) ->
if element <> value && count > 0 then
1, element, [count; value] @ output
else
count + 1, element, output)
input
(0, 0, [])
[count; value] @ output```

There are only two real changes here.  First, we are using foldBack instead of fold.  This change causes some argument reordering.  Second, we are appending to the head of the output list instead of the tail.  It works well, is rather fast and is easy to understand if you are comfortable with folds.

However, there is a bit of a dirty secret here.  Under the hood foldBack converts its input list into an array when the size is large.  As arrays have linear element look up time, they can be walked through backwards very quickly.  Does this make the solution not functional?  You’d never know unless you looked at the underlying implementation.  Anyway, however  you want to label it,  it sure works well.

If you liked this example and want to see more check out our book, Professional F# 2.0.   It’s just about to be done.  In fact, I better get back to editing.