Levenshtein distance is one of my favorite algorithms. On the surface it seems so very simple, but when you spend some time thinking hard on it deep insights are waiting to be had.

The first and most important thing about Levenshtein distance is it’s actually a metric distance. That is, it obeys the triangle inequality. For most other string distance measurements this property doesn’t hold.

The Vector Triangle Inequality

This might not seem like such a big deal, but this property gives the measurements meaning in a context larger than just the pair. For example, it allows you to embed your pair distance measurements into a higher dimensional space and so use it for things like clustering.

This was one of the first insights that Levenshtein distance gave me: A measurement doesn’t need to give you an absolute location in space to be useful for telling you where you are, it just has to tell you how far away everything else is. But what is it about Levenshtein distance that gives it this property? It’s not immediately obvious to most people, at least it wasn’t to me.

First, let’s consider a naive implementation of the Wagner-Fisher algorithm for Levenshtein distance. As stated above, here the triangle inequality holds.

1: let wagnerFischer (s: string) (t: string) =
2: let m = s.Length
3: let n = t.Length
4: let d = Array2D.create (m + 1) (n + 1) 0
5:
6: for i = 0 to m do d.[i, 0] <- i
7: for j = 0 to n do d.[0, j] <- j
8:
9: for j = 1 to n do
10: for i = 1 to m do
11: if s.[i-1] = t.[j-1] then
12: d.[i, j] <- d.[i-1, j-1]
13: else
14: d.[i, j] <-
15: List.min
16: [
17: // a deletion
18: d.[i-1, j ] + 1;
19: // an insertion
20: d.[i , j-1] + 1;
21: // a substitution
22: d.[i-1, j-1] + 1;
23: ]
24: d.[m,n]

Now compare this with an incorrect version of an extension called Damerau–Levenshtein distance (or restricted edit distance). This change adds support for Jaro-Winkler like transpositions to the original algorithm. However, in the process of adding just this minor tweak we lose the triangle inequality.

26: let damerauLevenshtein (s: string) (t: string) =
27: let m = s.Length
28: let n = t.Length
29: let d = Array2D.create (m + 1) (n + 1) 0
30:
31: for i = 0 to m do d.[i, 0] <- i
32: for j = 0 to n do d.[0, j] <- j
33:
34: for j = 1 to n do
35: for i = 1 to m do
36: // 1 if a substitution
37: // 0 if no change
38: let cost = if s.[i-1] = t.[j-1] then 0 else 1
39: d.[i, j] <-
40: List.min
41: [
42: // a deletion
43: d.[i-1, j ] + 1;
44: // an insertion
45: d.[i , j-1] + 1;
46: // a substitution or nothing
47: d.[i-1, j-1] + cost;
48: ]
49: if // boundary check
50: i > 1 && j > 1
51: // transposition check
52: && s.[i-1] = t.[j-2] && s.[i-2] = t.[j-1]
53: then // the lesser of a transposition or current cost
54: d.[i, j] <- min d.[i,j] (d.[i-2, j-2] + cost)
55: d.[m,n]

val wagnerFischer : string -> string -> int

Full name: Snippet.wagnerFischer

val s : string

type: string

implements: System.IComparable

implements: System.ICloneable

implements: System.IConvertible

implements: System.IComparable<string>

implements: seq<char>

implements: System.Collections.IEnumerable

implements: System.IEquatable<string>

Multiple items

val string : 'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

——————–

type string = System.String

Full name: Microsoft.FSharp.Core.string

type: string

implements: System.IComparable

implements: System.ICloneable

implements: System.IConvertible

implements: System.IComparable<string>

implements: seq<char>

implements: System.Collections.IEnumerable

implements: System.IEquatable<string>

val t : string

type: string

implements: System.IComparable

implements: System.ICloneable

implements: System.IConvertible

implements: System.IComparable<string>

implements: seq<char>

implements: System.Collections.IEnumerable

implements: System.IEquatable<string>

val m : int

type: int

implements: System.IComparable

implements: System.IFormattable

implements: System.IConvertible

implements: System.IComparable<int>

implements: System.IEquatable<int>

inherits: System.ValueType

property System.String.Length: int

val n : int

type: int

implements: System.IComparable

implements: System.IFormattable

implements: System.IConvertible

implements: System.IComparable<int>

implements: System.IEquatable<int>

inherits: System.ValueType

val d : int [,]

type: int [,]

implements: System.ICloneable

implements: System.Collections.IList

implements: System.Collections.ICollection

implements: System.Collections.IEnumerable

implements: System.Collections.IStructuralComparable

implements: System.Collections.IStructuralEquatable

inherits: System.Array

module Array2D

from Microsoft.FSharp.Collections

val create : int -> int -> 'T -> 'T [,]

Full name: Microsoft.FSharp.Collections.Array2D.create

val i : int

type: int

implements: System.IComparable

implements: System.IFormattable

implements: System.IConvertible

implements: System.IComparable<int>

implements: System.IEquatable<int>

inherits: System.ValueType

val j : int

type: int

implements: System.IComparable

implements: System.IFormattable

implements: System.IConvertible

implements: System.IComparable<int>

implements: System.IEquatable<int>

inherits: System.ValueType

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =

| ( [] )

| ( :: ) of 'T * 'T list

with

interface System.Collections.IEnumerable

interface System.Collections.Generic.IEnumerable<'T>

member Head : 'T

member IsEmpty : bool

member Item : index:int -> 'T with get

member Length : int

member Tail : 'T list

static member Cons : head:'T * tail:'T list -> 'T list

static member Empty : 'T list

end

Full name: Microsoft.FSharp.Collections.List<_>

type: List<'T>

implements: System.Collections.IStructuralEquatable

implements: System.IComparable<List<'T>>

implements: System.IComparable

implements: System.Collections.IStructuralComparable

implements: System.Collections.Generic.IEnumerable<'T>

implements: System.Collections.IEnumerable

val min : 'T list -> 'T (requires comparison)

Full name: Microsoft.FSharp.Collections.List.min

val damerauLevenshtein : string -> string -> int

Full name: Snippet.damerauLevenshtein

val cost : int

type: int

implements: System.IComparable

implements: System.IFormattable

implements: System.IConvertible

implements: System.IComparable<int>

implements: System.IEquatable<int>

inherits: System.ValueType

val min : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.min

It seems like such a simple and obvious addition to the algorithm. Just what is it about about the way we’ve added transpositions that ruins the magic? We’ve just put something like wormholes to our little universe. That’s right, the simple addition of transpositions in this way implies a universe where some combinations of characters treat space differently than everything else. The easiest way to prove this is the case is to give the definition of the triangle inequality for metric spaces a read.

From Wikipedia’s Triangle Inequality article:

*In a metric space M with metric d, the triangle inequality is a requirement upon distance: d(x, z) <= d(x, y) + d(y, z)*

for all x, y, z in M. That is, the distance from x to z is at most as large as the sum of the distance from x to y and the distance from y to z.

From this, it’s easy to construct a counterexample for our broken Damerau-Levenshtein distance simply by exploiting the transpositions.

Damerau-Levenshtein distance not satisfying the triangle inequality.

As you can see in this picture 4 is most certainly greater than 1 + 2, and so the triangle inequality is broken. Consider also the pathology that this example shows in the algorithm. Why didn’t it just go along the irkc –> rick –> rcik path when it’s obviously less expensive?

Levenshtein distance satisfying the triangle inequality.

For comparison, if we measure those same pairs with standard Levenshtein distance everything is just peachy.

So we now know of at least one case which causes the triangle inequality to fail, does this imply what causes it to succeed? I think yes, at least in a limited sense. We can see that with Levenshtein distance any given pair of characters is considered independently, changes at each only happen once and are exactly one character in size. As each pair in the strings is considered, small changes push them further and further apart, but in discrete and equal units for discrete and equal changes. While in our Damerau-Levenshtein distance implementation we greedily perform operations of a larger size and then never revisit their implications, standard Levenshtein is reversibly symmetric both in how it treats locations over the string as well as in how it treats the characters themselves due to its uniform granularity. The uniform granularity of the changes ensures all important paths are explored.

Can transpositions be reliably counted with a different approach? We’ll find out the answer to this question next time.