Levenshtein distance is one of my favorite algorithms. On the surface it seems so very simple, but when you spend some time thinking hard on it deep insights are waiting to be had.
The first and most important thing about Levenshtein distance is it’s actually a metric distance. That is, it obeys the triangle inequality. For most other string distance measurements this property doesn’t hold.
This might not seem like such a big deal, but this property gives the measurements meaning in a context larger than just the pair. For example, it allows you to embed your pair distance measurements into a higher dimensional space and so use it for things like clustering.
This was one of the first insights that Levenshtein distance gave me: A measurement doesn’t need to give you an absolute location in space to be useful for telling you where you are, it just has to tell you how far away everything else is. But what is it about Levenshtein distance that gives it this property? It’s not immediately obvious to most people, at least it wasn’t to me.
First, let’s consider a naive implementation of the Wagner-Fisher algorithm for Levenshtein distance. As stated above, here the triangle inequality holds.
1: let wagnerFischer (s: string) (t: string) = 2: let m = s.Length 3: let n = t.Length 4: let d = Array2D.create (m + 1) (n + 1) 0 5: 6: for i = 0 to m do d.[i, 0] <- i 7: for j = 0 to n do d.[0, j] <- j 8: 9: for j = 1 to n do 10: for i = 1 to m do 11: if s.[i-1] = t.[j-1] then 12: d.[i, j] <- d.[i-1, j-1] 13: else 14: d.[i, j] <- 15: List.min 16: [ 17: // a deletion 18: d.[i-1, j ] + 1; 19: // an insertion 20: d.[i , j-1] + 1; 21: // a substitution 22: d.[i-1, j-1] + 1; 23: ] 24: d.[m,n]
Now compare this with an incorrect version of an extension called Damerau–Levenshtein distance (or restricted edit distance). This change adds support for Jaro-Winkler like transpositions to the original algorithm. However, in the process of adding just this minor tweak we lose the triangle inequality.
26: let damerauLevenshtein (s: string) (t: string) = 27: let m = s.Length 28: let n = t.Length 29: let d = Array2D.create (m + 1) (n + 1) 0 30: 31: for i = 0 to m do d.[i, 0] <- i 32: for j = 0 to n do d.[0, j] <- j 33: 34: for j = 1 to n do 35: for i = 1 to m do 36: // 1 if a substitution 37: // 0 if no change 38: let cost = if s.[i-1] = t.[j-1] then 0 else 1 39: d.[i, j] <- 40: List.min 41: [ 42: // a deletion 43: d.[i-1, j ] + 1; 44: // an insertion 45: d.[i , j-1] + 1; 46: // a substitution or nothing 47: d.[i-1, j-1] + cost; 48: ] 49: if // boundary check 50: i > 1 && j > 1 51: // transposition check 52: && s.[i-1] = t.[j-2] && s.[i-2] = t.[j-1] 53: then // the lesser of a transposition or current cost 54: d.[i, j] <- min d.[i,j] (d.[i-2, j-2] + cost) 55: d.[m,n]
Full name: Snippet.wagnerFischer
type: string
implements: System.IComparable
implements: System.ICloneable
implements: System.IConvertible
implements: System.IComparable<string>
implements: seq<char>
implements: System.Collections.IEnumerable
implements: System.IEquatable<string>
val string : 'T -> string
Full name: Microsoft.FSharp.Core.Operators.string
——————–
type string = System.String
Full name: Microsoft.FSharp.Core.string
type: string
implements: System.IComparable
implements: System.ICloneable
implements: System.IConvertible
implements: System.IComparable<string>
implements: seq<char>
implements: System.Collections.IEnumerable
implements: System.IEquatable<string>
type: string
implements: System.IComparable
implements: System.ICloneable
implements: System.IConvertible
implements: System.IComparable<string>
implements: seq<char>
implements: System.Collections.IEnumerable
implements: System.IEquatable<string>
type: int
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<int>
implements: System.IEquatable<int>
inherits: System.ValueType
type: int
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<int>
implements: System.IEquatable<int>
inherits: System.ValueType
type: int [,]
implements: System.ICloneable
implements: System.Collections.IList
implements: System.Collections.ICollection
implements: System.Collections.IEnumerable
implements: System.Collections.IStructuralComparable
implements: System.Collections.IStructuralEquatable
inherits: System.Array
from Microsoft.FSharp.Collections
Full name: Microsoft.FSharp.Collections.Array2D.create
type: int
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<int>
implements: System.IEquatable<int>
inherits: System.ValueType
type: int
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<int>
implements: System.IEquatable<int>
inherits: System.ValueType
module List
from Microsoft.FSharp.Collections
——————–
type List<'T> =
| ( [] )
| ( :: ) of 'T * 'T list
with
interface System.Collections.IEnumerable
interface System.Collections.Generic.IEnumerable<'T>
member Head : 'T
member IsEmpty : bool
member Item : index:int -> 'T with get
member Length : int
member Tail : 'T list
static member Cons : head:'T * tail:'T list -> 'T list
static member Empty : 'T list
end
Full name: Microsoft.FSharp.Collections.List<_>
type: List<'T>
implements: System.Collections.IStructuralEquatable
implements: System.IComparable<List<'T>>
implements: System.IComparable
implements: System.Collections.IStructuralComparable
implements: System.Collections.Generic.IEnumerable<'T>
implements: System.Collections.IEnumerable
Full name: Microsoft.FSharp.Collections.List.min
Full name: Snippet.damerauLevenshtein
type: int
implements: System.IComparable
implements: System.IFormattable
implements: System.IConvertible
implements: System.IComparable<int>
implements: System.IEquatable<int>
inherits: System.ValueType
Full name: Microsoft.FSharp.Core.Operators.min
It seems like such a simple and obvious addition to the algorithm. Just what is it about about the way we’ve added transpositions that ruins the magic? We’ve just put something like wormholes to our little universe. That’s right, the simple addition of transpositions in this way implies a universe where some combinations of characters treat space differently than everything else. The easiest way to prove this is the case is to give the definition of the triangle inequality for metric spaces a read.
From Wikipedia’s Triangle Inequality article:
In a metric space M with metric d, the triangle inequality is a requirement upon distance: d(x, z) <= d(x, y) + d(y, z)
for all x, y, z in M. That is, the distance from x to z is at most as large as the sum of the distance from x to y and the distance from y to z.
From this, it’s easy to construct a counterexample for our broken Damerau-Levenshtein distance simply by exploiting the transpositions.
As you can see in this picture 4 is most certainly greater than 1 + 2, and so the triangle inequality is broken. Consider also the pathology that this example shows in the algorithm. Why didn’t it just go along the irkc –> rick –> rcik path when it’s obviously less expensive?
For comparison, if we measure those same pairs with standard Levenshtein distance everything is just peachy.
So we now know of at least one case which causes the triangle inequality to fail, does this imply what causes it to succeed? I think yes, at least in a limited sense. We can see that with Levenshtein distance any given pair of characters is considered independently, changes at each only happen once and are exactly one character in size. As each pair in the strings is considered, small changes push them further and further apart, but in discrete and equal units for discrete and equal changes. While in our Damerau-Levenshtein distance implementation we greedily perform operations of a larger size and then never revisit their implications, standard Levenshtein is reversibly symmetric both in how it treats locations over the string as well as in how it treats the characters themselves due to its uniform granularity. The uniform granularity of the changes ensures all important paths are explored.
Can transpositions be reliably counted with a different approach? We’ll find out the answer to this question next time.
Enjoy this post? Continue the conversation with me on twitter.
Tags: Damerau-Levenshtein distance, Edit Distance, F#, fsharp, Levenshtein distance, Machine Learning, Record Linkage, Restricted Edit Distance
I also wrote a post about string metrics (although it delves into some more abstract mathematics). Perhaps you’d enjoy reading it.
Forgot the link :)
http://jeremykun.wordpress.com/2011/12/19/metrics-on-words/
I suppose I should also mention, your example of the triangle inequality failing isn’t quite correct, and in fact the Damerau-Levenshtein distance is in fact a metric. In particular:
rcik
rick
irck
irkc
Gives three transformations for which rcik goes to irkc. This is just the distance computed by going along “the shorter path” in your picture (but in fact, if your algorithm doesn’t take this path then it’s not computing the true Damerau-Levensthein distance! O_o)
The proof that the triangle inequality always holds is not particularly hard: if we transform a string x to a string y, and then to a third string z, then the composition of the operations we performed will always be a valid edit from x right to z, and hence d(x,y) + d(y,z) <= d(x,z). However, if we restrict to the case where we are only allowed to edit each substring *once* (which makes it quite silly to speak of the triangle inequality), then this no longer holds.
Shame on me for that typo.
d(x,z) <= d(x,y) + d(y,z)
Hi Jeremy,
The proof makes sense to me intuitively, but I am struggling to understand how you make the jump from “a->b, b->c is a valid path from a->c” to the actual inequality. It makes complete sense to me but is there a way to explicitly explain that the indirect path can only be as or less optimal than the direct path from a to c?
Excellent post. I was going for something more beginner friendly here and ended up running short on time. Will certainly revisit it once my Coursera work load lightens :).
Refrain from calling something a distance that isn’t a distance. Confusion disappears.
any alternative to levenshtein for batch processing of thousands of records?
Generally, once your dataset gets large you need to move to a 2-phase blocking and then grading approach. Check out this book: http://www.amazon.com/Entity-Resolution-Information-Quality-Talburt/dp/0123819725/ref=sr_1_1?ie=UTF8&qid=1372894519&sr=8-1&keywords=talburt+information It will help quite a lot, I promise.
[…] Hacker News http://richardminerich.com/2012/09/levenshtein-distance-and-the-triangle-inequality/ Shoe Outlet This entry was posted in Uncategorized by admin. Bookmark the […]
[…] String Distances F# on Algorithms – Wagner–Fischer algorithm Edit Distance (EDIST) with F# Levenshtein Distance and the Triangle Inequality Metrics on […]
[…] Interesting discussion on the Levenshtein distance on @Rickasaurus’ blog. […]