While writing the previous article on tokenized matching I realized I left out some important background information on Jaro-Winkler distance.
First, there’s something important to know about the Jaro-Winkler distance: it’s not a metric distance and so does not obey the triangle inequality. That is, if you found the JW distance between strings A and B, and then found the JW distance between strings B and C, those results would have no relationship with JW distance between strings A and C. This may not seem like a big deal, but it means Jaro-Winkler distance can’t be used to embed strings in a metric space and so is a poor algorithm choice for many types of clustering. This will be an important point in future articles.
Second, it can be very helpful to extend the results of Jaro-Winkler based on the nature of your own data and your use of the algorithm. To better support my own use case I’ve made changes put the emphasis on better token alignment.
let jaroWinklerMI (t1:string) (t2:string) = // Optimizations for easy to calculate cases if t1.Length = 0 || t2.Length = 0 then 0.0 elif t1 = t2 then 1.0 else // Even more weight for the first char let score = jaroWinkler t1 t2 let p = 0.2 //percentage of score from new metric let b = if t1. = t2. then 1.0 else 0.0 ((1.0 - p) * score) + (p * b)
Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.
To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.
let scoreNamePairs (t1:string) (t2:string) = //Raise jaro to a power in order to over-weight better matches jaroWinklerMI t1 t2 ** 2.0
I also take the square of the result of jaroWinklerMI to weight better matches even more heavily. I found that in doing this I was able to get much more reliable matching. To understand how this works take a gander at this plot.
As you already know, multiplying any number greater than 0 but less than 1 by itself will give you a smaller number. However are you might intuit, the smaller the number the greater the proportional reduction. As you can see here, anything less than 1 takes a hit, but worse matches get dragged down significantly more.
Initially I was frustrated by bad alignments which would sometimes be chosen over better ones when two or more tokens were both fairly close, but not great. After seeing a variation on this squaring technique used for matrix convergence the thought occurred to me: why not see if it helps with token alignment? After implementing this I saw a huge improvement in results: incorrect alignments completely disappeared!
It’s often surprising where inspiration will come from.
Edit: The above code and it’s composition with Gale-Shapely is now available in my github repository.