• 2011 In Retrospect: A Year of Writing F# Professionally

    For the past year I’ve been working almost entirely in F# and have found the experience to be everything I hoped it to be and better. In just six months I was able to bring a research project to fruition which has since made our company millions of dollars. F#’s terseness made algorithms a joy to implement while F#’s type system made changes easy and regression errors non-existent. In fact, the few bugs we have had were either due to unexpected behavior in the GUI API, or in external C# libraries. Never before have I been so happy with the programming language I use on a daily basis. Never before have I felt so confident that when I make a change new bugs won’t be introduced. And no, we’re not using strict TDD, or any other strict development methodology. Simply leveraging the type system well in addition to algorithmic testing has proven to be extremely robust within F#.

    It’s also been a great year for making new friends and meeting interesting people. The NYC F# community is thriving with an average of about fifty people attending any given NYC F# User Group meeting, many of them using F# at work for at least some subset of their projects. I hear the same experiences from many: Yes, learning F# isn’t simple, but the benefits in speed of implementation, LoC reduction, and correctness are huge. The excitement over type providers is also palpable. Finally, we’ll be able to extend our types out into data sources and eliminate new classes of error which currently occur at the border between our data and our applications. If you’re interested in following what we’ve been up to at the NYC F# User Group, check out our Vimeo stream.

    I’ve also had the pleasure of speaking at a quite a few interesting events this year. One I haven’t yet written about is Monospace 2011.

    Click For Video

    In retrospect, it shouldn’t have been surprising that many in the Mono community hadn’t seen much F#.  They tend to go to different conferences and events than your standard .NET user and so hadn’t yet heard much about its benefits.  I will say though that it was a very bright group and I think many who speak on F# would find them  just as receptive as I have.  I do hope to see (and participate in) more outreach into the Mono world as I think we will find that there’s quite a few folks who would be happy to embrace a better way to program, if only they were shown the way.

  • Advice for Getting Started with F#

    I had a great time at NYC Code Camp this last weekend. About half the people in my talk already knew F# and were there to talk about Type Providers, the other half just came to see what this F# thing was all about. This post is to help those in the second half begin their F# learning adventure.

    The first thing anyone who is looking to get started with F# should do is work through the tutorial examples on tryfsharp.org. Getting some hands-on time with the basics is extremely important, especially if you haven’t done much functional programming before.

    Once you’ve got the basics down, the next step is thinking through some real problems in F#. For me, this was writing an Ant Colony simulation. With the version linked to here you can actually compile and then try your own ant behavior code against mine! Other great options include the Coding Dojo Katas or Project Euler (for the mathematically inclined).

    It’s best if you do most of your play using .fsx files and F# interactive at first. Just highlighting code and hitting Alt-Enter makes the process of learning so much faster. However, if you insist on working with unit tests I suggest Xunit as it doesn’t require the use of classes for hosting your tests.

    At first it’s also best to ignore most of F#’s features, it can be overwhelming otherwise. Try your best to focus on comprehensions. These may not always be the best way to solve a problem, but they allow you to write very imperative code right in the middle of a bunch of functional stuff and so are kind of an escape hatch for when you don’t know how to do what you want functionally. Eventually you’ll want to move on to recursion and folding though. When you’re ready to make the jump read this post which talks about how to go about it.

    As you’re playing with F#, don’t be afraid to look things up along the way when you get stuck. There’s quite a few good resources around for F# including the F# Programming WikiBook, the code samples provided by the F# team, and a number of print books including Professional F# 2.0 (shameless plug). When all else fails the MSDN documentation for F# is actually very good!

    I can say from experience that at first you will likely experience quite a few syntax errors. Don’t worry too much about it, your brain needs to build up a model of how the types and syntax work. Once you’ve gotten over that hurdle I promise you’ll be amazed by how different the programming world looks.

  • For whom the proteins fold

    I know this post isn’t of my usual technical type, but I hope you’ll bear with me while I share an idea I’ve been thinking about.

    Starting way back with SETI@Home, scientists have been borrowing our computer time in exchange for awesomely nerdy screen savers for years. However, it’s only fairly recently have they discovered the promise of crowd sourcing. You may have seen the headlines for the two largest successes just recently.

    Crowd Sourced Protein Folding
    Crowd Sourced Planet Finding

    Most interesting, scientists have found that gamers are actually better at figuring these things out than they are! That’s right, all of those years you thought you were wasting time, you were actually honing your science generating skills.

    But would it be possible to build on these discoveries in order to turn some proportion of the 150+ billion hours of time spent gaming a year into something good for society?

    If you’re like me you didn’t have a very big allowance when you were young, maybe enough to buy a few video games a year. I’m guessing that in the current recession it’s even worse for most kids. Also, have you noticed all of those games downloadable directly to a console these days? That must be torture for the modern day low-allowance kid.

    So here’s the idea: we turn these kids into an army of protein folding, planet finding scientists by paying them in virtual currency to do these tasks. They can then save up and use the virtual currency to buy the games their little hearts are burning for.

    Even better, think of the secondary effects:

    1. More exposure to science for kids, some might grow up and become scientists.
    2. It’s an opportunity for great corporate publicity (especially if they give away some virtual currency for free or let scientists turn grants into it at a discounted rate).
    3. I expect that children’s very plastic brains will learn the tasks quickly and be better at them than adults.
    4. Even if it’s done for-profit with drug companies we will still get new and better medicines out of the deal.

    All that’s left is for someone to lock some scientists and game console manufacturers in a room so they can figure the details out. Any takers?

  • Record Linkage Algorithms in F# – Extensions to Jaro-Winkler Distance (Part 3)

    While writing the previous article on tokenized matching I realized I left out some important background information on Jaro-Winkler distance.

    The Vector Triangle Inequality

    First, there’s something important to know about the Jaro-Winkler distance: it’s not a metric distance and so does not obey the triangle inequality. That is, if you found the JW distance between strings A and B, and then found the JW distance between strings B and C, those results would have no relationship with JW distance between strings A and C. This may not seem like a big deal, but it means Jaro-Winkler distance can’t be used to embed strings in a metric space and so is a poor algorithm choice for many types of clustering. This will be an important point in future articles.

    Second, it can be very helpful to extend the results of Jaro-Winkler based on the nature of your own data and your use of the algorithm. To better support my own use case I’ve made changes put the emphasis on better token alignment.

     let jaroWinklerMI (t1:string) (t2:string) = 
         // Optimizations for easy to calculate cases
         if t1.Length = 0 || t2.Length = 0 then 0.0
         elif t1 = t2 then 1.0
         else
             // Even more weight for the first char
             let score = jaroWinkler t1 t2
             let p = 0.2 //percentage of score from new metric
             let b = if t1.[0] = t2.[0] then 1.0 else 0.0
             ((1.0 - p) * score) + (p * b)
    

    Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.

    To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.

    let scoreNamePairs (t1:string) (t2:string) =  
        //Raise jaro to a power in order to over-weight better matches        
        jaroWinklerMI t1 t2 ** 2.0
    

    I also take the square of the result of jaroWinklerMI to weight better matches even more heavily. I found that in doing this I was able to get much more reliable matching. To understand how this works take a gander at this plot.

    As you already know, multiplying any number greater than 0 but less than 1 by itself will give you a smaller number. However are you might intuit, the smaller the number the greater the proportional reduction. As you can see here, anything less than 1 takes a hit, but worse matches get dragged down significantly more.

    Initially I was frustrated by bad alignments which would sometimes be chosen over better ones when two or more tokens were both fairly close, but not great. After seeing a variation on this squaring technique used for matrix convergence the thought occurred to me: why not see if it helps with token alignment? After implementing this I saw a huge improvement in results: incorrect alignments completely disappeared!

    It’s often surprising where inspiration will come from.

    Edit: The above code and it’s composition with Gale-Shapely is now available in my github repository.

  • Imperative Pseudocode to Pure Functional Algorithm with Gale-Shapely and F#

    Continuing from last time, let’s look at how one goes from imperative pseudocode to pure functional using Gale-Shapely as an example.

    Overall, to convert an algorithm from imperative to functional is a fairly simple process once you understand how to convert from a while loop to recursion with accumulators. This post is just a more advanced version of what I talked about in my From Imperative to Computation Expressions post. In fact, if you haven’t already I suggest you read it first. I’ll be assuming you understand how to perform that conversion in this article.  If you’re still having trouble, another good article which is a bit easier is The Ted Neward F# Folding Challenge.

    The conversion of an algorithm is a two stage process: first identify algorithmic cases, implied mutable data structures, and alternate immutable data structures.  Then convert the given loops into recursion while leveraging pattern matching.  I know it may seem difficult at first, but after reading this I think you’ll have a good idea of how to get there.

    So, let’s identify the cases.

    function stableMatching {   
        Initialize all m ? M and w ? W to free
        // Loop Termination Case on
        // check for “free m with unproposed w” and 
        // implicit ranking of W for each m in M   
        while ? free man m who still has a
                woman w to propose to {      
           w = m’s highest ranked such woman
                   who he has not proposed to yet
           // She’s Single Case 
           if w is free
             (m, w) become engaged
           // She’s Engaged Case 
           else some pair (m’, w) already exists
             // She Picks m Sub-Case 
             if w prefers m to m’
               (m, w) become engaged
               m’ becomes free
             // She Picks m’ Sub-Case 
             else
               (m’, w) remain engaged
        }
    }

    Starting from the top, we can just ignore the imperative initialization.  However, you’ll notice that a great deal is going on in that while clause. 

    First, it defines our loop termination by looking for a “free m who still has a w to propose to”.  This is interesting because it seems to be begging for some kind of data structure lookup.  Second it defines a implicit ranking of each member of W by each member of M.  Note that this is not mentioned in the initialization phase.

    The rest of this algorithm reads like a simple list of cases.  To make this plain, let’s list them out along with state changes.

    1. no single m in M with unproposed w 
      — execution terminates
    2. m proposes to unengaged w
      — w is no longer in m proposals
      — m is no longer single
      — m and w are engaged
    3. m proposes to w engaged to m’
      a.  She switches to m
            — w is no longer in m proposals
            — m is no longer single
            — m and w are engaged
            — m’ is single
      b.  She stays with m’
            — w is no longer in m proposals
            — m’ and w stay engaged
            — m remains single

    The mechanics of what exactly we do in these cases depends on our data structures.  So we can’t get much further without figuring that out.  To identify what data structures are needed we’ll need to identify what exactly it is that needs to be kept track of.  Only once we know that can we pick our immutable data structures appropriately. 

    Looking above, it appears that we’ll need to keep track of:

    1. Given a woman, is she engaged and, if so, who is she engaged to? (need to look it up)
    2. Which men are single? (just need one at a time, order doesn’t matter)
    3. For each man, which is his next most desirable match that he has not yet proposed to? (just need one at a time, order matters)

    Note that we don’t have to keep track of which men are engaged to which women, the question is never asked.  This is a minimal set of requirements here.  No extra stuff allowed.  I’m serious.

    Moving on, in an imperative mutable version you might use these data structures:

    1. An array indexed by w to the m index engaged to (-1 if single)
    2. An array indexed by m to the w index engaged to (-1 if single)
    3. An array indexed by m where each element is an ordered linked list of unproposed w indices.

    However, what would satisfy these same requirements but be immutable?  Well, if we step out of the mutation mindset it becomes obvious pretty quickly:

    1. An engagement map of women to men
    2. An immutable linked list of single men
    3. Define a man to include his index as well as an ordered unproposed woman immutable linked list

    Instead of scanning arrays, or querying mutable structures we simply decompose our data structures as much as needed and recompose them into the form we need on the next call.   This may sound difficult, but it’s really not.  The fact that we have pattern matching and our cases can be defined in terms of the shape of our data structures actually makes this very simple to do.  That is, once you get the hang of it.

    To see this in action, let’s break up the code in the last article and discuss each chunk.

    open System
     
    // a Bachelor is an identity index and an 
    // ordered list of women indices to approach.
    type Bachelor = int * int list
    

    Here we see the definition of a bachelor, it’s a tuple of an integer and a list of integers. This is effectively a pair containing an index and a list of woman indices. Note that in this example all men are of type Bachelor.

    // Some notation:
    // wi = woman index (int)
    // mi = man index (int)
    // mi' = woman's current partner index (int)
    // m = man with index and unapproached women indices (Bachelor)
    // mSingle = men that are single (Bachelor list)
    // wEngaged = engagements from women to men (int, Bachelor)
    

    We’ll keep this notation around just so you have easy reference.

    let funGS (comp: _ -> _ -> float) (M: _ array) (W: _ array) =
      let Windices = [ 0 .. W.Length - 1 ]
      // List of men with women in order of desire  
      let Munproposed = 
        List.init M.Length 
          (fun mi -> 
               let sortFun wi = 1.0 - (comp M.[mi] W.[wi])
               mi, Windices |> List.sortBy sortFun)
    

    Windices is just a list of numbers from 0 to the number of woman tokens. Doing this first makes the calculation of Munproposed less expensive.

    Munproposed is our initial set of all single men. Here List.init is used to call the given lambda to generate each element. As you can see by the last line within that lambda, each element of the list is a tuple containing the index of that man and a list of women indices sorted by the given desirability function, sortFun.

      // Recursively solve stable marriages
      let rec findMarriages mSingle wEngaged =
        match mSingle with
        // No single guys left with desired women, we're done
        | [] -> wEngaged
        // Guy is out of luck, remove from singles
        | (mi, []) :: bachelors -> findMarriages bachelors wEngaged
    

    These are our first two cases, if the mSingle list pattern matches with [], we know it’s empty and so we are done.

    In the second case we are pattern matching on the structure of singles list as well as it’s first element.  The syntax (mi, []) matches on a tuple of an index and an empty inner list.  This list is the list of yet unproposed to women and so if it’s empty we know that this guy is out of options and so it’s ok to drop him from the list of bachelors.  We do this by simply recursing on bachelors, which is the tail of the mSingle list.

        // He's got options!
        | (mi, wi :: rest) :: bachelors -> 
          let m = mi, rest
    

    This is our general purpose match, and is the structure we expect to see in most cases.  Here the head is fully decomposed into a man index (mi), his next first choice woman index (wi), the rest of the women (rest) and the rest of the single men (bachelors).

    Immediately afterward we define m to be that given man index (mi) and the rest of the women tokens (rest).  As two tokens are only ever compared once, m should no longer contain wi in his ordered list of proposals.  This is the form he will take until the algorithm ends or he is evaluated again after being placed in the mSingle pool.

          match wEngaged |> Map.tryFind wi with
          // She's single! m is now engaged!
          | None -> findMarriages bachelors (wEngaged |> Map.add wi m)
          // She's already engaged, let the best man win!
          | Some (m') –> 
            let mi', _ = m'
    

    Now that we have fully matched out everything masculine, it’s time to turn to the feminine. 

    Here we look in the wEngaged map to find out if wi is single.  This is done by a using Map.tryFind which returns None if the given key is not in the map and Some(value) if it is.

    If the Map.tryFind call returns None we know she is single and so recurse with the rest of our single men (bachelors) and our wEngaged map is extended via Map.add to contain a new mapping from the woman index (wi) to the given man (m).

    If she’s engaged our pattern match automatically decomposes the given option type and we know exactly who she is engaged to (m’).

            if comp W.[wi] M.[mi] > comp W.[wi] M.[mi'] then 
              // Congrats mi, he is now engaged to wi
              // The previous suitor (mi') is bested 
              findMarriages 
                (m' :: bachelors) 
                (wEngaged |> Map.add wi m)
            else
              // The current bachelor (mi) lost, better luck next time
              findMarriages 
                (m :: bachelors) 
                wEngaged
    

    This is the core comparison logic.  Here we determine which bachelor gets her hand.  This is pretty straightforward from an imperative perspective as we’re using the oh-so familiar if statement.

    In the first case the new contender (m) wins.  We append the loser (m’) back on to the head of what will be mSingle next time around with the syntax (m’ :: bachelors).  Similar to what happens if she’s single, we add a mapping from the woman index (wi) to the man instance (m) via the call to Map.add.  Note that this will override the former mapping from wi to m’.

    If the current bachelor loses we simply append him back to the head of the bachelors list and try again next time around. 

      findMarriages Munproposed Map.empty
      // Before returning, remove unproposed lists from man instances  
      |> Map.map (fun wi m -> let mi, _ = m in mi)  
    

    The final section is just how we set up our recursion.  We start with the full list of single men and their ranked lady lists (Munproposed) and an empty map of relationships (Map.empty).  When the algorithm exits we make one last pass through our returned wEngaged map in order to strip out all of the lists of unproposed to women.  It’s just baggage now anyway.

    Well that’s how you get from imperative pseudocode to a pure functional implementation.  All of the code here, as well as a really nasty imperative version to compare it to is available on my GitHub page

    If you liked this post, have any questions, or feel like I missed something important I hope you’ll comment here on it.  We’ll all benefit from the knowledge shared. 

  • Record Linkage in F# – Token Matching, Stable Marriages and the Gale-Shapley algorithm

    This post was originally published September 13th, 2011.

    Initially, one of the biggest problems I found when trying to marry records was the god awful quality of much of data I often have to work with. It’s mostly old mainframe and database data with truncated fields, limited character sets and fields with nonsensical contents. Even worse, much of the data is the result of years of merged account lists from different sources and so global statistical techniques often fail quite spectacularly.

    While naive compared to doing something along the lines of clustering, I have found that bag-of-words token matching has proven to be quite robust when comparing names. The idea is that if you consider a name as an unordered set of tokens, you can then find the best global alignment with some target name and that this will indicate the proper ordering in the majority of cases. Finally you can score these ordered tokens with an algorithm (like Jaro-Winker for example), average them, and end up with a good representation of the similarity of the two given names.

    (Of course, name is just one of many things to look at. After all, I’m pretty sure that all these John Smiths aren’t the same guy.)

    At first you might think finding this alignment is rather simple. Just score every permutation of the two sets of tokens and return the maximum, right? Well, this will work great for most traditional American names as they have only 2-3 tokens. Three factorial is only 6 different token arrangements to try, not a big deal at all. However, this method has serious limitations in practice. What happens when you run up against a 9 token Arabic name? Nine factorial is 362,880 different permutations to try, that’s a lot of grinding for just one record. That’s also just considering real names, if someone were to accidentally dump a paragraph into your algorithm you may just want to grab a snickers.

    Don’t despair though, the stable marriages problem and the Gale-Shapely algorithm are here to help.

    In the stable marriages problem you have two sets of things you want to match up with each other (described in an unambiguously sexist way as men and women), and each has a way of rating the others. The question is, how might we match up all the men and women so that there’s no two people who would rather be with each other than who they are already with. Note that this does not mean everyone gets their first choice, what we are looking for is global optimality of the pairs.

    This sounds familiar right? It’s a slightly more complex version of exactly what we’re trying to do with tokens! So, the solution to this problem is called the Gale-Shapley algorithm, and get this, it’s only worst case O(M*W) where M is the number of men tokens and W is the number of women!

    To turn this all into plain English, we can use this algorithm to align that 9-token name in at most 9 * 9 = 81 steps, and we’re guaranteed to get a result that is just as good!

    It works like this: Men take turns proposing to each woman in order of preference. If she’s single they are immediately “engaged”, if she’s already engaged the woman picks the man she likes best and the loser goes back into the singles pool. This happens over and over until all single men are engaged or out of options.

    In our case we’re actually creating sorted lists for each man of each woman ahead of time, so it’s O(M*WLog(W)), but it’s only a very minor increase in complexity, and how would you do that lazily anyhow?

    First, let’s take a look at the Wikipedia article’s very imperative pseudocode:

    function stableMatching {
        Initialize all m ? M and w ? W to free
        while ? free man m who still has a 
                woman w to propose to {
           w = m's highest ranked such woman 
                   who he has not proposed to yet
           if w is free
             (m, w) become engaged
           else some pair (m', w) already exists
             if w prefers m to m'
               (m, w) become engaged
               m' becomes free
             else
               (m', w) remain engaged
        }
    }

    Now, I’m not a big fan of imperative code so I wrote a pure functional version instead. Anyhow, having only two immutable data structures makes for a much easier read. It’s pretty much a long list of cases.

    open System
    // a Bachelor is an identity index and an 
    // ordered list of women indicies to approach.
    type Bachelor = int * int list
    // Some notation:
    // wi = woman index (int)
    // mi = man index (int)
    // mi' = woman's current partner index (int)
    // m = man with index and unapproached women indices (Bachelor)
    // mSingle = men that are single (Bachelor list)
    // wEngaged = engagements from women to men (int, Bachelor)
    let funGS (comp: _ -> _ -> float) (M: _ array) (W: _ array) =
      let Windices = [ 0 .. W.Length - 1 ]
      // List of men with women in order of desire  
      let Munproposed = 
        List.init M.Length 
          (fun mi -> 
               let sortFun wi = 1.0 - (comp M.[mi] W.[wi])
               mi, Windices |> List.sortBy sortFun)
      // Recursively solve stable marriages
      let rec findMarriages mSingle wEngaged =
        match mSingle with
        // No single guys left with desired women, we're done
        | [] -> wEngaged
        // Guy is out of luck, remove from singles
        | (mi, []) :: bachelors -> findMarriages bachelors wEngaged
        // He's got options!
        | (mi, wi :: rest) :: bachelors -> 
          let m = mi, rest
          match wEngaged |> Map.tryFind wi with
          // She's single! m is now engaged!
          | None -> findMarriages bachelors (wEngaged |> Map.add wi m)
          // She's already engaged, let the best man win!
          | Some (m') -> 
            let mi', _ = m'
            if comp W.[wi] M.[mi] > comp W.[wi] M.[mi'] then 
              // Congrats mi, he is now engaged to wi
              // The previous suitor (mi') is bested 
              findMarriages 
                (m' :: bachelors) 
                (wEngaged |> Map.add wi m)
            else
              // The current bachelor (mi) lost, better luck next time
              findMarriages 
                (m :: bachelors) 
                wEngaged
      findMarriages Munproposed Map.empty
      // Before returning, remove unproposed lists from man instances  
      |> Map.map (fun wi m -> let mi, _ = m in mi)  
    // By the supreme power of partial application I give you 
    // Jaro-Winkler Token Alignment with Gale-Shapely in one line!
    let alignJaroWinkler = funGS jaroWinkler
    

    So break out your F# Interactive and try something like:

    
    
    
    
    
    alignJaroWinkler 
      [|"DUDE"; "DUDERSON"|] 
      [|"D"; "RONNY"; "DUDERSON"|]
    
    and you’ll hopefully get something back like this:
    
    val it : Map = map [(0, 0); (2, 1)]

    This is just a map from women tokens to men tokens. (0,0) indicates that “D” matched best with “DUDE” and (2,1) indicates that “DUDERSON” matched best with “DUDERSON”. Poor “RONNY” ended up being the old maid.

    That’s it for today, but of course, there’s a whole lot more to cover here. We need to look at how to manipulate this output into something a bit nicer. I also have a bunch of tweaks and improvements to this pair of algorithms that I want to talk about. Finally, I want to sidetrack for a bit and tunnel into how one goes from from the horrible imperative pseudocode above, to the magnificent pure functional code underneath it.

    As per usual, all of the code use in this post (and previous posts in the series) can be found on github.

  • Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 2)

    Last time we dove into the Jaro distance algorithm and picked apart how each of its components are calculated. However, from a modern perspective Jaro alone is a rather weak method of string matching. It was Winkler’s extension that brought this algorithm into widespread modern use.

    Matthew Jaro’s insight when inventing the Jaro distance algorithm was that humans tend to make certain kinds of errors. However, while working on the 1990 census William Winkler had an additional insight: these kinds of errors are much more likely to occur later in the string. By simply wrapping Jaro’s algorithm to forgive some penalty according to the similarity of the first few characters, Winkler was able to get significantly better results with little additional overhead.

    Winker's Extension

    Here we see that the Jaro-Winkler distance (dw) is equal to the result of the Jaro distance (dj) plus one minus that same value times some weighted metric (lp). This is a great mathematical trick for two reasons. First, as long as the weighted metric (lp) doesn’t exceed 1, the final result will stay within the 0-1 range of the Jaro metric. Second, it guarantees that the result of Jaro-Winkler will never be lower than the result of Jaro alone. It effectively lets Winkler enrich the results of Jaro by filling in the remaining gap.

    The meaning of the weighted metric (lp) is even more simple. Here l is just the number of initial characters which match, up to a maximum of four. Meanwhile p is a weight which can be up to 1/4, or one over the maximum possible value of l.

    As p gets larger and larger, more and more of the gap will be filled in by Winkler’s extension. When p is 1/4, strings where first four characters match will always be given a perfect score. Just as we’d never want strings like “JOHN” and “JOHNSON” to be considered a perfect match, we’d never want to use a p value of 1/4. After much experimentation, Winkler recommends using a p value of 0.1, which is also what I use.

    Now that we’ve covered how the math works, let’s take a look at an actual implementation. If you’d like to see an implementation of the Jaro distance algorithm take a look at the previous installment of Record Linkage Algorithms in F#.

    /// Calculate the Jaro-Winkler distance of s1 and s2
    let jaroWinkler s1 s2 = 
        let jaroScore = jaro s1 s2
        // Accumulate the number of matching initial characters
        let maxLength = (min s1.Length s2.Length) - 1
        let rec calcL i acc =
            if i > maxLength || s1.[i] <> s2.[i] then acc
            else calcL (i + 1) (acc + 1.0)
        let l = min (calcL 0 0.0) 4.0
        // Calculate the JW distance
        let p = 0.1
        jaroScore + (l * p * (1.0 - jaroScore))
    

    In my implementation I’ve allowed for some slight inefficiency in order to make it easier to play with the p value and number of characters examined. So far I’ve found Winkler’s choice of four characters to be the best, which incidentally has been shown as a great number of initial characters to look at when using a number of record linkage algorithms on things in the English language. However, I suspect that other values may work better when working in other languages.

    The math here is so simple that I don’t feel it’s worth breaking down further, but I’ve included my tests built from the examples in the Jaro-Winkler distance Wikipedia article. For a fuller understanding just break out the debugger and play a bit on your own.

    open Xunit
    
    [<Fact>]
    let ``Jaro-Winkler identity test`` () = 
        let result = jaroWinkler "RICK" "RICK"
        Assert.Equal("1.000", String.Format("{0:0.000}", result))
    
    [<Fact>]
    let ``Jaro-Winkler martha test`` () = 
        let result = jaroWinkler "MARTHA" "MARHTA"
        Assert.Equal("0.961", String.Format("{0:0.000}", result))
    
    [<Fact>]
    let ``Jaro-Winkler dwayne test`` () = 
        let result = jaroWinkler "DWAYNE" "DUANE"
        Assert.Equal("0.840", String.Format("{0:0.000}", result))
    
    [<Fact>]
    let ``Jaro-Winkler dixon test`` () =
        let result = jaroWinkler "DIXON" "DICKSONX"
        Assert.Equal("0.813", String.Format("{0:0.000}", result))
    

    In the next installment of Record Linkage Algorithms in F# we’ll take a look a method for efficient token based matching with Jaro-Winkler. Stay tuned!

    Edit: All current code for this series is now available on github.

  • Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 1)

    When first approaching the task of record linkage I was initially overwhelmed by the huge number of different algorithms available for comparing strings. Now I know that the secret to finding your way in this sea of algorithms is two fold. First, know that many are outdated and have newer and better implementations, so they can be ignored. Second, of the newest extension of each algorithm, each has a particular use depending on the kind of data you’re comparing.

    Jaro-Winkler distance is a particularly good choice when comparing short strings of human entered data, such as names. This is due to it’s relative robustness against letter transpositions and it’s weighting of similarity toward the beginning of the string. When it comes to comparing names, a slightly customized version of tokenized Jaro-Winkler is one of the first things in my toolbox that I reach for.

    Jaro-Winkler distance also has the benefit of being a rather easy to understand algorithm. It’s composed of two parts, Jaro’s original algorithm and Winkler’s extension.  In this installment we’ll roll up our sleeves and dig into the first part of this algorithm, Jaro distance.

    The Jaro Algorithm

    This is the equation for Jaro distance. It’s made up of the average of three sub-calculations.

    1. The ratio of matching characters to the length of the first string.
    2. The ratio of matching characters to the length of the second string.
    3. The ratio of non-transpositions to the number matching of characters.

    My implementation has been optimized a bit, but I think it’s still rather easy to understand.

    open System
    
    /// Given an offset and a radius from that offset, 
    /// does mChar exist in that part of str?
    let inline existsInWin (mChar: char) (str: string) offset rad =
        let startAt = max 0 (offset - rad)
        let endAt = min (offset + rad) (String.length str - 1)
        if endAt - startAt < 0 then false
        else
            let rec exists index =
                if str.[index] = mChar then true
                elif index = endAt then false
                else exists (index + 1)
            exists startAt
    
    /// The jaro distance between s1 and s2
    let jaro s1 s2 =
    
        // The radius is half of the lesser 
        // of the two string lengths rounded up.
        let matchRadius =
            let minLen =
                min (String.length s1) (String.length s2) in
                  minLen / 2 + minLen % 2
    
        // An inner function which recursively finds the number  
        // of matched characters within the radius.
        let commonChars (chars1: string) (chars2: string) =
            let rec inner i result =
                match i with
                | -1 -> result
                | _ -> if existsInWin chars1.[i] chars2 i matchRadius
                       then inner (i - 1) (chars1.[i] :: result)
                       else inner (i - 1) result
            inner (chars1.Length - 1) []
    
        // The sets of common characters and their lengths as floats 
        let c1 = commonChars s1 s2
        let c2 = commonChars s2 s1
        let c1length = float (List.length c1)
        let c2length = float (List.length c2)
    
        // The number of transpositions within 
        // the sets of common characters.
        let transpositions =
            let rec inner cl1 cl2 result =
                match cl1, cl2 with
                | [], _ | _, [] -> result
                | c1h :: c1t, c2h :: c2t ->
                    if c1h <> c2h
                    then inner c1t c2t (result + 1.0)
                    else inner c1t c2t result
            let mismatches = inner c1 c2 0.0
            // If one common string is longer than the other
            // each additional char counts as half a transposition
            (mismatches + abs (c1length - c2length)) / 2.0
    
        let s1length = float (String.length s1)
        let s2length = float (String.length s2)
        let tLength = max c1length c2length
    
        // The jaro distance as given by 
        // 1/3 ( m2/|s1| + m1/|s2| + (mc-t)/mc )
        let result = (c1length / s1length +
                      c2length / s2length +
                      (tLength - transpositions) / tLength)
                     / 3.0
    
        // This is for cases where |s1|, |s2| or m are zero 
        if Double.IsNaN result then 0.0 else result
    

    We’re quite lucky to have examples of the algorithm in action available right in the Wikipedia article to use as unit tests. When optimizing, you couldn’t ask for more.

    Take ‘DWAYNE’ and ‘DUANE’ for example. According to the article we should end up with the following results:

    DWAYNE vs DUANE

    So, let’s break it down. The lesser of the two strings, DUANE, is 5 characters long. When we set minLen to 5 in (minLen / 2 + minLen % 2 is 3), we get a radius of 3.

    Next we find the common characters in each direction. Comparing at DWAYNE to DUANE with a radius of 3 we can see that D, A, N and E will match, giving us a c1 = “DANE”. Similarly, comparing at DUANE to DWAYNE we get exactly the same thing. That is, c2 = “DANE” as well.

    As you might have guessed by the fact both sets of common strings are the same, we have zero transpositions in this instance.

    So now, we just plug in the numbers and get 1/3 * (4/6 + 4/5 + (4-0)/4) = 0.822. Not too difficult, eh?

    For a better understanding of how transpositions end up working out, try walking through the debugger with the MARTHA vs MARHTA test below.

    open Xunit
    
    [<Fact>]
    let ``Jaro identity test`` () =
        let result = jaro "RICK" "RICK"
        Assert.Equal("1.000", String.Format("{0:0.000}", result))
    
    [<Fact>]
    let ``Jaro martha test`` () =
        let result = jaro "MARTHA" "MARHTA"
        Assert.Equal("0.944", String.Format("{0:0.000}", result))
    
    [<Fact>]
    let ``Jaro dwayne test`` () =
        let result = jaro "DWAYNE" "DUANE"
        Assert.Equal("0.822", String.Format("{0:0.000}", result))
    
    [<Fact>]
    let ``Jaro dixon test`` () =
        let result = jaro "DIXON" "DICKSONX"
        Assert.Equal("0.767", String.Format("{0:0.000}", result))
    

    I hope you enjoyed this installment of Record Linkage Algorithms in F#. Next time we’ll explore the Winkler extension to this algorithm and take a look at why weighting errors earlier in the string more heavily ends up giving significantly better results.

    Edit: All current code for this series is now available on github.

  • F# and the DLR at Dev Day for NSWC Dahlgren

    Just yesterday I gave a presentation on F# and the DLR for the Naval Surface Warfare Center. Many thanks to Kevin Hazzard and Chris Bowen for recommending me. It was a fantastic opportunity to speak on the benefits of F# to an entirely new audience and I learned a few things about the DLR along the way.

    Despite disliking dynamic languages in practice because of their lack of safety, I must admit that the DLR allows you to do quite a few very interesting things.

    First, you can embed DLR language code right into html and so build Silverlight applications with no backing assembly. This is quite interesting as it makes the web development story for Silverlight much more attractive. Read more about this feature on the DLR section of the Silverlight Website.

    Second, it’s quite easy to embed DLR languages into your application. With just about three lines of code I was able to build a IronPython query window right into a ListView. The possibilities for live debugging and building query DSLs here are huge.

    Click here to download my slides, as well as the code used in this presentation.

  • Some Recent Talks (with slides and code)

    I’ve given quite a few talks over the last couple of months and at each and every one I promised to post my content shortly afterwards here on my blog.

    However, due to some extreme laziness early on coupled with a crazy schedule and some unfortunate (but thankfully temporary) health problems more recently, I’ve failed to live up to those promises. Sorry guys, here’s the slides and code I promised. Better late than never, right?

    How you can get started with F# today

    I gave this talk at the local .NET meeting the monday night of the MVP Summit.  It’s a short (15 minutes) one-two punch of why you might care about F# and where you can find out more.  Very well received.

    Getting the MVVM Kicked Out of Your F#’n Monads

    Download The Code

    This presentation is from the most recent NYC Code Camp. The title started out as a joke with Steve Bohlen, mentioning on twitter that all the submitted talks had to do with MVVM. I wasn’t really going to do it, but when he said he’d automatically accept something with such a title, I decided to use it as a chance to explain computation expressions to a room full of people new to F#. It actually went really well considering, the post-talk reviews were fantastic.

    Fun and Games in F#

    My MVP2MVP talk from the MVP summit’s very first F# track. It’s a meta-talk about how to give a good F# talk, and how to build a stronger F# community.