Projects


22
Jul 13

The Promise of F# Language Type Providers

In most software domains you can safely stick with one or two languages and, because the tools you are using are fairly easy to replicate, you’ll find almost anything you might need to finish your project. This isn’t true in data science and data engineering however. Whether it be some hyper-optimized data structure or a cutting edge machine learning technique often you only have a single language or platform choice.

Even worse, when you want to build a system that uses one or more platform specific components, things can become quite an engineering mess. No matter what you do you can’t avoid the high cost of serialization and marshaling. This makes some combinations of tools non-options for some problems. You often make trade-offs that you shouldn’t need to make, for example using a worse algorithm just because the better option hasn’t been written for your platform.

In .NET this is a particularly bad problem. There are quite a few dedicated people working on open source libraries, but they are tiny in number compared to the Matlab, Python, R or Java communities. Meanwhile, Microsoft research has several fantastic libraries with overly restrictive licenses that make them impossible to use commercially. These libraries drive away academic competition, but at the same time can’t be used outside of academia. It’s a horrible situation.

Thankfully, there is a silver lining in this dark cloud. With the release of F# 3.0 in VS 2012 we were given a new language feature called Type Providers. Type Providers are compiler plugins that generate types at compile time and can run arbitrary code to do it. Initially, these were designed for access databases and getting types from the schema for free, but when Howard Mansell released the R Language Type Provider everything changed. We now realized that we had a way to build slick typed APIs on top of almost any other language.

This means that now it doesn’t matter if someone has written the algorithm or data structure for our platform as long as there’s a Type Provider for a platform where it has been done. The tedious work of building lots of little wrapped sub-programs is completely gone. It shouldn’t even matter if the kind of calculation you’d like to do is fast on your native platform, as you can just transparently push it to another. Of course, we still must pay the price of marshaling, but we can do it in a controlled way by dealing in handles to the other platform’s variables.

The language Type Providers themselves are a bit immature for the moment but the idea is sound and the list is growing. There is now the beginnings of an IKVM Type Provider (for Java) and I’m working on a Matlab Type Provider. The Matlab Provider doesn’t yet have all of the functionality I am aiming for, but I’ve been working on it for several months and it’s quite usable now. All that’s left is for someone to start in on a Python type provider and we’ll practically have all of the data science bases covered.

It’s an exciting time to be an F#’er.


27
Mar 13

Setting up F# Interactive for Machine Learning with Large Datasets

Before getting started with Machine Learning in F# Interactive it’s best to prepare it for large datasets and external 64-bit libraries so you don’t get blindsided with strange errors when you happen to cross the line. The good news is it’s a simple process that should only take a few minutes.

The first step is to go into the configuration and set fsi to 64-bit. It’s only matter of changing a boolean value buried deep in the Visual Studio settings. First, Go into Tools->Settings.

Tools->Settings

Then find the “F# Tools” section on the left and select the “F# Interactive” subsection.

tools_settings_fsi

Finally, set “64-bit F# Interactive” to true and click OK.

tools_settings_fsi_64-bit

What this does is set Visual Studio to use “FsiAnyCPU.exe” for the F# Interactive window instead of 32-bit “Fsi.exe”.

Now, after we restart Visual Studio, your F# Interactive is running with as many bits as your operating system can handle. However, if we want to support really big matrices we’re going to need to go a bit further. If we want really large arrays, that is greater than 2 gigabytes, we’re going to need to fiddle with the F# Interactive application config and enable the “gcAllowVeryLargeObjects” attribute.

For .NET 4.5 on Windows 7, Windows 8 and Windows Sever 2008R2 the standard directory for both the fsi exeuctables and their application configs is:

“C:\Program Files (x86)\Microsoft SDKs\F#\3.0\Framework\v4.0″

Navigate there and open “FsiAnyCPU.exe.config” in your favorite text editor. Then under the <runtime> tag add:

<gcAllowVeryLargeObjects enabled="true" />

When you’re done it should look like:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <runtime>
    <gcAllowVeryLargeObjects enabled="true" />
    <legacyUnhandledExceptionPolicy enabled="true" />
    <assemblyBinding 
      xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity
          name="FSharp.Core"
          publicKeyToken="b03f5f7f11d50a3a"
          culture="neutral"/>
        <bindingRedirect
          oldVersion="2.0.0.0"
          newVersion="4.3.0.0"/>
        <bindingRedirect
          oldVersion="4.0.0.0"
          newVersion="4.3.0.0"/>
      </dependentAssembly>
    </assemblyBinding>
  </runtime>
</configuration>

Just save and restart Visual Studio and you’re done! Your F# Interactive can now handle large datasets and loading external 64-bit native libraries.


4
Sep 12

Levenshtein Distance and the Triangle Inequality

Levenshtein distance is one of my favorite algorithms. On the surface it seems so very simple, but when you spend some time thinking hard on it deep insights are waiting to be had.

The first and most important thing about Levenshtein distance is it’s actually a metric distance. That is, it obeys the triangle inequality. For most other string distance measurements this property doesn’t hold.

The Vector Triangle Inequality

The Vector Triangle Inequality

This might not seem like such a big deal, but this property gives the measurements meaning in a context larger than just the pair. For example, it allows you to embed your pair distance measurements into a higher dimensional space and so use it for things like clustering.

This was one of the first insights that Levenshtein distance gave me: A measurement doesn’t need to give you an absolute location in space to be useful for telling you where you are, it just has to tell you how far away everything else is. But what is it about Levenshtein distance that gives it this property? It’s not immediately obvious to most people, at least it wasn’t to me.

First, let’s consider a naive implementation of the Wagner-Fisher algorithm for Levenshtein distance. As stated above, here the triangle inequality holds.

 1: let wagnerFischer (s: string) (t: string) =
 2:     let m = s.Length
 3:     let n = t.Length
 4:     let d = Array2D.create (m + 1) (n + 1) 0
 5: 
 6:     for i = 0 to m do d.[i, 0] <- i
 7:     for j = 0 to n do d.[0, j] <- j    
 8: 
 9:     for j = 1 to n do
10:         for i = 1 to m do
11:             if s.[i-1] = t.[j-1] then
12:                 d.[i, j] <- d.[i-1, j-1]
13:             else
14:                 d.[i, j] <-
15:                     List.min
16:                         [
17:                             // a deletion
18:                             d.[i-1, j  ] + 1; 
19:                             // an insertion
20:                             d.[i  , j-1] + 1; 
21:                             // a substitution
22:                             d.[i-1, j-1] + 1; 
23:                         ]
24:     d.[m,n]

Now compare this with an incorrect version of an extension called Damerau–Levenshtein distance (or restricted edit distance). This change adds support for Jaro-Winkler like transpositions to the original algorithm. However, in the process of adding just this minor tweak we lose the triangle inequality.

26: let damerauLevenshtein (s: string) (t: string) =
27:     let m = s.Length
28:     let n = t.Length
29:     let d = Array2D.create (m + 1) (n + 1) 0
30: 
31:     for i = 0 to m do d.[i, 0] <- i
32:     for j = 0 to n do d.[0, j] <- j    
33: 
34:     for j = 1 to n do
35:         for i = 1 to m do
36:             // 1 if a substitution
37:             // 0 if no change
38:             let cost = if s.[i-1] = t.[j-1] then 0 else 1
39:             d.[i, j] <-
40:                 List.min
41:                     [
42:                         // a deletion
43:                         d.[i-1, j  ] + 1; 
44:                         // an insertion
45:                         d.[i  , j-1] + 1; 
46:                         // a substitution or nothing
47:                         d.[i-1, j-1] + cost;
48:                     ]
49:             if // boundary check
50:                i > 1 && j > 1 
51:                // transposition check
52:             && s.[i-1] = t.[j-2] && s.[i-2] = t.[j-1] 
53:             then // the lesser of a transposition or current cost
54:                 d.[i, j] <- min d.[i,j] (d.[i-2, j-2] + cost)
55:     d.[m,n]

val wagnerFischer : string -> string -> int

Full name: Snippet.wagnerFischer

val s : string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

Multiple items

val string : 'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

——————–

type string = System.String

Full name: Microsoft.FSharp.Core.string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

val t : string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

val m : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

property System.String.Length: int
val n : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val d : int [,]

  type: int [,]
  implements: System.ICloneable
  implements: System.Collections.IList
  implements: System.Collections.ICollection
  implements: System.Collections.IEnumerable
  implements: System.Collections.IStructuralComparable
  implements: System.Collections.IStructuralEquatable
  inherits: System.Array

module Array2D

from Microsoft.FSharp.Collections

val create : int -> int -> 'T -> 'T [,]

Full name: Microsoft.FSharp.Collections.Array2D.create

val i : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val j : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =
  | ( [] )
  | ( :: ) of 'T * 'T list
  with
    interface System.Collections.IEnumerable
    interface System.Collections.Generic.IEnumerable<'T>
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    static member Cons : head:'T * tail:'T list -> 'T list
    static member Empty : 'T list
  end

Full name: Microsoft.FSharp.Collections.List<_>

  type: List<'T>
  implements: System.Collections.IStructuralEquatable
  implements: System.IComparable<List<'T>>
  implements: System.IComparable
  implements: System.Collections.IStructuralComparable
  implements: System.Collections.Generic.IEnumerable<'T>
  implements: System.Collections.IEnumerable

val min : 'T list -> 'T (requires comparison)

Full name: Microsoft.FSharp.Collections.List.min

val damerauLevenshtein : string -> string -> int

Full name: Snippet.damerauLevenshtein

val cost : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val min : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.min

It seems like such a simple and obvious addition to the algorithm. Just what is it about about the way we’ve added transpositions that ruins the magic? We’ve just put something like wormholes to our little universe. That’s right, the simple addition of transpositions in this way implies a universe where some combinations of characters treat space differently than everything else. The easiest way to prove this is the case is to give the definition of the triangle inequality for metric spaces a read.

From Wikipedia’s Triangle Inequality article:
In a metric space M with metric d, the triangle inequality is a requirement upon distance: d(x, z) <= d(x, y) + d(y, z)
for all x, y, z in M. That is, the distance from x to z is at most as large as the sum of the distance from x to y and the distance from y to z.

From this, it’s easy to construct a counterexample for our broken Damerau-Levenshtein distance simply by exploiting the transpositions.

Damerau-Levenshtein distance not satisfying the triangle inequality.

As you can see in this picture 4 is most certainly greater than 1 + 2, and so the triangle inequality is broken. Consider also the pathology that this example shows in the algorithm. Why didn’t it just go along the irkc –> rick –> rcik path when it’s obviously less expensive?

Levenshtein distance satisfying the triangle inequality.

For comparison, if we measure those same pairs with standard Levenshtein distance everything is just peachy.

So we now know of at least one case which causes the triangle inequality to fail, does this imply what causes it to succeed? I think yes, at least in a limited sense. We can see that with Levenshtein distance any given pair of characters is considered independently, changes at each only happen once and are exactly one character in size. As each pair in the strings is considered, small changes push them further and further apart, but in discrete and equal units for discrete and equal changes. While in our Damerau-Levenshtein distance implementation we greedily perform operations of a larger size and then never revisit their implications, standard Levenshtein is reversibly symmetric both in how it treats locations over the string as well as in how it treats the characters themselves due to its uniform granularity. The uniform granularity of the changes ensures all important paths are explored.

Can transpositions be reliably counted with a different approach? We’ll find out the answer to this question next time.


3
Oct 11

Advice for Getting Started with F#

I had a great time at NYC Code Camp this last weekend. About half the people in my talk already knew F# and were there to talk about Type Providers, the other half just came to see what this F# thing was all about. This post is to help those in the second half begin their F# learning adventure.

The first thing anyone who is looking to get started with F# should do is work through the tutorial examples on tryfsharp.org. Getting some hands-on time with the basics is extremely important, especially if you haven’t done much functional programming before.

Once you’ve got the basics down, the next step is thinking through some real problems in F#. For me, this was writing an Ant Colony simulation. With the version linked to here you can actually compile and then try your own ant behavior code against mine! Other great options include the Coding Dojo Katas or Project Euler (for the mathematically inclined).

It’s best if you do most of your play using .fsx files and F# interactive at first. Just highlighting code and hitting Alt-Enter makes the process of learning so much faster. However, if you insist on working with unit tests I suggest Xunit as it doesn’t require the use of classes for hosting your tests.

At first it’s also best to ignore most of F#’s features, it can be overwhelming otherwise. Try your best to focus on comprehensions. These may not always be the best way to solve a problem, but they allow you to write very imperative code right in the middle of a bunch of functional stuff and so are kind of an escape hatch for when you don’t know how to do what you want functionally. Eventually you’ll want to move on to recursion and folding though. When you’re ready to make the jump read this post which talks about how to go about it.

As you’re playing with F#, don’t be afraid to look things up along the way when you get stuck. There’s quite a few good resources around for F# including the F# Programming WikiBook, the code samples provided by the F# team, and a number of print books including Professional F# 2.0 (shameless plug). When all else fails the MSDN documentation for F# is actually very good!

I can say from experience that at first you will likely experience quite a few syntax errors. Don’t worry too much about it, your brain needs to build up a model of how the types and syntax work. Once you’ve gotten over that hurdle I promise you’ll be amazed by how different the programming world looks.


27
Sep 11

Record Linkage Algorithms in F# – Extensions to Jaro-Winkler Distance (Part 3)

While writing the previous article on tokenized matching I realized I left out some important background information on Jaro-Winkler distance.

The Vector Triangle Inequality

The Vector Triangle Inequality

First, there’s something important to know about the Jaro-Winkler distance: it’s not a metric distance and so does not obey the triangle inequality. That is, if you found the JW distance between strings A and B, and then found the JW distance between strings B and C, those results would have no relationship with JW distance between strings A and C. This may not seem like a big deal, but it means Jaro-Winkler distance can’t be used to embed strings in a metric space and so is a poor algorithm choice for many types of clustering. This will be an important point in future articles.

Second, it can be very helpful to extend the results of Jaro-Winkler based on the nature of your own data and your use of the algorithm. To better support my own use case I’ve made changes put the emphasis on better token alignment.

 1: let jaroWinklerMI (t1:string) (t2:string) = 
 2:     // Optimizations for easy to calculate cases
 3:     if t1.Length = 0 || t2.Length = 0 then 0.0
 4:     elif t1 = t2 then 1.0
 5:     else
 6:         // Even more weight for the first char
 7:         let score = jaroWinkler t1 t2
 8:         let p = 0.2 //percentage of score from new metric
 9:         let b = if t1.[0] = t2.[0] then 1.0 else 0.0
10:         ((1.0 - p) * score) + (p * b)

Beyond the optimization for empty strings and those which are exactly the same, you can see here that I weight the first character even more heavily. This is due to my data being very initial heavy.

To compensate for the frequent use of middle initials I count Jaro-Winkler distance as 80% of the score, while the remaining 20% is fully based on the first character matching. The value of p here was determined by the results of heavy experimentation and hair pulling. Before making this extension initials would frequently align incorrectly.

12: let scoreNamePairs (t1:string) (t2:string) =  
13:     //Raise jaro to a power in order to over-weight better matches        
14:     jaroWinklerMI t1 t2 ** 2.0

I also take the square of the result of jaroWinklerMI to weight better matches even more heavily. I found that in doing this I was able to get much more reliable matching. To understand how this works take a gander at this plot.

As you already know, multiplying any number greater than 0 but less than 1 by itself will give you a smaller number. However are you might intuit, the smaller the number the greater the proportional reduction. As you can see here, anything less than 1 takes a hit, but worse matches get dragged down significantly more.

Initially I was frustrated by bad alignments which would sometimes be chosen over better ones when two or more tokens were both fairly close, but not great. After seeing a variation on this squaring technique used for matrix convergence the thought occurred to me: why not see if it helps with token alignment? After implementing this I saw a huge improvement in results: incorrect alignments completely disappeared!

It’s often surprising where inspiration will come from.

Edit: The above code and it’s composition with Gale-Shapely is now available in my github repository.

val jaroWinklerMI : string -> string -> float

Full name: Snippet.jaroWinklerMI

val t1 : string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

Multiple items

val string : 'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

——————–

type string = System.String

Full name: Microsoft.FSharp.Core.string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

val t2 : string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

property System.String.Length: int
val score : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val p : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val b : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val scoreNamePairs : string -> string -> float

Full name: Snippet.scoreNamePairs


19
Sep 11

Imperative Pseudocode to Pure Functional Algorithm with Gale-Shapely and F#

Continuing from last time, let’s look at how one goes from imperative pseudocode to pure functional using Gale-Shapely as an example.

Overall, to convert an algorithm from imperative to functional is a fairly simple process once you understand how to convert from a while loop to recursion with accumulators. This post is just a more advanced version of what I talked about in my From Imperative to Computation Expressions post. In fact, if you haven’t already I suggest you read it first. I’ll be assuming you understand how to perform that conversion in this article.  If you’re still having trouble, another good article which is a bit easier is The Ted Neward F# Folding Challenge.

The conversion of an algorithm is a two stage process: first identify algorithmic cases, implied mutable data structures, and alternate immutable data structures.  Then convert the given loops into recursion while leveraging pattern matching.  I know it may seem difficult at first, but after reading this I think you’ll have a good idea of how to get there.

So, let’s identify the cases.

function stableMatching {   
    Initialize all m ? M and w ? W to free
    // Loop Termination Case on
    // check for “free m with unproposed w” and
 
    // implicit ranking of W for each m in M
   
    while ? free man m who still has a
            woman w to propose to {      
       w = m’s highest ranked such woman
               who he has not proposed to yet
      
// She’s Single Case 
       if w is free
         (m, w) become engaged
       // She’s Engaged Case 
       else some pair (m’, w) already exists
         // She Picks m Sub-Case 

         if w prefers m to m’
           (m, w) become engaged
           m’ becomes free
         // She Picks m’ Sub-Case 
         else
           (m’, w) remain engaged
    }
}

Starting from the top, we can just ignore the imperative initialization.  However, you’ll notice that a great deal is going on in that while clause. 

First, it defines our loop termination by looking for a “free m who still has a w to propose to”.  This is interesting because it seems to be begging for some kind of data structure lookup.  Second it defines a implicit ranking of each member of W by each member of M.  Note that this is not mentioned in the initialization phase.

The rest of this algorithm reads like a simple list of cases.  To make this plain, let’s list them out along with state changes.

  1. no single m in M with unproposed w 
    – execution terminates
  2. m proposes to unengaged w
    – w is no longer in m proposals
    – m is no longer single
    – m and w are engaged
  3. m proposes to w engaged to m’
    a.  She switches to m
          — w is no longer in m proposals
          — m is no longer single
          — m and w are engaged
          — m’ is single
    b.  She stays with m’
          — w is no longer in m proposals
          — m’ and w stay engaged
          — m remains single

The mechanics of what exactly we do in these cases depends on our data structures.  So we can’t get much further without figuring that out.  To identify what data structures are needed we’ll need to identify what exactly it is that needs to be kept track of.  Only once we know that can we pick our immutable data structures appropriately. 

Looking above, it appears that we’ll need to keep track of:

  1. Given a woman, is she engaged and, if so, who is she engaged to? (need to look it up)
  2. Which men are single? (just need one at a time, order doesn’t matter)
  3. For each man, which is his next most desirable match that he has not yet proposed to? (just need one at a time, order matters)

Note that we don’t have to keep track of which men are engaged to which women, the question is never asked.  This is a minimal set of requirements here.  No extra stuff allowed.  I’m serious.

Moving on, in an imperative mutable version you might use these data structures:

  1. An array indexed by w to the m index engaged to (-1 if single)
  2. An array indexed by m to the w index engaged to (-1 if single)
  3. An array indexed by m where each element is an ordered linked list of unproposed w indices.

However, what would satisfy these same requirements but be immutable?  Well, if we step out of the mutation mindset it becomes obvious pretty quickly:

  1. An engagement map of women to men
  2. An immutable linked list of single men
  3. Define a man to include his index as well as an ordered unproposed woman immutable linked list

Instead of scanning arrays, or querying mutable structures we simply decompose our data structures as much as needed and recompose them into the form we need on the next call.   This may sound difficult, but it’s really not.  The fact that we have pattern matching and our cases can be defined in terms of the shape of our data structures actually makes this very simple to do.  That is, once you get the hang of it.

To see this in action, let’s break up the code in the last article and discuss each chunk.

 1: open System
 2: 
 3: // a Bachelor is an identity index and an 
 4: // ordered list of women indicies to approach.
 5: type Bachelor = int * int list

Here we see the definition of a bachelor, it’s a tuple of an integer and a list of integers. This is effectively a pair containing an index and a list of woman indices. Note that in this example all men are of type Bachelor.

 7: // Some notation:
 8: // wi = woman index (int)
 9: // mi = man index (int)
10: // mi' = woman's current partner index (int)
11: // m = man with index and unapproached women indices (Bachelor)
12: // mSingle = men that are single (Bachelor list)
13: // wEngaged = engagements from women to men (int, Bachelor)

We’ll keep this notation around just so you have easy reference.

15: let funGS (comp: _ -> _ -> float) (M: _ array) (W: _ array) =
16:   let Windices = [ 0 .. W.Length - 1 ]
17:   // List of men with women in order of desire  
18:   let Munproposed = 
19:     List.init M.Length 
20:       (fun mi -> 
21:            let sortFun wi = 1.0 - (comp M.[mi] W.[wi])
22:            mi, Windices |> List.sortBy sortFun)

Windices is just a list of numbers from 0 to the number of woman tokens. Doing this first makes the calculation of Munproposed less expensive.

Munproposed is our initial set of all single men. Here List.init is used to call the given lambda to generate each element. As you can see by the last line within that lambda, each element of the list is a tuple containing the index of that man and a list of women indices sorted by the given desirability function, sortFun.

23:   // Recursively solve stable marriages
24:   let rec findMarriages mSingle wEngaged =
25:     match mSingle with
26:     // No single guys left with desired women, we're done
27:     | [] -> wEngaged
28:     // Guy is out of luck, remove from singles
29:     | (mi, []) :: bachelors -> findMarriages bachelors wEngaged

These are our first two cases, if the mSingle list pattern matches with [], we know it’s empty and so we are done.

In the second case we are pattern matching on the structure of singles list as well as it’s first element.  The syntax (mi, []) matches on a tuple of an index and an empty inner list.  This list is the list of yet unproposed to women and so if it’s empty we know that this guy is out of options and so it’s ok to drop him from the list of bachelors.  We do this by simply recursing on bachelors, which is the tail of the mSingle list.

30:     // He's got options!
31:     | (mi, wi :: rest) :: bachelors -> 
32:       let m = mi, rest

This is our general purpose match, and is the structure we expect to see in most cases.  Here the head is fully decomposed into a man index (mi), his next first choice woman index (wi), the rest of the women (rest) and the rest of the single men (bachelors).

Immediately afterward we define m to be that given man index (mi) and the rest of the women tokens (rest).  As two tokens are only ever compared once, m should no longer contain wi in his ordered list of proposals.  This is the form he will take until the algorithm ends or he is evaluated again after being placed in the mSingle pool.

33:       match wEngaged |> Map.tryFind wi with
34:       // She's single! m is now engaged!
35:       | None -> findMarriages bachelors (wEngaged |> Map.add wi m)
36:       // She's already engaged, let the best man win!
37:       | Some (m') –> 
38: let mi', _ = m'

Now that we have fully matched out everything masculine, it’s time to turn to the feminine. 

Here we look in the wEngaged map to find out if wi is single.  This is done by a using Map.tryFind which returns None if the given key is not in the map and Some(value) if it is.

If the Map.tryFind call returns None we know she is single and so recurse with the rest of our single men (bachelors) and our wEngaged map is extended via Map.add to contain a new mapping from the woman index (wi) to the given man (m).

If she’s engaged our pattern match automatically decomposes the given option type and we know exactly who she is engaged to (m’).

39:         if comp W.[wi] M.[mi] > comp W.[wi] M.[mi'] then 
40:           // Congrats mi, he is now engaged to wi
41:           // The previous suitor (mi') is bested 
42:           findMarriages 
43:             (m' :: bachelors) 
44:             (wEngaged |> Map.add wi m)
45:         else
46:           // The current bachelor (mi) lost, better luck next time
47:           findMarriages 
48:             (m :: bachelors) 
49:             wEngaged

This is the core comparison logic.  Here we determine which bachelor gets her hand.  This is pretty straightforward from an imperative perspective as we’re using the oh-so familiar if statement.

In the first case the new contender (m) wins.  We append the loser (m’) back on to the head of what will be mSingle next time around with the syntax (m’ :: bachelors).  Similar to what happens if she’s single, we add a mapping from the woman index (wi) to the man instance (m) via the call to Map.add.  Note that this will override the former mapping from wi to m’.

If the current bachelor loses we simply append him back to the head of the bachelors list and try again next time around.  You’ll get her next time bro.

50:   findMarriages Munproposed Map.empty
51:   // Before returning, remove unproposed lists from man instances  
52:   |> Map.map (fun wi m -> let mi, _ = m in mi)  

The final section is just how we set up our recursion.  We start with the full list of single men and their ranked lady lists (Munproposed) and an empty map of relationships (Map.empty).  When the algorithm exits we make one last pass through our returned wEngaged map in order to strip out all of the lists of unproposed to women.  It’s just baggage now anyway.

Well that’s how you get from imperative pseudocode to a pure functional implementation.  All of the code here, as well as a really nasty imperative version to compare it to is available on my GitHub page

If you liked this post, have any questions, or feel like I missed something important I hope you’ll comment here on it.  We’ll all benefit from the knowledge shared. 

namespace System
type Bachelor = int * int list

Full name: Snippet.Bachelor

Multiple items

val int : ‘T -> int (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.int

——————–

type int<’Measure> = int

Full name: Microsoft.FSharp.Core.int<_>

  type: int<’Measure>

  implements: IComparable

  implements: IConvertible

  implements: IFormattable

  implements: IComparable<int<’Measure>>

  implements: IEquatable<int<’Measure>>

  inherits: ValueType

——————–

type int = int32

Full name: Microsoft.FSharp.Core.int

  type: int

  implements: IComparable

  implements: IFormattable

  implements: IConvertible

  implements: IComparable<int>

  implements: IEquatable<int>

  inherits: ValueType

type ‘T list = List<’T>

Full name: Microsoft.FSharp.Collections.list<_>

  type: ‘T list

  implements: Collections.IStructuralEquatable

  implements: IComparable<List<’T>>

  implements: IComparable

  implements: Collections.IStructuralComparable

  implements: Collections.Generic.IEnumerable<’T>

  implements: Collections.IEnumerable

val funGS : (‘a -> ‘a -> float) -> ‘a array -> ‘a array -> Map<int,int>

Full name: Snippet.funGS

val comp : (‘a -> ‘a -> float)
Multiple items

val float : ‘T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–

type float<’Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

  type: float<’Measure>

  implements: IComparable

  implements: IConvertible

  implements: IFormattable

  implements: IComparable<float<’Measure>>

  implements: IEquatable<float<’Measure>>

  inherits: ValueType

——————–

type float = Double

Full name: Microsoft.FSharp.Core.float

  type: float

  implements: IComparable

  implements: IFormattable

  implements: IConvertible

  implements: IComparable<float>

  implements: IEquatable<float>

  inherits: ValueType

Multiple items

val M : ‘a array

  type: ‘a array

  implements: ICloneable

  implements: Collections.IList

  implements: Collections.ICollection

  implements: Collections.IStructuralComparable

  implements: Collections.IStructuralEquatable

  implements: Collections.Generic.IList<’a>

  implements: Collections.Generic.ICollection<’a>

  implements: seq<’a>

  implements: Collections.IEnumerable

  inherits: Array

——————–

val M : ‘a array

  type: ‘a array

  implements: ICloneable

  implements: Collections.IList

  implements: Collections.ICollection

  implements: Collections.IStructuralComparable

  implements: Collections.IStructuralEquatable

  implements: Collections.Generic.IList<’a>

  implements: Collections.Generic.ICollection<’a>

  implements: seq<’a>

  implements: Collections.IEnumerable

  inherits: Array

type ‘T array = ‘T []

Full name: Microsoft.FSharp.Core.array<_>

  type: ‘T array

  implements: ICloneable

  implements: Collections.IList

  implements: Collections.ICollection

  implements: Collections.IStructuralComparable

  implements: Collections.IStructuralEquatable

  implements: Collections.Generic.IList<’T>

  implements: Collections.Generic.ICollection<’T>

  implements: seq<’T>

  implements: Collections.IEnumerable

  inherits: Array

Multiple items

val W : ‘a array

  type: ‘a array

  implements: ICloneable

  implements: Collections.IList

  implements: Collections.ICollection

  implements: Collections.IStructuralComparable

  implements: Collections.IStructuralEquatable

  implements: Collections.Generic.IList<’a>

  implements: Collections.Generic.ICollection<’a>

  implements: seq<’a>

  implements: Collections.IEnumerable

  inherits: Array

——————–

val W : ‘a array

  type: ‘a array

  implements: ICloneable

  implements: Collections.IList

  implements: Collections.ICollection

  implements: Collections.IStructuralComparable

  implements: Collections.IStructuralEquatable

  implements: Collections.Generic.IList<’a>

  implements: Collections.Generic.ICollection<’a>

  implements: seq<’a>

  implements: Collections.IEnumerable

  inherits: Array

val Windices : int list

  type: int list

  implements: Collections.IStructuralEquatable

  implements: IComparable<List<int>>

  implements: IComparable

  implements: Collections.IStructuralComparable

  implements: Collections.Generic.IEnumerable<int>

  implements: Collections.IEnumerable

val W : ‘a array

  type: ‘a array

  implements: ICloneable

  implements: Collections.IList

  implements: Collections.ICollection

  implements: Collections.IStructuralComparable

  implements: Collections.IStructuralEquatable

  implements: Collections.Generic.IList<’a>

  implements: Collections.Generic.ICollection<’a>

  implements: seq<’a>

  implements: Collections.IEnumerable

  inherits: Array

property Array.Length: int
val Munproposed : (int * int list) list

  type: (int * int list) list

  implements: Collections.IStructuralEquatable

  implements: IComparable<List<int * int list>>

  implements: IComparable

  implements: Collections.IStructuralComparable

  implements: Collections.Generic.IEnumerable<int * int list>

  implements: Collections.IEnumerable

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<’T> =

  | ( [] )

  | ( :: ) of ‘T * ‘T list

  with

    interface Collections.IEnumerable

    interface Collections.Generic.IEnumerable<’T>

    member Head : ‘T

    member IsEmpty : bool

    member Item : index:int -> ‘T with get

    member Length : int

    member Tail : ‘T list

    static member Cons : head:’T * tail:’T list -> ‘T list

    static member Empty : ‘T list

  end

Full name: Microsoft.FSharp.Collections.List<_>

  type: List<’T>

  implements: Collections.IStructuralEquatable

  implements: IComparable<List<’T>>

  implements: IComparable

  implements: Collections.IStructuralComparable

  implements: Collections.Generic.IEnumerable<’T>

  implements: Collections.IEnumerable

val init : int -> (int -> ‘T) -> ‘T list

Full name: Microsoft.FSharp.Collections.List.init

val M : ‘a array

  type: ‘a array

  implements: ICloneable

  implements: Collections.IList

  implements: Collections.ICollection

  implements: Collections.IStructuralComparable

  implements: Collections.IStructuralEquatable

  implements: Collections.Generic.IList<’a>

  implements: Collections.Generic.ICollection<’a>

  implements: seq<’a>

  implements: Collections.IEnumerable

  inherits: Array

val mi : int

  type: int

  implements: IComparable

  implements: IFormattable

  implements: IConvertible

  implements: IComparable<int>

  implements: IEquatable<int>

  inherits: ValueType

val sortFun : (int -> float)
val wi : int

  type: int

  implements: IComparable

  implements: IFormattable

  implements: IConvertible

  implements: IComparable<int>

  implements: IEquatable<int>

  inherits: ValueType

val sortBy : (‘T -> ‘Key) -> ‘T list -> ‘T list (requires comparison)

Full name: Microsoft.FSharp.Collections.List.sortBy

val findMarriages : ((int * int list) list -> Map<int,(int * int list)> -> Map<int,(int * int list)>)
val mSingle : (int * int list) list

  type: (int * int list) list

  implements: Collections.IStructuralEquatable

  implements: IComparable<List<int * int list>>

  implements: IComparable

  implements: Collections.IStructuralComparable

  implements: Collections.Generic.IEnumerable<int * int list>

  implements: Collections.IEnumerable

val wEngaged : Map<int,(int * int list)>

  type: Map<int,(int * int list)>

  implements: IComparable

  implements: Collections.Generic.IDictionary<int,(int * int list)>

  implements: Collections.Generic.ICollection<Collections.Generic.KeyValuePair<int,(int * int list)>>

  implements: seq<Collections.Generic.KeyValuePair<int,(int * int list)>>

  implements: Collections.IEnumerable

val bachelors : (int * int list) list

  type: (int * int list) list

  implements: Collections.IStructuralEquatable

  implements: IComparable<List<int * int list>>

  implements: IComparable

  implements: Collections.IStructuralComparable

  implements: Collections.Generic.IEnumerable<int * int list>

  implements: Collections.IEnumerable

val rest : int list

  type: int list

  implements: Collections.IStructuralEquatable

  implements: IComparable<List<int>>

  implements: IComparable

  implements: Collections.IStructuralComparable

  implements: Collections.Generic.IEnumerable<int>

  implements: Collections.IEnumerable

val m : int * int list
Multiple items

module Map

from Microsoft.FSharp.Collections

——————–

type Map<’Key,’Value (requires comparison)> =

  class

    interface Collections.IEnumerable

    interface IComparable

    interface Collections.Generic.IEnumerable<Collections.Generic.KeyValuePair<’Key,’Value>>

    interface Collections.Generic.ICollection<Collections.Generic.KeyValuePair<’Key,’Value>>

    interface Collections.Generic.IDictionary<’Key,’Value>

    new : elements:seq<’Key * ‘Value> -> Map<’Key,’Value>

    member Add : key:’Key * value:’Value -> Map<’Key,’Value>

    member ContainsKey : key:’Key -> bool

    override Equals : obj -> bool

    member Remove : key:’Key -> Map<’Key,’Value>

    member TryFind : key:’Key -> ‘Value option

    member Count : int

    member IsEmpty : bool

    member Item : key:’Key -> ‘Value with get

  end

Full name: Microsoft.FSharp.Collections.Map<_,_>

  type: Map<’Key,’Value>

  implements: IComparable

  implements: Collections.Generic.IDictionary<’Key,’Value>

  implements: Collections.Generic.ICollection<Collections.Generic.KeyValuePair<’Key,’Value>>

  implements: seq<Collections.Generic.KeyValuePair<’Key,’Value>>

  implements: Collections.IEnumerable

val tryFind : ‘Key -> Map<’Key,’T> -> ‘T option (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.tryFind

union case Option.None: Option<’T>
val add : ‘Key -> ‘T -> Map<’Key,’T> -> Map<’Key,’T> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.add

union case Option.Some: ‘T -> Option<’T>
val m’ : int * int list
val mi’ : int

  type: int

  implements: IComparable

  implements: IFormattable

  implements: IConvertible

  implements: IComparable<int>

  implements: IEquatable<int>

  inherits: ValueType

val empty<’Key,’T (requires comparison)> : Map<’Key,’T> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.empty

val map : (‘Key -> ‘T -> ‘U) -> Map<’Key,’T> -> Map<’Key,’U> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.map

val alignJaroWinkler : (‘a array -> ‘a array -> Map<int,int>)

Full name: Snippet.alignJaroWinkler


13
Sep 11

Record Linkage in F# – Token Matching, Stable Marriages and the Gale-Shapley algorithm

Initially, one of the biggest problems I found when trying to marry records was the god awful quality of much of data I often have to work with. It’s mostly old mainframe and database data with truncated fields, limited character sets and fields with nonsensical contents. Even worse, much of the data is the result of years of merged account lists from different sources and so global statistical techniques often fail quite spectacularly.

While naive compared to doing something along the lines of clustering, I have found that bag-of-words token matching has proven to be quite robust when comparing names. The idea is that if you consider a name as an unordered set of tokens, you can then find the best global alignment with some target name and that this will indicate the proper ordering in the majority of cases. Finally you can score these ordered tokens with an algorithm (like Jaro-Winker for example), average them, and end up with a good representation of the similarity of the two given names.

(Of course, name is just one of many things to look at. After all, I’m pretty sure that all these John Smiths aren’t the same guy.)

At first you might think finding this alignment is rather simple. Just score every permutation of the two sets of tokens and return the maximum, right? Well, this will work great for most traditional American names as they have only 2-3 tokens. Three factorial is only 6 different token arrangements to try, not a big deal at all. However, this method has serious limitations in practice. What happens when you run up against a 9 token Arabic name? Nine factorial is 362,880 different permutations to try, that’s a lot of grinding for just one record. That’s also just considering real names, if someone were to accidentally dump a paragraph into your algorithm you may just want to grab a snickers.

Don’t despair though, the stable marriages problem and the Gale-Shapely algorithm are here to help.

In the stable marriages problem you have two sets of things you want to match up with each other (described in an unambiguously sexist way as men and women), and each has a way of rating the others. The question is, how might we match up all the men and women so that there’s no two people who would rather be with each other than who they are already with. Note that this does not mean everyone gets their first choice, what we are looking for is global optimality of the pairs.

This sounds familiar right? It’s a slightly more complex version of exactly what we’re trying to do with tokens! So, the solution to this problem is called the Gale-Shapley algorithm, and get this, it’s only worst case O(M*W) where M is the number of men tokens and W is the number of women!

To turn this all into plain English, we can use this algorithm to align that 9-token name in at most 9 * 9 = 81 steps, and we’re guaranteed to get a result that is just as good!

It works like this: Men take turns proposing to each woman in order of preference. If she’s single they are immediately “engaged”, if she’s already engaged the woman picks the man she likes best and the loser goes back into the singles pool. This happens over and over until all single men are engaged or out of options.

In our case we’re actually creating sorted lists for each man of each woman ahead of time, so it’s O(M*WLog(W)), but it’s only a very minor increase in complexity, and how would you do that lazily anyhow?

First, let’s take a look at the Wikipedia article’s very imperative pseudocode:

function stableMatching { Initialize all m ? M and w ? W to free while ? free man m who still has a woman w to propose to { w = m's highest ranked such woman who he has not proposed to yet if w is free (m, w) become engaged else some pair (m', w) already exists if w prefers m to m' (m, w) become engaged m' becomes free else (m', w) remain engaged } }

Now, I’m not a big fan of imperative code so I wrote a pure functional version instead. Anyhow, having only two immutable data structures makes for a much easier read. It’s pretty much a long list of cases.

 1: open System
 2: 
 3: // a Bachelor is an identity index and an 
 4: // ordered list of women indicies to approach.
 5: type Bachelor = int * int list
 6: 
 7: // Some notation:
 8: // wi = woman index (int)
 9: // mi = man index (int)
10: // mi' = woman's current partner index (int)
11: // m = man with index and unapproached women indices (Bachelor)
12: // mSingle = men that are single (Bachelor list)
13: // wEngaged = engagements from women to men (int, Bachelor)
14: 
15: let funGS (comp: _ -> _ -> float) (M: _ array) (W: _ array) =
16:   let Windices = [ 0 .. W.Length - 1 ]
17:   // List of men with women in order of desire  
18:   let Munproposed = 
19:     List.init M.Length 
20:       (fun mi -> 
21:            let sortFun wi = 1.0 - (comp M.[mi] W.[wi])
22:            mi, Windices |> List.sortBy sortFun)
23:   // Recursively solve stable marriages
24:   let rec findMarriages mSingle wEngaged =
25:     match mSingle with
26:     // No single guys left with desired women, we're done
27:     | [] -> wEngaged
28:     // Guy is out of luck, remove from singles
29:     | (mi, []) :: bachelors -> findMarriages bachelors wEngaged
30:     // He's got options!
31:     | (mi, wi :: rest) :: bachelors -> 
32:       let m = mi, rest
33:       match wEngaged |> Map.tryFind wi with
34:       // She's single! m is now engaged!
35:       | None -> findMarriages bachelors (wEngaged |> Map.add wi m)
36:       // She's already engaged, let the best man win!
37:       | Some (m') -> 
38:         let mi', _ = m'
39:         if comp W.[wi] M.[mi] > comp W.[wi] M.[mi'] then 
40:           // Congrats mi, he is now engaged to wi
41:           // The previous suitor (mi') is bested 
42:           findMarriages 
43:             (m' :: bachelors) 
44:             (wEngaged |> Map.add wi m)
45:         else
46:           // The current bachelor (mi) lost, better luck next time
47:           findMarriages 
48:             (m :: bachelors) 
49:             wEngaged
50:   findMarriages Munproposed Map.empty
51:   // Before returning, remove unproposed lists from man instances  
52:   |> Map.map (fun wi m -> let mi, _ = m in mi)  
53: 
54: // By the supreme power of partial application I give you 
55: // Jaro-Winkler Token Alignment with Gale-Shapely in one line!
56: 
57: let alignJaroWinkler = funGS jaroWinkler

namespace System
type Bachelor = int * int list

Full name: Snippet.Bachelor

Multiple items

val int : 'T -> int (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.int

——————–

type int<'Measure> = int

Full name: Microsoft.FSharp.Core.int<_>

  type: int<'Measure>
  implements: IComparable
  implements: IConvertible
  implements: IFormattable
  implements: IComparable<int<'Measure>>
  implements: IEquatable<int<'Measure>>
  inherits: ValueType

——————–

type int = int32

Full name: Microsoft.FSharp.Core.int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

type 'T list = List<'T>

Full name: Microsoft.FSharp.Collections.list<_>

  type: 'T list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'T>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'T>
  implements: Collections.IEnumerable

val funGS : ('a -> 'a -> float) -> 'a array -> 'a array -> Map<int,int>

Full name: Snippet.funGS

val comp : ('a -> 'a -> float)
Multiple items

val float : 'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–

type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

  type: float<'Measure>
  implements: IComparable
  implements: IConvertible
  implements: IFormattable
  implements: IComparable<float<'Measure>>
  implements: IEquatable<float<'Measure>>
  inherits: ValueType

——————–

type float = Double

Full name: Microsoft.FSharp.Core.float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

Multiple items

val M : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

——————–

val M : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

type 'T array = 'T []

Full name: Microsoft.FSharp.Core.array<_>

  type: 'T array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'T>
  implements: Collections.Generic.ICollection<'T>
  implements: seq<'T>
  implements: Collections.IEnumerable
  inherits: Array

Multiple items

val W : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

——————–

val W : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

val Windices : int list

  type: int list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int>
  implements: Collections.IEnumerable

val W : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

property Array.Length: int
val Munproposed : (int * int list) list

  type: (int * int list) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int list>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int list>
  implements: Collections.IEnumerable

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =
  | ( [] )
  | ( :: ) of 'T * 'T list
  with
    interface Collections.IEnumerable
    interface Collections.Generic.IEnumerable<'T>
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    static member Cons : head:'T * tail:'T list -> 'T list
    static member Empty : 'T list
  end

Full name: Microsoft.FSharp.Collections.List<_>

  type: List<'T>
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'T>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'T>
  implements: Collections.IEnumerable

val init : int -> (int -> 'T) -> 'T list

Full name: Microsoft.FSharp.Collections.List.init

val M : 'a array

  type: 'a array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

val mi : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val sortFun : (int -> float)
val wi : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val sortBy : ('T -> 'Key) -> 'T list -> 'T list (requires comparison)

Full name: Microsoft.FSharp.Collections.List.sortBy

val findMarriages : ((int * int list) list -> Map<int,(int * int list)> -> Map<int,(int * int list)>)
val mSingle : (int * int list) list

  type: (int * int list) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int list>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int list>
  implements: Collections.IEnumerable

val wEngaged : Map<int,(int * int list)>

  type: Map<int,(int * int list)>
  implements: IComparable
  implements: Collections.Generic.IDictionary<int,(int * int list)>
  implements: Collections.Generic.ICollection<Collections.Generic.KeyValuePair<int,(int * int list)>>
  implements: seq<Collections.Generic.KeyValuePair<int,(int * int list)>>
  implements: Collections.IEnumerable

val bachelors : (int * int list) list

  type: (int * int list) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int list>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int list>
  implements: Collections.IEnumerable

val rest : int list

  type: int list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int>
  implements: Collections.IEnumerable

val m : int * int list
Multiple items

module Map

from Microsoft.FSharp.Collections

——————–

type Map<'Key,'Value (requires comparison)> =
  class
    interface Collections.IEnumerable
    interface IComparable
    interface Collections.Generic.IEnumerable<Collections.Generic.KeyValuePair<'Key,'Value>>
    interface Collections.Generic.ICollection<Collections.Generic.KeyValuePair<'Key,'Value>>
    interface Collections.Generic.IDictionary<'Key,'Value>
    new : elements:seq<'Key * 'Value> -> Map<'Key,'Value>
    member Add : key:'Key * value:'Value -> Map<'Key,'Value>
    member ContainsKey : key:'Key -> bool
    override Equals : obj -> bool
    member Remove : key:'Key -> Map<'Key,'Value>
    member TryFind : key:'Key -> 'Value option
    member Count : int
    member IsEmpty : bool
    member Item : key:'Key -> 'Value with get
  end

Full name: Microsoft.FSharp.Collections.Map<_,_>

  type: Map<'Key,'Value>
  implements: IComparable
  implements: Collections.Generic.IDictionary<'Key,'Value>
  implements: Collections.Generic.ICollection<Collections.Generic.KeyValuePair<'Key,'Value>>
  implements: seq<Collections.Generic.KeyValuePair<'Key,'Value>>
  implements: Collections.IEnumerable

val tryFind : 'Key -> Map<'Key,'T> -> 'T option (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.tryFind

union case Option.None: Option<'T>
val add : 'Key -> 'T -> Map<'Key,'T> -> Map<'Key,'T> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.add

union case Option.Some: 'T -> Option<'T>
val m' : int * int list
val mi' : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val empty<'Key,'T (requires comparison)> : Map<'Key,'T> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.empty

val map : ('Key -> 'T -> 'U) -> Map<'Key,'T> -> Map<'Key,'U> (requires comparison)

Full name: Microsoft.FSharp.Collections.Map.map

val alignJaroWinkler : ('a array -> 'a array -> Map<int,int>)

Full name: Snippet.alignJaroWinkler

So break out your F# Interactive and try something like:

alignJaroWinkler [|"DUDE"; "DUDERSON"|] [|"D"; "RONNY"; "DUDERSON"|]

and you’ll hopefully get something back like this:

val it : Map = map [(0, 0); (2, 1)]

This is just a map from women tokens to men tokens. (0,0) indicates that “D” matched best with “DUDE” and (2,1) indicates that “DUDERSON” matched best with “DUDERSON”. Poor “RONNY” ended up being the old maid.

That’s it for today, but of course, there’s a whole lot more to cover here. We need to look at how to manipulate this output into something a bit nicer. I also have a bunch of tweaks and improvements to this pair of algorithms that I want to talk about. Finally, I want to sidetrack for a bit and tunnel into how one goes from from the horrible imperative pseudocode above, to the magnificent pure functional code underneath it.

As per usual, all of the code use in this post (and previous posts in the series) can be found on github.


6
Sep 11

Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 2)

Last time we dove into the Jaro distance algorithm and picked apart how each of its components are calculated. However, from a modern perspective Jaro alone is a rather weak method of string matching. It was Winkler’s extension that brought this algorithm into widespread modern use.

Matthew Jaro’s insight when inventing the Jaro distance algorithm was that humans tend to make certain kinds of errors. However, while working on the 1990 census William Winkler had an additional insight: these kinds of errors are much more likely to occur later in the string. By simply wrapping Jaro’s algorithm to forgive some penalty according to the similarity of the first few characters, Winkler was able to get significantly better results with little additional overhead.

Winker's Extension

Here we see that the Jaro-Winkler distance (dw) is equal to the result of the Jaro distance (dj) plus one minus that same value times some weighted metric (lp). This is a great mathematical trick for two reasons. First, as long as the weighted metric (lp) doesn’t exceed 1, the final result will stay within the 0-1 range of the Jaro metric. Second, it guarantees that the result of Jaro-Winkler will never be lower than the result of Jaro alone. It effectively lets Winkler enrich the results of Jaro by filling in the remaining gap.

The meaning of the weighted metric (lp) is even more simple. Here l is just the number of initial characters which match, up to a maximum of four. Meanwhile p is a weight which can be up to 1/4, or one over the maximum possible value of l.

As p gets larger and larger, more and more of the gap will be filled in by Winkler’s extension. When p is 1/4, strings where first four characters match will always be given a perfect score. Just as we’d never want strings like “JOHN” and “JOHNSON” to be considered a perfect match, we’d never want to use a p value of 1/4. After much experimentation, Winkler recommends using a p value of 0.1, which is also what I use.

Now that we’ve covered how the math works, let’s take a look at an actual implementation. If you’d like to see an implementation of the Jaro distance algorithm take a look at the previous installment of Record Linkage Algorithms in F#.

 1: /// Calculate the Jaro-Winkler distance of s1 and s2
 2: let jaroWinkler s1 s2 = 
 3:     let jaroScore = jaro s1 s2
 4:     // Accumulate the number of matching initial characters
 5:     let maxLength = (min s1.Length s2.Length) - 1
 6:     let rec calcL i acc =
 7:         if i > maxLength || s1.[i] <> s2.[i] then acc
 8:         else calcL (i + 1) (acc + 1.0)
 9:     let l = min (calcL 0 0.0) 4.0
10:     // Calculate the JW distance
11:     let p = 0.1
12:     jaroScore + (l * p * (1.0 - jaroScore))

In my implementation I’ve allowed for some slight inefficiency in order to make it easier to play with the p value and number of characters examined. So far I’ve found Winkler’s choice of four characters to be the best, which incidentally has been shown as a great number of initial characters to look at when using a number of record linkage algorithms on things in the English language. However, I suspect that other values may work better when working in other languages.

The math here is so simple that I don’t feel it’s worth breaking down further, but I’ve included my tests built from the examples in the Jaro-Winkler distance Wikipedia article. For a fuller understanding just break out the debugger and play a bit on your own.

14: open Xunit
15: 
16: [<Fact>]
17: let ``Jaro-Winkler identity test`` () = 
18:     let result = jaroWinkler "RICK" "RICK"
19:     Assert.Equal("1.000", String.Format("{0:0.000}", result))
20: 
21: [<Fact>]
22: let ``Jaro-Winkler martha test`` () = 
23:     let result = jaroWinkler "MARTHA" "MARHTA"
24:     Assert.Equal("0.961", String.Format("{0:0.000}", result))
25: 
26: [<Fact>]
27: let ``Jaro-Winkler dwayne test`` () = 
28:     let result = jaroWinkler "DWAYNE" "DUANE"
29:     Assert.Equal("0.840", String.Format("{0:0.000}", result))
30: 
31: [<Fact>]
32: let ``Jaro-Winkler dixon test`` () =
33:     let result = jaroWinkler "DIXON" "DICKSONX"
34:     Assert.Equal("0.813", String.Format("{0:0.000}", result))
35: 

val jaroWinkler : 'a -> 'b -> float

Full name: Snippet.jaroWinkler

Calculate the Jaro-Winkler distance of s1 and s2

val s1 : 'a
val s2 : 'b
val jaroScore : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val maxLength : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val min : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.min

val calcL : (int -> float -> float)
val i : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val acc : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val l : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val p : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

val result : float

  type: float
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<float>
  implements: System.IEquatable<float>
  inherits: System.ValueType

module String

from Microsoft.FSharp.Core

Multiple items

type Format<'Printer,'State,'Residue,'Result,'Tuple> = PrintfFormat<'Printer,'State,'Residue,'Result,'Tuple>

Full name: Microsoft.FSharp.Core.Format<_,_,_,_,_>

  type: Format<'Printer,'State,'Residue,'Result,'Tuple>
  inherits: PrintfFormat<'Printer,'State,'Residue,'Result>

——————–

type Format<'Printer,'State,'Residue,'Result> = PrintfFormat<'Printer,'State,'Residue,'Result>

Full name: Microsoft.FSharp.Core.Format<_,_,_,_>

In the next installment of Record Linkage Algorithms in F# we’ll take a look a method for efficient token based matching with Jaro-Winkler. Stay tuned!

Edit: All current code for this series is now available on github.


2
Sep 11

Record Linkage Algorithms in F# – Jaro-Winkler Distance (Part 1)

When first approaching the task of record linkage I was initially overwhelmed by the huge number of different algorithms available for comparing strings. Now I know that the secret to finding your way in this sea of algorithms is two fold. First, know that many are outdated and have newer and better implementations, so they can be ignored. Second, of the newest extension of each algorithm, each has a particular use depending on the kind of data you’re comparing.

Jaro-Winkler distance is a particularly good choice when comparing short strings of human entered data, such as names. This is due to it’s relative robustness against letter transpositions and it’s weighting of similarity toward the beginning of the string. When it comes to comparing names, a slightly customized version of tokenized Jaro-Winkler is one of the first things in my toolbox that I reach for.

Jaro-Winkler distance also has the benefit of being a rather easy to understand algorithm. It’s composed of two parts, Jaro’s original algorithm and Winkler’s extension.  In this installment we’ll roll up our sleeves and dig into the first part of this algorithm, Jaro distance.

The Jaro Algorithm

This is the equation for Jaro distance. It’s made up of the average of three sub-calculations.

  1. The ratio of matching characters to the length of the first string.
  2. The ratio of matching characters to the length of the second string.
  3. The ratio of non-transpositions to the number matching of characters.

My implementation has been optimized a bit, but I think it’s still rather easy to understand.

 1: open System
 2: 
 3: /// Given an offset and a radius from that offset, 
 4: /// does mChar exist in that part of str?
 5: let inline existsInWin (mChar: char) (str: string) offset rad =
 6:     let startAt = max 0 (offset - rad)
 7:     let endAt = min (offset + rad) (String.length str - 1)  
 8:     if endAt - startAt < 0 then false
 9:     else
10:         let rec exists index =
11:             if str.[index] = mChar then true
12:             elif index = endAt then false
13:             else exists (index + 1)
14:         exists startAt
15: 
16: /// The jaro distance between s1 and s2
17: let jaro s1 s2 =
18:     
19:     // The radius is half of the lesser 
20:     // of the two string lengths rounded up.
21:     let matchRadius = 
22:         let minLen = 
23:             min (String.length s1) (String.length s2) in
24:               minLen / 2 + minLen % 2
25: 
26:     // An inner function which recursively finds the number  
27:     // of matched characters within the radius.
28:     let commonChars (chars1: string) (chars2: string) =
29:         let rec inner i result = 
30:             match i with
31:             | -1 -> result
32:             | _ -> if existsInWin chars1.[i] chars2 i matchRadius
33:                    then inner (i - 1) (chars1.[i] :: result)
34:                    else inner (i - 1) result
35:         inner (chars1.Length - 1) []
36: 
37:     // The sets of common characters and their lengths as floats 
38:     let c1 = commonChars s1 s2
39:     let c2 = commonChars s2 s1
40:     let c1length = float (List.length c1)
41:     let c2length = float (List.length c2)
42:     
43:     // The number of transpositions within 
44:     // the sets of common characters.
45:     let transpositions = 
46:         let rec inner cl1 cl2 result = 
47:             match cl1, cl2 with
48:             | [], _ | _, [] -> result
49:             | c1h :: c1t, c2h :: c2t -> 
50:                 if c1h <> c2h
51:                 then inner c1t c2t (result + 1.0)
52:                 else inner c1t c2t result
53:         let mismatches = inner c1 c2 0.0
54:         // If one common string is longer than the other
55:         // each additional char counts as half a transposition
56:         (mismatches + abs (c1length - c2length)) / 2.0
57: 
58:     let s1length = float (String.length s1)
59:     let s2length = float (String.length s2)
60:     let tLength = max c1length c2length
61: 
62:     // The jaro distance as given by 
63:     // 1/3 ( m2/|s1| + m1/|s2| + (mc-t)/mc )
64:     let result = (c1length / s1length +
65:                   c2length / s2length + 
66:                   (tLength - transpositions) / tLength)
67:                  / 3.0
68: 
69:     // This is for cases where |s1|, |s2| or m are zero 
70:     if Double.IsNaN result then 0.0 else result

We’re quite lucky to have examples of the algorithm in action available right in the Wikipedia article to use as unit tests. When optimizing, you couldn’t ask for more.

Take ‘DWAYNE’ and ‘DUANE’ for example. According to the article we should end up with the following results:

DWAYNE vs DUANE

So, let’s break it down. The lesser of the two strings, DUANE, is 5 characters long. When we set minLen to 5 in (minLen / 2 + minLen % 2 is 3), we get a radius of 3.

Next we find the common characters in each direction. Comparing at DWAYNE to DUANE with a radius of 3 we can see that D, A, N and E will match, giving us a c1 = “DANE”. Similarly, comparing at DUANE to DWAYNE we get exactly the same thing. That is, c2 = “DANE” as well.

As you might have guessed by the fact both sets of common strings are the same, we have zero transpositions in this instance.

So now, we just plug in the numbers and get 1/3 * (4/6 + 4/5 + (4-0)/4) = 0.822. Not too difficult, eh?

For a better understanding of how transpositions end up working out, try walking through the debugger with the MARTHA vs MARHTA test below.

72: open Xunit
73: 
74: [<Fact>]
75: let ``Jaro identity test`` () = 
76:     let result = jaro "RICK" "RICK"
77:     Assert.Equal("1.000", String.Format("{0:0.000}", result))
78: 
79: [<Fact>]
80: let ``Jaro martha test`` () =
81:     let result = jaro "MARTHA" "MARHTA"
82:     Assert.Equal("0.944", String.Format("{0:0.000}", result))
83: 
84: [<Fact>]
85: let ``Jaro dwayne test`` () = 
86:     let result = jaro "DWAYNE" "DUANE"
87:     Assert.Equal("0.822", String.Format("{0:0.000}", result))
88: 
89: [<Fact>]
90: let ``Jaro dixon test`` () =
91:     let result = jaro "DIXON" "DICKSONX"
92:     Assert.Equal("0.767", String.Format("{0:0.000}", result))
93: 
94: 

namespace System
val existsInWin : char -> string -> 'a -> 'b -> bool (requires member ( – ) and member ( + ))

Full name: Snippet.existsInWin

Given an offset and a radius from that office,
 does mChar exist in that part of str?

val mChar : char

  type: char
  implements: IComparable
  implements: IConvertible
  implements: IComparable<char>
  implements: IEquatable<char>
  inherits: ValueType

Multiple items

val char : 'T -> char (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.char

——————–

type char = Char

Full name: Microsoft.FSharp.Core.char

  type: char
  implements: IComparable
  implements: IConvertible
  implements: IComparable<char>
  implements: IEquatable<char>
  inherits: ValueType

val str : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

Multiple items

val string : 'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

——————–

type string = String

Full name: Microsoft.FSharp.Core.string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val offset : 'a (requires member ( – ) and member ( + ))
val rad : 'b (requires member ( – ) and member ( + ))
val startAt : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val max : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.max

val endAt : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val min : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.min

type String =
  class
    new : char -> string
    new : char * int * int -> string
    new : System.SByte -> string
    new : System.SByte * int * int -> string
    new : System.SByte * int * int * System.Text.Encoding -> string
    new : char [] * int * int -> string
    new : char [] -> string
    new : char * int -> string
    member Chars : int -> char
    member Clone : unit -> obj
    member CompareTo : obj -> int
    member CompareTo : string -> int
    member Contains : string -> bool
    member CopyTo : int * char [] * int * int -> unit
    member EndsWith : string -> bool
    member EndsWith : string * System.StringComparison -> bool
    member EndsWith : string * bool * System.Globalization.CultureInfo -> bool
    member Equals : obj -> bool
    member Equals : string -> bool
    member Equals : string * System.StringComparison -> bool
    member GetEnumerator : unit -> System.CharEnumerator
    member GetHashCode : unit -> int
    member GetTypeCode : unit -> System.TypeCode
    member IndexOf : char -> int
    member IndexOf : string -> int
    member IndexOf : char * int -> int
    member IndexOf : string * int -> int
    member IndexOf : string * System.StringComparison -> int
    member IndexOf : char * int * int -> int
    member IndexOf : string * int * int -> int
    member IndexOf : string * int * System.StringComparison -> int
    member IndexOf : string * int * int * System.StringComparison -> int
    member IndexOfAny : char [] -> int
    member IndexOfAny : char [] * int -> int
    member IndexOfAny : char [] * int * int -> int
    member Insert : int * string -> string
    member IsNormalized : unit -> bool
    member IsNormalized : System.Text.NormalizationForm -> bool
    member LastIndexOf : char -> int
    member LastIndexOf : string -> int
    member LastIndexOf : char * int -> int
    member LastIndexOf : string * int -> int
    member LastIndexOf : string * System.StringComparison -> int
    member LastIndexOf : char * int * int -> int
    member LastIndexOf : string * int * int -> int
    member LastIndexOf : string * int * System.StringComparison -> int
    member LastIndexOf : string * int * int * System.StringComparison -> int
    member LastIndexOfAny : char [] -> int
    member LastIndexOfAny : char [] * int -> int
    member LastIndexOfAny : char [] * int * int -> int
    member Length : int
    member Normalize : unit -> string
    member Normalize : System.Text.NormalizationForm -> string
    member PadLeft : int -> string
    member PadLeft : int * char -> string
    member PadRight : int -> string
    member PadRight : int * char -> string
    member Remove : int -> string
    member Remove : int * int -> string
    member Replace : char * char -> string
    member Replace : string * string -> string
    member Split : char [] -> string []
    member Split : char [] * int -> string []
    member Split : char [] * System.StringSplitOptions -> string []
    member Split : string [] * System.StringSplitOptions -> string []
    member Split : char [] * int * System.StringSplitOptions -> string []
    member Split : string [] * int * System.StringSplitOptions -> string []
    member StartsWith : string -> bool
    member StartsWith : string * System.StringComparison -> bool
    member StartsWith : string * bool * System.Globalization.CultureInfo -> bool
    member Substring : int -> string
    member Substring : int * int -> string
    member ToCharArray : unit -> char []
    member ToCharArray : int * int -> char []
    member ToLower : unit -> string
    member ToLower : System.Globalization.CultureInfo -> string
    member ToLowerInvariant : unit -> string
    member ToString : unit -> string
    member ToString : System.IFormatProvider -> string
    member ToUpper : unit -> string
    member ToUpper : System.Globalization.CultureInfo -> string
    member ToUpperInvariant : unit -> string
    member Trim : unit -> string
    member Trim : char [] -> string
    member TrimEnd : char [] -> string
    member TrimStart : char [] -> string
    static val Empty : string
    static member Compare : string * string -> int
    static member Compare : string * string * bool -> int
    static member Compare : string * string * System.StringComparison -> int
    static member Compare : string * string * System.Globalization.CultureInfo * System.Globalization.CompareOptions -> int
    static member Compare : string * string * bool * System.Globalization.CultureInfo -> int
    static member Compare : string * int * string * int * int -> int
    static member Compare : string * int * string * int * int * bool -> int
    static member Compare : string * int * string * int * int * System.StringComparison -> int
    static member Compare : string * int * string * int * int * bool * System.Globalization.CultureInfo -> int
    static member Compare : string * int * string * int * int * System.Globalization.CultureInfo * System.Globalization.CompareOptions -> int
    static member CompareOrdinal : string * string -> int
    static member CompareOrdinal : string * int * string * int * int -> int
    static member Concat : obj -> string
    static member Concat : obj [] -> string
    static member Concat<'T> : System.Collections.Generic.IEnumerable<'T> -> string
    static member Concat : System.Collections.Generic.IEnumerable<string> -> string
    static member Concat : string [] -> string
    static member Concat : obj * obj -> string
    static member Concat : string * string -> string
    static member Concat : obj * obj * obj -> string
    static member Concat : string * string * string -> string
    static member Concat : obj * obj * obj * obj -> string
    static member Concat : string * string * string * string -> string
    static member Copy : string -> string
    static member Equals : string * string -> bool
    static member Equals : string * string * System.StringComparison -> bool
    static member Format : string * obj -> string
    static member Format : string * obj [] -> string
    static member Format : string * obj * obj -> string
    static member Format : System.IFormatProvider * string * obj [] -> string
    static member Format : string * obj * obj * obj -> string
    static member Intern : string -> string
    static member IsInterned : string -> string
    static member IsNullOrEmpty : string -> bool
    static member IsNullOrWhiteSpace : string -> bool
    static member Join : string * string [] -> string
    static member Join : string * obj [] -> string
    static member Join<'T> : string * System.Collections.Generic.IEnumerable<'T> -> string
    static member Join : string * System.Collections.Generic.IEnumerable<string> -> string
    static member Join : string * string [] * int * int -> string
  end

Full name: System.String

  type: String
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val length : string -> int

Full name: Microsoft.FSharp.Core.String.length

val exists : (int -> bool)
val index : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val jaro : string -> string -> float

Full name: Snippet.jaro

The jaro distance between s1 and s2

val s1 : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val s2 : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val matchRadius : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val minLen : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val commonChars : (string -> string -> char list)
val chars1 : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val chars2 : string

  type: string
  implements: IComparable
  implements: ICloneable
  implements: IConvertible
  implements: IComparable<string>
  implements: seq<char>
  implements: Collections.IEnumerable
  implements: IEquatable<string>

val inner : (int -> char list -> char list)
val i : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val result : char list

  type: char list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<char>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<char>
  implements: Collections.IEnumerable

property String.Length: int
val c1 : char list

  type: char list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<char>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<char>
  implements: Collections.IEnumerable

val c2 : char list

  type: char list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<char>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<char>
  implements: Collections.IEnumerable

val c1length : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

Multiple items

val float : 'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–

type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

  type: float<'Measure>
  implements: IComparable
  implements: IConvertible
  implements: IFormattable
  implements: IComparable<float<'Measure>>
  implements: IEquatable<float<'Measure>>
  inherits: ValueType

——————–

type float = Double

Full name: Microsoft.FSharp.Core.float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =
  | ( [] )
  | ( :: ) of 'T * 'T list
  with
    interface Collections.IEnumerable
    interface Collections.Generic.IEnumerable<'T>
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    static member Cons : head:'T * tail:'T list -> 'T list
    static member Empty : 'T list
  end

Full name: Microsoft.FSharp.Collections.List<_>

  type: List<'T>
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'T>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'T>
  implements: Collections.IEnumerable

val length : 'T list -> int

Full name: Microsoft.FSharp.Collections.List.length

val c2length : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val transpositions : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val inner : ('a list -> 'a list -> float -> float) (requires equality)
val cl1 : 'a list (requires equality)

  type: 'a list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'a>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'a>
  implements: Collections.IEnumerable

val cl2 : 'a list (requires equality)

  type: 'a list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'a>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'a>
  implements: Collections.IEnumerable

val result : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val c1h : 'a (requires equality)
val c1t : 'a list (requires equality)

  type: 'a list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'a>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'a>
  implements: Collections.IEnumerable

val c2h : 'a (requires equality)
val c2t : 'a list (requires equality)

  type: 'a list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'a>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'a>
  implements: Collections.IEnumerable

val mismatches : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val abs : 'T -> 'T (requires member Abs)

Full name: Microsoft.FSharp.Core.Operators.abs

val s1length : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val s2length : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val tLength : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

type Double =
  struct
    member CompareTo : obj -> int
    member CompareTo : float -> int
    member Equals : obj -> bool
    member Equals : float -> bool
    member GetHashCode : unit -> int
    member GetTypeCode : unit -> System.TypeCode
    member ToString : unit -> string
    member ToString : string -> string
    member ToString : System.IFormatProvider -> string
    member ToString : string * System.IFormatProvider -> string
    static val MinValue : float
    static val MaxValue : float
    static val Epsilon : float
    static val NegativeInfinity : float
    static val PositiveInfinity : float
    static val NaN : float
    static member IsInfinity : float -> bool
    static member IsNaN : float -> bool
    static member IsNegativeInfinity : float -> bool
    static member IsPositiveInfinity : float -> bool
    static member Parse : string -> float
    static member Parse : string * System.Globalization.NumberStyles -> float
    static member Parse : string * System.IFormatProvider -> float
    static member Parse : string * System.Globalization.NumberStyles * System.IFormatProvider -> float
    static member TryParse : string * float -> bool
    static member TryParse : string * System.Globalization.NumberStyles * System.IFormatProvider * float -> bool
  end

Full name: System.Double

  type: Double
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

Double.IsNaN(d: float) : bool
Multiple overloads

String.Format(format: string, args: obj []) : string

String.Format(format: string, arg0: obj) : string

String.Format(provider: IFormatProvider, format: string, args: obj []) : string

String.Format(format: string, arg0: obj, arg1: obj) : string

String.Format(format: string, arg0: obj, arg1: obj, arg2: obj) : string

I hope you enjoyed this installment of Record Linkage Algorithms in F#. Next time we’ll explore the Winkler extension to this algorithm and take a look at why weighting errors earlier in the string more heavily ends up giving significantly better results.

Edit: All current code for this series is now available on github.


11
Dec 10

An F# Ant Colony Simulation in Silverlight 4.0 with Dynamic AI Loading

I’ve been enviously watching Phillip Trelford publish excellent F# games all week and tonight I just couldn’t stand it anymore.  I stayed in, rolled up my sleeves and ported the very same ant colony simulation I used in my CUFP workshop to Silverlight 4.0.

Install Microsoft Silverlight

Wow, just look at those little guys go at it.  Silverlight sure is pretty!

As you might have noticed by the big white “Click To Load Custom AI” message, you can also compile your own AI and battle it out with what I’ve included here.  To make things easy I’ve provided a ready to go solution.  Simply load it up, compile it and select the compiled DLL.

Once you’ve loaded your AI the game will immediately start.  Best of luck to you against my little monsters, so far they’ve been undefeated.  If you happen to come up with an ant-dominating example I hope you’ll post it here in the comments.  There were some really creative ideas in the CUFP workshop and I’d love to see what could be done with more than just four hours.

If your interested in the gory details of the simulation itself, I’ve put up a github repository with the entirety of the code.  And just in case you’re wondering, I’m still planning on running that contest.  Expect a ton of cool new AI features and a big focus on combat.

Cheers!

Update:  I’ve mucked with a few things.  Mainly, it will be a bit nicer when handling AI exceptions now.  While I was at it I tweaked some of the game world parameters and so you’ll need to re-download the example solution if you already have it.