Posts Tagged: F#


7
Dec 14

Developing an Algorithm in F#: Fast Rotational Alignments with Gosper’s Hack

This post is for the 7th day of the 2014 F# Advent Calendar.

It has been said that functional languages can’t be as fast as their imperative cousins because of all of the allocation and garbage collection, this is patently false (as far as F# is concerned at least) largely because the CLR has value types.

Recently Louis Thrall and I ran into an interesting problem at work: how do you efficiently compare all rotations of one array to another to select the best? This is easy when the arrays are the same size, just shift the indices around:

Same Size Rotations

Which is equivalent to a nested for loop, the top one rotating and the inner doing comparisons.

But what about differently sized arrays? It turns out this is more complicated, even for comparing a size 2 array to size 3:

mixedrotations

Like comparing beads on two strings, there is much shifting as the indices are rotated around. The extra freedom from this shifting means that more comparisons must be made even though there total number of combined elements is smaller.

This problem also has an interesting structure, as it costs the most when one array is about half the size of the other. When the smaller array is size 1 then it simply needs to be moved around, and when the arrays are equal in size it’s a simple matter of rotations.

After some research Lou noted that this we were “counting the number of cyclic suborders of size k of a cyclic order of size n” which produces k(n choose k) different configurations.

knplot

Now that we had bounds and knew what we were doing was reasonable we had to figure out an algorithm for this process. It occurred to us that it would be ideal to pre-compute the comparison values to avoid repeated comparisons. Given this matrix of comparison values it wasn’t too difficult to determine a procedure for checking all possible configurations.

matrixpaths

Each time a pair is chosen, it further constrains what the rest of the assignments may be. The bottom level is somewhat obvious, just select each of the possible starting combinations down the y-axis. In the next column you potentially have a range of possible “downward” (using clock arithmetic) selections bounded by the difference in lengths and what has already been fixed.

Now it was time to implement, and the first stab was nice and concise and functional and worked pretty well.

 1: open System
 2: 
 3: // arr1 is always equal in size or smaller than arr2
 4: let bestRot (f: 'a -> 'a -> float) (arr1: 'a []) (arr2: 'a []) =
 5:     // Pre-calculate similarities
 6:     let scores = 
 7:         Array2D.init arr1.Length arr2.Length 
 8:                     (fun i j -> f arr1.[i] arr2.[j])
 9:     // inner function for recursively finding paths
10:     // col = previous column, prow = previous row
11:     // df = degrees of freedom, path = path accumulator
12:     let rec inner col prow df path =
13:         seq {
14:             // when we're out of columns to assign we're done
15:             if col >= arr1.Length then yield path |> List.rev
16:             else
17:                 // We have df "degrees of freedom" left
18:                 for d = 1 to df + 1 do 
19:                     // Clock arithmetic for the row
20:                     let nrow = (prow + d) % arr2.Length
21:                     // Recurse yielding out all further paths
22:                     yield! inner (col + 1) nrow (df + 1 - d) 
23:                                  ((col, nrow) :: path)           
24:         }
25:     let res, score =
26:         // the difference in length 
27:         // is the starting "degrees of freedom"
28:         let diff = arr2.Length - arr1.Length
29:         seq {
30:             // for each y-axis starting point
31:             for y = 0 to arr2.Length - 1 do
32:                 // 1 selected, r is the starting point (on y), 
33:                 //    starting is always 0 on x. 
34:                 // ((0, y) :: []) is the accumulator 
35:                 //    with the initial (0, y) coordinates
36:                 yield! inner 1 y diff ((0, y) :: [])
37:         } 
38:         // Sum each path to find total similarity
39:         |> Seq.map 
40:             (fun l -> l, l |> Seq.sumBy (fun (i,j) -> scores.[i,j]))
41:         // Get the path with the highest similarity
42:         |> Seq.maxBy snd
43:     // Create output array and copy in the results
44:     let out = Array.create arr1.Length Int32.MinValue
45:     for (i,j) in res do
46:         out.[i] <- j
47:     // Return results
48:     out, score
namespace System
val bestRot : ('a -> 'a -> float) -> 'a [] -> 'a [] -> int [] * float

Full name: Snippet.bestRot

val f : ('a -> 'a -> float)
Multiple items
val float : 'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–
type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

  type: float<'Measure>
  implements: IComparable
  implements: IConvertible
  implements: IFormattable
  implements: IComparable<float<'Measure>>
  implements: IEquatable<float<'Measure>>
  inherits: ValueType

——————–
type float = Double

Full name: Microsoft.FSharp.Core.float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val arr1 : 'a []

  type: 'a []
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

val arr2 : 'a []

  type: 'a []
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

val scores : float [,]

  type: float [,]
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IEnumerable
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  inherits: Array

module Array2D

from Microsoft.FSharp.Collections

val init : int -> int -> (int -> int -> 'T) -> 'T [,]

Full name: Microsoft.FSharp.Collections.Array2D.init

property Array.Length: int
val i : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val j : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val inner : (int -> int -> int -> (int * int) list -> seq<(int * int) list>)
val col : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val prow : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val df : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val path : (int * int) list

  type: (int * int) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int>
  implements: Collections.IEnumerable

Multiple items
val seq : seq<'T> -> seq<'T>

Full name: Microsoft.FSharp.Core.Operators.seq

——————–
type seq<'T> = Collections.Generic.IEnumerable<'T>

Full name: Microsoft.FSharp.Collections.seq<_>

  type: seq<'T>
  inherits: Collections.IEnumerable

Multiple items
module List

from Microsoft.FSharp.Collections

——————–
type List<'T> =
  | ( [] )
  | ( :: ) of 'T * 'T list
  with
    interface Collections.IEnumerable
    interface Collections.Generic.IEnumerable<'T>
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    static member Cons : head:'T * tail:'T list -> 'T list
    static member Empty : 'T list
  end

Full name: Microsoft.FSharp.Collections.List<_>

  type: List<'T>
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<'T>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<'T>
  implements: Collections.IEnumerable

val rev : 'T list -> 'T list

Full name: Microsoft.FSharp.Collections.List.rev

val d : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val nrow : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val res : (int * int) list

  type: (int * int) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int>
  implements: Collections.IEnumerable

val score : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val diff : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val y : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

module Seq

from Microsoft.FSharp.Collections

val map : ('T -> 'U) -> seq<'T> -> seq<'U>

Full name: Microsoft.FSharp.Collections.Seq.map

val l : (int * int) list

  type: (int * int) list
  implements: Collections.IStructuralEquatable
  implements: IComparable<List<int * int>>
  implements: IComparable
  implements: Collections.IStructuralComparable
  implements: Collections.Generic.IEnumerable<int * int>
  implements: Collections.IEnumerable

val sumBy : ('T -> 'U) -> seq<'T> -> 'U (requires member ( + ) and member get_Zero)

Full name: Microsoft.FSharp.Collections.Seq.sumBy

val maxBy : ('T -> 'U) -> seq<'T> -> 'T (requires comparison)

Full name: Microsoft.FSharp.Collections.Seq.maxBy

val snd : ('T1 * 'T2) -> 'T2

Full name: Microsoft.FSharp.Core.Operators.snd

val out : int []

  type: int []
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<int>
  implements: Collections.Generic.ICollection<int>
  implements: seq<int>
  implements: Collections.IEnumerable
  inherits: Array

type Array =
  class
    member Clone : unit -> obj
    member CopyTo : System.Array * int -> unit
    member CopyTo : System.Array * int64 -> unit
    member GetEnumerator : unit -> System.Collections.IEnumerator
    member GetLength : int -> int
    member GetLongLength : int -> int64
    member GetLowerBound : int -> int
    member GetUpperBound : int -> int
    member GetValue : int [] -> obj
    member GetValue : int -> obj
    member GetValue : int64 -> obj
    member GetValue : int64 [] -> obj
    member GetValue : int * int -> obj
    member GetValue : int64 * int64 -> obj
    member GetValue : int * int * int -> obj
    member GetValue : int64 * int64 * int64 -> obj
    member Initialize : unit -> unit
    member IsFixedSize : bool
    member IsReadOnly : bool
    member IsSynchronized : bool
    member Length : int
    member LongLength : int64
    member Rank : int
    member SetValue : obj * int -> unit
    member SetValue : obj * int [] -> unit
    member SetValue : obj * int64 -> unit
    member SetValue : obj * int64 [] -> unit
    member SetValue : obj * int * int -> unit
    member SetValue : obj * int64 * int64 -> unit
    member SetValue : obj * int * int * int -> unit
    member SetValue : obj * int64 * int64 * int64 -> unit
    member SyncRoot : obj
    static member AsReadOnly<'T> : 'T [] -> System.Collections.ObjectModel.ReadOnlyCollection<'T>
    static member BinarySearch : System.Array * obj -> int
    static member BinarySearch<'T> : 'T [] * 'T -> int
    static member BinarySearch : System.Array * obj * System.Collections.IComparer -> int
    static member BinarySearch<'T> : 'T [] * 'T * System.Collections.Generic.IComparer<'T> -> int
    static member BinarySearch : System.Array * int * int * obj -> int
    static member BinarySearch<'T> : 'T [] * int * int * 'T -> int
    static member BinarySearch : System.Array * int * int * obj * System.Collections.IComparer -> int
    static member BinarySearch<'T> : 'T [] * int * int * 'T * System.Collections.Generic.IComparer<'T> -> int
    static member Clear : System.Array * int * int -> unit
    static member ConstrainedCopy : System.Array * int * System.Array * int * int -> unit
    static member ConvertAll<'TInput,'TOutput> : 'TInput [] * System.Converter<'TInput,'TOutput> -> 'TOutput []
    static member Copy : System.Array * System.Array * int -> unit
    static member Copy : System.Array * System.Array * int64 -> unit
    static member Copy : System.Array * int * System.Array * int * int -> unit
    static member Copy : System.Array * int64 * System.Array * int64 * int64 -> unit
    static member CreateInstance : System.Type * int -> System.Array
    static member CreateInstance : System.Type * int [] -> System.Array
    static member CreateInstance : System.Type * int64 [] -> System.Array
    static member CreateInstance : System.Type * int * int -> System.Array
    static member CreateInstance : System.Type * int [] * int [] -> System.Array
    static member CreateInstance : System.Type * int * int * int -> System.Array
    static member Exists<'T> : 'T [] * System.Predicate<'T> -> bool
    static member Find<'T> : 'T [] * System.Predicate<'T> -> 'T
    static member FindAll<'T> : 'T [] * System.Predicate<'T> -> 'T []
    static member FindIndex<'T> : 'T [] * System.Predicate<'T> -> int
    static member FindIndex<'T> : 'T [] * int * System.Predicate<'T> -> int
    static member FindIndex<'T> : 'T [] * int * int * System.Predicate<'T> -> int
    static member FindLast<'T> : 'T [] * System.Predicate<'T> -> 'T
    static member FindLastIndex<'T> : 'T [] * System.Predicate<'T> -> int
    static member FindLastIndex<'T> : 'T [] * int * System.Predicate<'T> -> int
    static member FindLastIndex<'T> : 'T [] * int * int * System.Predicate<'T> -> int
    static member ForEach<'T> : 'T [] * System.Action<'T> -> unit
    static member IndexOf : System.Array * obj -> int
    static member IndexOf<'T> : 'T [] * 'T -> int
    static member IndexOf : System.Array * obj * int -> int
    static member IndexOf<'T> : 'T [] * 'T * int -> int
    static member IndexOf : System.Array * obj * int * int -> int
    static member IndexOf<'T> : 'T [] * 'T * int * int -> int
    static member LastIndexOf : System.Array * obj -> int
    static member LastIndexOf<'T> : 'T [] * 'T -> int
    static member LastIndexOf : System.Array * obj * int -> int
    static member LastIndexOf<'T> : 'T [] * 'T * int -> int
    static member LastIndexOf : System.Array * obj * int * int -> int
    static member LastIndexOf<'T> : 'T [] * 'T * int * int -> int
    static member Resize<'T> : 'T [] * int -> unit
    static member Reverse : System.Array -> unit
    static member Reverse : System.Array * int * int -> unit
    static member Sort : System.Array -> unit
    static member Sort<'T> : 'T [] -> unit
    static member Sort : System.Array * System.Array -> unit
    static member Sort : System.Array * System.Collections.IComparer -> unit
    static member Sort<'TKey,'TValue> : 'TKey [] * 'TValue [] -> unit
    static member Sort<'T> : 'T [] * System.Collections.Generic.IComparer<'T> -> unit
    static member Sort<'T> : 'T [] * System.Comparison<'T> -> unit
    static member Sort : System.Array * int * int -> unit
    static member Sort : System.Array * System.Array * System.Collections.IComparer -> unit
    static member Sort<'T> : 'T [] * int * int -> unit
    static member Sort<'TKey,'TValue> : 'TKey [] * 'TValue [] * System.Collections.Generic.IComparer<'TKey> -> unit
    static member Sort : System.Array * System.Array * int * int -> unit
    static member Sort : System.Array * int * int * System.Collections.IComparer -> unit
    static member Sort<'TKey,'TValue> : 'TKey [] * 'TValue [] * int * int -> unit
    static member Sort<'T> : 'T [] * int * int * System.Collections.Generic.IComparer<'T> -> unit
    static member Sort : System.Array * System.Array * int * int * System.Collections.IComparer -> unit
    static member Sort<'TKey,'TValue> : 'TKey [] * 'TValue [] * int * int * System.Collections.Generic.IComparer<'TKey> -> unit
    static member TrueForAll<'T> : 'T [] * System.Predicate<'T> -> bool
  end

Full name: System.Array

  type: Array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IEnumerable
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable

val create : int -> 'T -> 'T []

Full name: Microsoft.FSharp.Collections.Array.create

type Int32 =
  struct
    member CompareTo : obj -> int
    member CompareTo : int -> int
    member Equals : obj -> bool
    member Equals : int -> bool
    member GetHashCode : unit -> int
    member GetTypeCode : unit -> System.TypeCode
    member ToString : unit -> string
    member ToString : string -> string
    member ToString : System.IFormatProvider -> string
    member ToString : string * System.IFormatProvider -> string
    static val MaxValue : int
    static val MinValue : int
    static member Parse : string -> int
    static member Parse : string * System.Globalization.NumberStyles -> int
    static member Parse : string * System.IFormatProvider -> int
    static member Parse : string * System.Globalization.NumberStyles * System.IFormatProvider -> int
    static member TryParse : string * int -> bool
    static member TryParse : string * System.Globalization.NumberStyles * System.IFormatProvider * int -> bool
  end

Full name: System.Int32

  type: Int32
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

field Int32.MinValue = -2147483648

However, after real world trials we found that it was taking an astonishing 20% of our runtime. We had to find a better way. We soon found Gosper’s hack which did mostly what we wanted. It gave us the combinations, and we found we could just rotate each of those around to give us all of our cyclic suborders.

And so we went about rewriting our nice functional sample to use Gosper’s Hack to avoid the need to generate paths and also to be (nearly) minimally allocating to avoid garbage collection overhead.

 1: open System
 2: 
 3: // arr1 is always equal in size or smaller than arr2
 4: let bestRotGos (f: 'a -> 'a -> float) (arr1: 'a []) (arr2: 'a []) =
 5:     // Start Gosper's Hack Machinery Bootstrap
 6:     let bitsConfig = Array.create arr1.Length Int32.MinValue
 7:    
 8:     let inline setBits v =
 9:         let inline getBit x i = x &&& (1 <<< i) <> 0
10:         let mutable idx = 0
11:         for i = 0 to 31 do
12:             if getBit v i then
13:                 bitsConfig.[idx] <- i
14:                 idx <- idx + 1
15:  
16:     let inline nextGosper set =
17:         let c = set &&& -set
18:         let r = set + c
19:         r + (((r^^^set)/c)>>>2)
20:         
21:     let limit =
22:         let n = arr2.Length
23:         (1 <<< n)
24:  
25:     let mutable set =
26:         let k = arr1.Length
27:         (1 <<< k) - 1    
28:     // End Gosper's Hack Machinery Bootstrap
29: 
30:     // Pre-calculate sims and init placeholder variables
31:     let scores = Array2D.init arr1.Length arr2.Length 
32:                     (fun i j -> f arr1.[i] arr2.[j])    
33:                         
34:     let mutable maxScore = Double.NegativeInfinity
35:     let bestConfig = Array.create arr1.Length Int32.MinValue
36: 
37:     while (set < limit) do
38:         setBits set // Turn "set" bits into indices in "bitsConfig"
39: 
40:         // For each rotation
41:         for i = 0 to bitsConfig.Length - 1 do
42: 
43:             // calculate score and compare with previous best
44:             let mutable totalScore = 0.0
45:             for i = 0 to bitsConfig.Length - 1 do
46:                 let tokenScore = scores.[i,bitsConfig.[i]]
47:                 totalScore <- totalScore + tokenScore
48:             if totalScore > maxScore then
49:                 // and replace if better
50:                 maxScore <- totalScore
51:                 Array.blit bitsConfig 0 
52:                            bestConfig 0 
53:                            bitsConfig.Length 
54: 
55:             // Rotate the array
56:             let firstElement = bitsConfig.[0]
57:             for i = 0 to bitsConfig.Length - 2 do
58:                 bitsConfig.[i] <- bitsConfig.[i + 1]
59:             bitsConfig.[bitsConfig.Length - 1] <- firstElement
60:         
61:         set <- nextGosper set // Put the next combination in "set"
62:     bestConfig, maxScore
namespace System
val bestRotGos : ('a -> 'a -> float) -> 'a [] -> 'a [] -> int [] * float

Full name: Snippet.bestRotGos

val f : ('a -> 'a -> float)
Multiple items
val float : 'T -> float (requires member op_Explicit)

Full name: Microsoft.FSharp.Core.Operators.float

——————–
type float<'Measure> = float

Full name: Microsoft.FSharp.Core.float<_>

  type: float<'Measure>
  implements: IComparable
  implements: IConvertible
  implements: IFormattable
  implements: IComparable<float<'Measure>>
  implements: IEquatable<float<'Measure>>
  inherits: ValueType

——————–
type float = Double

Full name: Microsoft.FSharp.Core.float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val arr1 : 'a []

  type: 'a []
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

val arr2 : 'a []

  type: 'a []
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<'a>
  implements: Collections.Generic.ICollection<'a>
  implements: seq<'a>
  implements: Collections.IEnumerable
  inherits: Array

val bitsConfig : int []

  type: int []
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<int>
  implements: Collections.Generic.ICollection<int>
  implements: seq<int>
  implements: Collections.IEnumerable
  inherits: Array

type Array =
  class
    member Clone : unit -> obj
    member CopyTo : System.Array * int -> unit
    member CopyTo : System.Array * int64 -> unit
    member GetEnumerator : unit -> System.Collections.IEnumerator
    member GetLength : int -> int
    member GetLongLength : int -> int64
    member GetLowerBound : int -> int
    member GetUpperBound : int -> int
    member GetValue : int [] -> obj
    member GetValue : int -> obj
    member GetValue : int64 -> obj
    member GetValue : int64 [] -> obj
    member GetValue : int * int -> obj
    member GetValue : int64 * int64 -> obj
    member GetValue : int * int * int -> obj
    member GetValue : int64 * int64 * int64 -> obj
    member Initialize : unit -> unit
    member IsFixedSize : bool
    member IsReadOnly : bool
    member IsSynchronized : bool
    member Length : int
    member LongLength : int64
    member Rank : int
    member SetValue : obj * int -> unit
    member SetValue : obj * int [] -> unit
    member SetValue : obj * int64 -> unit
    member SetValue : obj * int64 [] -> unit
    member SetValue : obj * int * int -> unit
    member SetValue : obj * int64 * int64 -> unit
    member SetValue : obj * int * int * int -> unit
    member SetValue : obj * int64 * int64 * int64 -> unit
    member SyncRoot : obj
    static member AsReadOnly<'T> : 'T [] -> System.Collections.ObjectModel.ReadOnlyCollection<'T>
    static member BinarySearch : System.Array * obj -> int
    static member BinarySearch<'T> : 'T [] * 'T -> int
    static member BinarySearch : System.Array * obj * System.Collections.IComparer -> int
    static member BinarySearch<'T> : 'T [] * 'T * System.Collections.Generic.IComparer<'T> -> int
    static member BinarySearch : System.Array * int * int * obj -> int
    static member BinarySearch<'T> : 'T [] * int * int * 'T -> int
    static member BinarySearch : System.Array * int * int * obj * System.Collections.IComparer -> int
    static member BinarySearch<'T> : 'T [] * int * int * 'T * System.Collections.Generic.IComparer<'T> -> int
    static member Clear : System.Array * int * int -> unit
    static member ConstrainedCopy : System.Array * int * System.Array * int * int -> unit
    static member ConvertAll<'TInput,'TOutput> : 'TInput [] * System.Converter<'TInput,'TOutput> -> 'TOutput []
    static member Copy : System.Array * System.Array * int -> unit
    static member Copy : System.Array * System.Array * int64 -> unit
    static member Copy : System.Array * int * System.Array * int * int -> unit
    static member Copy : System.Array * int64 * System.Array * int64 * int64 -> unit
    static member CreateInstance : System.Type * int -> System.Array
    static member CreateInstance : System.Type * int [] -> System.Array
    static member CreateInstance : System.Type * int64 [] -> System.Array
    static member CreateInstance : System.Type * int * int -> System.Array
    static member CreateInstance : System.Type * int [] * int [] -> System.Array
    static member CreateInstance : System.Type * int * int * int -> System.Array
    static member Exists<'T> : 'T [] * System.Predicate<'T> -> bool
    static member Find<'T> : 'T [] * System.Predicate<'T> -> 'T
    static member FindAll<'T> : 'T [] * System.Predicate<'T> -> 'T []
    static member FindIndex<'T> : 'T [] * System.Predicate<'T> -> int
    static member FindIndex<'T> : 'T [] * int * System.Predicate<'T> -> int
    static member FindIndex<'T> : 'T [] * int * int * System.Predicate<'T> -> int
    static member FindLast<'T> : 'T [] * System.Predicate<'T> -> 'T
    static member FindLastIndex<'T> : 'T [] * System.Predicate<'T> -> int
    static member FindLastIndex<'T> : 'T [] * int * System.Predicate<'T> -> int
    static member FindLastIndex<'T> : 'T [] * int * int * System.Predicate<'T> -> int
    static member ForEach<'T> : 'T [] * System.Action<'T> -> unit
    static member IndexOf : System.Array * obj -> int
    static member IndexOf<'T> : 'T [] * 'T -> int
    static member IndexOf : System.Array * obj * int -> int
    static member IndexOf<'T> : 'T [] * 'T * int -> int
    static member IndexOf : System.Array * obj * int * int -> int
    static member IndexOf<'T> : 'T [] * 'T * int * int -> int
    static member LastIndexOf : System.Array * obj -> int
    static member LastIndexOf<'T> : 'T [] * 'T -> int
    static member LastIndexOf : System.Array * obj * int -> int
    static member LastIndexOf<'T> : 'T [] * 'T * int -> int
    static member LastIndexOf : System.Array * obj * int * int -> int
    static member LastIndexOf<'T> : 'T [] * 'T * int * int -> int
    static member Resize<'T> : 'T [] * int -> unit
    static member Reverse : System.Array -> unit
    static member Reverse : System.Array * int * int -> unit
    static member Sort : System.Array -> unit
    static member Sort<'T> : 'T [] -> unit
    static member Sort : System.Array * System.Array -> unit
    static member Sort : System.Array * System.Collections.IComparer -> unit
    static member Sort<'TKey,'TValue> : 'TKey [] * 'TValue [] -> unit
    static member Sort<'T> : 'T [] * System.Collections.Generic.IComparer<'T> -> unit
    static member Sort<'T> : 'T [] * System.Comparison<'T> -> unit
    static member Sort : System.Array * int * int -> unit
    static member Sort : System.Array * System.Array * System.Collections.IComparer -> unit
    static member Sort<'T> : 'T [] * int * int -> unit
    static member Sort<'TKey,'TValue> : 'TKey [] * 'TValue [] * System.Collections.Generic.IComparer<'TKey> -> unit
    static member Sort : System.Array * System.Array * int * int -> unit
    static member Sort : System.Array * int * int * System.Collections.IComparer -> unit
    static member Sort<'TKey,'TValue> : 'TKey [] * 'TValue [] * int * int -> unit
    static member Sort<'T> : 'T [] * int * int * System.Collections.Generic.IComparer<'T> -> unit
    static member Sort : System.Array * System.Array * int * int * System.Collections.IComparer -> unit
    static member Sort<'TKey,'TValue> : 'TKey [] * 'TValue [] * int * int * System.Collections.Generic.IComparer<'TKey> -> unit
    static member TrueForAll<'T> : 'T [] * System.Predicate<'T> -> bool
  end

Full name: System.Array

  type: Array
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IEnumerable
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable

val create : int -> 'T -> 'T []

Full name: Microsoft.FSharp.Collections.Array.create

property Array.Length: int
type Int32 =
  struct
    member CompareTo : obj -> int
    member CompareTo : int -> int
    member Equals : obj -> bool
    member Equals : int -> bool
    member GetHashCode : unit -> int
    member GetTypeCode : unit -> System.TypeCode
    member ToString : unit -> string
    member ToString : string -> string
    member ToString : System.IFormatProvider -> string
    member ToString : string * System.IFormatProvider -> string
    static val MaxValue : int
    static val MinValue : int
    static member Parse : string -> int
    static member Parse : string * System.Globalization.NumberStyles -> int
    static member Parse : string * System.IFormatProvider -> int
    static member Parse : string * System.Globalization.NumberStyles * System.IFormatProvider -> int
    static member TryParse : string * int -> bool
    static member TryParse : string * System.Globalization.NumberStyles * System.IFormatProvider * int -> bool
  end

Full name: System.Int32

  type: Int32
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

field Int32.MinValue = -2147483648
val setBits : (int -> unit)
val v : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val getBit : (int -> int32 -> bool)
val x : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val i : int32

  type: int32
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val mutable idx : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val i : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val nextGosper : ('b -> 'd) (requires member ( ~- ) and member ( &&& ) and member ( + ) and member ( ^^^ ) and member ( / ) and member ( + ) and member ( >>> ))
val set : 'b (requires member ( ~- ) and member ( &&& ) and member ( + ) and member ( ^^^ ) and member ( / ) and member ( + ) and member ( >>> ))
val c : 'b (requires member ( ~- ) and member ( &&& ) and member ( + ) and member ( ^^^ ) and member ( / ) and member ( + ) and member ( >>> ))
val r : 'b (requires member ( ~- ) and member ( &&& ) and member ( + ) and member ( ^^^ ) and member ( / ) and member ( + ) and member ( >>> ))
val limit : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val n : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val mutable set : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val k : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val scores : float [,]

  type: float [,]
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IEnumerable
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  inherits: Array

module Array2D

from Microsoft.FSharp.Collections

val init : int -> int -> (int -> int -> 'T) -> 'T [,]

Full name: Microsoft.FSharp.Collections.Array2D.init

val j : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

val mutable maxScore : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

type Double =
  struct
    member CompareTo : obj -> int
    member CompareTo : float -> int
    member Equals : obj -> bool
    member Equals : float -> bool
    member GetHashCode : unit -> int
    member GetTypeCode : unit -> System.TypeCode
    member ToString : unit -> string
    member ToString : string -> string
    member ToString : System.IFormatProvider -> string
    member ToString : string * System.IFormatProvider -> string
    static val MinValue : float
    static val MaxValue : float
    static val Epsilon : float
    static val NegativeInfinity : float
    static val PositiveInfinity : float
    static val NaN : float
    static member IsInfinity : float -> bool
    static member IsNaN : float -> bool
    static member IsNegativeInfinity : float -> bool
    static member IsPositiveInfinity : float -> bool
    static member Parse : string -> float
    static member Parse : string * System.Globalization.NumberStyles -> float
    static member Parse : string * System.IFormatProvider -> float
    static member Parse : string * System.Globalization.NumberStyles * System.IFormatProvider -> float
    static member TryParse : string * float -> bool
    static member TryParse : string * System.Globalization.NumberStyles * System.IFormatProvider * float -> bool
  end

Full name: System.Double

  type: Double
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

field Double.NegativeInfinity = -Infinity
val bestConfig : int []

  type: int []
  implements: ICloneable
  implements: Collections.IList
  implements: Collections.ICollection
  implements: Collections.IStructuralComparable
  implements: Collections.IStructuralEquatable
  implements: Collections.Generic.IList<int>
  implements: Collections.Generic.ICollection<int>
  implements: seq<int>
  implements: Collections.IEnumerable
  inherits: Array

val mutable totalScore : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val tokenScore : float

  type: float
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<float>
  implements: IEquatable<float>
  inherits: ValueType

val blit : 'T [] -> int -> 'T [] -> int -> int -> unit

Full name: Microsoft.FSharp.Collections.Array.blit

val firstElement : int

  type: int
  implements: IComparable
  implements: IFormattable
  implements: IConvertible
  implements: IComparable<int>
  implements: IEquatable<int>
  inherits: ValueType

A quick test shows that the new algorithm is about seven times faster across a variety of test inputs:

 1: let f x y = if x = y then 1.0 else 0.0
 2: 
 3: let cmp = [|"one"; "two"; "three"; "four"|]
 4: let cmps = 
 5:     [| for i = 0 to cmp.Length - 1 do 
 6:            for j = i to cmp.Length - 1 do 
 7:                // Longer array is second
 8:                yield cmp.[j..], cmp.[i..] 
 9:     |] 
10: 
11: #time
12: for i = 0 to 1000000 do
13:     for c1,c2 in cmps do 
14:         bestRot f c1 c2 |> ignore
15: //Real: 00:00:41.800, CPU: 00:00:41.812, 
16: // GC gen0: 10175, gen1: 6, gen2: 1
17: #time 
18: 
19: #time
20: for i = 0 to 1000000 do
21:     for c1,c2 in cmps do 
22:         bestRotGos f c1 c2 |> ignore
23: //Real: 00:00:06.378, CPU: 00:00:06.375, 
24: // GC gen0: 689, gen1: 1, gen2: 0
25: #time 
26: 
val f : 'a -> 'a -> float (requires equality)

Full name: Snippet.f

val x : 'a (requires equality)
val y : 'a (requires equality)
val cmp : string []

Full name: Snippet.cmp

  type: string []
  implements: System.ICloneable
  implements: System.Collections.IList
  implements: System.Collections.ICollection
  implements: System.Collections.IStructuralComparable
  implements: System.Collections.IStructuralEquatable
  implements: System.Collections.Generic.IList<string>
  implements: System.Collections.Generic.ICollection<string>
  implements: seq<string>
  implements: System.Collections.IEnumerable
  inherits: System.Array

val cmps : (string [] * string []) []

Full name: Snippet.cmps

  type: (string [] * string []) []
  implements: System.ICloneable
  implements: System.Collections.IList
  implements: System.Collections.ICollection
  implements: System.Collections.IStructuralComparable
  implements: System.Collections.IStructuralEquatable
  implements: System.Collections.Generic.IList<string [] * string []>
  implements: System.Collections.Generic.ICollection<string [] * string []>
  implements: seq<string [] * string []>
  implements: System.Collections.IEnumerable
  inherits: System.Array

val i : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

property System.Array.Length: int
val j : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val c1 : string []

  type: string []
  implements: System.ICloneable
  implements: System.Collections.IList
  implements: System.Collections.ICollection
  implements: System.Collections.IStructuralComparable
  implements: System.Collections.IStructuralEquatable
  implements: System.Collections.Generic.IList<string>
  implements: System.Collections.Generic.ICollection<string>
  implements: seq<string>
  implements: System.Collections.IEnumerable
  inherits: System.Array

val c2 : string []

  type: string []
  implements: System.ICloneable
  implements: System.Collections.IList
  implements: System.Collections.ICollection
  implements: System.Collections.IStructuralComparable
  implements: System.Collections.IStructuralEquatable
  implements: System.Collections.Generic.IList<string>
  implements: System.Collections.Generic.ICollection<string>
  implements: seq<string>
  implements: System.Collections.IEnumerable
  inherits: System.Array

val ignore : 'T -> unit

Full name: Microsoft.FSharp.Core.Operators.ignore

Of course, there are a few more things that could be done to make this even faster mostly related to further reducing allocations. For example, we could potentially use larger fixed sized arrays if we knew our input was bounded and pass in the arrays as references letting the calling function handle proper recycling.

However, at this point the function no longer appeared anywhere near significant when profiling our overall usage and so we found this was fast enough. It’s not worth making things uglier and more prone to breakage for minor gains when what we have right now is a referentially transparent function with predictable behavior.


30
Dec 13

My 2013 F# Year in Review

It’s been a great year for F# with the blossoming of the fsharp.org working groups. It’s been amazing watching the community come together to form a movement outside of Microsoft. This is most certainly the long term future of F#, protected from the whims of layer upon layer of management. Who knows, in the coming year we might even see community contributions to the F# Core libraries. Who would have thought that would ever have been possible?

I’m very happy to see that Sergey Tihon has maintained his wonderful weekly roundup of F# community goings on. It’s a big time investment week after week to keep the weekly news going. After leaving Atalasoft, and no longer being paid to blog on a regular basis, I found I couldn’t keep investing the time and felt very badly about not being able to continue my own weekly roundups. Sergey has picked up that mantle with a passion, and I’m so very glad for this extremely useful service he provides to the community.

Meanwhile Howard Mansell and Tomas Petricek (at his BlueMountain sabbatical), worked toward building a bunch of great new tools for data science in F#. The R Type Provider has become extremely polished and while Deedle may be fresh out of the oven, it already rivals pandas in its ability to easily manipulate data.

At Bayard Rock Paulmichael Blasucci, Peter Rosconi, and I have been working on a few small community contributions as well. iFSharp Notebook (An F# Kernel for iPython Notebook) is in a working and useful state, but is still missing intellisense and type information as the iPython API wasn’t really designed with that kind of interaction in mind. The Matlab Type Provider is also in a working state, but still missing some features (I would love to have some community contributions if anyone is interested). Also in the works is a nice set of F# bindings for the ACE Editor, I’m hoping we can release those early next year.

Finally, I wanted to mention what a great time I had at both the F# Tutorials both in London and in NYC this year. I also must say that the London F# culture is just fantastic; Phil is a thoughtful and warm community organizer and it shows in his community. I’ve been a bit lax in my bloggings but they were truly both wonderful events and are getting better with each passing year.

F# Tutorials NYC 2013 Group Photo

F# Tutorials NYC 2013

That right there was the highlight of my year. Just look at all of those smiling functional programmers.


13
Aug 13

All Machine Learning Platforms are Terrible (but some less so)

I recently took a medium sized feature set with labels at work and ran it through some of the most popular machine learning platforms. The goal was to get a feel for each of them via the standard battery of regressions and evaluate each for use in further experimentation.  This is a review of my journey.

Experimental Setup:
Features and Labels in a ~500mb CSV file
Labeled Records: ~140,000
Features: ~3500 binary, labels [0-100]
Hardware: 4 x 8 = 32 cores, 256gb of ram
OS: Windows Server 2008r2

- F# with Math.NET -
I used F# to build the features for the rest of these processes. It was quite nice using the SQL Type Provider to grab the data out of the database and then process it into binary features, even though it consisted of fourteen unoptimized tables across two SQL Server databases with rather odd relationships. I did this step-by-step while trying new features out on a hand written iterative linear regression function I wrote in a single line of F#. The syntax with Math.NET is almost exactly the same as Matlab and so it came quite easily. On top of that the linear algebra was quite fast using Math.NET’s MKL Linear Algebra provider.

While Math.NET is under constant work by some really smart folks, it currently only supports a few non-iterative linear solvers with MKL. Iterative linear regression was easy enough to do by hand, but I wanted to try some of the more complex regressions without worrying if I implemented them properly. Once I had my features sorted it was obvious that it was time to move on.

- R 2.14.? -
R was easy to install and get running. It was nice to have the package manager built right in to the console. However, from there on it was all down hill. Loading the data file took forever, approximately 10 minutes with the standard CSV machinery. Once it was loaded, it was just one out of memory exception after another. I tried to run several regressions but I wasn’t able to complete a single experiment and many took quite a long time to fail. All signs point to poor garbage collection in the R runtime.

Blinded by my frustration, I ended up buying Revolution R, but they skirt the problem by using their own file based format and have a limited handful of regressions on that format. I’m holding out hope that things will be better in R 3.0 as they’ve finally removed the 32-bit memory limitation. Still, given the state of Python (see below) I don’t think there’s any compelling reason to revisit R at all.

- Matlab 2013a -
I already own the base Matlab 2013a and use it on a regular basis, but I wanted to avoid shelling out the $5000+ for the toolkits needed for this project before making sure they could do what I wanted (and not choke on my data like R), so I requested a trial. It was quite an ordeal, I had to wait for an actual sales agent to call me on the phone, discuss what I was trying to do, and request that my license be sent multiple times via email (they kept forgetting?). I swear I’ve never had such a difficult software customer experience before. I still don’t understand why I couldn’t just download a trial from their site.

In any case, right when we finally got it all sorted we experienced some technical difficulties with our server room overheating and had to have the beastly box relocated. Two months or so later my hardware is back up and running at a better location but my toolbox trials have long since expired. I’m currently exploring other options before I go back groveling for an extended trial.

- Scikit-learn via WinPython 64-bit 2.7.5.1 -
The hardest part in getting started was picking a Scikit distribution, there’s at least three popular ones for Windows. I ended up going with WinPython because it was MIT licenced and I try not to bring the GPL into my workplace whenever I can avoid it. You’d never want GPL code to accidentally make its way into anything that leaves the building.

First impressions were great, the CSV file loaded in under 15 seconds with pandas, and it was quite a revelation that I could take a pandas table and just pass it in to these scikit functions as if it were a matrix, very slick. However it’s not all roses, I spent a lot of my first day trying to figure out why the basic linear regression was giving nonsensical results. After some inspection, it looks like an numerical overflow somewhere in the depths is causing a few weights to become extremely large negative values. The rest of the linear models worked great however.

Then, as I was full of momentum, I’d thought I’d give the SVM stuff a go, but it turns out that for some reason Scikit disables OpenMP for LibSVM and so it’s incredibly slow. So, after a 24-hours or so of LibSVM puttering away at 3% overall CPU usage, I thought I’d just load up another Spyder instance and keep working while this chugs along. No such luck, you can only have one Spyder window open at a time.

In fact, I think Spyder is by far the weakest part of the Scikit offering, it’s not only limited in terms of instances, it also has an odd tendency to lock up while the Python interpreter is busy and the variable explorer ignores some variables, I’m not sure what that’s about. Also in the box is IPython Notebook, but it doesn’t seem to like the Internet Explorer that’s on the machine and whatever solution we come up with has to eventually work in a completely locked down environment with no internet connection, and hopefully without any installed dependencies. Perhaps I’ll fare better with something like Sublime Text, but it is nice to have graphical variable inspection.

- Final Impressions - 
If I were going to recommend a setup to someone getting started today, I’d say by far and away the best choice is a Scikit distribution. It’s not without problems, but compared to the horrible mess that makes up the rest of the available options it shines. I’d also suggest trying to find a different GUI than Spyder. It’s fine for playing around, but it’s far too janky to be considered reasonable for professional day-to-day use.


22
Jul 13

The Promise of F# Language Type Providers

In most software domains you can safely stick with one or two languages and, because the tools you are using are fairly easy to replicate, you’ll find almost anything you might need to finish your project. This isn’t true in data science and data engineering however. Whether it be some hyper-optimized data structure or a cutting edge machine learning technique often you only have a single language or platform choice.

Even worse, when you want to build a system that uses one or more platform specific components, things can become quite an engineering mess. No matter what you do you can’t avoid the high cost of serialization and marshaling. This makes some combinations of tools non-options for some problems. You often make trade-offs that you shouldn’t need to make, for example using a worse algorithm just because the better option hasn’t been written for your platform.

In .NET this is a particularly bad problem. There are quite a few dedicated people working on open source libraries, but they are tiny in number compared to the Matlab, Python, R or Java communities. Meanwhile, Microsoft research has several fantastic libraries with overly restrictive licenses that make them impossible to use commercially. These libraries drive away academic competition, but at the same time can’t be used outside of academia. It’s a horrible situation.

Thankfully, there is a silver lining in this dark cloud. With the release of F# 3.0 in VS 2012 we were given a new language feature called Type Providers. Type Providers are compiler plugins that generate types at compile time and can run arbitrary code to do it. Initially, these were designed for access databases and getting types from the schema for free, but when Howard Mansell released the R Language Type Provider everything changed. We now realized that we had a way to build slick typed APIs on top of almost any other language.

This means that now it doesn’t matter if someone has written the algorithm or data structure for our platform as long as there’s a Type Provider for a platform where it has been done. The tedious work of building lots of little wrapped sub-programs is completely gone. It shouldn’t even matter if the kind of calculation you’d like to do is fast on your native platform, as you can just transparently push it to another. Of course, we still must pay the price of marshaling, but we can do it in a controlled way by dealing in handles to the other platform’s variables.

The language Type Providers themselves are a bit immature for the moment but the idea is sound and the list is growing. There is now the beginnings of an IKVM Type Provider (for Java) and I’m working on a Matlab Type Provider. The Matlab Provider doesn’t yet have all of the functionality I am aiming for, but I’ve been working on it for several months and it’s quite usable now. All that’s left is for someone to start in on a Python type provider and we’ll practically have all of the data science bases covered.

It’s an exciting time to be an F#’er.


18
Jul 13

Come join me at the SkillsMatter F# Tutorials NYC 2013

Last year was our first NYC F# tutorials and they were just amazing (you can read about them here) but this year’s are going to be even better. We’ve got a lineup including some of the most talented teachers in the F# community, and the tickets are extremely inexpensive as conferences and training events go.

Looking to learn F#? Our beginner track is jam packed with hands on exercises. It’s was amazing to see what just two days of training can do. A C# co-worker of mine was a beginner track attendee last year and delivered a project in F# just the next week.

Already have some serious F# skills? In our advanced track we’ve got a lineup that will push those skills to the limit. I personally am particularly excited to dig into the F# compiler with Don and Tomas.

Now that I’ve had my say, here’s the official spiel:

On the back of the success of the 2013 edition, the Progressive F# Tutorials return to New York in September – this time packing an even bigger punch! With F# UG lead Rick Minerich at the helm, we’ve put together a expert filled line-up – featuring Don Syme (creator of F#), Tomas Petricek, and Miguel de Icaza. The Tutorials will be split in two – a beginners track for those eager to unleash F#’s full power, and a ‘meaty track’ for those more experience f#pers amongst you! Each session will be a 4 hour hands-on deep dive, brushing aside the traditional format of conferences to allow you to truly immerse into the subject topic.

Want to get involved? We’re giving a special community 20% discount! Just go ahead and enter SkillsMatter_Community on the booking form and the team at Skills Matter will look forward to welcoming you to NYC this September!

- Check out our schedule.
- Purchase tickets.
- Read about last year’s tutorials.

Are you as excited as I am yet?


11
Jul 13

In Retrospect: QCon NYC 2013 (and a conversation with Rich Hickey on languages)

QCon NYC was the most refreshing conference I’ve been to in a very long time. Perhaps it’s partially because I’ve lingered too long in Microsoft circles, or maybe it’s just been too long since I went to a large conference. In any case, the speaker lineup was just chock full of brilliant minds from all over the world. I am honored to be counted among such an illustrious lineup.

Click for a video recording of the talk.

Click for a video recording of my talk.

My talk was well received, but the title wasn’t as descriptive of the content as I would have liked. It’s quite a challenge titling a talk six months in advance. Perhaps I should have called it something like “One language to rule them all”, or “Language de jour” but I’m not sure either of those would have gone over quite as well on the polyglot track.

Runar, Rich, and Rick

Left to right: Runar, Rick and Rich.
Paul Snively is behind the camera.

While the average quality of the talks was far and above what I’m used to at most of the conference I’ve attended, both in entertainment value and content, as usual the interspersed deep conversations were far and away the most rewarding. Of all of those deep conversations the one that stands out most in my mind was when Rich Hickey sat down with Runar Bjarnesson, Paul Snively and I for dinner. We talked quite a bit about his Datomic project, agreed on the power of immutability, and eventually discussed our differing philosophies on types.

I have immense respect for Rich Hickey. He’s a brilliant man and is almost solely responsible for kindling my interest in functional programming. He’s had a huge influence in creating the programmer that I am today, and I count him among my heroes. Now, the only case I’ve ever found myself disagreeing with him is his opinion on types and so I couldn’t help myself. With a bit of trepidation, I broached the subject. It’s funny that something so technical can be so difficult to talk about, but because we are all so passionate about our viewpoints I know we all had to be quite careful to phrase things so as not to inflame the tension.

What I learned is that Rich Hickey and I don’t disagree nearly as much as I thought. His main point was that the glue of a program shouldn’t know anything about what it’s gluing, much like a Fedex truck wasn’t designed with the contents of the boxes it carries in mind. I also tend to design programs in this way, but lean heavily on reflection to do it instead of using a dynamic language.

Even a month later, Runar’s main point of contention still seems unanswered: do generic types count as the truck being designed with the contents of the box in mind? You can argue either way here. On one hand, the code certainly knows about some of the properties of what’s in the box (for example, does it fit on the truck?), how tightly these properties constrain depends quite a bit on the language in question and its type features of course. This is actually quite useful because it keeps you from attempting to do something like putting a steamboat into your Fedex truck. The properties of the Fedex truck and the boxes it can hold must be respected.

On the other hand, you may often find yourself in a situation where your abstraction is overly limiting and the only recourse is to make changes to the type structure of the existing program in order to extend it. I think this is what Rich was getting at, and it’s true. For a true decoupled program (that is, no extra shared dependencies between sub-components) you need one of three things: 1) a meta reflection layer, 2) a dynamic language or 3) a very liberally defined type structure. In the third case it’s just extra work, with perhaps a negligible tangible benefit in terms of program safety.

In either case, the post-compilation/interpretation program eventually knows what’s in the box, it’s more of a question of when: at compile time, or when the box is first touched. Perhaps this is where the metaphor breaks down, or perhaps I’m just over thinking it. In any case it’s been a while since I reevaluated my hard-line views on types, and I’m grateful to Rich for sitting down with us and providing the impetus. After all, in my own words right from my QCon talk, it’s all about context.


3
Jul 13

On Type Safety, Representable States and Erlang

Close your eyes and imagine your program as a function that takes a set of inputs and produces a set of outputs. I know this may seem overly simple, but a set of actions in a GUI can be thought of as a set of inputs, and a set of resulting side effects to a database can be seen as a new state of the world being returned.

Now focus on its input space. This space is comprised as all possible combinations of all possible inputs. In this set some will be well defined for your program and some not. An example of a not well defined input could be as simple as an incorrect database connection string, as straightforward as an incorrect combination of flags on a console application, or as difficult to detect as a date with month and day transposed.

Input Space

A program thought of in this way is a fractal-like thing, a program made of little smaller programs, made of smaller programs yet. However, there’s no guarantee that each of these smaller programs will treat of a piece of data in exactly the same way as others. In addition to any initial validation, any top-level inputs which cause other inputs to be given to sub-programs where they are not properly handled are similarly considered not well defined. Consider these three approaches to making your program safer by reducing the size of incorrect input space:

First, you can increase the size of the blue circle with explicit input checking. This means numerous validations to ensure the program exits with proper notification when incorrect inputs are given. However, the program is fractal, and so if we want to be safe we’ll need to reproduce many of these checks fractally. A great example of this is handling null values and we all know how that turns out.

Another approach is to shrink the size of the red circle. We can do this by making fewer incorrect states representable with types. Because we know that all of the potential states are valid once encoded, we only need to do our checks once while marshalling our data into a well-typed representation. This eliminates almost all need for repeated validation, limited only by how far your type system will take you. Even better, with newer language features (such as F# type providers) we can eliminate much of this marshalling phase, this is however similarly limited by how far the schema of the data will take you.

A third approach, available only in some situations but which I find extremely fascinating, is to build everything in such a way as the entire program logs the error and resets its state when an incorrect input is found. Most paradoxically, in this case the more fragile you make your system, the safer it is (as long as you ensure that external state changes are the very last thing done, and that they’re done transactionally). This seems to be the Erlang philosophy and the only flaw I can find with it is shared with most type systems. That is, you can’t implicitly account for ambiguous inputs or state spaces that your type system can’t constrain.


27
Mar 13

Setting up F# Interactive for Machine Learning with Large Datasets

Before getting started with Machine Learning in F# Interactive it’s best to prepare it for large datasets and external 64-bit libraries so you don’t get blindsided with strange errors when you happen to cross the line. The good news is it’s a simple process that should only take a few minutes.

The first step is to go into the configuration and set fsi to 64-bit. It’s only matter of changing a boolean value buried deep in the Visual Studio settings. First, Go into Tools->Settings.

Tools->Settings

Then find the “F# Tools” section on the left and select the “F# Interactive” subsection.

tools_settings_fsi

Finally, set “64-bit F# Interactive” to true and click OK.

tools_settings_fsi_64-bit

What this does is set Visual Studio to use “FsiAnyCPU.exe” for the F# Interactive window instead of 32-bit “Fsi.exe”.

Now, after we restart Visual Studio, your F# Interactive is running with as many bits as your operating system can handle. However, if we want to support really big matrices we’re going to need to go a bit further. If we want really large arrays, that is greater than 2 gigabytes, we’re going to need to fiddle with the F# Interactive application config and enable the “gcAllowVeryLargeObjects” attribute.

For .NET 4.5 on Windows 7, Windows 8 and Windows Sever 2008R2 the standard directory for both the fsi exeuctables and their application configs is:

“C:\Program Files (x86)\Microsoft SDKs\F#\3.0\Framework\v4.0″

Navigate there and open “FsiAnyCPU.exe.config” in your favorite text editor. Then under the <runtime> tag add:

<gcAllowVeryLargeObjects enabled="true" />

When you’re done it should look like:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <runtime>
    <gcAllowVeryLargeObjects enabled="true" />
    <legacyUnhandledExceptionPolicy enabled="true" />
    <assemblyBinding 
      xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity
          name="FSharp.Core"
          publicKeyToken="b03f5f7f11d50a3a"
          culture="neutral"/>
        <bindingRedirect
          oldVersion="2.0.0.0"
          newVersion="4.3.0.0"/>
        <bindingRedirect
          oldVersion="4.0.0.0"
          newVersion="4.3.0.0"/>
      </dependentAssembly>
    </assemblyBinding>
  </runtime>
</configuration>

Just save and restart Visual Studio and you’re done! Your F# Interactive can now handle large datasets and loading external 64-bit native libraries.


4
Sep 12

Levenshtein Distance and the Triangle Inequality

Levenshtein distance is one of my favorite algorithms. On the surface it seems so very simple, but when you spend some time thinking hard on it deep insights are waiting to be had.

The first and most important thing about Levenshtein distance is it’s actually a metric distance. That is, it obeys the triangle inequality. For most other string distance measurements this property doesn’t hold.

The Vector Triangle Inequality

The Vector Triangle Inequality

This might not seem like such a big deal, but this property gives the measurements meaning in a context larger than just the pair. For example, it allows you to embed your pair distance measurements into a higher dimensional space and so use it for things like clustering.

This was one of the first insights that Levenshtein distance gave me: A measurement doesn’t need to give you an absolute location in space to be useful for telling you where you are, it just has to tell you how far away everything else is. But what is it about Levenshtein distance that gives it this property? It’s not immediately obvious to most people, at least it wasn’t to me.

First, let’s consider a naive implementation of the Wagner-Fisher algorithm for Levenshtein distance. As stated above, here the triangle inequality holds.

 1: let wagnerFischer (s: string) (t: string) =
 2:     let m = s.Length
 3:     let n = t.Length
 4:     let d = Array2D.create (m + 1) (n + 1) 0
 5: 
 6:     for i = 0 to m do d.[i, 0] <- i
 7:     for j = 0 to n do d.[0, j] <- j    
 8: 
 9:     for j = 1 to n do
10:         for i = 1 to m do
11:             if s.[i-1] = t.[j-1] then
12:                 d.[i, j] <- d.[i-1, j-1]
13:             else
14:                 d.[i, j] <-
15:                     List.min
16:                         [
17:                             // a deletion
18:                             d.[i-1, j  ] + 1; 
19:                             // an insertion
20:                             d.[i  , j-1] + 1; 
21:                             // a substitution
22:                             d.[i-1, j-1] + 1; 
23:                         ]
24:     d.[m,n]

Now compare this with an incorrect version of an extension called Damerau–Levenshtein distance (or restricted edit distance). This change adds support for Jaro-Winkler like transpositions to the original algorithm. However, in the process of adding just this minor tweak we lose the triangle inequality.

26: let damerauLevenshtein (s: string) (t: string) =
27:     let m = s.Length
28:     let n = t.Length
29:     let d = Array2D.create (m + 1) (n + 1) 0
30: 
31:     for i = 0 to m do d.[i, 0] <- i
32:     for j = 0 to n do d.[0, j] <- j    
33: 
34:     for j = 1 to n do
35:         for i = 1 to m do
36:             // 1 if a substitution
37:             // 0 if no change
38:             let cost = if s.[i-1] = t.[j-1] then 0 else 1
39:             d.[i, j] <-
40:                 List.min
41:                     [
42:                         // a deletion
43:                         d.[i-1, j  ] + 1; 
44:                         // an insertion
45:                         d.[i  , j-1] + 1; 
46:                         // a substitution or nothing
47:                         d.[i-1, j-1] + cost;
48:                     ]
49:             if // boundary check
50:                i > 1 && j > 1 
51:                // transposition check
52:             && s.[i-1] = t.[j-2] && s.[i-2] = t.[j-1] 
53:             then // the lesser of a transposition or current cost
54:                 d.[i, j] <- min d.[i,j] (d.[i-2, j-2] + cost)
55:     d.[m,n]

val wagnerFischer : string -> string -> int

Full name: Snippet.wagnerFischer

val s : string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

Multiple items

val string : 'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

——————–

type string = System.String

Full name: Microsoft.FSharp.Core.string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

val t : string

  type: string
  implements: System.IComparable
  implements: System.ICloneable
  implements: System.IConvertible
  implements: System.IComparable<string>
  implements: seq<char>
  implements: System.Collections.IEnumerable
  implements: System.IEquatable<string>

val m : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

property System.String.Length: int
val n : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val d : int [,]

  type: int [,]
  implements: System.ICloneable
  implements: System.Collections.IList
  implements: System.Collections.ICollection
  implements: System.Collections.IEnumerable
  implements: System.Collections.IStructuralComparable
  implements: System.Collections.IStructuralEquatable
  inherits: System.Array

module Array2D

from Microsoft.FSharp.Collections

val create : int -> int -> 'T -> 'T [,]

Full name: Microsoft.FSharp.Collections.Array2D.create

val i : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val j : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

Multiple items

module List

from Microsoft.FSharp.Collections

——————–

type List<'T> =
  | ( [] )
  | ( :: ) of 'T * 'T list
  with
    interface System.Collections.IEnumerable
    interface System.Collections.Generic.IEnumerable<'T>
    member Head : 'T
    member IsEmpty : bool
    member Item : index:int -> 'T with get
    member Length : int
    member Tail : 'T list
    static member Cons : head:'T * tail:'T list -> 'T list
    static member Empty : 'T list
  end

Full name: Microsoft.FSharp.Collections.List<_>

  type: List<'T>
  implements: System.Collections.IStructuralEquatable
  implements: System.IComparable<List<'T>>
  implements: System.IComparable
  implements: System.Collections.IStructuralComparable
  implements: System.Collections.Generic.IEnumerable<'T>
  implements: System.Collections.IEnumerable

val min : 'T list -> 'T (requires comparison)

Full name: Microsoft.FSharp.Collections.List.min

val damerauLevenshtein : string -> string -> int

Full name: Snippet.damerauLevenshtein

val cost : int

  type: int
  implements: System.IComparable
  implements: System.IFormattable
  implements: System.IConvertible
  implements: System.IComparable<int>
  implements: System.IEquatable<int>
  inherits: System.ValueType

val min : 'T -> 'T -> 'T (requires comparison)

Full name: Microsoft.FSharp.Core.Operators.min

It seems like such a simple and obvious addition to the algorithm. Just what is it about about the way we’ve added transpositions that ruins the magic? We’ve just put something like wormholes to our little universe. That’s right, the simple addition of transpositions in this way implies a universe where some combinations of characters treat space differently than everything else. The easiest way to prove this is the case is to give the definition of the triangle inequality for metric spaces a read.

From Wikipedia’s Triangle Inequality article:
In a metric space M with metric d, the triangle inequality is a requirement upon distance: d(x, z) <= d(x, y) + d(y, z)
for all x, y, z in M. That is, the distance from x to z is at most as large as the sum of the distance from x to y and the distance from y to z.

From this, it’s easy to construct a counterexample for our broken Damerau-Levenshtein distance simply by exploiting the transpositions.

Damerau-Levenshtein distance not satisfying the triangle inequality.

As you can see in this picture 4 is most certainly greater than 1 + 2, and so the triangle inequality is broken. Consider also the pathology that this example shows in the algorithm. Why didn’t it just go along the irkc –> rick –> rcik path when it’s obviously less expensive?

Levenshtein distance satisfying the triangle inequality.

For comparison, if we measure those same pairs with standard Levenshtein distance everything is just peachy.

So we now know of at least one case which causes the triangle inequality to fail, does this imply what causes it to succeed? I think yes, at least in a limited sense. We can see that with Levenshtein distance any given pair of characters is considered independently, changes at each only happen once and are exactly one character in size. As each pair in the strings is considered, small changes push them further and further apart, but in discrete and equal units for discrete and equal changes. While in our Damerau-Levenshtein distance implementation we greedily perform operations of a larger size and then never revisit their implications, standard Levenshtein is reversibly symmetric both in how it treats locations over the string as well as in how it treats the characters themselves due to its uniform granularity. The uniform granularity of the changes ensures all important paths are explored.

Can transpositions be reliably counted with a different approach? We’ll find out the answer to this question next time.


31
Aug 12

What is good API design?

Some say that API design is one of the hardest things in programming. A few even go as far as to say you should have at least 10 years of experience to even attempt it. While I think this process can be sped up almost an order of magnitude by good mentorship, at one time or another we’ve all suffered under the API of an inexperienced programmer. Though, this does raise the question: what exactly is it about building libraries that can take up to 10 years to learn?

I was lucky in that I got a strict API education early on. Right out of college I joined Atalasoft, a company for which the API was the product and so was under the strictest of scrutiny. My mentor was Steve Hawley, a man who has spent much of his life solving difficult problems and wrapping them up in nice little packages. Steve had little patience for babysitting as he always had a lot on his plate and so under him I was forced to learn very quickly.

His philosophy, which was never explicitly stated, I call 90-9-.9. For 90% of the users you want the problem solved out of the box with just a couple of lines of code that can be cut and pasted. Here defaults matter the most. For the next 9% you’re aiming for simple configuration; something that can be easily figured out from the documentation or resolved in just a few minutes by the support team. Then there’s the .9% who will want to bend your libraries in all kinds of twisted ways, sometimes for performance or and other times some wacky (but workable) use case you never thought of. It’s completely fine to sacrifice the experience of the .9% for the sake of everyone else, just make sure it’s possible to get what they want done and that your documentation will show them the way.

Finally, there’s the unmentioned .1% who you’ll never make happy because they’re mistaken about the capabilities of your product. Better to either ignore them, or do market research to see if they’re worth the cost of extending your library to pull them in.

A great example of this is Atalasoft’s barcode product. A lot of effort went into carefully tuning it to preprocess most scanned documents without issue. After preprocessing it will by default go whole hog and try every kind of possible barcode type that you have a license for. This is still quite fast, fast enough for anyone with a small time scanning operation. Sometimes for folks doing large scale batch scanning on expensive equipment it’s just not fast enough though, so they can configure which barcode scanners are used by changing a simple enumeration property. Once in a while they get folks doing things that are a bit more difficult, like for example maybe trying to scan a barcode wrapped around a banana. For this there are events that let you interrupt, tweak and replace whole chunks of the barcode engine. But the guy who wants to read the barcodes he hand shaved into the side of the dogs in his pet store? Sorry pal, you’re better off finding another product.

When I first saw this it seemed like bad design. The whole component is like a frickin’ monolithic program with an event based do-it-yourself plugin system! You see though, aesthetic beauty as judged by an architecture astronaut isn’t what Atalasoft is optimizing for. They’re optimizing for reduction of the customer support burden. As much as I dislike object oriented programming for writing the internals of libraries like these, I think there’s no better paradigm for exposing a single simple interface that allows for manipulation at all of these levels.

Now, for the past two years I’ve been in charge of the APIs at Bayard Rock, a completely different kind of company. We do research and development primarily for anti-money laundering. This means lots of little experiments and the occasional medium-scale project which will later be integrated into our sister company’s larger infrastructure. In the vast majority of cases Atalasoft-style monolithic black-boxing wouldn’t be helpful at all. We only have one customer and we work with them to tailor our external APIs directly to their needs.

However, code reuse at a fine grained level is much more important at Bayard Rock than it was at Atalasoft. In this context what matters most is the construction of large libraries full of many small categorized functions which we can first use to put together experiments quickly (that is, through simple composition and without comprehensive unit tests) but later still feel confident about shipping in our product. We’re optimizing for experimentation and the rapid development of components that we trust enough to ship. It should come as no surprise that here typed functional programming wins hands down.

So, what is good API design? It depends, and that’s why it’s so hard.