This post is for the 7th day of the 2014 F# Advent Calendar.

It has been said that functional languages can’t be as fast as their imperative cousins because of all of the allocation and garbage collection, this is patently false (as far as F# is concerned at least) largely because the CLR has value types.

Recently Louis Thrall and I ran into an interesting problem at work: how do you efficiently compare all rotations of one array to another to select the best? This is easy when the arrays are the same size, just shift the indices around:

Which is equivalent to a nested for loop, the top one rotating and the inner doing comparisons.

But what about differently sized arrays? It turns out this is more complicated, even for comparing a size 2 array to size 3:

Like comparing beads on two strings, there is much shifting as the indices are rotated around. The extra freedom from this shifting means that more comparisons must be made even though there total number of combined elements is smaller.

This problem also has an interesting structure, as it costs the most when one array is about half the size of the other. When the smaller array is size 1 then it simply needs to be moved around, and when the arrays are equal in size it’s a simple matter of rotations.

After some research Lou noted that this we were “counting the number of cyclic suborders of size k of a cyclic order of size n” which produces k(n choose k) different configurations.

Now that we had bounds and knew what we were doing was reasonable we had to figure out an algorithm for this process. It occurred to us that it would be ideal to pre-compute the comparison values to avoid repeated comparisons. Given this matrix of comparison values it wasn’t too difficult to determine a procedure for checking all possible configurations.

Each time a pair is chosen, it further constrains what the rest of the assignments may be. The bottom level is somewhat obvious, just select each of the possible starting combinations down the y-axis. In the next column you potentially have a range of possible “downward” (using clock arithmetic) selections bounded by the difference in lengths and what has already been fixed.

Now it was time to implement, and the first stab was nice and concise and functional and worked pretty well.

```
open System
// arr1 is always equal in size or smaller than arr2
let bestRot (f: 'a -> 'a -> float) (arr1: 'a []) (arr2: 'a []) =
// Pre-calculate similarities
let scores =
Array2D.init arr1.Length arr2.Length
(fun i j -> f arr1.[i] arr2.[j])
// inner function for recursively finding paths
// col = previous column, prow = previous row
// df = degrees of freedom, path = path accumulator
let rec inner col prow df path =
seq {
// when we're out of columns to assign we're done
if col >= arr1.Length then yield path |> List.rev
else
// We have df "degrees of freedom" left
for d = 1 to df + 1 do
// Clock arithmetic for the row
let nrow = (prow + d) % arr2.Length
// Recurse yielding out all further paths
yield! inner (col + 1) nrow (df + 1 - d)
((col, nrow) :: path)
}
let res, score =
// the difference in length
// is the starting "degrees of freedom"
let diff = arr2.Length - arr1.Length
seq {
// for each y-axis starting point
for y = 0 to arr2.Length - 1 do
// 1 selected, r is the starting point (on y),
// starting is always 0 on x.
// ((0, y) :: []) is the accumulator
// with the initial (0, y) coordinates
yield! inner 1 y diff ((0, y) :: [])
}
// Sum each path to find total similarity
|> Seq.map
(fun l -> l, l |> Seq.sumBy (fun (i,j) -> scores.[i,j]))
// Get the path with the highest similarity
|> Seq.maxBy snd
// Create output array and copy in the results
let out = Array.create arr1.Length Int32.MinValue
for (i,j) in res do
out.[i] <- j
// Return results
out, score
```

However, after real world trials we found that it was taking an astonishing 20% of our runtime. We had to find a better way. We soon found Gosper’s hack which did mostly what we wanted. It gave us the combinations, and we found we could just rotate each of those around to give us all of our cyclic suborders.

And so we went about rewriting our nice functional sample to use Gosper’s Hack to avoid the need to generate paths and also to be (nearly) minimally allocating to avoid garbage collection overhead.

```
open System
// arr1 is always equal in size or smaller than arr2
let bestRotGos (f: 'a -> 'a -> float) (arr1: 'a []) (arr2: 'a []) =
// Start Gosper's Hack Machinery Bootstrap
let bitsConfig = Array.create arr1.Length Int32.MinValue
let inline setBits v =
let inline getBit x i = x &&& (1 <<< i) <> 0
let mutable idx = 0
for i = 0 to 31 do
if getBit v i then
bitsConfig.[idx] <- i
idx <- idx + 1
let inline nextGosper set =
let c = set &&& -set
let r = set + c
r + (((r^^^set)/c)>>>2)
let limit =
let n = arr2.Length
(1 <<< n)
let mutable set =
let k = arr1.Length
(1 <<< k) - 1
// End Gosper's Hack Machinery Bootstrap
// Pre-calculate sims and init placeholder variables
let scores = Array2D.init arr1.Length arr2.Length
(fun i j -> f arr1.[i] arr2.[j])
let mutable maxScore = Double.NegativeInfinity
let bestConfig = Array.create arr1.Length Int32.MinValue
while (set < limit) do
setBits set // Turn "set" bits into indices in "bitsConfig"
// For each rotation
for i = 0 to bitsConfig.Length - 1 do
// calculate score and compare with previous best
let mutable totalScore = 0.0
for i = 0 to bitsConfig.Length - 1 do
let tokenScore = scores.[i,bitsConfig.[i]]
totalScore <- totalScore + tokenScore
if totalScore > maxScore then
// and replace if better
maxScore <- totalScore
Array.blit bitsConfig 0
bestConfig 0
bitsConfig.Length
// Rotate the array
let firstElement = bitsConfig.[0]
for i = 0 to bitsConfig.Length - 2 do
bitsConfig.[i] <- bitsConfig.[i + 1]
bitsConfig.[bitsConfig.Length - 1] <- firstElement
set <- nextGosper set // Put the next combination in "set"
bestConfig, maxScore
```

A quick test shows that the new algorithm is about seven times faster across a variety of test inputs:

```
let f x y = if x = y then 1.0 else 0.0
let cmp = [|"one"; "two"; "three"; "four"|]
let cmps =
[| for i = 0 to cmp.Length - 1 do
for j = i to cmp.Length - 1 do
// Longer array is second
yield cmp.[j..], cmp.[i..]
|]
#time
for i = 0 to 1000000 do
for c1,c2 in cmps do
bestRot f c1 c2 |> ignore
//Real: 00:00:41.800, CPU: 00:00:41.812,
// GC gen0: 10175, gen1: 6, gen2: 1
#time
#time
for i = 0 to 1000000 do
for c1,c2 in cmps do
bestRotGos f c1 c2 |> ignore
//Real: 00:00:06.378, CPU: 00:00:06.375,
// GC gen0: 689, gen1: 1, gen2: 0
#time
```

Of course, there are a few more things that could be done to make this even faster mostly related to further reducing allocations. For example, we could potentially use larger fixed sized arrays if we knew our input was bounded and pass in the arrays as references letting the calling function handle proper recycling.

However, at this point the function no longer appeared anywhere near significant when profiling our overall usage and so we found this was fast enough. It’s not worth making things uglier and more prone to breakage for minor gains when what we have right now is a referentially transparent function with predictable behavior.