March, 2013


27
Mar 13

Setting up F# Interactive for Machine Learning with Large Datasets

Before getting started with Machine Learning in F# Interactive it’s best to prepare it for large datasets and external 64-bit libraries so you don’t get blindsided with strange errors when you happen to cross the line. The good news is it’s a simple process that should only take a few minutes.

The first step is to go into the configuration and set fsi to 64-bit. It’s only matter of changing a boolean value buried deep in the Visual Studio settings. First, Go into Tools->Settings.

Tools->Settings

Then find the “F# Tools” section on the left and select the “F# Interactive” subsection.

tools_settings_fsi

Finally, set “64-bit F# Interactive” to true and click OK.

tools_settings_fsi_64-bit

What this does is set Visual Studio to use “FsiAnyCPU.exe” for the F# Interactive window instead of 32-bit “Fsi.exe”.

Now, after we restart Visual Studio, your F# Interactive is running with as many bits as your operating system can handle. However, if we want to support really big matrices we’re going to need to go a bit further. If we want really large arrays, that is greater than 2 gigabytes, we’re going to need to fiddle with the F# Interactive application config and enable the “gcAllowVeryLargeObjects” attribute.

For .NET 4.5 on Windows 7, Windows 8 and Windows Sever 2008R2 the standard directory for both the fsi exeuctables and their application configs is:

“C:\Program Files (x86)\Microsoft SDKs\F#\3.0\Framework\v4.0”

Navigate there and open “FsiAnyCPU.exe.config” in your favorite text editor. Then under the <runtime> tag add:

<gcAllowVeryLargeObjects enabled="true" />

When you’re done it should look like:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <runtime>
    <gcAllowVeryLargeObjects enabled="true" />
    <legacyUnhandledExceptionPolicy enabled="true" />
    <assemblyBinding 
      xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentAssembly>
        <assemblyIdentity
          name="FSharp.Core"
          publicKeyToken="b03f5f7f11d50a3a"
          culture="neutral"/>
        <bindingRedirect
          oldVersion="2.0.0.0"
          newVersion="4.3.0.0"/>
        <bindingRedirect
          oldVersion="4.0.0.0"
          newVersion="4.3.0.0"/>
      </dependentAssembly>
    </assemblyBinding>
  </runtime>
</configuration>

Just save and restart Visual Studio and you’re done! Your F# Interactive can now handle large datasets and loading external 64-bit native libraries.


15
Mar 13

Bad Data is the Real Problem

Big data is the buzzword de jour, and why not?  Companies like Google with huge server farms are doing amazing things leveraging huge amounts of data and processing power.  It’s all very sexy but these researchers get to pick and choose the data they work with.  They can maximize their research gains by pushing the cutting edge with data that is amicable to their task.

Meanwhile, the reality for most companies is that they are being crushed under their own small mountains of legacy data.  Data that has been merged together over decades from different machines with different fields and formatting constraints.  Decades where new laws were passed about which data can and must be collected.  These companies are desperate for data science.  Not is-it-a-cat data science, but discover-things-and-the-links-between-them data science.

Big data is the exciting research frontier where gains come relatively easily, but the more daunting task is coming up with a solution for bad data.  Unfortunately, dealing with bad data is difficult tradeoffs all the way down and this makes it intractable to build a one size fits all solution.  Even a one size fits many solution is difficult.  For example, at the US Census they focus on aggregate statistics and so somewhat sloppy methods will suffice, but what about the Casino trying to keep track of card counters?  Or even more poignant, the TSA computer trying to determine if you’re on a terrorist watch list?

Simplifying things a bit, here are the four basic categories of tasks within entity resolution:

  1. Low risk where errors have a low cost as with similar products on a shopping site
  2. High false-positive risk where false-positives have a high cost but false-negatives have moderate to low cost as with merging customer databases
  3. High false-negative risk where false-negatives have a high cost but false-positives have moderate cost as with anti-money laundering
  4. High risk where both false-positives and false-negatives have a high cost such as with the FBI hunting someone down who is sending anthrax in the mail

With low risk the more data the merrier.  The good will drown out the bad with traditional machine learning techniques.

For high false-positive risk you need to be extremely careful and manually review those records which have even a moderate probability of representing different entities.  Thankfully, as the cost of keeping duplicates around is low, you can start with combining databases in a straightforward manner and then slowly work on merging records over time as an iterative process from most probable matches to least.

The high false-negative risk style problem is much more challenging.  Consider that you have two datasets of size N and M, the number of all possible pairs of records is then N * M which would be far too many records for manual review and low cost Mechanical Turk style reviewers do a poor job at finding needles in haystacks.  So one approach, similar to high false-positive risk, is to order your matches by probability but also include a measure of quantified false-negative risk on a per-record match basis.  You must also go to extreme lengths to find matches here.  For example: building algorithms which understand how different cultures use, write and pass down names.  It’s a constant struggle to further refine the quality of your results without while pushing down the false-negative rate.

High risk is the hardest problem of all and is better left mostly unautomated at this point.  Perhaps you could use a high false-negative risk style approach to glean candidates, but you’ll still need a lot of intelligently applied elbow grease and a large team to get there.

These gross categories don’t take into account other factors like data quantity, class size imbalances, lossy data formatting, input errors, and poor data management.  I’m afraid that until the singularly no magic bullet will solve this problem.  For getting started my best recommendation is John Talburt’s Entity Resolution and Information Quality.  Unlike many books on the subject it’s quite accessible to non-academics. 

(experimental affiliate link)

(experimental affiliate link to fund my book habit)