Jul 14

Review: Sony Digital Paper DPT-S1 at Lambda Jam 2014

I don’t usually review hardware here, but I think this device stands out as being particularly useful to people who take a lot of notes and/or read a lot of research papers.

I read about the Sony Digital Paper DPT-S1 for the first time about a year ago and couldn’t help but be impressed. It promises the ease of reading of e-ink combined with a size that is amicable to academic papers and on top of that allows you to actively annotate the documents with a pen as you read them. It also sports the usual e-ink 3 weeks of battery life. Luckily enough, I managed to get one of my very own right before Lambda Jam 2014 and so had the perfect opportunity to give it a spin in a real use case kind of setting.

Reading Papers

Reading and marking up papers in PDF format is where this device shines.

A Pristine Paper (not yet marked up)

A Pristine Paper (not yet marked up)

You simply swipe to turn pages, and it works every time. There’s even pinch zoom. The screen is large enough that you can easily read an entire page without zooming, as was always the problem I had with my first gen e-ink kindle (also the DPT-S1 weighs substantially less). You even get multiple tabs in a workspace so you can swap between different documents quickly for cross-reference.

In this context it’s a better Kindle DX (now discontinued) that you can take notes on. For me (and for many others I suspect) reading a paper is a very interactive experience. You want to be able to highlight the important parts and even scribble in the margins as you move through it. The DPT-S1 supports this better than any device I have seen yet.

Marking up a research paper.

Marking up a research paper.

As you can see here you can not only write directly on the paper, but you can also highlight. These are both done with the included stylus for which the standard function is writing but changes to highlighting if you hold down the button on its side. You may also notice the little boxes in the margin of the text, these are collapsible notes.

Collapsible Notes on the DPT-S1

Collapsible Notes on the DPT-S1

As you can see, the darker square in the top right margin is here opened and available for writing. Also please note that these notes were taken by me (with horrible handwriting in general) on the way to Lambda Jam, in a squished economy seat of an airplane, while there was some mild turbulence. While the hand writing isn’t paper-perfect it’s much better than other devices I’ve used in the past, including the iPad.

One of the best features of the DPT-S1 is also its most limiting: It’s designed to work only with PDF files. The big benefit of this is that all of these writing annotations actually turn into PDF annotations on the given file. This makes them extremely easy to export and use in other contexts.

Taking Notes

The other big use case I had in mind for the DPT-S1 was taking notes. I always carry a notebook of some form and over the last three years I’ve managed to create quite a lot of non-indexed content.

About half of my notebooks from the last three years.

About half of my notebooks from the last three years.

I usually carry one notebook for notes/exploring ideas, another for planning things (like to-do lists and such), and finally one small one for writing down thoughts on the go. This stack doesn’t include notes from talks I’ve attended or my Coursera class notes. It also doesn’t include the giant stack of hand annotated papers in my office, but that’s more to do with the previous section.

I took pages and pages of notes on the DPT-S1 at Conal Elliott‘s talk at Lambda Jam (great presentation by the way). Here’s a side by side comparison with some paper notes I’ve written in the past.

Some notes from Conal Elliott's talk at Lambda Jam

Some notes from Conal Elliot’s talk at Lambda Jam

Some actual paper notes

Some actual paper notes

As you can see, my handwriting isn’t great as I tend to go kind of fast and sloppy when not looking at the paper, but the DPT-S1 holds up rather well. I think it would do even better for someone with nicer handwriting than I.

There is one somewhat annoying downside, and that’s that when you make a new notebook pdf to take notes it only has 10 pages and you have to give it a name with the software keyboard input (it defaults to a date and time based name). This slowed me down big time in the talk because he was moving very fast toward the end, and that’s precisely when I ran out of pages. Still, given how well polished the rest of the device is it’s something I can overlook.

Browsing the Web

The final use case for the DPT-S1 is web browsing. This isn’t something I really need as my phone usually does a pretty good job at this, but it could be nice to have for reading blogs and such so I’ll touch on it.

Hacker News on the DPT-S1

Hacker News DataTau on the DPT-S1

As you can see here, Hacker News DataTau is actually quite navigable in this format.

This blog, you're reading it right now.

This blog, you’re reading it right now.

My blog actually renders quite well and is very readable, you can scroll by swiping up and down. Pinch-zoom works here too.

I went to several sites and they all worked well enough, but given that this device is WiFi only I don’t expect I’ll be using it much for reading blog posts on the go.


If you’re looking for a cheap consumer device that you can easily buy e-books for you should look elsewhere. It’s expensive (~$1000 usd), hard to acquire (you have to email and talk to sales agents), and has no store, no API (only the filesystem), and only supports PDF.

However, if you’re like me in that you take a lot of notes and you read a lot of papers, and you don’t mind spending a bit of money on something to solve a major problem in your life, this is by far the best device on the market for your needs.

Please note, that while they are available on amazon, it’s the imported Japanese language version. Currently the only way to get an english version DPT-S1 is through contacting the sales team at WorlDox.

Dec 13

My 2013 F# Year in Review

It’s been a great year for F# with the blossoming of the fsharp.org working groups. It’s been amazing watching the community come together to form a movement outside of Microsoft. This is most certainly the long term future of F#, protected from the whims of layer upon layer of management. Who knows, in the coming year we might even see community contributions to the F# Core libraries. Who would have thought that would ever have been possible?

I’m very happy to see that Sergey Tihon has maintained his wonderful weekly roundup of F# community goings on. It’s a big time investment week after week to keep the weekly news going. After leaving Atalasoft, and no longer being paid to blog on a regular basis, I found I couldn’t keep investing the time and felt very badly about not being able to continue my own weekly roundups. Sergey has picked up that mantle with a passion, and I’m so very glad for this extremely useful service he provides to the community.

Meanwhile Howard Mansell and Tomas Petricek (at his BlueMountain sabbatical), worked toward building a bunch of great new tools for data science in F#. The R Type Provider has become extremely polished and while Deedle may be fresh out of the oven, it already rivals pandas in its ability to easily manipulate data.

At Bayard Rock Paulmichael Blasucci, Peter Rosconi, and I have been working on a few small community contributions as well. iFSharp Notebook (An F# Kernel for iPython Notebook) is in a working and useful state, but is still missing intellisense and type information as the iPython API wasn’t really designed with that kind of interaction in mind. The Matlab Type Provider is also in a working state, but still missing some features (I would love to have some community contributions if anyone is interested). Also in the works is a nice set of F# bindings for the ACE Editor, I’m hoping we can release those early next year.

Finally, I wanted to mention what a great time I had at both the F# Tutorials both in London and in NYC this year. I also must say that the London F# culture is just fantastic; Phil is a thoughtful and warm community organizer and it shows in his community. I’ve been a bit lax in my bloggings but they were truly both wonderful events and are getting better with each passing year.

F# Tutorials NYC 2013 Group Photo

F# Tutorials NYC 2013

That right there was the highlight of my year. Just look at all of those smiling functional programmers.

Aug 13

All Machine Learning Platforms are Terrible (but some less so)

I recently took a medium sized feature set with labels at work and ran it through some of the most popular machine learning platforms. The goal was to get a feel for each of them via the standard battery of regressions and evaluate each for use in further experimentation.  This is a review of my journey.

Experimental Setup:
Features and Labels in a ~500mb CSV file
Labeled Records: ~140,000
Features: ~3500 binary, labels [0-100]
Hardware: 4 x 8 = 32 cores, 256gb of ram
OS: Windows Server 2008r2

- F# with Math.NET -
I used F# to build the features for the rest of these processes. It was quite nice using the SQL Type Provider to grab the data out of the database and then process it into binary features, even though it consisted of fourteen unoptimized tables across two SQL Server databases with rather odd relationships. I did this step-by-step while trying new features out on a hand written iterative linear regression function I wrote in a single line of F#. The syntax with Math.NET is almost exactly the same as Matlab and so it came quite easily. On top of that the linear algebra was quite fast using Math.NET’s MKL Linear Algebra provider.

While Math.NET is under constant work by some really smart folks, it currently only supports a few non-iterative linear solvers with MKL. Iterative linear regression was easy enough to do by hand, but I wanted to try some of the more complex regressions without worrying if I implemented them properly. Once I had my features sorted it was obvious that it was time to move on.

- R 2.14.? -
R was easy to install and get running. It was nice to have the package manager built right in to the console. However, from there on it was all down hill. Loading the data file took forever, approximately 10 minutes with the standard CSV machinery. Once it was loaded, it was just one out of memory exception after another. I tried to run several regressions but I wasn’t able to complete a single experiment and many took quite a long time to fail. All signs point to poor garbage collection in the R runtime.

Blinded by my frustration, I ended up buying Revolution R, but they skirt the problem by using their own file based format and have a limited handful of regressions on that format. I’m holding out hope that things will be better in R 3.0 as they’ve finally removed the 32-bit memory limitation. Still, given the state of Python (see below) I don’t think there’s any compelling reason to revisit R at all.

- Matlab 2013a -
I already own the base Matlab 2013a and use it on a regular basis, but I wanted to avoid shelling out the $5000+ for the toolkits needed for this project before making sure they could do what I wanted (and not choke on my data like R), so I requested a trial. It was quite an ordeal, I had to wait for an actual sales agent to call me on the phone, discuss what I was trying to do, and request that my license be sent multiple times via email (they kept forgetting?). I swear I’ve never had such a difficult software customer experience before. I still don’t understand why I couldn’t just download a trial from their site.

In any case, right when we finally got it all sorted we experienced some technical difficulties with our server room overheating and had to have the beastly box relocated. Two months or so later my hardware is back up and running at a better location but my toolbox trials have long since expired. I’m currently exploring other options before I go back groveling for an extended trial.

- Scikit-learn via WinPython 64-bit -
The hardest part in getting started was picking a Scikit distribution, there’s at least three popular ones for Windows. I ended up going with WinPython because it was MIT licenced and I try not to bring the GPL into my workplace whenever I can avoid it. You’d never want GPL code to accidentally make its way into anything that leaves the building.

First impressions were great, the CSV file loaded in under 15 seconds with pandas, and it was quite a revelation that I could take a pandas table and just pass it in to these scikit functions as if it were a matrix, very slick. However it’s not all roses, I spent a lot of my first day trying to figure out why the basic linear regression was giving nonsensical results. After some inspection, it looks like an numerical overflow somewhere in the depths is causing a few weights to become extremely large negative values. The rest of the linear models worked great however.

Then, as I was full of momentum, I’d thought I’d give the SVM stuff a go, but it turns out that for some reason Scikit disables OpenMP for LibSVM and so it’s incredibly slow. So, after a 24-hours or so of LibSVM puttering away at 3% overall CPU usage, I thought I’d just load up another Spyder instance and keep working while this chugs along. No such luck, you can only have one Spyder window open at a time.

In fact, I think Spyder is by far the weakest part of the Scikit offering, it’s not only limited in terms of instances, it also has an odd tendency to lock up while the Python interpreter is busy and the variable explorer ignores some variables, I’m not sure what that’s about. Also in the box is IPython Notebook, but it doesn’t seem to like the Internet Explorer that’s on the machine and whatever solution we come up with has to eventually work in a completely locked down environment with no internet connection, and hopefully without any installed dependencies. Perhaps I’ll fare better with something like Sublime Text, but it is nice to have graphical variable inspection.

- Final Impressions - 
If I were going to recommend a setup to someone getting started today, I’d say by far and away the best choice is a Scikit distribution. It’s not without problems, but compared to the horrible mess that makes up the rest of the available options it shines. I’d also suggest trying to find a different GUI than Spyder. It’s fine for playing around, but it’s far too janky to be considered reasonable for professional day-to-day use.

Jul 13

The Promise of F# Language Type Providers

In most software domains you can safely stick with one or two languages and, because the tools you are using are fairly easy to replicate, you’ll find almost anything you might need to finish your project. This isn’t true in data science and data engineering however. Whether it be some hyper-optimized data structure or a cutting edge machine learning technique often you only have a single language or platform choice.

Even worse, when you want to build a system that uses one or more platform specific components, things can become quite an engineering mess. No matter what you do you can’t avoid the high cost of serialization and marshaling. This makes some combinations of tools non-options for some problems. You often make trade-offs that you shouldn’t need to make, for example using a worse algorithm just because the better option hasn’t been written for your platform.

In .NET this is a particularly bad problem. There are quite a few dedicated people working on open source libraries, but they are tiny in number compared to the Matlab, Python, R or Java communities. Meanwhile, Microsoft research has several fantastic libraries with overly restrictive licenses that make them impossible to use commercially. These libraries drive away academic competition, but at the same time can’t be used outside of academia. It’s a horrible situation.

Thankfully, there is a silver lining in this dark cloud. With the release of F# 3.0 in VS 2012 we were given a new language feature called Type Providers. Type Providers are compiler plugins that generate types at compile time and can run arbitrary code to do it. Initially, these were designed for access databases and getting types from the schema for free, but when Howard Mansell released the R Language Type Provider everything changed. We now realized that we had a way to build slick typed APIs on top of almost any other language.

This means that now it doesn’t matter if someone has written the algorithm or data structure for our platform as long as there’s a Type Provider for a platform where it has been done. The tedious work of building lots of little wrapped sub-programs is completely gone. It shouldn’t even matter if the kind of calculation you’d like to do is fast on your native platform, as you can just transparently push it to another. Of course, we still must pay the price of marshaling, but we can do it in a controlled way by dealing in handles to the other platform’s variables.

The language Type Providers themselves are a bit immature for the moment but the idea is sound and the list is growing. There is now the beginnings of an IKVM Type Provider (for Java) and I’m working on a Matlab Type Provider. The Matlab Provider doesn’t yet have all of the functionality I am aiming for, but I’ve been working on it for several months and it’s quite usable now. All that’s left is for someone to start in on a Python type provider and we’ll practically have all of the data science bases covered.

It’s an exciting time to be an F#’er.

Jul 13

Come join me at the SkillsMatter F# Tutorials NYC 2013

Last year was our first NYC F# tutorials and they were just amazing (you can read about them here) but this year’s are going to be even better. We’ve got a lineup including some of the most talented teachers in the F# community, and the tickets are extremely inexpensive as conferences and training events go.

Looking to learn F#? Our beginner track is jam packed with hands on exercises. It’s was amazing to see what just two days of training can do. A C# co-worker of mine was a beginner track attendee last year and delivered a project in F# just the next week.

Already have some serious F# skills? In our advanced track we’ve got a lineup that will push those skills to the limit. I personally am particularly excited to dig into the F# compiler with Don and Tomas.

Now that I’ve had my say, here’s the official spiel:

On the back of the success of the 2013 edition, the Progressive F# Tutorials return to New York in September – this time packing an even bigger punch! With F# UG lead Rick Minerich at the helm, we’ve put together a expert filled line-up – featuring Don Syme (creator of F#), Tomas Petricek, and Miguel de Icaza. The Tutorials will be split in two – a beginners track for those eager to unleash F#’s full power, and a ‘meaty track’ for those more experience f#pers amongst you! Each session will be a 4 hour hands-on deep dive, brushing aside the traditional format of conferences to allow you to truly immerse into the subject topic.

Want to get involved? We’re giving a special community 20% discount! Just go ahead and enter SkillsMatter_Community on the booking form and the team at Skills Matter will look forward to welcoming you to NYC this September!

- Check out our schedule.
- Purchase tickets.
- Read about last year’s tutorials.

Are you as excited as I am yet?

Jul 13

In Retrospect: QCon NYC 2013 (and a conversation with Rich Hickey on languages)

QCon NYC was the most refreshing conference I’ve been to in a very long time. Perhaps it’s partially because I’ve lingered too long in Microsoft circles, or maybe it’s just been too long since I went to a large conference. In any case, the speaker lineup was just chock full of brilliant minds from all over the world. I am honored to be counted among such an illustrious lineup.

Click for a video recording of the talk.

Click for a video recording of my talk.

My talk was well received, but the title wasn’t as descriptive of the content as I would have liked. It’s quite a challenge titling a talk six months in advance. Perhaps I should have called it something like “One language to rule them all”, or “Language de jour” but I’m not sure either of those would have gone over quite as well on the polyglot track.

Runar, Rich, and Rick

Left to right: Runar, Rick and Rich.
Paul Snively is behind the camera.

While the average quality of the talks was far and above what I’m used to at most of the conference I’ve attended, both in entertainment value and content, as usual the interspersed deep conversations were far and away the most rewarding. Of all of those deep conversations the one that stands out most in my mind was when Rich Hickey sat down with Runar Bjarnesson, Paul Snively and I for dinner. We talked quite a bit about his Datomic project, agreed on the power of immutability, and eventually discussed our differing philosophies on types.

I have immense respect for Rich Hickey. He’s a brilliant man and is almost solely responsible for kindling my interest in functional programming. He’s had a huge influence in creating the programmer that I am today, and I count him among my heroes. Now, the only case I’ve ever found myself disagreeing with him is his opinion on types and so I couldn’t help myself. With a bit of trepidation, I broached the subject. It’s funny that something so technical can be so difficult to talk about, but because we are all so passionate about our viewpoints I know we all had to be quite careful to phrase things so as not to inflame the tension.

What I learned is that Rich Hickey and I don’t disagree nearly as much as I thought. His main point was that the glue of a program shouldn’t know anything about what it’s gluing, much like a Fedex truck wasn’t designed with the contents of the boxes it carries in mind. I also tend to design programs in this way, but lean heavily on reflection to do it instead of using a dynamic language.

Even a month later, Runar’s main point of contention still seems unanswered: do generic types count as the truck being designed with the contents of the box in mind? You can argue either way here. On one hand, the code certainly knows about some of the properties of what’s in the box (for example, does it fit on the truck?), how tightly these properties constrain depends quite a bit on the language in question and its type features of course. This is actually quite useful because it keeps you from attempting to do something like putting a steamboat into your Fedex truck. The properties of the Fedex truck and the boxes it can hold must be respected.

On the other hand, you may often find yourself in a situation where your abstraction is overly limiting and the only recourse is to make changes to the type structure of the existing program in order to extend it. I think this is what Rich was getting at, and it’s true. For a true decoupled program (that is, no extra shared dependencies between sub-components) you need one of three things: 1) a meta reflection layer, 2) a dynamic language or 3) a very liberally defined type structure. In the third case it’s just extra work, with perhaps a negligible tangible benefit in terms of program safety.

In either case, the post-compilation/interpretation program eventually knows what’s in the box, it’s more of a question of when: at compile time, or when the box is first touched. Perhaps this is where the metaphor breaks down, or perhaps I’m just over thinking it. In any case it’s been a while since I reevaluated my hard-line views on types, and I’m grateful to Rich for sitting down with us and providing the impetus. After all, in my own words right from my QCon talk, it’s all about context.

Jul 13

On Type Safety, Representable States and Erlang

Close your eyes and imagine your program as a function that takes a set of inputs and produces a set of outputs. I know this may seem overly simple, but a set of actions in a GUI can be thought of as a set of inputs, and a set of resulting side effects to a database can be seen as a new state of the world being returned.

Now focus on its input space. This space is comprised as all possible combinations of all possible inputs. In this set some will be well defined for your program and some not. An example of a not well defined input could be as simple as an incorrect database connection string, as straightforward as an incorrect combination of flags on a console application, or as difficult to detect as a date with month and day transposed.

Input Space

A program thought of in this way is a fractal-like thing, a program made of little smaller programs, made of smaller programs yet. However, there’s no guarantee that each of these smaller programs will treat of a piece of data in exactly the same way as others. In addition to any initial validation, any top-level inputs which cause other inputs to be given to sub-programs where they are not properly handled are similarly considered not well defined. Consider these three approaches to making your program safer by reducing the size of incorrect input space:

First, you can increase the size of the blue circle with explicit input checking. This means numerous validations to ensure the program exits with proper notification when incorrect inputs are given. However, the program is fractal, and so if we want to be safe we’ll need to reproduce many of these checks fractally. A great example of this is handling null values and we all know how that turns out.

Another approach is to shrink the size of the red circle. We can do this by making fewer incorrect states representable with types. Because we know that all of the potential states are valid once encoded, we only need to do our checks once while marshalling our data into a well-typed representation. This eliminates almost all need for repeated validation, limited only by how far your type system will take you. Even better, with newer language features (such as F# type providers) we can eliminate much of this marshalling phase, this is however similarly limited by how far the schema of the data will take you.

A third approach, available only in some situations but which I find extremely fascinating, is to build everything in such a way as the entire program logs the error and resets its state when an incorrect input is found. Most paradoxically, in this case the more fragile you make your system, the safer it is (as long as you ensure that external state changes are the very last thing done, and that they’re done transactionally). This seems to be the Erlang philosophy and the only flaw I can find with it is shared with most type systems. That is, you can’t implicitly account for ambiguous inputs or state spaces that your type system can’t constrain.

Mar 13

Setting up F# Interactive for Machine Learning with Large Datasets

Before getting started with Machine Learning in F# Interactive it’s best to prepare it for large datasets and external 64-bit libraries so you don’t get blindsided with strange errors when you happen to cross the line. The good news is it’s a simple process that should only take a few minutes.

The first step is to go into the configuration and set fsi to 64-bit. It’s only matter of changing a boolean value buried deep in the Visual Studio settings. First, Go into Tools->Settings.


Then find the “F# Tools” section on the left and select the “F# Interactive” subsection.


Finally, set “64-bit F# Interactive” to true and click OK.


What this does is set Visual Studio to use “FsiAnyCPU.exe” for the F# Interactive window instead of 32-bit “Fsi.exe”.

Now, after we restart Visual Studio, your F# Interactive is running with as many bits as your operating system can handle. However, if we want to support really big matrices we’re going to need to go a bit further. If we want really large arrays, that is greater than 2 gigabytes, we’re going to need to fiddle with the F# Interactive application config and enable the “gcAllowVeryLargeObjects” attribute.

For .NET 4.5 on Windows 7, Windows 8 and Windows Sever 2008R2 the standard directory for both the fsi exeuctables and their application configs is:

“C:\Program Files (x86)\Microsoft SDKs\F#\3.0\Framework\v4.0″

Navigate there and open “FsiAnyCPU.exe.config” in your favorite text editor. Then under the <runtime> tag add:

<gcAllowVeryLargeObjects enabled="true" />

When you’re done it should look like:

<?xml version="1.0" encoding="utf-8"?>
    <gcAllowVeryLargeObjects enabled="true" />
    <legacyUnhandledExceptionPolicy enabled="true" />

Just save and restart Visual Studio and you’re done! Your F# Interactive can now handle large datasets and loading external 64-bit native libraries.

Mar 13

Bad Data is the Real Problem

Big data is the buzzword de jour, and why not?  Companies like Google with huge server farms are doing amazing things leveraging huge amounts of data and processing power.  It’s all very sexy but these researchers get to pick and choose the data they work with.  They can maximize their research gains by pushing the cutting edge with data that is amicable to their task.

Meanwhile, the reality for most companies is that they are being crushed under their own small mountains of legacy data.  Data that has been merged together over decades from different machines with different fields and formatting constraints.  Decades where new laws were passed about which data can and must be collected.  These companies are desperate for data science.  Not is-it-a-cat data science, but discover-things-and-the-links-between-them data science.

Big data is the exciting research frontier where gains come relatively easily, but the more daunting task is coming up with a solution for bad data.  Unfortunately, dealing with bad data is difficult tradeoffs all the way down and this makes it intractable to build a one size fits all solution.  Even a one size fits many solution is difficult.  For example, at the US Census they focus on aggregate statistics and so somewhat sloppy methods will suffice, but what about the Casino trying to keep track of card counters?  Or even more poignant, the TSA computer trying to determine if you’re on a terrorist watch list?

Simplifying things a bit, here are the four basic categories of tasks within entity resolution:

  1. Low risk where errors have a low cost as with similar products on a shopping site
  2. High false-positive risk where false-positives have a high cost but false-negatives have moderate to low cost as with merging customer databases
  3. High false-negative risk where false-negatives have a high cost but false-positives have moderate cost as with anti-money laundering
  4. High risk where both false-positives and false-negatives have a high cost such as with the FBI hunting someone down who is sending anthrax in the mail

With low risk the more data the merrier.  The good will drown out the bad with traditional machine learning techniques.

For high false-positive risk you need to be extremely careful and manually review those records which have even a moderate probability of representing different entities.  Thankfully, as the cost of keeping duplicates around is low, you can start with combining databases in a straightforward manner and then slowly work on merging records over time as an iterative process from most probable matches to least.

The high false-negative risk style problem is much more challenging.  Consider that you have two datasets of size N and M, the number of all possible pairs of records is then N * M which would be far too many records for manual review and low cost Mechanical Turk style reviewers do a poor job at finding needles in haystacks.  So one approach, similar to high false-positive risk, is to order your matches by probability but also include a measure of quantified false-negative risk on a per-record match basis.  You must also go to extreme lengths to find matches here.  For example: building algorithms which understand how different cultures use, write and pass down names.  It’s a constant struggle to further refine the quality of your results without while pushing down the false-negative rate.

High risk is the hardest problem of all and is better left mostly unautomated at this point.  Perhaps you could use a high false-negative risk style approach to glean candidates, but you’ll still need a lot of intelligently applied elbow grease and a large team to get there.

These gross categories don’t take into account other factors like data quantity, class size imbalances, lossy data formatting, input errors, and poor data management.  I’m afraid that until the singularly no magic bullet will solve this problem.  For getting started my best recommendation is John Talburt’s Entity Resolution and Information Quality.  Unlike many books on the subject it’s quite accessible to non-academics. 

(experimental affiliate link)

(experimental affiliate link to fund my book habit)

Dec 12

My Education in Machine Learning via Coursera, A Review So Far

As of today I’ve completed my fifth course at Coursera, all but one being directly related to Machine Learning. The fact that you can now take classes given by many of most well known researchers in their field who work at some of the most lauded institutions for no cost at all is a testament to the ever growing impact that the internet has on our lives.  It’s quite a gift that these classes started to become available at right about the same time as when Machine Learning demand started to sky rocket (and at right about the same time that I entered the field professionally).

Note that all effort estimations include the time spent watching lectures, reading related materials, taking quizes and completing programming assignments.  Classes are listed in the order they were taken.

Machine Learning (Fall 2011)
Estimated Effort: 10-20 Hours a Week
Taught by Andrew Ng of Stanford University, this class gives a whirlwind tour of the traditional machine learning landscape. In taking this class you’ll build basic models for Regression, Neural Networks, Support Vector Machines, Clustering, Recommendation Systems, and Anomaly Detection. While this class doesn’t cover any one of these topics in depth, this is a great class to take if you want to get your bearings and learn a few useful tricks along the way. I highly recommend this class for anyone interested in Machine Learning who is looking for a good place to start.

Probabilistic Graphical Models (Spring 2012)
Estimated Effort: 20-30 Hours a Week
Taught by Daphne Koller of Stanford University, who de facto wrote The Book on Probabilistic Graphical Models (weighing in at 1280 small print pages). This class was a huge time investment, but well worth the effort. Probabilistic Graphical Models are the relational databases of the Machine Learning world in that they provide a structured way to represent, understand and infer statistical models. While Daphne couldn’t cover the entire book in a single class, she made an amazing effort of it. After you complete this course you will most definitely be able to leverage many different kinds of Probabilistic Graphical Models in the real world.

Functional Programming Principles in Scala (Fall 2012)
Estimated Effort: 3-5 Hours a Week
Taught by Martin Odersky who is the primary author of the Scala programming language. I entered this class with an existing strong knowledge of functional programming and so I’d expect this class to be a bigger time investment for someone who isn’t quite as comfortable with the topic. While not directly related to Machine Learning, knowledge of Scala allows leverage of distributed platforms such as Hadoop and Spark which can be quite useful in large scale Entity Resolution and Machine Learning efforts. Also related, one of the most advanced frameworks for Probabilistic Graphical Models is written in and designed to be used from Scala.  In taking this class you’ll most certainly become proficient in the Scala language, but not quite familiar with the full breadth of its libraries. As far as functional programming goes, you can expect to learn quite a bit about the basics such as recursion and immutable data structures, but nothing so advanced as co-recursion, continuation passing or meta programming. Most interesting to myself was the focus on beta reduction rules for the various language constructs.  These together loosely form a primer for implementing functional languages.

Social Network Analysis (Fall 2012)
Estimated Effort: 5-10 Hours a Week
Taught by Lada Adamic of the University of Michigan. Social Network Analysis stands out from the others as I’ve never been exposed anything quite like it before. In this class you learn to measure various properties of networks and several different methods for generating them. The purpose in this is to better understand the structure, growth and spread of information in real human social networks. The focus of the class was largely on intuition and I was a bit unhappy with the sparsity of the mathematics, but this certainly makes it a very accessible introduction to the topic. After completing this class I guarantee you’ll see new insight into your corporate structure and will see your twitter network in a whole new way. If I were to pick one class from this list to recommend to a friend no matter background or interest, this would be it.

Neural Networks for Machine Learning (Fall 2012)
Estimated Effort: 10-30 Hours a Week
Taught by Geoffrey Hinton of the University of Toronto, who is a pioneer and one of the most well respected people in his field. Note that going in you’ll be expected to have a strong working knowledge of Calculus, which is not a prerequisite for any of the other classes listed here. I had hoped that this class would have been as worthwhile as the Probabilistic Graphical Models course given its instructor, but sadly it was not. Regretfully, I can only say that this class was poorly put together. It has a meager four programming assignments, the first two of which follow a simple multiple choice coding formula, and the following two of which are unexpectedly much more difficult in requiring you to both derive your own equations and implement the result of that derivation. It was extremely hard to predict how much time I would need to spend on any given assignment.  Having already learned to use Perceptrons and simple Backpropagation in Andrew Ng’s class, the only new hands-on skill I gained was implementing Restricted Boltzmann Machines. To be fair, I did acquire quite a bit of knowledge about the Neural Networks landscape, and Restricted Boltzmann Machines are a core component of Deep Belief Networks.  However, looking back at the sheer quantity of skills and knowledge I gained in the other classes listed here, I can’t help but feel this class could have been much better.