All Machine Learning Platforms are Terrible (but some less so)

I recently took a medium sized feature set with labels at work and ran it through some of the most popular machine learning platforms. The goal was to get a feel for each of them via the standard battery of regressions and evaluate each for use in further experimentation.  This is a review of my journey.

Experimental Setup:
Features and Labels in a ~500mb CSV file
Labeled Records: ~140,000
Features: ~3500 binary, labels [0-100]
Hardware: 4 x 8 = 32 cores, 256gb of ram
OS: Windows Server 2008r2

- F# with Math.NET -
I used F# to build the features for the rest of these processes. It was quite nice using the SQL Type Provider to grab the data out of the database and then process it into binary features, even though it consisted of fourteen unoptimized tables across two SQL Server databases with rather odd relationships. I did this step-by-step while trying new features out on a hand written iterative linear regression function I wrote in a single line of F#. The syntax with Math.NET is almost exactly the same as Matlab and so it came quite easily. On top of that the linear algebra was quite fast using Math.NET’s MKL Linear Algebra provider.

While Math.NET is under constant work by some really smart folks, it currently only supports a few non-iterative linear solvers with MKL. Iterative linear regression was easy enough to do by hand, but I wanted to try some of the more complex regressions without worrying if I implemented them properly. Once I had my features sorted it was obvious that it was time to move on.

- R 2.14.? -
R was easy to install and get running. It was nice to have the package manager built right in to the console. However, from there on it was all down hill. Loading the data file took forever, approximately 10 minutes with the standard CSV machinery. Once it was loaded, it was just one out of memory exception after another. I tried to run several regressions but I wasn’t able to complete a single experiment and many took quite a long time to fail. All signs point to poor garbage collection in the R runtime.

Blinded by my frustration, I ended up buying Revolution R, but they skirt the problem by using their own file based format and have a limited handful of regressions on that format. I’m holding out hope that things will be better in R 3.0 as they’ve finally removed the 32-bit memory limitation. Still, given the state of Python (see below) I don’t think there’s any compelling reason to revisit R at all.

- Matlab 2013a -
I already own the base Matlab 2013a and use it on a regular basis, but I wanted to avoid shelling out the $5000+ for the toolkits needed for this project before making sure they could do what I wanted (and not choke on my data like R), so I requested a trial. It was quite an ordeal, I had to wait for an actual sales agent to call me on the phone, discuss what I was trying to do, and request that my license be sent multiple times via email (they kept forgetting?). I swear I’ve never had such a difficult software customer experience before. I still don’t understand why I couldn’t just download a trial from their site.

In any case, right when we finally got it all sorted we experienced some technical difficulties with our server room overheating and had to have the beastly box relocated. Two months or so later my hardware is back up and running at a better location but my toolbox trials have long since expired. I’m currently exploring other options before I go back groveling for an extended trial.

- Scikit-learn via WinPython 64-bit 2.7.5.1 -
The hardest part in getting started was picking a Scikit distribution, there’s at least three popular ones for Windows. I ended up going with WinPython because it was MIT licenced and I try not to bring the GPL into my workplace whenever I can avoid it. You’d never want GPL code to accidentally make its way into anything that leaves the building.

First impressions were great, the CSV file loaded in under 15 seconds with pandas, and it was quite a revelation that I could take a pandas table and just pass it in to these scikit functions as if it were a matrix, very slick. However it’s not all roses, I spent a lot of my first day trying to figure out why the basic linear regression was giving nonsensical results. After some inspection, it looks like an numerical overflow somewhere in the depths is causing a few weights to become extremely large negative values. The rest of the linear models worked great however.

Then, as I was full of momentum, I’d thought I’d give the SVM stuff a go, but it turns out that for some reason Scikit disables OpenMP for LibSVM and so it’s incredibly slow. So, after a 24-hours or so of LibSVM puttering away at 3% overall CPU usage, I thought I’d just load up another Spyder instance and keep working while this chugs along. No such luck, you can only have one Spyder window open at a time.

In fact, I think Spyder is by far the weakest part of the Scikit offering, it’s not only limited in terms of instances, it also has an odd tendency to lock up while the Python interpreter is busy and the variable explorer ignores some variables, I’m not sure what that’s about. Also in the box is IPython Notebook, but it doesn’t seem to like the Internet Explorer that’s on the machine and whatever solution we come up with has to eventually work in a completely locked down environment with no internet connection, and hopefully without any installed dependencies. Perhaps I’ll fare better with something like Sublime Text, but it is nice to have graphical variable inspection.

- Final Impressions - 
If I were going to recommend a setup to someone getting started today, I’d say by far and away the best choice is a Scikit distribution. It’s not without problems, but compared to the horrible mess that makes up the rest of the available options it shines. I’d also suggest trying to find a different GUI than Spyder. It’s fine for playing around, but it’s far too janky to be considered reasonable for professional day-to-day use.

Enjoy this post? Continue the conversation with me on twitter.

Tags: , , , , , , , , ,

14 comments

  1. If you want to use the IPython Notebook but don’t want to use a browser they have a qt-based version: http://ipython.org/ipython-doc/stable/interactive/qtconsole.html.

  2. Hi Rick, thank you for this article – I’d really like to have a look at the code of the F# / Math.NET approach. Is it possible to post that setup here?

    • I’ll post it later today.

      • Vanilla iterative multiple linear regression looks like:

        let thetas X y alpha th =
        let iter (th: Vector) = th – ((X * th – y) * X * (alpha / float y.Count))
        th |> Seq.unfold (fun th -> let th’ = iter th in Some (th’, th’))

        Then just take Seq.nth where that’s the number of iterations you’d like. You can easily fit it all on one line by moving iter into the unfold, but then it would look ugly in the comments.

  3. Hi,

    R is already at v3.0+ and there are at least 5 packages dealing with exactly issues You had, successfully.

    I would not write off R so quickly. have a look at

    ff
    bigmemory
    biglm
    biglars
    bigrf

    All the best

    • Oh, please don’t think I didn’t try some of those, but the options available are extremely limited, and they don’t take advantage of the R table format at all. I’m also fairly certain I gave it a stab and still got out of memory exceptions when I ran the regressions, but it was about three months ago and I didn’t keep notes that precise.

      Jumping through hoops for simple multiple linear regression is just silly though.

  4. Bohdan Szymanik

    Just curious, but how would SAS measure up here? I know very little about it but a number of my co-workers use it for risk modelling – unfortunately, they know virtually nothing about R, F# or python – so would be interested in other’s opinions.

  5. You can also try using Encog, Weka or RapidMiner.

  6. You might try http://accord-net.github.io/ The author is available for help as well. You can find some examples here: http://crsouza.blogspot.com/

  7. Great comparison. Would be even better if it included matlab but I guess you are making a good point there ;)

    In the Python / scikit-learn section, I think you wanted say it was hard to pick a *Python* distribubtion in the first sentence, not a scikit-learn distribution, right?

    I never worked with spyder, and I agree, having a matlab-like ide for this environment would be great. I think IPython notebook is currently the best candidate, though it has a somewhat different goal.

    By the way: there is no OpenMP in the upstream libsvm. Having OpenMP in scikit-learn would be awesome, but hard, as it is highly platform dependent (people build scikit-learn with gcc, clang, msvc, …)

    Cheers,
    Andy

  8. Please may I suggest that you try Mathematica too, if you have not yet tried that.

  9. Python simply has insane number of packages doing machine learning. Its confusing at best. https://pypi.python.org/pypi?%3Aaction=search&term=machine+learning&submit=search

    Btw did you try Orange package in python It comes with a very nice gui for machine learning. http://orange.biolab.si/

  10. [...] Richard Minerich blogged “All Machine Learning Platforms are Terrible (but some less so)“. [...]

Leave a comment