I recently took a medium sized feature set with labels at work and ran it through some of the most popular machine learning platforms. The goal was to get a feel for each of them via the standard battery of regressions and evaluate each for use in further experimentation. This is a review of my journey.
Features and Labels in a ~500mb CSV file
Labeled Records: ~140,000
Features: ~3500 binary, labels [0-100]
Hardware: 4 x 8 = 32 cores, 256gb of ram
OS: Windows Server 2008r2
- F# with Math.NET -
I used F# to build the features for the rest of these processes. It was quite nice using the SQL Type Provider to grab the data out of the database and then process it into binary features, even though it consisted of fourteen unoptimized tables across two SQL Server databases with rather odd relationships. I did this step-by-step while trying new features out on a hand written iterative linear regression function I wrote in a single line of F#. The syntax with Math.NET is almost exactly the same as Matlab and so it came quite easily. On top of that the linear algebra was quite fast using Math.NET’s MKL Linear Algebra provider.
While Math.NET is under constant work by some really smart folks, it currently only supports a few non-iterative linear solvers with MKL. Iterative linear regression was easy enough to do by hand, but I wanted to try some of the more complex regressions without worrying if I implemented them properly. Once I had my features sorted it was obvious that it was time to move on.
- R 2.14.? -
R was easy to install and get running. It was nice to have the package manager built right in to the console. However, from there on it was all down hill. Loading the data file took forever, approximately 10 minutes with the standard CSV machinery. Once it was loaded, it was just one out of memory exception after another. I tried to run several regressions but I wasn’t able to complete a single experiment and many took quite a long time to fail. All signs point to poor garbage collection in the R runtime.
Blinded by my frustration, I ended up buying Revolution R, but they skirt the problem by using their own file based format and have a limited handful of regressions on that format. I’m holding out hope that things will be better in R 3.0 as they’ve finally removed the 32-bit memory limitation. Still, given the state of Python (see below) I don’t think there’s any compelling reason to revisit R at all.
- Matlab 2013a -
I already own the base Matlab 2013a and use it on a regular basis, but I wanted to avoid shelling out the $5000+ for the toolkits needed for this project before making sure they could do what I wanted (and not choke on my data like R), so I requested a trial. It was quite an ordeal, I had to wait for an actual sales agent to call me on the phone, discuss what I was trying to do, and request that my license be sent multiple times via email (they kept forgetting?). I swear I’ve never had such a difficult software customer experience before. I still don’t understand why I couldn’t just download a trial from their site.
In any case, right when we finally got it all sorted we experienced some technical difficulties with our server room overheating and had to have the beastly box relocated. Two months or so later my hardware is back up and running at a better location but my toolbox trials have long since expired. I’m currently exploring other options before I go back groveling for an extended trial.
- Scikit-learn via WinPython 64-bit 188.8.131.52 -
The hardest part in getting started was picking a Scikit distribution, there’s at least three popular ones for Windows. I ended up going with WinPython because it was MIT licenced and I try not to bring the GPL into my workplace whenever I can avoid it. You’d never want GPL code to accidentally make its way into anything that leaves the building.
First impressions were great, the CSV file loaded in under 15 seconds with pandas, and it was quite a revelation that I could take a pandas table and just pass it in to these scikit functions as if it were a matrix, very slick. However it’s not all roses, I spent a lot of my first day trying to figure out why the basic linear regression was giving nonsensical results. After some inspection, it looks like an numerical overflow somewhere in the depths is causing a few weights to become extremely large negative values. The rest of the linear models worked great however.
Then, as I was full of momentum, I’d thought I’d give the SVM stuff a go, but it turns out that for some reason Scikit disables OpenMP for LibSVM and so it’s incredibly slow. So, after a 24-hours or so of LibSVM puttering away at 3% overall CPU usage, I thought I’d just load up another Spyder instance and keep working while this chugs along. No such luck, you can only have one Spyder window open at a time.
In fact, I think Spyder is by far the weakest part of the Scikit offering, it’s not only limited in terms of instances, it also has an odd tendency to lock up while the Python interpreter is busy and the variable explorer ignores some variables, I’m not sure what that’s about. Also in the box is IPython Notebook, but it doesn’t seem to like the Internet Explorer that’s on the machine and whatever solution we come up with has to eventually work in a completely locked down environment with no internet connection, and hopefully without any installed dependencies. Perhaps I’ll fare better with something like Sublime Text, but it is nice to have graphical variable inspection.
- Final Impressions -
If I were going to recommend a setup to someone getting started today, I’d say by far and away the best choice is a Scikit distribution. It’s not without problems, but compared to the horrible mess that makes up the rest of the available options it shines. I’d also suggest trying to find a different GUI than Spyder. It’s fine for playing around, but it’s far too janky to be considered reasonable for professional day-to-day use.