In most software domains you can safely stick with one or two languages and, because the tools you are using are fairly easy to replicate, you’ll find almost anything you might need to finish your project. This isn’t true in data science and data engineering however. Whether it be some hyper-optimized data structure or a cutting edge machine learning technique often you only have a single language or platform choice.
Even worse, when you want to build a system that uses one or more platform specific components, things can become quite an engineering mess. No matter what you do you can’t avoid the high cost of serialization and marshaling. This makes some combinations of tools non-options for some problems. You often make trade-offs that you shouldn’t need to make, for example using a worse algorithm just because the better option hasn’t been written for your platform.
In .NET this is a particularly bad problem. There are quite a few dedicated people working on open source libraries, but they are tiny in number compared to the Matlab, Python, R or Java communities. Meanwhile, Microsoft research has several fantastic libraries with overly restrictive licenses that make them impossible to use commercially. These libraries drive away academic competition, but at the same time can’t be used outside of academia. It’s a horrible situation.
Thankfully, there is a silver lining in this dark cloud. With the release of F# 3.0 in VS 2012 we were given a new language feature called Type Providers. Type Providers are compiler plugins that generate types at compile time and can run arbitrary code to do it. Initially, these were designed for access databases and getting types from the schema for free, but when Howard Mansell released the R Language Type Provider everything changed. We now realized that we had a way to build slick typed APIs on top of almost any other language.
This means that now it doesn’t matter if someone has written the algorithm or data structure for our platform as long as there’s a Type Provider for a platform where it has been done. The tedious work of building lots of little wrapped sub-programs is completely gone. It shouldn’t even matter if the kind of calculation you’d like to do is fast on your native platform, as you can just transparently push it to another. Of course, we still must pay the price of marshaling, but we can do it in a controlled way by dealing in handles to the other platform’s variables.
The language Type Providers themselves are a bit immature for the moment but the idea is sound and the list is growing. There is now the beginnings of an IKVM Type Provider (for Java) and I’m working on a Matlab Type Provider. The Matlab Provider doesn’t yet have all of the functionality I am aiming for, but I’ve been working on it for several months and it’s quite usable now. All that’s left is for someone to start in on a Python type provider and we’ll practically have all of the data science bases covered.
It’s an exciting time to be an F#’er.
Enjoy this post? Continue the conversation with me on twitter.