July, 2013


22
Jul 13

The Promise of F# Language Type Providers

In most software domains you can safely stick with one or two languages and, because the tools you are using are fairly easy to replicate, you’ll find almost anything you might need to finish your project. This isn’t true in data science and data engineering however. Whether it be some hyper-optimized data structure or a cutting edge machine learning technique often you only have a single language or platform choice.

Even worse, when you want to build a system that uses one or more platform specific components, things can become quite an engineering mess. No matter what you do you can’t avoid the high cost of serialization and marshaling. This makes some combinations of tools non-options for some problems. You often make trade-offs that you shouldn’t need to make, for example using a worse algorithm just because the better option hasn’t been written for your platform.

In .NET this is a particularly bad problem. There are quite a few dedicated people working on open source libraries, but they are tiny in number compared to the Matlab, Python, R or Java communities. Meanwhile, Microsoft research has several fantastic libraries with overly restrictive licenses that make them impossible to use commercially. These libraries drive away academic competition, but at the same time can’t be used outside of academia. It’s a horrible situation.

Thankfully, there is a silver lining in this dark cloud. With the release of F# 3.0 in VS 2012 we were given a new language feature called Type Providers. Type Providers are compiler plugins that generate types at compile time and can run arbitrary code to do it. Initially, these were designed for access databases and getting types from the schema for free, but when Howard Mansell released the R Language Type Provider everything changed. We now realized that we had a way to build slick typed APIs on top of almost any other language.

This means that now it doesn’t matter if someone has written the algorithm or data structure for our platform as long as there’s a Type Provider for a platform where it has been done. The tedious work of building lots of little wrapped sub-programs is completely gone. It shouldn’t even matter if the kind of calculation you’d like to do is fast on your native platform, as you can just transparently push it to another. Of course, we still must pay the price of marshaling, but we can do it in a controlled way by dealing in handles to the other platform’s variables.

The language Type Providers themselves are a bit immature for the moment but the idea is sound and the list is growing. There is now the beginnings of an IKVM Type Provider (for Java) and I’m working on a Matlab Type Provider. The Matlab Provider doesn’t yet have all of the functionality I am aiming for, but I’ve been working on it for several months and it’s quite usable now. All that’s left is for someone to start in on a Python type provider and we’ll practically have all of the data science bases covered.

It’s an exciting time to be an F#’er.


18
Jul 13

Come join me at the SkillsMatter F# Tutorials NYC 2013

Last year was our first NYC F# tutorials and they were just amazing (you can read about them here) but this year’s are going to be even better. We’ve got a lineup including some of the most talented teachers in the F# community, and the tickets are extremely inexpensive as conferences and training events go.

Looking to learn F#? Our beginner track is jam packed with hands on exercises. It’s was amazing to see what just two days of training can do. A C# co-worker of mine was a beginner track attendee last year and delivered a project in F# just the next week.

Already have some serious F# skills? In our advanced track we’ve got a lineup that will push those skills to the limit. I personally am particularly excited to dig into the F# compiler with Don and Tomas.

Now that I’ve had my say, here’s the official spiel:

On the back of the success of the 2013 edition, the Progressive F# Tutorials return to New York in September – this time packing an even bigger punch! With F# UG lead Rick Minerich at the helm, we’ve put together a expert filled line-up – featuring Don Syme (creator of F#), Tomas Petricek, and Miguel de Icaza. The Tutorials will be split in two – a beginners track for those eager to unleash F#’s full power, and a ‘meaty track’ for those more experience f#pers amongst you! Each session will be a 4 hour hands-on deep dive, brushing aside the traditional format of conferences to allow you to truly immerse into the subject topic.

Want to get involved? We’re giving a special community 20% discount! Just go ahead and enter SkillsMatter_Community on the booking form and the team at Skills Matter will look forward to welcoming you to NYC this September!

Check out our schedule.
Purchase tickets.
Read about last year’s tutorials.

Are you as excited as I am yet?


11
Jul 13

In Retrospect: QCon NYC 2013 (and a conversation with Rich Hickey on languages)

QCon NYC was the most refreshing conference I’ve been to in a very long time. Perhaps it’s partially because I’ve lingered too long in Microsoft circles, or maybe it’s just been too long since I went to a large conference. In any case, the speaker lineup was just chock full of brilliant minds from all over the world. I am honored to be counted among such an illustrious lineup.

Click for a video recording of the talk.

Click for a video recording of my talk.

My talk was well received, but the title wasn’t as descriptive of the content as I would have liked. It’s quite a challenge titling a talk six months in advance. Perhaps I should have called it something like “One language to rule them all”, or “Language de jour” but I’m not sure either of those would have gone over quite as well on the polyglot track.

Runar, Rich, and Rick

Left to right: Runar, Rick and Rich.
Paul Snively is behind the camera.

While the average quality of the talks was far and above what I’m used to at most of the conference I’ve attended, both in entertainment value and content, as usual the interspersed deep conversations were far and away the most rewarding. Of all of those deep conversations the one that stands out most in my mind was when Rich Hickey sat down with Runar Bjarnesson, Paul Snively and I for dinner. We talked quite a bit about his Datomic project, agreed on the power of immutability, and eventually discussed our differing philosophies on types.

I have immense respect for Rich Hickey. He’s a brilliant man and is almost solely responsible for kindling my interest in functional programming. He’s had a huge influence in creating the programmer that I am today, and I count him among my heroes. Now, the only case I’ve ever found myself disagreeing with him is his opinion on types and so I couldn’t help myself. With a bit of trepidation, I broached the subject. It’s funny that something so technical can be so difficult to talk about, but because we are all so passionate about our viewpoints I know we all had to be quite careful to phrase things so as not to inflame the tension.

What I learned is that Rich Hickey and I don’t disagree nearly as much as I thought. His main point was that the glue of a program shouldn’t know anything about what it’s gluing, much like a Fedex truck wasn’t designed with the contents of the boxes it carries in mind. I also tend to design programs in this way, but lean heavily on reflection to do it instead of using a dynamic language.

Even a month later, Runar’s main point of contention still seems unanswered: do generic types count as the truck being designed with the contents of the box in mind? You can argue either way here. On one hand, the code certainly knows about some of the properties of what’s in the box (for example, does it fit on the truck?), how tightly these properties constrain depends quite a bit on the language in question and its type features of course. This is actually quite useful because it keeps you from attempting to do something like putting a steamboat into your Fedex truck. The properties of the Fedex truck and the boxes it can hold must be respected.

On the other hand, you may often find yourself in a situation where your abstraction is overly limiting and the only recourse is to make changes to the type structure of the existing program in order to extend it. I think this is what Rich was getting at, and it’s true. For a true decoupled program (that is, no extra shared dependencies between sub-components) you need one of three things: 1) a meta reflection layer, 2) a dynamic language or 3) a very liberally defined type structure. In the third case it’s just extra work, with perhaps a negligible tangible benefit in terms of program safety.

In either case, the post-compilation/interpretation program eventually knows what’s in the box, it’s more of a question of when: at compile time, or when the box is first touched. Perhaps this is where the metaphor breaks down, or perhaps I’m just over thinking it. In any case it’s been a while since I reevaluated my hard-line views on types, and I’m grateful to Rich for sitting down with us and providing the impetus. After all, in my own words right from my QCon talk, it’s all about context.


3
Jul 13

On Type Safety, Representable States and Erlang

Close your eyes and imagine your program as a function that takes a set of inputs and produces a set of outputs. I know this may seem overly simple, but a set of actions in a GUI can be thought of as a set of inputs, and a set of resulting side effects to a database can be seen as a new state of the world being returned.

Now focus on its input space. This space is comprised as all possible combinations of all possible inputs. In this set some will be well defined for your program and some not. An example of a not well defined input could be as simple as an incorrect database connection string, as straightforward as an incorrect combination of flags on a console application, or as difficult to detect as a date with month and day transposed.

Input Space

A program thought of in this way is a fractal-like thing, a program made of little smaller programs, made of smaller programs yet. However, there’s no guarantee that each of these smaller programs will treat of a piece of data in exactly the same way as others. In addition to any initial validation, any top-level inputs which cause other inputs to be given to sub-programs where they are not properly handled are similarly considered not well defined. Consider these three approaches to making your program safer by reducing the size of incorrect input space:

First, you can increase the size of the blue circle with explicit input checking. This means numerous validations to ensure the program exits with proper notification when incorrect inputs are given. However, the program is fractal, and so if we want to be safe we’ll need to reproduce many of these checks fractally. A great example of this is handling null values and we all know how that turns out.

Another approach is to shrink the size of the red circle. We can do this by making fewer incorrect states representable with types. Because we know that all of the potential states are valid once encoded, we only need to do our checks once while marshalling our data into a well-typed representation. This eliminates almost all need for repeated validation, limited only by how far your type system will take you. Even better, with newer language features (such as F# type providers) we can eliminate much of this marshalling phase, this is however similarly limited by how far the schema of the data will take you.

A third approach, available only in some situations but which I find extremely fascinating, is to build everything in such a way as the entire program logs the error and resets its state when an incorrect input is found. Most paradoxically, in this case the more fragile you make your system, the safer it is (as long as you ensure that external state changes are the very last thing done, and that they’re done transactionally). This seems to be the Erlang philosophy and the only flaw I can find with it is shared with most type systems. That is, you can’t implicitly account for ambiguous inputs or state spaces that your type system can’t constrain.