Collection Pipelines - The Revenge Of C
What do Ruby’s Enumerable module, .NET LINQ, and the new Java Stream package have in common? They are all there to facilitate functional programming. Here’s an example, pulled from Java 8’s docs because its support is the most recent.
int sum = widgets.stream()
.filter(b -> b.getColor() == RED)
.mapToInt(b -> b.getWeight())
.sum();
Chances are, you’ve been seeing a lot of this sort of code recently. I’m sure you’ll see more of it.
To me, its a very attractive style. It nearly eliminates loops and explicit conditionals from programming. Most routine calculation can be done with a series of filters, maps, and reductions in a single chain. When you write code in this way it looks declarative and once you get a sense of the common operations, it reads very well. I’d argue, though, that this is not just functional programming - it’s something more severe.
Here’s some Ruby code that uses the collection pipeline style:
def class_months es, class_name
es.select {|e| e.class_name == class_name }
.map {|e|e.date.month_start }
.uniq
.sort
end
def percent_active es, class_name, upto_date = Time.now.month_start
range = class_months(es, class_name)
range.count.to_f / month_range(range.first, upto_date).count * 100.0
end
The class_months function has a nice pipeline of operations from the es variable - select, map, uniq, and then sort. The percent_active function calls class_months and saves its result to a variable called range. The range variable is used twice in the final expression of that function - it uses range.count and range.first.
Is there any way that we could write percent_active in pipeline style? There might be a way, but it isn’t coming to me right now. There are two issues. One is the the primary computation is a division of two scalars. The second is that those scalars are dependent on a temporary variable - range. If we eliminate the temporary, we end up calling class_months twice.
In my programming, I’ve pushed hard at the collection pipeline style and I’m continually surprised by how much I can do with it, but I occasionally find cases where I have to drop out of the style and use temps to avoid redundant computation. The thing is, I wish I didn’t have to. And, it’s a totally irrational wish. The fact of the matter is that every language that supports this style gives us facility for temporaries. In Haskell, we you can write code in point-free style but you can still bind temps from previous computations into your chain. In fact, it’s very common.
The one place that you can’t do this as naturally is in Unix pipelines. And they were one of this inspirations for pipeline style. When you work in a Unix shell, you pipe input from one command to another and the typical case is to line up all of your commands with pipes between then either view or save the output. Anyone who has done any pipelining in Unix knows that there are problems that lend themselves to one line solutions and problems that don’t.
What I find intriguing is that collection pipeline style seduces us, or at least me, into going for it. In C-derived languages, we use a dot to chain together calls. It’s easy, and it’s tempting. Stopping one chain to create a temp and start another chain doesn’t feel like it should be necessary. I don’t feel this way in Clojure or Haskell, or other ML derived languages like OCaml and F#. In those languages bindings with or without let-expressions are natural and common.
Richard Gabriel has an essay in Patterns of Software called ‘The End of History and the Last Programming Language’ in which he and Guy Steele argued that programming language evolution aligns with the following rule - each new wildly successful language must look like C and be more dynamic than its predecessor. Incredibly, they had this insight in the 90s before Java and Javascript were invented(!). Syntax is viral and we’re pretty much stuck with C syntax in this industry. Yes, there are outliers like Ruby and Python, but which languages have the largest developer bases? C, C++, Java, C#, Javascript and a whole host of minor players who’ve aped C syntax.
I’m going to keep writing my pipelines and I’m going to keep wondering why I have to break their flow to introduce variables for reasons other than readability. I’m going to wonder why all programs can’t be single pipelines with calls out to functions that are pipelines as well. While I’m doing that, I’ll also be thinking about how close this style is to vector functional languages like APL and J. But that will be the subject of another blog.
For those concerned that C++ was being left behind in this, here's a link to some work to add collection pipeline support there as well http://jscheiny.github.io/Streams/