Literate Chains for Functional Programming
One of the clear benefits of functional programming is composability. It’s easy to take fundamental operations like maps, filters, and folds, and build up computations that are both powerful and concise. But sometimes concision can come at the expense of readability.
This Ruby code creates a data structure that is used to generate the next word in a Markov Chain based upon input text. It’s short and sweet, but also more than a little opaque.
stems = ARGF.read.split.each_cons(2).group_by {|word_pair| word_pair[0] }
I’ll walk you through it.
- The first thing that we do is take a list of files from the command-line and concatenate their contents into a single string as they are being read - the
read
function onARGF
does this. - The
split
function on strings produces an array of whitespace delimited strings (let’s call them words) from the string we just produced. - Our use of the
each_cons
function gives us each word paired with its successor. For example, if we start with[“The”, “rain”, “in”, “spain.”]
, aneach_cons
of 2 will produce a list of those pairs[[“The”, “rain”],[“rain”,”in”],[“in”,”spain.”]]
. - Our call to
group_by
produces a hash with the first element of the pairs as the key. The value for each key is a list of all of the pairs that have that key as the first element.
We are doing a lot of work in that single line of code. If we understand how each of these functions work, we can decipher it easily, but it would be nice if we had a bit more documentation.
In Haskell and F#, programmers often use type signatures to help readers “see” the data structures they are producing and consuming. Every function can have an annotation that shows the structure of its interface. The only downside is that we have to break up concise chains of computation in order to in order to have those documentable functions.
Another avenue we can take is to introduce explaining variables so that we can name our intermediate results.
text = ARGF.read
words = text.split
word_pairs = words.each_cons(2)
This is ok but we are still breaking up our computation far too finely.
Recently, I’ve been adopting another tack. I have a function in Ruby that I call c
. I’ve added it to the Kernel
module. Essentially, it is a no-op that merely returns self
and swallows a string. Here’s how I use it.
stems = ARGF
.read .c('all text')
.split .c('space delimited "words"')
.each_cons(2) .c('successive word pairs')
.group_by {|e| e[0]} .c('lists of word pairs by leading words')
Superficially, it doesn’t seem that there is any advantage to do thing this over having explaining variables, but there is. We can use c
as a hook to log or display intermediate data structures for a piece of code. I do this by introducing a wrapper when I want the insight:
show do
stems = ARGF
.read .c('all text')
.split .c('space delimited "words"')
.each_cons(2) .c('successive word pairs')
.group_by {|e| e[0]} .c('lists of word pairs by leading words')
end
The show
method temporarily overrides c
so that it shows comment text we pass it as well as the current value of self
(usually an array or an enumeration). I’m essentially using c
as a lightweight Object#tap
. Insight achieved.
I don’t claim that this is a perfect scheme or something that would fit everyone’s taste, but I’ve found it useful.