In the last post we saw how to stage parser combinators. We added alternation and concatenation, and CPS-encoded the parse result datas tructure to avoid intermediate allocations for it.
One important thing we didn’t do yet is recursion. Indeed, many data formats (JSON, programming languages, serialization formats) have a nested, self-similar structure. And as we are staging parser combinators, we should be able to stage recursive parsers too. Which is what we will do in this post. For once, we will slightly steer away from our principle of doing everything as a library, and use some minimal LMS internals.
But first, a cute little exercise, thanks to Alex, in which we explore recursion a bit. We all know how to write the Fibonacci function:
The problem with this implementation is that it computes many values more than
once. For example, here is the trace for fib(4)
fib(4)
↪ fib(3) + fib(2)
↪ fib(2) + fib(1) + fib(2)
↪ fib(1) + fib(0) + fib(1) + fib(2)
↪ ...
We compute fib(2)
twice, which is not necessary. We could for instance memo-ize
results as we go instead.
Exercise 1: Suppose you are given a cache in scope. Reimplement fib
so that
it memo-izes intermediate results.
Now we may have more than one function that we would like to memo-ize. It is tedious to have to create a cache for each such function. Could we do better?
Exercise 2: Define a function memo
that takes a function (Int => Int
) and
returns a memo-ized version of it. You can use caches if need be. Reimplement fib
using memo
.
It’s probably easier to implement backwards. The real trick lies in implementing
fib
using memo
. Following the types, we can get to something like:
And then we realize that we need to refer to fib
in the expression we pass to
memo
: after all, we are trying to implement a recursive function. But, and here
lies the trick, we need to refer to the same fib
that we declare. And for that
we cannot declare fib
using a def
. It has to be the same function value,
and def
creates a new value at every invocation. And we therefore have few options
other than resorting to laziness:
Indeed, thanks to laziness, the first time we use fib
a reference for it will
be created and then properly bound to further uses of fib
. As a result any
internal structures that memo
uses will be bound to that reference. And that
should definitely give you enough info to implement memo
now!
So what is the problem with recursion and staging? Seems that it should work out of the box? Well, consider the following recursive, staged parser, which parses digits and adds them up:
In a staged setting, adder
is a code generator for a parser. So every time we
encounter it we generate code for it. Being recursive, this process will last forever,
theoretically. In practice we get stack overflows. Neither are desirable, of course.
For recursive parsers, we need to generate two different types of code for the same parser:
We need to remember whether we have encountered a recursive parser, and we can use
memo-ization for that! Just like the memo
function above, we define a rec
combinator which wraps a recursive parser. It is similar to the fix combinator.
Our example now becomes:
One can see the use of lazy vals as a drawback, but it turns out that it is a common pattern even for standard parsers. For example packrat parsers need to be declared as lazy vals to function correctly [1].
The implementation of rec
turns out to be very close to that of memo
:
Some important points:
Parser
instances to LMS IR symbols. The idea is
to generate a fresh function symbol the first time we see a parser, and bind
the generated function to it.Rep[Input => ParseResult]
, i.e. it is
a staged function (exactly what we need). We also use a struct representation
for parse results, as functions also act as join points.doLambdaDef
function reifies an unstaged function: it converts a
Rep[T] => Rep[U]
to a Rep[T => U]
, and is available my mixing in the Functions
traits.createDefinition
function is also LMS IR code that binds a symbol to a
Rep
.So that’s it! We have now added recursion to our staged parser combinators. Before we conclude, there are a few points worth discussing.
Alternatives to staging recursion
It might be a bit disappointing that we have to use an explicit rec
combinator
because the information about recursion seems directly available anyway. A possibility
would be to use macros or reflection to inspect the structure of a parser, and add
the rec
combinator at compile time. But if we use macros, we could directly do
partial evaluation with them, and just leave recursive functions untouched. This
is what Eric does in his parser combinators and
macros work[2].
A note about laziness
In this post we saw that lazy vals are very useful for late binding of recursive constructs. This is another classic functional programming trick. In a language like Haskell, because everything is lazy by default we don’t realize it as much, but in Scala we are required to actively think about it.
We also need to be a bit careful with definitions of our combinators, in order
to propage the laziness correctly. This is why, for example, the sequencing combinator
~
takes a by-name parameter:
If it didn’t, the value of the right hand side would be computed when the expression is created, and in a recursive case this blows up.
The code used in this post can be accessed through the following files:
If you want to run this code locally, please follow instructions for setup on
the readme here. The sbt
command to
run the test is test-only stagedparsec.RecParsersSuite