We previously observed, a long time ago, that combining partial evaluation and CPS-encoded data structures enabled us to achieve fusion. To be a bit elaborate:
In this post, we will explore another interesting application of this principle, parser combinators [1]. Essentially, through the use of staging, we will turn parser combinators into parser generators. The latter are rid of the composition overhead of combinators, and therefore have good runtime performance [2].
This post heavily borrows from the lessons learnt on the previous adventure with fold-based fusion. I will therefore assume that the reader is familiar with the staging techniques we have used so far. I’d recommend reading the paragraph on LMS over here in order to get an idea.
What is a parser?
A parser is a program that reads some input (usually text) and converts it into a structured representation of this input. For example, suppose you are given a text file containing some information about tennis players:
firstname: "Roger", lastname: "Federer", grandslams: 18, currentranking: 2
firstname: "Rafael", lastname: "Nadal", grandslams: 14, currentranking: 8
firstname: "Novak", lastname: "Djokovic", grandslams: 8, currentranking: 1
If we want to operate on this data, we must first read it and convert it into a format that a program can handle. This is where a parser comes in. A possible conversion of the above data, in Scala, would be to a list of case classes that represent players:
Another classic, although meta, example of a parser is a program that reads another program, and converts into a tree format.
How to write a parser
There are three traditional ways of writing parsers:
Use a parser generator. You may have noticed that even the text file above respects a structured formatting. In other words, it adheres to a certain grammar. Therefore an elegant way to write a parser is to write the grammar that corresponds to the structured format, and generate the “by hand” version. There are many popular parser generators available. Such as yacc, ANTLR, Happy.
Grammars and parsers do not always have a one-to-one correspondence. Some grammars are more powerful/general than others, and some parsing techniques are better suited to some grammars. In this post we consider those grammars that can be parsed in a left-to-right, recursive-descent manner.
At its heart, a parser is a function that takes an input state and produces a result.
For simplicity, we consider the input to be an array of characters. The input state
also contains some information regarding where in the text we currently are. Here
is a possible type alias for a parser that parsers elements of type T
:
Before we go further, we need to take a quick look at what Input
and ParseResult
are.
Input
At its core, and input state knows whether it is at the end of the input, how to
get the current element, and access the remaining input (excluding the current
element). Typically, the above API is known to be a Reader
:
As we are working with arrays of characters as input sources, our base element is
a Char
.
Exercise 1: Implement the Reader
API for a StringReader
:
ParseResult
A ParseResult
, as its name suggests, encapsulates the result of parsing an input.
It can either be a succesful parse, in which case we have a result, or a failure,
in which case we have some basic information as to where the parsing failed. Many
parser combinator frameworks add different types of failures and extra information,
such as error messages to make the library more user-friendly. We will focus on
the basics here:
Exercise 2: Implement some usual suspects for ParseResult
. You can use
object-oriented style or pattern-matching:
Elementary Parser Combinators
Now that we understand what Input
and ParseResult
are, we can build our first
parser, which succeeds if an element satisfies a certain predicate and fails if
it does not. We take a similar design route to the one we took
in the past:
Parsers
trait, which acts as the context.Parser
extends a functionParser
companion object defines an apply
method which acts as
syntactic sugar.Packed together, we get this code:
So acceptIf
is the most basic parser we can build. We of course need to first
make sure that we haven’t reached the end of the input.
Exercise 3: Implement some more elementary parsers that parse single characters,
namely digit
and letter
, which successfully parse digits and letters. You are
only allowed to use acceptIf
. Then implement digit2Int
, which converts a parsed
digit into its integer representation.
Note that we have in the CharParsers
trait, we know the elements we read to be
characters.
The Parser API
It turns out that parser combinators are monads [3]! Apart from making us categorically happy, this means we can once again implement some usual suspects.
Exercise 4: Implement map
and flatMap
for Parser
. You may want, for
simplicity sake, implement a method flatMapWithNext
for ParseResult
first:
The solution really lies in implementing the corresponding functions for
ParseResult
. We want flatMapWithNext
to propagate the failure state, but
apply the function to its result otherwise:
The Parser API then becomes obvious:
Sequencing and Alternation
We finally come to the meat of parser combinators, sequencing and alternation:
The sequencing combinator ~
takes two parsers l
, r
, and creates
a new parser that succeeds if l
succeeds, and then r
succeeds on the
remaining input. The result is a pair containing results of both l
and r
.
There are additional variants of the combinator, ~>
and <~
, which respectively
discard the left and the right part after successfully parsing both of them.
The alternation combinator |
takes two parsers l
, r
, and succeeds if
either l
or r
succeed. It first tries l
and tries r
only if l
fails.
This is as per the PEG notation for grammars.
Exercise 5: Given the above specs, implement ~
and its cousins, as well as
|
. Hint: you’ve already done most of the work before, and can implement sequencing
using for-comprehensions only.
So there we are, we built the basics of a parser combinator library. By now, I guess you would have realised that bigger parsers are created from elementary ones through function composition. This is because parsers are represented as functions. Which means that if we partially evaluate function composition we get rid of a lot of overhead!
And so we use our favorite staging framework, LMS, to achieve this. The trick, as
usual, is to use the Rep
type judiciously.
Exercise 6: Given the original API for Parser
below, add Rep
types
where required.
If you have followed previous blogposts, you will find this rather easy: as we want to partially evaluate function composition, all functions are ‘non-Repped’, while input and output types are ‘Repped’:
Once again, the tricky method to type correctly above could be flatMap
. Once
again, remembering that Parser
is now a code generator helps ease all concerns.
The implementation of the methods themselves does not change much, as we mostly
delegate the logic to underlying types.
Note: for those who have been keeping up to date with the posts, you may have
noticed that the Manifest
typeclass is now replaced by the Typ
typeclass. This
does not fundamentally affect the behaviour of the partial evaluation engine that
we use, but if you are interested you can still read all about it
here.
CPS encoding ParseResult
To achieve fold-based fusion before, we first had to come up with a late representation of lists, in the form of the fold function. Only after that could we partially evaluate function composition, and get fusion.
For parser combinators, we already start with a function representation for parsers.
Yet we can still benefit from our good old trick of CPS-encoding data structures,
similarly to what we did with the Either
type.
Indeed, we still create quite a few intermediate structures, for every single parser:
we box the Input
after every parsing attempt, and we create a ParseResult
structure
at the same time.
These can result in considerable performance overhead. At the moment (day of writing), I do not know exactly how considerable, and therefore whether we should really invest effort in deforesting these. Moreover, if we generate C code, for instance, we can generate stack-allocated structs which are consider zero overhead. In Scala we could generate value classes.
Nonetheless, it is a good exercise to try it, because:
And so let’s just do it, for ParseResult
at least.
Exercise 7: Can you find a CPS-encoding for ParseResult
?
Yes you can! A ParseResult
is either a success or a failure, so it’s but a glorified
instance of the Option
type, which itself is a basic version of the Either
type.
Here is a math-ish notation:
type ParseResultCPS[T] = forall X. ((T, Input) => X, Input => X) => X
Adding Rep
types as appropriate, we get the following:
We define Success
and Failure
functions for the sake of a friendly API. They
respectively apply the success
and failure
continuations.
Exercise 8: Implement map
, flatMapWithNext
, and orElse
for ParseResultCPS
.
Let me give you the implementation for the second one:
Note that we have slightly changed the signature of the function f
, but Mr.
Schoenfinkel tells us that both forms
are equivalent.
As a result, the signature of Parser
now becomes:
Of course, we eventually need to materialize our result, i.e. call the apply
function on an instance of ParseResultCPS
. This is analogous to our need to
eventually materialize a staged FoldLeft
, either as a list, or as another
aggregate (for example Int
if we were summing elements). We’ll define a handy
toOption
function, which returns a Rep[Option]
:
Note that we can use variables by mixing in the LiftVariables
trait.
We have one final issue to take care of. We saw it once a while ago, but it’s worth taking a more in-depth look.
Exercise 9: What does the following conditional expression generate?
(if (c1) Success(t1, in1) else Success(t2, in2))
.flatMapWithNext(someF).apply(success, failure)
This is the evaluation trail:
(if (c1) Success(t1, in1) else Success(t2, in2))
.flatMapWithNext(someF).longPipeline.apply(success, failure)
↪ (if (c1) Success(t1, in1).flatMapWithNext(someF).longPipeline
else Success(t2, in2).flatMapWithNext(someF).longPipeline).apply(success, failure)
↪ if (c1) Success(t1, in1).flatMapWithNext(someF).longPipeline.apply(success, failure)
else Success(t2, in2).flatMapWithNext(someF).longPipeline.apply(success, failure)
Now if someF
also introduces a conditional expression, then longPipeline
gets
generated in those branches as well. This leads to code explosion, exponential in
the number of conditionals. But does such code occur in real life?
Exercise 10: Can you find an example of a parser which has a shape similar to the conditional expression above?
Here is a possible answer to that question:
A perfectly normal-looking parser indeed. The crux of the problem lies in the fact
that conditional expressions act split points. If we partially evaluate function
composition naively, code gets generated in both branches. We therefore need to
create deliberate join points. In a non CPS-encoded representation of parse results,
the boxing acts as a join point. We can instead introduce temporary variables for
the relevant fields of a parse result. These variables will be affected by the use
of the success
and failure
continuations. After that, they are effectively
passed on further, thereby acting as join points.
For this, we special case the construction of a ParseResultCPS
with a conditional
expression:
The apply
function is implemented in the naive way, but we need to override all
other methods of the API to use temporary variables. I will just show map here,
I am sure you can figure out the rest of them!
As you can see, we first evaluate self
, which is the pipeline before the call
to map
, by passing trivial continuations that assign results to variables. The
success
and failure
continuations are called on these temporary results.
We have seen in this post what parser combinators are, and how to stage them. We also saw how to deforest parse results, through the use of our good friend CPS-encoding. For conditional expressions, we realised the need to create join points. And we did all this by staying true to our principle of only using libary-style coding.
In the coming posts, we will add a few more features to this staged combinator library, so stay tuned!
Note: Much of the work in this post has been previously documented in a paper by my colleagues and me [2]. This post acts as a revival of that work along with some cleanup and fresh ideas (ex. CPS-encoding ParseResult). If you have been through that paper, you will notice that we use a different trick for alternation, namely the introduction of functions (Section 3.3). Turns out that all we need is some temporary variables, as seen above, so you can conveniently ignore that section.
The code used in this post can be accessed through the following files:
ParseResultCPS
here.Reader
here.
As you will see, it uses the struct representation available in LMS.If you dig into the lms.util folder you can find other CPS encodings, because why not.
flatMapTilde
, orElseAlternation
, orElseSeq
.If you want to run this code locally, please follow instructions for setup on
the readme here. The sbt
commands to
run the particular tests are
test-only stagedparsec.CharParsersSuite
test-only stagedparsec.ParseResultCPSSuite
test-only stagedparsec.StringReaderSuite