Writing Parsers with PetitParser

After the announcement in the Moose mailing list and after various people have asked me to provide some introduction to PetitParser I decided to write short tutorial.

Originally I have written PetitParser as part of my work on the Helvetia system. PetitParser is a parsing framework different to many other popular parser generators. For example, it is not table based such as SmaCC or ANTLR. Instead it uses a unique combination of four alternative parser methodologies: scannerless parsers, parser combinators, parsing expression grammars and packrat parsers. As such PetitParser is more powerful in what it can parse and it arguably fits better the dynamic nature of Smalltalk. Let’s have a quick look at these four parser methodologies:

  1. Scannerless Parsers combine what is usually done by two independent tools (scanner and parser) into one. This makes writing a grammar much simpler and avoids common problems when grammars are composed.
  2. Parser Combinators are building blocks for parsers modeled as a graph of composable objects; they are modular and maintainable, and can be changed, recomposed, transformed and reflected upon.
  3. Parsing Expression Grammars (PEGs) provide ordered choice. Unlike in parser combinators, the ordered choice of PEGs always follows the first matching alternative and ignores other alternatives. Valid input always results in exactly one parse-tree, the result of a parse is never ambiguous.
  4. Packrat Parsers give linear parse time guarantees and avoid common problems with left-recursion in PEGs.

Loading PetitParser

Enough theory, let’s get started. PetitParser is developed in Pharo, but is also available on other Smalltalk platforms. A ready made image can be downloaded here. To load PetitParser into an existing image evaluate the following Gofer expression:

Gofer new
renggli: 'petit';
package: 'PetitParser';
package: 'PetitTests';
load.

There are other packages in the same repository that provide additional features, for example PetitSmalltalk is a Smalltalk grammar, PetitXml is an XML grammar, PetitJson is a JSON grammar, PetitAnalyzer provides functionality to analyze and transform grammars, and PetitGui is a Glamour IDE for writing complex grammars. We are not going to use any of these packages for now.

More information on how to get PetitParser can be found on the website of the project.

Writing a Simple Grammar

Writing grammars with PetitParser is simple as writing Smalltalk code. For example to write a grammar that can parse identifiers that start with a letter followed by zero or more letter or digits is defined as follows. In a workspace we evaluate:

identifier := #letter asParser , #word asParser star.

If you inspect the object identifier you’ll notice that it is an instance of a PPSequenceParser. This is because the #, operator created a sequence of a letter and a zero or more word character parser. If you dive further into the object you notice the following simple composition of different parser objects:

PPSequenceParser (this parser accepts a sequence of parsers)
    PPPredicateObjectParser (this parser accepts a single letter)
    PPRepeatingParser (this parser accepts zero or more instances of another parser)
       PPPredicateObjectParser (this parser accepts a single word character)

Parsing Some Input

To actually parse a string (or stream) we can use the method #parse::

identifier parse: 'yeah'.          " --> #($y #($e $a $h)) "
identifier parse: 'f12'. " --> #($f #($1 $2)) "

While it seems odd to get these nested arrays with characters as a return value, this is the default decomposition of the input into a parse tree. We’ll see in a while how that can be customized.

If we try to parse something invalid we get an instance of PPFailure as an answer:

identifier parse: '123'.           " --> letter expected at 0 "

Instances of PPFailure are the only objects in the system that answer with true when you send the message #isPetitFailure. Alternatively you can also use #parse:onError: to throw an exception in case of an error:

identifier
parse: '123'
onError: [ :msg :pos | self error: msg ].

If you are only interested if a given string (or stream) matches or not you can use the following constructs:

identifier matches: 'foo'.         " --> true "
identifier matches: '123'. " --> false "

Furthermore to find all matches in a given input string (or stream) you can use:

identifier matchesIn: 'foo 123 bar12'.

Similarly, to find all the matching ranges in the given input string (or stream) you can use:

identifier matchingRangesIn: 'foo 123 bar12'.

Different Kinds of Parsers

PetitParser provide a large set of ready-made parser that you can compose to consume and transform arbitrarily complex languages. The terminal parsers are the most simple ones. We’ve already seen a few of those:

Terminal Parsers Description
$a asParser Parses the character $a.
'abc' asParser Parses the string 'abc'.
#any asParser Parses any character.
#digit asParser Parses the digits 0..9.
#letter asParser Parses the letters a..z and A..Z.

The class side of PPPredicateObjectParser provides a lot of other factory methods that can be used to build more complex terminal parsers.

The next set of parsers are used to combine other parsers together:

Parser Combinators Description
p1 , p2 Parses p1 followed by p2 (sequence).
p1 / p2 Parses p1, if that doesn’t work parses p2 (ordered choice).
p star Parses zero or more p.
p plus Parses one or more p.
p optional Parses p if possible.
p and Parses p but does not consume its input.
p not Parses p and succeed when p fails, but does not consume its input.
p end Parses p and succeed at the end of the input.

So instead of using the #word predicated we could have written our identifier parser like this:

identifier := #letter asParser , (#letter asParser / #digit asParser) star.

To attach an action or transformation to a parser we can use the following methods:

Action Parsers Description
p ==> aBlock Performs the transformation given in aBlock.
p flatten Creates a string from the result of p.
p token Creates a token from the result of p.
p trim Trims whitespaces before and after p.

To return a string of the parsed identifier, we can modify our parser like this:

identifier := (#letter asParser , (#letter asParser / #digit asParser) star) flatten.

These are the basic elements to build parsers. There are a few more well documented and tested factory methods in the operations protocol of PPParser. If you want browse that protocol.

Writing a More Complicated Grammar

Now we are able to write a more complicated grammar for evaluating simple arithmetic expressions. Within a workspace we start with the grammar for a number (actually an integer):

number :=  #digit asParser plus token trim ==> [ :token | token value asNumber ].

Then we define the productions for addition and multiplication in order of precedence. Note that we instantiate the productions as PPUnresolvedParser upfront, because they recursively refer to each other. The method #def: resolves this recursion using the reflective facilities of the host language:

term := PPUnresolvedParser new.
prod := PPUnresolvedParser new.
prim := PPUnresolvedParser new.

term def: (prod , $+ asParser trim , term ==> [ :nodes | nodes first + nodes last ])
/ prod.
prod def: (prim , $* asParser trim , prod ==> [ :nodes | nodes first * nodes last ])
/ prim.
prim def: ($( asParser trim , term , $) asParser trim ==> [ :nodes | nodes second ])
/ number.

To make sure that our parser consumes all input we wrap it with the end parser into the start production:

start := term end.

That’s it, now we can test our parser and evaluator:

start parse: '1 + 2 * 3'.       " --> 7 "
start parse: '(1 + 2) * 3'. " --> 9 "

As an exercise we could extend the parser to also accept negative numbers and floating point numbers, not only integers. Furthermore it would be useful to add support subtraction and division as well. All these features can be added with a few lines of PetitParser code.

Posted by Lukas Renggli at 25 February 2010, 5:34 pm with tags tutorial, petitparser, smalltalk, pharo link

Comments

How about changing asParser to just plain "parser"? Yes, I know messages in the converting protocol begin with "as," but sending asParser to #digit doesn’t really "convert" anything in the traditional sense. Please consider at least adding a "parser" message to object that just sends "self asParser" if nothing else.

Also, I believe the ==> message is not portable, as ST-80 limits the length of binary messages to two characters. => would work just as well, I think.

Other than that, PetitParser looks really cool!

Posted by anon at 28 February 2010, 2:24 am link

Replacing asParser with parser is probably a good idea. I will look into that.

The draft of the ANSI Smalltalk Standard from December 1997 defines arbitrary length binary selectors, see Section 3.5.5.

Posted by Lukas Renggli at 28 February 2010, 10:28 am link

PetitParser seem relatively similar to OMeta/Ometa2. What are the differnences ?

Posted by Alain Fischer at 16 March 2010, 4:18 pm link
  1. The most visible difference is that PetitParser uses plain Smalltalk to specify grammars. This allows one to arbitrarily use little parsers anywhere in your code to do string processing or other graph transformations.
  2. Another difference is that parsers are first-class objects and not static code. This is an important property if you want to do on-the-fly grammar transformation as I do in Helvetia.
  3. After reading the OMeta papers I was surprised that PetitParser is much more efficient for well defined grammars. I published some preliminary benchmarks here.
Posted by Lukas Renggli at 16 March 2010, 5:04 pm link