Meta talk about Smalltalk, Seaside, Magritte, Pier and related things.

PetitParser for Smalltalk, Java and Dart

I am happy to announce initial ports of PetitParser to Java and Dart. The repositories contain complete and usable ports of the original Smalltalk implementation, unit tests and example grammars.

The three implementations are very similar in their API and the supported features. If possible, I tried to adapt common practices and features of the target language. In this blog post I like to point out some differences and design decisions I took.

Production rules

In Smalltalk, the parser that accepts identifiers looks like this:

parser := #letter asParser , (#letter asParser / #digit asParser) star.

In Java the same parser looks like this:

Parser parser = letter().seq(letter().or(digit()).star());

Instead of the custom operators I use nested function calls. It is visually not as readable as the Smalltalk version, but still it is quite to the point.

In Dart the definition of the identifier parser is similar:

var parser = letter().seq(letter().or(digit()).star());

Alternatively, you can use the overloaded operators of Dart to make it more concise and look like the Smalltalk version:

var parser = letter() & (letter() | digit()).star();

Production actions

To attach a production action to a number parser in Smalltalk one writes:

parser := #digit asParser plus flatten ==> [ :input | input asNumber ].

Due to the lack of closures the Java solution gets a bit bulky. Hopefully this issue will finally be fixed with Java 8. The type declarations are quite nice as documentation of input and output values of the production action:

Parser parser = digit().plus().flatten().map(new Function<String, Integer>() {
    @Override
    public Integer apply(String input) {
      return Integer.valueOf(input);
    }
});

Luckily Dart has closures, and furthermore we can skip the type declarations:

var parser = digit().plus().flatten().map((input) => Math.parseInt(input));

Composite grammars

For the definition of complex grammars with possibly recursive rules you subclass CompositeParser in all three implementations:

The details of the Smalltalk implementation are explained in this blog post. Production rules are defined as methods and instance-variables of the same name. The parser constructor initializes all production rules to dummy parsers that are eventually replaced by the actual definitions using reflection. As a real-world example have a look at this JSON grammar.

The Java solution is very similar to the one in Smalltalk. To make things a bit more Java-ish, methods are annotated with @Production to mark them as a production. In the future, I plan to avoid the use of inner classes for production actions by introducing new annotation called @Action. Again there is a JSON grammar to look at.

Due to the (current) lack of reflection in Dart I had to come up with something else. For Dart I introduced a DSL that allows one to define productions, reference productions, redefine productions and assign production actions to existing productions. The implementation is simple and I like it much better than the others. The JSON grammar in Dart is pretty concise.

Performance

Comparing the performance of the implementations is interesting. For this preliminary evaluation I took the XML parser and parsed 2000 times a medium sized XML file. This is the ranking:

Java: 815ms (Java HotSpot 64-Bit 1.7)
Smalltalk: 4069ms (Pharo 1.3 on Cog)
Dart: 6156ms (Dart SDK version 13851, M1)
~~Dart: 7048ms (Dart SDK version 8638)~~
~~Dart: 14332ms (Dart SDK version 8344)~~

Interestingly Java is doing really well, even if I use auto-boxed characters everywhere and didn’t care to optimize the implementation yet.

The Smalltalk implementation is about 4 times slower. With this implementation I worked the longest and I doubt that it can be optimized further.

Dart lost this evaluation, but I am sure that the code can be further improved. I am suspecting that a problem is that Dart has no notion of characters and needs to represent them as integers or single character strings. Also, in Dart there is no efficient way to figure out if a character is a whitespace, a letter or a digit. ~~To do this correctly I had to fall back to regular expressions, which likely kills an otherwise fast VM.~~ The latest implementation works directly with character codes.

Availability

To sum up, the three implementations are open-source and available in their respective public repositories:

I am happy to take patches or suggestions on how to improve the Java or Dart version.

Posted by Lukas Renggli at 10 June 2012, 6:11 pm with tags petitparser, smalltalk, java, dart 1 comment link

Composite Grammars with PetitParser

In a previous post I described the basic principles of PetitParser and gave some introductory examples. In this blog post I am going to present a way to define more complicated grammars. We continue where we left off the last time, with the expression grammar.

Writing parsers as a script as we did in the previous post can be cumbersome, especially if grammar productions that are mutually recursive and refer to each other in complicated ways. Furthermore a grammar specified in a single script makes it unnecessary hard to reuse specific parts of that grammar. Luckily there is PPCompositeParser to the rescue.

Defining the Grammar

As an example let’s create a composite parser using the same expression grammar we built in the last blog post:

PPCompositeParser subclass: #ExpressionGrammar
   instanceVariableNames: ''
   classVariableNames: ''
   poolDictionaries: ''
   category: 'PetitTutorial'

Again we start with the grammar for an integer number. Define the method number in ExpressionGrammar as follows:

ExpressionGrammar>>number
   ^ #digit asParser plus token trim ==> [ :token | token value asNumber ]

Every production in ExpressionGrammar is specified as a method that returns its parser. Productions refer to each other by reading the respective instance variable of the same name. This is important to be able to create recursive grammars. The instance variables themselves are typically not written to as PetitParser takes care to initialize them for you automatically.

Next we define the productions term, prod, and prim. Contrary to our previous implementation we do not define the production actions yet; and we factor out the parts for addition (add), multiplication (mul), and parenthesis (parens) into separate productions. This will give us better reusability later on. We let Pharo automatically add the necessary instance variables as we refer to them for the first time.

ExpressionGrammar>>term
   ^ add / prod

ExpressionGrammar>>add
   ^ prod , $+ asParser trim , term

ExpressionGrammar>>prod
   ^ mul / prim

ExpressionGrammar>>mul
   ^ prim , $* asParser trim , prod

ExpressionGrammar>>prim
   ^ parens / number

ExpressionGrammar>>parens
   ^ $( asParser trim , term , $) asParser trim

Last but not least we define the starting point of the expression grammar. This is done by overriding start in the ExpressionGrammar class:

ExpressionGrammar>>start
   ^ term end

Instantiating the ExpressionGrammar gives us an expression parser that returns a default abstract-syntax tree:

parser := ExpressionGrammar new.
parser parse: '1 + 2 * 3'.       " --> #(1 $+ #(2 $* 3)) "
parser parse: '(1 + 2) * 3'.     " --> #(#($( #(1 $+ 2) $)) $* 3) "

Defining the Evaluator

Now that we have defined a grammar we can reuse this definition to implement an evaluator. To do this we create a subclass of ExpressionGrammar called ExpressionEvaluator

ExpressionGrammar subclass: #ExpressionEvaluator
   instanceVariableNames: ''
   classVariableNames: ''
   poolDictionaries: ''
   category: 'PetitTutorial'

and we redefine the implementation of add, mul and parens with our evaluation semantics:

ExpressionEvaluator>>add
   ^ super add ==> [ :nodes | nodes first + nodes last ]

ExpressionEvaluator>>mul
   ^ super mul ==> [ :nodes | nodes first * nodes last ]

ExpressionEvaluator>>parens
   ^ super parens ==> [ :nodes | nodes second ]

The evaluator is now ready to be tested:

parser := ExpressionEvaluator new.
parser parse: '1 + 2 * 3'.       " --> 7 "
parser parse: '(1 + 2) * 3'.     " --> 9 "

Similarly — as an exercise — a pretty printer can be defined by subclassing ExpressionGrammar and by redefining a few of its productions:

parser := ExpressionPrinter new.
parser parse: '1+2 *3'.          " --> '1 + 2 * 3' "
parser parse: '(1+ 2 )* 3'.      " --> '(1 + 2) * 3' "

Posted by Lukas Renggli at 27 November 2010, 3:14 pm with tags tutorial, petitparser, smalltalk, pharo comment link

Writing Parsers with PetitParser

After the announcement in the Moose mailing list and after various people have asked me to provide some introduction to PetitParser I decided to write short tutorial.

Originally I have written PetitParser as part of my work on the Helvetia system. PetitParser is a parsing framework different to many other popular parser generators. For example, it is not table based such as SmaCC or ANTLR. Instead it uses a unique combination of four alternative parser methodologies: scannerless parsers, parser combinators, parsing expression grammars and packrat parsers. As such PetitParser is more powerful in what it can parse and it arguably fits better the dynamic nature of Smalltalk. Let’s have a quick look at these four parser methodologies:

Scannerless Parsers combine what is usually done by two independent tools (scanner and parser) into one. This makes writing a grammar much simpler and avoids common problems when grammars are composed.
Parser Combinators are building blocks for parsers modeled as a graph of composable objects; they are modular and maintainable, and can be changed, recomposed, transformed and reflected upon.
Parsing Expression Grammars (PEGs) provide ordered choice. Unlike in parser combinators, the ordered choice of PEGs always follows the first matching alternative and ignores other alternatives. Valid input always results in exactly one parse-tree, the result of a parse is never ambiguous.
Packrat Parsers give linear parse time guarantees and avoid common problems with left-recursion in PEGs.

Loading PetitParser

Enough theory, let’s get started. PetitParser is developed in Pharo, but is also available on other Smalltalk platforms. A ready made image can be downloaded here. To load PetitParser into an existing image evaluate the following Gofer expression:

Gofer new
   renggli: 'petit'; 
   package: 'PetitParser';
   package: 'PetitTests';
   load.

There are other packages in the same repository that provide additional features, for example PetitSmalltalk is a Smalltalk grammar, PetitXml is an XML grammar, PetitJson is a JSON grammar, PetitAnalyzer provides functionality to analyze and transform grammars, and PetitGui is a Glamour IDE for writing complex grammars. We are not going to use any of these packages for now.

More information on how to get PetitParser can be found on the website of the project.

Writing a Simple Grammar

Writing grammars with PetitParser is simple as writing Smalltalk code. For example to write a grammar that can parse identifiers that start with a letter followed by zero or more letter or digits is defined as follows. In a workspace we evaluate:

identifier := #letter asParser , #word asParser star.

If you inspect the object identifier you’ll notice that it is an instance of a PPSequenceParser. This is because the #, operator created a sequence of a letter and a zero or more word character parser. If you dive further into the object you notice the following simple composition of different parser objects:

PPSequenceParser (this parser accepts a sequence of parsers)
    PPPredicateObjectParser (this parser accepts a single letter)
    PPRepeatingParser (this parser accepts zero or more instances of another parser)
       PPPredicateObjectParser (this parser accepts a single word character)

Parsing Some Input

To actually parse a string (or stream) we can use the method #parse::

identifier parse: 'yeah'.          " --> #($y #($e $a $h)) "
identifier parse: 'f12'.           " --> #($f #($1 $2)) "

While it seems odd to get these nested arrays with characters as a return value, this is the default decomposition of the input into a parse tree. We’ll see in a while how that can be customized.

If we try to parse something invalid we get an instance of PPFailure as an answer:

identifier parse: '123'.           " --> letter expected at 0 "

Instances of PPFailure are the only objects in the system that answer with true when you send the message #isPetitFailure. Alternatively you can also use #parse:onError: to throw an exception in case of an error:

identifier
   parse: '123'
   onError: [ :msg :pos | self error: msg ].

If you are only interested if a given string (or stream) matches or not you can use the following constructs:

identifier matches: 'foo'.         " --> true "
identifier matches: '123'.         " --> false "

Furthermore to find all matches in a given input string (or stream) you can use:

identifier matchesIn: 'foo 123 bar12'.

Similarly, to find all the matching ranges in the given input string (or stream) you can use:

identifier matchingRangesIn: 'foo 123 bar12'.

Different Kinds of Parsers

PetitParser provide a large set of ready-made parser that you can compose to consume and transform arbitrarily complex languages. The terminal parsers are the most simple ones. We’ve already seen a few of those:

Terminal Parsers	Description
`$a asParser`	Parses the character `$a`.
`'abc' asParser`	Parses the string `'abc'`.
`#any asParser`	Parses any character.
`#digit asParser`	Parses the digits 0..9.
`#letter asParser`	Parses the letters a..z and A..Z.

The class side of PPPredicateObjectParser provides a lot of other factory methods that can be used to build more complex terminal parsers.

The next set of parsers are used to combine other parsers together:

Parser Combinators	Description
`p1 , p2`	Parses `p1` followed by `p2` (sequence).
`p1 / p2`	Parses `p1`, if that doesn’t work parses `p2` (ordered choice).
`p star`	Parses zero or more `p`.
`p plus`	Parses one or more `p`.
`p optional`	Parses `p` if possible.
`p and`	Parses `p` but does not consume its input.
`p not`	Parses `p` and succeed when `p` fails, but does not consume its input.
`p end`	Parses `p` and succeed at the end of the input.

So instead of using the #word predicated we could have written our identifier parser like this:

identifier := #letter asParser , (#letter asParser / #digit asParser) star.

To attach an action or transformation to a parser we can use the following methods:

Action Parsers	Description
`p ==> aBlock`	Performs the transformation given in `aBlock`.
`p flatten`	Creates a string from the result of `p`.
`p token`	Creates a token from the result of `p`.
`p trim`	Trims whitespaces before and after `p`.

To return a string of the parsed identifier, we can modify our parser like this:

identifier := (#letter asParser , (#letter asParser / #digit asParser) star) flatten.

These are the basic elements to build parsers. There are a few more well documented and tested factory methods in the operations protocol of PPParser. If you want browse that protocol.

Writing a More Complicated Grammar

Now we are able to write a more complicated grammar for evaluating simple arithmetic expressions. Within a workspace we start with the grammar for a number (actually an integer):

number :=  #digit asParser plus token trim ==> [ :token | token value asNumber ].

Then we define the productions for addition and multiplication in order of precedence. Note that we instantiate the productions as PPUnresolvedParser upfront, because they recursively refer to each other. The method #def: resolves this recursion using the reflective facilities of the host language:

term := PPUnresolvedParser new.
prod := PPUnresolvedParser new.
prim := PPUnresolvedParser new.
 
term def: (prod , $+ asParser trim , term ==> [ :nodes | nodes first + nodes last ])
   / prod.
prod def: (prim , $* asParser trim , prod ==> [ :nodes | nodes first * nodes last ])
   / prim.
prim def: ($( asParser trim , term , $) asParser trim ==> [ :nodes | nodes second ])
   / number.

To make sure that our parser consumes all input we wrap it with the end parser into the start production:

start := term end.

That’s it, now we can test our parser and evaluator:

start parse: '1 + 2 * 3'.       " --> 7 "
start parse: '(1 + 2) * 3'.     " --> 9 "

As an exercise we could extend the parser to also accept negative numbers and floating point numbers, not only integers. Furthermore it would be useful to add support subtraction and division as well. All these features can be added with a few lines of PetitParser code.

Posted by Lukas Renggli at 25 February 2010, 5:34 pm with tags tutorial, petitparser, smalltalk, pharo 4 comments link

Lukas Renggli

Talking Meta

Archive

Tags