PetitParser for Smalltalk, Java and Dart

I am happy to announce initial ports of PetitParser to Java and Dart. The repositories contain complete and usable ports of the original Smalltalk implementation, unit tests and example grammars.

The three implementations are very similar in their API and the supported features. If possible, I tried to adapt common practices and features of the target language. In this blog post I like to point out some differences and design decisions I took.

Production rules

In Smalltalk, the parser that accepts identifiers looks like this:

parser := #letter asParser , (#letter asParser / #digit asParser) star.

In Java the same parser looks like this:

Parser parser = letter().seq(letter().or(digit()).star());

Instead of the custom operators I use nested function calls. It is visually not as readable as the Smalltalk version, but still it is quite to the point.

In Dart the definition of the identifier parser is similar:

var parser = letter().seq(letter().or(digit()).star());

Alternatively, you can use the overloaded operators of Dart to make it more concise and look like the Smalltalk version:

var parser = letter() & (letter() | digit()).star();

Production actions

To attach a production action to a number parser in Smalltalk one writes:

parser := #digit asParser plus flatten ==> [ :input | input asNumber ].

Due to the lack of closures the Java solution gets a bit bulky. Hopefully this issue will finally be fixed with Java 8. The type declarations are quite nice as documentation of input and output values of the production action:

Parser parser = digit().plus().flatten().map(new Function<String, Integer>() {
    @Override
    public Integer apply(String input) {
      return Integer.valueOf(input);
    }
});

Luckily Dart has closures, and furthermore we can skip the type declarations:

var parser = digit().plus().flatten().map((input) => Math.parseInt(input));

Composite grammars

For the definition of complex grammars with possibly recursive rules you subclass CompositeParser in all three implementations:

The details of the Smalltalk implementation are explained in this blog post. Production rules are defined as methods and instance-variables of the same name. The parser constructor initializes all production rules to dummy parsers that are eventually replaced by the actual definitions using reflection. As a real-world example have a look at this JSON grammar.

The Java solution is very similar to the one in Smalltalk. To make things a bit more Java-ish, methods are annotated with @Production to mark them as a production. In the future, I plan to avoid the use of inner classes for production actions by introducing new annotation called @Action. Again there is a JSON grammar to look at.

Due to the (current) lack of reflection in Dart I had to come up with something else. For Dart I introduced a DSL that allows one to define productions, reference productions, redefine productions and assign production actions to existing productions. The implementation is simple and I like it much better than the others. The JSON grammar in Dart is pretty concise.

Performance

Comparing the performance of the implementations is interesting. For this preliminary evaluation I took the XML parser and parsed 2000 times a medium sized XML file. This is the ranking:

  1. Java: 815ms (Java HotSpot 64-Bit 1.7)
  2. Smalltalk: 4069ms (Pharo 1.3 on Cog)
  3. Dart: 6156ms (Dart SDK version 13851, M1)
  4. Dart: 7048ms (Dart SDK version 8638)
  5. Dart: 14332ms (Dart SDK version 8344)

Interestingly Java is doing really well, even if I use auto-boxed characters everywhere and didn’t care to optimize the implementation yet.

The Smalltalk implementation is about 4 times slower. With this implementation I worked the longest and I doubt that it can be optimized further.

Dart lost this evaluation, but I am sure that the code can be further improved. I am suspecting that a problem is that Dart has no notion of characters and needs to represent them as integers or single character strings. Also, in Dart there is no efficient way to figure out if a character is a whitespace, a letter or a digit. To do this correctly I had to fall back to regular expressions, which likely kills an otherwise fast VM. The latest implementation works directly with character codes.

Availability

To sum up, the three implementations are open-source and available in their respective public repositories:

I am happy to take patches or suggestions on how to improve the Java or Dart version.

Posted by Lukas Renggli at 10 June 2012, 6:11 pm with tags petitparser, smalltalk, java, dart link

Comments

So, in Dart, charCodeAt() for Unicode codepoint integer comparison is pretty much what you’ll have to do, and port the lexical queries into this.

Posted by Brian T. Rice at 10 June 2012, 6:59 pm link