Prototype of a Scannerless, Generalized Left-to-right Rightmost (SGLR) derivation parser for JavaScript

Need a Scannerless, Generalized Left-to-right Rightmost (SGLR) derivation parser for JavaScript? Then my JSGLR-GWT proof-of-concept might interest you. (If you don’t know what a scannerless GLR parser is, then you are probably not interested in writing modular grammars for your programming language implementation anyway – or you might want to read a few sections down).

A word of warning: I’m not claiming it is written in JavaScript, merely for JavaScript. I simply took our old Java-based SGLR implementation and massaged it quite heavily until GWT could spit out a working version for the JavaScript interpreters.

Performance

The performance is, quite frankly, beyond my wildest expectations. I’ve not made a single attempt at performance tweaking for the JavaScript engines yet, and still I manage to parse through 830 lines of Stratego code in ~3 seconds (Chromium 7.0.519.0). For smaller fragments, even Firefox 3.5 manages to keep up: it manages ~20 lines in ~1s (Chromium does that test in 74ms).

To most people, this probably doesn’t sound very fast at all, but I was much worse off when I initially wrote JSGLR back in 2006 – and that used a term(tree) library that wasn’t hacked together over night. All in all, I think there are plenty of performance tweaking opportunities, and the JavaScript engines are getting faster almost by the minute.

Now, another reason for my optimism, is my use case. I’m interested only interested in JavaScript port of SGLR for interactive use. Think web IDE. My aim is to attain interactive speeds when editing text. A few optimization ideas off the top of my head:

  • Tweak the parser to only reparse the lines that have changed. This alone might be enough to survive with the current level of performance.
  • Reduce the extreme amount of garbage created currently by using object pools.
  • Do the initial parse on the server-side, at least for larger files.

What is a Scannerless GLR parser anyway?

(optional reading)

SGLR does not separate your parser into the usual two steps: (1) lexer/tokenizer/scanner followed by (2) LL- or LR-style parsing of the token stream. Instead, SGLR runs the parser on every character. This is why it’s scannerless.

The next point, GLR, is tricker to explain briefly. Remember that a parser takes a string of text and creates a (parse) tree for it. There might, at some stage during the parsing, be multiple possible trees that might fit the text seen so far. This is called an ambiguity. We don’t yet know what is meant by the text. Only by continue reading, will it (hopefully) become apparent. However, this is where the k comes in. LL(k)/LALR(k) lets you look only k tokens into the future. GLR allows you to look all tokens into the future (for SGLR, I should probably say all characters into the future.).

Typically, LL and LALR-parsers, such as the archaic Yacc and Bison parsers, force you to contort your grammar a lot to meet the k-token lookahead. If you have ever run into the dreaded “shift/reduce” conflict, you know what I’m talking about.

The combination of scannerless and GLR gives us something wonderful: grammars are closed under composition. So, if you write grammars for languages L1 and L2, you can just make a new module L3, include L1 and L2, et voila – you have L3 = L1 ∪ L2.

Trying such a stunt with LALR will almost invariably push you into the “shift/reduce”-conflict corner. Admittedly, there is a significant trade-off or gotcha with the (S)GLR approach. And it is very similar to the trade-offs/gotchas with early vs late binding, or static vs dynamic typing.

If, after processing the entire text, there still are multiple possible parse trees, GLR will give you all of them. So, you catch the ambiguity at runtime. But you can use GLR to say more: the set of grammars you can express with GLR is larger than that of LL(k) and LALR(k).

The trade-off is simple: with LL(k)/LALR(k), you are guaranteed to only have one, unambiguous, tree at the end – but the you might not be able to express the language you want to design. With SGLR, you are free to express the full class of context-free grammars, but the parsing result might be a forest of trees. It’s up to you to prune the forest afterwards, finding the tree you want.

I’m not claiming that SGLR is the best solution for all kinds for parsing needs. With great power comes great responsibility. You might not always need great power, and great responsibilty is usually just a liability, anyway;)

Implementation Details

To interested parties, this is what I did:

  • I ripped out the dependence on the CWI ATerm library and (yet again) wrote my own. Mostly hacked together based from scratch, but with the Textual ATerm Format (TAF) parser stolen from the BasicStrategoTerm implementation.
  • Ripped out all references to java.io and replaced with my own stuff, based on strings.
  • Wrote a tiny RemoteServiceServlet to allow the client side to access the parse tables from server side.
  • Added a tiny hack to also fetch parse tables as TAFs and load these on the client side.
  • Various other minor fixes all over the place.