Wednesday, December 23, 2009

Banging Heads with ANTLR: Changing Token Text

I'm hewing at the ANTLR parser generator documentation and find a lot of things that need to be discovered by trial and error.

Here's a simple way to change the text of a token. This example is the common problem of converting escape characters in text strings.
tokens {
   BACKSLASH   = '\\';
   DOUBLEQUOTE = '"';
   }
ESCAPE:
   BACKSLASH (
      | 'n' { setText("\n");}
      | 'r' { setText("\r");}
      | 't' { setText("\t");}
      | DOUBLEQUOTE { setText("\"");}
      );

stringQuote:
   q=DOUBLEQUOTE { $q.setText("");};

string:
   s=stringLiteral ->^(STRING[$s.text]);

stringLiteral:
   stringQuote ( ESCAPE | ~( DOUBLEQUOTE | BACKSLASH))* stringQuote;

The salient point here is that setText() changes the text of the whole token as it is ultimately presented to a parser rule. Therefore ESCAPE must be a complete token - not a fragment or referenced by another token. That makes stringLiteral a parser rule and not a token; otherwise the entire text would be overwritten by a setText().

Also the stringQuote production eliminates the delimiting double quotes from the text of stringliteral. If stringQuote were a token my grammar would be ambiguous.

The string production tidies up the tree by condensing the glob of token children of stringLiteral.into one node. If stringLiteral were a token the token fragments that compose it would combine into one node, but having to make it a parser rule makes a node with every token a child. I said this was simple?