Grammatical Actions: further thoughts on cooperative Raku grammars

Raku's grammars aren't always the right tool for the job – but when they are, they're so powerful that they feel almost like cheating. And one of the not-so-secret weapons that gives them that power is the ability to specify an action object to specify parse-time behavior. (Methods on this object are called whenever a token with the same name in the grammar matches, which lets you manipulate the syntax tree as you build it and avoids having to navigate a (potentially very deep!) match tree after parsing is complete.)

Over the past few weeks, there's been a fair bit of discussion about different ways to combine multiple grammars. Most notably, Mike Clark had an excellent blog post titled Multiple Co-operating Grammars in Raku. In that post, Mike showed two ways to combine two grammars – a simple way for times when you're fine with the grammars sharing an action object, and a more complex way to let each grammar have its own actions. And then, building on that post, Matthew Stuckwisch (guifa) opened a Problem Solving issue discussing different ways Raku could one day add syntax that better supports combining grammars (maybe in version 6.f).

Now, both of those posts were very interesting. But, being lazy, I wasn't too sure about Mike's second approach – it seemed like a lot of work, and a lot of bookkeeping. And, being impatient, I also wasn't too keen on waiting for a syntax-level solution in v6.f. After all, we haven't even finalized a release date for v6.e!

So I started thinking to myself, "self, would there be a way to do this with roles somehow?". And, after a bit of playing around, I replied in the affirmative. The rest of this post shows an example of using a Subgram role to easily combine multiple grammars, each with its own action object.

Getting set up

To show this in action, we'll need a motivating example. Unlike Mike, I don't have a production need to combine grammars, so let's work up a toy example. Fortunately, Mike's post provided the idea for one:

[combining grammars] might sound like a very strange thing to want to do, but in fact embedded languages crop up with surprising regularity in computing. As an obvious example, imagine you’re trying to parse a web page. Rather than trying to describe all the content in the page in a single grammar, it makes sense to have a grammar for HTML, another for CSS, a third for JS, and so on.

Ok, so let's quickly write the world's worst HTML and CSS parsers and then see how we can combine them with a role.

Css (or a subset thereof)

Writing a grammar for CSS is almost embarrassingly easy – you can pretty much just translate literally from the CSS definition, which is provided as a BNF. But it's also way more verbose that we need for a toy example (there are a lot of CSS properties!), so let's restrict ourselves to a tiny subset.

Specifically, let's implement just enough to let us parse this snippet of CSS, borrowed from a rude-but-not-wrong website:

body {
  margin: 40px auto;
  max-width: 650px;
  line-height: 1.6;
  font-size: 18px;
  color: #444;
  padding: 0 10px;
}

To parse that, we'll say that this snippet consists of one style rule, which consists of the selector "body", and a series of properties inside curly brackets. Each property, in turn, has a key followed by a colon and then a value followed by a semicolon.

Or, in Raku:

grammar Css {
    rule TOP        { <style-rule>+ }
    rule style-rule { <selector>+ '{' ~ '}' <property>+       }
    rule selector   { 'body' | 'p' | 'div' | 'span' | 'h1'    }
    rule property   { $<key>=<-[:]>+ ':' $<value>=<-[;]>+ ';' }
}

I suspect that most of the code is pretty self-explanatory, but pay particular attention to the style-rule line. It uses the ~ regex operator to specify that its looking for <property>+ (one or more properties) inside { and }. I'm drawing your attention to this ~ because you'll be seeing a lot more of them soon.

Taking our first action

That grammar parses the CSS snippet into a Match, but lets make it a bit easier to work with by using an action object to create a nested Hash. In particular, lets create a hash with keys for each selector and values that are hashes of property key-value pairs.

Here's the code:

class CssActions {
    method TOP($/)        { make %($<style-rule>».made)  }
    method style-rule($/) { make ~$<selector>.join.trim
                                 => %($<property>».made) }
    method property($/)   { make ~$<key> => ~$<value>    }
}

And then here's our grammar and action in action together:

my $match = Css.parse: $css-snippet,
                       :actions(CssActions.new);
say $match.made.&pretty;
# OUTPUT:
# { body => {
#       padding     => "0 10px",
#       max-width   => "650px",
#       margin      => "40px auto",
#       line-height => "1.6",
#       font-size   => "18px",
#       color       => "#444"}}

So, we have a Css grammar capable of parsing a tiny snippet of CSS. What's next?

An even worse HTML grammar

If our CSS parser was a toy, the HTML one is going to be duplo – but it'll give us something to discuss. We'd like to be able to parse some HTML markup like the following:

<html>
    <head>
      <style></style>
    </head>
    <body>
        <p> Hello, <em>world</em>! </p>
        <p> Welcome to a test website 🦋 </p>
    </body>
</html>

The good thing about HTML (well, there are a lot of good things! But the one I'm thinking about right now) is that it's heavily recursive and uses pairs of opening and closing tags. This means that our grammar can basically just be comprised of 'opening-tag' ~ 'closing-tag' <inner-stuff> rules (I told you we'd see ~ again!). Here's what that looks like:

grammar Html {
  proto rule tag   {*}
  rule TOP         { '<html>' ~ '</html>' [<head><body>] }
  rule head        { '<head>' ~ '</head>' <style>*       }
  rule style       { '<style></style>'                   }
  rule body        { '<body>' ~ '</body>' <inner>*       }
  rule tag:sym<p>  { '<p>'    ~ '</p>'    <inner>*       }
  rule tag:sym<em> { '<em>'   ~ '</em>'   <inner>*       }
  rule inner       { <tag>|<text>                        }
  rule text        { <-[<]>+                             }
}

HTML in action

Lets do something a little different with our HTML action: instead of parsing the HTML into structured data the way we did with the CSS, let's directly parse it into a Markdown-inspired plaintext output format. This action object should do the trick:

class HtmlActions {
  method TOP($/)         { make $<body>.made                }
  method head($/)        { }
  method body($/)        { make $<inner>».made.join         }
  method tag:sym<em>($/) { make "**$<inner>».made.join()**" }
  method tag:sym<p>($/)  { make "\n$<inner>».made.join()\n" }
  method inner($/)       { make $/.caps».value
                                        .map({.made || $_})
                                        .join               }
}

And the actions and grammar combined:

say made Html.parse( $html-snippet,
                     :actions(HtmlActions.new)):;
# OUTPUT:
#
# Hello, **world**!
#
# Welcome to a test website 🦋
#

Putting it all together

Ok, we can parse CSS. And we can parse HTML. But can we parse CSS in our HTML?

That is, can we modify our HTML grammar/actions to parse this combination of our previous input?

<html>
    <head>
        <style>
          body {
            margin: 40px auto;
            max-width: 650px;
            line-height: 1.6;
            font-size: 18px;
            color: #444;
            padding: 0 10px;
          }
        </style>
    </head>
    <body>
        <p> Hello, <em>world</em>! </p>
        <p> Welcome to a test website 🦋 </p>
    </body>
</html>

In fact, let's be a bit more specific about what we want to achieve. First, we'll be parsing this into a hash with css and html keys, each of which has our previous output. And we need to do this without modifying the CSS grammar or actions in any way – after all, we'd like an approach that we could use with a real CSS grammar in place of our little toy. Finally, we'd want to make as few changes as possible our HTML grammar/actions and to keep those changes as tiny as possible. In particular, we want to avoid doing anything that involves tracking state. To achieve all this, we'll be using a role.

To figure out what we need from that role, let's imagine for a minute that we could just use a <TODOcss> token that would automatically use the right grammar/actions for CSS. How would we change our code so far if we had a token like that?

Well, we'd need to make one changes to our grammar: our <style> rule would need to use the <TODOcss> token to process the contents of the style tags:

  rule style { '<style>' ~ '</style>' <TODOcss>? }

And we'd also need to make a few changes to our grammar: Out style method would need to make our CSS, our head method would needs to aggregate that CSS, and our TOP method would need to generate a hash with css and html keys.

  rule TOP   { '<html>' ~ '</html>'   [<head> <body>?] }
  rule head  { '<head>' ~ '</head>'   <style>*         }
  rule style { '<style>' ~ '</style>' <TODOcss)>?      }

Ok, now for the fun part: how do we build something like TODOcss – but non-imaginary?

We start with a role, which we'll call Subgrammar. We'll also give it a method slag (here, "slag" doesn't refer to the waste product from iron smelting; rather, it stands for "Sub LAnguage Grammar").

The slag method has three tasks:

  1. update the current the action object
  2. execute the match using the tokens/rules of the correct grammar
  3. restore the original action object.

Each of these is slightly tricky, so let's walk through them.

The hard part of step 1 is that we need to modify the action object associated with an existing instance of our grammar without disturbing our existing state or needing to manage the bookkeeping that comes from .parseing with a separate grammar. (We'll need to do the same thing again, just in reverse, when we get to step 3). Unfortunately, performing this modification requires us to get slightly tricky and discover the undocumented .set_actions method.

Once we've discovered .set_actions and made our peace with using it, we can handle step 1. And, as I mentioned, step 3 is just step 1, but backwards. So that just leaves step 2: execute our match using the correct tokens/rules.

This step, in turn, decomposes into two problems: first, we need to make sure that we have access to all of the tokens/rules that we need; second, we'll need to ensure that we use the correct token/rule (e.g., in the case of any name conflicts). Because tokens and rules are just methods, we can get access to them in exactly the same way that we'd get access to any other method: inherit from the class that defines the method. In our case, that means saying that grammar Html is Css.

So we've solved the access half, but we've done so in a way that slightly complicates the make-sure-we-use-the-right-token half of our problem. Specifically, because we're giving Html access to Css's methods via inheritance, any methods (i.e., any tokens/rules) defined in both grammars will pose a problem: by default, we'll always end up using the token/rule from Html, even when we need to be using the one from Css. This problem is most obvious with TOP, since both grammars are very likely to have a TOP token. But we need a solution that'll handle any name clashes, not just TOP.

There are a few ways we could address this issue, but here's the one I've gone with: first, whenever slag is called, we'll check the list of methods that are local to the subgrammar (so, Css's local methods, in our example). These are the methods that we want to be calling whenever we match against a token or a rule

Once we have that list, we can ask ourself what method we'd call based on the normal method resolution order. If the MRO has us calling a method other than the one we want to call, then we have a bit of a problem. We can solve that problem by .wraping the method that the MRO will send us to; all our wrapper needs to do is to divert our call to the correct method.

And then it's on to step 3: cleaning up after ourselves. We restore the actions object to it's initial state and make sure we unwrap each of the wrapped methods. All done!

Despite how many words this takes to explain in English, the Raku version isn't bad at all – only 15 lines of largely self explanatory code, and we have our role!

role Subgrammar[::G :$grammar, :$actions] {
    multi method slag(G) {
        my $old-actions = self.actions;
        self.set_actions: $actions;

        my @wrapped = G.^methods(:local).map: -> &m {
            with self.^methods.first({.name eq &m.name}) {
                next if $_ === &m or .name eq 'BUILDALL';
                .wrap: method (|c) { m self, |c } }
        }

        LEAVE { .restore for @wrapped;
                self.set_actions: $old-actions }
        self.TOP
    }
}

To use Subgrammar, we change our Html grammar definition to use that role, parameterized with the grammar and action object that we'll be using internally (and to inherit from the Css grammar):

grammar Html
    is Css does Subgrammar[ :grammar(Css),
                            :actions(CssActions.new) ] {...}

and then replace our calls to TODOcss with calls to slag, passing in the grammar that we want to treat as sub-grammar:

# in HtmlActions:
  method style($/) { make $<slag>.made // Empty    }
# in Html
  rule style { '<style>' ~ '</style>' <slag(Css)>? }

Now that we have that Subgrammar role, we can now parse the full-enchilada, CSS and HTML at the same time:

my $match = Html.parse: $combined-text,
                        :actions(HtmlActions.new);
say $match.made.&pretty;
# OUTPUT:
# { html => "\nHello, **world**! \n\nWelcome to a test website 🦋 \n",
#   css  => {
#       body => {
#           padding     => "0 10px",
#           max-width   => "650px",
#           margin      => "40px auto",
#           line-height => "1.6",
#           font-size   => "18px",
#           color       => "#444"}}}

Bonus round!

We made Subgrammar parametric on the grammar and were careful to handle method-name conflicts all the way up the MRO. This means that there's absolutely nothing stopping our class from doesing (doing?) more than one instance of Subgrammar, parameterized on different grammars. And no matter how many subgrammars we have in play, we know that we'll always use the tokens/rules from whatever grammar we're trying to parse with.

Let's take a look at using multiple subgrammars at once. If we have a Js grammar and JsActions action object, we could change our grammar declaration to

grammar Html
    is Css does Subgrammar[ :grammar(Css),
                            :actions(CssActions.new) ]
    is Js  does Subgrammar[ :grammar(Js),
                            :actions(JsActions.new)  ] {...}

Then we'd just need to add a script rule and reference use it our head rule,

  rule head   { '<head>'   ~ '</head>'   [<style>|<script>]* }
  rule script { '<script>' ~ '</script>' <slag(Js)>?         }

and suddenly we're parsing Javascript too!

Conclusion and full code

With a relatively straightforward, ~15 line role, we've been able to let our grammars call each other and have avoided all of the bookkeeping that comes with manually coordinating two separate parses. This seems like something that could come in handy in all sorts of contexts where one language is embedded in another – something that happens quite a bit. I can use multiple grammars without all the work of tracking my position and updating it as I change from one grammar to another. And saving those ~5ish lines of code only took us 3,488 words – laziness successful!

I know that I've blathered on about grammars for a while now – certainly for many more words than I expected to write when I saw a link Mike's post on r/rakulang. But, despite my verbosity, I'm not claiming any particular expertise with grammars – Mike Clark mentioned getting some help with his code from moritz – who literally wrote the book on Raku grammars. Whereas I haven't finished reading that book yet (but I have it here, right next to my desk☺).

So I'd be very interested in any thoughts any of you might have – especially if anyone has reason to believe that relying on set_actions is more dangerous than it looks. I'd also be interested to hear any thoughts people have about the performance cost/general advisability of wrapping grammar methods as I've done above – I know that wrapping code can sometimes mean sacrificing significant optimization opportunities, but there are a few reasons why that penalty might be less applicable for grammars. More generally, I'd just be interested in any related thoughts you might have: Grammars are a fascinating part of Raku, and one that I haven't explored as deeply (yet), so I'd love to keep this discussion going and all learn from one another.

Here's the full code. I've tried to keep the formatting as blog/phone-friendly as possible, but it's also available as a gist in case that renders better for you.

role Subgrammar[::G :$grammar, :$actions] {
    multi method slag(G) {
        my $old-actions = self.actions;
        self.set_actions: $actions;

        my @wrapped = G.^methods(:local).map: -> &m {
            with self.^methods.first({.name eq &m.name}) {
                next if $_ === &m or .name eq 'BUILDALL';
                .wrap: method (|c) { m self, |c } }
        }

        LEAVE { .restore for @wrapped;
                self.set_actions: $old-actions }
        self.TOP
    }
}

class CssActions {
    method TOP($/)        { make %($<style-rule>».made)  }
    method style-rule($/) { make ~$<selector>.join.trim
                                 => %($<property>».made) }
    method property($/)   { make ~$<key> => ~$<value>    }
}
grammar Css {
    rule TOP        { <style-rule>+  }
    rule style-rule { <selector>+ '{' ~ '}' <property>+ }
    rule selector   { 'body' | 'p' | 'h1'   }
    rule property   { <key> ':' <value> ';' }
    rule key        { <-[:]>+  }
    rule value      { <-[;]>+  }
}

class JsActions {
    method TOP($/) { make 'Javascript parser NYI' }}
grammar Js {
    rule TOP { <-[<]>+ #`[ let's just imagine this one] } #`[ > hlfix ]}

grammar Html
    is Css does Subgrammar[:grammar(Css),
                           :actions(CssActions.new)]
    is Js  does Subgrammar[:grammar(Js),
                           :actions(JsActions)] {
    proto rule tag   {*}
    rule head        { '<head>'   ~ '</head>'   [<style>|<script>]* }
    rule script      { '<script>' ~ '</script>' <slag(Js)>?         }
    rule TOP         { '<html>'   ~ '</html>'   [<head> <body>?]    }
    rule style       { '<style>'  ~ '</style>'  <slag(Css)>?        }
    rule body        { '<body>'   ~ '</body>'   <inner>*            }
    rule tag:sym<p>  { '<p>'      ~ '</p>'      <inner>*            }
    rule tag:sym<em> { '<em>'     ~ '</em>'     <inner>*            }
    rule text        { <-[<]>+       } # > hlfix
    rule inner       { <tag>|<text>  }
}
class HtmlActions {
    method TOP($/)         { make %(html => $<body>.made,
                                            |$<head>.made)    }
    method head($/)        { make %(css => %($<style>».made),
                                    js  => ~$<script>».made)  }
    method script($/)      { make $<slag>.made                }
    method style($/)       { make $<slag>.made // Empty       }
    method body($/)        { make $<inner>».made.join         }
    method tag:sym<em>($/) { make "**{$<inner>».made.join}**" }
    method tag:sym<p>($/)  { make "\n{$<inner>».made.join}\n" }

    method inner($/)       { make $/.caps».value
                                          .map: {.made || $_} }
}

my $combined-text
  = q:to/§html/;
  <html>
      <head>
          <style>
            body {
              margin: 40px auto;
              max-width: 650px;
              line-height: 1.6;
              font-size: 18px;
              color: #444;
              padding: 0 10px;
            }
          </style>
          <script>
            console.log('Hello, world!');
          </script>
      </head>
      <body>
          <p> Hello, <em>world</em>! </p>
          <p> Welcome to a test website 🦋 </p>
      </body>
  </html>
  §html

my $match
  = Html.parse: $combined-text,
                :actions(HtmlActions.new);

sub pretty($_, :$nl=False) {
    when Pair { .key, pretty(.value, :nl) }
    when Str  { .raku }
    when Map  -> :@_ = .map(&pretty) {
        my $n = @_»[0]».chars.max+1;
        with [ @_.map({.fmt: "\%-{$n}s", '=> '})\
                ».trim.sort(&[lt]).join(",\n")
                 .indent: $nl ?? 4 !! 2 ] {
            '{'~($nl ?? "\n$_" !! " {.trim}")~ '}' }}}

say $match.made.&pretty;

# OUTPUT:
# { js   => "Javascript parser NYI",
#   html => "\nHello, **world**! \n\nWelcome to a test website 🦋 \n",
#   css  => {
#       body => {
#           padding     => "0 10px",
#           max-width   => "650px",
#           margin      => "40px auto",
#           line-height => "1.6",
#           font-size   => "18px",
#           color       => "#444"}}}