Diving deep on Raku regexes, and coming back with a better way for grammars to cooperate

After some excellent feedback my previous post on coordinating multiple grammars in Raku, I realized that I'd had fundamentally the wrong mental model of regexes in Raku(do). This flawed model hadn't stopped me from using regexes, but was nevertheless wrong – and it prevented me from correctly grasping some of the more complex behaviors involved in grammars. Now that I've corrected that misconception, everything makes so much more sense!

In this post, I'm going to briefly present the flawed view I had (hopefully in enough of an outline that you'll understand where I was coming from but not be tempted into the same confusion) and then present the correct (or at least, less wrong) mental model. Exploring this model will lead us to a significantly deeper understanding of how Raku's regexes work under the hood (or, at least, deeper than I had a week ago; YMMV, as they say). Next, I'll explain how this new understanding allowed me to build a trait – on that I believe will make composing multiple grammars much easier. And, finally, I'll quickly walk you through the ~100 lines of code involved in implementing that trait.

Let's learn a bit of Raku together   »ö«

Getting the right mental model for Raku regexes

It's certainly possible to use Raku's regexes for years without any understanding of how they work under the hood – they're a good abstraction and, like all good abstractions, much of their power comes from the fact that you don't need to understand their implementation. That said, I've always found having a good general model of the details beneath the abstraction is tremendously helpful when the going gets tough, and I think Raku's regex's are a great example of this principle in action.

Regex.isa(Method). Ok, so what does that mean?

I've known that Regexes are Methods for a while; the docs are pretty clear on that point, and I've seen it mentioned in multiple posts. (For one thing, the first line in the Regex type documentation is literally class Regex is Method { }!). But, even though I knew it, I hadn't considered – much less deeply internalized – the implications.

To be concrete: what does the fact that a regex is a method mean in practice? Can we call a regex as a subroutine? If so, how? And it's a method, but a method on what object? What is its signature? What does it return? Like most rakoons, I always interact with regexes via syntax like 'foo' ~~ /<word>/, not by treating the regex as a Routine. And, just to be clear, that's absolutely the way we should use regexes virtually all of the time – but, in order to more deeply understand the consequences of regexes being routines, let's investigate out how, hypothetically, we would interact with a regex in a more Routine way.

A bad and wrong mental model

Without having deeply thought about it, I'd imagined regexes as more-or-less a classic class, in the Object Oriented Programming sense. That is, I was picturing a Regex as a class that created new objects for any particular search and where each object stored and updated internal state based on whatever text it was searching.

To make this model more concrete, I was imagining that the line my regex Rx { foo \s bar} created something vaguely along the lines of pseudocode below (though, sadly, my not-fully-thought-out mental models don't come with code samples, and so I hadn't thought it out in nearly this much detail). Again, note that this is totally wrong:

# Broken, do not try
my class Rx is Regex {
    my $rx-body = ' foo \s bar ';

    has     @!input-chars;
    has Int $!pos = 0;

    method CALL-ME(Rx:U: Str() $input --> Match) {
       my $rx = self.bless: :input-chars($input.comb);
       $rx!match
    }

    method !match {
        while 0$!pos ≤ @!input-chars.elem {
            my $c := @!input-chars[$pos];
            if MATCHER($c, $rx-body) === PartMatch { $pos++ }
            #  ^^^ somehow handles backtracking?
            if MATCHER($c, $rx-body) === FullMatch {
                return Match(FullMatch) }
            else { $pos-- }
        }
        $pos < 0 { return Match('#<failed match>') }
    }
}

# Hypothetical useage:
Rx('foo bar');

And extending this OOP-based mental model to grammars felt pretty simple: A Grammar just needs to store multiple %rx-bodies be able to translate between them and the token/rule names described in that Grammar's declaration (oh, and I guess a bit of logic for setting and calling action objects).

As we'll see, this model turned out to be comprehensively wrong. But, as I hope you can see from the outline above, it was plausibly wrong. And, in fact, it was plausible enough that I've been able to use regexes and grammars for quite a while without realizing just how wrong it was.

The unsound foundation tumbles

That flawed mental model, however plausible, couldn't stand up to the evidence that came up in the discussion following my previous post. In particular, it couldn't stand up to something I learned from Matthew Stuckwisch (guifa on the #raku IRC channel, alatennaub on r/rakulang and GitHub; I'll go with "guifa" here).

Guifa's comment related to the post that kicked off this whole discussion. In that post, Mike Clark presented a grammar that parses a main language, and then uses a second grammar to parse a lisp-like language nested inside the main language.

The code below shows a simplified version with Mike's comments removed; see the original post if you'd like more details.

grammar InnerLang {
    rule TOP { \s+ | ['(' ~ ')' .* ] }
}
grammar MainLang  {
    rule TOP       { [<.text-like> <lisp-like>*]* }
    rule text-like { [ <.alpha>+ ]+}

    rule lisp-like {
        :my $inner;
        <?{ $inner
              = InnerLang.subparse: $/.orig, :pos($/.to) }>
        .**{$inner.to - $/.pos}
    #   ^^^^^^^^^^^^^^^^^^^^^^^ I really dislike this part
    }
}

As my comment above noted, I really don't like the idea of needing to manage the state like that inside our outer grammar – it's not too bad in a simple case but, as the grammars get bigger, it strikes me as a lot of error-prone, fiddly work. And, even if done correctly, I'd still have concurrency-related concerns - having two parts of our code with different views of how much input we've parsed seems to invite the sort of bugs that are a nightmare to debug. (Maybe that's being paranoid, but I've been burned before).

But guifa's reply presented pointed something out about the code above that really surprised me and which, once I'd thought thorough the implications, demolished the mental model I described above. Guifa's comment didn't put it exactly like this, but the main takeaway is that we could rewrite the lisp-like rule above into:

    method lisp-like {
        InnerLang.subparse: $.orig, :pos($.to)
    }

That is, if we change rule lisp-like to method lisp-like and our $/. calls to $., then we can cut our code in half and remove all of the bookkeeping that bothered me in the previous version. When I realized this, I immediately had two reactions: "This is fantastic!" and "Wait, but how?‽!".

In case you're not having the same "but how?" reaction, here are some of the questions I was so puzzled by:

After banging my head against these questions for a while, I realized that the answer was right there in the first sentence: "If we're updating the current match position in MainLang…" – well, it turns out that we are not updating any MainLang state at all; in fact, MainLang is very nearly stateless from our point of view.

A paradigm shift

We're not updating the position state in MainLang because – contrary to my assumption and despite the presence of the $.pos method – MainLang doesn't store any (mutable) state. And that, in turn, is because the OOP-based statefull mental model I presented above is wrong. And not just for grammars – it's wrong for regexes too.

The first clue to this is that we can't actually call regexes, tokens, or rules with the Rx('some text') calling syntax I imagined above. If we try, we'll get the following error:

my regex Rx { . }
say Rx('some text');
# OUTPUT: «No such method '!cursor_start' for invocant of type 'Str'»

This error message isn't an example of Raku at its clearest; if Raku were up to its usual standards of Awesome Error Messages, it would have read

Type check failed in binding to parameter 'topic'; expected Match but got Str ("foo")

Ok, so Rx('text') doesn't work because it doesn't typecheck; we need to provide a Match. But why? If I already have a match, why would I need to pass it to a regex? Don't regexes search strings and return matches?

No, as it turns out, regexes (or, rather Regexes) don't search strings – because to do so, they'd have to track and mutate state along the lines I was imagining. Instead, the best way to think of a Regex is as a stateless function with the signature method (Match:D --> Match:D): Regexes take a Match and return a Match and it's the Match's job to contain data about the existing state.

This means that the actual way to call regexes with Routine syntax is the following:

my rule word { <alpha>+ }
say &word.WHAT;            # OUTPUT: «(Regex)»
#   ^ & sigil required b/c it does Callable
try word('a');
#        ^^^ Regexes don't take Str arguments
say $!.^name;              # OUTPUT: «(X::Method::NotFound)»

my $match = Match.new: :orig('Raku is -Ofun');
# call with ^^^^^^ a Match:D with the Str in :orig
say word($match);          # OUTPUT: «「Raku」»

# We can build a non-zero match using :to and :from
my $m2 = Match.new: :orig('Raku is -Ofun'), :from(0) :to(8);
say $m2;                   # OUTPUT: «「Raku is 」»
# And we can use that Match normally:
say $m2.&(/'-' \w**4/);    # OUTPUT: «「-Ofun」»
#         ^^^^^^^^^^^ regex-literal syntax also works

# A Regex is a Method and a Routine
say &word.^mro[1..4];      # OUTPUT: «((Method) (Routine) (Block) (Code))»
say $match.&word;          # OUTPUT: «「Raku」»
# so using ^^^^^^ method syntax might be more fitting

# A regex also returns a Match:
my $res = word $match;
say $res.WHAT;             # OUTPUT: «(Match)

# But *not* the same Match it got:
say $match.WHICH;          # OUTPUT: «Match|94080907590240»
say $res.WHICH;            # OUTPUT: «Match|94080907590384»
say $match, $res;          # OUTPUT: «(「」 「Raku」)»
#   ^^^^^^ the Match we started with is unchanged

# The returned Match records where we are in the input string:
say $res.raku; #`[ OUTPUT: «Match.new( :orig("Raku is -Ofun"),
                                       :from(0), :pos(5) )» ]
# Which lets us use it as input for a new match:
say my $r2 = $res.&word; # OUTPUT: «「is 」»
say $r2.raku;  #`[ OUTPUT: «Match.new( :orig("Raku is -Ofun"),
                                       :from(5), :pos(8) )» ]

One point worth emphasizing from the code above: not only are regexes (pure) functions from Match --> Match, they also return a different Match than they were given. That is, regexes interact with Matches as though the latter were immutable data containers.

The new paradigm solves old problems

Now that we're on a firmer foundation with regexes, lets return to grammars and the question that so puzzled me before: why did changing our lisp-like rule into a method remove the need to perform bookkeeping tasks?

Lets answer that question at a slightly higher level of generality: what method would a rule declaration desugar to? That is, we know that rules (like all Regexes) really are methods under the hood. This means that if Raku hadn't given us the rule declarator, we could have written our rules as methods with a bit more work.

And indeed, writing rule methods is still very possible. To replace a rule, we'll just write a method that is a Regex; this means that it will need to have our by-now-familiar Match --> Match signature. This method only needs to do the following three tasks:

  1. Declare a ratcheting, whitespace-significant Regex
  2. Call that regex with the grammar as an argument
  3. (If an action object has been set) call the action method with the same name as the token

Or, to put that in code, we can replace this code

grammar G {
    token TOP { <word>   }
    rule word { <alpha>+ }
}

with this code

grammar G {
    token TOP { <word>   }
    method word(--> Match:D) {
        my Match $new := regex {:r:s <alpha>+ }(self);
        $.actions.?word($new) if $new;
        $new
    #   ^^^^ NOTE: returns $new, **not** self. Here, $new is
    # a Grammar (which isa Match), but could be any Match:D
    }
}

Do you see why we return $new rather than self? We pass self in to the regex, which treats it as immutable. So nothing about self is ever mutated or updated – returning it would be a no-op, so of course we return the newly returned Match.

This means that the rule lisp-like declared above desuggars into something like:

method lisp-like {
    my Match $new := my regex {:r:s
        :my $inner;
        <?{ $inner
            = InnerLang.subparse: $/.orig, :pos($/.to) }>
        .**{$inner.to - $/.pos}
    }
    $.actions.?word($new) if $new;
    $new
    }

whereas the method lisp-like remains as it was:

    method lisp-like {
        InnerLang.subparse: $.orig, :pos($.to)
    }

Once we're looking at this desugared form, the answers to those previously-confounding questions are extremely clear:

I'm not sure about you, but when I shifted from the first (incorrect) mental model to the second, I had the wonderful feeling that a whole bunch of formerly confusing things suddenly made sense. And since that shift was so helpful (at least to my understanding, I'm going to reiterate it here, in a special yellow box:

Making the mental model pay rent

That mental model certainly feels like it fits with the behavior I've previously observed from Raku's grammars and regexes. But the real test of any model is whether it can help us have more accurate expectations for the future. So lets try this model out by considering how we could improve the MainLang grammar we saw above. Here's where we left that code:

grammar InnerLang {
    rule TOP { \s+ | ['(' ~ ')' .* ] }
}
grammar MainLang  {
    rule TOP { [<.text-like> <lisp-like>*]* }
    rule text-like { [<.alpha>+ ]+ }

    method lisp-like {
        InnerLang.subparse: $.orig, :pos($.to)
    }
}

say MainLang.parse($input);

That's pretty good – certainly admirably concise. But it has a pretty big omission: it doesn't make any use of action objects. Partly, that's just for a very simple reason: I left the action objects out to keep the code more focused. In Mike Clark's original post, both the MainLang.parse and InnerLang.parse calls had action objects specified. But I don't view this as a particularly satisfying solution.

In particular, it's passing the action object to InnerLang.subparse that bothers me. In the reddit discussion of my previous post, P6steve raised an important point: that using multiple action objects with one grammar is a big source of grammars' power. For example, in the language-parsing use case above, we might want to check the syntax without actually executing the code – and passing in a different action object would let us do that without needing to make any changes to the grammar's source code (which, after all, could be in a different module and/or maintained by someone else).

However, our current design sacrifices a huge chunk of the power we normally get from action objects: putting a specific action object in the InnerLang.parse call in the lisp-like method has effectively hard-coded that object, at least from the point of view of a MainLang caller. To return to the syntax-checking case, a MainLang caller could pass in a CheckMainLangSyntax action object, and get the desired syntax-checking behavior for the main language. But as soon as the grammar got to the inner language, it'd be right back to using the ExecuteInnerLang (or whatever) action object listed in the lisp-like method. And there'd be no way for the MainLang caller to fix that situation without cracking open the MainLang source code. Let's fix that.

So, how do we change our API? The most obvious (but still not great) option is to pass an action object in via parse's :args parameter. Here's what that would look like:

grammar MainLang  {
    rule TOP(:$lisp-like-actions)  {
        [<.text-like> <lisp-like(:$lisp-like-actions)>*]* }

    rule text-like {  [<.alpha>+ ]+ }

    method lisp-like(:lisp-like-actions($actions)) {
        InnerLang.subparse: $.orig, :pos($.to), :$actions
    }
}

MainLang.parse: $input,
            :args(\(:lisp-like-actions(CheckLispSyntax)));

We've solved our problem, but I'm still not thrilled. Why? Two problems. First, we need to thread the action objects through TOP and then on to lisp-like – not that big a deal here, but something that quickly gets out of hand if lisp-like is deeply nested. This problem is widely known enough that solving it has its own subsection in the docs. The solution presented there is to use dynamic variables, which will be scoped to the calling context; this is a good solution, and it's the one we'll use.

The second issue with our MainLang code is that we're using :args, and :args feels like something designed to customize the behavior of specific rules. But we're using it to set an action object – which seem like it fits much more naturally as parse-time configuration. To solve this issue, we can override the parse method that MainLang inherits from Grammar with a method accepts an additional argument for the inner action object.

Here's what our code looks like with both of those solutions:

grammar MainLang  {
    method parse(:$lisp-like-actions, |) {
        my $*lisp-like-actions = $lisp-like-actions;
        nextsame
    }

    rule TOP       { [<.text-like> <lisp-like>*]* }
    rule text-like {  [<.alpha>+ ]+ }

    method lisp-like {
        InnerLang.subparse: $.orig, :pos($.to),
                    :actions($*lisp-like-actions)
    }
}

Note the use of nextsame in the parse method – which makes it trivially easy to insert our method and grab the $lisp-like-actions argument without needing to reimplement Grammar's parse method or otherwise break our dependency on Grammar.

At this point, I'm pretty happy with our code: callers of MainLang can pass in actions for the InnerLang using an API that's very similar to the one they use for passing in MainLang actions – the only difference is that the named parameter is :lisp-lang-actions instead of :actions. Thinking back to the syntax-checking use case, a user could now pass in an action object that checks Lisp syntax and get the behavior they want for both the main language and the nested language. We've accomplished what we set out to do.

But why stop there?

Thinking about the use case we're addressing – an inner programming language nested inside an outer one – it seems pretty likely that the we might want to support nesting languages other than Lisp inside our outer language. We've already added a lot of flexibility – what would it take to add that too?

Not much, it turns out: we just tweak our API to have callers pass in a hash rather than an action object and then store the contents of that hash. We can even set it up so InnerLang and InnerLangActions are the default grammar and action object, assuming we expect those to be the most commonly used ones.

Anything else we should add? Oh, well, we've been focusing on the parse method, but grammars also have subparse and parsefile methods. I suppose we should wrap those as well, to provide a consistent API. Doing so is trivially easy, though it does require a bit more copy-and-paste than I'd prefer.

With all those changes, here's our final MainLang class, along with an example call:

grammar MainLang  {
    method parse(:%nested-lang, |) {
        my %*nested-lang = (:grammar(InnerLang),
                            :actions(InnerLangActions),
                            |%nested-lang);
        nextsame
    }
    method subparse(:%nested-lang, |) {
        my %*nested-lang = (:grammar(InnerLang),
                            :actions(InnerLangActions),
                            |%nested-lang);
        nextsame
    }
    method parsefile(:%nested-lang, |) {
        my %*nested-lang = (:grammar(InnerLang),
                            :actions(InnerLangActions),
                            |%nested-lang);
        nextsame
    }

    rule TOP       { [<text-like> <nested-lang>*]* }
    rule text-like { [<.alpha>+ ]+ }

    method nested-lang {
        %*nested-lang<grammar>
            .subparse: $.orig, :pos($.to),
                       :actions(%*nested-lang<actions>)
    }
}

say MainLang.parse: $input,
                    :nested-lang{ :grammar(OtherLang),
                                  :actions(OtherLangActions)};

Looking at this, I think it's fair to say that our mental model is paying rent – we used that model to significantly enhance MainLang, making it simultaneously more powerful and more flexible.

From a metal model to a production module

Our MainLang grammar adds a fair amount of functionality and, imo at least, would be much easier for callers to use. But this added functionality comes at a cost on the declaration side: we've grown a simple 8 line grammar into a 32 line one – and 15 of those lines are basically boilerplate that doesn't help readers understand the purpose of the grammar. This seems like a perfect opportunity to abstract away some boilerplate into a module.

Let's make a module that would let us write a grammar like MainLang much more concisely. Specifically, our module will let one grammar delegate regex/token/rule calls to a different grammar that knows how to handle those calls and will allow users to pass in appropriate actions objects at runtime.

As this framing suggests, our module will basically be a Grammar version of Raku's handles trait. If you haven't come across it before, handles lets you delegate a method call to another object (just as we'll be delegating to another grammar). From the perspective of anyone calling your code, a delegated method is exactly like a method you manually defined in your class; the only difference is that the actual execution is, er, handled by the object you delegated to – again, just the behavior we want for our grammar. In fact, since the functionality we're building is so similar to that of handles, but for grammars, that's what we'll call our module: Grammar::Handles.

To implement us, Raku helpfully defines the handles trait as a multi, so we can implement Grammar::Handles by adding a new candidate to the existing handles trait. Doing so will give our users an API that fits in naturally with the rest of Raku and that, personally, I like a lot – something along the lines of:

grammar MainLang handles(OtherLang) {...}

Our handles candidate needs to accomplish the following three tasks:

  1. Let the user supply token names and the grammars that each token should delegate to (e.g., in MainLang, that the nested-lang token should delegate to the OtherLang grammar)
  2. Let users pass action objects for the delegated-to grammars via the [sub]? parse [file]? methods.
  3. Set up the actual delegation/install tokens under the user-supplied names

As mentioned in point 1, we'll allow users to provide a token name that's different from the grammar's name. But often they'll likely want to use the same name (e.g., a LispLang grammar that handles calls to the LispLang token). But the handles API already accommodates both use cases by accepting Pairs for renaming; we'll do the same.

&trait_mod:<handles>

Our decision to implement a handles trait that operates on a grammar has a couple of knock-on effects that we should discuss before getting to the code.

First, unlike the (perhaps more familiar?) traits that operate on Subs or Variables , our first parameter is not a defined value – in fact, it's not even a fully initialized undefined value. It's a still-being-created object that doesn't even know who is .^parents are. This, in turn, means that it doesn't know that it's Grammar and thus that we can't use a type constraint &trait_mod's signature. Fortunately, Raku saves us again here, because grammars have their own metaobject, so we can test against that rather than the type.

Second, because we're declaring a handles trait, we'll get a slightly unusual second argument: &thunk. &thunk is a piece of not-yet-executed code; we'll need to call this code to access whatever arguments the user called handles with. This isn't a big deal; it just means that we'll have to work slightly harder to match against the different inputs callers may provide (we can't, for example, use multiple dispatch based on the type of the second parameter).

Now that we're clear on why our signature needs to be (Mu:U $grammar, &thunk), we're ready for the multi trait_mod:<handles> code – which is actually very short, at least if you ignore the calls to not-yet-defined helper functions. (So maybe it's more of a todo list than an actual implementation at this point…). Anyway, here it is:

multi trait_mod:<handles>(Mu:U $grammar, &thunk) {
    import Grammar::Handles::Helpers;
    # Ensure we don't mess w/ non-grammar &handles candidates
    when $grammar.HOW
           .get_default_parent_type !=:= Grammar { nextsame }

                     # vvv The name for our new token
    my Grammar %tokens{Str} = build-token-hash &thunk;
    #  ^^^^^^^ the Grammar the token delegates to

    my %delegee-args;
    #  ^^^^^^^^^^^^^ where [sub]?parse[file]? methods save
    #  args for the delegee Grammar (keyed by token name)
    $grammar.&wrap-parse-methods: :%delegee-args,
                                  :token-names(%tokens.keys);
    $grammar.&install-tokens:     :%tokens, :%delegee-args;
}

Based on this code, we just need to implement &build-token-hash (which maps the user-supplied &thunk into $token-name => Grammar pairs), &wrap-parse-methods (which overrides Grammar's parse, subparse, and parsefile methods with versions that will store the :actions argument and such for each delegge grammar), and &install-tokens (which installs methods with the specified $token-names that delegate to the correct grammars). Lets take them one by one and in order.

&build-token-hash

As we just saw, this function gets the &thunk as its single argument and needs to return $token-name => Grammar pairs (or raise an error if the &thunk isn't a value from which we can build such a pair). Specifically, we need to process Grammars, Strs (which we expect to be the name of a Grammar), and Pairs (which we expect to have a Str key to use as our $token-name and a value that's either a Grammar or a Str that's the name of a grammar).

Handling each case is fairly straightforward, with thanks once again due to Raku's pattern matching:

#| Transforms the &thunk passed to `handles` into a hash
#| where the keys provide token names to install and the
#| values are the delegee Grammars
sub build-token-hash(&thunk --> Map()) {
    proto thunk-mapper(| --> Pair)   {*}
    multi thunk-mapper(Grammar $g)   {
        $g.^name => $g }
    multi thunk-mapper(Pair $renamed (Grammar :$value, |)) {
        $renamed }
    multi thunk-mapper(Str $name) {
        my Grammar $gram = try ::($name);
        $! ?? pick-err($!, :$name) !! $name => $gram }
    multi thunk-mapper(Mu $type) {
        pick-err (try my Grammar $ = $type) // $!}

    thunk().map: &thunk-mapper
}

[Not pictured: the additional ~25 lines of error handling code hidden behind &pick-err. All &pick-err does is to decide which of Grammar::Handles custom exceptions to throw and pass on the relevant arguments. But, as I've unfortunately come to expect, error handling ends up being far more verbose than the happy path in Raku]

Other than using multis to handle our various cases, this code's only slightly exotic feature is its use of runtime interpolation to lookup the grammar in the my Grammar $gram = try ::($name) – I don't often have the need to look up a class or other symbol without being able to type its in the source code, but it's nice to have the option. And it's exactly what we need here, since it lets us translate user-supplied Strs into the actual Grammar we need.

Ok, we now have our token hash; on to our next step.

&wrap-parse-methods

One of our main goals is to provide users with the ability to specify actions objects when they call .parse and friends. Lets add that ability now.

Our basic approach is the same one we had when adding a parse method to MainLang: check for a the named argument we're interested in, save it somehow, and then use nextsame to continue the dispatch process – a powerful pattern that's only possible thanks to *%_. The only real difference is that this time we aren't handling a single, hard-coded named argument, but any named argument that matches a $token-name. This means that we'll need to slurp all the named arguments .parse got into an %args hash and then search that hash for any named arguments we care about.

The other difference is that we'll be a bit more comprehensive about what arguments we accept: in MainLang, we only cared about :actions, but .parse also takes :args and :rule (and, of course, *%_). To be through, we'll just pass all of the appropriate pairs on to the delegee grammar. Once we've done so, we then resume dispatch with nextsame, making it almost like our wrapper method was never called at all.

Here's the code:

#| Overrides the &parse, &subparse, and &parsefile methods with
#| a method that loads %delegee-args with named arguments whose
#| name matches a known $token-name
my method wrap-parse-methods(Mu: :@token-names,
                             :%delegee-args) is export {
    # despite the |, without vv, this sig rejects positionals
    my multi method wrapper ($?, *%args, |)
                             is hidden-from-backtrace {
        for @token-names -> $name {
            next unless %args{$name}:exists;
            if %args{$name}.first({$_ !~~ Map|Pair}, :p) {
                die X::TypeCheck::Binding::Parameter.new:
                        :symbol($name), :expected(Hash()),
                        got => %args{$name} }
            %delegee-args{$name}
              = %args{$name}.Hash;
        }
        nextsame }

    for |<parse subparse parsefile> -> $meth-name {
        self.^add_multi_method: $meth-name, &wrapper }
}

and leaves us with only one function to implement.

&install-tokens

&install-tokens is lot like &wrap-parse-methods, but in reverse. Just like in &wrap-parse-methods, we'll declare a new method, and add that method to our grammar. And, again, we'll rely on our %delegee-args hash in doing so – the only difference is that, this time, we're not adding new entries to the hash, we're checking existing ones to find the correct arguments. While we're here, we'll also give the user the option of defining default :action, :args, and :rule values when declaring their grammar. These defaults can still be overridden at runtime by passing values in to .parse, but the existence of defaults can make the typical use case significantly more ergonomic.

#| Install a method for each known token-name that delegates
#| to the correct Grammar delegee and passes the arguments
#| that the user supplied in their .parse call
my method install-tokens(Mu: :%tokens,
                   :%delegee-args) is export {
    for %tokens.kv -> $name, Grammar $delegee {
        my method TOKEN(:$actions, :$rule='TOP',
                        :$args) is hidden-from-backtrace {
            given %delegee-args{$name} {
                .<actions> = $actions unless .<actions>:exists;
                .<args>    = $args    unless .<args>:exists;
                .<rule>    = $rule    unless .<rule>:exists }
            $delegee.subparse: $.orig, :pos($.to),
                         :from($.from), |%delegee-args{$name}
        }
        self.^add_method: $name, &TOKEN }
}

At this point, the code above isn't too surprising(or, at least, I hope not!). Nevertheless, it's worth focusing on how much this code (and all the Grammar::Handles code, really) depends on the corrected mental model we developed in the first section of this post. The only reason we know that we can install a token that will do the right thing when we call $delegee.subparse is that we understand what Raku is doing under the hood and the wonderfully functional design that underpins Raku's Regex and Grammar classes.

Ok, but that's enough of a look back – we've implemented all of our functions, so it's time to see our trait in action!

Comparative demo & conclusion

So save you from scrolling up, here's what the definition and usage of our MainLang grammar looked like without Grammar::Handles (with OtherLang, OtherLangActions, and $input defined off screen):

grammar MainLang  {
    method parse(:%nested-lang, |) {
        my %*nested-lang = (:grammar(InnerLang),
                            :actions(InnerLangActions),
                            |%nested-lang);
        nextsame
    }
    method subparse(:%nested-lang, |) {
        my %*nested-lang = (:grammar(InnerLang),
                            :actions(InnerLangActions),
                            |%nested-lang);
        nextsame
    }
    method parsefile(:%nested-lang, |) {
        my %*nested-lang = (:grammar(InnerLang),
                            :actions(InnerLangActions),
                            |%nested-lang);
        nextsame
    }

    rule TOP       { [<text-like> <nested-lang>*]* }
    rule text-like { [<.alpha>+ ]+ }

    method nested-lang {
        %*nested-lang<grammar>
            .subparse: $.orig, :pos($.to),
                       :actions(%*nested-lang<actions>)
    }
}

say MainLang.parse: $input,
                    :nested-lang{ :grammar(OtherLang),
                                  :actions(OtherLangActions)};

And here's the equivalent definition and usage with Grammar::Handles:

grammar MainLang handles(:nested-lang(OtherLang))  {
    rule TOP       { [<text-like> <nested-lang>*]* }
    rule text-like { [<.alpha>+ ]+ }
}

say MainLang.parse: $input,
                    :nested-lang{:actions(OtherLangActions)};

From 32 lines to 4 – roughly an 87% reduction. I'd say that qualifies as a successful de-boiler-plating. And, more importantly, I hope that those of you who made it this far learned at least a few things about Raku and, just maybe, came away with a slightly improved mental model.

The full code for Grammar::Handles is below and in a gist. I also plan to release it as a module in a few days, once I've had a chance to add some additional tests and incorporate any suggestions that emerge from the discussion surrounding this post. I look forward to hearing any thoughts/questions you might have – and, in particular, I look forward to comparing the approach in Grammar::Handles with the one in guifa's Token::Foreign, which provides comes at the same general problem from a different angle (er, or maybe three different angles?).


# Grammar::Handles
my module Grammar::Handles::Helpers {

class X::Grammar::Can'tHandle is Exception {
    # extra ' to fix my blog’s syntax highlighter (aka hlfix)
    has $.type is required;
    multi method CALL-ME(|c) { die self.new(|c)}
    method message { q:to/§err/.trim.indent(2);
      The `handles` grammar trait expects a Grammar, the name
      of a Grammar, a Pair with a Grammar value, or a list of
      any of those types.  But `handles` was called with:
          \qq[{$!type.raku} of type ({$!type.WHAT.raku})]
      §err
}}

class X::Grammar::NotFound is Exception {
    has $.name;
    multi method CALL-ME(|c) { die self.new(|c)}
    method message { qq:to/§err/.trim.indent(2);
      The `handles` grammar trait tried to handle a grammar
      named '$!name' but couldn't find a grammar by that name
      §err
}}

#| A helper select the right error more concisely on the happy path
sub pick-err($_, :$name, |c) {
    when X::TypeCheck::Assignment { X::Grammar::Can'tHandle(:type(.got))  } # hlfix '
    when X::NoSuchSymbol          { X::Grammar::NotFound(:$name) }}

#| Install a method for each known token-name that delegates
#| to the correct Grammar delegee and passes the arguments
#| that the user supplied in their .parse call
my method install-tokens(Mu: :%tokens,
                   :%delegee-args) is export {
    for %tokens.kv -> $name, Grammar $delegee {
        my method TOKEN(:$actions, :$rule='TOP',
                        :$args) is hidden-from-backtrace {
            given %delegee-args{$name} {
                .<actions> = $actions unless .<actions>:exists;
                .<args>    = $args    unless .<args>:exists;
                .<rule>    = $rule    unless .<rule>:exists }
            $delegee.subparse: $.orig, :pos($.to),
                         :from($.from), |%delegee-args{$name}
        }
        self.^add_method: $name, &TOKEN }
}

#| Transforms the &thunk passed to `handles` into a hash
#| where the keys provide token names to install and the
#| values are the delegee Grammars
sub build-token-hash(&thunk --> Map()) is export {
    proto thunk-mapper(| --> Pair)   {*}
    multi thunk-mapper(Grammar $g)   { $g.^name => $g }
    multi thunk-mapper(Str $name) {
        my Grammar $gram = try ::($name);
        $! ?? pick-err($!, :$name)
           !! $name => $gram }
    multi thunk-mapper(Pair (:key($name), :value($_), |)) {
        when Grammar { $name => $_ }
        when Str     { $name => thunk-mapper($_).value }
        default      { #`[type err] thunk-mapper $_ }}
    multi thunk-mapper(Mu $invalid-type) {
        pick-err (try my Grammar $ = $invalid-type) // $! }

    thunk().map: &thunk-mapper
}

#| Overrides the &parse, &subparse, and &parsefile methods with
#| a method that loads %delegee-args with named arguments whose
#| name matches a known $token-name
my method wrap-parse-methods(Mu: :@token-names,
                             :%delegee-args) is export {
    # despite the |, without vv, this sig rejects positionals
    my multi method wrapper ($?, *%args, |)
                             is hidden-from-backtrace {
        for @token-names -> $name {
            next unless %args{$name}:exists;
            if %args{$name}.first({$_ !~~ Map|Pair}, :p) {
                die X::TypeCheck::Binding::Parameter.new:
                        :symbol($name), :expected(Hash()),
                        got => %args{$name} }
            %delegee-args{$name}
              = %args{$name}.Hash;
        }
        nextsame }

    for |<parse subparse parsefile> -> $meth-name {
        self.^add_multi_method: $meth-name, &wrapper }
}

#`[end module Grammar::Handles::Helpers] }

multi trait_mod:<handles>(Mu:U $grammar, &thunk) {
    import Grammar::Handles::Helpers;
    # Ensure we don't mess w/ non-grammar &handles candidates
    when $grammar.HOW
           .get_default_parent_type !=:= Grammar { nextsame }

                     # vvv The name for our new token
    my Grammar %tokens{Str} = build-token-hash &thunk;
    #  ^^^^^^^ the Grammar the token delegates to

    my %delegee-args;
    #  ^^^^^^^^^^^^^ where [sub]?parse[file]? methods save
    #  args for the delegee Grammar (keyed by token name)
    $grammar.&wrap-parse-methods: :%delegee-args,
                                  :token-names(%tokens.keys);
    $grammar.&install-tokens:     :%tokens, :%delegee-args;
}