Practical F# Parsing: Recursion and Predicate Functions

To prep for my Lang.NET talk, I went back and reviewed my PEG parser. One thing I was not happy with was that all the recursion was handled in a one-off manner. When I needed to match multiple characters in the comment rule, I wrote a special one-off function to recursively process the comment until it reached an EOL. When I needed to parse a series of ranges, characters or definitions, I wrote special one-off functions to handle that recursion. Obviously, that’s not the best approach. So, I wrote the following active pattern functions to handle recursion.

//ZOM == Zero Or More
let rec (|ZOM|) f input =
    match f input with
    | Some(i,input) ->
        let j,input = (|ZOM|) f input
        (i :: j, input)
    | None -> [], input

//OOM == One Or More
let (|OOM|_|) f input =
    match (|ZOM|) f input with
    | [], input -> None
    | v, input -> Some(v,input)

//ZOO == Zero Or One
let (|ZOO|) f input =
    match f input with
    | Some(i,input) -> Some(i), input
    | None -> None,input

With these functions at the ready, I can stop writing one-off recursion functions. Instead, I write a function that matches a single item, which I pass as an argument to one of the three functions above. For example, here is the original and new version of the top level Grammar function.

//Original version
let (|Grammar|_|) input =
    let rec ParseDefinitions dl input =
        match input with
        | Definition (d, input) -> ParseDefinitions (dl @ [d]) input
        | _ -> Some(dl, input)
    let (|OneOrMoreDefintions|_|) input =
        match input with
        | Definition (d, input) -> ParseDefinitions [d] input
        | _ -> None
    match input with
    | Spacing (OneOrMoreDefintions (dl, EndOfFile)) ->
          Some(List.to_array dl)
    | _ -> None

//New Version
let (|Grammar|_|) = function
    | Spacing (OOM (|Definition|_|) (dl, EndOfFile)) ->
          Some(List.to_array dl)
    | _ -> None

The new version is much shorter, because there’s already a function to match a single definition, which we can pass into OneOrMore (aka OOM). Note, when I pass an active pattern function as a parameter, I have to use it’s real name (with the pipes and parameters). Having to use the real name is pretty ugly, but F# need to be able to differentiate between using a function as an active pattern vs using it as a function parameter. If you could just call OOM Definition (dl, EndOfFile), would F# realize Definition is a parameter?

I also defined syntactic predicate functions. If you’ll recall, these syntactic predicates will try to match but automatically backtrack, returning success or failure depending on which function you called.

//FP == Failure Predicate
let (|FP|_|) f input =
    match f input with
    | Some(_) -> None
    | None -> Some(input)

//SP == Success Predicate
let (|SP|_|) f input =
    match f input with
    | Some(_) -> Some(input)
    | None -> None

To see this in action, here’s the original and updated Primary function. Only the first rule is relevant, so I’ve omitted the others.

//Original version
let (|Primary|_|) input =
    let (|NotLEFTARROW|_|) input =
        match input with
        | LEFTARROW (_) -> None
        | _ -> Some(input)
    match input with
    | Identifier (id, NotLEFTARROW (input)) ->
        Some(Primary.Identifier(id), input)
    //rest of function omitted for clarity

//new version
let (|Primary|_|) = function
    | Identifier (id, FP (|LEFTARROW|_|) (input)) ->
          Some(Primary.Identifier(id), input)
    //rest of function omitted for clarity

Instead of writing a special function to match “not left arrow”, I just pass the left arrow function as a parameter to Failure Predicate (aka FP). With these recursion and syntactic predicate functions, I was able to remove all the one-off recursion functions from my parser. (Note, I posted an updated version of PegParser on my SkyDrive so you can see this in action.)

These five functions significantly reduced the complexity of the code. Unfortunately, I’m not sure it’s much easier to read. The conciseness is offset IMO by the ugliness of using the active pattern’s true names. Also, I would have liked to use custom operators for these five functions, but operators aren’t allowed to be active pattern functions. Hopefully, that will change at some point in the future, though if we’re going to dream of better syntax, can we do something about all the parens? Personally, I’d love to be able to write the following:

//This doesn't work, but I can dream, can't I?
let (|Primary|_|) = function
    | Identifier (id) !!LEFTARROW (input) ->
        Some(Primary.Identifier(id), input)
    //rest of function omitted for clarity

let (|Grammar|_|) = function
    | Spacing ++Definition (dl) EndOfFile ->
        Some(List.to_array dl)
    | _ -> None

Note to self, talk to F# team members who come to LangNET about this…

Comments:

I am very interested in what you are saying here, but I don't know enough yet about either F# or PEG parsers to be clear WHAT you've done. I want to write a parser for [some grammar]. How do I build on what you have done? Is the grammar you've written a parser for just a specific sample grammar from that whitepaper, or are you saying that it parses grammars written in the notation used in that whitepaper? Or are you building up a combinator library of functions that I would use in writing my own parser directly in F#? To take a concrete example, suppose I wanted to write a parser for F# itself, and I'm staring at the F# language spec .. how would I proceed to write such a parser, beginning with your work? Also, what would be the likely speed of such a parser, versus using the lex/yacc available with F#? And since I'm barraging you with questions, I will also ask if there is source available for an F# parser written in F#, using any parsing approach? (What I'm really interested in is translating a subset of ECMAScript 4 to F#, relying on F#'s type inference to get a high performance CodeDOM based implementation running, but I can't claim to know enough about either ES4 or F# to know whether that makes any sense at all...)
Found your post 'F# PEG Parser Next Steps' which answers my question above about what your code does ('it simply builds a PEG AST from a PEG grammar.'). Re 'figure out where to take it next', I'd love to see a technique for mixing domain-specific-language source with generic F# source. As an anti-example of what I mean, consider how scripting languages are currently embedded into HTML pages. Horrible syntactic kludges. Even worse is stuff often done in Javascript or PHP to output page fragments that are composed into a page. Often totally ad hoc and unstructured. Suppose one's goal was to create web pages or a browser-based web application, and we re-thought this from scratch. What might an approach be? One interesting approach to structured mixing of logic and presentation in an XML document can be seen at whitebeam.org). Given F#, and parser technology, what would be a clean way to express what we are trying to say? In some situations, a good answer would be a DOM-based one. That is, I'd like a less verbose way to type the equivalent to the XML document one would create if using Whitebeam. F# logic snippets would be embedded in the presentation document hierarchy in a clean way. In other cases, it would make more sense to INVERT this: Write an application in F#, but with a preprocessor that allows presentation nodes to be expressed readably.
Minor comment re name choices on page http://devhawk.net/2007/12/11/Practical+F+Parsing+The+Parse+Buffer.aspx you present two different versions, one reusing the name "token", the other using an unambiguous name "ttail", which you find less readable. I agree, but the reuse of "token" in the first one caused me to mis-read its functioning the first time. A better solution than EITHER of these is to use variations on token; e.g. call the second use "token1" or some such. That way, the semantics is clear ("this is a case where the tail is being used as another token"), AND there is no confusing the two uses.
Here is my version of your code from Jan 2008: let (|TK|_|) token input = let rec ParseToken token input = match token,(|NC|_|) input with | t :: [], Some(i, input1) when i = t -> Some(input1) | t :: token1, Some(i, input1) when i = t-> ParseToken token1 input1 | _ -> None ParseToken (List.of_seq token) input
Gaaah -- add whitespace back in to indent the above as needed!
Another comment on naming: Being new to F#, and active pattern matching, it took me a long time to understand exactly how (|TK|_|) works, even after reading The Whitepaper you linked to [Thanks for that link -- I would not have grokked active patterns without it]. It would have been somewhat easier if something like "remainder" or "unconsumed" had been used instead of "input", for the RESULT of each rule. That would have clued me in that the active pattern syntax is not referring to a left hand side (input param) term here, but a right hand side (result). Here is my version of EndOfLine [replace my "indent-dots" with spacing]: let (|EndOfLine|_|) = function . | TK "rn" (unconsumed) -> Some(unconsumed) . | TK "n" (unconsumed) -> Some(unconsumed) . | TK "r" (unconsumed) -> Some(unconsumed) . | _ -> None
http://devhawk.net/2007/12/19/Practical+F+Parsing+The+Abstract+Syntax+Tree.aspx Would benefit from an example, showing a simple Abstract Syntax Tree being built up using these functions. In general, these talks would benefit from more examples of how all this looks to someone using what you are building up. This would give clear context to the reader. "black box" explain from the outside, THEN open the box to explain the insides. At this point, perhaps showing the nested rule calls recognizing a single low level construct, wrapped in successively higher constructs: (|Char|_|) !!"5" --> Some ('5', []) (|Range|_|) !!"5-9" --> Some (Dual ('5', '9'), []) (|Class|_|) !!"[5-9]" --> Some ([Dual ('5', '9')], []) (|Primary|_|) !!"[5-9]" --> Some (Class [Dual ('5', '9')], []) A talk or two later, once you have enough of the functions, describe a small PEG grammar, and show the AST for some sample inputs.
"These five functions significantly reduced the complexity of the code. Unfortunately, I'm not sure it's much easier to read. The conciseness is offset IMO by the ugliness of using the active pattern's true names." I consider the new version MUCH easier to read. To be precise, as I get used to the split-banana syntax by experimenting with active patterns, it is becoming a familiar visual entity with a known meaning, that I can rapidly scan and associate with that meaning. Contrast that with the original, which requires examining several lines of code to recognize a usage pattern. As far as how to make this less ugly, I am concerned that removing that visual cue would take away meaning; remove that immediate knowledge "its a split-banana active pattern". I suggest simply having the code editor recognize bananas and splits, and color them blue. Its not like full names of banana entities are going to be heavily used in ordinary code. Keeping them distinctive is a good thing!