F# PEG Parser Next Steps

There are still a couple of posts to go in my Practical Parsing in F# series. But with Christmas and my parents on their way, I’m taking the rest of the year off.

I’ve stuck the code as it currently stands up on my SkyDrive. Conveniently enough, xUnit.net released their RC1 build yesterday, which includes supports for static test methods. I’ve included the RC1 build in the zip file on SkyDrive, as well as simple batch file so you can run the tests yourself.

Taking a break from this project will give me a good opportunity to figure out where to take it next. As the code stands, it’s not very useful – it simply builds a PEG AST from a PEG grammar. That’s just the first phase of a typical compiler. Without those other phases (you know, like “generate binary code”) this is just an interesting sample.

Since I’m in the “future pondering” phase, now’s the time to make your opinion known. What do you, dear reader, think I should do with this code? Bonus points for wanting to get involved.

Practical F# Parsing: Semantic Productions (2)

Now that I’ve explained the AST, there are several more semantic productions to go. I’m not going to describe them all in detail, just hit a few important highlights.

Many of the semantic productions return lists of other productions. Class returns a list of Ranges, Literal and Identifier returns lists of characters, etc. As you would expect, these multiples are encoded in the grammar. For example, here’s the implementation of Literal:

///Literal <- ['] (!['] Char)* ['] Spacing
///         / ["] (!["] Char)* ["] Spacing
let (|Literal|_|) input =

    let rec (|LitChars|_|) delimiter chars input =
        match input with
        | TOKEN delimiter (_) -> Some(L2S chars, input)
        | Char (c, input) ->  
            (|LitChars|_|) delimiter (chars @ [c]) input
        | _ -> None

    match input with
    | TOKEN "'"  (LitChars "'"  [] (lit, TOKEN "'"  (Spacing(input)))) ->  
        Some(lit, input)
    | TOKEN """ (LitChars """ [] (lit, TOKEN """ (Spacing(input)))) ->  
        Some(lit, input)
    | _ -> None

I’m using a local recursive function LitChars to retrieve the characters between the quote delimiters. The quote parameter – i.e. single or double quote – is passed in as a parameter. I also pass in an empty list of chars as a parameter. Remember that functional programs keep their data on the stack, a list parameter is a common way to keep state in a recursive function. When I match a single non-delimiter character, I add it to the list with the chars @ [c] expression. [c] converts a single value c into a list of one element while the @ operator concatenates to lists. I’m not sure adding the value to he end like that is a good idea perf wise. Programming Erlang recommends only adding items to the head then reversing the list when you’re done matching. But F# isn’t Erlang, so I’m not sure what the guidance is here.

Another thing you find in productions is the backtracking syntactic predicates. We saw an example of them in the implementation of Comment. Often, their used to indicate the end of a list of other productions, such as Literal, above. However, sometimes, they’re used to ensure the correct production is matched. For example, a Primary can be an Identifier, as long as it’s not followed by a left arrow. An identifier followed by a left arrow indicates a Definition.

///Primary <- Identifier !LEFTARROW
///         / OPEN Expression CLOSE
///         / Literal / Class / DOT
let rec (|Primary|_|) input =

    let (|NotLEFTARROW|_|) input =
        match input with
        | LEFTARROW (_) -> None
        | _ -> Some(input)

    match input with
    | Identifier (id, NotLEFTARROW (input)) ->  
        Some(Primary.Identifier(id), input)
    | OPEN ( Expression (exp, CLOSE (input))) ->
        Some(Primary.Expression(exp), input)
    | Literal (lit, input) -> Some(Primary.Literal(lit), input)
    | Class (cls, input) -> Some(Primary.Class(cls), input)
    | DOT (input) -> Some(Primary.Dot, input)
    | _ -> None

Here, I need a way to match the absence of LEFTARROW, so I’ve build a simple local function called NotLEFTARROW. This isn’t very clean IMO – I’d rather have a used a custom operator like !!! and &&& for my backtracking predicates. But I haven’t figured out how to use custom operators as Active Patterns. I was able to write a standard non-operator AP function, but then I have to use the full AP function name. Here’s a version of Primary written that way:

///Backtracking failure predicate
let (|NotPred|_|) f input =
    match f input with
    | Some (_) -> None
    | _ -> Some(input)

let rec (|Primary|_|) input =
    match input with
    | Identifier (id, NotPred (|LEFTARROW|_|) (input)) ->  
        Some(Primary.Identifier(id), input)
    //Other matches omited

Frankly, I don’t think that’s very readable, so I didn’t implement it that way. If I can figure out how to use custom operators and pass around AP functions without using their full ugly name, I’ll change it.

Finally, there are a few things about F#’s scoping rules that you need to understand. F# uses linear scoping, which is to say there’s no way to use a type or function that hasn’t been declared, sort of like C/C++. The difference is that while C/C++ have a way to declare a type or function separately from its implementation, F# has no such capacity. This becomes an issue when you have circular references. For example, Primary can be an Expression, which is a list of SequenceItems, each of which is a Primary with an optional prefix and suffix. In order to declare those in F#, you have to use a special “and” syntax to link the types/functions together.

//ToString and Exp2Str omitted for clarity
type Primary =
| Identifier of string
| Expression of Expression
| Literal of string
| Class of Range list
| Dot  

//ToString omitted for clarity
and SequenceItem =
    {  
        primaryItem: Primary;
        itemPrefix: Prefix option;
        itemSuffix: Suffix option;
    }

and Sequence = SequenceItem list

and Expression = Sequence list

Likewise, the AP functions to recognize Primary, SequenceItem, Sequence and Expression are anded together. For me, this is one of the hardest things to get used to about F#. But as you can see from the expressiveness of the code, it’s well worth the trouble

Practical F# Parsing: The Abstract Syntax Tree

In the last post, I showed two semantic productions, Char and Range. Char returns an option tuple of a native char and the parse buffer. Range returns a tuple of either a single character or a character range and the parse buffer. Certainly, I could have written Range to always return a char * char tuple, passing in the same character for both in the case of a single character range. However, this provides an opportunity to introduce F#’s discriminated unions (or simply union for short).

The F# Manual describes a discriminated union as “a new type composed of a fixed number of distinct alternatives”. Many of the semantic productions return “a fixed number of distinct alternatives” so I find a union is a good way to model the return value of semantic production functions. Here’s the definition of Range:

///AST Type for Range production
type Range =
| Single of char
| Dual of char * char
    with
    override this.ToString() =  
        match this with
        | Single x -> sprintf "Range.Single (%A)" x
        | Dual (x,y) -> sprintf "Range.Dual (%A,%A)" x y

So Range is either a single character, or a tuple of two characters. As you saw in the last post, you create an instance of a union with the type.alternative syntax. You can also use simply the alternative name, assuming F# can determine the correct union type. Personally, I like using the full name – it helps me remember what the type really is.

Notice that the AP function and the union type appear to have the same name. Actually, they don’t since the name of the AP function’s name includes the bananas – i.e. (|Range|_). However, if you want you can define a function called simply Range and still have a type named Range as well – as long as you’re not interested in language interop. F# can tell the difference between the Range function and the Range union, but C# can’t. So I’d say we’re best off avoiding overloading the names entirely.

If you look at the compiled union in Reflector, you’ll see the Range type, with public internal classes named _Single and _Dual that inherit from Range. In other words, F# implements union types as an inheritance tree.  Range also provides static constructors for the various disparate types in the union.

One last thing I want to point out about the Range type is how I overrode ToString. This is primarily for unit testing – if you don’t override ToString, you only get the type name which isn’t very useful when trying to figure out why a given unit test failed. I’m using the F# native sprintf function rather than string.Format, so the format string is a little different.

The other major F# type we’ll use in the AST are record types. These are similar conceptually to structs in C#. Basically, they’re a tuple with names. For example, here’s the Definition record type (though we haven’t seen any functions that use this type yet).

///AST Type for Definition production
type Definition =
    {  
        name: string;  
        exp: Expression;  
    }  
    with  
    override this.ToString() =
        sprintf "Definition (name: %A, exp: %A)" this.name (Primary.Exp2Str this.exp)

I could have simply defined this type as (string * Expression), but having the fields named makes it crystal clear what the semantic meaning of each field is. The only place where I used an anonymous tuple in the AST instead of a record is in the Range union above – I figured that was simple enough not to warrant named fields.

I also have a couple of type aliases. For example, I have a record type called SequenceItem. An array of SequenceItems is a Sequence and an array of Sequences is an Expression (which we saw in the Definition type above).

///AST Type for Sequence Item production
let SequenceItem =
    {  
        primaryItem: Primary;
        itemPrefix: Prefix option;
        itemSuffix: Suffix option;
    }

///AST Type for Sequence production
let Sequence = SequenceItem list

///AST Type for Expression production
let Expression = Sequence list

Note, unlike unions and records, type aliases can’t override base class methods like ToString. This is because there is no actual Sequence or Expression types in the compiled code – F# compiles away type aliases completely. Looking at the implementaiton of Definition in reflector confirms that the exp member is of type List<List<SequenceItem>>. Since I need to convert Expressions to strings in two different places, I wrote a static Exp2Str method on the Primary type (not shown). It feels a bit hacky to stick Expression’s ToString implementation on the Primary type, but I had little choice given F#’s scoping rules.

Technically, since they get compiled away anyway, I could have skipped the Sequence and Expression declarations and simply defined the exp field of Definition as “SequenceItem list list”. But the “list list” syntax throws me a bit. I mean, I understand it, but I found using the terms Sequence and Expression far more readable. Also, I used the definition of Expression in the definition of Primary, so it makes sense for it to have it’s own name.

Practical F# Parsing: Semantic Productions (1)

All the syntactic productions in my PEG parser, save one, have the exact same signature. They take in a char list and return a char list option. Which is to say, they take a parse buffer in and return either the remaining parse buffer on a successful match or nothing on a failed match. The only exception is EndOfFile which doesn’t return the remaining parse buffer because there isn’t any buffer left to parse.

Now we’re moving on to look at the productions with semantic implications. In Parsing Expression Grammars, there are eleven: Char, Range, Class, Literal, Identifier, Primary, Sequence Item, Sequence, Expression, Definition and Grammar. Like their syntactic brethren, these semantic productions will all have a single char list input parameter. However, they will all return some semantic value along with the remaining parse buffer.

We’ll start with Char, since it’s the only semantic production that doesn’t return a custom type:

///Char <- '\' [nrt'"[]\]
/// / '\' [0-2][0-7][0-7]
/// / '\' [0-7][0-7]
/// / '\' [0-7]
/// / !'\' .
let (|Char|_|) input =  

    let (|InRange|_|) upper input =
        let i2c value = Char.chr(Char.code '0' + value)
        let c2i value = Char.code value - Char.code '0'

        match input with
        | NC (c, input) when (i2c 0) <= c && c <= (i2c upper) ->
            Some((c2i c), input)
        | _ -> None

    match input with
    | TOKEN @"" (NC(c, input))  
    when List.exists (fun x -> x=c) ['n';'r';'t';''';'"';'[';']';'\'] ->  
        match c with
        | 'n' -> Some('n', input)
        | 'r' -> Some('r', input)
        | 't' -> Some('t', input)
        | _ -> Some(c, input)
    | TOKEN @"" (InRange 2 (i1, InRange 7 (i2, InRange 7 (i3, input)))) ->
        Some(Char.chr (i1 * 64 + i2 * 8 + i3), input)
    | TOKEN @"" (InRange 7 (i1, InRange 7 (i2, input))) ->
        Some(Char.chr (i1 * 8 + i2), input)
    | TOKEN @"" (InRange 7 (i1, input)) ->
        Some(Char.chr (i1), input)

    | NC(c, input) when c <> '\' -> Some(c, input)
    | _ -> None

Note, this production is slightly different from the one in the PEG whitepaper. This way was easier to pattern match. Also, I typically don’t wrap my when guards onto the next line, but this way it doesn’t wrap funny on my blog.

While long, Char is fairly straight-forward. There are five ordered choices that can match this production. The first is for escaped characters, the next three are for character codes, and the last one is matching any character except the backslash escape character. Note, tracking F#’s escape characters and PEG’s escape characters can get tricky. I’ve used verbatim strings for all my TOKEN parameters in order to help try and keep it straight.

The escape character match clause uses a when guard to narrow down the selection criteria. I use the built-in List.exists method to see if the character is in a hard-coded list of special characters. List.exists takes in a function parameter, and returns true if that function returns true for any of the value is the list. Since I’m just matching a value, my function parameter is a trivial equality test. If List.exists returns true, I return that special character as part of the return tuple. Of all the escape characters in PEG, only three are also escape characters in F#, so I use a second match clause to return the correct char value. There’s probably a way to do that more elegantly, but since there were just three clauses, I figured it was easier to type them out manually.

For the character code clauses, I wrote a special local AP function called InRange to determine if the specified character was within a specified range and to convert it from a char to an int. Note, the way the production is written, the largest character code you can specify is 277, which means you can encode slightly more than the standard UTF-8 character set. Honestly, this should be updated to support full UTF-16, but I’m not here to critique the grammar, so I didn’t try to fix this issue.

Note, all the results (save None) return a tuple of the matched character value and the remaining input buffer. Again, all the remaining productions will work like that. For example, here’s the Range production:

///Range <- Char '-' Char / Char
let (|Range|_|) input =
    match input with
    | Char (c1, TOKEN "-" (Char (c2, input))) ->  
        Some(Range.Dual (c1, c2), input)
    | Char (c1, input) ->  
        Some(Range.Single (c1), input)
    | _ -> None

Compared to Char, Range is fairly simple. It’s either two chars, separated by a hyphen (for example: a-z) or it’s a single char. Again, being able to use Active Patterns to build on lower level productions is a huge helper.

But what does this function return? What does Range.Single and Range.Dual mean? Those are refer to a special F# construct called a discriminated union. Before we can continue writing semantic productions, we need to define these types to hold the results of these productions.

Practical F# Parsing : Syntactical Productions (2)

Now that I’ve moved over to Active Patterns, I want to go back and finish the syntactic productions for my PEG parser. Most of the syntactic productions are very straightforward when implemented in AP. We’ve seen EndOfFile, EndOfLine and Space already. There is also a series of symbol identifiers that have only a single match clause. For example, here’s DOT:

///DOT <- '.' Spacing
let (|DOT|_|) input =
    match input with
    | TOKEN "." (Spacing(input)) -> Some(input)
    | _ -> None

I’m not going to go thru all the symbol AP functions since their all basically like this one. However, you’ll notice that this function references an AP we haven’t seen yet – Spacing. I want to close out the section on Syntactical Productions by looking at the Spacing and Comment productions. Since Spacing depends on Comment, I’ll start with Comment.

Comments in PEG grammars are single lines that start with a # symbol, similar to the // line comments in F# and C#. This is the PEG grammar rule for Comment:

///Comment <- '#' (!EndOfLine .)* EndOfLine

Basically, this says that a comment starts with a #, then matches zero or more characters that are not EndOfLine, and ends with an EndOfLine. The exclamation point is a syntactic predicate, which means that we unconditionally backtrack after attempting to match. PEG has both a success and failure syntactic predicate – the ! is the failure predicate while & is the success predicate. So inside the parens, this production rule says to test the current point in the parse buffer for EndOfLine. If it finds it, the match fails and we exit out of the parens (where we match EndOfLine again without backtracking it this time). If it doesn’t find it, the parser backtracks, consumes the next character regardless what it is, then repeats.

Unfortunately, there’s a bug in this production. If the parse buffer ends in a comment, the production will fail since it hasn’t reached the EndOfLine and there are no more characters to consume. So I changed the production to:

///Comment <- '#' ((!EndOfLine / !EndOfFile) .)* EndOfLine?

This rule now ends the comment if it reaches an EndOfLine or EndOfFile. Additionally, it makes the final EndOfLine match optional. So if the comment ends with a new line, the new line is consumed as part of the grammar production. If the comment ends with the end of file, the EndOfFile is not consumed as part of the production. If you’ll recall, EndOfFile returns Some(unit) rather than Some(char list). In F#, the various branches of a match clause have to have the same return type, so you can’t return Some() from one branch and Some(input) from another. It’s no big deal – you use the EndOfFile production at the top-level grammar to ensure you’ve consumed the entire file anyway.

Here’s the F# implementation of Comment:

Comment defines a local AP function called CommentContent, which implements the part of the grammar production inside the parens.

///Comment <- '#' ((!EndOfLine / !EndOfFile) .)* EndOfLine?
let (|Comment|_|) input =  
    let rec (|CommentContent|_|) input =  
        match input with
        | EndOfLine (input) -> Some(input)
        | EndOfFile -> Some(input)
        | NC (_,input) -> (|CommentContent|_|) input
        | _ -> None
    match input with
    | TOKEN "#" (CommentContent (input)) -> Some(input)
    | _ -> None

Local AP function CommentContent recurses thru the input buffer after the pound sign, looking for  EndOfLine or EndOfFile. This function should never match the final default clause, but I put it in to keep the compiler from complaining. Notice that I use symbol redefinition here so both EndOf match clauses return Some(input). For EndOfLine, I’m re-defining input to mean what is returned by EndOfLine. For EndOfFile, I’m not re-defining, so input still means the list that is passed into the pattern match statement.

Compared to Comment, Spacing is pretty trivial:

///Spacing <- (Space / Comment)*
let rec (|Spacing|) input =  
    match input with
    | Space (input) -> (|Spacing|) input
    | Comment (input) -> (|Spacing|) input
    | _ -> input

There are two things I want to call out about spacing. First, it’s a recursive function, so it’s defined with let rec. AP functions can be recursive, just like normal functions. Also, note the lack of an underscore in the name of this AP function. Spacing is defined as zero or more spaces or comments, so it’s perfectly valid to match nothing. Thus, Spacing is always a successful match. In this case, we don’t put the underscore in the AP function name and we don’t wrap the return result in Some(). You’ll notice the last match clause simply returns input, rather than Some(input).

That’s all the syntactic predicates. Next up, the meat of the grammar: semantic predicates.