Practical F# Parsing : Syntactical Productions (2)

Now that I’ve moved over to Active Patterns, I want to go back and finish the syntactic productions for my PEG parser. Most of the syntactic productions are very straightforward when implemented in AP. We’ve seen EndOfFile, EndOfLine and Space already. There is also a series of symbol identifiers that have only a single match clause. For example, here’s DOT:

///DOT <- '.' Spacing
let (|DOT|_|) input =
    match input with
    | TOKEN "." (Spacing(input)) -> Some(input)
    | _ -> None

I’m not going to go thru all the symbol AP functions since their all basically like this one. However, you’ll notice that this function references an AP we haven’t seen yet – Spacing. I want to close out the section on Syntactical Productions by looking at the Spacing and Comment productions. Since Spacing depends on Comment, I’ll start with Comment.

Comments in PEG grammars are single lines that start with a # symbol, similar to the // line comments in F# and C#. This is the PEG grammar rule for Comment:

///Comment <- '#' (!EndOfLine .)* EndOfLine

Basically, this says that a comment starts with a #, then matches zero or more characters that are not EndOfLine, and ends with an EndOfLine. The exclamation point is a syntactic predicate, which means that we unconditionally backtrack after attempting to match. PEG has both a success and failure syntactic predicate – the ! is the failure predicate while & is the success predicate. So inside the parens, this production rule says to test the current point in the parse buffer for EndOfLine. If it finds it, the match fails and we exit out of the parens (where we match EndOfLine again without backtracking it this time). If it doesn’t find it, the parser backtracks, consumes the next character regardless what it is, then repeats.

Unfortunately, there’s a bug in this production. If the parse buffer ends in a comment, the production will fail since it hasn’t reached the EndOfLine and there are no more characters to consume. So I changed the production to:

///Comment <- '#' ((!EndOfLine / !EndOfFile) .)* EndOfLine?

This rule now ends the comment if it reaches an EndOfLine or EndOfFile. Additionally, it makes the final EndOfLine match optional. So if the comment ends with a new line, the new line is consumed as part of the grammar production. If the comment ends with the end of file, the EndOfFile is not consumed as part of the production. If you’ll recall, EndOfFile returns Some(unit) rather than Some(char list). In F#, the various branches of a match clause have to have the same return type, so you can’t return Some() from one branch and Some(input) from another. It’s no big deal – you use the EndOfFile production at the top-level grammar to ensure you’ve consumed the entire file anyway.

Here’s the F# implementation of Comment:

Comment defines a local AP function called CommentContent, which implements the part of the grammar production inside the parens.

///Comment <- '#' ((!EndOfLine / !EndOfFile) .)* EndOfLine?
let (|Comment|_|) input =  
    let rec (|CommentContent|_|) input =  
        match input with
        | EndOfLine (input) -> Some(input)
        | EndOfFile -> Some(input)
        | NC (_,input) -> (|CommentContent|_|) input
        | _ -> None
    match input with
    | TOKEN "#" (CommentContent (input)) -> Some(input)
    | _ -> None

Local AP function CommentContent recurses thru the input buffer after the pound sign, looking for  EndOfLine or EndOfFile. This function should never match the final default clause, but I put it in to keep the compiler from complaining. Notice that I use symbol redefinition here so both EndOf match clauses return Some(input). For EndOfLine, I’m re-defining input to mean what is returned by EndOfLine. For EndOfFile, I’m not re-defining, so input still means the list that is passed into the pattern match statement.

Compared to Comment, Spacing is pretty trivial:

///Spacing <- (Space / Comment)*
let rec (|Spacing|) input =  
    match input with
    | Space (input) -> (|Spacing|) input
    | Comment (input) -> (|Spacing|) input
    | _ -> input

There are two things I want to call out about spacing. First, it’s a recursive function, so it’s defined with let rec. AP functions can be recursive, just like normal functions. Also, note the lack of an underscore in the name of this AP function. Spacing is defined as zero or more spaces or comments, so it’s perfectly valid to match nothing. Thus, Spacing is always a successful match. In this case, we don’t put the underscore in the AP function name and we don’t wrap the return result in Some(). You’ll notice the last match clause simply returns input, rather than Some(input).

That’s all the syntactic predicates. Next up, the meat of the grammar: semantic predicates.

Morning Coffee 131

  • On a recommendation from my mother-in-law, I’ve been watching Torchwood. Sort of Men in Black, the series and set in Cardiff. Since it’s made in England, it’ll be one of the few shows still running in the new year due to the WGA strike.
  • A while back I pointed out that many DotNetKicks articles were submitted by their authors. I submitted a few of my own, just for kicks (har har), with mixed results. Today, I discovered that the parse buffer post from my Practical Parsing in F# series was submitted, picked up some kicks, and made it to the home page. That’s pretty cool. I guess writing more dev-focused articles is the way to go to get attention on DNK.
  • Amazon has rolled out a limited beta of SimpleDB, which appears to be S3 + query support. Cost is based on usage: 14¢/hour for machine utilization, 10¢/GB upload, 13-18¢/GB download and $1.50/GB storage/month. I’d love to see SimpleDB software that I could download and install, rather than hosted only. Even if I was going to use the hosted service, I’d like to develop against a non-hosted instance.
  • Research for sale! I was checking out the MS Research download feed and discovered a link to the Automatic Graph Layout (MSAGL) library. This was previously called GLEE (Graph Layout Execution Engine) and was “free for non-commercial use”. Now, you can buy it for $295 from Windows Marketplace (though the previous free version is still available). The idea of directly commercializing research like this strikes me as pretty unusual. It must be a really good library.
  • Scott Guthrie shows off the new Dynamic Data Support that will ship as part of the ASP.NET Extensions. I’m like, whatever. Scaffolding wasn’t that that interesting to me in RoR, so it’s no surprise that it’s not that interesting in ASP.NET.
  • Jeff “Party With” Palermo blogs about the IoC support in the new MVC Contrib project. Also looks like they’re porting RoR’s simply_restful. (via Scott Guthrie)
  • I need to try out some of Tomas Respro’s VS color schemes (also via Scott Guthrie)

Practical F# Parsing: Active Patterns

In the last post, I gave you a sneak preview of what the EndOfLine production would look like using Active Patterns. But before we get to how to build that, let me give you a little background on why. If you want the full explanation, check out the whitepaper.

Basically, Active Patterns (aka AP) are a way to use the pattern matching of functional languages with abstractions rather than native language types. If you’ll recall, I built functions to abstract the parse buffer so I could later change it’s implementation if I needed to. The problem is that since the parse buffer is an abstraction, you can’t use it in the match clauses. For example, here’s a version of EndOfLine that uses a native char list.

///EndOfLine <- ‘rn’ / ‘n’ / ‘r’
let EndOfLine input =
    match input with
    | ‘r’ :: ‘n’ :: input -> Some(input)
    | ‘n’ :: input -> Some(input)
    | ‘r’ :: input -> Some(input)
    | _ -> None

That’s straightforward like the AP preview at the end of the last post, but I’ve lost the benefit of the parse buffer abstraction. In other words, if I wanted to change the implementaiton of the parse buffer to a string, or some other type, I’d be screwed if I wrote EndOfLine this way. Traditionally, functional language developers had an either/or choice when it came to abstractions vs. pattern matching. AP let’s you use both.

Using a special syntax, you can indicate that an F# function is an AP by surrounding the name in what Don calls “bananas”. Here’s the AP version of NC:

let (|NC|_|) input =
    match input with
    | i :: input -> Some(i, input)
    | [] -> None

This function is identical to the one defined in the first post, except for the name. By surrounding the actual name in paren/pipe “bananas”, you’re indicating the function can be used in match clauses, not just the match input. The trailing underscore in the name indicates this is a partial pattern, which means it returns an option value (aka Some(_) or None).

There’s no reason why you can’t use an AP function like any other function. I find I do this often in my unit tests. Here’s an updated version of an NC unit test.

[<Fact>]
let test_NC_empty_string () =
    let ret = (|NC|_|) !!""
    Assert.Equal(None, ret)

While you can still call the function like this, the primary benefit of using Active Patterns is so you can use the function in pattern match clauses directly. This allows the production clauses to mirror the actual grammar rules directly. For simple productions like EndOfFile and EndOfLine, the AP F# implementation isn’t much more complex than the grammar rule itself:

///EndOfFile <- !.
let (|EndOfFile|_|) input =
    match input with
    | NC (_) -> None
    | _ -> Some()  

///EndOfLine <- 'rn' / 'n' / 'r'
let (|EndOfLine|_|) input =
    match input with
    | TOKEN "rn" (input) -> Some(input)
    | TOKEN "n" (input) -> Some(input)
    | TOKEN "r" (input) -> Some(input)
    | _ -> None

You see in these functions, the calls to NC and TOKEN are used in the match clauses (i.e. after the pipe) rather than the match input (i.e. between match and with). Note, when used in a match clause, you just use the name directly without the bananas.

You’ll notice that for TOKEN, the token string to match goes outside the parentheses. This is because “rn” is an input parameter to the TOKEN AP function. Alternatively, I could have written EndOfLine using only the NC function, though I find TOKEN version easier to read.

///EndOfLine <- 'rn' / 'n' / 'r'
let (|EndOfLine|_|) input =
    match input with
    | NC ('r', NC ('n', (input))) -> Some(input)
    | NC ('n', input) -> Some(input)
    | NC ('r', input) -> Some(input)
    | _ -> None

In this version, the values of ‘r’ and ‘n’ are pattern matched against the result of calling NC, so they go inside the parentheses. In other words, the TOKEN clauses are matched if TOKEN returns some value. However, the NC clauses are only matched if the returned result matches the value specified in the match clause. inside the parentheses. TOKEN has two parameters, the token string and the parse buffer, while NC only has the parse buffer. When you write an AP function, the last parameter gets bound to the match clause input. Additional parameters, like TOKEN’s token, much be specified in the match clause.

Notice I’ve defined these grammar productions as active patterns as well, which will make them compose nicely with higher-order productions. For example, here’s the Space grammar production, which reuses EndOfLine:

///Space <- ' ' / 't' / EndOfLine
let (|Space|_|) input =
    match input with
    | TOKEN " " (input) -> Some(input)
    | TOKEN "t" (input) -> Some(input)
    | EndOfLine (input) -> Some(input)
    | _ -> None

It’s DSL-esque, wouldn’t you say? Active Patterns is a little parens-heavy – the NC version of EndOfLine has three nested APs which isn’t exactly easy on the eyes. However, the concept is very solid and it make the parsing code almost easier to write by hand than it would be to use a parser generator like yacc. Almost.

Practical F# Parsing: Syntactical Productions (1)

Now that I have my parse buffer functions and unit test framework taken care of, it’s time to write some parsing expression grammar productions 1. In any grammar, there are productions that have semantic meaning and productions that are only used to specify syntax. For example, in C# statements end with a semicolon and code blocks are delineated with curly braces. Syntactical elements don’t end up in the abstract syntax tree, so they’re easier to implement.

Most of PEG’s grammar productions are syntactical in nature. Let’s start with the simplest production, EndOfFile.

///EndOfFile <- !.
let EndOfFile input =
    match NC input with
    | None -> Some()
    | _ -> None

Similar to the underscore in F#, the period represents any character In the PEG grammar and the exclamation point is a backtracking not. In other words, EndOfFile succeeds if NC fails, which only happens when the parse buffer is empty. Since the parse buffer is empty on a successful match, there’s no value to return so I return Some(). Some() is kinda like Nullable, which doesn’t make much sense in C#. But here it means I have a successful match but no data to pass back to the caller.

Here’s a slightly more complex grammar production, EndOfLine:

///EndOfLine <- ‘rn’ / ‘n’ / ‘r’
let EndOfLine input =
    match NC input with
    | Some(‘r’, input) ->
        match NC input with
        | Some (‘n’, input) -> Some(input)
        | _ -> Some(input)
    | Some(‘n’, input) -> Some(input)
    | _ -> None

EndOfLine has three possible matches: rn, n and r. Here, you see the use of nested match statements in the r match clause. If the top character is r, the function checks the top of the remaining text buffer for n. F#’s ability to reuse symbols comes in very handy here.

Unfortunately, this isn’t code very readable. In particular, the match priority order, which matters in PEGs, is totally lost. For this production, it’s no big deal, but that won’t always be the case. I’m also not using the handy TOKEN function I wrote in a few posts ago. Here’s an alternative version that uses TOKEN and preserves the match priority order.

///EndOfLine <- ‘rn’ / ‘n’ / ‘r’
let EndOfLine input =
    match TOKEN “rn” input with
    | Some(input) -> Some(input)
    | _ ->
        match TOKEN “n” input with
        | Some(input) -> Some(input)
        | _ ->
            match TOKEN “r” input with
            | Some(input) -> Some(input)
            | _ -> None

This time, I use the TOKEN function and nested match statements to chain the results together. The previous method is probably faster, but this approach is more readable and explicitly preserves the match order. However, it sure is a pain to type. Wouldn’t it be great if we could write something like this:

///EndOfLine <- 'rn' / 'n' / 'r'
let EndOfLine input =
    match input with
    | TOKEN "rn" (input) -> Some(input)
    | TOKEN "n" (input) -> Some(input)
    | TOKEN "r" (input) -> Some(input)
    | _ -> None

Turns out, F#’s Active Patterns feature let’s me implement EndOfLine exactly like this. We’ll look at how it works in the next post.

Morning Coffee 130

  • Michael Klucher announces the release of XNA Game Studio 2.0 and Major Nelson points to the press release announcing the release. You can get the bits from XNA Creators Club Online (the XNA dev center has yet to be updated).
  • Speaking of XNA, David Weller points out the warm-up challenge for Dream-Build-Play 2008. I assume networking will be a big part of this years’ entries, but the warm-up challenge is to “Create a new and innovative use of Artificial Intelligence in a game”.
  • Still speaking of XNA, Gamasutra has an interview with XNA GM Chris Satchell where he hints at a publishing channel for XNA games on the Xbox 360, with “full details” coming sometime in the new year.
  • The Capitals beat the Rangers in overtime last night. Since changing coaches on Thanksgiving, they’re 6-3-1. That’s great, but they’re still five games under .500. The good news is that even though the Caps tied for last in the league, they’re only six points out of a playoff spot with about fifty games left in the season.
  • My old team puts on an event every year called the Strategic Architects Forum. It’s invite-only, but they’ve posted some of the videos, slides and transcripts from this year’s event.
  • J.D. Meier discusses the new Guidance Explorer release. They’re now up to 3,000 “nuggets” of guidance and they’ve moved the guidance store to MSDN. (via Sam Gentle)
  • Arnon Rotem-Gal-Oz explains further why arbitrary tier-splitting is bad. I’d also suggest reading Chapter 7 of PoEAA which provides another version of the same story: You can’t take an object that’s designed for fine-grained local access and make it remote without really screwing yourself up.
  • Eric Lippert thinks immutable data structures are “the way of the future in C#” so he’s written a series on immutability. Posts include kinds of immutability, an immutable stack, an immutable covariant stack and an immutable queue. As I’ve discussed, immutable data structures are HUGE in functional programming. Eric’s immutable stacks and queues are similar to F#’s native list type. (via Jomo Fisher)