Practical F# Parsing: Recursion and Predicate Functions

To prep for my Lang.NET talk, I went back and reviewed my PEG parser. One thing I was not happy with was that all the recursion was handled in a one-off manner. When I needed to match multiple characters in the comment rule, I wrote a special one-off function to recursively process the comment until it reached an EOL. When I needed to parse a series of ranges, characters or definitions, I wrote special one-off functions to handle that recursion. Obviously, that’s not the best approach. So, I wrote the following active pattern functions to handle recursion.

//ZOM == Zero Or More
let rec (|ZOM|) f input =
    match f input with
    | Some(i,input) ->
        let j,input = (|ZOM|) f input
        (i :: j, input)
    | None -> [], input

//OOM == One Or More
let (|OOM|_|) f input =
    match (|ZOM|) f input with
    | [], input -> None
    | v, input -> Some(v,input)

//ZOO == Zero Or One
let (|ZOO|) f input =
    match f input with
    | Some(i,input) -> Some(i), input
    | None -> None,input

With these functions at the ready, I can stop writing one-off recursion functions. Instead, I write a function that matches a single item, which I pass as an argument to one of the three functions above. For example, here is the original and new version of the top level Grammar function.

//Original version
let (|Grammar|_|) input =
    let rec ParseDefinitions dl input =
        match input with
        | Definition (d, input) -> ParseDefinitions (dl @ [d]) input
        | _ -> Some(dl, input)
    let (|OneOrMoreDefintions|_|) input =
        match input with
        | Definition (d, input) -> ParseDefinitions [d] input
        | _ -> None
    match input with
    | Spacing (OneOrMoreDefintions (dl, EndOfFile)) ->
          Some(List.to_array dl)
    | _ -> None

//New Version
let (|Grammar|_|) = function
    | Spacing (OOM (|Definition|_|) (dl, EndOfFile)) ->
          Some(List.to_array dl)
    | _ -> None

The new version is much shorter, because there’s already a function to match a single definition, which we can pass into OneOrMore (aka OOM). Note, when I pass an active pattern function as a parameter, I have to use it’s real name (with the pipes and parameters). Having to use the real name is pretty ugly, but F# need to be able to differentiate between using a function as an active pattern vs using it as a function parameter. If you could just call OOM Definition (dl, EndOfFile), would F# realize Definition is a parameter?

I also defined syntactic predicate functions. If you’ll recall, these syntactic predicates will try to match but automatically backtrack, returning success or failure depending on which function you called.

//FP == Failure Predicate
let (|FP|_|) f input =
    match f input with
    | Some(_) -> None
    | None -> Some(input)

//SP == Success Predicate
let (|SP|_|) f input =
    match f input with
    | Some(_) -> Some(input)
    | None -> None

To see this in action, here’s the original and updated Primary function. Only the first rule is relevant, so I’ve omitted the others.

//Original version
let (|Primary|_|) input =
    let (|NotLEFTARROW|_|) input =
        match input with
        | LEFTARROW (_) -> None
        | _ -> Some(input)
    match input with
    | Identifier (id, NotLEFTARROW (input)) ->
        Some(Primary.Identifier(id), input)
    //rest of function omitted for clarity

//new version
let (|Primary|_|) = function
    | Identifier (id, FP (|LEFTARROW|_|) (input)) ->
          Some(Primary.Identifier(id), input)
    //rest of function omitted for clarity

Instead of writing a special function to match “not left arrow”, I just pass the left arrow function as a parameter to Failure Predicate (aka FP). With these recursion and syntactic predicate functions, I was able to remove all the one-off recursion functions from my parser. (Note, I posted an updated version of PegParser on my SkyDrive so you can see this in action.)

These five functions significantly reduced the complexity of the code. Unfortunately, I’m not sure it’s much easier to read. The conciseness is offset IMO by the ugliness of using the active pattern’s true names. Also, I would have liked to use custom operators for these five functions, but operators aren’t allowed to be active pattern functions. Hopefully, that will change at some point in the future, though if we’re going to dream of better syntax, can we do something about all the parens? Personally, I’d love to be able to write the following:

//This doesn't work, but I can dream, can't I?
let (|Primary|_|) = function
    | Identifier (id) !!LEFTARROW (input) ->
        Some(Primary.Identifier(id), input)
    //rest of function omitted for clarity

let (|Grammar|_|) = function
    | Spacing ++Definition (dl) EndOfFile ->
        Some(List.to_array dl)
    | _ -> None

Note to self, talk to F# team members who come to LangNET about this…

Morning Coffee 137

  • Note, I somehow duplicated Morning Coffee 135. So I’ve skipped 136 to make up for it.
  • Congrats to Hillary Clinton for her unexpected win in the New Hampshire primary. As I said last week, I think Obama has a better chance of winning in November, but I’ve got nothing against Clinton or her politics.
  • Speaking of winning, congrats to LSU on winning the BCS. Are they the best team in college football? Personally, I don’t think so – there are at least three other teams (Georgia, West VA and of course USC) who can make a persuasive argument that they should be #1. But losing to teams like Penn Pitt and Stanford, neither WVA and USC have an argument they should have been in the championship game. But that’s what makes the BCS such BS. If nothing else, at least the “we need a playoff” meme is picking up steam.
  • This is sort of cool: Eye-fi is a wireless enabled SD card so you can wirelessly upload pictures from your camera to your PC or favorite photo service. However, I think the price needs to come down a bit. I recently bought a 2GB SD card for my wife’s new camera for $20. A 2GB Eye-fi card is $99. Not sure wireless upload is worth 5x per card.
  • With all the focus on LINQ providing type-safe queries, it’s easy to forget that some apps do need to build their queries at run time. Scott Guthrie points at a Dynamic LINQ C# sample (also available for VB) that builds LINQ expression trees from strings. It kinda takes you back to the bad-old-days of embedding SQL strings in your code, but there are scenarios – especially BI scenarios – where you need this capability.
  • Soma announces the VC++ 2008 Feature Pack Beta. This is the long-awaited (by who?) MFC update as well as support for the C++ TR1. TR1 provides some FP-esque support like function objects and tuples, so maybe this is worth a look. On the other hand, given that much (all?) of TR1 is lifted from Boost, maybe we should just use that.
  • Speaking of cool libraries, check out C5 (aka the Copenhagen Comprehensive Collection Classes for C#). It’s basically a complete redesign of System.Collections.Generic (or SCG as they call it). I’ve read thru their online book and I’m very impressed. Of course, with me focused on F# of late, I’m primarily using immutable collections, so I’m not sure how much use I have for C5 right now.
  • There was a free CoDe magazine in my DevTeach bag back in November with a fascinating article on where LINQ goes from here – LINQ 2.0 if you will. One of things the article discusses is tier-splitting, which has seen the light of day in Volta. Will Volta also deliver External Relationships, Reshaping Combinators and Join Patterns or will those come from different projects?
  • I had to pave my workstation yesterday. I was running an interim build of Vista x64 SP1 and I couldn’t make Virtual Server work with it. As part of the repave, I discovered I needed to update the firmware of my SCSI controller, but the update had to run under DOS. Freaking DOS? My workstation doesn’t even have a floppy drive to boot DOS from! However, I was able to boot from a USB thumb disk instead. That’s damn useful.

Morning Coffee 135

  • Congrats to Barack Obama for walking away with the Iowa Democratic Caucus, which set turnout records. Frankly, I’m pretty cool with any of the democratic front runners but I think Obama has the best chance of winning in November. I’m not sure Edwards second time around will be any more successful than the last and I believe Clinton would drive the GOP GOTV campaign better than any of the actual GOP candidates would.
  • Obviously, I like to play M-rated games like Bioshock and Mass Effect. But I also like games I can play with my kids like Lego Star Wars. There are two new Lego games coming out this year: Lego Indiana Jones and Lego Batman. I can’t wait.
  • Speaking of gaming, Xbox LIVE had some issues over the holiday break, due to record setting sign-ups and concurrent users. Record setting numbers is a nice problem to have if you’re on the business side, but a not-so-nice if you’re a customer or work in operations. The XBL GM announced they’re offering a “token of appreciation” for everyone’s patience – a free XBLA game. Assuming it’s not a crappy game, it’s a classy move.
  • I watched Transformers on HD-DVD last night. Fun movie with lots of action, but man is it dumb. John Turturro is the only real stand-out.
  • Dustin Campbell implements cons, cdr and car from Scheme in C# and VB. While of limited production value (Dustin specifically warns readers not to use any of his code), it really demonstrates how different the functional world is from the object/imperative one, right down to the concept of type. Cons doesn’t return a tuple, it returns function with two bound variables. (via DNK)

Practical F# Parsing: Syntactical Productions (1)

Now that I have my parse buffer functions and unit test framework taken care of, it’s time to write some parsing expression grammar productions 1. In any grammar, there are productions that have semantic meaning and productions that are only used to specify syntax. For example, in C# statements end with a semicolon and code blocks are delineated with curly braces. Syntactical elements don’t end up in the abstract syntax tree, so they’re easier to implement.

Most of PEG’s grammar productions are syntactical in nature. Let’s start with the simplest production, EndOfFile.

///EndOfFile <- !.
let EndOfFile input =
    match NC input with
    | None -> Some()
    | _ -> None

Similar to the underscore in F#, the period represents any character In the PEG grammar and the exclamation point is a backtracking not. In other words, EndOfFile succeeds if NC fails, which only happens when the parse buffer is empty. Since the parse buffer is empty on a successful match, there’s no value to return so I return Some(). Some() is kinda like Nullable, which doesn’t make much sense in C#. But here it means I have a successful match but no data to pass back to the caller.

Here’s a slightly more complex grammar production, EndOfLine:

///EndOfLine <- ‘rn’ / ‘n’ / ‘r’
let EndOfLine input =
    match NC input with
    | Some(‘r’, input) ->
        match NC input with
        | Some (‘n’, input) -> Some(input)
        | _ -> Some(input)
    | Some(‘n’, input) -> Some(input)
    | _ -> None

EndOfLine has three possible matches: rn, n and r. Here, you see the use of nested match statements in the r match clause. If the top character is r, the function checks the top of the remaining text buffer for n. F#’s ability to reuse symbols comes in very handy here.

Unfortunately, this isn’t code very readable. In particular, the match priority order, which matters in PEGs, is totally lost. For this production, it’s no big deal, but that won’t always be the case. I’m also not using the handy TOKEN function I wrote in a few posts ago. Here’s an alternative version that uses TOKEN and preserves the match priority order.

///EndOfLine <- ‘rn’ / ‘n’ / ‘r’
let EndOfLine input =
    match TOKEN “rn” input with
    | Some(input) -> Some(input)
    | _ ->
        match TOKEN “n” input with
        | Some(input) -> Some(input)
        | _ ->
            match TOKEN “r” input with
            | Some(input) -> Some(input)
            | _ -> None

This time, I use the TOKEN function and nested match statements to chain the results together. The previous method is probably faster, but this approach is more readable and explicitly preserves the match order. However, it sure is a pain to type. Wouldn’t it be great if we could write something like this:

///EndOfLine <- 'rn' / 'n' / 'r'
let EndOfLine input =
    match input with
    | TOKEN "rn" (input) -> Some(input)
    | TOKEN "n" (input) -> Some(input)
    | TOKEN "r" (input) -> Some(input)
    | _ -> None

Turns out, F#’s Active Patterns feature let’s me implement EndOfLine exactly like this. We’ll look at how it works in the next post.

Morning Coffee 130

  • Michael Klucher announces the release of XNA Game Studio 2.0 and Major Nelson points to the press release announcing the release. You can get the bits from XNA Creators Club Online (the XNA dev center has yet to be updated).
  • Speaking of XNA, David Weller points out the warm-up challenge for Dream-Build-Play 2008. I assume networking will be a big part of this years’ entries, but the warm-up challenge is to “Create a new and innovative use of Artificial Intelligence in a game”.
  • Still speaking of XNA, Gamasutra has an interview with XNA GM Chris Satchell where he hints at a publishing channel for XNA games on the Xbox 360, with “full details” coming sometime in the new year.
  • The Capitals beat the Rangers in overtime last night. Since changing coaches on Thanksgiving, they’re 6-3-1. That’s great, but they’re still five games under .500. The good news is that even though the Caps tied for last in the league, they’re only six points out of a playoff spot with about fifty games left in the season.
  • My old team puts on an event every year called the Strategic Architects Forum. It’s invite-only, but they’ve posted some of the videos, slides and transcripts from this year’s event.
  • J.D. Meier discusses the new Guidance Explorer release. They’re now up to 3,000 “nuggets” of guidance and they’ve moved the guidance store to MSDN. (via Sam Gentle)
  • Arnon Rotem-Gal-Oz explains further why arbitrary tier-splitting is bad. I’d also suggest reading Chapter 7 of PoEAA which provides another version of the same story: You can’t take an object that’s designed for fine-grained local access and make it remote without really screwing yourself up.
  • Eric Lippert thinks immutable data structures are “the way of the future in C#” so he’s written a series on immutability. Posts include kinds of immutability, an immutable stack, an immutable covariant stack and an immutable queue. As I’ve discussed, immutable data structures are HUGE in functional programming. Eric’s immutable stacks and queues are similar to F#’s native list type. (via Jomo Fisher)