Passion * Technology * Ruthless Competence

Monday, July 07, 2008

Morning Coffee 166

Yes, I realize it’s been a while. I tried in vain to catch up with my blog reading after my Hawaii vacation and finally just gave up and hit “mark all as read”.

Dynamic Languages

  • There's a new version of the DLR hosting spec available (doc, pdf). The DLR implementation is still in motion, so there are some inconsistencies between the spec and the code, but the spec should give you the high level overview you need if you want to host DLR languages inside your app.
  • Oleg Tkachenko recently joined the dynamic languages team. He's the creator of the Interactive IronRuby Web Shell, an IronRuby version of Try Ruby. Of course, it’s not as cool as using SL2to execute the code directly in the browser. Michael Foord has his Python in the Browser and my teammates John and Jimmy demoed a Silverlight version of Try Ruby @ TechEd.
  • Jim Deville, also of the dynamic languages team, recently started blogging.
  • I have a new boss, Dave Remy. He doesn't have a blog - yet - but you can follow him on Twitter as daveremy. When Twitter is actually working that is.
  • There's a new homepage/wiki for IronRuby though I’m not sure why there's a picture of Matz wearing a Python shirt on the home page.
  • My teammate Jimmy Schementi provides some "continued hope" for a better (heck, I'll take current) ASP.NET and ASP.NET MVC story for DLR languages.
  • Via Michael Foord, sounds like IronClad is making good progress. V0.4 can run the bz2 module "in its entirity" (maybe run a spellcheck on your site, guys?) and now apparently, it's now able to load numpy.core. Very exciting!

Other Stuff

  • Pat Helland, who has blogged even less than me for the past few months, has a post up about controller and doers in the IT department. After 18 months in MSIT, put me in the doer camp, please.
  • The F# team has pushed out a spec for v1.9.4 of the language. Don Syme says it's not official, but it's a huge improvement over the old informal spec
  • Speaking of F#, my friend Matthew Podwysocki recently published FsTest, a testing DSL for F#. I wrote about F# unit testing as part of my PEG parsing series, and I really like the direction Matthew has taken this project. You can pull it down from CodePlex.
  • When I did my PEG talk @ Lang.NET, Gilad Bracha mentioned I should check out oMeta. It looks really cool, though with the job change I haven’t had the time to play with it. Now I discover that Jeff Moser is working on a version for CLR called oMeta# that I’ve got to spend some time with. And in the comments to that post, I discovered pyMeta from Allen Short, though it apparently doesn’t work on IronPython (must investigate why).
  • James Kovacs introduces psake, a PowerShell based build automation tool which uses a rake-inspired internal DSL syntax similar to one I blogged last year. I'd love to see this take off, but given MSBuild's tool integration, I wonder if that's feasible.
  • I upgraded my home wireless network almost exactly a year ago. I've been happy with the range and coverage, but not so happy with the Buffalo Tech firmware. The built-in DHCP server is pretty flaky. So I upgraded to the open-source Tomato firmware. Upgrade was smooth, though I did need to reset my cable modem. But even that was smooth - Comcast has an automated service for that now,
Posted By Harry Pierson at 9:30 AM Pacific Daylight Time

Friday, February 22, 2008

Morning Coffee 149

Posted By Harry Pierson at 10:34 AM Pacific Standard Time

Thursday, February 07, 2008

Morning Coffee 144

  • I finished Mass Effect last night. I definitely need to play thru that one again, though I'll probably wait until the new Bring Down the Sky DLC ships next month.
  • Caps won again last night, improving to 20-10-4 since changing coaches at Thanksgiving. They're now at 57 points, taking the lead in the SE division with a full game on Carolina, Atlanta and Florida. Still a ways to go - 27 games left in the regular season - and things are far from "sewn up" but we're a damn sight better off than we were in November.
  • Speaking of a horserace, looks like Clinton and Obama are in one after Super Tuesday. Their estimated delegate counts are basically tied. On the other side of the aisle, McCain opened up what is probably insurmountable lead - even though he has the right-wing media stars and Christian leaders against him. Money quote of the day:

“The real story of the night, when you look at their rallies and their turn-out numbers, is that the Dems have two strong candidates either of whom could lead a united party to victory. Forget the gaseous platitudes: in Dem terms, their choice on Super Duper Tuesday was deciding which candidate was Super Duper and which was merely Super. Over on the GOP side, it was a choice between Weak & Divisive or Weaker & Unacceptable. Doesn’t bode well for November.”
- Mark Steyn, National Review 
(via Carpetbagger Report, lest you think I regularly read National Review)

  • Charlie Calvert is starting a new series on the future of C#. First up: Dynamic Lookup. Probably most interesting is the news that the DLR "will be the infrastructure on which the C# team implements dynamic lookup". Does this mean C# will target the DLR? Sure sounds like it. I think it's a good addition, but I'm not a fan of the proposed syntax. (via Bitter Coder)
  • Brian McNamara saw me present @ LangNET and sent me a link to his blog. He's building up a monadic parser combinator library in C# 3.0. This is basically the same concept that FParsec implements, though C#'s syntax is much less attractive than F#'s for this kind of code. However, Brian does a very good job explaining why monadic parser combinators are useful and making the idea accessible to the C# programmer (i.e. you don't have to learn F# or Haskell to understand what he's talking about). He also points to Luke Hoban's C# 3.0 monadic parser implementation.
Posted By Harry Pierson at 10:05 AM Pacific Standard Time

Wednesday, February 06, 2008

Morning Coffee 143

  • I've been sick for three days, hence the lack of posting around here.
  • As a Redskins fan, it's hard to root for any other NFC East team. On the other hand, it sure was easy to root against the Patriots. Congrats to the Giants on their Super Bowl victory. Favorite headline: 18 and uh-oh!
  • Sounds like there's cause for optimism regarding the writer's strike. But is it already too late? Will the 9% drop in viewers ever come back? Personally, I think the studios have hastened their own irrelevance.
  • With last night's win, the Caps are one game above .500. In and of itself, that's nothing to be proud of - Coach Boudreau remarked when we reached .500 that the Caps had "officially reached mediocrity". However, the Caps are the only team in the SE conference that's above .500. If hockey used baseball standings, Carolina, Atlanta and Florida would each be 1/2 game back of the Caps. It's going to be a fight to the finish.
  • In fairly big managed Ruby news, Wayne Kelly has decided to contribute to the IronRuby effort, effectively walking away from the Ruby.NET which helped get off the ground. One the one hand, obviously this is great for IronRuby. On the other hand, I liked the idea of multiple managed implementations of Ruby, so here's hoping Ruby.NET doesn't fade away.
  • Speaking of the DLR, I know I mentioned Martin Maly's blog in my Lang.NET Morning Coffee Post, but I didn't actually get to read his posts on targeting the DLR until I unexpectedly had several days off sick. If you are at all interested in writing your own language for the .NET platform: Go. Read. Now. You should also check out Tomas Restrepo's blog, he has also started writing about targeting the DLR.
  • Larry O'Brien's blog is currently offline, but he commented that he doubted my ToyScript F# parser would be more than 600 lines of code. Currently, the parser is clocking in at 287 lines of code plus about 50 more for the AST. It's not done yet - see earlier statement about being sick - but I'm fixing bugs not writing additional code at this point. To be completely accurate, that's 287 lines of FParsec code. It's taken me a little bit to learn FParsec, but so far I'm pretty happy with it.
  • Scott Hanselman points to the new MS Deploy project, a tool for managing content and configuration on web servers. I've never understood why this wasn't a standard part of IIS. It seems every hosting company I've used has rolled their own web-based management tool like DotNetPanel.
  • Oh yeah, Vista SP1 and Windows Server 2008 shipped Monday. Congrats!
  • I fired up Inside Xbox the other day, and there was a page about the new Disney Channel show "Phineas and Ferb". Of course, with two kids under five, anything new on the Disney Channel is notable in my house. What made this blog-worthy is the fact that it's directed and written by Dan Povenmire, who I knew from my USC days. I used to go see his band Keep Left and groan loudly at the bad puns in their song "PSA". Dan, if you found this searching for yourself online: Awesome work, my kids love the show!
Posted By Harry Pierson at 11:41 AM Pacific Standard Time

Thursday, January 31, 2008

Morning Coffee 141 - Lang.NET '08 Edition

header I was hoping to blog my thoughts on Lang.NET as the event went along. Obviously, that didn't happen, though I was pretty good about dumping links into my del.icio.us feed. The talks were all recorded, and should be up on the website in a week or two. Rather than provide a detailed summary of everything that happened, here are my highlights:

  • The coolest thing about conferences like this is what John Rose called "N3" aka "Nerd-to-Nerd Networking". It was great to meet in person, drink with and geek out with folks who's blogs I read like Tomas Petricek, Wesner Moise and Larry O'Brien. Plus, I got to meet a bunch of other cool folks like Gilad Bracha, Stefan Wenig and Wez Furlong. That's worth the price of admission (which was admittedly nothing) right there.
  • Coolest MSFT talk: Martin Maly "Targeting DLR". I was wholly unaware that the DLR includes an entire compiler back end. Martin summarized the idea of DLR trees on his blog, but the short version is "you parse the language, DLR generates the code". That's pretty cool, and should dramatically lower the bar for language development. Of course, I want to write my parser in F#, so I'm going to port the DLR ToyScript sample to F#.
  • Runner-up, Coolest MSFT talk: Erik Meijer "Democratizing the Cloud with Volta". Erik is a great speaker and he really set the tone of his session with the comment "Division by zero is the goal, not an error". He was referring to an idea from The Change Function that user's measure of success is a function of perceived crisis divided by perceived pain of adoption. Erik wants to drive that adoption pain to zero. It's a laudable goal, but I remain unconvinced on Volta.
  • Coolest Non-MSFT talk: Gilad Bracha "Newspeak". Newspeak is a new language from one of the co-authors of Java. It's heavily smalltalk influenced, and runs on Squeak. He showed developing PEGs in Newspeak, and they were very compact and easy to read, easier even than F#. He calls them Executable grammar, and you can read his research paper or review his slides on the topic. Unfortunately, Newspeak isn't generally available at this time.
  • Runner-up, Coolest Non-MSFT talk: Miguel de Icaza "Moonlight and Mono". The talk was kinda all-over-the-place, but It's great to see how far Mono has come. Second Life just started beta testing a Mono-based script runner for their LSL language (apparently, Mono breaks many LSL scripts because it runs them so fast). He also showed off Unity, a 3D game development tool, also running on Mono.
  • Resolver One is a product that bridges the gap between spreadsheets and applications, entirely written in IronPython (around 30,000 lines of app code and 110,000 lines of test code, all in IPy). Creating a spread-sheet based app development environment is one of those ideas that seems obvious in retrospect, at least to me. If you do any kind of complicated spreadsheet based analysis, you should check out their product.
  • If you're a PowerShell user, you should check out PowerShell+. It's a free console environment designed for PowerShell and a damn sight better than CMD.exe. If you're not a PowerShell user, what the heck is wrong with you?
  • Other projects to take a deeper look at: C# Mixins and Cobra Language.
  • I thought my talk went pretty well. It's was a 15 minute version of my Practical Parsing in F# series. Several folks were surprised I've been coding F# for less than a year.
Posted By Harry Pierson at 10:25 AM Pacific Standard Time

Tuesday, January 29, 2008

Practical F# Parsing: Recursion and Predicate Functions

To prep for my Lang.NET talk, I went back and reviewed my PEG parser. One thing I was not happy with was that all the recursion was handled in a one-off manner. When I needed to match multiple characters in the comment rule, I wrote a special one-off function to recursively process the comment until it reached an EOL. When I needed to parse a series of ranges, characters or definitions, I wrote special one-off functions to handle that recursion. Obviously, that's not the best approach. So, I wrote the following active pattern functions to handle recursion.

//ZOM == Zero Or More   
let rec (|ZOM|) f input = 
    match f input with 
    | Some(i,input) -> 
        let j,input = (|ZOM|) f input
        (i :: j, input)
    | None -> [], input

//OOM == One Or More       
let (|OOM|_|) f input = 
    match (|ZOM|) f input with
    | [], input -> None
    | v, input -> Some(v,input)
    
//ZOO == Zero Or One
let (|ZOO|) f input = 
    match f input with 
    | Some(i,input) -> Some(i), input
    | None -> None,input

With these functions at the ready, I can stop writing one-off recursion functions. Instead, I write a function that matches a single item, which I pass as an argument to one of the three functions above. For example, here is the original and new version of the top level Grammar function.

//Original version
let (|Grammar|_|) input =
    let rec ParseDefinitions dl input =
        match input with
        | Definition (d, input) -> 
            ParseDefinitions (dl @ [d]) input
        | _ -> Some(dl, input)
    let (|OneOrMoreDefintions|_|) input = 
        match input with
        | Definition (d, input) -> ParseDefinitions [d] input
        | _ -> None
    match input with
    | Spacing (OneOrMoreDefintions (dl, EndOfFile)) -> 
        Some(List.to_array dl)
    | _ -> None
    
//New Version   
let (|Grammar|_|) = function
    | Spacing (OOM (|Definition|_|) (dl, EndOfFile)) ->
        Some(List.to_array dl)
    | _ -> None

The new version is much shorter, because there's already a function to match a single definition, which we can pass into OneOrMore (aka OOM). Note, when I pass an active pattern function as a parameter, I have to use it's real name (with the pipes and parameters). Having to use the real name is pretty ugly, but F# need to be able to differentiate between using a function as an active pattern vs using it as a function parameter. If you could just call "OOM Definition (dl, EndOfFile)", would F# realize Definition is a parameter?

I also defined syntactic predicate functions. If you'll recall, these syntactic predicates will try to match but automatically backtrack, returning success or failure depending on which function you called.

//FP == Failure Predicate
let (|FP|_|) f input =
    match f input with
    | Some(_) -> None
    | None -> Some(input)

//SP == Success Predicate
let (|SP|_|) f input =
    match f input with
    | Some(_) -> Some(input)
    | None -> None

To see this in action, here's the original and updated Primary function. Only the first rule is relevant, so I've omitted the others.

//Original version
let (|Primary|_|) input = 
    let (|NotLEFTARROW|_|) input = 
        match input with 
        | LEFTARROW (_) -> None 
        | _ -> Some(input) 
    match input with 
    | Identifier (id, NotLEFTARROW (input)) ->  
        Some(Primary.Identifier(id), input) 
    //rest of function omitted for clarity

//new version
let (|Primary|_|) = function 
    | Identifier (id, FP (|LEFTARROW|_|) (input)) -> 
        Some(Primary.Identifier(id), input) 
    //rest of function omitted for clarity

Instead of writing a special function to match "not left arrow", I just pass the left arrow function as a parameter to Failure Predicate (aka FP). With these recursion and syntactic predicate functions, I was able to remove all the one-off recursion functions from my parser. (Note, I posted an updated version of PegParser on my SkyDrive so you can see this in action.)

These five functions significantly reduced the complexity of the code. Unfortunately, I'm not sure it's much easier to read. The conciseness is offset IMO by the ugliness of using the active pattern's true names. Also, I would have liked to use custom operators for these five functions, but operators aren't allowed to be active pattern functions. Hopefully, that will change at some point in the future, though if we're going to dream of better syntax, can we do something about all the parens? Personally, I'd love to be able to write the following:

//This doesn't work, but I can dream, can't I?
let (|Primary|_|) = function  
    | Identifier (id) !!LEFTARROW (input) -> 
        Some(Primary.Identifier(id), input)  
    //rest of function omitted for clarity
   
let (|Grammar|_|) = function 
    | Spacing ++Definition (dl) EndOfFile -> 
        Some(List.to_array dl) 
    | _ -> None

Note to self, talk to F# team members who come to LangNET about this...

Posted By Harry Pierson at 9:40 AM Pacific Standard Time

Thursday, December 20, 2007

F# PEG Parser Next Steps

There are still a couple of posts to go in my Practical Parsing in F# series. But with Christmas and my parents on their way, I'm taking the rest of the year off.

I've stuck the code as it currently stands up on my SkyDrive. Conveniently enough, xUnit.net released their RC1 build yesterday, which includes supports for static test methods. I've included the RC1 build in the zip file on SkyDrive, as well as simple batch file so you can run the tests yourself.

Taking a break from this project will give me a good opportunity to figure out where to take it next. As the code stands, it's not very useful - it simply builds a PEG AST from a PEG grammar. That's just the first phase of a typical compiler. Without those other phases (you know, like "generate binary code") this is just an interesting sample.

Since I'm in the "future pondering" phase, now's the time to make your opinion known. What do you, dear reader, think I should do with this code? Bonus points for wanting to get involved.

Posted By Harry Pierson at 10:03 AM Pacific Standard Time

Practical F# Parsing: Semantic Productions (2)

Now that I've explained the AST, there are several more semantic productions to go. I'm not going to describe them all in detail, just hit a few important highlights.

Many of the semantic productions return lists of other productions. Class returns a list of Ranges, Literal and Identifier returns lists of characters, etc. As you would expect, these multiples are encoded in the grammar. For example, here's the implementation of Literal:

///Literal <- ['] (!['] Char)* ['] Spacing
///         / ["] (!["] Char)* ["] Spacing
let (|Literal|_|) input =

    let rec (|LitChars|_|) delimiter chars input =
        match input with
        | TOKEN delimiter (_) -> Some(L2S chars, input)
        | Char (c, input) -> 
            (|LitChars|_|) delimiter (chars @ [c]) input
        | _ -> None
    
    match input with
    | TOKEN "'"  (LitChars "'"  [] (lit, TOKEN "'"  (Spacing(input)))) -> 
        Some(lit, input)
    | TOKEN "\"" (LitChars "\"" [] (lit, TOKEN "\"" (Spacing(input)))) -> 
        Some(lit, input)
    | _ -> None

I'm using a local recursive function LitChars to retrieve the characters between the quote delimiters. The quote parameter - i.e. single or double quote - is passed in as a parameter. I also pass in an empty list of chars as a parameter. Remember that functional programs keep their data on the stack, a list parameter is a common way to keep state in a recursive function. When I match a single non-delimiter character, I add it to the list with the chars @ [c] expression. [c] converts a single value c into a list of one element while the @ operator concatenates to lists. I'm not sure adding the value to he end like that is a good idea perf wise. Programming Erlang recommends only adding items to the head then reversing the list when you're done matching. But F# isn't Erlang, so I'm not sure what the guidance is here.

Another thing you find in productions is the backtracking syntactic predicates. We saw an example of them in the implementation of Comment. Often, their used to indicate the end of a list of other productions, such as Literal, above. However, sometimes, they're used to ensure the correct production is matched. For example, a Primary can be an Identifier, as long as it's not followed by a left arrow. An identifier followed by a left arrow indicates a Definition.

///Primary <- Identifier !LEFTARROW
///         / OPEN Expression CLOSE
///         / Literal / Class / DOT
let rec (|Primary|_|) input =

    let (|NotLEFTARROW|_|) input =
        match input with
        | LEFTARROW (_) -> None
        | _ -> Some(input)

    match input with
    | Identifier (id, NotLEFTARROW (input)) -> 
        Some(Primary.Identifier(id), input)
    | OPEN ( Expression (exp, CLOSE (input))) -> 
        Some(Primary.Expression(exp), input)
    | Literal (lit, input) -> Some(Primary.Literal(lit), input)
    | Class (cls, input) -> Some(Primary.Class(cls), input)
    | DOT (input) -> Some(Primary.Dot, input)
    | _ -> None

Here, I need a way to match the absence of LEFTARROW, so I've build a simple local function called NotLEFTARROW. This isn't very clean IMO - I'd rather have a used a custom operator like !!! and &&& for my backtracking predicates. But I haven't figured out how to use custom operators as Active Patterns. I was able to write a standard non-operator AP function, but then I have to use the full AP function name. Here's a version of Primary written that way:

///Backtracking failure predicate
let (|NotPred|_|) f input =
    match f input with
    | Some (_) -> None
    | _ -> Some(input) 
    
let rec (|Primary|_|) input =
    match input with
    | Identifier (id, NotPred (|LEFTARROW|_|) (input)) -> 
        Some(Primary.Identifier(id), input) 
    //Other matches omited

Frankly, I don't think that's very readable, so I didn't implement it that way. If I can figure out how to use custom operators and pass around AP functions without using their full ugly name, I'll change it.

Finally, there are a few things about F#'s scoping rules that you need to understand. F# uses linear scoping, which is to say there's no way to use a type or function that hasn't been declared, sort of like C/C++. The difference is that while C/C++ have a way to declare a type or function separately from its implementation, F# has no such capacity. This becomes an issue when you have circular references. For example, Primary can be an Expression, which is a list of SequenceItems, each of which is a Primary with an optional prefix and suffix. In order to declare those in F#, you have to use a special "and" syntax to link the types/functions together.

//ToString and Exp2Str omitted for clarity
type Primary =
| Identifier of string
| Expression of Expression
| Literal of string
| Class of Range list
| Dot 

//ToString omitted for clarity
and SequenceItem =
    { 
        primaryItem: Primary;
        itemPrefix: Prefix option;     
        itemSuffix: Suffix option;
    }

and Sequence = SequenceItem list

and Expression = Sequence list

Likewise, the AP functions to recognize Primary, SequenceItem, Sequence and Expression are anded together. For me, this is one of the hardest things to get used to about F#. But as you can see from the expressiveness of the code, it's well worth the trouble

Posted By Harry Pierson at 10:00 AM Pacific Standard Time

Wednesday, December 19, 2007

Practical F# Parsing: The Abstract Syntax Tree

In the last post, I showed two semantic productions, Char and Range. Char returns an option tuple of a native char and the parse buffer. Range returns a tuple of either a single character or a character range and the parse buffer. Certainly, I could have written Range to always return a char * char tuple, passing in the same character for both in the case of a single character range. However, this provides an opportunity to introduce F#'s discriminated unions (or simply union for short).

The F# Manual describes a discriminated union as "a new type composed of a fixed number of distinct alternatives". Many of the semantic productions return "a fixed number of distinct alternatives" so I find a union is a good way to model the return value of semantic production functions. Here's the definition of Range:

///AST Type for Range production
type Range =
| Single of char
| Dual of char * char
    with
    override this.ToString()
        match this with
        | Single x -> sprintf "Range.Single (%A)" x
        | Dual (x,y) -> sprintf "Range.Dual (%A,%A)" x y

So Range is either a single character, or a tuple of two characters. As you saw in the last post, you create an instance of a union with the type.alternative syntax. You can also use simply the alternative name, assuming F# can determine the correct union type. Personally, I like using the full name - it helps me remember what the type really is.

Notice that the AP function and the union type appear to have the same name. Actually, they don't since the name of the AP function's name includes the bananas - i.e. (|Range|_). However, if you want you can define a function called simply Range and still have a type named Range as well - as long as you're not interested in language interop. F# can tell the difference between the Range function and the Range union, but C# can't. So I'd say we're best off avoiding overloading the names entirely.

If you look at the compiled union in Reflector, you'll see the Range type, with public internal classes named _Single and _Dual that inherit from Range. In other words, F# implements union types as an inheritance tree.  Range also provides static constructors for the various disparate types in the union.

One last thing I want to point out about the Range type is how I overrode ToString. This is primarily for unit testing - if you don't override ToString, you only get the type name which isn't very useful when trying to figure out why a given unit test failed. I'm using the F# native sprintf function rather than string.Format, so the format string is a little different.

The other major F# type we'll use in the AST are record types. These are similar conceptually to structs in C#. Basically, they're a tuple with names. For example, here's the Definition record type (though we haven't seen any functions that use this type yet).

///AST Type for Definition production
type Definition =  
    { 
        name: string
        exp: Expression; 
    } 
    with 
    override this.ToString() =  
        sprintf "Definition (name: %A, exp: %A)" this.name (Primary.Exp2Str this.exp)

I could have simply defined this type as (string * Expression), but having the fields named makes it crystal clear what the semantic meaning of each field is. The only place where I used an anonymous tuple in the AST instead of a record is in the Range union above - I figured that was simple enough not to warrant named fields.

I also have a couple of type aliases. For example, I have a record type called SequenceItem. An array of SequenceItems is a Sequence and an array of Sequences is an Expression (which we saw in the Definition type above).

///AST Type for Sequence Item production
let SequenceItem =
    { 
        primaryItem: Primary;
        itemPrefix: Prefix option;     
        itemSuffix: Suffix option;
    }

///AST Type for Sequence production
let Sequence = SequenceItem list

///AST Type for Expression production
let Expression = Sequence list

Note, unlike unions and records, type aliases can't override base class methods like ToString. This is because there is no actual Sequence or Expression types in the compiled code - F# compiles away type aliases completely. Looking at the implementaiton of Definition in reflector confirms that the exp member is of type List<List<SequenceItem>>. Since I need to convert Expressions to strings in two different places, I wrote a static Exp2Str method on the Primary type (not shown). It feels a bit hacky to stick Expression's ToString implementation on the Primary type, but I had little choice given F#'s scoping rules.

Technically, since they get compiled away anyway, I could have skipped the Sequence and Expression declarations and simply defined the exp field of Definition as "SequenceItem list list". But the "list list" syntax throws me a bit. I mean, I understand it, but I found using the terms Sequence and Expression far more readable. Also, I used the definition of Expression in the definition of Primary, so it makes sense for it to have it's own name.

Posted By Harry Pierson at 11:00 AM Pacific Standard Time

Tuesday, December 18, 2007

Practical F# Parsing: Semantic Productions (1)

All the syntactic productions in my PEG parser, save one, have the exact same signature. They take in a char list and return a char list option. Which is to say, they take a parse buffer in and return either the remaining parse buffer on a successful match or nothing on a failed match. The only exception is EndOfFile which doesn't return the remaining parse buffer because there isn't any buffer left to parse.

Now we're moving on to look at the productions with semantic implications. In Parsing Expression Grammars, there are eleven: Char, Range, Class, Literal, Identifier, Primary, Sequence Item, Sequence, Expression, Definition and Grammar. Like their syntactic brethren, these semantic productions will all have a single char list input parameter. However, they will all return some semantic value along with the remaining parse buffer.

We'll start with Char, since it's the only semantic production that doesn't return a custom type:

///Char <- '\\' [nrt'"\[\]\\]
/// / '\\' [0-2][0-7][0-7]
/// / '\\' [0-7][0-7]
/// / '\\' [0-7]
/// / !'\\' .   
let (|Char|_|) input = 
       
    let (|InRange|_|) upper input =
        let i2c value = Char.chr(Char.code '0' + value)
        let c2i value = Char.code value - Char.code '0'
        
        match input with
        | NC (c, input) when (i2c 0) <= c && c <= (i2c upper) ->
            Some((c2i c), input)
        | _ -> None
        
    match input with
    | TOKEN @"\" (NC(c, input)) 
    when List.exists (fun x -> x=c) ['n';'r';'t';'\'';'"';'[';']';'\\'] -> 
        match c with
        | 'n' -> Some('\n', input)
        | 'r' -> Some('\r', input)
        | 't' -> Some('\t', input)
        | _ -> Some(c, input)
    | TOKEN @"\" (InRange 2 (i1, InRange 7 (i2, InRange 7 (i3, input)))) ->
        Some(Char.chr (i1 * 64 + i2 * 8 + i3), input)
    | TOKEN @"\" (InRange 7 (i1, InRange 7 (i2, input))) ->
        Some(Char.chr (i1 * 8 + i2), input)
    | TOKEN @"\" (InRange 7 (i1, input)) ->
        Some(Char.chr (i1), input)

    | NC(c, input) when c <> '\\' -> Some(c, input)
    | _ -> None 

Note, this production is slightly different from the one in the PEG whitepaper. This way was easier to pattern match. Also, I typically don't wrap my when guards onto the next line, but this way it doesn't wrap funny on my blog.

While long, Char is fairly straight-forward. There are five ordered choices that can match this production. The first is for escaped characters, the next three are for character codes, and the last one is matching any character except the backslash escape character. Note, tracking F#'s escape characters and PEG's escape characters can get tricky. I've used verbatim strings for all my TOKEN parameters in order to help try and keep it straight.

The escape character match clause uses a when guard to narrow down the selection criteria. I use the built-in List.exists method to see if the character is in a hard-coded list of special characters. List.exists takes in a function parameter, and returns true if that function returns true for any of the value is the list. Since I'm just matching a value, my function parameter is a trivial equality test. If List.exists returns true, I return that special character as part of the return tuple. Of all the escape characters in PEG, only three are also escape characters in F#, so I use a second match clause to return the correct char value. There's probably a way to do that more elegantly, but since there were just three clauses, I figured it was easier to type them out manually.

For the character code clauses, I wrote a special local AP function called InRange to determine if the specified character was within a specified range and to convert it from a char to an int. Note, the way the production is written, the largest character code you can specify is 277, which means you can encode slightly more than the standard UTF-8 character set. Honestly, this should be updated to support full UTF-16, but I'm not here to critique the grammar, so I didn't try to fix this issue.

Note, all the results (save None) return a tuple of the matched character value and the remaining input buffer. Again, all the remaining productions will work like that. For example, here's the Range production:

///Range <- Char '-' Char / Char
let (|Range|_|) input =
    match input with
    | Char (c1, TOKEN "-" (Char (c2, input))) -> 
        Some(Range.Dual (c1, c2), input)
    | Char (c1, input) -> 
        Some(Range.Single (c1), input)
    | _ -> None

Compared to Char, Range is fairly simple. It's either two chars, separated by a hyphen (for example: a-z) or it's a single char. Again, being able to use Active Patterns to build on lower level productions is a huge helper.

But what does this function return? What does Range.Single and Range.Dual mean? Those are refer to a special F# construct called a discriminated union. Before we can continue writing semantic productions, we need to define these types to hold the results of these productions.

Posted By Harry Pierson at 12:53 PM Pacific Standard Time

Monday, December 17, 2007

Practical F# Parsing : Syntactical Productions (2)

Now that I've moved over to Active Patterns, I want to go back and finish the syntactic productions for my PEG parser. Most of the syntactic productions are very straightforward when implemented in AP. We've seen EndOfFile, EndOfLine and Space already. There is also a series of symbol identifiers that have only a single match clause. For example, here's DOT:

///DOT <- '.' Spacing
let (|DOT|_|) input =
    match input with
    | TOKEN "." (Spacing(input)) -> Some(input)
    | _ -> None

I'm not going to go thru all the symbol AP functions since their all basically like this one. However, you'll notice that this function references an AP we haven't seen yet - Spacing. I want to close out the section on Syntactical Productions by looking at the Spacing and Comment productions. Since Spacing depends on Comment, I'll start with Comment.

Comments in PEG grammars are single lines that start with a # symbol, similar to the // line comments in F# and C#. This is the PEG grammar rule for Comment:

///Comment <- '#' (!EndOfLine .)* EndOfLine

Basically, this says that a comment starts with a #, then matches zero or more characters that are not EndOfLine, and ends with an EndOfLine. The exclamation point is a syntactic predicate, which means that we unconditionally backtrack after attempting to match. PEG has both a success and failure syntactic predicate - the ! is the failure predicate while & is the success predicate. So inside the parens, this production rule says to test the current point in the parse buffer for EndOfLine. If it finds it, the match fails and we exit out of the parens (where we match EndOfLine again without backtracking it this time). If it doesn't find it, the parser backtracks, consumes the next character regardless what it is, then repeats.

Unfortunately, there's a bug in this production. If the parse buffer ends in a comment, the production will fail since it hasn't reached the EndOfLine and there are no more characters to consume. So I changed the production to:

///Comment <- '#' ((!EndOfLine / !EndOfFile) .)* EndOfLine?

This rule now ends the comment if it reaches an EndOfLine or EndOfFile. Additionally, it makes the final EndOfLine match optional. So if the comment ends with a new line, the new line is consumed as part of the grammar production. If the comment ends with the end of file, the EndOfFile is not consumed as part of the production. If you'll recall, EndOfFile returns Some(unit) rather than Some(char list). In F#, the various branches of a match clause have to have the same return type, so you can't return Some() from one branch and Some(input) from another. It's no big deal - you use the EndOfFile production at the top-level grammar to ensure you've consumed the entire file anyway.

Here's the F# implementation of Comment:

Comment defines a local AP function called CommentContent, which implements the part of the grammar production inside the parens.

///Comment <- '#' ((!EndOfLine / !EndOfFile) .)* EndOfLine?
let (|Comment|_|) input = 
    let rec (|CommentContent|_|) input = 
        match input with
        | EndOfLine (input) -> Some(input)
        | EndOfFile -> Some(input)
        | NC (_,input) -> (|CommentContent|_|) input
        | _ -> None
    match input with
    | TOKEN "#" (CommentContent (input)) -> Some(input)
    | _ -> None

Local AP function CommentContent recurses thru the input buffer after the pound sign, looking for  EndOfLine or EndOfFile. This function should never match the final default clause, but I put it in to keep the compiler from complaining. Notice that I use symbol redefinition here so both EndOf match clauses return Some(input). For EndOfLine, I'm re-defining input to mean what is returned by EndOfLine. For EndOfFile, I'm not re-defining, so input still means the list that is passed into the pattern match statement.

Compared to Comment, Spacing is pretty trivial:

///Spacing <- (Space / Comment)*
let rec (|Spacing|) input = 
    match input with
    | Space (input) -> (|Spacing|) input
    | Comment (input) -> (|Spacing|) input
    | _ -> input

There are two things I want to call out about spacing. First, it's a recursive function, so it's defined with let rec. AP functions can be recursive, just like normal functions. Also, note the lack of an underscore in the name of this AP function. Spacing is defined as zero or more spaces or comments, so it's perfectly valid to match nothing. Thus, Spacing is always a successful match. In this case, we don't put the underscore in the AP function name and we don't wrap the return result in Some(). You'll notice the last match clause simply returns input, rather than Some(input).

That's all the syntactic predicates. Next up, the meat of the grammar: semantic predicates.

Posted By Harry Pierson at 11:38 AM Pacific Standard Time

Friday, December 14, 2007

Practical F# Parsing: Active Patterns

In the last post, I gave you a sneak preview of what the EndOfLine production would look like using Active Patterns. But before we get to how to build that, let me give you a little background on why. If you want the full explanation, check out the whitepaper.

Basically, Active Patterns (aka AP) are a way to use the pattern matching of functional languages with abstractions rather than native language types. If you'll recall, I built functions to abstract the parse buffer so I could later change it's implementation if I needed to. The problem is that since the parse buffer is an abstraction, you can't use it in the match clauses. For example, here's a version of EndOfLine that uses a native char list.

///EndOfLine <- '\r\n' / '\n' / '\r'   
let EndOfLine input = 
    match input with
    | '\r' :: '\n' :: input -> Some(input)
    | '\n' :: input -> Some(input)
    | '\r' :: input -> Some(input)
    | _ -> None

That's straightforward like the AP preview at the end of the last post, but I've lost the benefit of the parse buffer abstraction. In other words, if I wanted to change the implementaiton of the parse buffer to a string, or some other type, I'd be screwed if I wrote EndOfLine this way. Traditionally, functional language developers had an either/or choice when it came to abstractions vs. pattern matching. AP let's you use both.

Using a special syntax, you can indicate that an F# function is an AP by surrounding the name in what Don calls "bananas". Here's the AP version of NC:

let (|NC|_|) input = 
    match input with 
    | i :: input -> Some(i, input)
    | [] -> None

This function is identical to the one defined in the first post, except for the name. By surrounding the actual name in paren/pipe "bananas", you're indicating the function can be used in match clauses, not just the match input. The trailing underscore in the name indicates this is a partial pattern, which means it returns an option value (aka Some(_) or None).

There's no reason why you can't use an AP function like any other function. I find I do this often in my unit tests. Here's an updated version of an NC unit test.

[<Fact>]  
let test_NC_empty_string () =   
    let ret = (|NC|_|) !!"" 
    Assert.Equal(None, ret)

While you can still call the function like this, the primary benefit of using Active Patterns is so you can use the function in pattern match clauses directly. This allows the production clauses to mirror the actual grammar rules directly. For simple productions like EndOfFile and EndOfLine, the AP F# implementation isn't much more complex than the grammar rule itself:

///EndOfFile <- !.
let (|EndOfFile|_|) input =  
    match input with 
    | NC (_) -> None 
    | _ -> Some() 

///EndOfLine <- '\r\n' / '\n' / '\r'    
let (|EndOfLine|_|) input =  
    match input with 
    | TOKEN "\r\n" (input) -> Some(input) 
    | TOKEN "\n" (input) -> Some(input) 
    | TOKEN "\r" (input) -> Some(input) 
    | _ -> None

You see in these functions, the calls to NC and TOKEN are used in the match clauses (i.e. after the pipe) rather than the match input (i.e. between match and with). Note, when used in a match clause, you just use the name directly without the bananas.

You'll notice that for TOKEN, the token string to match goes outside the parentheses. This is because "\r\n" is an input parameter to the TOKEN AP function. Alternatively, I could have written EndOfLine using only the NC function, though I find TOKEN version easier to read.

///EndOfLine <- '\r\n' / '\n' / '\r'   
let (|EndOfLine|_|) input = 
    match input with
    | NC ('\r', NC ('\n', (input))) -> Some(input)
    | NC ('\n', input) -> Some(input)
    | NC ('\r', input) -> Some(input)
    | _ -> None

In this version, the values of '\r' and '\n' are pattern matched against the result of calling NC, so they go inside the parentheses. In other words, the TOKEN clauses are matched if TOKEN returns some value. However, the NC clauses are only matched if the returned result matches the value specified in the match clause. inside the parentheses. TOKEN has two parameters, the token string and the parse buffer, while NC only has the parse buffer. When you write an AP function, the last parameter gets bound to the match clause input. Additional parameters, like TOKEN's token, much be specified in the match clause.

Notice I've defined these grammar productions as active patterns as well, which will make them compose nicely with higher-order productions. For example, here's the Space grammar production, which reuses EndOfLine:

///Space <- ' ' / '\t' / EndOfLine
let (|Space|_|) input = 
    match input with
    | TOKEN " " (input) -> Some(input)
    | TOKEN "\t" (input) -> Some(input)
    | EndOfLine (input) -> Some(input)
    | _ -> None

It's DSL-esque, wouldn't you say? Active Patterns is a little parens-heavy - the NC version of EndOfLine has three nested APs which isn't exactly easy on the eyes. However, the concept is very solid and it make the parsing code almost easier to write by hand than it would be to use a parser generator like yacc. Almost.

Posted By Harry Pierson at 10:11 AM Pacific Standard Time

Thursday, December 13, 2007

Practical F# Parsing: Syntactical Productions (1)

Now that I have my parse buffer functions and unit test framework taken care of, it's time to write some parsing expression grammar productions [1]. In any grammar, there are productions that have semantic meaning and productions that are only used to specify syntax. For example, in C# statements end with a semicolon and code blocks are delineated with curly braces. Syntactical elements don't end up in the abstract syntax tree, so they're easier to implement.

Most of PEG's grammar productions are syntactical in nature. Let's start with the simplest production, EndOfFile.

///EndOfFile <- !.
let EndOfFile input = 
    match NC input with
    | None -> Some()
    | _ -> None

Similar to the underscore in F#, the period represents any character In the PEG grammar and the exclamation point is a backtracking not. In other words, EndOfFile succeeds if NC fails, which only happens when the parse buffer is empty. Since the parse buffer is empty on a successful match, there's no value to return so Ireturn Some(). Some() is kinda like Nullable<void>, which doesn't make much sense in C#. But here it means I have a successful match but no data to pass back to the caller.

Here's a slightly more complex grammar production, EndOfLine:

///EndOfLine <- '\r\n' / '\n' / '\r'   
let EndOfLine input = 
    match NC input with
    | Some('\r', input) ->
        match NC input with
        | Some ('\n', input) -> Some(input)
        | _ -> Some(input)
    | Some('\n', input) -> Some(input)
    | _ -> None

EndOfLine has three possible matches: \r\n, \n and \r. Here, you see the use of nested match statements in the \r match clause. If the top character is \r, the function checks the top of the remaining text buffer for \n. F#'s ability to reuse symbols comes in very handy here.

Unfortunately, this isn't code very readable. In particular, the match priority order, which matters in PEGs, is totally lost. For this production, it's no big deal, but that won't always be the case. I'm also not using the handy TOKEN function I wrote in a few posts ago. Here's an alternative version that uses TOKEN and preserves the match priority order.

///EndOfLine <- '\r\n' / '\n' / '\r'   
let EndOfLine input = 
    match TOKEN "\r\n" input with
    | Some(input) -> Some(input)
    | _ -> 
        match TOKEN "\n" input with
        | Some(input) -> Some(input)
        | _ -> 
            match TOKEN "\r" input with
            | Some(input) -> Some(input)
            | _ -> None

This time, I use the TOKEN function and nested match statements to chain the results together. The previous method is probably faster, but this approach is more readable and explicitly preserves the match order. However, it sure is a pain to type. Wouldn't it be great if we could write something like this:

///EndOfLine <- '\r\n' / '\n' / '\r'   
let EndOfLine input =  
    match input with 
    | TOKEN "\r\n" (input) -> Some(input) 
    | TOKEN "\n" (input) -> Some(input) 
    | TOKEN "\r" (input) -> Some(input) 
    | _ -> None 

Turns out, F#'s Active Patterns feature let's me implement EndOfLine exactly like this. We'll look at how it works in the next post.


[1] The full PEG grammar is listed in "Parsing Expression Grammars: A Recognition-Based Syntactic Foundation" by Brian Ford.

Posted By Harry Pierson at 11:08 AM Pacific Standard Time

Wednesday, December 12, 2007

Practical F# Parsing: Unit Testing

Now that I have functions to access the parse buffer, I better write some tests for them. Yes, I realize I should have written the tests first, but the articles flow better this way.

I've written before that one of the benefits to side-effect free functional programming is that it makes unit testing a breeze. But I still need some type of xUnit testing framework. I could write my own native F# xUnit framework, but given the availability of mature xUnit frameworks on .NET, I'd really rather just use one of them.

Traditionally, I've used NUnit or Visual Studio's unit testing framework, but they're designed to work with OO languages like C#. In order to use them from F#, we have to use the OO features of F#. Here's an example of some F# unit tests using NUnit.

type [<TestFixture>] parser_tests =   
    class   
        new () = {}   
          
        [<Test>]  
        member this.test_NC() =   
            let Some(c,text) = NC !!"test" 
            Assert.Equal(c, 't')   
            Assert.Equal(text, !!"est")

        [<Test>]  
        member this.test_NC_empty_string() =   
            let ret = NC !!"" 
            Assert.Equal(None, ret)
    end

While this works, there's an awful lot of extraneous text needed to make this work. Test functions need to be methods on a Test Fixture class (note, F# uses [< >] to indicate attributes) and that class needs a default constructor. F# doesn't add one by default, so we have to do so manually. And every test function needs to be marked with "member this".

I'd really rather write tests that looks like this:

[<Test>]   
let test_NC =    
    let Some(c,text) = NC !!"test" 
    Assert.Equal(c, 't')   
    Assert.Equal(text, !!"est")

[<Test>]   
let test_NC_empty_string =    
    let ret = NC !!"" 
    Assert.Equal(None, ret)

That's a lot more straightforward. If only I could write my test code like that...

It turns out there's a new kid on the .NET unit testing block. xUnit.net is the brainchild of Jim Newkirk (one of the original NUnit developers) and Brad Wilson (aka the .NET Guy). Among other things, xUnit.net does away with the TestFixture attribute. All public methods in all public classes are checked for tests in xUnit.net.

Since we don't need the TestFixture, does that mean I can write the tests as F# functions if I use xUnit.net? Not quite. xUnit.net only checks for public instance methods on the public classes in a test assembly. But F# functions get compiled as static methods. Luckily, xUnit.net is simple enough to change. I submitted a patch to xUnit.net that enables it to find both static and instance test methods (and to skip creating and disposing an object instance for static test methods). I'm hoping it will be accepted and ship as part of their next release. Until then, I'm running my own private build with those changes included.

Now that I've settled on a unit test framework, let's look at some tests. For my parser solution, I have two projects: PegParser and PegParser.Tests. The tests project depends both on the PegParser assembly as well as xunit.dll, so I need to set a reference to both in my project. F# projects in VS don't have the References node in the project tree, you have to either add the references on the project property page or directly within the code. Not sure which is better, but it's easier to show the code syntax:

#R @"..\..\xUnit.net\Main\xunit\bin\Debug\xunit.dll"
#R @"..\PegParser\pegparser.dll"

open Xunit
open Parser

The #R compiler directive is used to reference an external assembly. F#'s open statement acts like C#'s using statement, so I can reference types without specifying their full namespace. You'll notice that the parser is implemented in a dll called pegparser.dll while the namespace is simply Parser. Actually, it's not really a namespace. If you open PegParser.dll in Reflector, you'll notice that Parser is actually a type, and the functions are all implemented as static methods. F# hides that from you, though you'd have to know that if you wanted to invoke the parser from C# or VB.net. By default, F# uses the filename as the namespace/class name and I haven't changed that default in my parser code (though I probably should).

Once we've referenced the correct assemblies, I need to write the tests. Here are two tests for NC (aka Next Char) function I wrote in the last post.

[<Fact>]   
let test_NC_empty_string () =    
    let ret = NC !!""  
    Assert.Equal(None, ret) 

[<Fact>]   
let test_NC_simple_string () =    
    let Some(c,text) = NC !!"test"  
    Assert.Equal(c, 't')    
    Assert.Equal(text, !!"est") 

You'll notice this code is almost identical to my wish test code above. Almost. There are a few syntax changes - In xUnit.net, tests are called facts and Assert.AreEqual is simply Assert.Equal. I've also had to add empty parentheses after each test function name. Remember that functions in FP are like math functions. If there's no independent value to pass in, the result of a math function is is constant. F# compiles parameter-less functions as static properties instead of a static methods. In order to make the test functions compile as static methods, we have to pass in at least one parameter. In F#, the unit type is the equivalent of the void type in C#. Unit has exactly one valid value - the empty parens. Adding the empty parens to the parameter list of the test functions ensures they get compiled into static methods.

Note, it's really really easy to forget to add those empty parens. If you don't add them, the code will still compile, but the tests won't be found or run. I've been bit by that once already, so if you have a bunch of tests not running, make sure they have the empty parens!

So each test method feeds a parse buffer (converted from a string with the custom !! operator) into the NC function and makes assertions about the result. In the first test, NC should return None if the parse buffer is empty, so I simply compare the function result to None via Assert.Equals. In the second test, I use F#'s pattern matching capability to assign the result of NC to the value Some(c,text). Basically, this is doing simple pattern matching to bind the two value names to the two parts of the tuple returned from NC. Those two values are then asserted to be a specific values as you would expect.

Note, in the current version of F#, the line let Some(c,text) = NC !!"test" yields two warnings. The first (FS0062) warns that a future version of the language will require parens around Some(c,text). I sure hope they change their minds on this, since active patterns are already so parens-heavy. The second (FS0025) warns that this is an incomplete pattern match. If NC returns None, the pattern wont match and the function will throw a MatchFailureException. Of course, since these are unit tests, that's exactly the behavior I want! Given the nature of these warnings, personally, I turn them both off (only in my unit tests, mind you). This is done via the #nowarn compiler directives at the top of the file.

#nowarn "25" //Turn off Incomplete Pattern Match warning
#nowarn "62" //Turn off Some contruct needs parens warning

Obviously, there are more tests than just these (though my total code coverage is pretty poor, shame on me) but they're all pretty similar. As you can see, there's tests are very straight forward. The nature of FP functions makes them fairly simple to test, and xUnit.net (with a couple of minor changes) makes it easy to write your unit tests in F#.

Posted By Harry Pierson at 11:40 AM Pacific Standard Time

Tuesday, December 11, 2007

Practical F# Parsing: The Parse Buffer

The first thing I need for my F# parser is a type to represent the buffer of characters being parsed. For now, I'm using the simplest thing that could possibly work: an intrinsic F# character list. Lists are heavily used in functional languages, so they tend to have very efficient native list types. F# is no exception. Along with a native list type, F# has a native operation for retrieving the head of a list. For example, If you execute the following code, head will be bound to '1' and tail will be bound to the list ['2';'3';'4']

let head :: tail = ['1';'2';'3';'4']

The problem using the native list head operator is that my parsing code will be explicitly bound to work on character lists only. However, I'm not sure an in-memory character list is the best way to read long files of text off the disk for processing, so I'd rather not limit future options like this. So instead, I'll define my own function to retrieve the head of the parsing buffer.

let NC input =  
    match input with     
    | head :: tail -> Some(head, tail)    
    | _ -> None

Note, in all my F# code I use the #light syntax which means code blocks are indicated by significant whitespace indentation similar to Python.

NC stands for Next Character, though I gave it a short name since it's used often. It's basically wraps the native list head operator. If there's at least one element in the list, the first clause matches and the function returns Some(head,tail). If the list is empty, the second clause matches and the function returns None. The use of Some and None means this function returns an F# option type, which is similar to .NET's Nullable type. (head,tail) is a tuple - in this case, combining the head and the tail of the list together. Finally, the underscore in the second match clause is a wildcard, so that clause matches anything. I'm using it here like the default clause of a switch statement in C#.

The F# type for this function is 'a list -> ('a * 'a list) option. The 'a is a generic type parameter and the asterisk indicates a tuple. So NC takes a generic list, and returns an option tuple with a single generic element and a generic list. Even though the function is named Next Character, it would work with a list of any type.

Now that I have my own list head function, the rest of my parsing code can it. If I later decide to change the type of the parse buffer, I can change the implementation of NC - including the input and return types - without having to change the parsing functions that use NC. For example, here's a different implementation where the input parameter is a .NET string instead of a char list.

let NC (input : string)
    if input.Length > 0     
        then Some(input.Chars(0), input.Substring(1))     
        else None

Since I'm calling methods on the input parameter, I need to explicitly tell F# what type it is. The F# type for this function is string -> (char * string) option, which is obviously different from the type definition of the original NC version. But F#'s type inference automatically adjusts to handle the change in the type so functions that call NC don't have to change. FP languages like F# handle list operations extremely efficiently, so this version of NC is probably a step in the wrong direction. However, it's good to know I can easily encapsulate the details our the parse buffer type away from the rest of my parsing code.

Here's another function I'll use in parsing, defined entirely in terms of NC.

let TOKEN token input = 
    let rec ParseToken token input = 
        match token, NC input with  
        | t :: [], Some(i, input) when i = t ->
            Some(input) 
        | t :: token, Some(i, input) when i = t ->
            ParseToken token input 
        | _ -> None 
    ParseToken (List.of_seq token) input

The TOKEN function checks to see if the token string is at the front of the input parse buffer. If so, it returns the remainder of the buffer after the token. If not, it returns None. It's defined entirely in terms of NC, so it works the same with both the versions of NC I've written so far. However, depending on the implementation of NC, I might rewrite TOKEN. For example, if I was using the string version of NC, I'd probably rewrite this function to use String.StartsWith instead of recursion.

TOKEN defines a local recursive function called ParseToken. It's very common to see local functions defined inside other functions in FP, similar to how classes can define local classes in OO languages like C#. ParseToken recursively compares the characters in the token string with the characters in the input buffer, finishing when either it reaches the end of the token string or there's a mismatch. By default, functions in F# can't call themselves recursively by default, so ParseToken is declared to be recursive by using "let rec" instead of simply "let".

ParseToken shows off something interesting about the way symbols are used in F#. Remember that F# values are immutable by default. Yet, the symbols "token" and "input" appear to change. In the match statement, token and input represent the values passed into the function. However, in the match clauses, token and input represent the tail of those two lists. Technically, the values didn't change, F# allows you to reuse the symbols. If I wanted to avoid reusing symbols, I could define ParseToken this way (though I find this much less readable):

let rec ParseToken token input = 
    match token, NC input with  
    | t :: [], Some(i, itail) when i = t ->  
        Some(itail) 
    | t :: ttail, Some(i, itail) when i = t ->  
        ParseToken ttail itail 
    | _ -> None

Other than declaring ParseToken, the TOKEN function is a single line of code. It simply calls ParseToken, converting the token parameter into a char list along the way. While lists are very efficient, which would you rather type?

let token = TOKEN ['f';'o';'o'] input
let token = TOKEN "foo" input

Personally, I like the second version. I'm sure there's a slight perf hit to convert from a string to a list, but frankly I value readability over performance so I used strings for tokens. TOKEN uses the List.of_seq function to do the conversion. Seq is F#'s name for IEnumerable. Technically, TOKEN would work with any IEnumerable<char> type. However, in my source code, it's always going to be a string.

I also used List.of_seq to define a helper function String to Parse Buffer (aka S2PB) that converts a string into a character list. I use function often in the test code.

let S2PB input = List.of_seq input

If I were to change the input type of NC, I'd also change the implementation of S2PB so that it still took in a string but returned whatever the correct input type for NC.

The one problem with S2PB function is that you have to use it with parentheses almost all the time. If I write NC S2PB "foo", F# can't tell if I'm calling NC with two parameters or passing the result of S2PB "foo" to NC. Since NC is strongly typed to have only one input parameter, you might have thought it could automatically determine the calling order, but it doesn't.

I could make the function calls explicit with parenthesis by writing NC (S2PB "foo"). F# also provides a pipe operator, so I could pipe the result of S2PB into NC by writing S2PB "foo" |> NC. I didn't like either of those solutions, so instead I defined a custom unary operator !! as an alias. The parameter binding rules are different for custom operators because I can write NC !! "foo" without piping or parenthesis.

let (!!) input = S2PB input

So now I've got three functions and a custom operator that completely abstract away the parse buffer type. As long a my parsing functions only uses these functions to access the parse buffer, I'll be able to later change the buffer type without affecting my parsing code at all.

Posted By Harry Pierson at 1:18 PM Pacific Standard Time

Monday, December 10, 2007

Practical Parsing in F#

I'm interested in parsing because I'm interested in Domain Specific Languages. F# is pretty good for internal DSLs, but internal DSLs are obviously limited by the syntax of the host language. If you want complete control over the language, you've got to build your own parser.

The defacto standard for parser development is Yet Another Compiler Compiler, or yacc. There's a version of yacc for .NET as well as one specifically for F#. However, I'm not a fan of yacc. Yacc parsers are specified using context-free grammar (aka CFG). But CFG's can be ambiguous - actually, it's nearly impossible to build an unambiguous CFG. Personally, I'm a big fan of Parsing Expression Grammars (or PEGs) which among other advantages makes it impossible to develop ambiguous grammars. Furthermore, PEGs don't require a separate lexical analyzer like lex, so I think they're more suitable for building modular compilers.

Since I like PEGs and F# so much, I developed a parser for the PEG grammar from the original PEG whitepaper using F#. The grammar is much simpler than a language like C#, but with twenty nine grammar productions it's certainly not trivial. The F# implementation is fairly straightforward backtracking recursive decent parser, which makes it easy to understand even if you're not a parser guru. It's also small - around 400 lines of code including comments. But I think the code illustrates both the general value of Functional Programming as well as the specific value of F#. Here's how the series is shaping up (though this is subject to change):

I was originally planning to post the code for the parser itself with this post. However, i find that I'm revising the code as I write the articles in this series, so I'm going to hold off for now. If you're really desperate, drop me a line and I'll see what I can do.

Update - Almost forgot, if you're going to follow along at home, I'm using the latest version of F#, v1.9.3.7. Note, the F# Downloads page on the MS Research is woefully out of date, so go to the MS Research Downloads page. Currently, it's the most recent release. It snaps into VS 2005 and 2008 plus has command line tools. If you're an VS Express user, Douglas Stockwell explained how to roll your own F# Express.

Much Later Update - The code is now available on my Skydrive.

Posted By Harry Pierson at 2:40 PM Pacific Standard Time
Change Congress
Recent Bookmarks
Tags .NET Framework (2) __clrtype__ (9) ADO.NET (5) Agile (7) AJAX (3) Architecture (288) Guidance (6) Interop (2) Modelling (61) Patterns (7) Process (4) SOA (94) Web Services (5) ASP.NET (25) Async Messaging (2) Azure (1) Battlestar Galactica (3) BI (2) BizTalk (4) Blogging (117) dasBlog (11) Podcasting (4) BPM (1) C# (11) C++ (4) Capitals (5) CardSpace (3) CLR (2) CodePlex (1) College Football (10) Comedy Central (1) Community (81) Concurrency (6) Consumer Electronics (1) Database (13) Debugger (23) Dependency Injection (2) Development (122) C Plus Plus (1) Embedded (5) Lanugages (42) Media (2) P2P (11) Rotor (1) SharePoint (6) SOP (3) DIY (1) DLR (25) Domain Specific Languages (15) Durable Messaging (5) Dynamic Languages (12) Dynamic Silverlight (1) Education (3) Enterprise 2.0 (1) Entertainment (14) ETech (15) F# (51) Functional Programming (17) Game Development (2) Guidance Automation (3) Hardware (8) HawkCodeBox (1) HawkEye (3) Health (1) Hockey (31) Home Electronics (1) Home Network (5) Hosting API (1) Humor (5) IASA (1) Idempotence (3) infrastructure (5) Instrumentation (4) Integration (2) IronPython (112) IronRuby (16) Java (2) Job (3) Kodu (1) LangNET (2) Lightweight Debugger (5) LINQ (23) Live Framework (3) Live Mesh (2) Lost (1) Master Data Management (1) Media 2.0 (6) Microsoft (31) MIX06 (2) Mobile Phone (1) Monads (5) Morning Coffee (172) Object Oriented (4) Office (5) Open Source (8) Open Space (2) Operations (3) Other (135) Art (1) Books (1) Family (33) Games (18) General Geekery (27) Home Theater (1) Movies (23) Music (20) Politics (3) Society (1) Sports (37) Working at MSFT (19) Parallel Programming (3) Parsing Expression Grammar (16) patterns & practices (2) PDC08 (5) Politics (48) Polyglot (3) PowerPoint (2) PowerShell (39) Presentation (7) Projects (1) HawkWiki (1) Pygments (5) Python (6) Quote of the Day (4) Refactoring (1) Research (2) REST (18) Reuse (5) Robotics (2) Rock Band (4) Rome (5) Ruby (23) Ruby on Rails (1) Sci-Fi (2) Scripting (4) Security (3) Service Broker (14) SharePoint (2) Silverlight (20) Social Software (1) Software + Services (2) Software Design (2) Software Engineering (1) Software Factories (11) Software Industry (1) Space Elevator (1) Spark (1) SQL Server (2) Stephen Colbert (1) TechEd (7) TechEd06 (1) TechRec League (1) Television (6) Travel (7) Unified Client (1) Unit Testing (4) USC (1) UX (1) Virtual PC (2) Visual Basic (3) Visual Studio (20) Volta (2) Washington Capitals (37) WCF (31) Web 2.0 (67) Web Services (7) WF (21) Windows (3) Windows Live (29) Windows Live Writer (3) WPF (8) Xbox (1) Xbox 360 (54) XML (11) XNA (15) Zune (4)
Disclaimer: The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion. Inappropriate comments will be deleted at the authors discretion.