Practical F# Parsing: Unit Testing

Now that I have functions to access the parse buffer, I better write some tests for them. Yes, I realize I should have written the tests first, but the articles flow better this way.

I’ve written before that one of the benefits to side-effect free functional programming is that it makes unit testing a breeze. But I still need some type of xUnit testing framework. I could write my own native F# xUnit framework, but given the availability of mature xUnit frameworks on .NET, I’d really rather just use one of them.

Traditionally, I’ve used NUnit or Visual Studio’s unit testing framework, but they’re designed to work with OO languages like C#. In order to use them from F#, we have to use the OO features of F#. Here’s an example of some F# unit tests using NUnit.

type [<TestFixture>] parser_tests =
    class
        new () = {}

        [<Test>]
        member this.test_NC() =
            let Some(c,text) = NC !!"test"  
            Assert.Equal(c, 't')
            Assert.Equal(text, !!"est")

        [<Test>]
        member this.test_NC_empty_string() =
            let ret = NC !!""  
            Assert.Equal(None, ret)
    end

While this works, there’s an awful lot of extraneous text needed to make this work. Test functions need to be methods on a Test Fixture class (note, F# uses [< >] to indicate attributes) and that class needs a default constructor. F# doesn’t add one by default, so we have to do so manually. And every test function needs to be marked with “member this”.

I’d really rather write tests that looks like this:

[<Test>]
let test_NC =
    let Some(c,text) = NC !!"test"  
    Assert.Equal(c, 't')
    Assert.Equal(text, !!"est")

[<Test>]
let test_NC_empty_string =
    let ret = NC !!""  
    Assert.Equal(None, ret)

That’s a lot more straightforward. If only I could write my test code like that…

It turns out there’s a new kid on the .NET unit testing block. xUnit.net is the brainchild of Jim Newkirk (one of the original NUnit developers) and Brad Wilson (aka the .NET Guy). Among other things, xUnit.net does away with the TestFixture attribute. All public methods in all public classes are checked for tests in xUnit.net.

Since we don’t need the TestFixture, does that mean I can write the tests as F# functions if I use xUnit.net? Not quite. xUnit.net only checks for public instance methods on the public classes in a test assembly. But F# functions get compiled as static methods. Luckily, xUnit.net is simple enough to change. I submitted a patch to xUnit.net that enables it to find both static and instance test methods (and to skip creating and disposing an object instance for static test methods). I’m hoping it will be accepted and ship as part of their next release. Until then, I’m running my own private build with those changes included.

Now that I’ve settled on a unit test framework, let’s look at some tests. For my parser solution, I have two projects: PegParser and PegParser.Tests. The tests project depends both on the PegParser assembly as well as xunit.dll, so I need to set a reference to both in my project. F# projects in VS don’t have the References node in the project tree, you have to either add the references on the project property page or directly within the code. Not sure which is better, but it’s easier to show the code syntax:

#R @"....xUnit.netMainxunitbinDebugxunit.dll"
#R @"..PegParserpegparser.dll"

open Xunit
open Parser

The #R compiler directive is used to reference an external assembly. F#’s open statement acts like C#’s using statement, so I can reference types without specifying their full namespace. You’ll notice that the parser is implemented in a dll called pegparser.dll while the namespace is simply Parser. Actually, it’s not really a namespace. If you open PegParser.dll in Reflector, you’ll notice that Parser is actually a type, and the functions are all implemented as static methods. F# hides that from you, though you’d have to know that if you wanted to invoke the parser from C# or VB.net. By default, F# uses the filename as the namespace/class name and I haven’t changed that default in my parser code (though I probably should).

Once we’ve referenced the correct assemblies, I need to write the tests. Here are two tests for NC (aka Next Char) function I wrote in the last post.

[<Fact>]
let test_NC_empty_string () =
    let ret = NC !!""
    Assert.Equal(None, ret)  

[<Fact>]
let test_NC_simple_string () =
    let Some(c,text) = NC !!"test"
    Assert.Equal(c, 't')
    Assert.Equal(text, !!"est")

You’ll notice this code is almost identical to my wish test code above. Almost. There are a few syntax changes – In xUnit.net, tests are called facts and Assert.AreEqual is simply Assert.Equal. I’ve also had to add empty parentheses after each test function name. Remember that functions in FP are like math functions. If there’s no independent value to pass in, the result of a math function is is constant. F# compiles parameter-less functions as static properties instead of a static methods. In order to make the test functions compile as static methods, we have to pass in at least one parameter. In F#, the unit type is the equivalent of the void type in C#. Unit has exactly one valid value – the empty parens. Adding the empty parens to the parameter list of the test functions ensures they get compiled into static methods.

Note, it’s really really easy to forget to add those empty parens. If you don’t add them, the code will still compile, but the tests won’t be found or run. I’ve been bit by that once already, so if you have a bunch of tests not running, make sure they have the empty parens!

So each test method feeds a parse buffer (converted from a string with the custom !! operator) into the NC function and makes assertions about the result. In the first test, NC should return None if the parse buffer is empty, so I simply compare the function result to None via Assert.Equals. In the second test, I use F#’s pattern matching capability to assign the result of NC to the value Some(c,text). Basically, this is doing simple pattern matching to bind the two value names to the two parts of the tuple returned from NC. Those two values are then asserted to be a specific values as you would expect.

Note, in the current version of F#, the line let Some(c,text) = NC !!"test" yields two warnings. The first (FS0062) warns that a future version of the language will require parens around Some(c,text). I sure hope they change their minds on this, since active patterns are already so parens-heavy. The second (FS0025) warns that this is an incomplete pattern match. If NC returns None, the pattern wont match and the function will throw a MatchFailureException. Of course, since these are unit tests, that’s exactly the behavior I want! Given the nature of these warnings, personally, I turn them both off (only in my unit tests, mind you). This is done via the #nowarn compiler directives at the top of the file.

#nowarn "25" //Turn off Incomplete Pattern Match warning
#nowarn "62" //Turn off Some contruct needs parens warning

Obviously, there are more tests than just these (though my total code coverage is pretty poor, shame on me) but they’re all pretty similar. As you can see, there’s tests are very straight forward. The nature of FP functions makes them fairly simple to test, and xUnit.net (with a couple of minor changes) makes it easy to write your unit tests in F#.

Practical F# Parsing: The Parse Buffer

The first thing I need for my F# parser is a type to represent the buffer of characters being parsed. For now, I’m using the simplest thing that could possibly work: an intrinsic F# character list. Lists are heavily used in functional languages, so they tend to have very efficient native list types. F# is no exception. Along with a native list type, F# has a native operation for retrieving the head of a list. For example, If you execute the following code, head will be bound to '1' and tail will be bound to the list ['2';'3';'4']

let head :: tail = ['1';'2';'3';'4']

The problem using the native list head operator is that my parsing code will be explicitly bound to work on character lists only. However, I’m not sure an in-memory character list is the best way to read long files of text off the disk for processing, so I’d rather not limit future options like this. So instead, I’ll define my own function to retrieve the head of the parsing buffer.

let NC input =
    match input with
    | head :: tail -> Some(head, tail)
    | _ -> None

Note, in all my F# code I use the #light syntax which means code blocks are indicated by significant whitespace indentation similar to Python.

NC stands for Next Character, though I gave it a short name since it’s used often. It’s basically wraps the native list head operator. If there’s at least one element in the list, the first clause matches and the function returns Some(head,tail). If the list is empty, the second clause matches and the function returns None. The use of Some and None means this function returns an F# option type, which is similar to .NET’s Nullable type. (head,tail) is a tuple – in this case, combining the head and the tail of the list together. Finally, the underscore in the second match clause is a wildcard, so that clause matches anything. I’m using it here like the default clause of a switch statement in C#.

The F# type for this function is a list -> ('a * 'a list) option. The 'a is a generic type parameter and the asterisk indicates a tuple. So NC takes a generic list, and returns an option tuple with a single generic element and a generic list. Even though the function is named Next Character, it would work with a list of any type.

Now that I have my own list head function, the rest of my parsing code can it. If I later decide to change the type of the parse buffer, I can change the implementation of NC – including the input and return types – without having to change the parsing functions that use NC. For example, here’s a different implementation where the input parameter is a .NET string instead of a char list.

let NC (input : string) =
    if input.Length > 0
        then Some(input.Chars(0), input.Substring(1))
        else None

Since I’m calling methods on the input parameter, I need to explicitly tell F# what type it is. The F# type for this function is string -> (char * string) option, which is obviously different from the type definition of the original NC version. But F#’s type inference automatically adjusts to handle the change in the type so functions that call NC don’t have to change. FP languages like F# handle list operations extremely efficiently, so this version of NC is probably a step in the wrong direction. However, it’s good to know I can easily encapsulate the details our the parse buffer type away from the rest of my parsing code.

Here’s another function I’ll use in parsing, defined entirely in terms of NC.

let TOKEN token input =
    let rec ParseToken token input =
        match token, NC input with
        | t :: [], Some(i, input) when i = t ->
            Some(input)
        | t :: token, Some(i, input) when i = t ->
            ParseToken token input
        | _ -> None
    ParseToken (List.of_seq token) input

The TOKEN function checks to see if the token string is at the front of the input parse buffer. If so, it returns the remainder of the buffer after the token. If not, it returns None. It’s defined entirely in terms of NC, so it works the same with both the versions of NC I’ve written so far. However, depending on the implementation of NC, I might rewrite TOKEN. For example, if I was using the string version of NC, I’d probably rewrite this function to use String.StartsWith instead of recursion.

TOKEN defines a local recursive function called ParseToken. It’s very common to see local functions defined inside other functions in FP, similar to how classes can define local classes in OO languages like C#. ParseToken recursively compares the characters in the token string with the characters in the input buffer, finishing when either it reaches the end of the token string or there’s a mismatch. By default, functions in F# can’t call themselves recursively by default, so ParseToken is declared to be recursive by using “let rec” instead of simply “let”.

ParseToken shows off something interesting about the way symbols are used in F#. Remember that F# values are immutable by default. Yet, the symbols “token” and “input” appear to change. In the match statement, token and input represent the values passed into the function. However, in the match clauses, token and input represent the tail of those two lists. Technically, the values didn’t change, F# allows you to reuse the symbols. If I wanted to avoid reusing symbols, I could define ParseToken this way (though I find this much less readable):

let rec ParseToken token input =
    match token, NC input with
    | t :: [], Some(i, itail) when i = t ->
        Some(itail)
    | t :: ttail, Some(i, itail) when i = t ->
        ParseToken ttail itail
    | _ -> None

Other than declaring ParseToken, the TOKEN function is a single line of code. It simply calls ParseToken, converting the token parameter into a char list along the way. While lists are very efficient, which would you rather type?

let token = TOKEN ['f';'o';'o'] input
let token = TOKEN "foo" input

Personally, I like the second version. I’m sure there’s a slight perf hit to convert from a string to a list, but frankly I value readability over performance so I used strings for tokens. TOKEN uses the List.of_seq function to do the conversion. Seq is F#’s name for IEnumerable. Technically, TOKEN would work with any IEnumerable type. However, in my source code, it’s always going to be a string.

I also used List.of_seq to define a helper function String to Parse Buffer (aka S2PB) that converts a string into a character list. I use function often in the test code.

let S2PB input = List.of_seq input

If I were to change the input type of NC, I’d also change the implementation of S2PB so that it still took in a string but returned whatever the correct input type for NC.

The one problem with S2PB function is that you have to use it with parentheses almost all the time. If I write NC S2PB "foo", F# can’t tell if I’m calling NC with two parameters or passing the result of S2PB "foo" to NC. Since NC is strongly typed to have only one input parameter, you might have thought it could automatically determine the calling order, but it doesn’t.

I could make the function calls explicit with parenthesis by writing NC (S2PB "foo"). F# also provides a pipe operator, so I could pipe the result of S2PB into NC by writing S2PB "foo" |> NC. I didn’t like either of those solutions, so instead I defined a custom unary operator !! as an alias. The parameter binding rules are different for custom operators because I can write NC !! "foo" without piping or parenthesis.

let (!!) input = S2PB input

So now I’ve got three functions and a custom operator that completely abstract away the parse buffer type. As long a my parsing functions only uses these functions to access the parse buffer, I’ll be able to later change the buffer type without affecting my parsing code at all.

ADO.NET Data Services and Idempotence

I was reading thru the ADO.NET Data Services Quickstart, because I wanted to understand how it support data updates. The quickstart uses the Customers table from the Northwind sample database, which unlike most of the other tables uses an nchar(5) value as the primary key. Categories, Employees, Orders, Products, Shippers and Suppliers all use an auto-increment integer field for their primary keys.

I only point this out because inserting into Customers is idempotent, while inserting into those other listed tables is not. Is it a coincidence that the ADO.NET data services team chose the Customers table for their quickstart? Maybe, but I doubt it. Regardless, making a non-idempotent insert call from the browser is a bad idea, if you care about Exactly Once.

Practical Parsing in F#

I’m interested in parsing because I’m interested in Domain Specific Languages. F# is pretty good for internal DSLs, but internal DSLs are obviously limited by the syntax of the host language. If you want complete control over the language, you’ve got to build your own parser.

The defacto standard for parser development is Yet Another Compiler Compiler, or yacc. There’s a version of yacc for .NET as well as one specifically for F#. However, I’m not a fan of yacc. Yacc parsers are specified using context-free grammar (aka CFG). But CFG’s can be ambiguous – actually, it’s nearly impossible to build an unambiguous CFG. Personally, I’m a big fan of Parsing Expression Grammars (or PEGs) which among other advantages makes it impossible to develop ambiguous grammars. Furthermore, PEGs don’t require a separate lexical analyzer like lex, so I think they’re more suitable for building modular compilers.

Since I like PEGs and F# so much, I developed a parser for the PEG grammar from the original PEG whitepaper using F#. The grammar is much simpler than a language like C#, but with twenty nine grammar productions it’s certainly not trivial. The F# implementation is fairly straightforward backtracking recursive decent parser, which makes it easy to understand even if you’re not a parser guru. It’s also small – around 400 lines of code including comments. But I think the code illustrates both the general value of Functional Programming as well as the specific value of F#. Here’s how the series is shaping up (though this is subject to change):

I was originally planning to post the code for the parser itself with this post. However, i find that I’m revising the code as I write the articles in this series, so I’m going to hold off for now. If you’re really desperate, drop me a line and I’ll see what I can do.

Update: Almost forgot, if you’re going to follow along at home, I’m using the latest version of F#, v1.9.3.7. Note, the F# Downloads page on the MS Research is woefully out of date, so go to the MS Research Downloads page. Currently, it’s the most recent release. It snaps into VS 2005 and 2008 plus has command line tools. If you’re an VS Express user, Douglas Stockwell explained how to roll your own F# Express.

Much Later Update: The code is now available on my Skydrive.

Morning Coffee 129

  • Short coffee this morning, as I’m home with a tweaked ankle.
  • I started playing Indigo Prophecy over the weekend. It’s an original Xbox game, released as part of the new Xbox Originals program. It has a good metacritic score (84), though apparently it wasn’t much of a retail success. I’m enjoying it, though it’s not very challenging. It’s more an interactive movie than a game. Good story, though.
  • The ASP.NET MVC preview dropped today, Scott Guthrie has the details. Scott Hanselman has a 40 minute how-to video and Phil Haack has severalarticles up already.
  • Speaking of ASP.NET MVC and Scott Guthrie, he’s got another post in his series on ASP.NET MVC. This time, he’s covering how to handle form input / POST data.
  • Erik Meijer has posted some of his thoughts on Volta. He’s one of the guys behind Volta, so it’s worth a good look. (via Dare Obasanjo)
  • Late Addition – the ASP.NET Extensions is more than just the MVC stuff. It also includes AJAX improvements, Silverlight support, ADO.NET Data Services and ASP.NET Dynamic Data Support. Data Services (formerly Astoria) let’s you easily expose your database via RESTful services. I think Dynamic Data Support used to be code named Jasper. It’s a “rich scaffolding framework” for ASP.NET. I assume that’s to compete w/ Ruby on Rails.