Passion * Technology * Ruthless Competence

Thursday, December 13, 2007

Practical F# Parsing: Syntactical Productions (1)

Now that I have my parse buffer functions and unit test framework taken care of, it's time to write some parsing expression grammar productions [1]. In any grammar, there are productions that have semantic meaning and productions that are only used to specify syntax. For example, in C# statements end with a semicolon and code blocks are delineated with curly braces. Syntactical elements don't end up in the abstract syntax tree, so they're easier to implement.

Most of PEG's grammar productions are syntactical in nature. Let's start with the simplest production, EndOfFile.

///EndOfFile <- !.
let EndOfFile input = 
    match NC input with
    | None -> Some()
    | _ -> None

Similar to the underscore in F#, the period represents any character In the PEG grammar and the exclamation point is a backtracking not. In other words, EndOfFile succeeds if NC fails, which only happens when the parse buffer is empty. Since the parse buffer is empty on a successful match, there's no value to return so Ireturn Some(). Some() is kinda like Nullable<void>, which doesn't make much sense in C#. But here it means I have a successful match but no data to pass back to the caller.

Here's a slightly more complex grammar production, EndOfLine:

///EndOfLine <- '\r\n' / '\n' / '\r'   
let EndOfLine input = 
    match NC input with
    | Some('\r', input) ->
        match NC input with
        | Some ('\n', input) -> Some(input)
        | _ -> Some(input)
    | Some('\n', input) -> Some(input)
    | _ -> None

EndOfLine has three possible matches: \r\n, \n and \r. Here, you see the use of nested match statements in the \r match clause. If the top character is \r, the function checks the top of the remaining text buffer for \n. F#'s ability to reuse symbols comes in very handy here.

Unfortunately, this isn't code very readable. In particular, the match priority order, which matters in PEGs, is totally lost. For this production, it's no big deal, but that won't always be the case. I'm also not using the handy TOKEN function I wrote in a few posts ago. Here's an alternative version that uses TOKEN and preserves the match priority order.

///EndOfLine <- '\r\n' / '\n' / '\r'   
let EndOfLine input = 
    match TOKEN "\r\n" input with
    | Some(input) -> Some(input)
    | _ -> 
        match TOKEN "\n" input with
        | Some(input) -> Some(input)
        | _ -> 
            match TOKEN "\r" input with
            | Some(input) -> Some(input)
            | _ -> None

This time, I use the TOKEN function and nested match statements to chain the results together. The previous method is probably faster, but this approach is more readable and explicitly preserves the match order. However, it sure is a pain to type. Wouldn't it be great if we could write something like this:

///EndOfLine <- '\r\n' / '\n' / '\r'   
let EndOfLine input =  
    match input with 
    | TOKEN "\r\n" (input) -> Some(input) 
    | TOKEN "\n" (input) -> Some(input) 
    | TOKEN "\r" (input) -> Some(input) 
    | _ -> None 

Turns out, F#'s Active Patterns feature let's me implement EndOfLine exactly like this. We'll look at how it works in the next post.


[1] The full PEG grammar is listed in "Parsing Expression Grammars: A Recognition-Based Syntactic Foundation" by Brian Ford.

Posted By Harry Pierson at 11:08 AM Pacific Standard Time
Comments are closed.

PDC08

patterns & practices
Summit 2008

Øredev

Change Congress
Recent Bookmarks
Tags .NET Framework (2) ADO.NET (5) Agile (7) AJAX (3) Architecture (284) Guidance (6) Interop (2) Modelling (61) Patterns (7) Process (4) SOA (93) Web Services (5) ASP.NET (24) Battlestar Galactica (3) BI (2) BizTalk (4) Blogging (115) dasBlog (11) Podcasting (4) BPM (1) C# (10) C++ (4) Capitals (5) CardSpace (3) CLR (2) College Football (10) Comedy Central (1) Community (81) Concurrency (6) Consumer Electronics (1) Database (13) Dependency Injection (2) Development (117) C Plus Plus (1) Embedded (5) Lanugages (37) Media (2) P2P (11) Rotor (1) SharePoint (6) SOP (3) DIY (1) DLR (14) Domain Specific Languages (13) Durable Messaging (5) Dynamic Languages (10) Dynamic Silverlight (1) Education (3) Enterprise 2.0 (1) Entertainment (14) ETech (15) F# (51) Functional Programming (17) Game Development (2) Guidance Automation (3) Hardware (8) HawkEye (3) Hockey (29) Home Electronics (1) Home Network (5) Humor (5) IASA (1) Idempotence (3) infrastructure (5) Instrumentation (4) Integration (2) IronPython (27) IronRuby (11) Java (2) Job (3) LINQ (19) Live Mesh (2) Lost (1) Master Data Management (1) Media 2.0 (6) Microsoft (29) MIX06 (2) Mobile Phone (1) Monads (5) Morning Coffee (172) Object Oriented (4) Office (5) Open Source (5) Open Space (2) Operations (3) Other (135) Art (1) Books (1) Family (31) Games (18) General Geekery (26) Home Theater (1) Movies (23) Music (20) Politics (3) Society (1) Sports (37) Working at MSFT (15) Parsing Expression Grammar (16) patterns & practices (2) PDC08 (2) Politics (42) PowerPoint (2) PowerShell (33) Presentation (5) Projects (1) HawkWiki (1) Python (4) Quote of the Day (4) Refactoring (1) Research (2) REST (18) Reuse (5) Robotics (2) Rome (5) Ruby (23) Ruby on Rails (1) Sci-Fi (2) Scripting (4) Security (3) Service Broker (14) SharePoint (2) Silverlight (18) Social Software (1) Software + Services (2) Software Design (1) Software Factories (11) Software Industry (1) Spark (1) SQL Server (2) Stephen Colbert (1) TechEd (7) TechEd06 (1) TechRec League (1) Television (6) Travel (6) Unified Client (1) Unit Testing (4) UX (1) Virtual PC (2) Visual Basic (1) Visual Studio (20) Volta (2) Washington Capitals (34) WCF (31) Web 2.0 (65) Web Services (5) WF (21) Windows Live (23) Xbox (1) Xbox 360 (53) XML (7) XNA (14)
Disclaimer: The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my opinion. Inappropriate comments will be deleted at the authors discretion.