All the syntactic
productions
in my PEG parser, save one, have the exact same signature. They take in
a char list and return a char list option. Which is to say, they take a
parse buffer in and return either the remaining parse buffer on a
successful match or nothing on a failed match. The only exception is
EndOfFile which doesn’t return the remaining parse buffer because there
isn’t any buffer left to parse.
Now we’re moving on to look at the productions with semantic
implications. In Parsing Expression Grammars, there are eleven: Char,
Range, Class, Literal, Identifier, Primary, Sequence Item, Sequence,
Expression, Definition and Grammar. Like their syntactic brethren, these
semantic productions will all have a single char list input parameter.
However, they will all return some semantic value along with the
remaining parse buffer.
We’ll start with Char, since it’s the only semantic production that
doesn’t return a custom type:
///Char <- '\' [nrt'"[]\]
/// / '\' [0-2][0-7][0-7]
/// / '\' [0-7][0-7]
/// / '\' [0-7]
/// / !'\' .
let (|Char|_|) input =
let (|InRange|_|) upper input =
let i2c value = Char.chr(Char.code '0' + value)
let c2i value = Char.code value - Char.code '0'
match input with
| NC (c, input) when (i2c 0) <= c && c <= (i2c upper) ->
Some((c2i c), input)
| _ -> None
match input with
| TOKEN @"" (NC(c, input))
when List.exists (fun x -> x=c) ['n';'r';'t';''';'"';'[';']';'\'] ->
match c with
| 'n' -> Some('n', input)
| 'r' -> Some('r', input)
| 't' -> Some('t', input)
| _ -> Some(c, input)
| TOKEN @"" (InRange 2 (i1, InRange 7 (i2, InRange 7 (i3, input)))) ->
Some(Char.chr (i1 * 64 + i2 * 8 + i3), input)
| TOKEN @"" (InRange 7 (i1, InRange 7 (i2, input))) ->
Some(Char.chr (i1 * 8 + i2), input)
| TOKEN @"" (InRange 7 (i1, input)) ->
Some(Char.chr (i1), input)
| NC(c, input) when c <> '\' -> Some(c, input)
| _ -> None
Note, this production is slightly different from the one in the PEG
whitepaper. This way was easier to pattern match. Also, I typically
don’t wrap my when guards onto the next line, but this way it doesn’t
wrap funny on my blog.
While long, Char is fairly straight-forward. There are five ordered
choices that can match this production. The first is for escaped
characters, the next three are for character codes, and the last one is
matching any character except the backslash escape character. Note,
tracking F#’s escape characters and PEG’s escape characters can get
tricky. I’ve used verbatim strings for all my TOKEN parameters in order
to help try and keep it straight.
The escape character match clause uses a when guard to narrow down the
selection criteria. I use the built-in List.exists method to see if the
character is in a hard-coded list of special characters. List.exists
takes in a function parameter, and returns true if that function returns
true for any of the value is the list. Since I’m just matching a value,
my function parameter is a trivial equality test. If List.exists returns
true, I return that special character as part of the return tuple. Of
all the escape characters in PEG, only three are also escape characters
in F#, so I use a second match clause to return the correct char value.
There’s probably a way to do that more elegantly, but since there were
just three clauses, I figured it was easier to type them out manually.
For the character code clauses, I wrote a special local AP function
called InRange to determine if the specified character was within a
specified range and to convert it from a char to an int. Note, the way
the production is written, the largest character code you can specify is
277, which means you can encode slightly more than the standard UTF-8
character set. Honestly, this should be updated to support full UTF-16,
but I’m not here to critique the grammar, so I didn’t try to fix this
issue.
Note, all the results (save None) return a tuple of the matched
character value and the remaining input buffer. Again, all the remaining
productions will work like that. For example, here’s the Range
production:
///Range <- Char '-' Char / Char
let (|Range|_|) input =
match input with
| Char (c1, TOKEN "-" (Char (c2, input))) ->
Some(Range.Dual (c1, c2), input)
| Char (c1, input) ->
Some(Range.Single (c1), input)
| _ -> None
Compared to Char, Range is fairly simple. It’s either two chars,
separated by a hyphen (for example: a-z) or it’s a single char. Again,
being able to use Active Patterns to build on lower level productions is
a huge helper.
But what does this function return? What does Range.Single and
Range.Dual mean? Those are refer to a special F# construct called a
discriminated union. Before we can continue writing semantic
productions, we need to define these types to hold the results of these
productions.