Modular Compilers

During Lang.NET, I ended up sitting next to Hua Ming, who’s been working on the .NET Classbox project I wrote about previously. .NET Classbox introduces a new syntax for “using” to C# – basically, you can use individual classes as well as whole namespaces, and you can extend the individual classes you use. Obviously, that meant having a custom compiler that was 99% vanilla C# + the extra classbox syntax. Rather than building a C# compiler from scratch, the Classbox project extended the Mono Project C# compiler. Hua described the process as taking a “huge amount of time” and he described the compiler as “a monster”. Now, I’m not trying to knock Mono here, I imagine our C# compiler is just as hard to work with. SSCLI’s C# compiler directory is 5.5MB of source code alone spread across 126 .h and 68 .cpp files.

Is it just me, or does it seem crazy to have to muck about with such a large code base in order to add a relatively simple language feature? What I’d like to see is a more modular way of building compilers, so that integrating a small language feature like classbox would be a small amount of effort.

Of course, there is some work that’s been done in this space. MS Research had a Research C# compiler paper, but it’s three years old and one of the two authors has moved on to a cool product group job. I also discovered SUIF and the National Compiler Infrastructure Project, but these don’t look like they’ve been updated in a while.

I like the model that the Research C# compiler proposes. Basically, it looks like this:

  1. Specify the grammar in a modular way. In the paper, the grammar is specified in an Excel file, and you can use multiple files in a modular fashion. i.e. have one file for the core language and another for the extensions.
  2. Late bind a grammar production to an action. Typically, in a lex/yacc style scenario, you embed the action code for a given production directly into the grammar, which makes it extremely hard to extend the existing syntax. In the paper, each production is linked with an instance of a type, so swapping out a new type would seem to be possible.
  3. Generate an abstract syntax tree, that gets processed by multiple visitors. From the paper, the compiler has broken the “traditional” compiler steps – bind, typecheck, rewrite and generate binary (in this case IL) – into separate visitors. That makes adding extra steps or chaning existing steps fairly straightforward.

The only think I don’t like about this specific approach is their Excel file based parser generator. It’s a huge step beyond the LEX/YACC approach as it is scanner-less (having separate scanner and parser steps kills any chance of modularity) but it still has to deal with ambiguous grammars. Personally, I’ve been looking at Parsing Expression Grammars in part because they aren’t ambiguous. For programming lanugages, support ambiguity in the grammar is a bug, not a feature.

Lang.NET Is Helping Game Developers

Back at POPL 06, Tim Sweeny of Epic Games delivered a talk titled “The Next Mainstream Programming Language: A Game Developer’s Perspective“. I imagine he was a little too busy getting Gears of War out the door to attend the Lang.NET Symposium. Too bad, as there were interesting solutions presented that solved two of the issues Tim identified in his his POPL talk.

One of the issues Tim identified is one of Modularity. Gears of War uses the Unreal Engine 3. In other words, UE3 is a game framework and GoW uses that framework. As you might expect, this framework is exposed as a hierarchy of objects. Tim’s example had “Actor” as the base class in the framework hierarchy, with classes like “Player”, “Enemy” and “InventoryItem” inheriting from “Actor”. Then he had game-specific classes like “Dragon” and “Sword” inheriting from the generic “Enemy” class. The problem is that game developers also need to extend the functionality of the framework’s base classes. That is, they need a game-specific version of “Actor” or “InventoryItem” in addition to the game specific subclasses like “Dragon” and “Sword”. Unfortunately, the current generation of languages don’t support this, so game developers often clone the entire framework, which is error-prone and hard to support.

At Lang.NET, Professor Markus Lumpe demonstrated an implementation of the Classbox concept for .NET. Classbox is essentially a solution to the modularity problem Tim identified. They’ve modified C#’s using syntax to apply to individual classes. When using a class in this fashion, you can add extensions to it like new methods and new fields. I’m not sure the scope of these extensions – whether it’s the file with the using clause or the containing assembly – but it’s key to realize this is a local extension. The original framework isn’t modified at all. Within you assembly, the metadata for the extended classes is re-written to include the new extension. So to use Tim’s example, if you extended the framework’s “Actor” class, it would create a YourGame.Actor class that inherited from the Framework.Actor and contained your extensions. Then it would re-write the inheritance metadata (again, only for your assembly) so classes that inherited from Framework.Actor such as Framework.Enemy and Framework.InventoryItem now inherit from YourGame.Actor.

Now I’m sure there are some nefarious uses of this type of inheritance tree hacking. But there are scenarios such as Tim’s Gaming Framework example where this behavior is exactly what you want. I spoke briefly to Markus and at length with Hua Ming, one of Markus’ grad students, about perhaps having a keyword indicating that a class is “classbox enabled” rather than allowing any class to be classboxed in this way. Looking forward to their future work.

Another issue Tim identified was Reliability. He called this problem “If the compiler doesn’t beep, my program should work”. He showed a very simple method to iterate an index array and transform the associated vertex from a vertex array by a provided matrix. A simple function – four lines of code. Yet, the compiler can’t detect null pointer or out-of-bound array access. Adding code to check those runtime conditions would easily double or triple the length of the function. While modern managed languages (C#/VB/Java) have made great strides in eliminating “random memory overwrites” (via type safety) and “memory leaks” (via garbage collection) they don’t help you with these other types of dynamic failures.

At Lang.NET, Microsoft Researcher Mike Barnett demonstrated Spec#. Spec# is a superset of C# that solves these and other types of dynamic errors. The idea, in Mike’s words, is to better bridge developer intent and code by embedding certain specifications into the code itself. Furthermore, it uses a combination of static and data flow analysis to detect the types of dynamic errors Tim described in his talk. So if you took Tim’s simple transform function and fed it into the Spec# compiler, it would warn you of the possible null pointer dereferences. Furthermore, you can eliminate this warning by specifying the caller never pass you a null pointer. This is simply accomplished by adding an exclamation point to the type declaration. In other words, the vertex array method parameter would be declared “Vertex[]! vertices” to indicate you can’t pass in a null array. With Spec#, you can also specify method pre and post conditions, which can solve the out-of-bound array access issue, as well as object invariants, which can specify the valid states an object instance can be in.

I didn’t see Tim give this presentation, I only saw the slides after the fact. But I get the feeling that one of Tim’s points is that game development is extremely cutting edge, and the issues they’re running into now will be mainstream issues in a few years. Good to see language researchers are already well on their way to solving these issues.

The only thing I worry about is when will these ideas make it into mainstream languages? And will they be extensions to existing languages – i.e. will C# 4.0 and VB 10 include classboxing and specifications – or will they be entirely new languages? How much can you improve a language by adding features until it collapses under it’s own weight?

More on Lang.NET

Jason Bock left me a comment that he’s covering Lang.NET over at his .NET Languages site. His coverage of day one is here. Looking forward to his coverage of day two and three!

Lang .NET 2006 Symposium

Yesterday, I attended the Lang .NET 2006 Symposium – basically a public version of the CLR Compiler Lab I went to back in March. Unfortunately, with my new job, I couldn’t attend all three days, but I did attend day one. Here we’re my thoughts on the various sessions.

Anders Hejlsberg – LINQ and C# 3.0

  • This was basically a rehash of his talk from the March Compiler lab. Makes sense as it was a new audience, but the “Query the Running Processes” demo is getting pretty old. Check out my notes from March for more details.

John Gough – Dealing with Ruby on the CLR

John is a professor from the Programming Languages and Systems group at Queensland University of Technology. They’re the ones building Ruby.NET. He’s also the author of Compiling for the .NET Common Language Runtime, a great if somewhat dated (i.e. .NET 1.0) book.

Much of John’s talk covered the ground that Jim Hugunin covered back in March around the difficulties of mapping dynamic languages to the static CLR. For example, most Ruby.NET objects are instances of Ruby.Object, with their link to a class – a Ruby.Class – managed by the Ruby.NET runtime rather than leveraging the CLR’s built-in class structure.

He didn’t spend much time talking about the really hard problems like continuations, which I was really hoping he would.

There are a series of “allied” tools coming out of this project which look really interesting in their own right:

  • PE File Reader/Writer – a managed component for reading writing DLL and EXE files.
  • Gardens Point Parser Generator (GPPG) – a Yacc/Bison style parser generator, written in and generating C#
  • Gardens Point LEX (GPLEX) – companion to GPPG for generating C# scanners, a la LEX or Flex. Not released yet, but John indicated it would be available in the next couple of weeks.

Christopher DigginsCat Programming Language: A Functional Optimization Framework for the CLI

  • I’m fairly sure Christopher doesn’t present often. Otherwise he would have know that there’s no way to present 107 slides in 30 minutes.
  • Christopher had a hard time expressing why someone would use Cat, even when asked point blank by an audience member. Most of his 107 slides were describing various features of the language. I don’t know about the rest of the audience, but I got lost pretty quickly.
  • It’s too bad Christopher was so obtuse as a speaker, as Cat seemed pretty interesting. If you skip the first 78 slides (!) of his deck, you get to a slide named “Transformation Engine” which seems to be the primary reason for Cat’s existence. The idea seems to be to build a large number (Chris said potentially thousands) of little optimization transformations which are used to “prune” the tree during the binary generation stage of a compiler.
  • The only problem with this (other than the difficulty of following the presentation) is that I don’t think compiler optimization is a particularly useful area of study. I subscribe to “Proebsting’s Law” on this one: “Advances in compiler optimizations double computing power every 18 years.” This implies that programmer productivity is far more important than compiler optimization. Ruby is the latest great example of this phenomenon.

Mark Cooper – Page XML : An XML based domain specific language for developing web applications

  • Page XML is a DSL for building web apps. Unfortunately, it isn’t released yet and it was hard to get a sense of what a solution built with Page XML would look like from the individual features described on slides. But I was certainly intrigued.
  • As a DSL, Page XML needs to encode domain-specific abstraction. One example they provided that I thought was cool was their URL handling. Good URL design is an important usability feature. URLs in PageXML are split into constant and variable parts, so in a URL like mysite.com/news/somechannel/4, the “somechannel” and the “4″ would be variable parts that would map into parameters that are passed to a page handler. Very cool.
  • There were a large number of what felt like small and simple yet eminently usable features. Too many for me to write down.
  • The only think I didn’t like is the use of XML. No only are domain specific concepts like URLs encoded in XML, but also relatively mundane things like loops and if statements. This gets ugly really quickly. I imagine, the creators of Page XML did this so they wouldn’t have to build their own parser, but it really hurts the usability of the language.
  • The last point really points to the need for a simple meta-language – a language for building languages. Lex/Yacc and their derivatives just don’t cut it. Ruby is good for building internal DSLs, but I’d like something faster and amenable to static typing as well as something more light weight for building external DSLs.

This post is long enough and I have “real” work to do (the downside of leaving evangelism! 😄 ). I’ll post about the afternoon sessions later.

Dynamics of Software Development

Speaking of Sam, he talks today about one-week iterations and having a “shippable” version of the code every week. Reminds me of Jim McCarthy’s classic book “Dynamics of Software Development”. One of the many rules from that book is “Get to a Known State and Stay There” – I’m sure that sounds familiar to Sam. Other classic rules include “Don’t Flip the Bozo Bit” and “When You Slip, Don’t Fall”.

I just noticed MS Press is putting Dynamics of Software Development back in print with a new cover and a DVD of the the original “21 Rules” presentation that begat the book. It’s a great book – I highly recommend it as well as his other book “Software For Your Head“. Also, Jim and his wife Michelle are doing a podcast called The McCarthy Show. Subscribed.