The SQL Complexity Problem

I mentioned on the first day of the Compiler Dev Lab that Brian Beckman is a hoot. He’s also wicked smart. He posted about his demo from Monday where he demonstrates building indexes for use in LINQ queries. In his words:

In the terminology of relational databases, a “join” is, semantically, like a nested loop over a pair of lists (or tables) of records, saving only those where some certain fields match. Unless we do something smart, this could be very expensive. Imagine searching a database of a million DNA profiles for the closest match to a sample that has 10,000 DNA features (I have no idea whether those are real numbers: I just made them up, but they sound ballpark to me). A dumb join would search all 1 million profiles for each of the 10,000 features, resulting in 10 billion match tests, almost all of which will fail – by design, of course. That’s going to hurt.

The “something smart” is to build an index and search through that. Your database doesn’t have to be large at all for this to pay off. In fact, even with just a few records, it’s cheaper to build an index, use it, and throw it away than it is to do a nested loop.

He goes on to prove out his point about building an index. For his full dataset (joining 3053 cities with 195 countries) it is literally 65x slower not to build a one-off index. Even for smaller datasets, the time difference is less dramatic but still significant. For example, with 89 cities instead of 3053, it’s 3x slower not to build the index.

The reason I’m so interested in Brian’s post is because of my experiments with Ning. As you might recall, in trying to build a .NET version of Partisan Hacks, I found ASP.NET 2.0 to be significantly simpler than PHP (which Ning uses). However, building even the trivial SQL Express database for Partisan Hacks was a non-trivial exercise. Sure, I’ve done it many times before, but it seems strange that ASP.NET makes it so easy to build a site while SQL Server makes it so complex to build a database. If I was a novice user, I would never be able to build a database for my web site.

Why is this? I think that the simple app or amateur developer is simply not the target audience for SQL Server (even SQL Express). If you don’t know the difference between nvarchar(100) and varchar(max) you’re pretty much out in the cold when it comes to SQL Server. Their target audience appears to be enterprise databases that are cared for by enterprise database administrators. Databases with scores of tables and millions of rows. Great for them, bad for novice users who just want to persist their data somewhere quickly and easily.

Why can’t building my database be as simple as building my site?

Ning makes it easy to use their Content Store. You create an instance of a content object, you set properties (dynamic ones), you hit save. No fuss, no muss, no db schema. Sure is an easier model to understand and program to. In that regard, it blows away everything, even Ruby on Rails. RoR is pretty sweet, but it needs a real database schema on the back end in order to drive RoR’s guiding principle of “convention over configuration*“*. If there’s no DB schema to discover, I think much of the RoR model would break down. (but that may just be my lack of RoR experience talking)

I not sure what a simpler database system would look like, but one idea of mine is to use a schemaless database. Much of the complexity comes from having to define both an in memory as well as perseistant schema, as well as the translation between them. If you just stored managed .NET objects, you would eliminate the redundant schema specification. It’s not a fully fleshed out concept, but it is a start of an idea.

What other ideas would make persistant data significantly easier to work with?

Compiler Dev Lab – Scripting

Day Two of the Compiler Dev Lab was all about scripting. Iron Python was the primary focus of the day, but they also had Phalanger (Managed PHP) and Monad folks there as well.

  • I hadn’t realized just how performant these dynamic languages are on the CLR when compared to their native versions. The original version of Iron Python was 1.7x faster than the standard C implementation back in the summer of ’04. Now with CLR 2.0, that version is now 2x faster with out any code changes. The Phalanger folks said they are 2.5x faster than the native version of PHP (1.7x faster than PHP + the Zend Optimizer). That’s pretty impressive performance.
  • The IronPython folks are heavy users of the new DynamicMethod class from .NET 2.0. Otherwise known as Lightweight Code Generation, DynamicMethod allows you emit a static function but have it get garbage collected when it’s no longer needed. IP almost never generates new classes, since new types can’t be garbage collected. The only times they generate actual classes are when you inherit from an existing .NET class or when you generate a new delegate type.
  • It’s really hard to serve the dual masters of both the existing language community and the .NET community. Jim Hugunin used the example of String.Trim(). A .NET developer would expect String.Trim() to “just work”. A Python developer would expect that to throw an AttributeError exception (the Python equivalent of Trim is strip). How do you handle this? In IP, it defaults to pure Python mode, but if you enter “import clr”, you move into .NET hybrid mode.
  • One of the typical features of dynamic languages is the ability to change the base class of an object on the fly. Jim demoed this with WPF. He created a class that inherited from one type of panel and then set the __class__ property of the object to a different panel and the display changed immediately. Freaky, but cool.
  • Jim showed a demo of a WPF app that hosted Python for extensibility. One of the scripts in turn hosted Python to create an interactive console for the app. Having a scripting engine that can host itself is awesome.
  • The VSIP SDK CTP (reg required) includes an sample lanugage integration project for Iron Python. So you can get both the source into IP language itself as well as the source to the integration into Visual Studio.
  • I got an email yesterday from someone asking about the possibility of Visual Ruby.NET. I haven’t heard anything about it, but it would be cool to see Ruby on Rails runing under CLR. John Lam is working on RubyCLR, but my understanding is that is a bridge between the CLR and the Ruby runtime, not a CLR implemenation of the Ruby runtime. (IP is a CLR implementation of the Python runtime.) I’m thinking that there are some similarities between Ruby and Python, so having the source of IronPython would be a huge help in building a Visual Ruby implementation. For example, both Ruby and Python have closures. IP has a FunctionEnvironment class which is used to lift stack variables onto the heap in a variety of scenarios, including closures. So if I was building Visual Ruby, having access to the FunctionEnvironment class would be a good start.
  • I said yesterday that I need to learn more about F#. They showed a video of an internal F# presentation, but I spent most of my time cracking jokes with Sam Gentile who’s in town for an SC-BAT workshop.
  • I didn’t pay enough attention to the Monad presentation. 😦

Compiler Dev Lab – LINQ

Even though I haven’t finished my ETech postings, I’m already onto another event. This week, thanks to an invite from Michael Lehman, I’m sitting in on a Compiler Lab discussing implementing other languages for CLR. The first day was about LINQ. Much of the info is rehashed from PDC or the docs up on MSDN. However, I have learned a few new things.

  • One of the standard features of LINQ is Extension Methods. That enables you to declare a static method like “static void Foo(this string source)” and then use it like “stringvar.Foo()”. Apparently, they are considering adding other types of extension members including properties and fields. The idea of extension fields is somewhat scary but powerful.
  • LINQ uses something Anders called deferred query execution. The query isn’t executed until the values are asked for (typically by calling foreach on the query). That means you can compose queries to your hearts content with no perf impact until you actually invoke the query.
  • Query Comprehensions in C# and VB is a pattern implementation in a similar vein to foreach. Foreach is relatively simple shorthand for iterating through an collection by calling IEnumerator.MoveNext until it returns false. While LINQ enables arbitrary composition of queries, there is obvious gravitational pull towards the SELECT / FROM / WHERE / ORDER BY / GROUP BY approach favored by SQL. So if you build your own query operator, you can include it in a LINQ query, but C# and VB won’t be able to include it in the Query Comprehension syntax. Probably not a big deal, given the breadth of standard query operators as well as the deferred query execution, but it’s good to understand how the abstraction works.
  • I want to know more about how DLinq is implemented. I’ve been refining my thinking about data since working with Ning’s content store and I’m convinced of the need for a simplified datastore. SQL is designed for significantly complex database schemas, which means a significantly complex development environment.
  • I’m looking much more closely at VB, given the new features in VB 9.0. Not only the LINQ stuff from C# like type inference, extension methods and anonymous types but also VB specific stuff like XML Literals and Duck Typing. Combined with VB’s existing support for late binding, there are compelling features to make VB attractive over C#.
  • I’ve been hanging out with Brian Beckman. He’s a hoot.
  • I think I need to take a deeper look at F#.

ASP.NET Trust Levels

For reasons to be named later, I’m experimenting with the various trust levels of ASP.NET. While “most things” work fine when you ratchet down the security, if finding that the things that break aren’t well documented. For example, at anything other than Full Trust you can’t use the Response.OutputStream.Write() method to write binary information to the response buffer. So that means that using ASHX Handlers for images doesn’t work. That means that, among others, the Personal Web Site starter kit breaks on any photo related features.

Also, does anyone know what happened to permview in .NET 2.0?

The Next Mainstream Programming Language

Terra Nova is not the usual place I go to get news around programming language improvements. But they linked to a great presentation from POPL 2006 by Tim Sweeney of Epic Games. Tim’s talk is called The Next Mainstream Programming Language: A Game Developer’s Perspective and it talks at great length the major issues facing game developers today. As Nate Combs at Terra Nova remarked, most of these issues are not specific to the game industry, but will likely be seen there first.

Most interesting (to me) was the issue of concurrency. Tim uses Gears of War for all his examples. Of course, Gears of War is an Xbox 360 exclusive. Xbox 360, as many of you probably know, has three hyper-threaded CPUs for a total capactiy of six hardware threads. Herb Sutter talked about this in his DDJ article The Free Lunch Is Over. Tim points out – rightly so – that “C++ is ill-equipped for concurrency”. C#, Java and VB aren’t much better. Tim conculdes that we’ll need a combination of effects-free non-imperative code (which can safely be executed in parallel) and software transactional memory (to manage parallel modifications to system state).

Tim also touches on topics of performance, modularity and reliability. And he has an eye on the practical at all times. For example, he points out that even a four times performance overhead of software transactional memory is acceptable, if it allows the code to scale to many threads.

Anyway, it’s a great read so check it out. Also, MS Research has a software transactional memory project you can download if you’re so inclined.