ActiveRecord::Migration

When I wrote about the Dual Schema problem a few weeks ago, I specifically wrote that the Rails model is backwards because it derives the in-memory schema from the database schema. While I still believe that, Rails’ ActiveRecord::Migration library does make it significantly easier to manage the database from Ruby code. For those not familiar, ActiveRecord::Migration is a series of Ruby script files that define the database schema. Inside each migration script is an up and down method, so you can migrate forward and backward in the history of your project. And it provides easy to use abstractions such as create_table and add_column so you don’t have to geek out on SQL syntax (unless you want to). Once you have a collection of these scripts, simply calling rake migrate will bring your database instance up to the current schema (rake is Ruby’s equivalent of make). Or, you can set your database to a specific version of the schema by running rake migrate VERSION=X.

I wonder why the Rolling on Rails tutorial uses the database tools directly instead of ActiveRecord::Migrate? I’m thinking it wasn’t available when the tutorial was written. Whatever the reason, they really should update the tutorial to reflect the current state of Rails.

The Dual Schema Problem

A few months ago, Ted Neward wrote a great article about the history of the Object Relational Impedance Mismatch problem and how LINQ is addressing it in a new way. Basically, LINQ is introducing new language abstractions and complementary libraries to enable queries as a first class concept within the language. However, I don’t believe that O/R Impedance Mismatch is the whole problem. More specifically, it’s a follow-on problem to what I would call the Dual Schema problem.

In a nutshell, the Dual Schema problem is that you have to design and implement two separate versions of your persistent entities. There’s the in memory version, typically written in an OO language like C# or Java. Then there’s the on disk version, typically written in SQL. Regardless of the difficulties translating between the two versions (i.e. the aforementioned impedance mismatch), you have to first deal with the complexity of keeping the two versions in sync. While LINQ does a great job eliminating much of the friction translating between on disk and in memory formats, it could go much farther by eliminating the need for translation in the first place.

A variety of solutions to the Dual Schema problem have evolved, primarily outside the hallowed halls of enterprise vendors (i.e. MS and others like us). One such solution is Ruby on Rails. In a Rails environment, I simply declare the existence of a given persistent entity:

class Person < ActiveRecord::Base
end

The ActiveRecord base class (a standard part of Rails) will dynamically create methods and attributes on the Person object at runtime, based on the schema of the People table in the database. (Rails is smart enough to understand English plurals, hence the automatic connection of Person and People.) So technically there are still two schemas, but the in-memory version is automatically derived of the on-disk version.

(Note, DLinq provides a conceptually similar tool – SqlMetal – that can generate the static types from a given database schema. However, as static types they have to be defined at compile time. So while SqlMetal reduces the effort to keep schemas in sync, it doesn’t eliminate it the way Rails does.)

By slaving the object schema to the database schema, Rails essentially solves the Dual Schema problem. The problem with the Rails approach is that defining a database schema requires a significant amount of skill and effort. Defining classes is typically trivial in comparison.The fact Rails allows you to implement a persistent entity with almost no code doesn’t help you much if you have to write and maintain a ton of SQL code to define your database schema.

I believe the Rails model is actually backwards. It would be much better for the developer if they could define their persistent entity in code and slave the database schema to the object model instead of the other way around.

Of course, this approach isn’t exactly news. In his article, Ted writes of the rise and fall of OO database management systems, which were supposed to solve the Dual Schema and Impedance Mismatch problems. I’m certainly not suggesting a return to the heyday of OODBMS. However, one of the reasons Ted points out OODBMS failed was because big companies were already wedded to RDBMS. But those big companies are the short head. As you move down the long tail of software, relational database as the primary storage paradigm makes less and less sense. For the vast majority of applications, relational databases are overkill.

Ted’s other point about OODBMS is that loose coupling between the data store and the in memory representation is a feature, not a flaw. He’s totally right. But can’t we advance the state of the art in database typing to the level of modern day OO languages? How about eliminating anachronisms like fixed length strings? What if we derive the database schema from the object model – Rails in reverse if you will – but is still loosely coupled enough to allow for schema evolution?

An example of this code-centric model for data storage is Consus. It’s written by Konstantin Knizhnik, who has written a bunch of open source, object-oriented and object-relational databases across a wide variety of languages and execution environments, including CLR. Consus is actually written in Java, but he provides version compiled for .NET using Visual J#. Consus lets you to define your data either as tables or objects. So you can do this:

Statement st = db.createStatement();
st.executeUpdate("create table Person (name string, address string, salary bigint)");
st.executeUpdate("insert into Person values ('John Smith', '1 Guildhall St.', 75000)");
ResultSet rs = st.executeQuery(
    "select name, address, salary from Person where salary > 100000");

Or you can do this:

class Person {
    String name;
    String address;
    long salary;
    Person(String aName, long aSalary, String aAddress) {
        name = aName;
        salary = aSalary;
        address = aAddress;
    }
};

Person p = new Person("John Smith", 75000, "1 Guildhall St.");
ConsusStatement st = db.createStatement();
stmt.insert(p);
ConsusResultSet cursor = (ConsusResultSet)st.executeQuery(
    "select from Person where salary > 100000");

Consus also handles OO concepts like derivation and containment. Of course, the embedded queries are ugly, but you could imagine DLinq style support for Consus. In fact, one of the primary issues with Consus is that it supports both object and tuple style queries. When you explicitly request tables (i.e. “select name, address salary from Person”), you’ve got a tuple style query. When you don’t (i.e. “select from Person”) you’ve got an object style query. Of course, the issues with tuple style queries are well documented in Ted’s article and is exactly the problem that LINQ is designed to solve.

(Konstantin, if you’re reading this, drop me a line and I’ll look into getting you hooked up with the LINQ folks if you’re interested in adding LINQ support to Consus.NET.)

The tradeoff between the Rails approach and the Consus approach is one of performance. I have a ton of respect for Konstantin and the work he’s done on Consus and other OO and OR databases available from his site. However, I sure the combined developer forces at major database vendors like Microsoft (and other DB companies) means SQL Server (and the like) will out perform Consus by a significant margin, especially on large scale databases. So if execution performance is your primary criteria, the Ruby on Rails approach is better (leaving aside discussion of the Ruby runtime itself). However, in the long run execution performance is much less important than developer productivity. So I believe that  for all the current interest in Rails, I think a Consus-style model will become dominant.

The SQL Complexity Problem

I mentioned on the first day of the Compiler Dev Lab that Brian Beckman is a hoot. He’s also wicked smart. He posted about his demo from Monday where he demonstrates building indexes for use in LINQ queries. In his words:

In the terminology of relational databases, a “join” is, semantically, like a nested loop over a pair of lists (or tables) of records, saving only those where some certain fields match. Unless we do something smart, this could be very expensive. Imagine searching a database of a million DNA profiles for the closest match to a sample that has 10,000 DNA features (I have no idea whether those are real numbers: I just made them up, but they sound ballpark to me). A dumb join would search all 1 million profiles for each of the 10,000 features, resulting in 10 billion match tests, almost all of which will fail – by design, of course. That’s going to hurt.

The “something smart” is to build an index and search through that. Your database doesn’t have to be large at all for this to pay off. In fact, even with just a few records, it’s cheaper to build an index, use it, and throw it away than it is to do a nested loop.

He goes on to prove out his point about building an index. For his full dataset (joining 3053 cities with 195 countries) it is literally 65x slower not to build a one-off index. Even for smaller datasets, the time difference is less dramatic but still significant. For example, with 89 cities instead of 3053, it’s 3x slower not to build the index.

The reason I’m so interested in Brian’s post is because of my experiments with Ning. As you might recall, in trying to build a .NET version of Partisan Hacks, I found ASP.NET 2.0 to be significantly simpler than PHP (which Ning uses). However, building even the trivial SQL Express database for Partisan Hacks was a non-trivial exercise. Sure, I’ve done it many times before, but it seems strange that ASP.NET makes it so easy to build a site while SQL Server makes it so complex to build a database. If I was a novice user, I would never be able to build a database for my web site.

Why is this? I think that the simple app or amateur developer is simply not the target audience for SQL Server (even SQL Express). If you don’t know the difference between nvarchar(100) and varchar(max) you’re pretty much out in the cold when it comes to SQL Server. Their target audience appears to be enterprise databases that are cared for by enterprise database administrators. Databases with scores of tables and millions of rows. Great for them, bad for novice users who just want to persist their data somewhere quickly and easily.

Why can’t building my database be as simple as building my site?

Ning makes it easy to use their Content Store. You create an instance of a content object, you set properties (dynamic ones), you hit save. No fuss, no muss, no db schema. Sure is an easier model to understand and program to. In that regard, it blows away everything, even Ruby on Rails. RoR is pretty sweet, but it needs a real database schema on the back end in order to drive RoR’s guiding principle of “convention over configuration*“*. If there’s no DB schema to discover, I think much of the RoR model would break down. (but that may just be my lack of RoR experience talking)

I not sure what a simpler database system would look like, but one idea of mine is to use a schemaless database. Much of the complexity comes from having to define both an in memory as well as perseistant schema, as well as the translation between them. If you just stored managed .NET objects, you would eliminate the redundant schema specification. It’s not a fully fleshed out concept, but it is a start of an idea.

What other ideas would make persistant data significantly easier to work with?