The Durable Messaging Debate Continues

Last week, Nick Malik responded to Libor Soucek’s advice to avoid durable messaging. Nick points out that while both durable and non-durable messaging requires some type of compensation logic (nothing is 100% foolproof because fools are so ingenious), the durable messaging compensation logic is significantly simpler.

This led to a very long conversation over on Libor’s blog. Libor started by clarifying his original point, and then the two of them went back and forth chatting in the comments. It’s been very respectful, Libor calls both Nick and I “clever and influential” though he also thinks we’re wrong on this durable messaging thing. In my private emails with Libor, he’s been equally respectful and his opinion is very well thought out, though obviously I think he’s the one who’s wrong. 😄

I’m not sure how much is clear from Libor’s public posts, but it looks like most of his recent experience comes from building trading exchanges. According to his about page, he’s been building electronic trading systems since 2002. While I have very little experience in that domain, I can see very clearly how the highly redundant, reliable multicast approach that he describes would be a very good if not the best solution.

But there is no system inside Microsoft IT that looks even remotely like a trading exchange. Furthermore, I don’t think approaches for building a trading exchange generalize well. So that means Nick and I have very different priorities than Libor, something that seems to materialize as a significant amount of talking past each other. As much as I respect Libor, I can’t shake the feeling that he doesn’t “get” my priorities and I wouldn’t be at all surprised if he felt the same way about me.

The biggest problem with his highly redundant approach is the sheer cost when you consider the large number of systems involved. According to Nick, MSIT has “over 2700 applications in 82 solution domains”. When you consider the cost for taking a highly redundant approach across that many applications, the cost gets out of control very quickly. Nick estimates that the support staff cost alone for tripling our hardware infrastructure to make it highly redundant would be around half a billion dollars a year. And that doesn’t include hardware acquisition costs, electricity costs, real-estate costs (gotta put all those servers somewhere) or any other costs. The impact to Microsoft’s bottom line would be enormous, for what Nick calls “negligible or invisible” benefit.

There’s no question that high availability costs big money. I just asked Dale about it, and he said that in his opinion going above 99.9% availability increases costs “nearly exponentially”. He estimates just going from 99% to 99.9% doubles the cost. 99% availability is almost 15 minutes of downtime per day (on average). 99.9% is about 90 seconds downtime per day (again, on average).

How much is that 13 extra minutes of uptime per day worth? I would say “depends on the application”. How many of the 2700 applications Nick mentions need even 99% availability? Certainly some do, but I would guess that less than 10% of those systems need better than 99% availability. What pretty much all of them actually need is high reliability, which is to say they need to work even in the face of “hostile or unexpected circumstances” (like system failures and downtime).

High availability implies high reliability. However, the reverse is not true. You can build systems to gracefully handle failures without the cost overhead of highly redundant infrastructure intended to avoid failures. Personally, I think the best way to build such highly reliable yet not highly available systems is to use durable messaging, though I’m sure there are other ways.

This is probably the biggest difference between Libor and me. I am actively looking to trade away availability (not reliability) in return for lowering the cost of building and running a system. To someone who builds electronic trading systems like Libor, that probably sounds completely wrongheaded. But an electronic trading system would fall into the minority of systems that need high availability (ultra high five nines of availability in this case). For the systems that actually do need high availability, you have to invest in redundancy to get it. But for the rest of the systems, there’s a less costly way to get the reliability you need: Durable Messaging.

*Now* How Much Would You Pay For This Code?

Via Larkware and InfoQ, I discovered a great post on code reuse by Dennis Forbes: Internal Code Reuse Considered Dangerous. I’ve written about reuse before, albeit in the context of services. But where I wrote about the impact of context on reuse (high context == low or no reuse), Dennis focused on the idea of accidental reuse. Here’s the money quote from Dennis:

Code reuse doesn’t happen by accident, or as an incidental – reusable code is designed and developed to be generalized reusable code. Code reuse as a by-product of project development, usually the way organizations attempt to pursue it, is almost always detrimental to both the project and anyone tasked with reusing the code in the future. [Emphasis in original]

I’ve seen many initiatives of varying “officialness” to identify and produce reusable code assets over the years, both inside and outside Microsoft. Dennis’ point that code has to be specifically designed to be reusable is right on the money. Accidental code (or service) reuse just doesn’t happen. Dennis goes so far as to describe such efforts as “almost always detrimental to both the project and anyone tasked with reusing the code in the future”.

One of the more recent code reuse efforts I’ve seen went so far as to identify a reusable asset lifecycle model. While it was fairly detailed at the lifecycle steps that came after said asset came into existence, it was maddeningly vague as to how these reusable assets got built in the first place. The lifecycle said that a reusable asset “comes into existence during the planning phases”. That’s EA-speak for “and then a miracle happens”.

Obviously, the hard part about reusable assets is designing and building them in the first place. So the fact that they skimped on this part of the lifecycle made it very clear they had no chance of success with the project. I shot back the following questions, but never got a response. If you are attempting such a reuse effort, I’d strongly advise answering these questions first:

  • How does a project know a given asset is reusable?
  • How does a project design a given asset to be reusable?
  • How do you incent (incentivize?) a project to invest the extra resources (time, people, money) in takes to generalize an asset to be reusable?

And to steal one from Dennis:

  • What, realistically, would competitors and new entrants in the field offer for a given reusable asset?

Carl Lewis wonders Is your code worthless? As a reusable asset, probably yes.

Early Afternoon Coffee 105

  • My two sessions on Rome went very well. Sort of like what I did @ TechEd last month, but with a bit more kimono opening since it was an internal audience. Best things about doing these types of talks is the questions and post-session conversation. I’ve missed that since moving over to MSIT.
  • Late last week, I got my phone switched over to the new Office Communications Server 2007 beta. In my old office, I used the Office Communicator PBX phone integration features extensively. However, when we moved we got new IP phones that didn’t integrate with Communicator. So when a chance to get on the beta came along, I jumped. I’ll let you know my impressions after a few weeks, in the meantime you can read about Mark Deakin’s experience.
  • Matevz Gacnik figures out how to build a transactional web service that interacts with the new transactional file system in Vista and Server 08. Interesting, but personally I don’t believe in using transactional web services. The whole point of service orientation is to reduce the coupling between services. Trying two services (technically, a service consumer and provider) together in an atomic transaction seems like going in the wrong direction. Still, good on Matevz for digging into the transactional file system.
  • Udi Dahan gives us 6 simple steps to being a “top” IT consultant. I notice that getting well known, speaking and publishing are at the top of the list but actually being good at what you’re well known for comes in at #5 on the list. I’m sure Udi thinks that’s implicit in becoming a “top” consultant, but I’m not so sure.
  • Pat Helland thinks Normalization is for Sissies. Slide #6 has the key take away: “For God’s Sake, Don’t Normalize Immutable Data”.
  • Larry O’Brien bashes the new binary efficient XML working group and working draft. I agree 100% w/ Larry. These aren’t the droids we’re looking for.
  • John Evdemon points to a new e-book from my old team called SOA in the Real World. I flipped thru it (figuratively) and it appears to drill into the Foundations of Solution Architecture as well as provide real-world case studdies for each of the pillars recurring logical capabilities. Need to give it a deeper read.

Morning Coffee 104

  • I’m presenting at a an internal training conference today and tomorrow, so my Morning Coffee roundup posts will be lighter than usual. On the other hand, I’m taking a bus downtown to the convention center, so I might write something more substantial on the way there and back. Or maybe I’ll just read.
  • My wife’s blogging will also be light, because she’s got her nose buried in a book. If I do read something to or from the conference, it’s not that book because she won’t let me near it until she’s done! 😄
  • Speaking of “that book”, Werner Vogel drops a few details about how well Amazon handled 1.3 million pre-orders that were delivered on Saturday (including our copy).
  • First drop of IronRuby is available. For now, you can get it from John Lam’s blog. Unlike IronPython, IronRuby will be hosted at RubyForge, not CodePlex, but the site isn’t set up yet. Other big news is that the IronRuby team will be accepting external contributions. Are these encouraging signs to the Ruby community?
  • More MS Research goodness: a new drop of Spec# is available. I’ve written about Spec# before, but haven’t had the time to dig into it. (via Larkware)
  • Scott Hanselman takes the red pill. Congrats!
  • Speaking of Scott, he forwards on advice to remove a programmatic crutch. Good advice. Not to go all Petzold on Visual Studio, but I would guess the IDE is the biggest crutch out there. As for giving up compulsively checking email, if that’s a goal Scott, I think you might have joined the wrong company…

Morning Coffee 103